infra

Author	SHA1	Message	Date
Viktor Barzin	e5250f417e	k8s-version-upgrade: compat gate must not false-block patch upgrades All checks were successful ci/woodpecker/push/default Pipeline was successful Details The compat gate compared every addon's matrix ceiling against the target k8s minor unconditionally. That is correct for a minor JUMP, but it also blocked patch upgrades within the minor the cluster is ALREADY running: ESO v0.12's matrix ceiling is 1.31, the cluster runs 1.34.9, so a target of 1.34.10 (a patch) was refused with "external-secrets supports k8s <= 1.31; target 1.34 exceeds it" — even though the running cluster is itself proof ESO 0.12 works on 1.34. That silently defeats autonomous patching (it would have bitten the moment a 1.34.10 was published). Fix: a target at or below the running minor crosses into no new k8s minor, so every installed addon is already empirically proven on it — check_addons now returns no reasons when target_minor <= running_minor. Added running_minor() (oldest kubelet across nodes, mirroring the detector; RUNNING_K8S env override for tests) and pass it in. Minor jumps are unchanged: 1.34->1.35 still blocks on ESO 0.12 + kyverno 1.16. removed-API + containerd checks are naturally inert for patches (no API removal / containerd floor inside a minor) and keep running as defence. Added test_compat_gate.py (8 cases) covering both paths. Verified end-to-end against live Prometheus: target 1.34.10 -> EXIT 0 (safe), target 1.35.6 -> EXIT 2 (blocked on ESO+kyverno). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:14:50 +00:00
Viktor Barzin	38675b7922	crowdsec: register kvsync + firewall bouncer keys in LAPI Seeds two new bouncers at LAPI startup (BOUNCER_KEY_kvsync, BOUNCER_KEY_firewall) from Vault secret/platform, mirroring the existing BOUNCER_KEY_traefik wiring. These are the two halves of the real enforcement that replaces the dead Yaegi plugin: kvsync authenticates the LAPI->Cloudflare-KV sync (proxied edge Worker), firewall authenticates the cs-firewall-bouncer DaemonSet (direct-host nftables). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:12:38 +00:00
Viktor Barzin	a9384a4067	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 08:09:16 +00:00
Viktor Barzin	44a98d408e	k8s-version-upgrade: detector next-minor probe must follow 302 (curl -sfL) The next-minor Packages query used `curl -sf` without -L. pkgs.k8s.io 302-redirects every request to a backing host, so without -L curl returned an empty body, NEXT_MINOR_PATCH came back empty, and the detector fell through to "No upgrade needed". That is exactly why last night's 23:00 chain no-op'd instead of resolving the 1.35 next-minor target (1.35.6) and handing it to the compat gate. `curl -sfL` follows the redirect and returns the Packages file (verified: -sf -> empty, -sfL -> 1.35.6). Mirrors the same -L fix already applied to the Release availability probe (-sILo) above. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:09:08 +00:00
Viktor Barzin	910d589205	fix(forgejo): raise git-op timeouts + lower gc.auto to stop push-mirror timeouts Some checks failed ci/woodpecker/push/default Pipeline was canceled Details The tripit Forgejo->GitHub push-mirror silently stalled: `git cat-file --batch-all-objects` over the NFS-backed repo exceeded the default git deadline once ~4500 loose objects accumulated (gc.auto's 6700 threshold hadn't fired), so pushes stopped reaching GitHub and prod deploys stalled. Raise [git.timeout] (DEFAULT/MIRROR/GC) so a slow object enumeration can't abort the mirror, and set [git.config] gc.auto=1000 so post-push autogc + the git_gc_repos cron keep repos packed (the real fix). A one-off forced gc already unblocked tripit; this prevents recurrence across all repos. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:08:50 +00:00
Viktor Barzin	45bed1c133	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-20 08:07:23 +00:00
Viktor Barzin	e1736d2e5c	calico: hop 3.28.5->3.30.7 (operator v1.38.13) — restores a SUPPORTED Calico/k8s-1.34 pairing. Disabled new-in-3.30 Goldmane/Whisker (their CRs render before crds/ install on helm upgrade; we use Prometheus/Loki). calico-node 7/7 on quay/v3.30.7, tigerastatus green. Applied manually + verified overnight.	2026-06-20 08:07:08 +00:00
Viktor Barzin	4d9fdbc7f7	rybbit: add CrowdSec LAPI -> Cloudflare KV sync script (proxied edge control plane) Pure-stdlib script (alert_digest pattern, runs on stock python:3.12-alpine) that projects CrowdSec Ip-scope ban/captcha decisions into the Workers KV namespace the edge Worker reads on each proxied request. Full-reconcile per run so an un-ban clears from the edge within one interval; fail-safe (a LAPI read error skips the run and leaves existing bans to expire by TTL = fail-open, never a stale all-block). TF wiring (KV namespace + CronJob + key registration) follows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:05:11 +00:00
Viktor Barzin	0ac176da01	crowdsec: whitelist internal/LAN/tailnet CIDRs at the decision layer Preparing for real CrowdSec enforcement (edge Cloudflare Worker for proxied hosts + cs-firewall-bouncer for direct hosts). Both enforce by dropping the real source IP, so if an internal/RFC1918 address ever ended up in a ban decision it could blackhole legitimate internal traffic. Whitelisting the cluster/LAN/tailnet ranges (10/8, 172.16/12, 192.168/16, 100.64/10) at the CrowdSec parser layer makes that structurally impossible — a trusted source can never produce a decision in the first place. Public IP already whitelisted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:03:46 +00:00
Viktor Barzin	3e3fdb34f0	homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Answers the question that drove the whole CLI — which verbs to add next — with data instead of one maintainer's habits, and resolves the cross-user-usage ask in-bounds (no reading anyone's home). - emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} + "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery verbs (manifest/version/help) and usage itself don't self-record. - usage top [--since 30d] [--user U] [--json]: ranks verbs via sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving answer to "what does the team use". - Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no auth. ADR docs/adr/0011. Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 22:29:01 +00:00
Viktor Barzin	666fefd22b	calico: hop 3.26->3.28.5 (operator v1.34.13); calico-node 7/7 healthy, tigerastatus green, kube-controller-manager restarted (3.28 UID change). Applied manually + verified. Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-19 22:09:23 +00:00
Viktor Barzin	8ed5368be9	calico: bring tigera-operator under Terraform via Helm (adopt at 3.26.1) Some checks failed ci/woodpecker/push/default Pipeline failed Details Base for the stepped 3.26->3.28->3.30->3.32 upgrade (k8s 1.36 prereq; 3.26 is already unsupported on k8s 1.34). Manage ONLY the operator via the official tigera-operator Helm chart (chart ver == Calico ver); installation.enabled=false keeps the live Installation CR operator-managed so Helm never touches calico-node. Adopted in place: existing operator Deployment/SA/ClusterRole/ClusterRoleBinding pre-stamped with Helm ownership metadata (transient migration step), then the release imported via a plan-verified create (1 to add, 0 destroy). Verified clean: calico-node 7/7 unchanged, tigerastatus green. Closes the year-deferred adoption (code-3ad) for the operator without adopting the Installation CR.	2026-06-19 21:50:34 +00:00
Viktor Barzin	dd029ca7fb	traefik/crowdsec: switch bouncer to live mode (stream cache doesn't enforce under Yaegi) All checks were successful ci/woodpecker/push/default Pipeline was successful Details After bumping to v1.6.0 (stream goroutine runs) and disabling redis (in-memory cache), the plugin logs `handleStreamCache:updated` but still does NOT enforce: a ban present in the LAPI stream AND pulled by the plugin still let the banned IP through. Stream-mode decision matching is unreliable under Traefik's Yaegi interpreter here. Switch crowdsecMode stream->live: the plugin queries LAPI synchronously per request (result cached per-IP for defaultDecisionSeconds), which enforces reliably and picks up new decisions immediately. LAPI is 3-replica + in-cluster so per-request latency is small; fail-open preserved (updateMaxFailure=-1). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:49:26 +00:00
Viktor Barzin	0cc48d83ac	traefik/crowdsec: disable bouncer redis cache (broken under Yaegi → in-memory) With the plugin on v1.6.0 the stream goroutine finally runs, and its slog output revealed the real blocker: `handleStreamTicker ... isCrowdsecStreamHealthy:true cache:unreachable`. The LAPI stream is healthy, but the plugin's redis client cannot reach the cache under Traefik's Yaegi interpreter — even though redis-master.redis.svc is reachable AND writable from the traefik namespace (SET/GET verified via busybox; no NetworkPolicies; no auth). Same interpreter -incompat class as the stream goroutine itself. With redisCacheUnreachableBlock =false the bouncer then failed open and enforced nothing. Disable the redis cache so the plugin uses its in-memory decision store (works under Yaegi). Removes redisCacheHost/redisCacheUnreachableBlock. Trade-off: captcha already-solved grace is per-pod across the 3 Traefik replicas (at worst an occasional re-solve) — acceptable; bans/captcha decisions enforce correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:49:26 +00:00
Viktor Barzin	531efb218d	traefik: bump crowdsec-bouncer plugin v1.4.2 -> v1.6.0 (fix stream not pulling) The crowdsec-bouncer Yaegi plugin pinned at v1.4.2 loads on Traefik 3.7.5 but its decision-stream goroutine never runs — no Traefik pod ever calls the LAPI stream (verified: no traefik-pod bouncer entry / no @pod-ip auto-registration), and it logs nothing. All deps are healthy (LAPI 200 + full ban list reachable from the traefik ns, key valid, redis PONG, config correct, no NetworkPolicies), so CrowdSec enforced nothing despite the bouncer now being registered. This is the Traefik-v3 / Yaegi plugin-incompat class that already killed rewrite-body here. v1.4.2 predates Nov 2025; latest is v1.6.0. Bump to v1.6.0 (initContainer download URL + state.json + experimental.plugins version). Config-verified compatible: every key we use survives (crowdsecMode, crowdsecLapiKey/Host, updateMaxFailure, redisCache, clientTrustedIPs, all captcha incl. turnstile); v1.6.0 also moves logging to slog/trace for future diagnosis. Pinned, not auto-updated (Keel can't manage a Yaegi plugin, and plugin bumps must be tested against the running Traefik/Yaegi). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:49:26 +00:00
viktor	78095aa273	docs(forgejo): runbook reflects Authentik disabled + zero-click GitHub All checks were successful ci/woodpecker/push/default Pipeline was successful Details Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub auto-registration (zero-click sign-up) is on. Document why (global auto-reg + Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks account-linking) and how to re-enable Authentik later. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:37:46 +00:00
viktor	7d99203fc6	forgejo: re-enable ENABLE_AUTO_REGISTRATION for zero-click GitHub sign-up All checks were successful ci/woodpecker/push/default Pipeline was successful Details Per Viktor: GitHub sign-up must work zero-click (account created on first login, no form). This global [oauth2_client] setting enables it. It conflicts with Authentik (preferred_username is an email → invalid Forgejo username → 500 on auto-create), and Viktor's Forgejo email (me@viktorbarzin.me) doesn't match his Authentik email (vbarzin@gmail.com) so account-linking can't bridge it — so the Authentik OAuth2 source is DISABLED (login_source.is_active=0; DB-managed, out-of-band) per his directive. Forgejo sign-in is now GitHub + native login. Committed via API to land on origin without pushing a concurrent agent's unpushed local commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:34:17 +00:00
viktor	ef530b7d38	forgejo: drop ENABLE_AUTO_REGISTRATION — it broke Authentik sign-in All checks were successful ci/woodpecker/push/default Pipeline was successful Details ENABLE_AUTO_REGISTRATION is a global [oauth2_client] setting (all OAuth sources). On Authentik sign-in, Forgejo auto-created an account and derived the username from Authentik's preferred_username claim — which is the user's email (vbarzin@gmail.com), invalid as a Forgejo username (no '@') → CreateUser failed → 500 on the OAuth callback. (GitHub's username claim is valid, so only Authentik broke.) Reverting to the standard link/register flow fixes both; GitHub sign-up still works via a one-step register form. Committed via API to touch only main.tf (forgejo-only CI apply) so it doesn't collide with concurrent crowdsec work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:24:29 +00:00
Viktor Barzin	a5bb4db9c5	crowdsec: register the Traefik bouncer with LAPI (fix fail-open) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The Traefik bouncer plugin's API key was never registered with LAPI — the crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and the chart registers no bouncer. So LAPI returned 403 to the plugin, which with updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was empty; the registration was likely lost in the MySQL->PostgreSQL DB migration with no IaC to recreate it. Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same Vault key the middleware presents — so they match by construction, and the bouncer re-registers automatically on every LAPI start (survives DB wipes). - stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module. - module main.tf: new sensitive var + thread into the values templatefile. - values.yaml: BOUNCER_KEY_traefik on lapi.env. - docs/architecture/security.md: document registration + fail-open history and the proxied-app coverage caveat. Activates enforcement (community blocklist bans + captcha) on non-proxied apps; internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:08:28 +00:00
Viktor Barzin	56dadda453	traefik: pin helm chart to 40.2.0 (deployed version) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The traefik helm_release had no chart version pin, so a refreshed helm repo index resolves `chart = "traefik"` to the latest (41.0.0), whose values schema rejects this stack's `logs` block ("Additional property logs is not allowed") — an unpinned apply attempts that upgrade and fails (atomic rollback). Pin to the deployed 40.2.0 (release rev 57, since 2026-05-30) so applies are deterministic; chart bumps must be deliberate with a values migration. Follow-up to `fd0c7493` (Turnstile captcha), which was applied with this pin already in live TF state — this lands the pin in git to remove the drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:58:33 +00:00
Viktor Barzin	4a66377425	forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted people to be able to sign up with GitHub, not just the native form or Authentik SSO. - Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth --provider github` (name "github", matching the callback registered on the GitHub OAuth App). Like the existing Authentik source, it lives in Forgejo's DB rather than Terraform — there's no clean TF resource for login sources. Client id/secret mirrored to Vault secret/viktor (forgejo_github_oauth_client_id / _secret) for recovery. - This commit's TF change: ENABLE_AUTO_REGISTRATION=true in [oauth2_client], so a first GitHub sign-in creates the account directly ("sign up with GitHub") instead of a link-to-existing detour. The GitHub identity is the trust gate for this path; Turnstile + email confirmation still gate the native form. Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github redirects to GitHub's authorize URL with the correct client id + callback, and the login page renders the button. Final browser click-through is the user's to do. Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section + secret-rotation + DB-loss recreate steps). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:41:49 +00:00
Viktor Barzin	fd0c7493c3	traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse (http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files), but the Traefik bouncer plugin had no captcha provider configured — so those decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go @ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had no way to self-unblock, contradicting the profile's stated intent. Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha decision now renders a solvable challenge instead of a hard block: - New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to viktorbarzin.me so one widget covers every subdomain the bouncer fronts. Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are passed into the traefik module. - middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s + captchaHTMLFilePath=/captcha/captcha.html. - Vendor the plugin's captcha.html and mount it into the Traefik container at /captcha via the chart `volumes` value — the pulled Yaegi plugin does not expose its bundled template to Traefik. - docs/architecture/security.md: document the ban-vs-captcha remediation split. - Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with placeholder reCAPTCHA keys; referenced by zero .tf). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:38:38 +00:00
Viktor Barzin	963e4fcdde	forgejo: open native self-signups, gated by Turnstile + email confirmation All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wants Forgejo open for anyone to sign up, but without bot/spam account floods. Flip the deployment from OAuth-only registration (ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local sign-up, and add two bot gates on the registration form: - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget is managed in Terraform (turnstile.tf) via the CF Global API key, so the sitekey/secret are IaC, not a dashboard artifact. - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced credential Authentik uses (email-secret.tf ESO -> secret/authentik smtp_password). Existing Authentik OAuth2 login is unchanged (additive). Deployment env appended (not inserted) so the diff stays purely additive; a reloader annotation rolls the pod on secret rotation. Verified live: signup page renders the Turnstile widget, mailer delivers a test message end-to-end, Forgejo healthy, plan-to-zero after apply. Runbook: docs/runbooks/forgejo-open-signups.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:05:07 +00:00
Viktor Barzin	21dbd79ae4	Merge remote-tracking branch 'origin/master' into wizard/homelab-obs Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details	2026-06-19 11:27:44 +00:00
Viktor Barzin	e91e1612dd	homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution) The remaining verbs that pass the "saves reasoning, not just typing" test the user posed mid-session: each encodes the non-obvious which-endpoint-reached-how resolution otherwise re-derived every time. (Same test deprioritized node-ssh and secret-get aliasing — thin wrappers over commands already known.) - net check <host> [path]: two-legged reachability — external (public DNS→CF) vs internal (Traefik LB) — so you see WHERE a break is, not just that one path works. (live: surfaced the LB at 6ms vs CF 77ms.) - dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff. - metrics query "<promql>" / metrics alerts: Prometheus via the LB (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series since the query frontend has no /api/v1/alerts and Alertmanager has no ingress. - logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB. All reach auth-free internal ingresses through the LB (Go form of curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster- only endpoints (Alertmanager v2) deliberately out of scope. Verified live before building; all five smoke-tested green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:27:31 +00:00
Viktor Barzin	6cb823e431	k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt + alert when not": - monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning) in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see Slack for why" signal. (Until monitoring is applied, a block still surfaces via the already-live K8sUpgradeChainJobFailed.) - upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests — apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns) Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't downgrade). Catches a "pods look Running but cluster is broken" upgrade. - runbook: documents the compat gate, the blocked alert, how to clear a block, matrix maintenance, and the detector minor-probe fix. After deploy, the nightly chain detects 1.35 (minor detection now works) and correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting via K8sUpgradeBlocked — the autonomy working as designed until the catch-up clears those addons.	2026-06-19 11:27:17 +00:00
Viktor Barzin	cecd9fe247	k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain attempts every upgrade but refuses unless it can prove the target is safe. A refusal is a BLOCK (not a crash) — it halts the chain and signals for attention. - compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's running version doesn't support the target k8s minor, (b) an in-use deprecated API (apiserver_requested_deprecated_apis) is removed at/before the target, or (c) a node's containerd is below the target's floor. Validated against the live cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), which is exactly the auto-halt we want until they're bumped. - addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO, kyverno, gpu-operator + containerd floor), sourced from each project's compat docs (2026-06-19). The keystone data the gate reads; keep current. - upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation); block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts. - main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io resolves to 200 — minors were never being detected). Gated behind the compat gate above, so enabling minor detection can't roll an unsafe minor. Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight + runbook (next commit) so the detector fix only goes live with the full net.	2026-06-19 11:23:30 +00:00
Viktor Barzin	9189560ac3	homelab: v0.4.0 — ci/deploy verbs (watch what you trigger) Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Adds the verb-group that kills the single biggest reasoning sink in agent sessions — watching a build/deploy to completion (proven the session that built it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI incident). - ci status/watch: Woodpecker REST API (version-stable, not its DB schema), reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me so the cert verifies — the Go form of the house `curl --resolve` pattern), token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with retries that ride Woodpecker's intermittent empty responses. watch matches the HEAD/given commit (avoids the post-push race) and exits non-zero on failure. - deploy wait: image-sha match THEN rollout status (rollout status alone returns success on the old ReplicaSet); kubectl-based. - work land now auto-watches CI to green on the landed commit (--no-ci-watch to skip), closing the v0.1 gap. - ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least reliable; status/watch use the working list endpoint). Live-verified ci status/watch against the live API. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 10:59:14 +00:00
Viktor Barzin	787ce4edfa	homelab: v0.3.1 — fix k8s db PG target (resolve CNPG primary pod, not the Service) Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details `k8s db <app>` (Postgres path) execed `pg-cluster-rw`, which is the CNPG read-write SERVICE, not a pod — so kubectl exec failed with `pods "pg-cluster-rw" not found`. The unit test only checked the plan; the verb was never fired at live state (the gap flagged in v0.2), so it shipped broken. Fix: the PG plan now carries a label selector (cnpg.io/instanceRole=primary) instead of a pod name, and k8s db resolves the actual primary POD via `kubectl get pod -l <selector>` before exec. MySQL path (real pod mysql-standalone-0) unchanged. Live-verified both paths (psql + mysql). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 09:09:34 +00:00
Viktor Barzin	90c944a265	woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128 All checks were successful ci/woodpecker/push/default Pipeline was successful Details Infra pipelines were failing intermittently across all authors (e.g. #241-244, #247) with the git clone step exiting 128: git fetch --depth=1 --filter=tree:0 ... (partial/treeless clone) git reset --hard <sha> fatal: could not fetch <tree-sha> from promisor remote remote: 404 page not found The plugin-git clone defaulted to a partial (treeless) clone. The initial ref fetch carries credentials, but the lazy promisor object fetch triggered by `git reset --hard` hits the PRIVATE Forgejo repo without creds -> 404 -> exit 128. Whether it fired was luck-of-the-draw, hence the ~50% intermittent failures fleet-wide (not specific to any commit). Fix: set `partial: false` on every clone block so all objects for the (still shallow) commit are fetched upfront with creds — no fragile lazy promisor fetch. Diagnosed against the woodpecker Postgres DB (steps/log_entries) since the Woodpecker HTTP API was itself flapping. Earlier "permission for ViktorBarzin" log lines were an unrelated cross-forge red herring. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 09:06:44 +00:00
Viktor Barzin	fd77c0dc4f	monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot Some checks failed ci/woodpecker/push/default Pipeline failed Details The rpi-sofia under-voltage alert keyed off the sticky firmware bit (rpi_under_voltage_occurred == 1), which latches on the first brown-out and stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a few of these lately" — and it disagreed with the HA-sofia dashboard, which shows the live state and reads OK once voltage recovers. Can't just switch to the live bit: rpi_under_voltage_now never registered once in 14d (brown-outs are sub-second and fall between the 1-min textfile-collector samples), so the sticky bit is the only reliable detector. Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0. Fires once per brown-out and auto-resolves ~1h later (~2h active over the same 14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both real brown-out events in the window are still caught. Docs updated in the same commit (monitoring.md). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:45:39 +00:00
Viktor Barzin	fbf6f11038	feat(tripit): #96 cutover — /api self-authenticates (remove forward-auth, add strip-auth-headers) All checks were successful ci/woodpecker/push/default Pipeline was successful Details ADR-0028 #96 (website half): /api drops Authentik forward-auth so the browser can carry a TripIt session cookie (the outpost 302'd cookie-only requests). The app self-authenticates (TripIt-session-first in get_current_user); no session -> 401 -> SPA landing. strip-auth-headers is REQUIRED now: with forward-auth gone, the hybrid forward-auth arm would otherwise trust a client-injected X-authentik-email — stripping inbound X-authentik-* closes that. /metrics split into its own still-gated ingress. Shell keeps Authentik bearers on tripit-api.* until #94; full AUTH_MODE collapse follows then. Verified live: no-session->401, valid TripIt cookie->200, injected header->401, Shell->200. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:27:39 +00:00
Viktor Barzin	8559c4574a	fix(tripit): pin Authentik invalidation_flow literal (data source flakes null in CI under provider skew) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Pipeline 244 failed: data.authentik_flow.default_provider_invalidation resolved null in CI (goauthentik 2024.x provider vs 2026.2 server), silently blocking every tripit-stack apply incl. the ADR-0028 #90 signing-key + redirect-URI delivery. Pin the literal UUID (what the slug resolves to) — matches the data-source-skew workaround used for the Vault binding. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 08:10:25 +00:00
Viktor Barzin	e5bb16e02a	feat(tripit): activate TripIt-native session auth — signing key + Authentik web redirect (ADR-0028 #90 ) Some checks failed ci/woodpecker/push/default Pipeline failed Details Adds SESSION_SIGNING_KEY (Vault secret/tripit -> tripit-secrets ExternalSecret -> env_from) so TripIt's own session JWTs are signed with a real key (the app fails closed under the dev default until this lands), and adds the website OIDC redirect URI https://tripit.viktorbarzin.me/api/auth/callback/authentik to the public tripit-app provider so 'Log in with Authentik' works. Reuses the Shell's existing public OAuth2 app. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 06:06:43 +00:00
Viktor Barzin	077ac97df5	k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps Some checks failed ci/woodpecker/push/default Pipeline failed Details kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the k8s dashboard) until someone manually re-applied the rbac stack. That manual step ran after every control-plane upgrade — the one thing keeping autonomous patch upgrades from being truly hands-off (it bit us this cycle: an earlier master bump left SSO broken until we noticed). Automate it: the rbac stack now publishes its existing OIDC restore script (the same one its null_resource runs) to a kube-system/apiserver-oidc-restore ConfigMap, and the upgrade chain's phase_master re-runs it on master right after the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add apiserver restart can't crashloop it. The script is idempotent and health-gates /livez with auto-rollback; the step is non-fatal (a failure only lags SSO until the next rbac apply, it won't abort the upgrade). phase_master already self-skips when master is at target, so this only fires when master was actually upgraded. The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the manual restore is now a documented fallback (command corrected — it needs -replace, since the null_resource trigger hash never changes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 06:04:30 +00:00
Viktor Barzin	48b63ffa6f	homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client Some checks failed Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline failed Details Lets agents search/navigate memory via the CLI, as the first step toward deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just one frontend); homelab memory is a thin Bearer-auth HTTP client over the same API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works even when the MCP frontend is down — the recurring disconnect that took the MCP offline for this whole session. Verbs: recall (server-side semantic search), list, categories, tags, stats, secret (read); store, update, delete (write). Validated against the live API including a store→recall→delete round-trip — full data-plane parity with the MCP. The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after the CLI is proven in the hooks — see docs/adr/0008. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 05:56:25 +00:00
Viktor Barzin	3594485f77	homelab: v0.2.0 — docs + version for the k8s verb-group Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver note), add docs/adr/0007 (resolver, read/write split, config-mutation stays raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the Kubernetes surface. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 22:30:41 +00:00
Viktor Barzin	1f7438bb18	homelab: add k8s verb-group (v0.2) — the biggest remaining surface Mining the post-v0.1 corpus showed kubectl is the dominant remaining domain by far: 11,291 commands across 243 sessions (more than everything else combined). This adds the full k8s verb-group built on an app→namespace→pod resolver (most namespaces hold one app, so <app> defaults to the namespace and the target defaults to deploy/<app>, letting kubectl resolve the pod; -n/--pod/-c/-l/--tty override). Read: status (pods + non-Normal events), get, logs, describe, debug (one-shot triage), pf, rollout-status. Write/operational: db (the dbaas psql/mysql exec pattern — PG via pg-cluster-rw -c postgres, MySQL via mysql-standalone-0 with the env-password bash wrapper, never inline), exec, rm-pod (pods/jobs ONLY), restart. Config-mutation verbs (apply/edit/patch/scale/create) are deliberately NOT exposed — they stay raw per the Terraform-only policy. Smoke-verified read verbs against the live cluster (get/logs/rollout-status); write verbs are unit-tested (resolver, db-plan, shell-quoting) but not fired at live state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 22:29:51 +00:00
Viktor Barzin	66caa0bf7f	homelab: v0.1 docs, distribution wiring, and version Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Completes v0.1: documentation, build/install path, and version stamping. - cli/VERSION (v0.1.0) stamped into the binary via ldflags. - cli/README.md rewritten as the homelab overview (verbs + tiers, manifest, build, the preserved legacy webhook use-cases). - docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the work/tf behaviour (native worktree entry, verification-gated auto-land, presence-coupled apply). - setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run (t3-dispatch pattern), so every devvm user gets the current binary. - AGENTS.md: discovery pointer under Common Operations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:25:51 +00:00
Viktor Barzin	087b415f73	homelab: add work verbs (start/land/clean) with a land verification gate Completes the infra-loop verb surface. work start creates .worktrees/<topic> on <user>/<topic> off <remote>/master (git-crypt-aware, ensures .worktrees is ignored) and prints the path for native EnterWorktree entry. work land fetches, merges master in, verifies, pushes HEAD:master with non-fast-forward retry, and falls back to pushing the feature branch for a PR when the direct push is rejected (branch protection). work clean removes the worktree + branch. Safety: work land REFUSES to push when it cannot verify (no --verify-cmd and no auto-detected suite) unless --no-verify is passed. This was added after an accidental smoke-test invocation pushed unverified WIP to master (benign — the infra CI applied 0 stacks since the diff was cli/-only — but the gate makes an unverified land a deliberate choice, not the default). Known v0.1 limitation: land does not yet block on CI to green; that arrives with the ci/deploy watch verbs. It prints a reminder to follow the pipeline manually. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:24:08 +00:00
Viktor Barzin	36d562c15c	homelab: add tf verbs + stack/git-crypt substrate Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Adds the tf verb-group and the resolver substrate beneath it, continuing the v0.1 infra-loop build. - substrate: findInfraRoot (walk up to terragrunt.hcl + stacks/), stack→dir resolver, and repo/remote/git-crypt detection (preferRemote forgejo>origin, hasGitCryptAttr, gitCryptFlags) — the last is for `work` next. - tf plan/validate/fmt/force-unlock/apply, resolving the stack from cwd and delegating to scripts/tg (which owns state decrypt/encrypt, the Vault lock, and the ingress auth-comment check) rather than calling terragrunt directly. - tf apply is presence-coupled: claims stack:<name>, ALWAYS releases on exit (normal, error, or SIGINT/SIGTERM via sync.Once + signal handler) — fixing the documented ~200-claim leak — and prints an out-of-band reminder since CI applies canonically on push. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:16:33 +00:00
Viktor Barzin	ed6f22fd53	homelab: scaffold unified CLI (registry, manifest, claim/release) in infra/cli Begin evolving the existing infra/cli into the agent-facing "homelab" CLI decided in the design/grilling session: one composable, JSON-capable surface for the operations agents run over and over (mined from 51k commands across 2,225 past sessions; the infra inner-loop is ~29% of them). v0.1 targets that loop — work/tf/claim — and ships here, in place, in infra/cli. This first slice: - command registry + dispatcher (longest-prefix verb matching) and a `manifest`/`manifest --json` progressive-discovery entrypoint; every verb declares a read\|write tier so write-gating can be added later (everything is allowed for now). - claim/release verbs wrapping the existing presence script (not reimplemented), with label-taxonomy validation. - main() front-dispatches the homelab verb surface but falls through to the legacy webhook -use-case path verbatim, so the in-cluster infra-cli image is unaffected. - fix a pre-existing vet error (glog.Infof missing format directive) that blocked `go test`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 19:12:57 +00:00
Viktor Barzin	70e217db24	k8s-version-upgrade: preflight skips kubeadm-plan gate when master already at target All checks were successful ci/woodpecker/push/default Pipeline was successful Details The autonomous 1.34.9 version-upgrade chain has been failing its preflight every night. A prior run left k8s-master + k8s-node1 on 1.34.9 while node2-6 stayed on 1.34.8, and preflight's gate-4 runs `kubeadm upgrade plan` on master. On an already-at-target master, kubeadm prints no "kubeadm upgrade apply vX.Y.Z" line, so the parsed target came back empty and the `!= requested` check aborted the whole chain before any worker was touched. Deterministic — it self-cleaned and re-failed identically each night, so it would have failed again tonight, leaving node2-6 stuck on the old patch. Fix: skip the kubeadm-plan-target gate when master is already on TARGET_VERSION — the same at-target self-skip that phase_master and phase_worker already do. The remaining workers are still validated by their own per-node phases, and the detector already confirmed the target is installable via apt-cache. This lets tonight's unattended chain resume and finish node2-6 -> 1.34.9. Runbook updated: node count 5 -> 7, the gate skip note, and a Past Incidents writeup (incl. the collateral apiserver OIDC wipe, restored via the rbac stack). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:17:46 +00:00
Viktor Barzin	8787d361dc	claude-memory: HA (replicas 2 + PDB) to stop recurring MCP disconnects All checks were successful ci/woodpecker/push/default Pipeline was successful Details The claude-memory MCP backend ran as a single replica with no PDB, so every voluntary disruption took it to zero for ~30-90s — which surfaced as the memory MCP "keeps getting disconnected" problem. Disruption sources hitting the lone pod: the descheduler (every-5-min CronJob, LowNodeUtilization — caught evicting it live), Keel image bumps, Reloader restarts on the 7-day DB-password rotation, node drains, and CI deploys. The local stdio MCP subprocess itself was proven healthy (fast non-blocking startup, stderr suppressed, graceful degradation), so the fault was purely backend availability, not the MCP plumbing. Fix: run 2 replicas (the backend is stateless FastAPI over shared CNPG Postgres and already has hostname anti-affinity) + restore the PDB at minAvailable=1 (safe now — the drain deadlock that justified removing it only existed at 1 replica) + descheduler evict=false to stop the needless 5-min churn. All five disruption sources become zero-downtime rolling events. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 09:13:36 +00:00
Viktor Barzin	48b7be3b14	feat(tripit): live lodging-price scrape — LODGING_PROVIDER=playwright All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to turn lodging prices on and stop using the fake provider. Mirrors the existing FARE_PROVIDER wiring: point the Booking.com/Airbnb lodging scraper at the shared chrome-service browser over CDP (the namespace is already admitted through chrome-service's NetworkPolicy for the fare scrape). The lodging code (ADR-0025, tripit #78) is live in tripit 03973b5, so the env lands after that rollout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:53:19 +00:00
Viktor Barzin	d709d338c6	service-catalog: add paperless-ai (RAG semantic search + auto-tagging) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Document the new paperless-ai service and the two non-obvious operational facts: runtime config lives in the PVC .env (not TF env, which would shadow it), and Qwen3 needs /no_think for parseable tagging output. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:44:00 +00:00
Viktor Barzin	4977153dfb	paperless-ai: make the PVC .env the single source of config truth All checks were successful ci/woodpecker/push/default Pipeline was successful Details Auto-tagging silently no-op'd: the container env vars set in the deployment shadowed the app's own /app/data/.env, because paperless-ai's dotenv loader does not override process.env. A stale PROCESS_PREDEFINED_DOCUMENTS=yes (with no TAGS) made the scan select zero documents. Strip the wizard-owned behavioural config (Paperless URL, AI provider, model, scan interval, tagging flags) from the container env, keeping only infrastructural env (PUID/PGID/port/RAG/HF cache) and the Vault-sourced secret refs. The app's setup-written .env on the PVC is now authoritative, so processing runs and tags all documents. Qwen3 thinking is disabled via SYSTEM_PROMPT=/no_think in that .env to keep the model's JSON output parseable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:41:29 +00:00
Viktor Barzin	aeee0d02e2	paperless-ai: deploy clusterzx/paperless-ai for semantic doc search + AI tagging Some checks failed ci/woodpecker/push/default Pipeline failed Details Viktor wanted real semantic search over his ~300 Paperless documents and preferred a ready-made solution over building one. paperless-ai provides local-embedding RAG (ChromaDB + sentence-transformers, GPU-free) plus LLM-driven auto-analysis/tagging. Wiring: - LLM (chat answers + tagging) -> in-cluster llama-swap qwen3-8b (OpenAI-compatible); embeddings + vector store are local on the PVC. - Reads Paperless over the internal service via a dedicated `paperless-ai` superuser token (Vault secret/paperless-ai); app-admin creds also in Vault. - Encrypted PVC for /app/data (SQLite + ChromaDB + model cache). - Ingress paperless-ai.viktorbarzin.me behind Authentik (auth=required). - Third-party image pinned (docker.io/clusterzx/paperless-ai:3.0.9), no Keel. Runtime config persists to the PVC .env via the app's one-time setup; the deployment env vars are pre-fill/documentation only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-18 06:23:00 +00:00
Viktor Barzin	605cf99a1b	portal-tts: docker.io/ prefix on edge-tts image (Kyverno trusted-registries) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The edge-tts apply was blocked by the require-trusted-registries Kyverno policy — a bare `travisvn/openai-edge-tts` isn't in the allowlist. The policy blanket- trusts `docker.io/*`, so prefixing the image with `docker.io/` passes admission with no policy change. Verified live: bg synth round-trips through Whisper verbatim and a full gateway /v1/talk bg turn returns a coherent spoken Bulgarian reply ("Добър ден! Добре съм, благодаря!..."). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 21:24:34 +00:00
Viktor Barzin	ab55cb5dcd	portal-stt: drop setup_tls_secret module (ClusterIP-only, no fullchain.pem) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The landed portal-stt source still declared the setup_tls_secret module + tls_secret_name variable, which file()-reads secrets/fullchain.pem — a file this stack does not ship. portal-stt is ClusterIP-only (no ingress; the Gateway is the sole externally-exposed component, ADR-0001), so it needs no TLS secret. The live deployment never had it (removed during the original apply); this aligns the source with reality so CI applies cleanly. Fixes the pipeline-229 portal-stt apply failure. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 20:29:31 +00:00

1 2 3 4 5 ...

4508 commits