infra

Author	SHA1	Message	Date
Viktor Barzin	600f1f933c	Create Claude auth state directories All checks were successful ci/woodpecker/push/default Pipeline was successful Details The first live renewal run showed systemd could not create state beneath a read-only home sandbox. Provision each user's writable state directory before enabling the timer so automatic renewal can run.	2026-06-20 20:25:55 +00:00
Viktor Barzin	7f1788a106	Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-20 20:22:20 +00:00
Viktor Barzin	ff67e9d422	Fix workstation package manifest parsing The approved Claude token renewal deployment could not run because setup-devvm passed inline package comments to apt as package names. Strip inline comments so the persisted all-user setup remains reproducible.	2026-06-20 20:22:05 +00:00
Viktor Barzin	524b874036	state(vault): update encrypted state Some checks failed ci/woodpecker/push/default Pipeline was canceled Details	2026-06-20 20:14:53 +00:00
Viktor Barzin	7050b0441e	Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 20:11:09 +00:00
Viktor Barzin	bc2fbc712c	Merge remote-tracking branch 'origin/master' into wizard/claude-auth-renew	2026-06-20 20:10:48 +00:00
Viktor Barzin	02d14796cc	feat(mailserver): add trips@ send-as alias for TripIt native auth email (ADR-0028) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details TripIt's native signup-verification + account-recovery mail (ADR-0028) sends From: trips@viktorbarzin.me while authenticating SMTP as spam@. With SPOOF_PROTECTION on, Postfix smtpd_sender_login_maps requires an EXPLICIT alias (the @domain catch-all doesn't satisfy it) — mirrors the existing plans@->spam@ grant. Must be applied + verified before TripIt flips SMTP_FROM to trips@, else every verification/recovery send is rejected 550. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 20:10:47 +00:00
Viktor Barzin	5549fc3672	Add per-user Claude auth renewal Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.	2026-06-20 20:10:40 +00:00
Viktor Barzin	3278588325	chore(authentik): tear down obsolete tripit-enrollment (ADR-0020 superseded by ADR-0028) All checks were successful ci/woodpecker/push/default Pipeline was successful Details TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 20:04:24 +00:00
viktor	834c5e6a2a	Merge pull request 'CrowdSec proxied: single CF list (block-only) + firewall-bouncer re-apply' (#5 ) from wizard/crowdsec-1list into master Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 19:31:01 +00:00
Viktor Barzin	7cf93a0587	crowdsec+rybbit: proxied edge to single CF list (block-only) + retrigger firewall-bouncer apply CF account hard-limits to 1 Rules List, so proxied enforcement uses one crowdsec_ban list + one WAF block rule; the sync writes both ban and captcha decisions into it (captcha downgraded to block at the edge). Drops the second list + managed_challenge rule. Trivial touch to firewall_bouncer.tf to make CI re-apply crowdsec and recreate the DaemonSet (tar fix already in master; stale orphan was cleared). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 19:29:43 +00:00
viktor	1406d8a391	Merge pull request 'Fix CF ruleset import id + depends_on' (#4 ) from wizard/crowdsec-fix2 into master Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 19:13:03 +00:00
Viktor Barzin	f2b089e267	rybbit: fix cloudflare_ruleset import id (zone/ 3-part form) + depends_on lists v4.52.7 import id must be zone/<zone_id>/<ruleset_id>; add depends_on so the crowdsec_ban/captcha lists exist before the WAF rules reference them. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 19:12:29 +00:00
viktor	58fc6d5061	Merge pull request 'Fix CrowdSec firewall-bouncer tar + CF WAF ruleset import' (#3 ) from wizard/crowdsec-fixes into master Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 19:06:15 +00:00
Viktor Barzin	a351a66843	crowdsec+rybbit: fix firewall-bouncer tar extraction (busybox) + import existing CF WAF ruleset - initContainer used GNU tar --wildcards which fails on the busybox curl image (pod Init:Error); switch to extract-all + cp via shell glob. - cloudflare_ruleset hit the per-zone singleton conflict; import the existing 'default' http_request_firewall_custom ruleset and manage all rules — CrowdSec ban/captcha first, the pre-existing disabled skip rule preserved verbatim. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 19:04:30 +00:00
viktor	70e8ce1021	Merge pull request 'CrowdSec real enforcement: edge WAF (proxied) + firewall-bouncer (direct)' (#2 ) from wizard/crowdsec-enforcement into master Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 09:42:41 +00:00
Viktor Barzin	ca8d617e72	rybbit: use 'Account Rule Lists' permission group for the CF sync token (v4) tg plan verified the agent's guess 'Account Filter Lists Edit/Read' is not a key in the v4.52.7 permission-group map; the live CF API lists the correct account-scoped groups as 'Account Rule Lists Read'/'Write'. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 09:41:41 +00:00
Viktor Barzin	0c56290af0	chore(forgejo): re-trigger apply of git.timeout/gc.auto (changed-stack skip) All checks were successful ci/woodpecker/push/default Pipeline was successful Details `910d5892` landed the [git.timeout] + [git.config] env in master, but the CI apply skipped stacks/forgejo (the changed-stack-diff race after a sync-merge), so the Forgejo deployment never picked it up. A trivial comment touch to force a clean apply of the stack so the durable push-mirror fix actually takes effect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 09:19:53 +00:00
Viktor Barzin	cc4bfb593b	rybbit: proxied CrowdSec enforcement via Cloudflare IP Lists + WAF rule Replaces the Worker+KV approach (which only covered the ~27 routed hosts) with a zone-wide mechanism that covers ALL proxied hosts: two CF account IP Lists (crowdsec_ban, crowdsec_captcha) + one zone WAF custom rule that blocks `(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)`. No per-request Worker, no cookie machinery — the rybbit Worker stays analytics-only. lapi_kv_sync.py now full-reconciles the two lists from LAPI (fail-safe: a LAPI blip skips the run and freezes the last-known-good block set; serializes CF bulk ops since CF allows one pending op per account). A least-privilege CF API token (Account Filter Lists Edit) is minted in TF. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 09:18:33 +00:00
Viktor Barzin	7e646e1c7c	crowdsec: add cs-firewall-bouncer DaemonSet (direct-host nftables enforcement) Drops banned source IPs in-kernel via nftables (hooks input+forward, so DNAT'd LoadBalancer traffic is caught before reaching Traefik) for DIRECT hosts — the direct-side replacement for the dead Traefik plugin, zero per-request hop. No published image exists, so an initContainer fetches the pinned official static binary (v0.0.34) onto a stock debian-slim base (nftables backend uses netlink directly, no nft CLI needed). hostNetwork + NET_ADMIN/NET_RAW (not privileged). Config (with api_key) in a Secret, Reloader-annotated. crowdsec ns is already in the Kyverno wave-1 exclude list, so the privileged/hostNetwork pod is admitted. Pinned to k8s-node2 (runs a Traefik pod) for one-node validation before the nodeSelector is removed to roll cluster-wide. Fail-open by element timeout if the bouncer stops. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 09:11:08 +00:00
Viktor Barzin	53117b193a	portal-realtime: deploy the v2 full-duplex voice agent (Pipecat) All checks were successful ci/woodpecker/push/default Pipeline was successful Details New stack for the realtime voice agent — v2 of the portal-assistant brain path. One persistent WebSocket per conversation: continuous mic audio -> Silero VAD turn-taking -> Whisper STT (portal-stt) -> streaming Claude brain (claude-agent-service) -> edge-tts (portal-tts) -> audio out, with barge-in. Reuses all three upstream cluster services; nothing new is spun up. Public Cloudflare ingress (proxied, WebSocket) at portal-realtime.viktorbarzin.me with the app's own DEVICE_TOKEN as the edge gate (auth="app" — Authentik would break the native Portal client). No buffering middleware: it would break the streaming WebSocket. Image ghcr.io/viktorbarzin/portal-assistant-realtime (private ghcr, pulled with ghcr_pull_token). Sibling to the v1 portal-assistant gateway, which stays live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:23:17 +00:00
Viktor Barzin	44cac6f4e2	gitignore: ignore Python test artifacts (__pycache__, *.pyc, .pytest_cache) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Introduced the first pytest file in the tree (stacks/k8s-version-upgrade/scripts/test_compat_gate.py); running it leaves an untracked __pycache__/ dir. Ignore the standard Python build artifacts so test runs don't show up as working-tree noise or get committed by accident. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:17:03 +00:00
Viktor Barzin	b58fe8cb1a	docs(k8s-upgrade): record detector Packages-probe -L fix + compat-gate patch scope All checks were successful ci/woodpecker/push/default Pipeline was successful Details Two corrections to the runbook matching today's code fixes: - The next-minor patch probe (GET .../Packages) also needs `-L`; it lacked it until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes now follow the 302. - The compat gate's addon check is scoped to minor jumps — patches within the running minor are never addon-blocked (target_minor <= running_minor returns early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks a 1.34.x patch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:16:20 +00:00
Viktor Barzin	e5250f417e	k8s-version-upgrade: compat gate must not false-block patch upgrades All checks were successful ci/woodpecker/push/default Pipeline was successful Details The compat gate compared every addon's matrix ceiling against the target k8s minor unconditionally. That is correct for a minor JUMP, but it also blocked patch upgrades within the minor the cluster is ALREADY running: ESO v0.12's matrix ceiling is 1.31, the cluster runs 1.34.9, so a target of 1.34.10 (a patch) was refused with "external-secrets supports k8s <= 1.31; target 1.34 exceeds it" — even though the running cluster is itself proof ESO 0.12 works on 1.34. That silently defeats autonomous patching (it would have bitten the moment a 1.34.10 was published). Fix: a target at or below the running minor crosses into no new k8s minor, so every installed addon is already empirically proven on it — check_addons now returns no reasons when target_minor <= running_minor. Added running_minor() (oldest kubelet across nodes, mirroring the detector; RUNNING_K8S env override for tests) and pass it in. Minor jumps are unchanged: 1.34->1.35 still blocks on ESO 0.12 + kyverno 1.16. removed-API + containerd checks are naturally inert for patches (no API removal / containerd floor inside a minor) and keep running as defence. Added test_compat_gate.py (8 cases) covering both paths. Verified end-to-end against live Prometheus: target 1.34.10 -> EXIT 0 (safe), target 1.35.6 -> EXIT 2 (blocked on ESO+kyverno). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:14:50 +00:00
Viktor Barzin	38675b7922	crowdsec: register kvsync + firewall bouncer keys in LAPI Seeds two new bouncers at LAPI startup (BOUNCER_KEY_kvsync, BOUNCER_KEY_firewall) from Vault secret/platform, mirroring the existing BOUNCER_KEY_traefik wiring. These are the two halves of the real enforcement that replaces the dead Yaegi plugin: kvsync authenticates the LAPI->Cloudflare-KV sync (proxied edge Worker), firewall authenticates the cs-firewall-bouncer DaemonSet (direct-host nftables). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:12:38 +00:00
Viktor Barzin	a9384a4067	Merge remote-tracking branch 'origin/master' Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-20 08:09:16 +00:00
Viktor Barzin	44a98d408e	k8s-version-upgrade: detector next-minor probe must follow 302 (curl -sfL) The next-minor Packages query used `curl -sf` without -L. pkgs.k8s.io 302-redirects every request to a backing host, so without -L curl returned an empty body, NEXT_MINOR_PATCH came back empty, and the detector fell through to "No upgrade needed". That is exactly why last night's 23:00 chain no-op'd instead of resolving the 1.35 next-minor target (1.35.6) and handing it to the compat gate. `curl -sfL` follows the redirect and returns the Packages file (verified: -sf -> empty, -sfL -> 1.35.6). Mirrors the same -L fix already applied to the Release availability probe (-sILo) above. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:09:08 +00:00
Viktor Barzin	910d589205	fix(forgejo): raise git-op timeouts + lower gc.auto to stop push-mirror timeouts Some checks failed ci/woodpecker/push/default Pipeline was canceled Details The tripit Forgejo->GitHub push-mirror silently stalled: `git cat-file --batch-all-objects` over the NFS-backed repo exceeded the default git deadline once ~4500 loose objects accumulated (gc.auto's 6700 threshold hadn't fired), so pushes stopped reaching GitHub and prod deploys stalled. Raise [git.timeout] (DEFAULT/MIRROR/GC) so a slow object enumeration can't abort the mirror, and set [git.config] gc.auto=1000 so post-push autogc + the git_gc_repos cron keep repos packed (the real fix). A one-off forced gc already unblocked tripit; this prevents recurrence across all repos. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:08:50 +00:00
Viktor Barzin	45bed1c133	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-20 08:07:23 +00:00
Viktor Barzin	e1736d2e5c	calico: hop 3.28.5->3.30.7 (operator v1.38.13) — restores a SUPPORTED Calico/k8s-1.34 pairing. Disabled new-in-3.30 Goldmane/Whisker (their CRs render before crds/ install on helm upgrade; we use Prometheus/Loki). calico-node 7/7 on quay/v3.30.7, tigerastatus green. Applied manually + verified overnight.	2026-06-20 08:07:08 +00:00
Viktor Barzin	4d9fdbc7f7	rybbit: add CrowdSec LAPI -> Cloudflare KV sync script (proxied edge control plane) Pure-stdlib script (alert_digest pattern, runs on stock python:3.12-alpine) that projects CrowdSec Ip-scope ban/captcha decisions into the Workers KV namespace the edge Worker reads on each proxied request. Full-reconcile per run so an un-ban clears from the edge within one interval; fail-safe (a LAPI read error skips the run and leaves existing bans to expire by TTL = fail-open, never a stale all-block). TF wiring (KV namespace + CronJob + key registration) follows. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:05:11 +00:00
Viktor Barzin	0ac176da01	crowdsec: whitelist internal/LAN/tailnet CIDRs at the decision layer Preparing for real CrowdSec enforcement (edge Cloudflare Worker for proxied hosts + cs-firewall-bouncer for direct hosts). Both enforce by dropping the real source IP, so if an internal/RFC1918 address ever ended up in a ban decision it could blackhole legitimate internal traffic. Whitelisting the cluster/LAN/tailnet ranges (10/8, 172.16/12, 192.168/16, 100.64/10) at the CrowdSec parser layer makes that structurally impossible — a trusted source can never produce a decision in the first place. Public IP already whitelisted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-20 08:03:46 +00:00
Viktor Barzin	3e3fdb34f0	homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details Answers the question that drove the whole CLI — which verbs to add next — with data instead of one maintainer's habits, and resolves the cross-user-usage ask in-bounds (no reading anyone's home). - emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} + "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery verbs (manifest/version/help) and usage itself don't self-record. - usage top [--since 30d] [--user U] [--json]: ranks verbs via sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving answer to "what does the team use". - Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no auth. ADR docs/adr/0011. Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 22:29:01 +00:00
Viktor Barzin	666fefd22b	calico: hop 3.26->3.28.5 (operator v1.34.13); calico-node 7/7 healthy, tigerastatus green, kube-controller-manager restarted (3.28 UID change). Applied manually + verified. Some checks failed ci/woodpecker/push/default Pipeline failed Details	2026-06-19 22:09:23 +00:00
Viktor Barzin	8ed5368be9	calico: bring tigera-operator under Terraform via Helm (adopt at 3.26.1) Some checks failed ci/woodpecker/push/default Pipeline failed Details Base for the stepped 3.26->3.28->3.30->3.32 upgrade (k8s 1.36 prereq; 3.26 is already unsupported on k8s 1.34). Manage ONLY the operator via the official tigera-operator Helm chart (chart ver == Calico ver); installation.enabled=false keeps the live Installation CR operator-managed so Helm never touches calico-node. Adopted in place: existing operator Deployment/SA/ClusterRole/ClusterRoleBinding pre-stamped with Helm ownership metadata (transient migration step), then the release imported via a plan-verified create (1 to add, 0 destroy). Verified clean: calico-node 7/7 unchanged, tigerastatus green. Closes the year-deferred adoption (code-3ad) for the operator without adopting the Installation CR.	2026-06-19 21:50:34 +00:00
Viktor Barzin	dd029ca7fb	traefik/crowdsec: switch bouncer to live mode (stream cache doesn't enforce under Yaegi) All checks were successful ci/woodpecker/push/default Pipeline was successful Details After bumping to v1.6.0 (stream goroutine runs) and disabling redis (in-memory cache), the plugin logs `handleStreamCache:updated` but still does NOT enforce: a ban present in the LAPI stream AND pulled by the plugin still let the banned IP through. Stream-mode decision matching is unreliable under Traefik's Yaegi interpreter here. Switch crowdsecMode stream->live: the plugin queries LAPI synchronously per request (result cached per-IP for defaultDecisionSeconds), which enforces reliably and picks up new decisions immediately. LAPI is 3-replica + in-cluster so per-request latency is small; fail-open preserved (updateMaxFailure=-1). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:49:26 +00:00
Viktor Barzin	0cc48d83ac	traefik/crowdsec: disable bouncer redis cache (broken under Yaegi → in-memory) With the plugin on v1.6.0 the stream goroutine finally runs, and its slog output revealed the real blocker: `handleStreamTicker ... isCrowdsecStreamHealthy:true cache:unreachable`. The LAPI stream is healthy, but the plugin's redis client cannot reach the cache under Traefik's Yaegi interpreter — even though redis-master.redis.svc is reachable AND writable from the traefik namespace (SET/GET verified via busybox; no NetworkPolicies; no auth). Same interpreter -incompat class as the stream goroutine itself. With redisCacheUnreachableBlock =false the bouncer then failed open and enforced nothing. Disable the redis cache so the plugin uses its in-memory decision store (works under Yaegi). Removes redisCacheHost/redisCacheUnreachableBlock. Trade-off: captcha already-solved grace is per-pod across the 3 Traefik replicas (at worst an occasional re-solve) — acceptable; bans/captcha decisions enforce correctly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:49:26 +00:00
Viktor Barzin	531efb218d	traefik: bump crowdsec-bouncer plugin v1.4.2 -> v1.6.0 (fix stream not pulling) The crowdsec-bouncer Yaegi plugin pinned at v1.4.2 loads on Traefik 3.7.5 but its decision-stream goroutine never runs — no Traefik pod ever calls the LAPI stream (verified: no traefik-pod bouncer entry / no @pod-ip auto-registration), and it logs nothing. All deps are healthy (LAPI 200 + full ban list reachable from the traefik ns, key valid, redis PONG, config correct, no NetworkPolicies), so CrowdSec enforced nothing despite the bouncer now being registered. This is the Traefik-v3 / Yaegi plugin-incompat class that already killed rewrite-body here. v1.4.2 predates Nov 2025; latest is v1.6.0. Bump to v1.6.0 (initContainer download URL + state.json + experimental.plugins version). Config-verified compatible: every key we use survives (crowdsecMode, crowdsecLapiKey/Host, updateMaxFailure, redisCache, clientTrustedIPs, all captcha incl. turnstile); v1.6.0 also moves logging to slog/trace for future diagnosis. Pinned, not auto-updated (Keel can't manage a Yaegi plugin, and plugin bumps must be tested against the running Traefik/Yaegi). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:49:26 +00:00
viktor	78095aa273	docs(forgejo): runbook reflects Authentik disabled + zero-click GitHub All checks were successful ci/woodpecker/push/default Pipeline was successful Details Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub auto-registration (zero-click sign-up) is on. Document why (global auto-reg + Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks account-linking) and how to re-enable Authentik later. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:37:46 +00:00
viktor	7d99203fc6	forgejo: re-enable ENABLE_AUTO_REGISTRATION for zero-click GitHub sign-up All checks were successful ci/woodpecker/push/default Pipeline was successful Details Per Viktor: GitHub sign-up must work zero-click (account created on first login, no form). This global [oauth2_client] setting enables it. It conflicts with Authentik (preferred_username is an email → invalid Forgejo username → 500 on auto-create), and Viktor's Forgejo email (me@viktorbarzin.me) doesn't match his Authentik email (vbarzin@gmail.com) so account-linking can't bridge it — so the Authentik OAuth2 source is DISABLED (login_source.is_active=0; DB-managed, out-of-band) per his directive. Forgejo sign-in is now GitHub + native login. Committed via API to land on origin without pushing a concurrent agent's unpushed local commit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:34:17 +00:00
viktor	ef530b7d38	forgejo: drop ENABLE_AUTO_REGISTRATION — it broke Authentik sign-in All checks were successful ci/woodpecker/push/default Pipeline was successful Details ENABLE_AUTO_REGISTRATION is a global [oauth2_client] setting (all OAuth sources). On Authentik sign-in, Forgejo auto-created an account and derived the username from Authentik's preferred_username claim — which is the user's email (vbarzin@gmail.com), invalid as a Forgejo username (no '@') → CreateUser failed → 500 on the OAuth callback. (GitHub's username claim is valid, so only Authentik broke.) Reverting to the standard link/register flow fixes both; GitHub sign-up still works via a one-step register form. Committed via API to touch only main.tf (forgejo-only CI apply) so it doesn't collide with concurrent crowdsec work. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:24:29 +00:00
Viktor Barzin	a5bb4db9c5	crowdsec: register the Traefik bouncer with LAPI (fix fail-open) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The Traefik bouncer plugin's API key was never registered with LAPI — the crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and the chart registers no bouncer. So LAPI returned 403 to the plugin, which with updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was empty; the registration was likely lost in the MySQL->PostgreSQL DB migration with no IaC to recreate it. Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same Vault key the middleware presents — so they match by construction, and the bouncer re-registers automatically on every LAPI start (survives DB wipes). - stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module. - module main.tf: new sensitive var + thread into the values templatefile. - values.yaml: BOUNCER_KEY_traefik on lapi.env. - docs/architecture/security.md: document registration + fail-open history and the proxied-app coverage caveat. Activates enforcement (community blocklist bans + captcha) on non-proxied apps; internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 17:08:28 +00:00
Viktor Barzin	56dadda453	traefik: pin helm chart to 40.2.0 (deployed version) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The traefik helm_release had no chart version pin, so a refreshed helm repo index resolves `chart = "traefik"` to the latest (41.0.0), whose values schema rejects this stack's `logs` block ("Additional property logs is not allowed") — an unpinned apply attempts that upgrade and fails (atomic rollback). Pin to the deployed 40.2.0 (release rev 57, since 2026-05-30) so applies are deterministic; chart bumps must be deliberate with a values migration. Follow-up to `fd0c7493` (Turnstile captcha), which was applied with this pin already in live TF state — this lands the pin in git to remove the drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:58:33 +00:00
Viktor Barzin	4a66377425	forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted people to be able to sign up with GitHub, not just the native form or Authentik SSO. - Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth --provider github` (name "github", matching the callback registered on the GitHub OAuth App). Like the existing Authentik source, it lives in Forgejo's DB rather than Terraform — there's no clean TF resource for login sources. Client id/secret mirrored to Vault secret/viktor (forgejo_github_oauth_client_id / _secret) for recovery. - This commit's TF change: ENABLE_AUTO_REGISTRATION=true in [oauth2_client], so a first GitHub sign-in creates the account directly ("sign up with GitHub") instead of a link-to-existing detour. The GitHub identity is the trust gate for this path; Turnstile + email confirmation still gate the native form. Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github redirects to GitHub's authorize URL with the correct client id + callback, and the login page renders the button. Final browser click-through is the user's to do. Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section + secret-rotation + DB-loss recreate steps). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:41:49 +00:00
Viktor Barzin	fd0c7493c3	traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse (http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files), but the Traefik bouncer plugin had no captcha provider configured — so those decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go @ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had no way to self-unblock, contradicting the profile's stated intent. Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha decision now renders a solvable challenge instead of a hard block: - New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to viktorbarzin.me so one widget covers every subdomain the bouncer fronts. Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are passed into the traefik module. - middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s + captchaHTMLFilePath=/captcha/captcha.html. - Vendor the plugin's captcha.html and mount it into the Traefik container at /captcha via the chart `volumes` value — the pulled Yaegi plugin does not expose its bundled template to Traefik. - docs/architecture/security.md: document the ban-vs-captcha remediation split. - Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with placeholder reCAPTCHA keys; referenced by zero .tf). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:38:38 +00:00
Viktor Barzin	963e4fcdde	forgejo: open native self-signups, gated by Turnstile + email confirmation All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wants Forgejo open for anyone to sign up, but without bot/spam account floods. Flip the deployment from OAuth-only registration (ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local sign-up, and add two bot gates on the registration form: - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget is managed in Terraform (turnstile.tf) via the CF Global API key, so the sitekey/secret are IaC, not a dashboard artifact. - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced credential Authentik uses (email-secret.tf ESO -> secret/authentik smtp_password). Existing Authentik OAuth2 login is unchanged (additive). Deployment env appended (not inserted) so the diff stays purely additive; a reloader annotation rolls the pod on secret rotation. Verified live: signup page renders the Turnstile widget, mailer delivers a test message end-to-end, Forgejo healthy, plan-to-zero after apply. Runbook: docs/runbooks/forgejo-open-signups.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 16:05:07 +00:00
Viktor Barzin	21dbd79ae4	Merge remote-tracking branch 'origin/master' into wizard/homelab-obs Some checks are pending Build infra CLI / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details	2026-06-19 11:27:44 +00:00
Viktor Barzin	e91e1612dd	homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution) The remaining verbs that pass the "saves reasoning, not just typing" test the user posed mid-session: each encodes the non-obvious which-endpoint-reached-how resolution otherwise re-derived every time. (Same test deprioritized node-ssh and secret-get aliasing — thin wrappers over commands already known.) - net check <host> [path]: two-legged reachability — external (public DNS→CF) vs internal (Traefik LB) — so you see WHERE a break is, not just that one path works. (live: surfaced the LB at 6ms vs CF 77ms.) - dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff. - metrics query "<promql>" / metrics alerts: Prometheus via the LB (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series since the query frontend has no /api/v1/alerts and Alertmanager has no ingress. - logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB. All reach auth-free internal ingresses through the LB (Go form of curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster- only endpoints (Alertmanager v2) deliberately out of scope. Verified live before building; all five smoke-tested green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-19 11:27:31 +00:00
Viktor Barzin	6cb823e431	k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt + alert when not": - monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning) in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see Slack for why" signal. (Until monitoring is applied, a block still surfaces via the already-live K8sUpgradeChainJobFailed.) - upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests — apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns) Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't downgrade). Catches a "pods look Running but cluster is broken" upgrade. - runbook: documents the compat gate, the blocked alert, how to clear a block, matrix maintenance, and the detector minor-probe fix. After deploy, the nightly chain detects 1.35 (minor detection now works) and correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting via K8sUpgradeBlocked — the autonomy working as designed until the catch-up clears those addons.	2026-06-19 11:27:17 +00:00
Viktor Barzin	cecd9fe247	k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain attempts every upgrade but refuses unless it can prove the target is safe. A refusal is a BLOCK (not a crash) — it halts the chain and signals for attention. - compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's running version doesn't support the target k8s minor, (b) an in-use deprecated API (apiserver_requested_deprecated_apis) is removed at/before the target, or (c) a node's containerd is below the target's floor. Validated against the live cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), which is exactly the auto-halt we want until they're bumped. - addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO, kyverno, gpu-operator + containerd floor), sourced from each project's compat docs (2026-06-19). The keystone data the gate reads; keep current. - upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation); block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts. - main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io resolves to 200 — minors were never being detected). Gated behind the compat gate above, so enabling minor detection can't roll an unsafe minor. Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight + runbook (next commit) so the detector fix only goes live with the full net.	2026-06-19 11:23:30 +00:00

1 2 3 4 5 ...

4431 commits