Commit graph

4419 commits

Author SHA1 Message Date
Viktor Barzin
f2b089e267 rybbit: fix cloudflare_ruleset import id (zone/ 3-part form) + depends_on lists
v4.52.7 import id must be zone/<zone_id>/<ruleset_id>; add depends_on so the
crowdsec_ban/captcha lists exist before the WAF rules reference them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:12:29 +00:00
58fc6d5061 Merge pull request 'Fix CrowdSec firewall-bouncer tar + CF WAF ruleset import' (#3) from wizard/crowdsec-fixes into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 19:06:15 +00:00
Viktor Barzin
a351a66843 crowdsec+rybbit: fix firewall-bouncer tar extraction (busybox) + import existing CF WAF ruleset
- initContainer used GNU tar --wildcards which fails on the busybox curl image (pod Init:Error); switch to extract-all + cp via shell glob.
- cloudflare_ruleset hit the per-zone singleton conflict; import the existing 'default' http_request_firewall_custom ruleset and manage all rules — CrowdSec ban/captcha first, the pre-existing disabled skip rule preserved verbatim.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 19:04:30 +00:00
70e8ce1021 Merge pull request 'CrowdSec real enforcement: edge WAF (proxied) + firewall-bouncer (direct)' (#2) from wizard/crowdsec-enforcement into master
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 09:42:41 +00:00
Viktor Barzin
ca8d617e72 rybbit: use 'Account Rule Lists' permission group for the CF sync token (v4)
tg plan verified the agent's guess 'Account Filter Lists Edit/Read' is not a key in the v4.52.7 permission-group map; the live CF API lists the correct account-scoped groups as 'Account Rule Lists Read'/'Write'.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:41:41 +00:00
Viktor Barzin
0c56290af0 chore(forgejo): re-trigger apply of git.timeout/gc.auto (changed-stack skip)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
910d5892 landed the [git.timeout] + [git.config] env in master, but the CI apply
skipped stacks/forgejo (the changed-stack-diff race after a sync-merge), so the
Forgejo deployment never picked it up. A trivial comment touch to force a clean
apply of the stack so the durable push-mirror fix actually takes effect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:19:53 +00:00
Viktor Barzin
cc4bfb593b rybbit: proxied CrowdSec enforcement via Cloudflare IP Lists + WAF rule
Replaces the Worker+KV approach (which only covered the ~27 routed hosts) with a
zone-wide mechanism that covers ALL proxied hosts: two CF account IP Lists
(crowdsec_ban, crowdsec_captcha) + one zone WAF custom rule that blocks
`(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)`.
No per-request Worker, no cookie machinery — the rybbit Worker stays
analytics-only. lapi_kv_sync.py now full-reconciles the two lists from LAPI
(fail-safe: a LAPI blip skips the run and freezes the last-known-good block set;
serializes CF bulk ops since CF allows one pending op per account). A
least-privilege CF API token (Account Filter Lists Edit) is minted in TF.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:18:33 +00:00
Viktor Barzin
7e646e1c7c crowdsec: add cs-firewall-bouncer DaemonSet (direct-host nftables enforcement)
Drops banned source IPs in-kernel via nftables (hooks input+forward, so DNAT'd
LoadBalancer traffic is caught before reaching Traefik) for DIRECT hosts — the
direct-side replacement for the dead Traefik plugin, zero per-request hop.

No published image exists, so an initContainer fetches the pinned official
static binary (v0.0.34) onto a stock debian-slim base (nftables backend uses
netlink directly, no nft CLI needed). hostNetwork + NET_ADMIN/NET_RAW (not
privileged). Config (with api_key) in a Secret, Reloader-annotated. crowdsec ns
is already in the Kyverno wave-1 exclude list, so the privileged/hostNetwork pod
is admitted. Pinned to k8s-node2 (runs a Traefik pod) for one-node validation
before the nodeSelector is removed to roll cluster-wide. Fail-open by element
timeout if the bouncer stops.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 09:11:08 +00:00
Viktor Barzin
53117b193a portal-realtime: deploy the v2 full-duplex voice agent (Pipecat)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
New stack for the realtime voice agent — v2 of the portal-assistant brain
path. One persistent WebSocket per conversation: continuous mic audio ->
Silero VAD turn-taking -> Whisper STT (portal-stt) -> streaming Claude brain
(claude-agent-service) -> edge-tts (portal-tts) -> audio out, with barge-in.
Reuses all three upstream cluster services; nothing new is spun up.

Public Cloudflare ingress (proxied, WebSocket) at portal-realtime.viktorbarzin.me
with the app's own DEVICE_TOKEN as the edge gate (auth="app" — Authentik would
break the native Portal client). No buffering middleware: it would break the
streaming WebSocket. Image ghcr.io/viktorbarzin/portal-assistant-realtime
(private ghcr, pulled with ghcr_pull_token). Sibling to the v1 portal-assistant
gateway, which stays live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:23:17 +00:00
Viktor Barzin
44cac6f4e2 gitignore: ignore Python test artifacts (__pycache__, *.pyc, .pytest_cache)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Introduced the first pytest file in the tree
(stacks/k8s-version-upgrade/scripts/test_compat_gate.py); running it leaves an
untracked __pycache__/ dir. Ignore the standard Python build artifacts so test
runs don't show up as working-tree noise or get committed by accident.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:17:03 +00:00
Viktor Barzin
b58fe8cb1a docs(k8s-upgrade): record detector Packages-probe -L fix + compat-gate patch scope
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Two corrections to the runbook matching today's code fixes:
- The next-minor *patch* probe (GET .../Packages) also needs `-L`; it lacked it
  until 2026-06-20 and silently no-op'd the 2026-06-19 nightly run. Both probes
  now follow the 302.
- The compat gate's addon check is scoped to minor jumps — patches within the
  running minor are never addon-blocked (target_minor <= running_minor returns
  early), so a conservative ceiling like ESO 0.12 -> 1.31 no longer false-blocks
  a 1.34.x patch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:16:20 +00:00
Viktor Barzin
e5250f417e k8s-version-upgrade: compat gate must not false-block patch upgrades
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The compat gate compared every addon's matrix ceiling against the target
k8s minor unconditionally. That is correct for a minor JUMP, but it also
blocked patch upgrades within the minor the cluster is ALREADY running:
ESO v0.12's matrix ceiling is 1.31, the cluster runs 1.34.9, so a target of
1.34.10 (a patch) was refused with "external-secrets supports k8s <= 1.31;
target 1.34 exceeds it" — even though the running cluster is itself proof ESO
0.12 works on 1.34. That silently defeats autonomous patching (it would have
bitten the moment a 1.34.10 was published).

Fix: a target at or below the running minor crosses into no new k8s minor, so
every installed addon is already empirically proven on it — check_addons now
returns no reasons when target_minor <= running_minor. Added running_minor()
(oldest kubelet across nodes, mirroring the detector; RUNNING_K8S env override
for tests) and pass it in. Minor jumps are unchanged: 1.34->1.35 still blocks
on ESO 0.12 + kyverno 1.16. removed-API + containerd checks are naturally
inert for patches (no API removal / containerd floor inside a minor) and keep
running as defence. Added test_compat_gate.py (8 cases) covering both paths.

Verified end-to-end against live Prometheus: target 1.34.10 -> EXIT 0 (safe),
target 1.35.6 -> EXIT 2 (blocked on ESO+kyverno).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:14:50 +00:00
Viktor Barzin
38675b7922 crowdsec: register kvsync + firewall bouncer keys in LAPI
Seeds two new bouncers at LAPI startup (BOUNCER_KEY_kvsync, BOUNCER_KEY_firewall)
from Vault secret/platform, mirroring the existing BOUNCER_KEY_traefik wiring.
These are the two halves of the real enforcement that replaces the dead Yaegi
plugin: kvsync authenticates the LAPI->Cloudflare-KV sync (proxied edge Worker),
firewall authenticates the cs-firewall-bouncer DaemonSet (direct-host nftables).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:12:38 +00:00
Viktor Barzin
a9384a4067 Merge remote-tracking branch 'origin/master'
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-20 08:09:16 +00:00
Viktor Barzin
44a98d408e k8s-version-upgrade: detector next-minor probe must follow 302 (curl -sfL)
The next-minor Packages query used `curl -sf` without -L. pkgs.k8s.io
302-redirects every request to a backing host, so without -L curl returned
an empty body, NEXT_MINOR_PATCH came back empty, and the detector fell
through to "No upgrade needed". That is exactly why last night's 23:00 chain
no-op'd instead of resolving the 1.35 next-minor target (1.35.6) and handing
it to the compat gate. `curl -sfL` follows the redirect and returns the
Packages file (verified: -sf -> empty, -sfL -> 1.35.6). Mirrors the same
-L fix already applied to the Release availability probe (-sILo) above.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:09:08 +00:00
Viktor Barzin
910d589205 fix(forgejo): raise git-op timeouts + lower gc.auto to stop push-mirror timeouts
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
The tripit Forgejo->GitHub push-mirror silently stalled: `git cat-file
--batch-all-objects` over the NFS-backed repo exceeded the default git deadline
once ~4500 loose objects accumulated (gc.auto's 6700 threshold hadn't fired), so
pushes stopped reaching GitHub and prod deploys stalled. Raise [git.timeout]
(DEFAULT/MIRROR/GC) so a slow object enumeration can't abort the mirror, and set
[git.config] gc.auto=1000 so post-push autogc + the git_gc_repos cron keep repos
packed (the real fix). A one-off forced gc already unblocked tripit; this prevents
recurrence across all repos.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:08:50 +00:00
Viktor Barzin
45bed1c133 Merge remote-tracking branch 'origin/master'
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-20 08:07:23 +00:00
Viktor Barzin
e1736d2e5c calico: hop 3.28.5->3.30.7 (operator v1.38.13) — restores a SUPPORTED Calico/k8s-1.34 pairing. Disabled new-in-3.30 Goldmane/Whisker (their CRs render before crds/ install on helm upgrade; we use Prometheus/Loki). calico-node 7/7 on quay/v3.30.7, tigerastatus green. Applied manually + verified overnight. 2026-06-20 08:07:08 +00:00
Viktor Barzin
4d9fdbc7f7 rybbit: add CrowdSec LAPI -> Cloudflare KV sync script (proxied edge control plane)
Pure-stdlib script (alert_digest pattern, runs on stock python:3.12-alpine) that
projects CrowdSec Ip-scope ban/captcha decisions into the Workers KV namespace
the edge Worker reads on each proxied request. Full-reconcile per run so an
un-ban clears from the edge within one interval; fail-safe (a LAPI read error
skips the run and leaves existing bans to expire by TTL = fail-open, never a
stale all-block). TF wiring (KV namespace + CronJob + key registration) follows.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:05:11 +00:00
Viktor Barzin
0ac176da01 crowdsec: whitelist internal/LAN/tailnet CIDRs at the decision layer
Preparing for real CrowdSec enforcement (edge Cloudflare Worker for proxied
hosts + cs-firewall-bouncer for direct hosts). Both enforce by dropping the
real source IP, so if an internal/RFC1918 address ever ended up in a ban
decision it could blackhole legitimate internal traffic. Whitelisting the
cluster/LAN/tailnet ranges (10/8, 172.16/12, 192.168/16, 100.64/10) at the
CrowdSec parser layer makes that structurally impossible — a trusted source
can never produce a decision in the first place. Public IP already whitelisted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-20 08:03:46 +00:00
Viktor Barzin
3e3fdb34f0 homelab: v0.6.0 — usage telemetry (usage top), evidence-driven verb prioritization
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Answers the question that drove the whole CLI — which verbs to add next — with
data instead of one maintainer's habits, and resolves the cross-user-usage ask
in-bounds (no reading anyone's home).

- emit on dispatch: every verb fire-and-forgets one Loki line {job,user,verb} +
  "exit=N ver=X". ONLY the verb path + exit code — never args, paths, flags, or
  secrets (the emit never sees arguments). Best-effort: 800ms timeout, errors
  swallowed, never affects the command; opt-out HOMELAB_TELEMETRY=0. Discovery
  verbs (manifest/version/help) and usage itself don't self-record.
- usage top [--since 30d] [--user U] [--json]: ranks verbs via
  sum by (verb)(count_over_time({job="homelab-usage"}[…])) against the shared
  Loki. Cross-user analytics WITHOUT touching ~/.claude — the privacy-preserving
  answer to "what does the team use".
- Loki sink (zero new infra, dogfoods v0.5 logs path); push verified HTTP 204 no
  auth. ADR docs/adr/0011.

Live-verified: ran 4 verbs, usage top ranked them correctly (metrics query=2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 22:29:01 +00:00
Viktor Barzin
666fefd22b calico: hop 3.26->3.28.5 (operator v1.34.13); calico-node 7/7 healthy, tigerastatus green, kube-controller-manager restarted (3.28 UID change). Applied manually + verified.
Some checks failed
ci/woodpecker/push/default Pipeline failed
2026-06-19 22:09:23 +00:00
Viktor Barzin
8ed5368be9 calico: bring tigera-operator under Terraform via Helm (adopt at 3.26.1)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Base for the stepped 3.26->3.28->3.30->3.32 upgrade (k8s 1.36 prereq; 3.26 is
already unsupported on k8s 1.34). Manage ONLY the operator via the official
tigera-operator Helm chart (chart ver == Calico ver); installation.enabled=false
keeps the live Installation CR operator-managed so Helm never touches calico-node.
Adopted in place: existing operator Deployment/SA/ClusterRole/ClusterRoleBinding
pre-stamped with Helm ownership metadata (transient migration step), then the
release imported via a plan-verified create (1 to add, 0 destroy). Verified clean:
calico-node 7/7 unchanged, tigerastatus green. Closes the year-deferred adoption
(code-3ad) for the operator without adopting the Installation CR.
2026-06-19 21:50:34 +00:00
Viktor Barzin
dd029ca7fb traefik/crowdsec: switch bouncer to live mode (stream cache doesn't enforce under Yaegi)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
After bumping to v1.6.0 (stream goroutine runs) and disabling redis (in-memory
cache), the plugin logs `handleStreamCache:updated` but still does NOT enforce:
a ban present in the LAPI stream AND pulled by the plugin still let the banned IP
through. Stream-mode decision matching is unreliable under Traefik's Yaegi
interpreter here. Switch crowdsecMode stream->live: the plugin queries LAPI
synchronously per request (result cached per-IP for defaultDecisionSeconds), which
enforces reliably and picks up new decisions immediately. LAPI is 3-replica +
in-cluster so per-request latency is small; fail-open preserved (updateMaxFailure=-1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
Viktor Barzin
0cc48d83ac traefik/crowdsec: disable bouncer redis cache (broken under Yaegi → in-memory)
With the plugin on v1.6.0 the stream goroutine finally runs, and its slog output
revealed the real blocker: `handleStreamTicker ... isCrowdsecStreamHealthy:true
cache:unreachable`. The LAPI stream is healthy, but the plugin's redis client
cannot reach the cache under Traefik's Yaegi interpreter — even though
redis-master.redis.svc is reachable AND writable from the traefik namespace
(SET/GET verified via busybox; no NetworkPolicies; no auth). Same interpreter
-incompat class as the stream goroutine itself. With redisCacheUnreachableBlock
=false the bouncer then failed open and enforced nothing.

Disable the redis cache so the plugin uses its in-memory decision store (works
under Yaegi). Removes redisCacheHost/redisCacheUnreachableBlock. Trade-off:
captcha already-solved grace is per-pod across the 3 Traefik replicas (at worst
an occasional re-solve) — acceptable; bans/captcha decisions enforce correctly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
Viktor Barzin
531efb218d traefik: bump crowdsec-bouncer plugin v1.4.2 -> v1.6.0 (fix stream not pulling)
The crowdsec-bouncer Yaegi plugin pinned at v1.4.2 loads on Traefik 3.7.5 but
its decision-stream goroutine never runs — no Traefik pod ever calls the LAPI
stream (verified: no traefik-pod bouncer entry / no @pod-ip auto-registration),
and it logs nothing. All deps are healthy (LAPI 200 + full ban list reachable
from the traefik ns, key valid, redis PONG, config correct, no NetworkPolicies),
so CrowdSec enforced nothing despite the bouncer now being registered. This is
the Traefik-v3 / Yaegi plugin-incompat class that already killed rewrite-body
here. v1.4.2 predates Nov 2025; latest is v1.6.0.

Bump to v1.6.0 (initContainer download URL + state.json + experimental.plugins
version). Config-verified compatible: every key we use survives (crowdsecMode,
crowdsecLapiKey/Host, updateMaxFailure, redisCache*, clientTrustedIPs, all
captcha* incl. turnstile); v1.6.0 also moves logging to slog/trace for future
diagnosis. Pinned, not auto-updated (Keel can't manage a Yaegi plugin, and
plugin bumps must be tested against the running Traefik/Yaegi).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:49:26 +00:00
78095aa273 docs(forgejo): runbook reflects Authentik disabled + zero-click GitHub
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Authentik OAuth2 source is now disabled (login_source.is_active=0) and GitHub
auto-registration (zero-click sign-up) is on. Document why (global auto-reg +
Authentik's email-as-username 500; Forgejo/Authentik email mismatch blocks
account-linking) and how to re-enable Authentik later.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:37:46 +00:00
7d99203fc6 forgejo: re-enable ENABLE_AUTO_REGISTRATION for zero-click GitHub sign-up
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Per Viktor: GitHub sign-up must work zero-click (account created on first login,
no form). This global [oauth2_client] setting enables it. It conflicts with
Authentik (preferred_username is an email → invalid Forgejo username → 500 on
auto-create), and Viktor's Forgejo email (me@viktorbarzin.me) doesn't match his
Authentik email (vbarzin@gmail.com) so account-linking can't bridge it — so the
Authentik OAuth2 source is DISABLED (login_source.is_active=0; DB-managed,
out-of-band) per his directive. Forgejo sign-in is now GitHub + native login.

Committed via API to land on origin without pushing a concurrent agent's unpushed
local commit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:34:17 +00:00
ef530b7d38 forgejo: drop ENABLE_AUTO_REGISTRATION — it broke Authentik sign-in
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ENABLE_AUTO_REGISTRATION is a global [oauth2_client] setting (all OAuth sources).
On Authentik sign-in, Forgejo auto-created an account and derived the username
from Authentik's preferred_username claim — which is the user's email
(vbarzin@gmail.com), invalid as a Forgejo username (no '@') → CreateUser failed
→ 500 on the OAuth callback. (GitHub's username claim is valid, so only Authentik
broke.) Reverting to the standard link/register flow fixes both; GitHub sign-up
still works via a one-step register form. Committed via API to touch only main.tf
(forgejo-only CI apply) so it doesn't collide with concurrent crowdsec work.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:24:29 +00:00
Viktor Barzin
a5bb4db9c5 crowdsec: register the Traefik bouncer with LAPI (fix fail-open)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The Traefik bouncer plugin's API key was never registered with LAPI — the
crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and
the chart registers no bouncer. So LAPI returned 403 to the plugin, which with
updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist
bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was
empty; the registration was likely lost in the MySQL->PostgreSQL DB migration
with no IaC to recreate it.

Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same
Vault key the middleware presents — so they match by construction, and the
bouncer re-registers automatically on every LAPI start (survives DB wipes).

- stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module.
- module main.tf: new sensitive var + thread into the values templatefile.
- values.yaml: BOUNCER_KEY_traefik on lapi.env.
- docs/architecture/security.md: document registration + fail-open history and
  the proxied-app coverage caveat.

Activates enforcement (community blocklist bans + captcha) on non-proxied apps;
internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 17:08:28 +00:00
Viktor Barzin
56dadda453 traefik: pin helm chart to 40.2.0 (deployed version)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The traefik helm_release had no chart version pin, so a refreshed helm repo
index resolves `chart = "traefik"` to the latest (41.0.0), whose values schema
rejects this stack's `logs` block ("Additional property logs is not allowed") —
an unpinned apply attempts that upgrade and fails (atomic rollback). Pin to the
deployed 40.2.0 (release rev 57, since 2026-05-30) so applies are deterministic;
chart bumps must be deliberate with a values migration.

Follow-up to fd0c7493 (Turnstile captcha), which was applied with this pin
already in live TF state — this lands the pin in git to remove the drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:58:33 +00:00
Viktor Barzin
4a66377425 forgejo: add "Sign in with GitHub" (OAuth2 source + auto-registration)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.

- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
  --provider github` (name "github", matching the callback registered on
  the GitHub OAuth App). Like the existing Authentik source, it lives in
  Forgejo's DB rather than Terraform — there's no clean TF resource for
  login sources. Client id/secret mirrored to Vault secret/viktor
  (forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
  [oauth2_client], so a first GitHub sign-in creates the account directly
  ("sign up with GitHub") instead of a link-to-existing detour. The
  GitHub identity is the trust gate for this path; Turnstile + email
  confirmation still gate the native form.

Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.

Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:41:49 +00:00
Viktor Barzin
fd0c7493c3 traefik/crowdsec: serve Cloudflare Turnstile for captcha remediation
CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse
(http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files),
but the Traefik bouncer plugin had no captcha provider configured — so those
decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go
@ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had
no way to self-unblock, contradicting the profile's stated intent.

Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha
decision now renders a solvable challenge instead of a hard block:

- New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to
  viktorbarzin.me so one widget covers every subdomain the bouncer fronts.
  Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are
  passed into the traefik module.
- middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s +
  captchaHTMLFilePath=/captcha/captcha.html.
- Vendor the plugin's captcha.html and mount it into the Traefik container at
  /captcha via the chart `volumes` value — the pulled Yaegi plugin does not
  expose its bundled template to Traefik.
- docs/architecture/security.md: document the ban-vs-captcha remediation split.
- Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with
  placeholder reCAPTCHA keys; referenced by zero .tf).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:38:38 +00:00
Viktor Barzin
963e4fcdde forgejo: open native self-signups, gated by Turnstile + email confirmation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:

  - Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
    is managed in Terraform (turnstile.tf) via the CF Global API key, so
    the sitekey/secret are IaC, not a dashboard artifact.
  - Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
    Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
    (mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
    credential Authentik uses (email-secret.tf ESO -> secret/authentik
    smtp_password).

Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.

Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.

Runbook: docs/runbooks/forgejo-open-signups.md

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 16:05:07 +00:00
Viktor Barzin
21dbd79ae4 Merge remote-tracking branch 'origin/master' into wizard/homelab-obs
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-19 11:27:44 +00:00
Viktor Barzin
e91e1612dd homelab: v0.5.0 — net/dns/metrics/logs probes (endpoint resolution)
The remaining verbs that pass the "saves reasoning, not just typing" test the
user posed mid-session: each encodes the non-obvious which-endpoint-reached-how
resolution otherwise re-derived every time. (Same test deprioritized node-ssh
and secret-get aliasing — thin wrappers over commands already known.)

- net check <host> [path]: two-legged reachability — external (public DNS→CF)
  vs internal (Traefik LB) — so you see WHERE a break is, not just that one path
  works. (live: surfaced the LB at 6ms vs CF 77ms.)
- dns lookup <name> [type]: Technitium (10.0.20.201) vs public (1.1.1.1) diff.
- metrics query "<promql>" / metrics alerts: Prometheus via the LB
  (prometheus-query.viktorbarzin.lan); alerts uses the synthetic ALERTS series
  since the query frontend has no /api/v1/alerts and Alertmanager has no ingress.
- logs query "<logql>" [--since 1h] [--limit N]: Loki range query via the LB.

All reach auth-free internal ingresses through the LB (Go form of
curl --resolve host:443:10.0.20.203) — no port-forward, no kubectl. In-cluster-
only endpoints (Alertmanager v2) deliberately out of scope. Verified live before
building; all five smoke-tested green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 11:27:31 +00:00
Viktor Barzin
6cb823e431 k8s-version-upgrade: complete autonomy P0 — blocked alert + deeper postflight + runbook
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
  in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
  Slack for why" signal. (Until monitoring is applied, a block still surfaces via
  the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
  apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
  core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
  Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
  downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
  matrix maintenance, and the detector minor-probe fix.

After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
2026-06-19 11:27:17 +00:00
Viktor Barzin
cecd9fe247 k8s-version-upgrade: compat gate — auto-upgrade when safe, halt + alert when not
Make k8s upgrades (patch AND minor) autonomous without being reckless: the chain
attempts every upgrade but refuses unless it can prove the target is safe. A
refusal is a BLOCK (not a crash) — it halts the chain and signals for attention.

- compat-gate.py: read-only preflight check. Blocks if (a) a critical addon's
  running version doesn't support the target k8s minor, (b) an in-use deprecated
  API (apiserver_requested_deprecated_apis) is removed at/before the target, or
  (c) a node's containerd is below the target's floor. Validated against the live
  cluster: correctly blocks 1.35/1.36 today on Calico 3.26 / ESO 0.12 / kyverno
  1.16 (all behind), which is exactly the auto-halt we want until they're bumped.
- addon-compat.json: curated addon -> max-supported-k8s matrix (Calico, ESO,
  kyverno, gpu-operator + containerd floor), sourced from each project's compat
  docs (2026-06-19). The keystone data the gate reads; keep current.
- upgrade-step.sh: phase_preflight runs the gate FIRST (before any mutation);
  block() pushes k8s_upgrade_blocked=1 + Slacks the reasons + halts.
- main.tf: detector minor-probe fix (curl -sILo so the 302 from pkgs.k8s.io
  resolves to 200 — minors were never being detected). Gated behind the compat
  gate above, so enabling minor detection can't roll an unsafe minor.

Not pushed yet: deploys with the K8sUpgradeBlocked alert + deeper postflight +
runbook (next commit) so the detector fix only goes live with the full net.
2026-06-19 11:23:30 +00:00
Viktor Barzin
9189560ac3 homelab: v0.4.0 — ci/deploy verbs (watch what you trigger)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Adds the verb-group that kills the single biggest reasoning sink in agent
sessions — watching a build/deploy to completion (proven the session that built
it: hours hand-rolling Woodpecker polling + DB-schema spelunking for one CI
incident).

- ci status/watch: Woodpecker REST API (version-stable, not its DB schema),
  reached via the internal Traefik LB (dial 10.0.20.203, SNI=ci.viktorbarzin.me
  so the cert verifies — the Go form of the house `curl --resolve` pattern),
  token from WOODPECKER_TOKEN/Vault, repo id resolved from the cwd remote, with
  retries that ride Woodpecker's intermittent empty responses. watch matches the
  HEAD/given commit (avoids the post-push race) and exits non-zero on failure.
- deploy wait: image-sha match THEN rollout status (rollout status alone returns
  success on the old ReplicaSet); kubectl-based.
- work land now auto-watches CI to green on the landed commit (--no-ci-watch to
  skip), closing the v0.1 gap.
- ci logs deferred to v0.4.1 (Woodpecker detail/log endpoints were the least
  reliable; status/watch use the working list endpoint).

Live-verified ci status/watch against the live API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 10:59:14 +00:00
Viktor Barzin
787ce4edfa homelab: v0.3.1 — fix k8s db PG target (resolve CNPG primary pod, not the Service)
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
`k8s db <app>` (Postgres path) execed `pg-cluster-rw`, which is the CNPG
read-write SERVICE, not a pod — so kubectl exec failed with
`pods "pg-cluster-rw" not found`. The unit test only checked the plan; the verb
was never fired at live state (the gap flagged in v0.2), so it shipped broken.

Fix: the PG plan now carries a label selector (cnpg.io/instanceRole=primary)
instead of a pod name, and k8s db resolves the actual primary POD via
`kubectl get pod -l <selector>` before exec. MySQL path (real pod
mysql-standalone-0) unchanged. Live-verified both paths (psql + mysql).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 09:09:34 +00:00
Viktor Barzin
90c944a265 woodpecker: disable partial clone (partial: false) — fix intermittent git exit-128
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Infra pipelines were failing intermittently across all authors (e.g. #241-244,
#247) with the git clone step exiting 128:

  git fetch --depth=1 --filter=tree:0 ...   (partial/treeless clone)
  git reset --hard <sha>
  fatal: could not fetch <tree-sha> from promisor remote
  remote: 404 page not found

The plugin-git clone defaulted to a partial (treeless) clone. The initial ref
fetch carries credentials, but the lazy *promisor* object fetch triggered by
`git reset --hard` hits the PRIVATE Forgejo repo without creds -> 404 -> exit
128. Whether it fired was luck-of-the-draw, hence the ~50% intermittent failures
fleet-wide (not specific to any commit).

Fix: set `partial: false` on every clone block so all objects for the (still
shallow) commit are fetched upfront with creds — no fragile lazy promisor fetch.

Diagnosed against the woodpecker Postgres DB (steps/log_entries) since the
Woodpecker HTTP API was itself flapping. Earlier "permission for ViktorBarzin"
log lines were an unrelated cross-forge red herring.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 09:06:44 +00:00
Viktor Barzin
fd77c0dc4f monitoring: RpiSofiaUndervoltage alerts on new brown-out, not until reboot
Some checks failed
ci/woodpecker/push/default Pipeline failed
The rpi-sofia under-voltage alert keyed off the sticky firmware bit
(rpi_under_voltage_occurred == 1), which latches on the first brown-out and
stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every
boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a
few of these lately" — and it disagreed with the HA-sofia dashboard, which shows
the live state and reads OK once voltage recovers.

Can't just switch to the live bit: rpi_under_voltage_now never registered once in
14d (brown-outs are sub-second and fall between the 1-min textfile-collector
samples), so the sticky bit is the only reliable detector.

Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0.
Fires once per brown-out and auto-resolves ~1h later (~2h active over the same
14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both
real brown-out events in the window are still caught. Docs updated in the same
commit (monitoring.md).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:45:39 +00:00
Viktor Barzin
fbf6f11038 feat(tripit): #96 cutover — /api self-authenticates (remove forward-auth, add strip-auth-headers)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
ADR-0028 #96 (website half): /api drops Authentik forward-auth so the browser can carry a TripIt session cookie (the outpost 302'd cookie-only requests). The app self-authenticates (TripIt-session-first in get_current_user); no session -> 401 -> SPA landing. strip-auth-headers is REQUIRED now: with forward-auth gone, the hybrid forward-auth arm would otherwise trust a client-injected X-authentik-email — stripping inbound X-authentik-* closes that. /metrics split into its own still-gated ingress. Shell keeps Authentik bearers on tripit-api.* until #94; full AUTH_MODE collapse follows then. Verified live: no-session->401, valid TripIt cookie->200, injected header->401, Shell->200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:27:39 +00:00
Viktor Barzin
8559c4574a fix(tripit): pin Authentik invalidation_flow literal (data source flakes null in CI under provider skew)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Pipeline 244 failed: data.authentik_flow.default_provider_invalidation resolved null in CI (goauthentik 2024.x provider vs 2026.2 server), silently blocking every tripit-stack apply incl. the ADR-0028 #90 signing-key + redirect-URI delivery. Pin the literal UUID (what the slug resolves to) — matches the data-source-skew workaround used for the Vault binding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 08:10:25 +00:00
Viktor Barzin
e5bb16e02a feat(tripit): activate TripIt-native session auth — signing key + Authentik web redirect (ADR-0028 #90)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Adds SESSION_SIGNING_KEY (Vault secret/tripit -> tripit-secrets ExternalSecret -> env_from) so TripIt's own session JWTs are signed with a real key (the app fails closed under the dev default until this lands), and adds the website OIDC redirect URI https://tripit.viktorbarzin.me/api/auth/callback/authentik to the public tripit-app provider so 'Log in with Authentik' works. Reuses the Shell's existing public OAuth2 app.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:06:43 +00:00
Viktor Barzin
077ac97df5 k8s-version-upgrade: auto-restore apiserver OIDC after control-plane bumps
Some checks failed
ci/woodpecker/push/default Pipeline failed
kubeadm upgrade apply regenerates the apiserver static-pod manifest and drops
the --authentication-config flag, silently breaking SSO (kubectl/kubelogin + the
k8s dashboard) until someone manually re-applied the rbac stack. That manual step
ran after every control-plane upgrade — the one thing keeping autonomous patch
upgrades from being truly hands-off (it bit us this cycle: an earlier master bump
left SSO broken until we noticed).

Automate it: the rbac stack now publishes its existing OIDC restore script (the
same one its null_resource runs) to a kube-system/apiserver-oidc-restore
ConfigMap, and the upgrade chain's phase_master re-runs it on master right after
the kubeadm upgrade — while tigera-operator is still quiesced so the flag-add
apiserver restart can't crashloop it. The script is idempotent and health-gates
/livez with auto-rollback; the step is non-fatal (a failure only lags SSO until
the next rbac apply, it won't abort the upgrade). phase_master already self-skips
when master is at target, so this only fires when master was actually upgraded.

The chain SA gets a name-scoped get on that one ConfigMap. Runbook updated: the
manual restore is now a documented fallback (command corrected — it needs
-replace, since the null_resource trigger hash never changes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:04:30 +00:00
Viktor Barzin
48b63ffa6f homelab: add memory verb-group (v0.3.0) — direct claude-memory HTTP client
Some checks failed
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline failed
Lets agents search/navigate memory via the CLI, as the first step toward
deprecating the memory MCP. claude-memory is a FastAPI service (the MCP is just
one frontend); homelab memory is a thin Bearer-auth HTTP client over the same
API, using the env the hooks already set (CLAUDE_MEMORY_API_URL/KEY). It works
even when the MCP frontend is down — the recurring disconnect that took the MCP
offline for this whole session.

Verbs: recall (server-side semantic search), list, categories, tags, stats,
secret (read); store, update, delete (write). Validated against the live API
including a store→recall→delete round-trip — full data-plane parity with the MCP.

The deprecation itself (rewiring the per-prompt auto-recall + auto-learn hooks to
the CLI, then uninstalling the MCP) is a deliberate follow-up, sequenced after
the CLI is proven in the hooks — see docs/adr/0008.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 05:56:25 +00:00
Viktor Barzin
3594485f77 homelab: v0.2.0 — docs + version for the k8s verb-group
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Bump cli/VERSION to v0.2.0; document the k8s verbs (README table + resolver
note), add docs/adr/0007 (resolver, read/write split, config-mutation stays
raw, db dbaas pattern), and extend the AGENTS.md discovery pointer with the
Kubernetes surface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 22:30:41 +00:00
Viktor Barzin
1f7438bb18 homelab: add k8s verb-group (v0.2) — the biggest remaining surface
Mining the post-v0.1 corpus showed kubectl is the dominant remaining domain by
far: 11,291 commands across 243 sessions (more than everything else combined).
This adds the full k8s verb-group built on an app→namespace→pod resolver (most
namespaces hold one app, so <app> defaults to the namespace and the target
defaults to deploy/<app>, letting kubectl resolve the pod; -n/--pod/-c/-l/--tty
override).

Read: status (pods + non-Normal events), get, logs, describe, debug (one-shot
triage), pf, rollout-status. Write/operational: db (the dbaas psql/mysql exec
pattern — PG via pg-cluster-rw -c postgres, MySQL via mysql-standalone-0 with the
env-password bash wrapper, never inline), exec, rm-pod (pods/jobs ONLY), restart.
Config-mutation verbs (apply/edit/patch/scale/create) are deliberately NOT
exposed — they stay raw per the Terraform-only policy.

Smoke-verified read verbs against the live cluster (get/logs/rollout-status);
write verbs are unit-tested (resolver, db-plan, shell-quoting) but not fired at
live state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 22:29:51 +00:00
Viktor Barzin
66caa0bf7f homelab: v0.1 docs, distribution wiring, and version
Some checks are pending
Build infra CLI / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
Completes v0.1: documentation, build/install path, and version stamping.

- cli/VERSION (v0.1.0) stamped into the binary via ldflags.
- cli/README.md rewritten as the homelab overview (verbs + tiers, manifest,
  build, the preserved legacy webhook use-cases).
- docs/adr/0004-0006: why homelab exists (grown in place from infra/cli, not a
  separate repo), v0.1 scope + everything-allowed/tiers-recorded, and the
  work/tf behaviour (native worktree entry, verification-gated auto-land,
  presence-coupled apply).
- setup-devvm.sh builds cli/ -> /usr/local/bin/homelab each provisioning run
  (t3-dispatch pattern), so every devvm user gets the current binary.
- AGENTS.md: discovery pointer under Common Operations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 19:25:51 +00:00