Now that the native-auth rollout is complete: (1) AUTH_MODE hybrid->normal — the legacy Authentik OIDC-bearer + forward-auth arms were removed in #96, and 'hybrid' already resolved to 'normal' via backward-compat parsing; this makes it explicit and corrects the now-false comment. (2) SMTP_FROM plans@->trips@ — the dedicated native-auth sender; the trips@->spam@ send-as alias is live + verified (RCPT 250). (3) TRUST_FORWARDED_FOR=true — so #95's per-IP signup rate-limit keys on the real client behind Traefik, not the shared ingress pod IP. Env-only; the Deployment image is KEEL_IGNORE_IMAGE (lifecycle-ignored), so this does NOT touch the running image. Reloader restarts the pod to pick up the new env.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted a glanceable number on the Wealth dashboard for how much
he can spend for the rest of his life — spending the whole net worth
down to zero by age 100.
Adds a third line of six stat tiles to the Overview section, two
equations × three cadences (per day / month / year):
• FLOOR — net worth ÷ time remaining to age 100. Treats the money as
cash (no growth, no inflation): a conservative lower bound.
≈ £43/day, £1.3k/mo, £15.8k/yr.
• 4% REAL — die-with-zero annuity: the constant, inflation-adjusted
spend that drains the balance to £0 at 100 while it keeps earning
4% real. PMT = NW·r/(1−(1+r)^−n). ≈ £133/day, £4.0k/mo, £48.5k/yr.
Horizon is today → his 100th birthday (DOB 1998-10-04 → 2098-10-04),
computed live so the figures tick as net worth and the horizon move.
Net worth reuses the existing latest-per-account dav_corrected math, so
the tiles always agree with the "Net worth (current)" stat (pension
included; target £0). The 4% real rate is hard-coded per his "keep it
simple, just a number" steer — a one-line SQL edit to change later.
Layout: tiles inserted at y=9; all sections below shifted down 4 rows.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kyverno v1.16 supports k8s <=1.34, so it was one of the two addons blocking the
autonomous 1.35 upgrade (compat gate, nightly). v1.18 supports 1.35.
Stepped one minor at a time per the kyverno upgrade guide (per-minor CRD notes):
3.6.1 (1.16) -> 3.7.2 (1.17.2) -> 3.8.1 (1.18.1), each hop applied + verified
supervised. atomic=true (auto-rollback on a failed rollout) + forceFailurePolicyIgnore
(admissions stay open mid-roll) kept it safe. Values schema confirmed compatible
across 3.6->3.8 (forceFailurePolicyIgnore still under features:).
Verified after each hop: all 17 ClusterPolicies stayed Ready, admission controller
2/2, no destroys/replaces in plan. Final 1.18.1: images v1.18.1, mutating webhook
live (server-side dry-run injects ndots:2 in a non-excluded ns). compat-gate vs
1.35.6 now lists ONLY external-secrets (kyverno cleared). ESO 0.12->2.x
(v1beta1->v1, 73 files) is the last remaining 1.35 blocker — to be planned.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a daily visibility layer so every night's autonomous-upgrade outcome is
reviewable at a glance during the upgrade-cleanup window (Viktor: "track every
night's upgrade for the next 7 days; clean up all bugs and blockers").
Last night (2026-06-20) confirmed BOTH prior fixes work in production: the
detector resolved target 1.35.6 (k8s_upgrade_available) and the compat gate
correctly REFUSED it (k8s_upgrade_blocked=1 -> K8sUpgradeBlocked) because ESO
v0.12 (<=1.31) and kyverno v1.16 (<=1.34) don't support 1.35.
What's here:
- CronJob k8s-upgrade-nightly-report (06:07 UTC) -> one Slack summary/morning:
running version, detector freshness, detected target, outcome (no-op /
blocked+live reasons / upgraded / in-progress / detector-stale), recent jobs.
Read-only: reads Pushgateway gauges + live nodes/jobs, re-runs compat-gate.py
for fresh blockers; reuses the chain SA + slack_webhook + scripts ConfigMap.
Pure helpers unit-tested (test_nightly_report.py, 8 cases incl. a real
v-prefix bug TDD caught). Verified end-to-end in-cluster (posted to Slack).
- K8sUpgradeChainJobFailed regex scoped from `k8s-upgrade-.*` to
`k8s-upgrade-(preflight|master|worker|postflight)-.*` so the new report job
(or any future helper) can't false-trip the chain-wedged alarm.
Manual state repair (no git artifact): imported the orphaned `alert-digest`
CronJob into the monitoring stack state
(`tg import module.monitoring.kubernetes_cron_job_v1.alert_digest monitoring/alert-digest`).
Root cause: when alert_digest was added (2026-06-12) the apply recorded its
ConfigMap + Secret but not the CronJob, so every full monitoring apply since has
failed with `cronjobs.batch "alert-digest" already exists` (Woodpecker pipeline
298 today) — surviving only via targeted prometheus applies. Now in state, so
monitoring CI applies cleanly again.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Last night (2026-06-20) the detector + compat-gate fixes worked: the chain
resolved target 1.35.6 and the gate correctly REFUSED it (ESO 0.12 + kyverno
1.16 don't support 1.35), pushing k8s_upgrade_blocked=1 -> K8sUpgradeBlocked
fired as designed. But the refusal also made the preflight Job exit 1
(block() exits 1 on purpose so the Failed Job re-spawns nightly), which tripped
K8sUpgradeChainJobFailed too — a duplicate, misleading "pipeline wedged" alarm
for what is the intended halt-and-alert outcome.
Fix: gate the alert with `unless on() k8s_upgrade_blocked == 1`. A deliberate
block sets that gauge (and it stays 1 until the next preflight resets it), so
the chain-job-failed alert is suppressed for the blocked period; a genuine
wedge / crash / halt-on-alert exits 1 WITHOUT setting it, so it still fires
(preserving the alert's original purpose — catching the pre-in_flight preflight
failure that hid the 5-day 1.34.9 wedge). Runbook + automated-upgrades docs
updated to match.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted the freshness tile to cover all three main holdings
(META, VUAG, VUSA), not only the single stalest one. Dropped LIMIT 1 so
the stat renders one value per held position (worst-first), switched the
tile to horizontal orientation so the three values sit side-by-side, and
updated the description. Each value is coloured by its own age threshold
(META red ~2mo, the Vanguard ETFs green ~2d). No threshold or datasource
change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Zero live ingresses reference traefik-crowdsec@kubernetescrd (PR1 + a
cluster-wide targeted ingress re-apply confirmed 0), so the crowdsec Middleware
CRD and the broken Yaegi bouncer plugin can be removed without orphaning any
router. Removes: the `crowdsec` Middleware, the crowdsec-bouncer plugin (static
config + initContainer download + state.json entry), the captcha template
ConfigMap + volume + captcha.html, the Turnstile widget + data.cloudflare_accounts,
and the 3 now-unused module vars. Also drops the `crowdsec` middleware from the
catch-all error-pages IngressRoute chain (the one remaining CRD-level reference,
which an Ingress-annotation grep does not surface) so that router is not orphaned
when the Middleware is deleted; it keeps rate-limit. Enforcement is fully handled
out-of-band now: cs-firewall-bouncer (in-kernel nftables, direct hosts) +
Cloudflare IP-List/WAF (proxied hosts). The api-token-middleware plugin is
deliberately preserved (still used by paperless-mcp).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The council-complaints app (Islington civic-reporting pilot) has been
abandoned. It was already dead in the cluster (deployments scaled 0/0,
image only on the decommissioned registry.viktorbarzin.me which 404s),
and it was never in Terraform — only docs + a kyverno comment referenced
it. Its live cluster resources (namespace, both NFS-backed PVs, ingresses)
were torn down out-of-band via kubectl (nothing in TF to drift from); the
DB-dump PVC was backed up to NFS first.
This removes the remaining repo references to the live app:
- service-catalog.md: drop the council-complaints row
- ci-cd.md + .claude/CLAUDE.md: drop it from the GHA->ghcr app list
- kyverno require-trusted-registries: the registry.viktorbarzin.me/*
allowlist comment claimed council-complaints as the last referencer;
rewrite it (no live workload pulls from that registry now; only stale
completed Job records still carry the ref). The allowlist line itself
is kept (registry-scoped, not app-specific).
Historical point-in-time plan docs (docs/plans/2026-05-16-auto-upgrade-
apps-{design,plan}.md) still mention it inside a frozen "10 GHA-migrated
repos (memory id=388)" snapshot; left as-is so the dated record stays
accurate.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two follow-ups Viktor asked for on the Price freshness panel:
- Expand every section by default. Grafana's collapsed rows hide their
child panels; just flipping collapsed=false leaves a non-canonical shape
(confirmed via the Grafana API that it keeps the panels nested rather
than hoisting them), so each row is now collapsed=false + panels=[] with
its children hoisted to top-level -- the exact form Grafana writes when
you expand-and-save. Row headers revert to their original y (the child
y-coords were already expanded-layout coordinates).
- Stop the freshness stat from taking its own line. It's now the 6th tile
in the existing returns row (1d/7d/30d/90d/12mo + freshness), all width 4
at y=5; the collapsed-row y-shift from the previous commit is undone.
No query or threshold changes. The large diff is mechanical: 12 child
panels re-indent from nested to top-level.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor was worried about stale prices silently distorting net worth.
Confirmed it's real: META's quote has been frozen at 2026-04-17 (65 days
old) while the dashboard keeps valuing the ~55-share position at that
stale close; the Vanguard ETFs are current. Nothing flagged it.
Adds one compact stat to the Overview row showing the most out-of-date
HELD position's quote age (symbol + humanised age), colour-coded: green
<=4d (weekend/bank-holiday tolerant), amber 5-9d, red >=10d. Pure read of
the quote_latest mirror via the wealth-pg datasource, held positions
only, LEFT JOIN so a held symbol with no quote at all sorts as max-stale.
The six collapsed rows below shift down 4 grid units to make room; no
other panel touched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`ha token` originally read openclaw/openclaw-secrets -> skill_secrets, which only
cluster admins can read — so it hung/failed for the non-admin operator it was
built for (emo = emil.barzin@gmail.com, OIDC group "Home Server Admins", whose
identity is deliberately barred from secrets in the openclaw namespace).
Split the HA tokens into a dedicated secret openclaw/ha-tokens (keys sofia/london)
with a Role + RoleBinding granting `get` on JUST that secret to the Home Server
Admins group (k8s RBAC can't scope to a JSON sub-key, hence a separate object).
emo now resolves the HA token with their own identity, WITHOUT gaining the rest
of skill_secrets (slack_webhook, uptime_kuma_password). openclaw's own deployment
keeps reading openclaw-secrets — purely additive.
- stacks/openclaw/ha_tokens.tf: new secret + least-privilege Role/RoleBinding
- cli/cmd_ha.go: read openclaw/ha-tokens (raw base64 per-instance key); drop JSON parse
- README + ADR-0012 updated; VERSION -> v0.7.1
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The first PR1 commit only dropped the ingress_factory reference + the 8
exclude_crowdsec call sites. But the crowdsec middleware is ALSO hard-coded
(not via the variable) in 6 more ingresses that build their middleware chain by
hand: owntracks, the monitoring Helm values (grafana + prometheus +
alertmanager), and the reverse-proxy module + its own separate ingress factory.
Remove all 6 so that after the full-cluster apply NO live ingress references
traefik-crowdsec@kubernetescrd — the precondition for PR2 deleting the CRD.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Traefik CrowdSec (Yaegi) bouncer plugin enforces nothing on Traefik 3.7.5
(handler never invoked) and is fully superseded by the cs-firewall-bouncer
(in-kernel nftables drop on direct hosts) + the Cloudflare IP-List/WAF rule
(proxied hosts). Drop the `traefik-crowdsec@kubernetescrd` middleware from the
ingress_factory chain and the 8 explicit `exclude_crowdsec = true` call sites,
and delete the now-unused `exclude_crowdsec` variable.
This is PR1 of a 2-phase removal: the reference is removed FIRST (a shared-module
change → full-cluster apply re-renders every ingress without the middleware) so
that PR2 can delete the `crowdsec` Middleware CRD + the plugin itself WITHOUT
leaving any ingress pointing at a missing middleware (which would error those
routers). PR2 MUST NOT land until this has fully applied and zero live ingresses
reference traefik-crowdsec@kubernetescrd.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
One-node validation on k8s-node2 passed: kernel nftables sets created in both
input and forward chains (policy accept), ~31k decisions loaded, a known banned
scanner confirmed in the drop set, pod stable 4h+ with no collateral. Remove the
nodeSelector so the DaemonSet runs on every node — direct-host enforcement now
survives a MetalLB VIP failover to any worker.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The edge CF IP List can't hold the ~31k CAPI community blocklist (already
enforced in-kernel by the firewall-bouncer), so the sync now skips origin=CAPI
and carries only high-signal local/curated decisions (+ a 9000 safety cap).
Also fixes the list-items GET: per_page=1000 returned a misleading CF 400
'invalid or expired cursor' (10027); the endpoint max is 500. Verified live:
crowdsec_ban populates (4 IPs) and the sync exits 0.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor's passkeys all vanished and he was suddenly being asked to log in
multiple times a day instead of ~monthly. Root cause: on 2026-06-18 an ad-hoc
tripit passkey E2E test (run from the devvm as akadmin via python-httpx) cleaned
up "the demo user's" passkeys with GET /core/users/?search={demo} then DELETE
each device of users[0] — but the fuzzy search returned the REAL account, so it
wiped all 6 real passkeys. Losing passkeys forced fallback to Google login, and
the social-login stage (default-source-authentication-login) had the provider
default session_duration=seconds=0, which falls back to UNAUTHENTICATED_AGE=2h —
hence the constant re-logins. (Password + passkey logins were already weeks=4.)
Changes:
- authentik: adopt default-source-authentication-login into Terraform (import)
and pin session_duration=weeks=4, so Google/GitHub/Facebook logins last as long
as password/passkey. Immediate relief without re-enrolling.
- authentik: document the provider-schema gotcha — authentik_stage_identification
exposes no webauthn_stage / enable_remember_me attribute, so they must NOT be in
ignore_changes (commit 4e882989 removed them for this reason; re-adding breaks
every apply). The passkey break was purely the missing device records, not drift.
- edge (rybbit): shield auth so a CrowdSec hit can never wall a user out of login —
carve authentik.viktorbarzin.me + public-auth out of the zone WAF block rule,
make the LAPI->edge sync ban-only (stop downgrading captcha to a hard block),
and set exclude_crowdsec on the Authentik UI ingress (auth keeps rate-limiting).
- docs: record the session-duration change, the edge enforcement + auth carve-out
(previously undocumented), and the pre-existing broken crowdsec-cf-sync CronJob
(CF cursor pagination 400 + ~31k IPs vs list capacity -> edge list inert).
Passkey re-enrollment is a manual user action (devices are gone from the DB);
nothing auto-re-deletes them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
TripIt's native signup-verification + account-recovery mail (ADR-0028) sends From: trips@viktorbarzin.me while authenticating SMTP as spam@. With SPOOF_PROTECTION on, Postfix smtpd_sender_login_maps requires an EXPLICIT alias (the @domain catch-all doesn't satisfy it) — mirrors the existing plans@->spam@ grant. Must be applied + verified before TripIt flips SMTP_FROM to trips@, else every verification/recovery send is rejected 550.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Each workstation user needs a continuously valid Claude token under their own Enterprise identity. Store only that user's OAuth state in an isolated Vault path, renew and verify it automatically, recover from Vault when possible, and alert when interactive SSO is required.
TripIt external users are now LOCAL TripIt accounts (ADR-0028 native passkey + Authentik OIDC), so the Authentik-side self-enrollment machinery is dead. Removes the tripit-enrollment + tripit-recovery flows and all their stages/prompts/policies/bindings, the tripit-email-stages blueprint (+yaml), and the 'TripIt External' group; reverts the admin-services-restriction fence branch that contained those users (its sole member, the leftover tripit-demo@ test account, was deleted first, so the revert affects zero live principals). Real external collaborators (type=external) are untouched. tg plan: 0 add, 1 change (the policy expression), 20 destroy (all tripit_*). Closes tripit#97; moots the B2 per-app OIDC fences.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CF account hard-limits to 1 Rules List, so proxied enforcement uses one crowdsec_ban
list + one WAF block rule; the sync writes both ban and captcha decisions into it
(captcha downgraded to block at the edge). Drops the second list + managed_challenge
rule. Trivial touch to firewall_bouncer.tf to make CI re-apply crowdsec and recreate
the DaemonSet (tar fix already in master; stale orphan was cleared).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
v4.52.7 import id must be zone/<zone_id>/<ruleset_id>; add depends_on so the
crowdsec_ban/captcha lists exist before the WAF rules reference them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- initContainer used GNU tar --wildcards which fails on the busybox curl image (pod Init:Error); switch to extract-all + cp via shell glob.
- cloudflare_ruleset hit the per-zone singleton conflict; import the existing 'default' http_request_firewall_custom ruleset and manage all rules — CrowdSec ban/captcha first, the pre-existing disabled skip rule preserved verbatim.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tg plan verified the agent's guess 'Account Filter Lists Edit/Read' is not a key in the v4.52.7 permission-group map; the live CF API lists the correct account-scoped groups as 'Account Rule Lists Read'/'Write'.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
910d5892 landed the [git.timeout] + [git.config] env in master, but the CI apply
skipped stacks/forgejo (the changed-stack-diff race after a sync-merge), so the
Forgejo deployment never picked it up. A trivial comment touch to force a clean
apply of the stack so the durable push-mirror fix actually takes effect.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces the Worker+KV approach (which only covered the ~27 routed hosts) with a
zone-wide mechanism that covers ALL proxied hosts: two CF account IP Lists
(crowdsec_ban, crowdsec_captcha) + one zone WAF custom rule that blocks
`(ip.src in $crowdsec_ban)` and managed-challenges `(ip.src in $crowdsec_captcha)`.
No per-request Worker, no cookie machinery — the rybbit Worker stays
analytics-only. lapi_kv_sync.py now full-reconciles the two lists from LAPI
(fail-safe: a LAPI blip skips the run and freezes the last-known-good block set;
serializes CF bulk ops since CF allows one pending op per account). A
least-privilege CF API token (Account Filter Lists Edit) is minted in TF.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drops banned source IPs in-kernel via nftables (hooks input+forward, so DNAT'd
LoadBalancer traffic is caught before reaching Traefik) for DIRECT hosts — the
direct-side replacement for the dead Traefik plugin, zero per-request hop.
No published image exists, so an initContainer fetches the pinned official
static binary (v0.0.34) onto a stock debian-slim base (nftables backend uses
netlink directly, no nft CLI needed). hostNetwork + NET_ADMIN/NET_RAW (not
privileged). Config (with api_key) in a Secret, Reloader-annotated. crowdsec ns
is already in the Kyverno wave-1 exclude list, so the privileged/hostNetwork pod
is admitted. Pinned to k8s-node2 (runs a Traefik pod) for one-node validation
before the nodeSelector is removed to roll cluster-wide. Fail-open by element
timeout if the bouncer stops.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New stack for the realtime voice agent — v2 of the portal-assistant brain
path. One persistent WebSocket per conversation: continuous mic audio ->
Silero VAD turn-taking -> Whisper STT (portal-stt) -> streaming Claude brain
(claude-agent-service) -> edge-tts (portal-tts) -> audio out, with barge-in.
Reuses all three upstream cluster services; nothing new is spun up.
Public Cloudflare ingress (proxied, WebSocket) at portal-realtime.viktorbarzin.me
with the app's own DEVICE_TOKEN as the edge gate (auth="app" — Authentik would
break the native Portal client). No buffering middleware: it would break the
streaming WebSocket. Image ghcr.io/viktorbarzin/portal-assistant-realtime
(private ghcr, pulled with ghcr_pull_token). Sibling to the v1 portal-assistant
gateway, which stays live.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The compat gate compared every addon's matrix ceiling against the target
k8s minor unconditionally. That is correct for a minor JUMP, but it also
blocked patch upgrades within the minor the cluster is ALREADY running:
ESO v0.12's matrix ceiling is 1.31, the cluster runs 1.34.9, so a target of
1.34.10 (a patch) was refused with "external-secrets supports k8s <= 1.31;
target 1.34 exceeds it" — even though the running cluster is itself proof ESO
0.12 works on 1.34. That silently defeats autonomous patching (it would have
bitten the moment a 1.34.10 was published).
Fix: a target at or below the running minor crosses into no new k8s minor, so
every installed addon is already empirically proven on it — check_addons now
returns no reasons when target_minor <= running_minor. Added running_minor()
(oldest kubelet across nodes, mirroring the detector; RUNNING_K8S env override
for tests) and pass it in. Minor jumps are unchanged: 1.34->1.35 still blocks
on ESO 0.12 + kyverno 1.16. removed-API + containerd checks are naturally
inert for patches (no API removal / containerd floor inside a minor) and keep
running as defence. Added test_compat_gate.py (8 cases) covering both paths.
Verified end-to-end against live Prometheus: target 1.34.10 -> EXIT 0 (safe),
target 1.35.6 -> EXIT 2 (blocked on ESO+kyverno).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Seeds two new bouncers at LAPI startup (BOUNCER_KEY_kvsync, BOUNCER_KEY_firewall)
from Vault secret/platform, mirroring the existing BOUNCER_KEY_traefik wiring.
These are the two halves of the real enforcement that replaces the dead Yaegi
plugin: kvsync authenticates the LAPI->Cloudflare-KV sync (proxied edge Worker),
firewall authenticates the cs-firewall-bouncer DaemonSet (direct-host nftables).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The next-minor Packages query used `curl -sf` without -L. pkgs.k8s.io
302-redirects every request to a backing host, so without -L curl returned
an empty body, NEXT_MINOR_PATCH came back empty, and the detector fell
through to "No upgrade needed". That is exactly why last night's 23:00 chain
no-op'd instead of resolving the 1.35 next-minor target (1.35.6) and handing
it to the compat gate. `curl -sfL` follows the redirect and returns the
Packages file (verified: -sf -> empty, -sfL -> 1.35.6). Mirrors the same
-L fix already applied to the Release availability probe (-sILo) above.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The tripit Forgejo->GitHub push-mirror silently stalled: `git cat-file
--batch-all-objects` over the NFS-backed repo exceeded the default git deadline
once ~4500 loose objects accumulated (gc.auto's 6700 threshold hadn't fired), so
pushes stopped reaching GitHub and prod deploys stalled. Raise [git.timeout]
(DEFAULT/MIRROR/GC) so a slow object enumeration can't abort the mirror, and set
[git.config] gc.auto=1000 so post-push autogc + the git_gc_repos cron keep repos
packed (the real fix). A one-off forced gc already unblocked tripit; this prevents
recurrence across all repos.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pure-stdlib script (alert_digest pattern, runs on stock python:3.12-alpine) that
projects CrowdSec Ip-scope ban/captcha decisions into the Workers KV namespace
the edge Worker reads on each proxied request. Full-reconcile per run so an
un-ban clears from the edge within one interval; fail-safe (a LAPI read error
skips the run and leaves existing bans to expire by TTL = fail-open, never a
stale all-block). TF wiring (KV namespace + CronJob + key registration) follows.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Preparing for real CrowdSec enforcement (edge Cloudflare Worker for proxied
hosts + cs-firewall-bouncer for direct hosts). Both enforce by dropping the
real source IP, so if an internal/RFC1918 address ever ended up in a ban
decision it could blackhole legitimate internal traffic. Whitelisting the
cluster/LAN/tailnet ranges (10/8, 172.16/12, 192.168/16, 100.64/10) at the
CrowdSec parser layer makes that structurally impossible — a trusted source
can never produce a decision in the first place. Public IP already whitelisted.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Base for the stepped 3.26->3.28->3.30->3.32 upgrade (k8s 1.36 prereq; 3.26 is
already unsupported on k8s 1.34). Manage ONLY the operator via the official
tigera-operator Helm chart (chart ver == Calico ver); installation.enabled=false
keeps the live Installation CR operator-managed so Helm never touches calico-node.
Adopted in place: existing operator Deployment/SA/ClusterRole/ClusterRoleBinding
pre-stamped with Helm ownership metadata (transient migration step), then the
release imported via a plan-verified create (1 to add, 0 destroy). Verified clean:
calico-node 7/7 unchanged, tigerastatus green. Closes the year-deferred adoption
(code-3ad) for the operator without adopting the Installation CR.
After bumping to v1.6.0 (stream goroutine runs) and disabling redis (in-memory
cache), the plugin logs `handleStreamCache:updated` but still does NOT enforce:
a ban present in the LAPI stream AND pulled by the plugin still let the banned IP
through. Stream-mode decision matching is unreliable under Traefik's Yaegi
interpreter here. Switch crowdsecMode stream->live: the plugin queries LAPI
synchronously per request (result cached per-IP for defaultDecisionSeconds), which
enforces reliably and picks up new decisions immediately. LAPI is 3-replica +
in-cluster so per-request latency is small; fail-open preserved (updateMaxFailure=-1).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
With the plugin on v1.6.0 the stream goroutine finally runs, and its slog output
revealed the real blocker: `handleStreamTicker ... isCrowdsecStreamHealthy:true
cache:unreachable`. The LAPI stream is healthy, but the plugin's redis client
cannot reach the cache under Traefik's Yaegi interpreter — even though
redis-master.redis.svc is reachable AND writable from the traefik namespace
(SET/GET verified via busybox; no NetworkPolicies; no auth). Same interpreter
-incompat class as the stream goroutine itself. With redisCacheUnreachableBlock
=false the bouncer then failed open and enforced nothing.
Disable the redis cache so the plugin uses its in-memory decision store (works
under Yaegi). Removes redisCacheHost/redisCacheUnreachableBlock. Trade-off:
captcha already-solved grace is per-pod across the 3 Traefik replicas (at worst
an occasional re-solve) — acceptable; bans/captcha decisions enforce correctly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The crowdsec-bouncer Yaegi plugin pinned at v1.4.2 loads on Traefik 3.7.5 but
its decision-stream goroutine never runs — no Traefik pod ever calls the LAPI
stream (verified: no traefik-pod bouncer entry / no @pod-ip auto-registration),
and it logs nothing. All deps are healthy (LAPI 200 + full ban list reachable
from the traefik ns, key valid, redis PONG, config correct, no NetworkPolicies),
so CrowdSec enforced nothing despite the bouncer now being registered. This is
the Traefik-v3 / Yaegi plugin-incompat class that already killed rewrite-body
here. v1.4.2 predates Nov 2025; latest is v1.6.0.
Bump to v1.6.0 (initContainer download URL + state.json + experimental.plugins
version). Config-verified compatible: every key we use survives (crowdsecMode,
crowdsecLapiKey/Host, updateMaxFailure, redisCache*, clientTrustedIPs, all
captcha* incl. turnstile); v1.6.0 also moves logging to slog/trace for future
diagnosis. Pinned, not auto-updated (Keel can't manage a Yaegi plugin, and
plugin bumps must be tested against the running Traefik/Yaegi).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per Viktor: GitHub sign-up must work zero-click (account created on first login,
no form). This global [oauth2_client] setting enables it. It conflicts with
Authentik (preferred_username is an email → invalid Forgejo username → 500 on
auto-create), and Viktor's Forgejo email (me@viktorbarzin.me) doesn't match his
Authentik email (vbarzin@gmail.com) so account-linking can't bridge it — so the
Authentik OAuth2 source is DISABLED (login_source.is_active=0; DB-managed,
out-of-band) per his directive. Forgejo sign-in is now GitHub + native login.
Committed via API to land on origin without pushing a concurrent agent's unpushed
local commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ENABLE_AUTO_REGISTRATION is a global [oauth2_client] setting (all OAuth sources).
On Authentik sign-in, Forgejo auto-created an account and derived the username
from Authentik's preferred_username claim — which is the user's email
(vbarzin@gmail.com), invalid as a Forgejo username (no '@') → CreateUser failed
→ 500 on the OAuth callback. (GitHub's username claim is valid, so only Authentik
broke.) Reverting to the standard link/register flow fixes both; GitHub sign-up
still works via a one-step register form. Committed via API to touch only main.tf
(forgejo-only CI apply) so it doesn't collide with concurrent crowdsec work.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Traefik bouncer plugin's API key was never registered with LAPI — the
crowdsec stack reads many keys from Vault but not ingress_crowdsec_api_key, and
the chart registers no bouncer. So LAPI returned 403 to the plugin, which with
updateMaxFailure=-1 failed open and enforced NOTHING: no community-blocklist
bans, and the (now-Turnstile-wired) captcha never fired. cscli bouncers list was
empty; the registration was likely lost in the MySQL->PostgreSQL DB migration
with no IaC to recreate it.
Seed the bouncer at LAPI startup via BOUNCER_KEY_traefik, valued from the same
Vault key the middleware presents — so they match by construction, and the
bouncer re-registers automatically on every LAPI start (survives DB wipes).
- stacks/crowdsec/main.tf: read ingress_crowdsec_api_key, pass to module.
- module main.tf: new sensitive var + thread into the values templatefile.
- values.yaml: BOUNCER_KEY_traefik on lapi.env.
- docs/architecture/security.md: document registration + fail-open history and
the proxied-app coverage caveat.
Activates enforcement (community blocklist bans + captcha) on non-proxied apps;
internal IPs stay bypassed (clientTrustedIPs), fail-open-on-LAPI-down preserved.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The traefik helm_release had no chart version pin, so a refreshed helm repo
index resolves `chart = "traefik"` to the latest (41.0.0), whose values schema
rejects this stack's `logs` block ("Additional property logs is not allowed") —
an unpinned apply attempts that upgrade and fails (atomic rollback). Pin to the
deployed 40.2.0 (release rev 57, since 2026-05-30) so applies are deterministic;
chart bumps must be deliberate with a values migration.
Follow-up to fd0c7493 (Turnstile captcha), which was applied with this pin
already in live TF state — this lands the pin in git to remove the drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wanted people to be able to sign up with GitHub, not just the
native form or Authentik SSO.
- Added a GitHub OAuth2 login source via `forgejo admin auth add-oauth
--provider github` (name "github", matching the callback registered on
the GitHub OAuth App). Like the existing Authentik source, it lives in
Forgejo's DB rather than Terraform — there's no clean TF resource for
login sources. Client id/secret mirrored to Vault secret/viktor
(forgejo_github_oauth_client_id / _secret) for recovery.
- This commit's TF change: ENABLE_AUTO_REGISTRATION=true in
[oauth2_client], so a first GitHub sign-in creates the account directly
("sign up with GitHub") instead of a link-to-existing detour. The
GitHub identity is the trust gate for this path; Turnstile + email
confirmation still gate the native form.
Verified: GitHub recognises the client id, Forgejo's /user/oauth2/github
redirects to GitHub's authorize URL with the correct client id +
callback, and the login page renders the button. Final browser
click-through is the user's to do.
Runbook updated: docs/runbooks/forgejo-open-signups.md (GitHub section +
secret-rotation + DB-loss recreate steps).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CrowdSec LAPI already issues `captcha`-type decisions for lower-severity abuse
(http-429-abuse, http-403-abuse, http-crawl-non_statics, http-sensitive-files),
but the Traefik bouncer plugin had no captcha provider configured — so those
decisions silently fell through to a 403 ban (traced in the plugin's bouncer.go
@ v1.4.2: captchaClient.Valid==false => handleBanServeHTTP). Flagged users had
no way to self-unblock, contradicting the profile's stated intent.
Wire Cloudflare Turnstile as the bouncer's captcha provider so a captcha
decision now renders a solvable challenge instead of a hard block:
- New cloudflare_turnstile_widget.crowdsec_captcha (managed mode), scoped to
viktorbarzin.me so one widget covers every subdomain the bouncer fronts.
Mirrors the existing Forgejo-signup Turnstile pattern; sitekey + secret are
passed into the traefik module.
- middleware.tf: captchaProvider=turnstile + site/secret keys + grace 1800s +
captchaHTMLFilePath=/captcha/captcha.html.
- Vendor the plugin's captcha.html and mount it into the Traefik container at
/captcha via the chart `volumes` value — the pulled Yaegi plugin does not
expose its bundled template to Traefik.
- docs/architecture/security.md: document the ban-vs-captcha remediation split.
- Remove the dead crowdsec-ingress-bouncer.yaml (unused nginx bouncer with
placeholder reCAPTCHA keys; referenced by zero .tf).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Viktor wants Forgejo open for anyone to sign up, but without bot/spam
account floods. Flip the deployment from OAuth-only registration
(ALLOW_ONLY_EXTERNAL_REGISTRATION=true) to allowing native local
sign-up, and add two bot gates on the registration form:
- Cloudflare Turnstile captcha (CAPTCHA_TYPE=cfturnstile). The widget
is managed in Terraform (turnstile.tf) via the CF Global API key, so
the sitekey/secret are IaC, not a dashboard artifact.
- Mandatory email confirmation (REGISTER_EMAIL_CONFIRM=true). Wire the
Forgejo mailer to the cluster mailserver as noreply@viktorbarzin.me
(mail.viktorbarzin.me:587 STARTTLS), reusing the same Vault-sourced
credential Authentik uses (email-secret.tf ESO -> secret/authentik
smtp_password).
Existing Authentik OAuth2 login is unchanged (additive). Deployment env
appended (not inserted) so the diff stays purely additive; a reloader
annotation rolls the pod on secret rotation.
Verified live: signup page renders the Turnstile widget, mailer delivers
a test message end-to-end, Forgejo healthy, plan-to-zero after apply.
Runbook: docs/runbooks/forgejo-open-signups.md
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>