The in-cluster svc name mailserver.mailserver.svc.cluster.local fails Authentik's strict STARTTLS hostname verification (CERTIFICATE_VERIFY_FAILED): the mailserver serves the *.viktorbarzin.me wildcard cert, which doesn't cover the svc DNS name. Use the public name mail.viktorbarzin.me, which resolves in-cluster (10.0.20.1) and matches the cert. Verified end-to-end from an authentik pod (verified TLS + SASL auth + send) before this change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Authentik email was unconfigured (localhost), so the TripIt enrollment flow's email-verification stage couldn't send. Add AUTHENTIK_EMAIL__* to server.env + worker.env pointing at the in-cluster mailserver as noreply@viktorbarzin.me (587/STARTTLS), with the SASL password synced from Vault secret/authentik.smtp_password via a new authentik-email ExternalSecret (reloader-annotated). Image pin unchanged (2026.2.4 == live). Prereq for the tripit-enrollment flow.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The first apply of the signin-speedup change triggered a ~50min authentik
outage (and a shared CNPG primary failover): the helm chart pin (2026.2.2)
silently DOWNGRADED the Keel-managed live image (2026.2.4) against an
already-migrated DB, default liveness probes kill-looped pods queuing on
authentik's migration advisory lock, and kills mid-migration left ghost
idle-in-transaction sessions holding that lock. Full analysis in
docs/post-mortems/2026-06-10-authentik-downgrade-boot-storm.md.
Hardening (all root causes):
- values.yaml: pin global.image.tag to the Keel-managed live tag (2026.2.4)
so helm applies can never downgrade under Keel again
- values.yaml: server livenessProbe 6x10s/5s (was chart-default 3x10s/3s)
- values.yaml: REMOVE AUTHENTIK_POSTGRESQL__CONN_MAX_AGE (session-mode
pgbouncer pins persistent conns 1:1 -> pool saturation, 58s/s waits)
- pgbouncer.ini: idle_transaction_timeout=300 reaps ghost lock holders;
pgbouncer.tf gets a config-checksum annotation so ini changes roll pods
- authentik_provider.tf: drop the completed import stanza (adoption rule)
- traefik: suppress pre-existing keel.sh annotation/tier-label drift on
auth-proxy/bot-block/x402/error-pages deployments (KEEL_LIFECYCLE_V1
pattern) so applies stop stripping live Keel state
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Viktor asked to review Authentik and the web tier and make first-time
signin to apps faster. Review found the slowness is screens and round
trips, not server time. Changes:
- values.yaml: the authentik.* Helm values (gunicorn workers, cache
timeouts, conn_max_age) were silently INERT because existingSecret
skips chart env rendering — pods ran defaults (2 workers, 300s
caches, no persistent DB conns). Moved all tuning into
server.env/worker.env, which actually reaches the pods.
- authentik_provider.tf: adopt the identification stage and pin
password_stage so username+password render on ONE screen (the
separate order-20 password binding is deleted via API — authentik
requires that when embedding). Outpost log_level trace->info and
1->2 replicas (it is on the hot path of every forward-auth request;
PG-backed sessions make 2 replicas safe).
- authentik module: /static ingress carve-out with immutable
Cache-Control (assets are version-fingerprinted but served with no
max-age — internal split-horizon users got zero caching).
- traefik auth-proxy nginx: upstream keepalive 32 + HTTP/1.1 (was
opening a fresh TCP connection to the outpost per subrequest) +
config-checksum annotation so config changes roll the pods.
- docs: authentication.md + authentik-state.md updated; fixed stale
'postgresql.dbaas has no endpoints' claim in CLAUDE.md/CONTEXT.md
(it is a live CNPG primary-selector compatibility service).
Done via API in the same change (UI-managed objects): 6 OIDC providers
(Vault, Forgejo, Immich, Headscale, linkwarden, Cloudflare Access)
switched from explicit to implicit consent — all first-party, the
4-weekly consent screen only slowed first-time signin.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Workers handle background tasks only (LDAP sync, email, certificate
renewal) — no user-facing traffic, so 2-of-3 redundancy isn't load-
bearing. Reduces sustained CPU by ~100m.
Server replicas unchanged at 3 (PDB minAvailable=2 — user-facing).
PgBouncer pool unchanged at 3 (DB connection pooling).
- Adopt UserLoginStage (default-authentication-login) into Terraform
and pin session_duration=weeks=4 so users stay logged in across
browser restarts. There is no Brand.session_duration in 2026.2.x;
UserLoginStage is the only correct lever.
- Cap anonymous Django sessions at 2h via
AUTHENTIK_SESSIONS__UNAUTHENTICATED_AGE on server + worker pods
(default is days=1). Bots, healthcheckers, and partial flows now
get reaped within 2h instead of accumulating for a day.
Implementation note: the env var is injected via server.env /
worker.env rather than authentik.sessions.unauthenticated_age,
because authentik.existingSecret.secretName is set, which makes the
chart skip rendering its own AUTHENTIK_* Secret. authentik.* values
are therefore inert in this stack -- this is documented in
.claude/reference/authentik-state.md so future edits use the right
surface.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- linkwarden: add Reloader match annotation to DB secret so pods
auto-restart on Vault credential rotation (was causing 100% 5xx)
- authentik: increase memory limits (server 1Gi→1.5Gi, worker 896Mi→1Gi)
to prevent OOM kills
- prometheus: drop 113k high-cardinality series to reduce HDD write rate
from ~8.8 to ~6.0 MB/s (31% reduction):
- drop all traefik/apiserver/etcd histogram bucket metrics
- drop goflow2_flow_process_nf_templates_total (9.3k series)
- drop container_tasks_state and container_memory_failures_total
- rewrite HighServiceLatency alert to use avg latency (_sum/_count)
- update cluster_health dashboard to match
- raise KubeletRuntimeOperationsLatency threshold from 30s to 60s
Phase 1 of platform stack split for parallel CI applies.
All 3 modules were fully independent (no cross-module refs).
State migrated via terraform state mv. All 3 stacks applied
with zero changes (dbaas had pre-existing ResourceQuota drift).
Woodpecker pipeline updated to run extracted stacks in parallel.