infra

Author	SHA1	Message	Date
Viktor Barzin	d94f267c93	immich: upgrade v2.7.5 → v3.0.0 (postgres → vectorchord 0.4.3, frames → immich_v3 tag) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to upgrade Immich to the just-released v3.0.0 (release notes, migration guide and release discussion #29439 reviewed — no config-breaking changes for this stack: we already use the split MACHINE_LEARNING_PRELOAD vars, don't set DB_VECTOR_EXTENSION, OAuth goes through Authentik over HTTPS, and the GPU node's CPU meets the new x86-64-v2 requirement). The Immich Postgres image moves to VectorChord 0.4.3 to match the upstream v3 reference stack (0.3.0 is still within v3's supported range '>=0.3 <2'; Immich upgrades the extension itself at startup). Both photo frames switch to ImmichFrame's immich_v3 compatibility tag because every versioned ImmichFrame release (≤ v1.0.33.0) crashes deserializing Immich v3 API responses; repin to a versioned tag once upstream ships stable v3 support. Deployment images are Keel-managed (KEEL_IGNORE_IMAGE, policy=patch), so this commit is the source-of-truth record; the live rollout happens via kubectl set image in the same session. Pre-upgrade pg_dumpall taken (job postgresql-backup-pre-v3). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:18:22 +00:00
Viktor Barzin	6f03ccd1aa	excalidraw: grant emo-browser SA port-forward for drawing uploads All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to fix emo's permission so his Claude can upload to the Excalidraw service. emo's recent sessions show the documented upload recipe (kubectl port-forward svc/draw + X-Authentik-Username header, from his ~/.claude/CLAUDE.md) failing with: pods/portforward forbidden for system:serviceaccount:chrome-service:emo-browser in namespace excalidraw because his default kubeconfig is the read-only emo-browser SA (its port-forward grant covers only chrome-service) and his old admin kubeconfig at /home/emo/code/config expired and was removed. Add a namespace-scoped Role (pods/portforward create) + RoleBinding for that SA in the excalidraw namespace, mirroring the 2026-06-28 chrome-service grant. Trade-off (any-user drawings via the trusted username header) documented in the file and accepted. Also record the grant in docs/architecture/chrome-service.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 11:08:28 +00:00
Viktor Barzin	a64d2ba2b9	upgrades: fix hourly gotenberg error + cap update notifications at weekly All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor was getting upgrade-error Slack messages every hour and wants update notifications at most weekly. Root cause of the errors: Keel kept trying to roll gotenberg 8.25->8.25.1 in paperless-ngx but kyverno's require-trusted-registries denied it — gotenberg/* (and apache/, which tika will hit next) were never allowlisted, and Keel's Slack notifier at info level re-posted the identical failure to #general on every hourly poll since Jun 28. Changes: allowlist gotenberg/ + apache/* so the patch applies cleanly; disable Keel's direct Slack notifier and replace failure visibility with a KeelUpdateFailing Loki-ruler alert (alert-on-change: one notification plus the daily digest, never an hourly drip); remove diun's Slack notifier whose default message @channel-pinged #image-updates for every new upstream tag every 6h (the n8n upgrade-agent webhook feed is untouched). The k8s upgrade report is already weekly (Mon 06:07 UTC). Paperless-ngx itself stays paused (keel policy=never, user-managed) while the ingest runs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 07:16:50 +00:00
Viktor Barzin	dab307f9f8	Merge remote-tracking branch 'origin/master' All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-07-02 05:39:15 +00:00
Viktor Barzin	f1e81772d5	broker-sync: repoint image to ghcr (was frozen on pre-migration DockerHub) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The nightly ibkr sync failed with 'No such command ibkr': every broker-sync CronJob still pulled viktorbarzin/broker-sync:latest from DockerHub, which nothing has pushed to since the ADR-0002 move to GHA->ghcr on 2026-06-13 — the jobs were silently running a frozen pre-ibkr build. The migration had allowlisted only the wealthfolio namespace for the private ghcr.io/viktorbarzin/wealthfolio-sync image, so broker-sync also lacked pull credentials. Repoint the image, add ghcr-credentials imagePullSecrets to all eight CronJobs, and allowlist the broker-sync namespace (wealthfolio stays — its own monthly sync pulls the same image). Related: code-9ko8. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 05:31:00 +00:00
Viktor Barzin	ac41e7c017	nvidia: run advertise-gpumem provisioner under bash (dash rejects pipefail) First apply of ADR-0016 failed: terraform local-exec defaults to /bin/sh, which on Ubuntu is dash — 'set -euo pipefail' exits 2 before running kubectl. Pin the interpreter to bash. Everything else in the gpumem apply succeeded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-02 05:21:47 +00:00
Viktor Barzin	968b2b9c64	Merge remote-tracking branch 'origin/master' into wizard/gpu-vram-budget	2026-07-02 05:18:34 +00:00
Viktor Barzin	a12b09af04	broker-sync: pin data-mounting CronJobs to k8s-node4 (stop nightly RWO wedge) All checks were successful ci/woodpecker/push/default Pipeline was successful Details All broker-sync CronJobs share one RWO proxmox-lvm volume. With free scheduling the nightly 02:00-04:15 runs land on different nodes, forcing a detach/attach cycle whose QMP hotplug intermittently ghost-attaches on disk-heavy VMs — every job then sits in ContainerCreating for hours (happened 2026-06-30, 07-01 and again 07-02; fires PodsStuckContainerCreating and skips the day's trade syncs). Pinning all seven volume-mounting jobs to k8s-node4 (fewest CSI disks, 11) makes the volume attach once and stay put — no hotplug dance, no wedge. version_probe mounts nothing and stays unpinned. Durable fix for the recurrence tracked in beads code-9ko8. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 05:16:38 +00:00
Viktor Barzin	3c85af2dc2	fire-countdown dashboard: SQL guards + tax regime + honesty fixes All checks were successful ci/woodpecker/push/default Pipeline was successful Details From the flaw-hunt workflow (all verified): - Projected-FIRE-date panels (solo/household/family) now guard savings £/yr: 0 / empty / negative all render "Set savings £/yr" instead of a blank tile, a SQL error, or a nonsensical past date ("Jan 1849"). Verified across cases. - New "Tax regime" panel surfaces the per-country jurisdiction — 14/22 countries fall back to the neutral 'nomad' 1% assumption, which was previously invisible. - Intro no longer hard-codes "£139k pension" (contradicted the £328k tranche panel); pension value is now only shown data-bound in the tranche panel. - Intro adds caveats: Anca's spend is an estimate (pending live re-pull), and non-modelled countries use the nomad tax fallback. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-07-01 22:44:17 +00:00
Viktor Barzin	339f5d89b9	onlyoffice: decommission (stack destroyed, dir removed) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The document server had been deliberately scaled to 0/0 for 184 days, but its ingress kept the uptime-kuma monitors alive, so 'onlyoffice down' showed up in every daily alert digest. Viktor approved tearing it down. terragrunt destroy ran clean (11 resources) before this commit; the kuma monitors auto-prune with the ingress. Also drops the onlyoffice/* image prefix from the kyverno trusted-registries allowlist, the service-catalog rows, and updates the nextcloud collabora comment. Document data (if any) remains on the PVE NFS share. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-01 22:35:22 +00:00
Viktor Barzin	3c476dab32	postiz+portal: remove broken alert sources (stale backup CronJob, bogus scrape annotations) Viktor is getting daily Slack alert noise; these two were the recurring generators. The postiz-postgres-backup CronJob still dumped from the old in-namespace postiz-postgresql service that was removed in the CNPG migration (2026-06-28) — it failed every night at 03:00 and re-fired BackupCronJobFailed each day. The postiz DB now lives on the shared CNPG cluster and is already covered by the dbaas per-db dumps, so the CronJob (and its NFS backup volume) is redundant and removed rather than repaired. portal-stt/portal-tts advertised prometheus.io scrape annotations that never worked: the deployed Speaches build 404s /metrics, and openai-edge-tts has no metrics at all (its annotation pointed at a JSON endpoint, which fails exposition parsing regardless). Both produced a permanently firing ScrapeTargetDown. Annotations removed until the apps actually serve metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-01 22:35:21 +00:00
Viktor Barzin	5a312563c6	monitoring/wealth: dash the in-progress year on the hourly-rate panel All checks were successful ci/woodpecker/push/default Pipeline was successful Details The current, still-accruing calendar year read misleadingly high (e.g. 2026 at 5 months showed £149/h gross, above all of 2025) because the full-year bonus - paid every March - plus front-loaded quarterly RSU vests get divided by only the months worked so far. It settles lower as the year completes. Split each line into a solid series (complete years) and a dashed series (the latest, still-accruing year), so the provisional point is visually flagged. The split auto-detects the in-progress year (latest year with < 12 months of payslips), so it needs no per-year maintenance. Panel description now explains the caveat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:45:51 +00:00
Viktor Barzin	28984dda9a	monitoring/wealth: add per-year effective hourly-rate panel (gross vs net) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor wanted to see, on the wealth dashboard, the hourly wage he earned each year - both gross and net - with year on the X axis. New timeseries (line) panel "Effective hourly rate - gross vs net": - hourly = annual pay / hours worked; hours = contractual 40h/week (2,080h per full year, confirmed from the Facebook/Meta UK offer letter: Mon-Fri 09:00-18:00 less a 1h lunch), prorated by the months actually worked so partial years (2019, 2020, 2026) read correctly. - Gross = gross_pay incl. notional RSU vest; Net = take-home. - timeFrom 10y so all years show under the dashboard's default 180d range. Source data: a duplicate March-2023 payslip (Paperless doc 347, a re-upload of doc 33) was removed separately, so 2023 is no longer double-counted; this also corrects the existing net-pay panel. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 12:28:46 +00:00
Viktor Barzin	82371d1ef8	dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes All checks were successful ci/woodpecker/push/default Pipeline was successful Details MySQL device-write investigation (code-oflt): after the nextcloud webcal throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients), MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s, ~55 pages/s) is the doublewrite buffer writing every flushed page twice. Redo is negligible (0.01 MB/s), no temp-table spilling. Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer (~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps torn-page DETECTION metadata — a crash-torn page is flagged on recovery (restore from the daily mysqldump) rather than silently corrupt. Chosen over full OFF: same write saving, keeps detection, and OFF requires a shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable risk given the PERC BBU cache + UPS (in-flight writes complete on power loss) + daily per-db backups. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 08:47:09 +00:00
Viktor Barzin	74819d4061	feat(nvidia): GPU VRAM budget + watchdog to stop T4 overallocation The single time-sliced Tesla T4 has no per-tenant memory isolation, so its ~9 GPU workloads can collectively overallocate VRAM. On 2026-06-02 immich-ml's onnxruntime arena grew to 10.7 GB and silently starved llama-swap, breaking recruiter-responder for ~5h. Viktor asked for memory protection so we don't overallocate GPU memory, and chose to do it at the scheduling level (no device-plugin swap) after weighing HAMi and MPS. Make the scheduler VRAM-aware and add runtime teeth, all repo-native, time-slicing untouched: - Advertise a node extended resource viktorbarzin.me/gpumem (~14000 MiB) via a reconcile null_resource (immediate, apply-time) + hourly re-assert CronJob. - Each always-on GPU tenant declares a gpumem budget (immich-ml 3000, llama-swap 5000, frigate 2000, immich-server 1800, portal-stt 1500; sum 13300 <= advertised) so the scheduler refuses to co-schedule past the card (overflow -> Pending). - gpu-vram-watchdog Deployment recycles the biggest over-budget tenant ONLY when actual free VRAM < floor. Ships DRY_RUN=true (observe-then-enforce); flip to false after a few cycles look right. - Prometheus alerts GPUVRAMLow / GPUVRAMTelemetryDown / GPUVRAMWatchdogDown -- the 2026-06-02 post-mortem's never-built free-VRAM follow-up. - Docs: ADR-0016 (records why HAMi/MPS were rejected), CONTEXT.md GPU-sharing glossary; fix the stale "whole T4 / scale immich-ml to 0" llama-cpp comment. HITL GPU-node change: apply nvidia FIRST (advertise gpumem), verify the node shows the capacity, THEN the consumer stacks -- the cutover bounces GPU pods. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:57:40 +00:00
Viktor Barzin	82c9e69b77	dbaas/mysql: 2Gi InnoDB buffer pool + 6Gi limit + ignore VCT drift Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Cut MySQL's write-IOPS footprint on the contended PVE sdc HDD (code-oflt). Standalone MySQL was the #1 sdc bandwidth writer (~2.8-3.5 MB/s). Live attribution found ~60% of its writes were nextcloud webcal calendar churn (throttled separately at the app layer); this addresses write amplification on the remainder: - innodb_buffer_pool_size 1Gi -> 2Gi: the pool was too small for the ~5.6Gi hot set (Innodb_buffer_pool_wait_free=1.78M = threads stalling for a free page -> constant flush-to-make-room write IOPS). - container memory limit 4Gi -> 6Gi (requests 3->4Gi): the pod was already at ~3.7Gi/4Gi (near OOM) with the 1Gi pool, so the 2Gi pool needs the headroom. One-time MySQL pod restart to apply. - ignore_changes on the StatefulSet volume_claim_template: the VCT is immutable post-creation and pvc-autoresizer rewrites its annotations on the live object, so TF's desired VCT could never apply and errored every broad dbaas apply. Ignoring it (autoresizer owns PVC sizing) removes the long-standing need to -target around it. Applied + verified live: buffer_pool=2.0GiB, limit=6Gi, pod healthy, 24 DBs reachable, restart clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:55:18 +00:00
ebarzin	469cdd7507	frigate: expose go2rtc on a dedicated MetalLB LB IP (RTSP 8554 + WebRTC 8555) All checks were successful ci/woodpecker/push/default Pipeline was successful Details HA live video from the cluster Frigate hangs/fails because the only path to Frigate is the Traefik HTTP(S) ingress (frigate-lan -> 10.0.20.203), which cannot carry RTSP or WebRTC. The container already listens on 8554+8555 but only RTSP had a Service (NodePort), and WebRTC (8555) was never exposed. Convert frigate-rtsp to a LoadBalancer on a dedicated MetalLB IP (.204, ETP=Local, pod pinned to the GPU node) carrying RTSP 8554 + WebRTC 8555 (TCP+UDP), giving HA Sofia + LAN browsers a stable cross-VLAN endpoint for native HLS/WebRTC live (parity with the Hikvision NVR). Companion non-Terraform steps are in the PR body. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-30 07:15:22 +00:00
Viktor Barzin	9ea9cae073	rightsize: reconcile batch-2/3 stacks blocked by killed #427 (job-hunter, wealthfolio, f1-stream) Some checks failed ci/woodpecker/push/default Pipeline failed Details Memory limits were committed (batch 2/3) but pipeline #427 was killed mid-apply and the local homelab tf apply hit a stale backend-init; this comment-only diff re-triggers a clean CI apply for the three stacks so live matches master (job-hunter 768Mi, wealthfolio 512Mi, f1-stream 384Mi). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:59:41 +00:00
Viktor Barzin	7cc9cde5b1	external-secrets: enable ESO Vault token cache to cut sdc write churn All checks were successful ci/woodpecker/push/default Pipeline was successful Details Add --enable-vault-token-cache to the ESO controller (a graduated, non-experimental flag in chart 2.6.0). Until now ESO authenticated to Vault with login -> lookup-self -> revoke-self on every secret fetch. Across 92 ExternalSecrets refreshing every 15m that measured ~0.22 logins/s + ~0.22 revoke-self/s on the active Vault member, and each cycle is a token create+revoke (plus its lease) written to the Raft log on all three members. Those fsync-heavy writes land on the contended PVE RAID1 7200rpm HDD (sdc) -- one of the write sources behind the recurring control-plane flaps (code-oflt write-reduction). The eso kubernetes-auth role already issues a 240h periodic, unlimited- use token, so the churn was pure waste: ESO discarded a perfectly good token after a single use. With token caching ESO mints one token and reuses/renews it, collapsing logins from ~13/min to a handful per token lifetime. Verified live: vault cache initialized, 112/113 ExternalSecrets Ready (the one failure, instagram-poster, is pre-existing data drift unrelated to auth), logins dropped to ~0 after warm-up. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:32:37 +00:00
Viktor Barzin	bc626a2d89	rightsize: raise OOM-tight memory limits (batch 3/N — spike protection) Some checks failed ci/woodpecker/push/default Pipeline failed Details shlink 512->704Mi, linkwarden 1Gi->1280Mi, chrome-service 2Gi->2624Mi, forgejo 4Gi->5Gi, f1-stream 256->384Mi. All were request==limit with 30d peak at 91-100% of the ceiling — a spike would OOM-kill them. Raising the limit (now Burstable, request<limit) gives real burst headroom. This is the genuine 'don't OOM on occasional spike' fix. Small add (~2.2Gi limits) vs the ~20Gi of fat removed in batches 1-2, so net overcommit keeps dropping. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:28:11 +00:00
Viktor Barzin	418d1efb4b	rightsize: trim over-provisioned memory (batch 2/N) Some checks failed ci/woodpecker/push/default Pipeline was canceled Details claude-agent-service 12Gi->3Gi (peak 585Mi — the single biggest fat, ~9Gi of limit-overcommit removed), job-hunter 1280->768Mi (kept chromium headroom; 30d peak 118Mi), fire-planner 1024->320Mi, wealthfolio 1Gi->512Mi (kept history-growth headroom). Burstable, limits kept >= generous peak headroom, never below peak. ~10.7Gi of limit overcommit removed. paperless-ai intentionally LEFT at 4Gi (documented in-process RAG model load). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:27:17 +00:00
Viktor Barzin	c3553731c7	dbaas: CNPG write-reduction — archive_timeout=0, commit_delay, wal_compression=zstd Part of code-oflt (cut sdc write IOPS before the SSD move; analysis #6922). - archive_timeout 300->0: CNPG forces archive_mode=on but .spec.backup is empty (no ObjectStore), so a 16MB WAL segment switch every 5min shipped NOWHERE = ~4.6 GB/day of pure-waste WAL on the contended sdc. archive_mode stays CNPG-on (reserved); 0 just stops the timed switch. Daily pg_dump cron unchanged. - commit_delay 0->2500us: group-commit coalesces concurrent fsyncs. SAFE for every DB incl financial -- data still fsynced before COMMIT acks, only <=2.5ms added latency under concurrency. - wal_compression pglz->zstd: ~30-50% smaller full-page images. All sighup-reloadable. Applied via targeted apply of module.dbaas.null_resource.pg_cluster (trigger bumped) to avoid the pre-existing mysql VCT drift that breaks broad dbaas applies. Refs: code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 15:16:38 +00:00
Viktor Barzin	5d059786a1	rightsize: trim over-provisioned memory limits+requests (batch 1/N) All checks were successful ci/woodpecker/push/default Pipeline was successful Details claude-breakglass 4Gi->512Mi, stirling-pdf 1536->512Mi, insta2spotify 2Gi->256Mi, recruiter-responder 768->256Mi. These idle/utility services had memory LIMITS sitting 4-15x above their 30d peak, inflating cluster limit-overcommit to 142% across the 5 post-node6 nodes. Burstable (request<limit), limits capped at ~peak x1.5 (never below peak), so no OOM risk (verified zero OOMKills cluster-wide in 30d). Reduces phantom limit overcommit + frees scheduler requests. Follows the 3-reviewer adversarial review: raising limits on an already-overcommitted cluster worsens correlated node-OOM; the real fix is trimming the fat. Limits only lowered where peak is far below; tuned/DB/GPU limits untouched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 14:46:58 +00:00
Viktor Barzin	256122ff5b	monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic All checks were successful ci/woodpecker/push/default Pipeline was successful Details The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live). Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 12:34:01 +00:00
Viktor Barzin	c0e0911afa	dbaas: bump pg_cluster trigger so the checkpoint/WAL params actually apply `a2c8f906` added checkpoint_timeout=15min + max/min_wal_size to the CNPG Cluster YAML, but the cluster is applied via null_resource.pg_cluster + local-exec kubectl apply, which only re-runs when its `triggers` change. The YAML edit didn't bump a trigger, so the change was inert and never applied (incl. via CI). Bump the pg_params trigger so the kubectl apply re-runs and CNPG hot-reloads the new params (reloadable, no restart). Landing it via a targeted apply (-target=null_resource.pg_cluster) to avoid 3 pre-existing unrelated drifts in this stack -- notably a mysql_standalone volumeClaimTemplate annotation diff the apiserver rejects as immutable, which is what fails broad dbaas applies (and silently blocked `a2c8f906`). Refs: code-oflt. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 12:25:37 +00:00
Viktor Barzin	a2c8f906ec	dbaas: stretch CNPG checkpoint timer 5->15min + raise WAL size (cut sdc write IOPS) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each triggering a full-page-write burst + flush onto the contended 7200rpm sdc spindle -- a top write-IOPS source after etcd. Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled rather than churned. All three are sighup-reloadable -> CNPG applies them without a restart or failover. checkpoint_completion_target stays 0.9 so each checkpoint's IO is still smeared across the interval. Bounded recovery-time tradeoff (more WAL to replay on crash), acceptable for the write relief. wal_compression left at pglz ('on') pending image zstd-support verification. Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB / 2560MB / 3Gi). Refs: code-oflt (etcd/sdc IO isolation). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 11:41:09 +00:00
Viktor Barzin	3398873a16	k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security) uptake now lags up to 7 days instead of <=1. - var.schedule: 0 23 * * * -> 0 23 * * 0 (detector: weekly Sunday 23:00 UTC) - var.report_schedule: 7 6 * * * -> 7 6 * * 1 (report: Monday 06:07 UTC, ~7h after the Sunday check, so nightly-report.py's ~25h staleness threshold stays valid AND still flags a missed weekly run; no STALE_SECONDS change needed) The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename = churn). Cadence wording updated across main.tf comments, nightly-report.py docstring, and the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 06:22:20 +00:00
Viktor Barzin	e43e64c666	kyverno: disable reports-controller to stop etcd ephemeralreport load All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor flagged not wanting to wear the single non-RAID SSD with useless etcd writes if etcd moves there. Investigation found the avoidable load is kyverno reporting: the 2026-06-12 etcd-load-reduction disabled the report features but left the reports-controller running (default --enableReporting + --validatingAdmissionPolicyReports=true), so the 2026-06-21 kyverno upgrade left a one-time pile of ~10.5k cluster/namespaced ephemeralreports (~114MB in etcd) that nothing reaps (aggregation off). Listing that range starves etcd's fdatasync enough to flap the apiserver (observed live 2026-06-28). Disable the reports-controller outright (reportsController.enabled=false), completing the 2026-06-12 intent. Reports are not consumed (violations surface via Loki->Slack); admission enforcement (deny-* policies) and Keel mutation are independent of it. The ~10.5k stale reports already in etcd are cleared separately (throttled, out-of-band) since bulk-deleting them is itself etcd-heavy. Refs: code-oflt (etcd IO isolation), code-at4f (etcd starvation alerting). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-29 05:35:36 +00:00
Viktor Barzin	cf42042cba	monitoring: re-trigger apply to persist state after CI cancel-race All checks were successful ci/woodpecker/push/default Pipeline was successful Details No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`. The pfSense egress-monitoring apply (commit `7fe2d978`, CI pipeline #414) was cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources applied (probes green, rules loaded) but the Terraform state write and the helm release finalize were lost, leaving the prometheus release stuck in pending-upgrade (manually unstuck). This commit re-applies the unchanged monitoring stack so state matches live, with zero resource changes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:58:49 +00:00
Viktor Barzin	f92075b7c5	fire-planner: solve FIRE targets to age 100 (horizon 60→72) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor plans to live to 100, so the portfolio must last that long. The fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72 (retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years to fund). A one-off in-cluster job re-solves the existing rows at the new horizon. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:49:20 +00:00
Viktor Barzin	7fe2d9780e	monitoring: add pfSense WAN/egress alerting + probes Some checks failed ci/woodpecker/push/default Pipeline was canceled Details On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for ~20 min while internal routing + Unbound stayed up; recovery needed a manual reboot and NOTHING alerted — there was no egress probe and the cloudflared replica metric stayed green. Add first-class egress monitoring so the next occurrence pages in ~2 min instead of being noticed by a human. - blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW so ICMP can use raw sockets). - Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 + 1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers). - Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable, InternetEgressDown (both providers dead), ExternalDNSResolutionDown, EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's exact "external down while internal up" signature), PfSenseVMDown. - Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the cloudflared replica metric is blind to tunnel-connection loss. Threshold calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident). - Alertmanager inhibit: WAN/egress-down suppresses the downstream egress symptom alerts so one root alert pages, not a storm. - Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md. All metric names + the cloudflared threshold verified against live Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening (dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred and documented in the runbook. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:46:30 +00:00
Viktor Barzin	6f042ee239	fix(fire-planner): grafana fire-planner-pg datasource survives pw rotation Some checks failed ci/woodpecker/push/default Pipeline failed Details The fire-planner-pg Grafana datasource baked the rotating fire_planner DB password into its provisioning ConfigMap at terraform plan-time, so on every 7-day static-role rotation the password went stale and ALL fire-planner-pg dashboards (fire-planner, cost-of-living, and the new wealth FIRE Countdown) silently failed with "password authentication failed for user fire_planner" until the next stack apply. Switch to the same live-env pattern wealth-pg / payslips-pg already use: - new ExternalSecret grafana-fire-planner-pg-creds (monitoring ns, Reloader match) mirrors the rotating Vault static-creds/pg-fire-planner password - datasource ConfigMap now references $__env{FIRE_PLANNER_PG_PASSWORD} - Grafana mounts it via envFromSecrets; reloader (auto) restarts Grafana on rotation so the provisioned datasource never goes stale Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 16:14:42 +00:00
Viktor Barzin	35c0057d83	chrome-service: raise noVNC sidecar memory limit 96Mi->256Mi (fix OOMKill) All checks were successful ci/woodpecker/push/default Pipeline was successful Details The noVNC sidecar (x11vnc + websockify) was OOMKilled (exit 137) repeatedly whenever someone actively opened chrome.viktorbarzin.me — the view connected then froze/hung. Idle usage is ~37Mi, but x11vnc + websockify framebuffer/encode buffers spike past the 96Mi cap when streaming the 1280x720 screen to a client. Raised request 32Mi->64Mi, limit 96Mi->256Mi (Burstable, aux tier). Already applied live via a transient kubectl patch (Recreate rollout, verified 0 restarts since); this lands the durable state so the next apply / daily drift-detection doesn't revert it to 96Mi. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:39:17 +00:00
Viktor Barzin	2e50c1235c	chrome-service: grant emo shared browser access (noVNC + homelab browser CLI) All checks were successful ci/woodpecker/push/default Pipeline was successful Details Viktor asked to give emo access to the cluster's headed Chrome so he can fill in forms and get past anti-bot / captcha pages. emo was deliberately locked out of chrome-service (noVNC Authentik allowlist was Viktor-only + his power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE his existing browser rather than stand up an isolated per-user instance, accepting that emo can therefore reach Viktor's warmed logged-in sessions (CDP has no per-context auth, so the single shared persistent profile is reachable by anyone who can drive the browser). emo's CLI use is hands-off (his agent can run it unattended). - authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED so the admin-services-restriction policy admits him to chrome.viktorbarzin.me (noVNC). Reverses the prior Viktor-only lock; comment updated to record why. - chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token (dashboard-sa.tf pattern), a chrome-service-portforward Role granting pods/portforward, and a cluster read-only binding (oidc-power-user-readonly) so the SA can resolve the Service and emo's normal read access doesn't regress. - t3-provision-users.sh: install_browser_kubeconfig installs a dual-context kubeconfig for any user with a <user>-browser SA — SA token as the default context (non-interactive, works headless), personal OIDC retained as the oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the headless agent session that homelab browser needs. - docs/architecture/chrome-service.md: document the shared-browser multi-user access model, the session-exposure trade-off, and how to grant/revoke a user. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:20:07 +00:00
Viktor Barzin	50077b43d4	paperless-ngx: drop TASK_WORKERS 6->4 (6 OOMKilled the pod mid-import) All checks were successful ci/woodpecker/push/default Pipeline was successful Details 6 OCR workers crept past the 8Gi per-container memory cap over ~6h and OOMKilled paperless at 15:00 during the Emo bulk import. The import auto-recovered (the consume dir lives on the PVC, so a restart re-scans and reprocesses — nothing lost), but it left the queue inflated with re-queued duplicates and spiked etcd on each restart. The 8Gi cap is the shared edge-tier `tier-defaults` LimitRange, not worth raising for one namespace. 4 workers fit with headroom (4 measured ~1.3Gi). Matches the value applied live via `kubectl set env` during incident response; this removes the drift so the next apply keeps it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 15:06:46 +00:00
Viktor Barzin	8236ae309d	postiz: reconcile HCL to live (adopt unmerged stack config), keep parked All checks were successful ci/woodpecker/push/default Pipeline was successful Details postiz's live deployment (Helm + Temporal + Elasticsearch + Authentik OIDC + static-DB password) came from the never-merged branch `wizard/postiz-cnpg-oidc`, so master's HCL was stale and a `terragrunt apply` would have DESTROYED the stack. This lands that postiz config to master so HCL == state == live (CI green; destroy-landmine gone). Kept PARKED (postiz + temporal replicas = 0): IG-via-postiz is Meta- blocked (it hardcodes retired Instagram scopes → OAuth "Invalid Scopes"), which is why it was parked; IG runs via the instagram-poster service. To revive later: flip postiz `replicaCount` + temporal `replicas` back to 1 and re-check image pins. Notes captured in this reconcile: - ES image pinned to 7.17.28 (the branch's 7.17.24 was a DOWNGRADE vs the live data → ES refused to start "cannot downgrade node 7.17.28→7.17.24"; caught + rolled back during this work). - The 4 Authentik resources (app/provider/group/binding) were re-imported into state (adopted, not recreated — no duplicate AK objects); the obsolete `external_secret_jwt` ExternalSecret was removed (Retain → its synced secret was kept). - Vault-side cleanup (removing the unused pg-postiz rotated role) is deliberately NOT included here — deferred, postiz uses a static secret/postiz database_url. State was already reconciled by a local `scripts/tg apply`; this commit is the HCL catch-up (CI re-apply is a no-op). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 12:54:59 +00:00
Viktor Barzin	e518ada3d4	authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs All checks were successful ci/woodpecker/push/default Pipeline was successful Details global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets the SFE too, and the SFE login shows social-login buttons (emo is Google-only with no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md + authentication.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:53:26 +00:00
Viktor Barzin	4fc09b7a61	Merge remote-tracking branch 'origin/master' into wizard/authentik-sfe-social Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Build Custom Authentik Image / build (push) Has been cancelled Details	2026-06-28 11:53:04 +00:00
Viktor Barzin	916516eeab	authentik overlay patch3: SFE for ALL old iOS browsers + social-login links Two follow-ups to patch2 (both in patch-compat-sfe.py, guarded): 1. compat_needs_sfe() now also serves the SFE to ANY iOS browser on iOS<=16.3, not just Safari. iOS Chrome/Firefox are WebKit skins (Apple mandate) reporting a non-Safari UA family, so the Safari-only check missed them and they still got the blank modern SPA. Added an os.family=="iOS" + version<=16.3 branch. 2. Inject static social-login <a> links (Continue with Google/GitHub/Facebook -> /source/oauth/login/<slug>/) into the SFE shell (flow-sfe.html). The SFE architecturally can't render Identification-stage sources (authentik docs), and emo's account (emil.barzin@gmail.com) is Google-only with NO password — so the SFE's username/password form was a dead end. The links are plain redirects that work on any browser. Slugs are static; re-verify on source changes. Tag -> 2026.2.4-patch3; values repoint + docs land once GHA builds it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:53:03 +00:00
Viktor Barzin	08bdf32aa0	feat(fire-planner): FIRE Countdown dashboard section + monthly target solve Some checks failed ci/woodpecker/push/default Pipeline was canceled Details Add a "FIRE Countdown" section to the wealth Grafana dashboard plus a monthly CronJob that computes the targets it reads. Viktor wanted a £ countdown to retirement in today's money, per life-case (Solo / Household / Family) and per country, with progress, a projected date, runway, and his safety guardrails — so he can see how close he is to FIRE (ideally lean) without ever coming back to work. - wealth.json: new country / with_home / savings_per_year template vars + a per-Case row (target NW at the 99% GK bar, progress gauge, still-needed, projected FIRE date, runway) and safety-valve panels (re-entry trigger vs £1.0M, 2.5yr cash buffer, pension tranche @57, Anca-bridge note). Reads fire_planner.fire_target via the fire-planner-pg datasource (Mixed). - fire-planner stack: fire-planner-fire-targets CronJob (monthly, 2nd 10:00 UTC) runs `recompute-fire-targets --countries all`. Targets come from the solver shipped in fire-planner edb4d11. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:52:17 +00:00
Viktor Barzin	6ba60cbb2d	authentik: repoint to overlay patch2 (SFE for old Safari) + docs All checks were successful ci/woodpecker/push/default Pipeline was successful Details global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:39:29 +00:00
Viktor Barzin	f10bb71562	authentik overlay: serve the no-JS SFE login to old Safari (patch #2 ) Old Safari/WebKit (<=16.3, e.g. iPadOS<=16.3) can't parse authentik's modern ES2022 flow SPA and gets a COMPLETELY BLANK login — exactly what emo's iPadOS-15.8 iPad hit. authentik already ships a no-JS Simplified Flow Executor (SFE, ES5) and serves it via compat_needs_sfe(), but only for IE/old-Edge/PKeyAuth. Extend that to old Safari so those clients get the REAL authentik login (password + MFA + reputation, identity preserved — NO auth downgrade, no new credential store). Chosen over a Traefik basic-auth fallback after an adversarial review: that route would put a single, spoofable-UA password in front of vbarzin->wizard (passwordless root on the cluster-controlling devvm) — an MFA->single-factor path to cluster root. SFE keeps full authentik auth and is generic for any old browser. Shipped as patch #2 in the existing overlay image (patch-compat-sfe.py — guarded: asserts the upstream anchor + ast-parses; verified against the live interface.py). Tag -> 2026.2.4-patch2; the values repoint lands once GHA builds the image. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:38:05 +00:00
Viktor Barzin	87a450e9a3	vault: grant emo full read/write on his own secret/emo tree Viktor asked that emo be able to edit his own secrets with full access. emo's personal-emo policy was read-only (read on data, read/list on metadata), so he could view but not change his personal secrets. Widen it to the same self-service capability set every namespace-owner already has over their own tree: create/read/update/delete/list on secret/data/emo(+/) and list/read/delete on secret/metadata/emo(+/). Scope is unchanged — still only emo's own secret/emo subtree, still a named exception that does not widen the power-user tier in general. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 11:07:22 +00:00
Viktor Barzin	a1cf7ccaf6	authentik: repoint to the SLOW-1a overlay image + un-enroll Keel All checks were successful ci/woodpecker/push/default Pipeline was successful Details GHA built ghcr.io/viktorbarzin/authentik-server:2026.2.4-patch1 (public, verified anonymously pullable). Point global.image at it (repository + tag pinned explicitly so neither helm's appVersion default nor Keel can downgrade it — the 2026-06-10 boot-storm class) and remove keel.sh/enrolled from the namespace so Keel won't auto-bump the custom tag. authentik is now manual-upgrade: bump the Dockerfile FROM + this tag together on each authentik version bump. Net effect once rolled: the identification-stage query drops ~1.4s -> ~14ms, so the cold login-flow first-load stops being slow. (Does NOT affect old-browser clients — iPadOS<=15/Safari<=15.6 still can't run the SPA; that's unfixable server-side.) Docs: .claude/CLAUDE.md Authentik row. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:46:21 +00:00
Viktor Barzin	7ec64ed5ff	authentik: custom-image overlay to fix the 1.4s login-flow query (SLOW-1a) Some checks are pending Build Custom Authentik Image / build (push) Waiting to run Details ci/woodpecker/push/default Pipeline was successful Details The login flow's identification stage runs a bare select_subclasses() that LEFT-JOINs every Source subtype table — ~1.4s server-side on every cold login (verified live: 1527ms vs 14ms). Narrow it to only the subtypes that render a UI login button (oauth/saml/plex/telegram/kerberos — not the sync-only ldap/scim), via django-model-utils string accessors so no import is needed. Byte-identical output, ~100x faster, robust to adding new login source types. Shipped as a thin overlay over the official image (mirrors the diun/excalidraw precedent): stacks/authentik/Dockerfile (FROM ghcr.io/goauthentik/server:2026.2.4 + a guarded sed) built by .github/workflows/build-authentik.yml -> ghcr.io/ viktorbarzin/authentik-server:2026.2.4-patch1. The values repoint + Keel freeze land in a follow-up commit once the image is built. Upstream bug still present in main (no fix/PR) — drop this overlay once upstream narrows the query. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:42:58 +00:00
Viktor Barzin	eebb6c8594	k8s-upgrade: classify compat-gate blocks as actionable vs held; quiet the held case The nightly upgrade chain detected 1.36, the preflight compat-gate refused it, and that produced a Failed preflight Job + a K8sUpgradeBlocked alert EVERY night — even though the block is unactionable (no kyverno/ESO release supports 1.36 yet, and gpu-operator is pinned to its current version because bumping it needs a newer NVIDIA driver image + Ubuntu/kernel we're not ready for). Viktor asked to teach the checker to tell 'we can fix this' apart from 'nothing to do but wait', and stop the nightly Failed-Job + alert noise for the latter. compat-gate.py now classifies each blocker: - ACTIONABLE: a newer addon version in addon-compat.json supports the target -> exit 2, k8s_upgrade_blocked=1 -> K8sUpgradeBlocked alert (reasons in the nightly report). - WAITING-on-upstream: no released version supports the target yet -> held. - PINNED: addon marked pinned in the matrix (gpu-operator) -> held. Held wins on a mix -> exit 4, k8s_upgrade_held=1 (NEW gauge), NO alert. Tidy the block path (Viktor's scope choice): deliberate gate decisions now make the preflight Job Complete cleanly (HALT_CHAIN stops chain progression without a non-zero exit), so they no longer create Failed Jobs. Dropped the now-obsolete 'unless k8s_upgrade_blocked==1' suppression from K8sUpgradeChainJobFailed. Gauge is pushed definitively once per run (no 1->0->1 flap that re-notifies). The detector re-spawns a refused-but-Complete preflight nightly (silently) so a standing hold still re-evaluates, and only announces genuine new/Failed spawns. nightly-report gains a quiet '⏸️ HELD' headline with reasons grouped by class. gpu-operator pinned in addon-compat.json (unpin = delete pinned + pin_reason). Net effect on 1.36: HELD + quiet (waiting on kyverno/ESO, gpu-operator pinned; Calico the lone actionable piece) — no nightly Failed Job, no alert, just the morning report's HELD line. Design: docs/plans/2026-06-28-k8s-upgrade-gate-held-classification.md Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 10:08:20 +00:00
Viktor Barzin	b3c419e108	Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix All checks were successful ci/woodpecker/push/default Pipeline was successful Details	2026-06-28 09:55:25 +00:00
Viktor Barzin	a3eb309e26	calico: fix empty Whisker UI — allow whisker egress to the kube-dns ClusterIP All checks were successful ci/woodpecker/push/default Pipeline was successful Details Real root cause of the 2026-06-28 "Whisker UI empty" incident (the watchdog added in `8d1d2fb9` was treating a symptom). The tigera operator's own `whisker` NetworkPolicy is policyTypes:[Ingress,Egress]; its egress allows DNS only to the kube-dns pods (podSelector k8s-app=kube-dns). But whisker-backend resolves goldmane.calico-system.svc via the kube-dns ClusterIP (10.96.0.10), and Calico drops UDP DNS to a ClusterIP under a podSelector-only egress rule. Verified in an isolated repro: from the whisker pod's netns, ClusterIP DNS = 100% timeout while direct kube-dns pod-IP DNS = OK; a pod with no egress policy resolves fine; a test pod with the operator's podSelector-only egress rule reproduces the failure, and adding an ipBlock(ClusterIP) egress rule flips it to 100% ok. whisker-backend resolves goldmane once in the brief startup window before the policy programs, holds its long-lived gRPC stream, and only re-resolves when that stream breaks (e.g. a node-reboot blip) — then the blocked ClusterIP DNS wedges its Go resolver and the UI goes empty. The durable aggregator (separate pod, unrestricted namespace) was never affected. Fix: additive egress NetworkPolicy whisker-allow-dns-clusterip (whisker -> 10.96.0.10/32 on 53 UDP+TCP); k8s egress policies are additive so the operator NP is untouched. The whisker-watchdog CronJob is kept as a backstop (repurposed comment). Applied + verified: ClusterIP DNS from the whisker netns now 8/8 ok, whisker-backend 0 errors, flow API returns 828 flows / the namespace list. Docs (runbook + CLAUDE.md) updated to the real root cause. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:32:28 +00:00
Viktor Barzin	385dfff0e7	authentik: fix episodic blank-screen + 30s-hang login (reliability R2) The login screen would sometimes hang/blank for everyone for ~30s at a time. Root-caused: the readiness probe (/-/health/ready/) queries the DB, and on a transient PG/pgbouncer blip it 503s; with the chart-default ~30s tolerance all 3 goauthentik-server pods dropped out of the Service at once, so Traefik had no healthy backend -> 502/503/504. Compounded by a silent drift: the repo set the rollout strategy under `strategy:`, but the chart reads `deploymentStrategy:` — so live ran the chart-default 25%/25% and dropped a pod out of rotation on every roll. (Redis was removed upstream in authentik 2026.2, so sessions+cache are on PostgreSQL and request-serving is coupled to PG — verified there is no external-cache option to put back, so a SHORT transient is now survived but a total CNPG outage still takes authentik down.) Reliability package (R2, approved): - readinessProbe.failureThreshold 3->8 (~80s) — absorbs a full CNPG failover reconnect without dropping the whole fleet from the Service. - rename server+worker `strategy:` -> `deploymentStrategy:` (the real chart key) and set maxSurge:1/maxUnavailable:0 so a roll never dips below 3 ready. - gunicorn AUTHENTIK_WEB__MAX_REQUESTS 1000->10000 / JITTER 50->1000 so the 9 workers' recycles don't cluster on a DB blip. - / and /static ingresses switch to the dedicated authentik-rate-limit (100/1000) from the previous commit (skip_default_rate_limit) — fixes the cold-load 429 blank screen. Liveness intentionally left DB-coupled-but-shallow (LiveView always returns 200, so it can't kill a DB-blocked pod). CONN_MAX_AGE intentionally NOT set (pins the pgbouncer pool, reverted 2026-06-10). Docs: .claude/CLAUDE.md + authentication.md (also corrected a stale "60s persistent DB connections" note). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:17:05 +00:00
Viktor Barzin	b84b0021c2	authentik: dedicated rate-limit carve-out + per-router 5xx observability All checks were successful ci/woodpecker/push/default Pipeline was successful Details Unauthenticated users were getting a blank login screen (and the screen would sometimes just hang). Root-caused via a read-only fan-out + adversarial verify: the login SPA cold-loads ~70 flow-executor JS/CSS chunks from /static through the SHARED 10/50 Traefik limiter, so a fresh/empty-cache load 429s the tail and a failed ES-module import aborts SPA bootstrap -> permanent blank. authentik was the only first-party SPA still on the default limiter (8 siblings already have a carve-out). NAT-shared clients trip it especially easily (shared per-IP bucket). - traefik: new `authentik-rate-limit` Middleware (average 100 / burst 1000, mirroring the existing health/tripit carve-outs). The authentik / and /static ingresses switch to it in the authentik-stack commit. - monitoring: the `traefik` scrape job's drop-regex was a blanket `traefik_router_.`, which also dropped `traefik_router_requests_total` — so per-router 4xx/5xx (incl. 429/503) was neither queryable nor alertable. Narrowed it to keep the counter while still dropping the high-cardinality `_duration_seconds_bucket` histogram, and added `AuthentikRootRouter5xxHigh` for the episodic all-3-server-pods-NotReady 502/503/504 cascade. Docs updated (networking.md rate-limit list, .claude/CLAUDE.md). GitOps CI applies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-28 09:10:34 +00:00

1 2 3 4 5 ...

1626 commits