infra/stacks/platform/modules/authentik/values.yaml

authentik:
  log_level: warning
  # log_level: trace
  secret_key: "${secret_key}"
  # This sends anonymous usage-data, stack traces on errors and
  # performance data to authentik.error-reporting.a7k.io, and is fully opt-in
  error_reporting:
    enabled: true
  postgresql:
    # host: postgresql.dbaas
    host: pgbouncer.authentik
    port: 6432
    user: authentik
    password: ${postgres_password}
  redis:
    host: ${redis_host}

server:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
  resources:
    requests:
      cpu: 100m
      memory: 1Gi
    limits:
      memory: 1Gi
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: server
  ingress:
    enabled: false
    # hosts:
    #   - authentik.viktorbarzin.me
  podAnnotations:
    diun.enable: true
    diun.include_tags: "^202[0-9].[0-9]+.*$" # no need to annotate the worker as it uses the same image
  pdb:
    enabled: true
    minAvailable: 2
global:
  addPrometheusAnnotations: true

worker:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
  resources:
    requests:
      cpu: 100m
      memory: 896Mi
    limits:
      memory: 896Mi
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: ScheduleAnyway
      labelSelector:
        matchLabels:
          app.kubernetes.io/component: worker
  pdb:
    enabled: true
    maxUnavailable: 1
add authentik [ci skip] 2024-11-12 20:20:10 +00:00			`authentik:`
set authentik log level to warning to reduce db noise [ci skip] 2025-09-06 21:41:22 +00:00			`log_level: warning`
add debug option in authentik helm [ci skip] 2025-12-28 20:03:37 +00:00			`# log_level: trace`
add authentik [ci skip] 2024-11-12 20:20:10 +00:00			`secret_key: "${secret_key}"`
			`# This sends anonymous usage-data, stack traces on errors and`
			`# performance data to authentik.error-reporting.a7k.io, and is fully opt-in`
			`error_reporting:`
			`enabled: true`
			`postgresql:`
add pgbouncer in front of authentik to reduce postgres connections [ci skip] 2025-10-08 21:56:03 +00:00			`# host: postgresql.dbaas`
			`host: pgbouncer.authentik`
			`port: 6432`
add authentik [ci skip] 2024-11-12 20:20:10 +00:00			`user: authentik`
			`password: ${postgres_password}`
			`redis:`
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules 2026-02-23 22:05:28 +00:00			`host: ${redis_host}`
add authentik [ci skip] 2024-11-12 20:20:10 +00:00
			`server:`
scale authentik worker and server to 3 replicas [ci skip] 2025-03-16 18:26:32 +00:00			`replicas: 3`
mitigate cluster instability during terraform applies - Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf) - Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno) to prevent memory request surge overwhelming scheduler - Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup - Disable Kyverno policy reports (ephemeral report cleanup) - Cloud-init: journald persistence + 4Gi swap for worker nodes - Kubelet: LimitedSwap behavior for memory pressure relief 2026-03-15 17:23:39 +00:00			`strategy:`
			`type: RollingUpdate`
			`rollingUpdate:`
			`maxSurge: 0`
			`maxUnavailable: 1`
[ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik - osrm-bicycle: 1Gi limit (loads 403MB routing graph) - aiostreams: 768Mi limit (loads 44K anime entries) - listenarr: 1Gi limit (.NET + Playwright/Chromium) - authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn) - servarr: pass nfs_server variable to all submodules 2026-02-28 17:03:33 +00:00			`resources:`
			`requests:`
			`cpu: 100m`
equalize memory req=lim across 70+ containers using Prometheus 7d max data After node2 OOM incident, right-size memory across the cluster by setting requests=limits based on max_over_time(container_memory_working_set_bytes[7d]) with 1.3x headroom. Eliminates ~37Gi overcommit gap. Categories: - Safe equalization (50 containers): set req=lim where max7d well within target - Limit increases (8 containers): raise limits for services spiking above current - No Prometheus data (12 containers): conservatively set lim=req - Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql 4Gi limits across 3 replicas. 2026-03-14 21:46:49 +00:00			`memory: 1Gi`
[ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik - osrm-bicycle: 1Gi limit (loads 403MB routing graph) - aiostreams: 768Mi limit (loads 44K anime entries) - listenarr: 1Gi limit (.NET + Playwright/Chromium) - authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn) - servarr: pass nfs_server variable to all submodules 2026-02-28 17:03:33 +00:00			`limits:`
			`memory: 1Gi`
resource quota review: fix OOM risks, close quota gaps, add HA protections Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values 2026-03-08 18:17:46 +00:00			`topologySpreadConstraints:`
			`- maxSkew: 1`
			`topologyKey: kubernetes.io/hostname`
			`whenUnsatisfiable: ScheduleAnyway`
			`labelSelector:`
			`matchLabels:`
			`app.kubernetes.io/component: server`
add authentik [ci skip] 2024-11-12 20:20:10 +00:00			`ingress:`
			`enabled: false`
			`# hosts:`
			`# - authentik.viktorbarzin.me`
update diun annotations to correctly monitor for image version updates and update some services alongside[ci skip] 2024-12-30 14:01:38 +00:00			`podAnnotations:`
			`diun.enable: true`
			`diun.include_tags: "^202[0-9].[0-9]+.*$" # no need to annotate the worker as it uses the same image`
[ci skip] add Authentik PDB (minAvailable=2) 2026-03-01 14:24:47 +00:00			`pdb:`
			`enabled: true`
			`minAvailable: 2`
update diun annotations to correctly monitor for image version updates and update some services alongside[ci skip] 2024-12-30 14:01:38 +00:00			`global:`
			`addPrometheusAnnotations: true`
scale authentik worker and server to 3 replicas [ci skip] 2025-03-16 18:26:32 +00:00
			`worker:`
			`replicas: 3`
mitigate cluster instability during terraform applies - Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf) - Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno) to prevent memory request surge overwhelming scheduler - Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup - Disable Kyverno policy reports (ephemeral report cleanup) - Cloud-init: journald persistence + 4Gi swap for worker nodes - Kubelet: LimitedSwap behavior for memory pressure relief 2026-03-15 17:23:39 +00:00			`strategy:`
			`type: RollingUpdate`
			`rollingUpdate:`
			`maxSurge: 0`
			`maxUnavailable: 1`
[ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik - osrm-bicycle: 1Gi limit (loads 403MB routing graph) - aiostreams: 768Mi limit (loads 44K anime entries) - listenarr: 1Gi limit (.NET + Playwright/Chromium) - authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn) - servarr: pass nfs_server variable to all submodules 2026-02-28 17:03:33 +00:00			`resources:`
			`requests:`
Right-size CPU requests cluster-wide and remove missed CPU limits Increase requests for under-requested pods (dashy 50m→250m, frigate 500m→1500m, clickhouse 100m→500m, otp 100m→300m, linkwarden 25m→50m, authentik worker 50m→100m). Reduce requests for over-requested pods (crowdsec agent/lapi 500m→25m each, prometheus 200m→100m, dbaas mysql 1800m→100m, pg-cluster 250m→50m, shlink-web 250m→10m, gpu-pod-exporter 50m→10m, stirling-pdf 100m→25m, technitium 100m→25m, celery 50m→15m). Reduce crowdsec quota from 8→1 CPU. Remove missed CPU limits in prometheus (cpu: "2") and dbaas (cpu: "3600m") tpl files. 2026-03-14 09:22:24 +00:00			`cpu: 100m`
right-size cluster memory: reduce overprovisioned, fix under-provisioned services Phase 1 - Quick wins (~4.5 Gi saved): - democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default) - caretta: 768Mi → 600Mi (VPA upper 485Mi) - immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin) - onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi) Phase 2 - Safety fixes (prevent OOMKills): - frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom) - openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement) Phase 3 - Additional right-sizing: - authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi) - shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase) Phase 4 - Burstable QoS for lower tiers: - tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit - tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit Phase 5 - Monitoring: - Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m) - Add ContainerNearOOM alert (>85% limit, 30m) - Add PodUnschedulable alert (5m, critical) Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable. 2026-03-15 15:30:18 +00:00			`memory: 896Mi`
[ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik - osrm-bicycle: 1Gi limit (loads 403MB routing graph) - aiostreams: 768Mi limit (loads 44K anime entries) - listenarr: 1Gi limit (.NET + Playwright/Chromium) - authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn) - servarr: pass nfs_server variable to all submodules 2026-02-28 17:03:33 +00:00			`limits:`
right-size cluster memory: reduce overprovisioned, fix under-provisioned services Phase 1 - Quick wins (~4.5 Gi saved): - democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default) - caretta: 768Mi → 600Mi (VPA upper 485Mi) - immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin) - onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi) Phase 2 - Safety fixes (prevent OOMKills): - frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom) - openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement) Phase 3 - Additional right-sizing: - authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi) - shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase) Phase 4 - Burstable QoS for lower tiers: - tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit - tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit Phase 5 - Monitoring: - Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m) - Add ContainerNearOOM alert (>85% limit, 30m) - Add PodUnschedulable alert (5m, critical) Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable. 2026-03-15 15:30:18 +00:00			`memory: 896Mi`
resource quota review: fix OOM risks, close quota gaps, add HA protections Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values 2026-03-08 18:17:46 +00:00			`topologySpreadConstraints:`
			`- maxSkew: 1`
			`topologyKey: kubernetes.io/hostname`
			`whenUnsatisfiable: ScheduleAnyway`
			`labelSelector:`
			`matchLabels:`
			`app.kubernetes.io/component: worker`
			`pdb:`
			`enabled: true`
			`maxUnavailable: 1`