2024-11-12 20:20:10 +00:00
|
|
|
authentik:
|
2025-09-06 21:41:22 +00:00
|
|
|
log_level: warning
|
2025-12-28 20:03:37 +00:00
|
|
|
# log_level: trace
|
2024-11-12 20:20:10 +00:00
|
|
|
secret_key: "${secret_key}"
|
|
|
|
|
# This sends anonymous usage-data, stack traces on errors and
|
|
|
|
|
# performance data to authentik.error-reporting.a7k.io, and is fully opt-in
|
|
|
|
|
error_reporting:
|
|
|
|
|
enabled: true
|
|
|
|
|
postgresql:
|
2025-10-08 21:56:03 +00:00
|
|
|
# host: postgresql.dbaas
|
|
|
|
|
host: pgbouncer.authentik
|
|
|
|
|
port: 6432
|
2024-11-12 20:20:10 +00:00
|
|
|
user: authentik
|
|
|
|
|
password: ${postgres_password}
|
|
|
|
|
redis:
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
host: ${redis_host}
|
2024-11-12 20:20:10 +00:00
|
|
|
|
|
|
|
|
server:
|
2025-03-16 18:26:32 +00:00
|
|
|
replicas: 3
|
2026-03-15 17:23:39 +00:00
|
|
|
strategy:
|
|
|
|
|
type: RollingUpdate
|
|
|
|
|
rollingUpdate:
|
|
|
|
|
maxSurge: 0
|
|
|
|
|
maxUnavailable: 1
|
2026-02-28 17:03:33 +00:00
|
|
|
resources:
|
|
|
|
|
requests:
|
|
|
|
|
cpu: 100m
|
2026-03-14 21:46:49 +00:00
|
|
|
memory: 1Gi
|
2026-02-28 17:03:33 +00:00
|
|
|
limits:
|
|
|
|
|
memory: 1Gi
|
resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
(outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
explicit resources)
Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification
Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16
Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template
Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
|
|
|
topologySpreadConstraints:
|
|
|
|
|
- maxSkew: 1
|
|
|
|
|
topologyKey: kubernetes.io/hostname
|
|
|
|
|
whenUnsatisfiable: ScheduleAnyway
|
|
|
|
|
labelSelector:
|
|
|
|
|
matchLabels:
|
|
|
|
|
app.kubernetes.io/component: server
|
2024-11-12 20:20:10 +00:00
|
|
|
ingress:
|
|
|
|
|
enabled: false
|
|
|
|
|
# hosts:
|
|
|
|
|
# - authentik.viktorbarzin.me
|
2024-12-30 14:01:38 +00:00
|
|
|
podAnnotations:
|
|
|
|
|
diun.enable: true
|
|
|
|
|
diun.include_tags: "^202[0-9].[0-9]+.*$" # no need to annotate the worker as it uses the same image
|
2026-03-01 14:24:47 +00:00
|
|
|
pdb:
|
|
|
|
|
enabled: true
|
|
|
|
|
minAvailable: 2
|
2024-12-30 14:01:38 +00:00
|
|
|
global:
|
|
|
|
|
addPrometheusAnnotations: true
|
2025-03-16 18:26:32 +00:00
|
|
|
|
|
|
|
|
worker:
|
|
|
|
|
replicas: 3
|
2026-03-15 17:23:39 +00:00
|
|
|
strategy:
|
|
|
|
|
type: RollingUpdate
|
|
|
|
|
rollingUpdate:
|
|
|
|
|
maxSurge: 0
|
|
|
|
|
maxUnavailable: 1
|
2026-02-28 17:03:33 +00:00
|
|
|
resources:
|
|
|
|
|
requests:
|
Right-size CPU requests cluster-wide and remove missed CPU limits
Increase requests for under-requested pods (dashy 50m→250m, frigate 500m→1500m,
clickhouse 100m→500m, otp 100m→300m, linkwarden 25m→50m, authentik worker 50m→100m).
Reduce requests for over-requested pods (crowdsec agent/lapi 500m→25m each,
prometheus 200m→100m, dbaas mysql 1800m→100m, pg-cluster 250m→50m,
shlink-web 250m→10m, gpu-pod-exporter 50m→10m, stirling-pdf 100m→25m,
technitium 100m→25m, celery 50m→15m). Reduce crowdsec quota from 8→1 CPU.
Remove missed CPU limits in prometheus (cpu: "2") and dbaas (cpu: "3600m") tpl files.
2026-03-14 09:22:24 +00:00
|
|
|
cpu: 100m
|
2026-03-15 15:30:18 +00:00
|
|
|
memory: 896Mi
|
2026-02-28 17:03:33 +00:00
|
|
|
limits:
|
2026-03-15 15:30:18 +00:00
|
|
|
memory: 896Mi
|
resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
(outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
explicit resources)
Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification
Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16
Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template
Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
|
|
|
topologySpreadConstraints:
|
|
|
|
|
- maxSkew: 1
|
|
|
|
|
topologyKey: kubernetes.io/hostname
|
|
|
|
|
whenUnsatisfiable: ScheduleAnyway
|
|
|
|
|
labelSelector:
|
|
|
|
|
matchLabels:
|
|
|
|
|
app.kubernetes.io/component: worker
|
|
|
|
|
pdb:
|
|
|
|
|
enabled: true
|
|
|
|
|
maxUnavailable: 1
|