2025-05-04 11:22:12 +00:00
|
|
|
loki:
|
|
|
|
|
commonConfig:
|
|
|
|
|
replication_factor: 1
|
|
|
|
|
schemaConfig:
|
|
|
|
|
configs:
|
|
|
|
|
- from: "2025-04-01"
|
|
|
|
|
store: tsdb
|
|
|
|
|
object_store: filesystem
|
|
|
|
|
schema: v13
|
|
|
|
|
index:
|
|
|
|
|
prefix: loki_index_
|
|
|
|
|
period: 24h
|
2026-02-13 23:03:40 +00:00
|
|
|
ingester:
|
|
|
|
|
chunk_idle_period: 12h
|
|
|
|
|
max_chunk_age: 24h
|
|
|
|
|
chunk_retain_period: 1m
|
|
|
|
|
chunk_target_size: 1572864
|
|
|
|
|
wal:
|
|
|
|
|
dir: /loki-wal
|
2025-05-04 11:22:12 +00:00
|
|
|
pattern_ingester:
|
|
|
|
|
enabled: true
|
|
|
|
|
limits_config:
|
|
|
|
|
allow_structured_metadata: true
|
|
|
|
|
volume_enabled: true
|
[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs
Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb
Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts
Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi
Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
(removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
|
|
|
retention_period: 720h
|
2026-02-13 23:03:40 +00:00
|
|
|
compactor:
|
|
|
|
|
retention_enabled: true
|
2026-02-13 23:22:13 +00:00
|
|
|
working_directory: /var/loki/compactor
|
2026-02-13 23:03:40 +00:00
|
|
|
compaction_interval: 1h
|
|
|
|
|
delete_request_store: filesystem
|
2025-05-04 11:22:12 +00:00
|
|
|
ruler:
|
|
|
|
|
enable_api: true
|
2026-02-13 23:03:40 +00:00
|
|
|
storage:
|
|
|
|
|
type: local
|
|
|
|
|
local:
|
Add node hang instrumentation and scale down chromium services
- Add journald collection to Alloy (loki.source.journal) for kernel OOM,
panic, hung task, and soft lockup detection — ships system logs off-node
so they survive hard resets
- Add 5 Loki alerting rules (KernelOOMKiller, KernelPanic, KernelHungTask,
KernelSoftLockup, ContainerdDown) evaluating against node-journal logs
- Fix Loki ruler config: correct rules mount path (/var/loki/rules/fake),
add alertmanager_url and enable_api
- Add Prometheus alerts: NodeMemoryPressureTrending (>85%), NodeExporterDown,
NodeHighIOWait (>30%)
- Add caretta tolerations for control-plane and GPU nodes
- Scale down chromium-based services to 0 for cluster stability:
f1-stream, flaresolverr, changedetection, resume/printer
2026-03-11 22:46:33 +00:00
|
|
|
directory: /var/loki/rules
|
2026-02-13 23:08:44 +00:00
|
|
|
alertmanager_url: http://prometheus-alertmanager.monitoring.svc.cluster.local:9093
|
2026-02-13 23:03:40 +00:00
|
|
|
ring:
|
|
|
|
|
kvstore:
|
|
|
|
|
store: inmemory
|
2026-02-13 23:22:13 +00:00
|
|
|
rule_path: /var/loki/scratch
|
2025-05-04 11:22:12 +00:00
|
|
|
storage:
|
|
|
|
|
type: "filesystem"
|
|
|
|
|
auth_enabled: false
|
|
|
|
|
|
|
|
|
|
minio:
|
|
|
|
|
enabled: false
|
|
|
|
|
|
|
|
|
|
deploymentMode: SingleBinary
|
|
|
|
|
|
|
|
|
|
singleBinary:
|
|
|
|
|
replicas: 1
|
2026-02-13 23:03:40 +00:00
|
|
|
persistence:
|
|
|
|
|
enabled: true
|
2026-03-13 08:13:05 +00:00
|
|
|
size: 50Gi
|
2026-03-06 20:50:55 +00:00
|
|
|
storageClass: "iscsi-truenas"
|
2026-02-13 23:03:40 +00:00
|
|
|
extraVolumes:
|
|
|
|
|
- name: wal
|
|
|
|
|
emptyDir:
|
|
|
|
|
medium: Memory
|
|
|
|
|
sizeLimit: 2Gi
|
|
|
|
|
- name: rules
|
|
|
|
|
configMap:
|
|
|
|
|
name: loki-alert-rules
|
|
|
|
|
extraVolumeMounts:
|
|
|
|
|
- name: wal
|
|
|
|
|
mountPath: /loki-wal
|
|
|
|
|
- name: rules
|
Add node hang instrumentation and scale down chromium services
- Add journald collection to Alloy (loki.source.journal) for kernel OOM,
panic, hung task, and soft lockup detection — ships system logs off-node
so they survive hard resets
- Add 5 Loki alerting rules (KernelOOMKiller, KernelPanic, KernelHungTask,
KernelSoftLockup, ContainerdDown) evaluating against node-journal logs
- Fix Loki ruler config: correct rules mount path (/var/loki/rules/fake),
add alertmanager_url and enable_api
- Add Prometheus alerts: NodeMemoryPressureTrending (>85%), NodeExporterDown,
NodeHighIOWait (>30%)
- Add caretta tolerations for control-plane and GPU nodes
- Scale down chromium-based services to 0 for cluster stability:
f1-stream, flaresolverr, changedetection, resume/printer
2026-03-11 22:46:33 +00:00
|
|
|
mountPath: /var/loki/rules/fake
|
2026-02-13 23:03:40 +00:00
|
|
|
resources:
|
|
|
|
|
requests:
|
|
|
|
|
cpu: 250m
|
2026-03-03 21:10:26 +00:00
|
|
|
memory: 2Gi
|
2026-02-13 23:03:40 +00:00
|
|
|
limits:
|
2026-03-03 21:10:26 +00:00
|
|
|
memory: 4Gi
|
2025-05-04 11:22:12 +00:00
|
|
|
|
|
|
|
|
# Zero out replica counts of other deployment modes
|
|
|
|
|
backend:
|
|
|
|
|
replicas: 0
|
|
|
|
|
read:
|
|
|
|
|
replicas: 0
|
|
|
|
|
write:
|
|
|
|
|
replicas: 0
|
|
|
|
|
ingester:
|
|
|
|
|
replicas: 0
|
|
|
|
|
querier:
|
|
|
|
|
replicas: 0
|
|
|
|
|
queryFrontend:
|
|
|
|
|
replicas: 0
|
|
|
|
|
queryScheduler:
|
|
|
|
|
replicas: 0
|
|
|
|
|
distributor:
|
|
|
|
|
replicas: 0
|
|
|
|
|
compactor:
|
|
|
|
|
replicas: 0
|
|
|
|
|
indexGateway:
|
|
|
|
|
replicas: 0
|
|
|
|
|
bloomCompactor:
|
|
|
|
|
replicas: 0
|
|
|
|
|
bloomGateway:
|
|
|
|
|
replicas: 0
|
2026-02-13 23:17:32 +00:00
|
|
|
|
|
|
|
|
# Disable optional components for single binary mode
|
|
|
|
|
gateway:
|
|
|
|
|
enabled: false
|
|
|
|
|
chunksCache:
|
|
|
|
|
enabled: false
|
|
|
|
|
resultsCache:
|
|
|
|
|
enabled: false
|