infra/stacks
Viktor Barzin 6377a8b85b Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards
Noise reduction (8 alerts tuned):
- PoisonFountainDown: 2m→5m, critical→warning (fail-open service)
- NodeExporterDown: 2m→5m (flaps during node restarts)
- PowerOutage: add for:1m (debounce transient voltage dips)
- New Tailscale client: add for:5m (debounce headscale reauths)
- NoNodeLoadData: use absent() instead of OR vector(0)==0
- NodeHighCPUUsage: 30%→60% (normal for 70+ services)
- HighMemoryUsage GPU: 12GB/5m→14GB/15m (T4=16GB, model loading)
- PrometheusStorageFull: 50GiB→150GiB (TSDB cap is 180GB)

Alert regrouping:
- Move MailServerDown, HackmdDown, PrivatebinDown → new "Application Health"
- Move New Tailscale client → "Infrastructure Health"

New alerts (14):
- Networking: Cloudflared (2), MetalLB (2), Technitium DNS
- Storage: NFS CSI, iSCSI CSI controllers
- Critical Services: PgBouncer, CNPG operator, MySQL operator
- Infra Health: CrowdSec, Kyverno, Sealed Secrets, Woodpecker

Inhibit rules:
- Consolidate 3 NodeDown rules into 1 comprehensive rule
- Extend NFS rule to suppress NFS-dependent services
- Add PowerOutage → downstream suppression

Dashboard loading:
- Add for_each ConfigMap in grafana.tf to auto-load all 18 dashboards
- Remove duplicate caretta dashboard ConfigMap from caretta.tf
2026-03-14 10:25:31 +00:00
..
actualbudget Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
affine Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
audiobookshelf Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
blog Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
calibre Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
changedetection Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
city-guesser Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
claude-memory feat(claude-memory): add stack and update image to standalone repo 2026-03-14 09:49:38 +00:00
coturn Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
cyberchef Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
dashy Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
dawarich Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
descheduler Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
diun Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
ebook2audiobook Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
echo Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
excalidraw Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
f1-stream Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
forgejo Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
freedify Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
freshrss Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
frigate Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
grampsweb Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
hackmd Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
health Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
homepage Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
immich Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
infra Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
isponsorblocktv Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
jsoncrack Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
k8s-dashboard Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
kms Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
linkwarden Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
matrix Migrate Matrix Synapse from SQLite to PostgreSQL 2026-03-13 23:21:59 +00:00
meshcentral Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
n8n Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
navidrome Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
netbox Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
networking-toolbox Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
nextcloud Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
ntfy Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
ollama Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
onlyoffice Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
openclaw Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
osm_routing Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
owntracks Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
paperless-ngx Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
platform Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards 2026-03-14 10:25:31 +00:00
plotting-book Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
poison-fountain Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
privatebin Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
real-estate-crawler Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
reloader [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
resume Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
rybbit Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
send Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
servarr Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
shadowsocks Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
speedtest Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
stirling-pdf Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
tandoor Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
terminal Add terminal stack - reverse proxy to ttyd behind authentik 2026-03-10 23:46:01 +00:00
tor-proxy Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
trading-bot Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
travel_blog Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
tuya-bridge Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
url Right-size CPU requests cluster-wide and remove missed CPU limits 2026-03-14 09:22:24 +00:00
wealthfolio Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
webhook_handler Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
whisper Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
woodpecker Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00
ytdlp Remove all CPU limits cluster-wide to eliminate CFS throttling 2026-03-14 08:51:45 +00:00