infra/stacks
Viktor Barzin ce79bd5c04 Add node hang instrumentation and scale down chromium services
- Add journald collection to Alloy (loki.source.journal) for kernel OOM,
  panic, hung task, and soft lockup detection — ships system logs off-node
  so they survive hard resets
- Add 5 Loki alerting rules (KernelOOMKiller, KernelPanic, KernelHungTask,
  KernelSoftLockup, ContainerdDown) evaluating against node-journal logs
- Fix Loki ruler config: correct rules mount path (/var/loki/rules/fake),
  add alertmanager_url and enable_api
- Add Prometheus alerts: NodeMemoryPressureTrending (>85%), NodeExporterDown,
  NodeHighIOWait (>30%)
- Add caretta tolerations for control-plane and GPU nodes
- Scale down chromium-based services to 0 for cluster stability:
  f1-stream, flaresolverr, changedetection, resume/printer
2026-03-13 22:20:28 +00:00
..
actualbudget fix cluster health: pin actualbudget, spread MySQL, scale grampsweb, fix GPU toleration 2026-03-11 11:43:34 +00:00
affine [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
audiobookshelf [ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale 2026-03-07 20:39:55 +00:00
blog [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
calibre resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
changedetection Add node hang instrumentation and scale down chromium services 2026-03-13 22:20:28 +00:00
city-guesser [ci skip] fix invalid Homepage dashboard icons for 9 services 2026-03-07 21:14:17 +00:00
coturn [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
cyberchef [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
dashy resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
dawarich [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
descheduler resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
diun [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
ebook2audiobook [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
echo [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
excalidraw [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
f1-stream Add node hang instrumentation and scale down chromium services 2026-03-13 22:20:28 +00:00
forgejo [ci skip] add Forgejo task pipeline for OpenClaw AI agent 2026-03-07 21:11:07 +00:00
freedify [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
freshrss [ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma 2026-03-07 20:39:55 +00:00
frigate fix Frigate GPU stall: add inference speed check to liveness probe 2026-03-13 10:25:46 +00:00
grampsweb fix cluster health: pin actualbudget, spread MySQL, scale grampsweb, fix GPU toleration 2026-03-11 11:43:34 +00:00
hackmd [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
health [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
homepage add nginx caching proxy for Homepage widget API requests 2026-03-07 21:11:07 +00:00
immich [ci skip] fix widget issues: ports, Immich v2 API, Nextcloud trusted domains 2026-03-07 20:39:56 +00:00
infra feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident 2026-03-13 07:16:56 +00:00
isponsorblocktv [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
jsoncrack [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
k8s-dashboard [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
kms [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
linkwarden [ci skip] fix widget URLs: use correct k8s service ports 2026-03-07 20:39:56 +00:00
matrix [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
meshcentral [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
n8n [ci skip] add Forgejo task pipeline for OpenClaw AI agent 2026-03-07 21:11:07 +00:00
navidrome [ci skip] fix widget URLs: use correct k8s service ports 2026-03-07 20:39:56 +00:00
netbox [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
networking-toolbox [ci skip] fix Homepage icons for Tandoor, Listenarr, Networking Toolbox, Goldilocks 2026-03-07 21:29:51 +00:00
nextcloud fix(nextcloud): Database corruption recovery and conservative Apache tuning 2026-03-12 13:38:37 +00:00
ntfy [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
ollama [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
onlyoffice [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
openclaw [ci skip] add Forgejo task pipeline for OpenClaw AI agent 2026-03-07 21:11:07 +00:00
osm_routing [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
owntracks [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
paperless-ngx [ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale 2026-03-07 20:39:55 +00:00
platform Add node hang instrumentation and scale down chromium services 2026-03-13 22:20:28 +00:00
plotting-book set Recreate strategy for plotting-book deployment 2026-03-10 23:47:30 +00:00
poison-fountain [ci skip] fix invalid Homepage dashboard icons for 9 services 2026-03-07 21:14:17 +00:00
privatebin [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
real-estate-crawler resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
reloader [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
resume Add node hang instrumentation and scale down chromium services 2026-03-13 22:20:28 +00:00
rybbit [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
send [ci skip] add liveness probe to Send deployment 2026-03-07 20:39:57 +00:00
servarr Add node hang instrumentation and scale down chromium services 2026-03-13 22:20:28 +00:00
shadowsocks [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
speedtest [ci skip] fix broken Homepage widgets + add service API tokens to SOPS 2026-03-07 20:39:55 +00:00
stirling-pdf [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
tandoor [ci skip] fix Homepage icons for Tandoor, Listenarr, Networking Toolbox, Goldilocks 2026-03-07 21:29:51 +00:00
terminal Add terminal stack - reverse proxy to ttyd behind authentik 2026-03-10 23:46:01 +00:00
tor-proxy [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
trading-bot resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
travel_blog [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
tuya-bridge [ci skip] fix invalid Homepage dashboard icons for 9 services 2026-03-07 21:14:17 +00:00
url [ci skip] add Homepage widget credentials for Authentik, Shlink, Home Assistant 2026-03-07 20:39:54 +00:00
wealthfolio [ci skip] fix Wealthfolio Homepage icon: wealthfolio.png → mdi-finance 2026-03-07 21:32:58 +00:00
webhook_handler [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00
whisper [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars 2026-03-07 14:30:36 +00:00
woodpecker resource quota review: fix OOM risks, close quota gaps, add HA protections 2026-03-08 18:17:46 +00:00
ytdlp [ci skip] add Homepage gethomepage.dev annotations to all services 2026-03-07 20:39:54 +00:00