infra

History

OpenClaw 8154103ac4 feat(monitoring): Disable Loki centralized logging while preserving configuration DECISION: Disable Loki due to operational overhead vs benefit analysis EVIDENCE FROM NODE2 INCIDENT: - Loki was the root cause of major cluster outage (PVC storage exhaustion) - Centralized logging was unavailable when needed most (Loki was down) - All debugging was accomplished with simpler tools (kubectl logs, events, describe) - Prometheus metrics proved more valuable than centralized logs OPERATIONAL OVERHEAD ELIMINATED: ✅ 50GB iSCSI storage freed up (expensive) ✅ ~3.5GB memory freed up (Loki + Alloy agents across cluster) ✅ ~2+ CPU cores freed up for actual workloads ✅ Reduced complexity - fewer services to maintain and troubleshoot ✅ Eliminated single point of failure that can cascade cluster-wide CONFIGURATION PRESERVED: ✅ All Terraform resources commented out (not deleted) ✅ loki.yaml preserved with 50GB configuration ✅ alloy.yaml preserved with log shipping configuration ✅ Alert rules and Grafana datasource preserved (commented) ✅ Easy re-enabling: just uncomment resources and apply ALTERNATIVE DEBUGGING APPROACH: ✅ kubectl logs (always works, no storage dependency) ✅ kubectl get events (built-in Kubernetes events) ✅ Prometheus metrics (more reliable for monitoring) ✅ Enhanced health check scripts (direct status verification) RE-ENABLING: To restore Loki later, uncomment all /* ... */ blocks in loki.tf and apply via Terraform. All configuration is preserved. [ci skip] - Infrastructure changes applied locally first due to resource cleanup		2026-03-17 16:51:02 +00:00
..
authentik	resource quota review: fix OOM risks, close quota gaps, add HA protections	2026-03-08 18:17:46 +00:00
cloudflared	resource quota review: fix OOM risks, close quota gaps, add HA protections	2026-03-08 18:17:46 +00:00
cnpg	[ci skip] install CloudNativePG operator as platform module	2026-02-28 17:22:53 +00:00
crowdsec	resource quota review: fix OOM risks, close quota gaps, add HA protections	2026-03-08 18:17:46 +00:00
dbaas	fix OOM kills: tune MySQL memory, reduce Nextcloud workers, increase Uptime Kuma limit	2026-03-12 07:26:08 +00:00
headscale	[ci skip] fix widget issues: ports, Immich v2 API, Nextcloud trusted domains	2026-03-07 20:39:56 +00:00
infra-maintenance	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup	2026-03-06 19:54:21 +00:00
iscsi-csi	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup	2026-03-06 19:54:21 +00:00
k8s-portal	deploy Sealed Secrets controller for encrypted secret management	2026-03-08 19:49:48 +00:00
kyverno	fix cluster health: pin actualbudget, spread MySQL, scale grampsweb, fix GPU toleration	2026-03-11 11:43:34 +00:00
mailserver	[ci skip] add Homepage gethomepage.dev annotations to all services	2026-03-07 20:39:54 +00:00
metallb	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
metrics-server	[ci skip] Move Terraform modules into stack directories	2026-02-22 14:38:14 +00:00
monitoring	feat(monitoring): Disable Loki centralized logging while preserving configuration	2026-03-17 16:51:02 +00:00
nfs-csi	[ci skip] add NFS CSI driver + nfs_volume shared module	2026-03-01 23:38:58 +00:00
nvidia	fix nvidia quota: use custom quota (32 CPU) instead of Kyverno-generated (16 CPU)	2026-03-12 07:04:34 +00:00
rbac	Woodpecker CI: use built-in clone, fix CoreDNS DNS resolution [CI SKIP]	2026-02-23 00:08:42 +00:00
redis	[ci skip] migrate Redis, Prometheus, Loki storage to iSCSI	2026-03-06 20:50:55 +00:00
reverse_proxy	[ci skip] fix pfSense widget: wan interface is vtnet0 not vmx0	2026-03-07 20:39:56 +00:00
sealed-secrets	deploy Sealed Secrets controller for encrypted secret management	2026-03-08 19:49:48 +00:00
technitium	[ci skip] add Homepage gethomepage.dev annotations to all services	2026-03-07 20:39:54 +00:00
traefik	[ci skip] add Homepage gethomepage.dev annotations to all services	2026-03-07 20:39:54 +00:00
uptime-kuma	fix OOM kills: tune MySQL memory, reduce Nextcloud workers, increase Uptime Kuma limit	2026-03-12 07:26:08 +00:00
vaultwarden	[ci skip] add Homepage gethomepage.dev annotations to all services	2026-03-07 20:39:54 +00:00
vpa	[ci skip] fix Homepage icons for Tandoor, Listenarr, Networking Toolbox, Goldilocks	2026-03-07 21:29:51 +00:00
wireguard	[ci skip] right-size all pod resources based on VPA + live metrics audit	2026-03-01 19:18:50 +00:00
xray	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars	2026-03-07 14:30:36 +00:00