infra

1754 commits 2 branches 0 tags 2.3 GiB

Author	SHA1	Message	Date
Viktor Barzin	b00f810d3d	Remove all CPU limits cluster-wide to eliminate CFS throttling CPU limits cause CFS throttling even when nodes have idle capacity. Move to a request-only CPU model: keep CPU requests for scheduling fairness but remove all CPU limits. Memory limits stay (incompressible). Changes across 108 files: - Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers - Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers - Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas - Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice) - RBAC module: remove cpu_limits variable and quota reference - Freedify factory: remove cpu_limit variable and limits reference - 86 deployment files: remove cpu from all limits blocks - 6 Helm values files: remove cpu under limits sections	2026-03-14 08:51:45 +00:00
Viktor Barzin	ce79bd5c04	Add node hang instrumentation and scale down chromium services - Add journald collection to Alloy (loki.source.journal) for kernel OOM, panic, hung task, and soft lockup detection — ships system logs off-node so they survive hard resets - Add 5 Loki alerting rules (KernelOOMKiller, KernelPanic, KernelHungTask, KernelSoftLockup, ContainerdDown) evaluating against node-journal logs - Fix Loki ruler config: correct rules mount path (/var/loki/rules/fake), add alertmanager_url and enable_api - Add Prometheus alerts: NodeMemoryPressureTrending (>85%), NodeExporterDown, NodeHighIOWait (>30%) - Add caretta tolerations for control-plane and GPU nodes - Scale down chromium-based services to 0 for cluster stability: f1-stream, flaresolverr, changedetection, resume/printer	2026-03-13 22:20:28 +00:00
OpenClaw	523f0ba7eb	fix(monitoring): Expand Loki PVC from 15GB to 50GB to resolve storage exhaustion ISSUE RESOLVED: - Root cause: Loki's 15GB iSCSI PVC was completely full - Symptom: 'no space left on device' errors during TSDB operations - Impact: Loki service completely down, logging unavailable - Side effects: Contributed to node2 containerd corruption incident SOLUTION APPLIED: - Expanded PVC storage: 15Gi → 50Gi via direct kubectl patch - Triggered pod restart to complete filesystem resize - Verified successful expansion and service recovery CURRENT STATUS: ✅ PVC: 50Gi capacity (iscsi-truenas storage class) ✅ Loki StatefulSet: 1/1 ready ✅ Loki Pod: 2/2 containers running ✅ Service: Successfully processing log streams ✅ No storage errors in recent logs TERRAFORM ALIGNED: - Updated loki.yaml persistence.size to match actual PVC - Infrastructure code now reflects deployed state [ci skip] - Emergency fix applied locally first due to service outage	2026-03-13 08:13:05 +00:00
Viktor Barzin	100a876dfe	[ci skip] migrate Redis, Prometheus, Loki storage to iSCSI - Redis: local-path → iscsi-truenas (master + replica persistence) - Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data) - Loki: NFS PV → dynamic iSCSI via storageClass in Helm values - Deleted 2 orphaned Released iSCSI PVs (31Gi freed)	2026-03-06 20:50:55 +00:00
Viktor Barzin	87ef313888	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	89a6e08245	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	e6420c7b36	[ci skip] Move Terraform modules into stack directories Move all 88 service modules (66 individual + 22 platform) from modules/kubernetes/<service>/ into their corresponding stack directories: - Service stacks: stacks/<service>/module/ - Platform stack: stacks/platform/modules/<service>/ This collocates module source code with its Terragrunt definition. Only shared utility modules remain in modules/kubernetes/: ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy. All cross-references to shared modules updated to use correct relative paths. Verified with terragrunt run --all -- plan: 0 adds, 0 destroys across all 68 stacks.	2026-02-22 14:38:14 +00:00

Renamed from modules/kubernetes/monitoring/loki.yaml (Browse further)

7 commits