Commit graph

41 commits

Author SHA1 Message Date
Viktor Barzin
ce79bd5c04 Add node hang instrumentation and scale down chromium services
- Add journald collection to Alloy (loki.source.journal) for kernel OOM,
  panic, hung task, and soft lockup detection — ships system logs off-node
  so they survive hard resets
- Add 5 Loki alerting rules (KernelOOMKiller, KernelPanic, KernelHungTask,
  KernelSoftLockup, ContainerdDown) evaluating against node-journal logs
- Fix Loki ruler config: correct rules mount path (/var/loki/rules/fake),
  add alertmanager_url and enable_api
- Add Prometheus alerts: NodeMemoryPressureTrending (>85%), NodeExporterDown,
  NodeHighIOWait (>30%)
- Add caretta tolerations for control-plane and GPU nodes
- Scale down chromium-based services to 0 for cluster stability:
  f1-stream, flaresolverr, changedetection, resume/printer
2026-03-13 22:20:28 +00:00
OpenClaw
50d539908c feat(monitoring): Disable Loki centralized logging while preserving configuration
DECISION: Disable Loki due to operational overhead vs benefit analysis

EVIDENCE FROM NODE2 INCIDENT:
- Loki was the root cause of major cluster outage (PVC storage exhaustion)
- Centralized logging was unavailable when needed most (Loki was down)
- All debugging was accomplished with simpler tools (kubectl logs, events, describe)
- Prometheus metrics proved more valuable than centralized logs

OPERATIONAL OVERHEAD ELIMINATED:
 50GB iSCSI storage freed up (expensive)
 ~3.5GB memory freed up (Loki + Alloy agents across cluster)
 ~2+ CPU cores freed up for actual workloads
 Reduced complexity - fewer services to maintain and troubleshoot
 Eliminated single point of failure that can cascade cluster-wide

CONFIGURATION PRESERVED:
 All Terraform resources commented out (not deleted)
 loki.yaml preserved with 50GB configuration
 alloy.yaml preserved with log shipping configuration
 Alert rules and Grafana datasource preserved (commented)
 Easy re-enabling: just uncomment resources and apply

ALTERNATIVE DEBUGGING APPROACH:
 kubectl logs (always works, no storage dependency)
 kubectl get events (built-in Kubernetes events)
 Prometheus metrics (more reliable for monitoring)
 Enhanced health check scripts (direct status verification)

RE-ENABLING:
To restore Loki later, uncomment all /* ... */ blocks in loki.tf
and apply via Terraform. All configuration is preserved.

[ci skip] - Infrastructure changes applied locally first due to resource cleanup
2026-03-13 08:41:23 +00:00
OpenClaw
523f0ba7eb fix(monitoring): Expand Loki PVC from 15GB to 50GB to resolve storage exhaustion
ISSUE RESOLVED:
- Root cause: Loki's 15GB iSCSI PVC was completely full
- Symptom: 'no space left on device' errors during TSDB operations
- Impact: Loki service completely down, logging unavailable
- Side effects: Contributed to node2 containerd corruption incident

SOLUTION APPLIED:
- Expanded PVC storage: 15Gi → 50Gi via direct kubectl patch
- Triggered pod restart to complete filesystem resize
- Verified successful expansion and service recovery

CURRENT STATUS:
 PVC: 50Gi capacity (iscsi-truenas storage class)
 Loki StatefulSet: 1/1 ready
 Loki Pod: 2/2 containers running
 Service: Successfully processing log streams
 No storage errors in recent logs

TERRAFORM ALIGNED:
- Updated loki.yaml persistence.size to match actual PVC
- Infrastructure code now reflects deployed state

[ci skip] - Emergency fix applied locally first due to service outage
2026-03-13 08:13:05 +00:00
Viktor Barzin
d8bcdfef2e revert MaxRequestWorkers to 50, exclude nextcloud from 5xx alert
- MaxRequestWorkers 25→50: too few workers caused ALL workers to block
  on SQLite locks, making liveness probes fail even faster (131 restarts
  vs 50 before). 50 is a compromise — enough workers for probes.
- Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue)
- MySQL migration attempted but hit: GR error 3100 (fixed with GIPK),
  emoji in calendar/filecache (stripped), SQLite corruption (pre-existing
  from crash-looping). Migration rolled back, Nextcloud restored to SQLite.
2026-03-09 22:05:20 +00:00
Viktor Barzin
eed991a27b exclude nextcloud from HighServiceErrorRate alert
Nextcloud has chronic 5xx errors due to SQLite lock contention causing
Apache worker exhaustion. Excluding from alert until MySQL migration.
2026-03-09 20:26:30 +00:00
Viktor Barzin
ad8b90575e fix noisy JobFailed and duplicate mail server alerts
- JobFailed: only alert on jobs started within the last hour, so stale
  failed CronJob runs don't keep firing after subsequent runs succeed
- Mail server alert: renamed to MailServerDown, now targets the specific
  mailserver deployment instead of all deployments in the namespace
  (was falsely triggering on roundcubemail going down)
- Updated inhibition rule to use new MailServerDown alert name
2026-03-08 21:22:43 +00:00
Viktor Barzin
33c7976630 reduce alert noise: add cascade inhibitions, increase for durations, drop Loki alerts
- NodeDown now suppresses workload and service alerts (PodCrashLooping,
  DeploymentReplicasMismatch, StatefulSetReplicasMismatch, etc.)
- NFSServerUnresponsive suppresses pod-level alerts
- Increased for durations on transient alerts (e.g. 15m→30m for replica mismatches)
- NodeDown for: 1m→3m to avoid flapping
- Removed all 3 Loki log-based alerts (duplicated Prometheus alerts)
- Downgraded HeadscaleDown critical→warning, mail server page→warning
2026-03-08 21:13:16 +00:00
Viktor Barzin
d352d6e7f8 resource quota review: fix OOM risks, close quota gaps, add HA protections
Phase 1 - OOM fixes:
- dashy: increase memory limit 512Mi→1Gi (was at 99% utilization)
- caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%)
- mysql-operator: add Helm resource values 256Mi/512Mi, create namespace
  with tier label (was at 92% of LimitRange default)
- prowlarr, flaresolverr, annas-archive-stacks: add explicit resources
  (outgrowing 256Mi LimitRange defaults)
- real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no
  explicit resources)

Phase 2 - Close quota gaps:
- nvidia, real-estate-crawler, trading-bot: remove custom-quota=true
  labels so Kyverno generates tier-appropriate quotas
- descheduler: add tier=1-cluster label for proper classification

Phase 3 - Reduce excessive quotas:
- monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64
- woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16
- GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16

Phase 4 - Kubelet protection:
- Add cpu: 200m to systemReserved and kubeReserved in kubelet template

Phase 5 - HA improvements:
- cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1)
- grafana: add topology spread + PDB via Helm values
- crowdsec LAPI: add topology spread + PDB via Helm values
- authentik server: add topology spread via Helm values
- authentik worker: add topology spread + PDB via Helm values
2026-03-08 18:17:46 +00:00
Viktor Barzin
1f1700c4ff [ci skip] fix broken Homepage widgets + add service API tokens to SOPS
- Grafana: fix service URL (grafana not monitoring-grafana)
- Uptime Kuma: remove widget (no status page configured)
- Speedtest/Frigate/Immich: use internal k8s service URLs (external
  goes through Authentik forward auth, blocking API calls)
- pfSense: clean up annotations
- SOPS: add headscale, prowlarr, changedetection, audiobookshelf tokens
2026-03-07 20:39:55 +00:00
Viktor Barzin
6bd3970579 [ci skip] add Homepage gethomepage.dev annotations to all services
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
2026-03-07 20:39:54 +00:00
Viktor Barzin
1f2c1ca361 [ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
  specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/

Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
  breaking module interface contracts

Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
2026-03-07 14:30:36 +00:00
Viktor Barzin
614d14c47d [ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history
Storage analysis: ~10.5 GB/month ingestion rate, 1 year = ~125 GB + overhead.
PVC: 30Gi → 200Gi, retention.size: 45GB → 180GB.
Historical TSDB data restored from NFS (39.8 GB total including all blocks).
2026-03-06 23:16:32 +00:00
Viktor Barzin
a52a371e35 [ci skip] expand Prometheus iSCSI PVC to 30Gi for historical data restore 2026-03-06 22:51:38 +00:00
Viktor Barzin
100a876dfe [ci skip] migrate Redis, Prometheus, Loki storage to iSCSI
- Redis: local-path → iscsi-truenas (master + replica persistence)
- Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data)
- Loki: NFS PV → dynamic iSCSI via storageClass in Helm values
- Deleted 2 orphaned Released iSCSI PVs (31Gi freed)
2026-03-06 20:50:55 +00:00
Viktor Barzin
a48915ee02 [ci skip] exclude linkwarden from HighService4xxRate alert 2026-03-06 20:15:58 +00:00
Viktor Barzin
87ef313888 [ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
22223ec0fd [ci skip] fix OOMKill: prometheus (4Gi), kyverno-reports (512Mi), grampsweb (512Mi)
- Prometheus server: explicit 1Gi req / 4Gi limit (was inheriting 512Mi LimitRange default)
- Kyverno reports controller: 128Mi req / 512Mi limit (was 128Mi Helm default)
- Grampsweb: 256Mi req / 512Mi limit for both containers (was 256Mi LimitRange default)
2026-03-02 21:39:14 +00:00
Viktor Barzin
307b356f06 [ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3)
Critical fix: StorageClass mountOptions only apply during dynamic
provisioning. Our static PVs (created by Terraform) were missing
mount_options, so all NFS mounts defaulted to hard,timeo=600 —
the exact stale mount behavior we were trying to eliminate.

Adds mount_options directly to the nfs_volume module PV spec and
to the monitoring PVs (prometheus, loki, alertmanager).

Requires re-applying all stacks to propagate to existing PVs.
2026-03-02 20:23:36 +00:00
Viktor Barzin
0abae33c71 [ci skip] complete NFS CSI migration: complex stacks + platform modules
Migrate remaining multi-volume stacks and all platform modules from
inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass
(soft,timeo=30,retrans=3 mount options).

Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols),
nextcloud (2 vols + old PV replaced), rybbit (1 vol)

Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing,
real-estate-crawler

Platform modules: monitoring (prometheus, loki, alertmanager PVs
converted from native NFS to CSI), redis, dbaas, technitium,
headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance
2026-03-02 01:24:07 +00:00
Viktor Barzin
84f76a15c2 [ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition 2026-03-01 15:05:57 +00:00
Viktor Barzin
3974827ea9 [ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue 2026-02-28 19:44:16 +00:00
Viktor Barzin
c229f6ccab [ci skip] set network observability dashboard auto-refresh to 1h 2026-02-28 19:32:49 +00:00
Viktor Barzin
11eb256511 [ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors 2026-02-28 19:25:52 +00:00
Viktor Barzin
79775fa2cc [ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map 2026-02-28 19:14:20 +00:00
Viktor Barzin
5e177a8889 [ci skip] combine caretta and goflow2 into unified network observability dashboard 2026-02-28 19:04:53 +00:00
Viktor Barzin
c5c3b092e5 [ci skip] fix caretta helm values and goflow2 transport args 2026-02-28 18:51:02 +00:00
Viktor Barzin
87fc11121d fix: use plain string for cache_from/cache_to and fix caretta helm_release
- cache_from/cache_to must be plain strings, not YAML lists — the
  plugin-docker-buildx treats them as single string values and the
  Woodpecker settings layer was splitting comma-separated list items
  into separate --cache-from flags (type=registry and ref=... separately)
- caretta.tf: replace deprecated set{} blocks with values=[yamlencode()]
  to fix Terraform plan error with newer Helm provider
2026-02-28 18:47:20 +00:00
Viktor Barzin
e8997ec430 [ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module 2026-02-28 18:30:20 +00:00
Viktor Barzin
bf3404bf6b [ci skip] add goflow2 netflow collector to monitoring module 2026-02-28 18:29:07 +00:00
Viktor Barzin
9d52acd286 [ci skip] add caretta eBPF pod topology to monitoring module 2026-02-28 18:28:09 +00:00
Viktor Barzin
c35bef2fd8 [ci skip] fix cluster health: GPU tolerations, actualbudget nfs_server, AuthentikDown alert
- Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments
- Add node_selector gpu=true to ollama deployment
- Pass nfs_server variable through to actualbudget factory modules
- Fix AuthentikDown alert to match actual deployment name (goauthentik-server)
2026-02-24 22:55:58 +00:00
Viktor Barzin
4fab38da1f [ci skip] wrongmove dashboard: add per-path latency table, fix layout, sort top offenders
Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate
per endpoint. Fix bar gauge position to sit next to timeseries. Add sort
transformation to "Top Offenders (Avg Duration)" panel.
2026-02-24 22:31:41 +00:00
Viktor Barzin
0a1d53b6dd [ci skip] platform: add ndots=2 dns_config to all deployment pod specs
Prevents Terraform from reverting the Kyverno inject-ndots mutation
on every apply. 23 pod specs across 19 platform module files.
2026-02-23 22:43:05 +00:00
Viktor Barzin
a0df23f565 [ci skip] monitoring: increase resource quota limits
Bump limits.cpu 80→120 and limits.memory 160Gi→240Gi to provide
headroom. Previous values were at 87% and 92% utilization.
2026-02-23 22:42:30 +00:00
Viktor Barzin
89a6e08245 [ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability
Phase 1 - Critical Security:
- Netbox: move hardcoded DB/superuser passwords to variables
- MeshCentral: disable public registration, add Authentik auth
- Traefik: disable insecure API dashboard (api.insecure=false)
- Traefik: configure forwarded headers with Cloudflare trusted IPs

Phase 2 - Security Hardening:
- Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.)
- Add Kyverno pod security policies in audit mode (privileged, host
  namespaces, SYS_ADMIN, trusted registries)
- Tighten rate limiting (avg=10, burst=50)
- Add Authentik protection to grampsweb

Phase 3 - Monitoring & Alerting:
- Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale,
  Authentik, Loki)
- Increase Loki retention from 7 to 30 days (720h)
- Add predictive PV filling alert (predict_linear)
- Re-enable Hackmd and Privatebin down alerts

Phase 4 - Reliability:
- Add resource requests/limits to Redis, DBaaS, Technitium, Headscale,
  Vaultwarden, Uptime Kuma
- Increase Alloy DaemonSet memory to 512Mi/1Gi

Phase 6 - Maintainability:
- Extract duplicated tiers locals to terragrunt.hcl generate block
  (removed from 67 stacks)
- Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114
  instances across 63 files)
- Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references
  with variables across ~35 stacks
- Migrate xray raw ingress resources to ingress_factory modules
2026-02-23 22:05:28 +00:00
Viktor Barzin
1b4737c90c Reorder realestate-crawler Grafana dashboard sections
Move API Performance and Per-Endpoint Latency to the top.
Move Scraping Overview, Scraping Activity, and Throttling & Errors
to the bottom. Keeps the most operationally relevant panels visible
first.
2026-02-23 22:03:27 +00:00
Viktor Barzin
5fdd9d7f04 Sync realestate-crawler Grafana dashboard with per-endpoint latency panels 2026-02-23 21:31:01 +00:00
Viktor Barzin
b0aaa7b813 [ci skip] monitoring: enable mailserver-down Prometheus alert
Uncomment the mailserver availability alert so we get paged if
the mail server pod has no available replicas for 5 minutes.
2026-02-23 20:29:33 +00:00
Viktor Barzin
65ca327ed0 Sync realestate-crawler dashboard with navigation & usage metrics panels 2026-02-23 20:28:55 +00:00
Viktor Barzin
cc7f119578 [ci skip] Reduce node config drift: GPU label, OIDC idempotency, node-exporter, rebuild docs
- Add gpu=true label to Terraform (nvidia null_resource alongside taint)
- Improve API server OIDC config to detect value changes, not just flag presence
- Add policy_hash trigger to audit-policy so rule changes auto-reapply
- Enable prometheus-node-exporter sub-chart, delete unused Ansible playbook
- Document full node rebuild procedure in CLAUDE.md
- Save Talos Linux migration evaluation for future reference
2026-02-22 22:59:38 +00:00
Viktor Barzin
e6420c7b36 [ci skip] Move Terraform modules into stack directories
Move all 88 service modules (66 individual + 22 platform) from
modules/kubernetes/<service>/ into their corresponding stack directories:

- Service stacks: stacks/<service>/module/
- Platform stack: stacks/platform/modules/<service>/

This collocates module source code with its Terragrunt definition.
Only shared utility modules remain in modules/kubernetes/:
ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy.

All cross-references to shared modules updated to use correct
relative paths. Verified with terragrunt run --all -- plan:
0 adds, 0 destroys across all 68 stacks.
2026-02-22 14:38:14 +00:00