infra

Author	SHA1	Message	Date
Viktor Barzin	21bb3036af	state(dbaas): update encrypted state	2026-03-19 20:23:59 +00:00
Viktor Barzin	01eb9dd121	fix(monitoring): patch idrac-redfish-exporter to restore PSU voltage metric Upstream v2.4.1 accidentally dropped idrac_power_supply_input_voltage from the legacy RefreshPowerOld code path during a Huawei OEM support refactor. Built a patched image that restores the single missing line: mc.NewPowerSupplyInputVoltage(ch, psu.LineInputVoltage, id) Image: viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 13:37:14 +00:00
Viktor Barzin	c8b42f78df	fix DB password rotation desync in 5 stacks Vault DB engine rotates passwords weekly but 5 stacks baked passwords at Terraform plan time, causing stale credentials until next apply. - real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments - nextcloud: switch Helm chart to existingSecret for DB password - grafana: add vault-database ESO, use envFromSecrets in Helm values - woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain - affine: add vault-database ESO, use secret_key_ref in deployment + init container	2026-03-17 07:39:29 +00:00
Viktor Barzin	fb66676d7b	post-mortem: kured + containerd cascade outage — alerts + report 26h outage caused by unattended-upgrades kernel update → kured reboot → containerd overlayfs snapshotter corruption → image pull failures → calico down → cascading cluster outage. Remediation: - Add "Node Runtime Health" Prometheus alert group (6 alerts): KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating, KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady - Add containerd cascade inhibition rule - Save post-mortem report as HTML in post-mortems/ Also applied via kubectl (needs Terraform codification): - Sentinel gate DaemonSet gating kured reboots on cluster health - Fixed kured Helm values: reboot window + gated sentinel path	2026-03-16 22:06:10 +00:00
Viktor Barzin	327c021a90	fix: improve Slack alert formatting — add values, fix ContainerNearOOM filter - Add container!="" filter to ContainerNearOOM to exclude system-level cadvisor entries - Add $value to summaries: ContainerOOMKilled, ClusterMemoryRequestsHigh, ContainerNearOOM, PVPredictedFull, NFSServerUnresponsive, NewTailscaleClient - Add fallback field to all Slack receivers for clean push notifications - Multiply ratio exprs by 100 for readable percentages - Rename "New Tailscale client" to CamelCase "NewTailscaleClient" - Add actionable hints to PodUnschedulable, NodeConditionBad, ForwardAuthFallbackActive	2026-03-16 19:35:24 +00:00
Viktor Barzin	0f262ceda3	add pod dependency management via Kyverno init container injection Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation and injects busybox init containers that block until each dependency is reachable (nc -z). Annotations added to 18 stacks (24 deployments). Includes graceful-db-maintenance.sh script for planned DB maintenance (scales dependents to 0, saves replica counts, restores on startup).	2026-03-15 19:17:57 +00:00
Viktor Barzin	c034adab5f	mitigate cluster instability during terraform applies - Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf) - Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno) to prevent memory request surge overwhelming scheduler - Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup - Disable Kyverno policy reports (ephemeral report cleanup) - Cloud-init: journald persistence + 4Gi swap for worker nodes - Kubelet: LimitedSwap behavior for memory pressure relief	2026-03-15 17:23:39 +00:00
Viktor Barzin	194281e527	right-size cluster memory: reduce overprovisioned, fix under-provisioned services Phase 1 - Quick wins (~4.5 Gi saved): - democratic-csi: add explicit sidecar resources (64-80Mi vs 256Mi LimitRange default) - caretta: 768Mi → 600Mi (VPA upper 485Mi) - immich-ml: 4Gi → 3584Mi (VPA upper 2.95Gi, GPU margin) - onlyoffice: 3Gi → 2304Mi (VPA upper 1.82Gi) Phase 2 - Safety fixes (prevent OOMKills): - frigate: 2Gi/8Gi → 5Gi/10Gi (VPA upper 7.7Gi, was 4% headroom) - openclaw: 1280Mi req → 2Gi req=limit (documented 2Gi requirement) Phase 3 - Additional right-sizing: - authentik workers: 1Gi → 896Mi x3 (VPA upper 722Mi) - shlink: 512Mi/768Mi → 960Mi req=limit (VPA upper 780Mi, safety increase) Phase 4 - Burstable QoS for lower tiers: - tier-3-edge: 128Mi/128Mi → 96Mi req / 192Mi limit - tier-4-aux: 128Mi/128Mi → 64Mi req / 256Mi limit Phase 5 - Monitoring: - Add ClusterMemoryRequestsHigh alert (>85% allocatable, 15m) - Add ContainerNearOOM alert (>85% limit, 30m) - Add PodUnschedulable alert (5m, critical) Cluster: 92.7% → 90.8% memory requests. Stirling-pdf now schedulable.	2026-03-15 15:30:18 +00:00
Viktor Barzin	fca4e02c54	prometheus: increase memory to 4Gi and probe delays for TSDB compaction Compaction of 5 years of TSDB blocks was OOM-killing at 3Gi (18 restarts in 8h), causing sustained IO pressure on the PVE host spinning disk. Increase liveness probe delay to 300s so WAL replay completes before the probe kills the pod.	2026-03-15 01:44:52 +00:00
Viktor Barzin	23019da8e5	equalize memory req=lim across 70+ containers using Prometheus 7d max data After node2 OOM incident, right-size memory across the cluster by setting requests=limits based on max_over_time(container_memory_working_set_bytes[7d]) with 1.3x headroom. Eliminates ~37Gi overcommit gap. Categories: - Safe equalization (50 containers): set req=lim where max7d well within target - Limit increases (8 containers): raise limits for services spiking above current - No Prometheus data (12 containers): conservatively set lim=req - Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql 4Gi limits across 3 replicas.	2026-03-14 21:46:49 +00:00
Viktor Barzin	2be858f616	fix: eliminate memory overcommit to prevent node OOM crashes Set requests = limits (Guaranteed QoS) across LimitRange defaults and explicit pod resources. Node2 crashed 2026-03-14 from 250% memory overcommit (61GB limits on 24GB node). Changes: - LimitRange: default = defaultRequest for all 6 tiers - Grafana: 3 → 2 replicas - Grampsweb: document why replicas=0 - Prometheus: 1Gi/4Gi → 3Gi/3Gi - OpenClaw: 512Mi/2Gi → 768Mi/768Mi - Immich server: 256Mi/2Gi → 512Mi/512Mi - Immich postgresql: 256Mi/1Gi → 512Mi/512Mi - Calibre: 256Mi/1536Mi → 256Mi/256Mi - Linkwarden: 256Mi/1536Mi → 768Mi/768Mi - N8N: 256Mi/1Gi → 512Mi/512Mi - MySQL cluster: 1Gi/3-4Gi → 2Gi/2Gi - pg-cluster (CNPG): 512Mi/4Gi → 512Mi/512Mi - DBaaS ResourceQuota limits.memory: 64Gi → 12Gi [ci skip]	2026-03-14 16:01:41 +00:00
Viktor Barzin	4ea3ffe9d3	Reduce downtime during platform stack applies CrowdSec Helm fix: - Increase ResourceQuota requests.cpu from 1 to 4 — pods were at 302% of quota, preventing scheduling during rolling upgrades - Reduce Helm timeout from 3600s to 600s — 1 hour hang is excessive - Add wait=true and wait_for_jobs=true for proper readiness checking Prometheus startup guard: - Add startup guard to 8 rate/increase-based alerts that false-fire after Prometheus restarts (needs 2 scrapes for rate() to work): PodCrashLooping, ContainerOOMKilled, CoreDNSErrors, HighServiceErrorRate, HighService4xxRate, HighServiceLatency, SSDHighWriteRate, HDDHighWriteRate - Guard: and on() (time() - process_start_time_seconds) > 900 suppresses alerts for 15m after Prometheus startup	2026-03-14 12:09:09 +00:00
Viktor Barzin	17065304dc	Fix NFSServerUnresponsive false positives Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile: - rate() returns nothing after Prometheus restarts (needs 2 scrapes) - Individual nodes show zero NFS rate during scrape gaps or low activity - The sum() could hit zero during quiet hours + scrape gaps New expression uses: - changes() instead of rate() — works with a single scrape - Per-instance aggregation: count nodes with any NFS counter change - Threshold < 2 nodes: single-node restarts won't trigger, real NFS outage (all nodes affected) will - Prometheus startup guard: skip first 15m after restart to avoid false positives from empty TSDB - Wider 15m changes() window to smooth out scrape gaps	2026-03-14 11:28:17 +00:00
Viktor Barzin	6377a8b85b	Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards Noise reduction (8 alerts tuned): - PoisonFountainDown: 2m→5m, critical→warning (fail-open service) - NodeExporterDown: 2m→5m (flaps during node restarts) - PowerOutage: add for:1m (debounce transient voltage dips) - New Tailscale client: add for:5m (debounce headscale reauths) - NoNodeLoadData: use absent() instead of OR vector(0)==0 - NodeHighCPUUsage: 30%→60% (normal for 70+ services) - HighMemoryUsage GPU: 12GB/5m→14GB/15m (T4=16GB, model loading) - PrometheusStorageFull: 50GiB→150GiB (TSDB cap is 180GB) Alert regrouping: - Move MailServerDown, HackmdDown, PrivatebinDown → new "Application Health" - Move New Tailscale client → "Infrastructure Health" New alerts (14): - Networking: Cloudflared (2), MetalLB (2), Technitium DNS - Storage: NFS CSI, iSCSI CSI controllers - Critical Services: PgBouncer, CNPG operator, MySQL operator - Infra Health: CrowdSec, Kyverno, Sealed Secrets, Woodpecker Inhibit rules: - Consolidate 3 NodeDown rules into 1 comprehensive rule - Extend NFS rule to suppress NFS-dependent services - Add PowerOutage → downstream suppression Dashboard loading: - Add for_each ConfigMap in grafana.tf to auto-load all 18 dashboards - Remove duplicate caretta dashboard ConfigMap from caretta.tf	2026-03-14 10:25:31 +00:00
Viktor Barzin	2102cb2d73	Right-size CPU requests cluster-wide and remove missed CPU limits Increase requests for under-requested pods (dashy 50m→250m, frigate 500m→1500m, clickhouse 100m→500m, otp 100m→300m, linkwarden 25m→50m, authentik worker 50m→100m). Reduce requests for over-requested pods (crowdsec agent/lapi 500m→25m each, prometheus 200m→100m, dbaas mysql 1800m→100m, pg-cluster 250m→50m, shlink-web 250m→10m, gpu-pod-exporter 50m→10m, stirling-pdf 100m→25m, technitium 100m→25m, celery 50m→15m). Reduce crowdsec quota from 8→1 CPU. Remove missed CPU limits in prometheus (cpu: "2") and dbaas (cpu: "3600m") tpl files.	2026-03-14 09:22:24 +00:00
Viktor Barzin	b00f810d3d	Remove all CPU limits cluster-wide to eliminate CFS throttling CPU limits cause CFS throttling even when nodes have idle capacity. Move to a request-only CPU model: keep CPU requests for scheduling fairness but remove all CPU limits. Memory limits stay (incompressible). Changes across 108 files: - Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers - Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers - Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas - Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice) - RBAC module: remove cpu_limits variable and quota reference - Freedify factory: remove cpu_limit variable and limits reference - 86 deployment files: remove cpu from all limits blocks - 6 Helm values files: remove cpu under limits sections	2026-03-14 08:51:45 +00:00
Viktor Barzin	c8d15adc16	Remove LokiDown alert rule and inhibit reference Loki has been turned off — remove the orphaned alert rule and its reference in the NodeDown inhibit configuration.	2026-03-13 22:21:11 +00:00
Viktor Barzin	ce79bd5c04	Add node hang instrumentation and scale down chromium services - Add journald collection to Alloy (loki.source.journal) for kernel OOM, panic, hung task, and soft lockup detection — ships system logs off-node so they survive hard resets - Add 5 Loki alerting rules (KernelOOMKiller, KernelPanic, KernelHungTask, KernelSoftLockup, ContainerdDown) evaluating against node-journal logs - Fix Loki ruler config: correct rules mount path (/var/loki/rules/fake), add alertmanager_url and enable_api - Add Prometheus alerts: NodeMemoryPressureTrending (>85%), NodeExporterDown, NodeHighIOWait (>30%) - Add caretta tolerations for control-plane and GPU nodes - Scale down chromium-based services to 0 for cluster stability: f1-stream, flaresolverr, changedetection, resume/printer	2026-03-13 22:20:28 +00:00
OpenClaw	50d539908c	feat(monitoring): Disable Loki centralized logging while preserving configuration DECISION: Disable Loki due to operational overhead vs benefit analysis EVIDENCE FROM NODE2 INCIDENT: - Loki was the root cause of major cluster outage (PVC storage exhaustion) - Centralized logging was unavailable when needed most (Loki was down) - All debugging was accomplished with simpler tools (kubectl logs, events, describe) - Prometheus metrics proved more valuable than centralized logs OPERATIONAL OVERHEAD ELIMINATED: ✅ 50GB iSCSI storage freed up (expensive) ✅ ~3.5GB memory freed up (Loki + Alloy agents across cluster) ✅ ~2+ CPU cores freed up for actual workloads ✅ Reduced complexity - fewer services to maintain and troubleshoot ✅ Eliminated single point of failure that can cascade cluster-wide CONFIGURATION PRESERVED: ✅ All Terraform resources commented out (not deleted) ✅ loki.yaml preserved with 50GB configuration ✅ alloy.yaml preserved with log shipping configuration ✅ Alert rules and Grafana datasource preserved (commented) ✅ Easy re-enabling: just uncomment resources and apply ALTERNATIVE DEBUGGING APPROACH: ✅ kubectl logs (always works, no storage dependency) ✅ kubectl get events (built-in Kubernetes events) ✅ Prometheus metrics (more reliable for monitoring) ✅ Enhanced health check scripts (direct status verification) RE-ENABLING: To restore Loki later, uncomment all /* ... */ blocks in loki.tf and apply via Terraform. All configuration is preserved. [ci skip] - Infrastructure changes applied locally first due to resource cleanup	2026-03-13 08:41:23 +00:00
OpenClaw	523f0ba7eb	fix(monitoring): Expand Loki PVC from 15GB to 50GB to resolve storage exhaustion ISSUE RESOLVED: - Root cause: Loki's 15GB iSCSI PVC was completely full - Symptom: 'no space left on device' errors during TSDB operations - Impact: Loki service completely down, logging unavailable - Side effects: Contributed to node2 containerd corruption incident SOLUTION APPLIED: - Expanded PVC storage: 15Gi → 50Gi via direct kubectl patch - Triggered pod restart to complete filesystem resize - Verified successful expansion and service recovery CURRENT STATUS: ✅ PVC: 50Gi capacity (iscsi-truenas storage class) ✅ Loki StatefulSet: 1/1 ready ✅ Loki Pod: 2/2 containers running ✅ Service: Successfully processing log streams ✅ No storage errors in recent logs TERRAFORM ALIGNED: - Updated loki.yaml persistence.size to match actual PVC - Infrastructure code now reflects deployed state [ci skip] - Emergency fix applied locally first due to service outage	2026-03-13 08:13:05 +00:00
Viktor Barzin	d8bcdfef2e	revert MaxRequestWorkers to 50, exclude nextcloud from 5xx alert - MaxRequestWorkers 25→50: too few workers caused ALL workers to block on SQLite locks, making liveness probes fail even faster (131 restarts vs 50 before). 50 is a compromise — enough workers for probes. - Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue) - MySQL migration attempted but hit: GR error 3100 (fixed with GIPK), emoji in calendar/filecache (stripped), SQLite corruption (pre-existing from crash-looping). Migration rolled back, Nextcloud restored to SQLite.	2026-03-09 22:05:20 +00:00
Viktor Barzin	eed991a27b	exclude nextcloud from HighServiceErrorRate alert Nextcloud has chronic 5xx errors due to SQLite lock contention causing Apache worker exhaustion. Excluding from alert until MySQL migration.	2026-03-09 20:26:30 +00:00
Viktor Barzin	ad8b90575e	fix noisy JobFailed and duplicate mail server alerts - JobFailed: only alert on jobs started within the last hour, so stale failed CronJob runs don't keep firing after subsequent runs succeed - Mail server alert: renamed to MailServerDown, now targets the specific mailserver deployment instead of all deployments in the namespace (was falsely triggering on roundcubemail going down) - Updated inhibition rule to use new MailServerDown alert name	2026-03-08 21:22:43 +00:00
Viktor Barzin	33c7976630	reduce alert noise: add cascade inhibitions, increase for durations, drop Loki alerts - NodeDown now suppresses workload and service alerts (PodCrashLooping, DeploymentReplicasMismatch, StatefulSetReplicasMismatch, etc.) - NFSServerUnresponsive suppresses pod-level alerts - Increased for durations on transient alerts (e.g. 15m→30m for replica mismatches) - NodeDown for: 1m→3m to avoid flapping - Removed all 3 Loki log-based alerts (duplicated Prometheus alerts) - Downgraded HeadscaleDown critical→warning, mail server page→warning	2026-03-08 21:13:16 +00:00
Viktor Barzin	d352d6e7f8	resource quota review: fix OOM risks, close quota gaps, add HA protections Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values	2026-03-08 18:17:46 +00:00
Viktor Barzin	1f1700c4ff	[ci skip] fix broken Homepage widgets + add service API tokens to SOPS - Grafana: fix service URL (grafana not monitoring-grafana) - Uptime Kuma: remove widget (no status page configured) - Speedtest/Frigate/Immich: use internal k8s service URLs (external goes through Authentik forward auth, blocking API calls) - pfSense: clean up annotations - SOPS: add headscale, prowlarr, changedetection, audiobookshelf tokens	2026-03-07 20:39:55 +00:00
Viktor Barzin	6bd3970579	[ci skip] add Homepage gethomepage.dev annotations to all services Add Kubernetes ingress annotations for Homepage auto-discovery across ~88 services organized into 11 groups. Enable serviceAccount for RBAC, configure group layouts, and add Grafana/Frigate/Speedtest widgets.	2026-03-07 20:39:54 +00:00
Viktor Barzin	1f2c1ca361	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	614d14c47d	[ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history Storage analysis: ~10.5 GB/month ingestion rate, 1 year = ~125 GB + overhead. PVC: 30Gi → 200Gi, retention.size: 45GB → 180GB. Historical TSDB data restored from NFS (39.8 GB total including all blocks).	2026-03-06 23:16:32 +00:00
Viktor Barzin	a52a371e35	[ci skip] expand Prometheus iSCSI PVC to 30Gi for historical data restore	2026-03-06 22:51:38 +00:00
Viktor Barzin	100a876dfe	[ci skip] migrate Redis, Prometheus, Loki storage to iSCSI - Redis: local-path → iscsi-truenas (master + replica persistence) - Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data) - Loki: NFS PV → dynamic iSCSI via storageClass in Helm values - Deleted 2 orphaned Released iSCSI PVs (31Gi freed)	2026-03-06 20:50:55 +00:00
Viktor Barzin	a48915ee02	[ci skip] exclude linkwarden from HighService4xxRate alert	2026-03-06 20:15:58 +00:00
Viktor Barzin	87ef313888	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	22223ec0fd	[ci skip] fix OOMKill: prometheus (4Gi), kyverno-reports (512Mi), grampsweb (512Mi) - Prometheus server: explicit 1Gi req / 4Gi limit (was inheriting 512Mi LimitRange default) - Kyverno reports controller: 128Mi req / 512Mi limit (was 128Mi Helm default) - Grampsweb: 256Mi req / 512Mi limit for both containers (was 256Mi LimitRange default)	2026-03-02 21:39:14 +00:00
Viktor Barzin	307b356f06	[ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3) Critical fix: StorageClass mountOptions only apply during dynamic provisioning. Our static PVs (created by Terraform) were missing mount_options, so all NFS mounts defaulted to hard,timeo=600 — the exact stale mount behavior we were trying to eliminate. Adds mount_options directly to the nfs_volume module PV spec and to the monitoring PVs (prometheus, loki, alertmanager). Requires re-applying all stacks to propagate to existing PVs.	2026-03-02 20:23:36 +00:00
Viktor Barzin	0abae33c71	[ci skip] complete NFS CSI migration: complex stacks + platform modules Migrate remaining multi-volume stacks and all platform modules from inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options). Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols), nextcloud (2 vols + old PV replaced), rybbit (1 vol) Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing, real-estate-crawler Platform modules: monitoring (prometheus, loki, alertmanager PVs converted from native NFS to CSI), redis, dbaas, technitium, headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance	2026-03-02 01:24:07 +00:00
Viktor Barzin	84f76a15c2	[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition	2026-03-01 15:05:57 +00:00
Viktor Barzin	3974827ea9	[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue	2026-02-28 19:44:16 +00:00
Viktor Barzin	c229f6ccab	[ci skip] set network observability dashboard auto-refresh to 1h	2026-02-28 19:32:49 +00:00
Viktor Barzin	11eb256511	[ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors	2026-02-28 19:25:52 +00:00
Viktor Barzin	79775fa2cc	[ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map	2026-02-28 19:14:20 +00:00
Viktor Barzin	5e177a8889	[ci skip] combine caretta and goflow2 into unified network observability dashboard	2026-02-28 19:04:53 +00:00
Viktor Barzin	c5c3b092e5	[ci skip] fix caretta helm values and goflow2 transport args	2026-02-28 18:51:02 +00:00
Viktor Barzin	87fc11121d	fix: use plain string for cache_from/cache_to and fix caretta helm_release - cache_from/cache_to must be plain strings, not YAML lists — the plugin-docker-buildx treats them as single string values and the Woodpecker settings layer was splitting comma-separated list items into separate --cache-from flags (type=registry and ref=... separately) - caretta.tf: replace deprecated set{} blocks with values=[yamlencode()] to fix Terraform plan error with newer Helm provider	2026-02-28 18:47:20 +00:00
Viktor Barzin	e8997ec430	[ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module	2026-02-28 18:30:20 +00:00
Viktor Barzin	bf3404bf6b	[ci skip] add goflow2 netflow collector to monitoring module	2026-02-28 18:29:07 +00:00
Viktor Barzin	9d52acd286	[ci skip] add caretta eBPF pod topology to monitoring module	2026-02-28 18:28:09 +00:00
Viktor Barzin	c35bef2fd8	[ci skip] fix cluster health: GPU tolerations, actualbudget nfs_server, AuthentikDown alert - Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments - Add node_selector gpu=true to ollama deployment - Pass nfs_server variable through to actualbudget factory modules - Fix AuthentikDown alert to match actual deployment name (goauthentik-server)	2026-02-24 22:55:58 +00:00
Viktor Barzin	4fab38da1f	[ci skip] wrongmove dashboard: add per-path latency table, fix layout, sort top offenders Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate per endpoint. Fix bar gauge position to sit next to timeseries. Add sort transformation to "Top Offenders (Avg Duration)" panel.	2026-02-24 22:31:41 +00:00
Viktor Barzin	0a1d53b6dd	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs Prevents Terraform from reverting the Kyverno inject-ndots mutation on every apply. 23 pod specs across 19 platform module files.	2026-02-23 22:43:05 +00:00

1 2

58 commits