Upstream v2.4.1 accidentally dropped idrac_power_supply_input_voltage from
the legacy RefreshPowerOld code path during a Huawei OEM support refactor.
Built a patched image that restores the single missing line:
mc.NewPowerSupplyInputVoltage(ch, psu.LineInputVoltage, id)
Image: viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Vault DB engine rotates passwords weekly but 5 stacks baked passwords
at Terraform plan time, causing stale credentials until next apply.
- real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments
- nextcloud: switch Helm chart to existingSecret for DB password
- grafana: add vault-database ESO, use envFromSecrets in Helm values
- woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain
- affine: add vault-database ESO, use secret_key_ref in deployment + init container
- Add container!="" filter to ContainerNearOOM to exclude system-level cadvisor entries
- Add $value to summaries: ContainerOOMKilled, ClusterMemoryRequestsHigh,
ContainerNearOOM, PVPredictedFull, NFSServerUnresponsive, NewTailscaleClient
- Add fallback field to all Slack receivers for clean push notifications
- Multiply ratio exprs by 100 for readable percentages
- Rename "New Tailscale client" to CamelCase "NewTailscaleClient"
- Add actionable hints to PodUnschedulable, NodeConditionBad, ForwardAuthFallbackActive
Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation
and injects busybox init containers that block until each dependency
is reachable (nc -z). Annotations added to 18 stacks (24 deployments).
Includes graceful-db-maintenance.sh script for planned DB maintenance
(scales dependents to 0, saves replica counts, restores on startup).
Compaction of 5 years of TSDB blocks was OOM-killing at 3Gi (18 restarts
in 8h), causing sustained IO pressure on the PVE host spinning disk.
Increase liveness probe delay to 300s so WAL replay completes before
the probe kills the pod.
After node2 OOM incident, right-size memory across the cluster by setting
requests=limits based on max_over_time(container_memory_working_set_bytes[7d])
with 1.3x headroom. Eliminates ~37Gi overcommit gap.
Categories:
- Safe equalization (50 containers): set req=lim where max7d well within target
- Limit increases (8 containers): raise limits for services spiking above current
- No Prometheus data (12 containers): conservatively set lim=req
- Exception: nextcloud keeps req=256Mi/lim=8Gi due to Apache memory spikes
Also increased dbaas namespace quota from 12Gi to 16Gi to accommodate mysql
4Gi limits across 3 replicas.
CrowdSec Helm fix:
- Increase ResourceQuota requests.cpu from 1 to 4 — pods were at 302%
of quota, preventing scheduling during rolling upgrades
- Reduce Helm timeout from 3600s to 600s — 1 hour hang is excessive
- Add wait=true and wait_for_jobs=true for proper readiness checking
Prometheus startup guard:
- Add startup guard to 8 rate/increase-based alerts that false-fire
after Prometheus restarts (needs 2 scrapes for rate() to work):
PodCrashLooping, ContainerOOMKilled, CoreDNSErrors,
HighServiceErrorRate, HighService4xxRate, HighServiceLatency,
SSDHighWriteRate, HDDHighWriteRate
- Guard: and on() (time() - process_start_time_seconds) > 900
suppresses alerts for 15m after Prometheus startup
Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile:
- rate() returns nothing after Prometheus restarts (needs 2 scrapes)
- Individual nodes show zero NFS rate during scrape gaps or low activity
- The sum() could hit zero during quiet hours + scrape gaps
New expression uses:
- changes() instead of rate() — works with a single scrape
- Per-instance aggregation: count nodes with any NFS counter change
- Threshold < 2 nodes: single-node restarts won't trigger, real NFS
outage (all nodes affected) will
- Prometheus startup guard: skip first 15m after restart to avoid
false positives from empty TSDB
- Wider 15m changes() window to smooth out scrape gaps
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).
Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
DECISION: Disable Loki due to operational overhead vs benefit analysis
EVIDENCE FROM NODE2 INCIDENT:
- Loki was the root cause of major cluster outage (PVC storage exhaustion)
- Centralized logging was unavailable when needed most (Loki was down)
- All debugging was accomplished with simpler tools (kubectl logs, events, describe)
- Prometheus metrics proved more valuable than centralized logs
OPERATIONAL OVERHEAD ELIMINATED:
✅ 50GB iSCSI storage freed up (expensive)
✅ ~3.5GB memory freed up (Loki + Alloy agents across cluster)
✅ ~2+ CPU cores freed up for actual workloads
✅ Reduced complexity - fewer services to maintain and troubleshoot
✅ Eliminated single point of failure that can cascade cluster-wide
CONFIGURATION PRESERVED:
✅ All Terraform resources commented out (not deleted)
✅ loki.yaml preserved with 50GB configuration
✅ alloy.yaml preserved with log shipping configuration
✅ Alert rules and Grafana datasource preserved (commented)
✅ Easy re-enabling: just uncomment resources and apply
ALTERNATIVE DEBUGGING APPROACH:
✅ kubectl logs (always works, no storage dependency)
✅ kubectl get events (built-in Kubernetes events)
✅ Prometheus metrics (more reliable for monitoring)
✅ Enhanced health check scripts (direct status verification)
RE-ENABLING:
To restore Loki later, uncomment all /* ... */ blocks in loki.tf
and apply via Terraform. All configuration is preserved.
[ci skip] - Infrastructure changes applied locally first due to resource cleanup
ISSUE RESOLVED:
- Root cause: Loki's 15GB iSCSI PVC was completely full
- Symptom: 'no space left on device' errors during TSDB operations
- Impact: Loki service completely down, logging unavailable
- Side effects: Contributed to node2 containerd corruption incident
SOLUTION APPLIED:
- Expanded PVC storage: 15Gi → 50Gi via direct kubectl patch
- Triggered pod restart to complete filesystem resize
- Verified successful expansion and service recovery
CURRENT STATUS:
✅ PVC: 50Gi capacity (iscsi-truenas storage class)
✅ Loki StatefulSet: 1/1 ready
✅ Loki Pod: 2/2 containers running
✅ Service: Successfully processing log streams
✅ No storage errors in recent logs
TERRAFORM ALIGNED:
- Updated loki.yaml persistence.size to match actual PVC
- Infrastructure code now reflects deployed state
[ci skip] - Emergency fix applied locally first due to service outage
- MaxRequestWorkers 25→50: too few workers caused ALL workers to block
on SQLite locks, making liveness probes fail even faster (131 restarts
vs 50 before). 50 is a compromise — enough workers for probes.
- Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue)
- MySQL migration attempted but hit: GR error 3100 (fixed with GIPK),
emoji in calendar/filecache (stripped), SQLite corruption (pre-existing
from crash-looping). Migration rolled back, Nextcloud restored to SQLite.
- JobFailed: only alert on jobs started within the last hour, so stale
failed CronJob runs don't keep firing after subsequent runs succeed
- Mail server alert: renamed to MailServerDown, now targets the specific
mailserver deployment instead of all deployments in the namespace
(was falsely triggering on roundcubemail going down)
- Updated inhibition rule to use new MailServerDown alert name
Add Kubernetes ingress annotations for Homepage auto-discovery across
~88 services organized into 11 groups. Enable serviceAccount for RBAC,
configure group layouts, and add Grafana/Frigate/Speedtest widgets.
Phase 5 — CI pipelines:
- default.yml: add SOPS decrypt in prepare step, change git add . to
specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure
- renew-tls.yml: change git add . to git add secrets/ state/
Phase 6 — sensitive=true:
- Add sensitive = true to 256 variable declarations across 149 stack files
- Prevents secret values from appearing in terraform plan output
- Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid
breaking module interface contracts
Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret
to be created before the pipeline will work with SOPS. Until then, the old
terraform.tfvars path continues to function.
Critical fix: StorageClass mountOptions only apply during dynamic
provisioning. Our static PVs (created by Terraform) were missing
mount_options, so all NFS mounts defaulted to hard,timeo=600 —
the exact stale mount behavior we were trying to eliminate.
Adds mount_options directly to the nfs_volume module PV spec and
to the monitoring PVs (prometheus, loki, alertmanager).
Requires re-applying all stacks to propagate to existing PVs.
- cache_from/cache_to must be plain strings, not YAML lists — the
plugin-docker-buildx treats them as single string values and the
Woodpecker settings layer was splitting comma-separated list items
into separate --cache-from flags (type=registry and ref=... separately)
- caretta.tf: replace deprecated set{} blocks with values=[yamlencode()]
to fix Terraform plan error with newer Helm provider
- Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments
- Add node_selector gpu=true to ollama deployment
- Pass nfs_server variable through to actualbudget factory modules
- Fix AuthentikDown alert to match actual deployment name (goauthentik-server)
Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate
per endpoint. Fix bar gauge position to sit next to timeseries. Add sort
transformation to "Top Offenders (Avg Duration)" panel.