ISSUE: Automated cron health checks were failing with 'cluster unreachable'
ROOT CAUSE: Cron jobs lack access to kubeconfig (KUBECONFIG env var not set)
SOLUTION: Created setup-monitoring.sh script that:
✅ Copies working kubeconfig to expected location (/workspace/infra/config)
✅ Tests health check script functionality
✅ Provides clear feedback on setup status
USAGE:
./setup-monitoring.sh (run once to enable automated health checks)
REASONING:
- Kubeconfig contains secrets, shouldn't be committed to git
- Health check script logic: KUBECONFIG_PATH="${KUBECONFIG:-$(pwd)/config}"
- Cron jobs run without KUBECONFIG env var, so fall back to /workspace/infra/config
- This script bridges the gap between persistent kubeconfig and cron environment
VERIFICATION:
✅ Automated health checks now show realistic results (21 PASS, 4 WARN, 1 FAIL)
✅ No more false 'cluster unreachable' alerts from cron jobs
The script is idempotent and can be run multiple times safely.
The existing probe only checked nvidia-smi + API availability, which passes
even when the detector falls back to CPU. Now also checks /api/stats and
restarts the pod if inference speed exceeds 100ms (normal GPU: ~20ms,
CPU fallback: 200ms+).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DECISION: Disable Loki due to operational overhead vs benefit analysis
EVIDENCE FROM NODE2 INCIDENT:
- Loki was the root cause of major cluster outage (PVC storage exhaustion)
- Centralized logging was unavailable when needed most (Loki was down)
- All debugging was accomplished with simpler tools (kubectl logs, events, describe)
- Prometheus metrics proved more valuable than centralized logs
OPERATIONAL OVERHEAD ELIMINATED:
✅ 50GB iSCSI storage freed up (expensive)
✅ ~3.5GB memory freed up (Loki + Alloy agents across cluster)
✅ ~2+ CPU cores freed up for actual workloads
✅ Reduced complexity - fewer services to maintain and troubleshoot
✅ Eliminated single point of failure that can cascade cluster-wide
CONFIGURATION PRESERVED:
✅ All Terraform resources commented out (not deleted)
✅ loki.yaml preserved with 50GB configuration
✅ alloy.yaml preserved with log shipping configuration
✅ Alert rules and Grafana datasource preserved (commented)
✅ Easy re-enabling: just uncomment resources and apply
ALTERNATIVE DEBUGGING APPROACH:
✅ kubectl logs (always works, no storage dependency)
✅ kubectl get events (built-in Kubernetes events)
✅ Prometheus metrics (more reliable for monitoring)
✅ Enhanced health check scripts (direct status verification)
RE-ENABLING:
To restore Loki later, uncomment all /* ... */ blocks in loki.tf
and apply via Terraform. All configuration is preserved.
[ci skip] - Infrastructure changes applied locally first due to resource cleanup
ISSUE RESOLVED:
- Root cause: Loki's 15GB iSCSI PVC was completely full
- Symptom: 'no space left on device' errors during TSDB operations
- Impact: Loki service completely down, logging unavailable
- Side effects: Contributed to node2 containerd corruption incident
SOLUTION APPLIED:
- Expanded PVC storage: 15Gi → 50Gi via direct kubectl patch
- Triggered pod restart to complete filesystem resize
- Verified successful expansion and service recovery
CURRENT STATUS:
✅ PVC: 50Gi capacity (iscsi-truenas storage class)
✅ Loki StatefulSet: 1/1 ready
✅ Loki Pod: 2/2 containers running
✅ Service: Successfully processing log streams
✅ No storage errors in recent logs
TERRAFORM ALIGNED:
- Updated loki.yaml persistence.size to match actual PVC
- Infrastructure code now reflects deployed state
[ci skip] - Emergency fix applied locally first due to service outage
SECTIONS ADDED:
- Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics)
- Section 26: Power Monitoring (DCGM GPU power + host power)
FEATURES:
- 5-minute CPU usage averages (more accurate than kubectl top)
- Tesla T4 GPU power consumption monitoring
- CPU thresholds: 70% warn, 85% critical
- GPU power thresholds: 50W active, 65W high
- Maps IP addresses to friendly node names
- Integrates with existing health check infrastructure
CURRENT STATUS:
- All nodes have healthy disk usage (~10%)
- k8s-node4 flagged at 87% CPU (explains resource pressure)
- GPU operating normally at 30.9W
- Enhanced monitoring prevents issues like node2 containerd corruption
Total health check sections: 26 (was 24)
Addresses node2 incident prevention requirements
- Restored clean SQLite database from pre-migration backup
- Fixed severe database corruption (rowid ordering, page corruption, index issues)
- Applied conservative MaxRequestWorkers=15 for SQLite stability
- Database integrity now perfect, all health checks passing
- Ready for future MySQL migration with clean data
[ci skip]
- Increase MaxRequestWorkers from 10 to 25 for 4 CPU + 3Gi memory container
- Update Apache tuning for Redis + SQLite backend (not pure SQLite)
- Resolves CrashLoopBackOff caused by health probe timeouts
- Allows handling concurrent users without MaxRequestWorkers limit errors
[ci skip]
The check was querying Technitium DNS directly at 10.0.20.101:53, which
refuses connections from non-cluster hosts. Use the system resolver
(no @server flag) so it works from any host or pod environment.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MySQL (3 OOM kills):
- Cap group_replication_message_cache_size to 128MB (default 1GB caused OOM)
- Reduce innodb_log_buffer_size from 64MB to 16MB
- Lower max_connections from 151 to 80 (peak usage ~40)
- Increase memory limit from 3Gi to 4Gi for headroom
Nextcloud (30+ apache2 OOM kills per incident):
- Reduce MaxRequestWorkers from 50 to 10 to prevent fork bomb
when SQLite locks cause request pileup
- Lower StartServers/MinSpare/MaxSpare proportionally
Uptime Kuma (Node.js memory leak):
- Increase memory limit from 256Mi to 512Mi
- Increase CPU limit from 200m to 500m
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GPU operator needs ~19 CPU limits across 16 pods (NFD, device plugin,
driver, validators, exporters). The Kyverno auto-generated quota of 16 CPU
was insufficient, blocking NFD worker and GC pods from scheduling.
- Add custom-quota label to nvidia namespace to exempt from Kyverno generation
- Add explicit ResourceQuota with limits.cpu=32, limits.memory=48Gi
- Fix: nvidia namespace tier label was missing after CI re-apply, causing
Kyverno to use fallback LimitRange instead of tier-2-gpu specific one
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SQLite on NFS caused persistent 500 errors on WebDAV PROPFIND due to
missing fsync guarantees and database locking under concurrent access.
iSCSI (ext4) provides proper fsync and block-level I/O.
- Replace nfs_volume module with iscsi-truenas PVC (20Gi)
- Update Helm chart to use nextcloud-data-iscsi claim
- Excluded 12.5GB nextcloud.log and corrupted DB from migration
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Removed from git tracking and added to .gitignore.
File stays on disk locally for reference.
config.tfvars + secrets.auto.tfvars.json are the active var sources.
[ci skip]
- Pin actualbudget/actual-server from edge to 26.3.0 (all 3 instances) to
prevent recurring migration breakage from rolling nightly builds
- Add podAntiAffinity to MySQL InnoDB Cluster to spread replicas across nodes,
relieving memory pressure on k8s-node4
- Scale grampsweb to 0 replicas (unused, consuming 1.7Gi memory)
- Add GPU toleration Kyverno policy to Terraform using patchesJson6902 instead
of patchStrategicMerge to fix toleration array being overwritten (caused
caretta DaemonSet pod to be unable to schedule on k8s-master)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- MaxRequestWorkers 25→50: too few workers caused ALL workers to block
on SQLite locks, making liveness probes fail even faster (131 restarts
vs 50 before). 50 is a compromise — enough workers for probes.
- Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue)
- MySQL migration attempted but hit: GR error 3100 (fixed with GIPK),
emoji in calendar/filecache (stripped), SQLite corruption (pre-existing
from crash-looping). Migration rolled back, Nextcloud restored to SQLite.
The Nextcloud Helm chart expects extraVolumes/extraVolumeMounts nested
under the nextcloud: key. Also mount to mods-available/ (the actual file)
not mods-enabled/ (which is a symlink).
Verified: MaxRequestWorkers 150→25, workers dropped from 49 to 6.
- JobFailed: only alert on jobs started within the last hour, so stale
failed CronJob runs don't keep firing after subsequent runs succeed
- Mail server alert: renamed to MailServerDown, now targets the specific
mailserver deployment instead of all deployments in the namespace
(was falsely triggering on roundcubemail going down)
- Updated inhibition rule to use new MailServerDown alert name
Adds Sealed Secrets (Bitnami) to the platform stack so cluster users can
encrypt secrets with a public key and commit SealedSecret YAMLs to git.
The in-cluster controller decrypts them into regular K8s Secrets.
- New module: sealed-secrets (namespace + Helm chart v2.18.3, cluster tier)
- k8s-portal setup script: adds kubeseal CLI install for Linux and Mac
Previously manualStartOnBoot=true and exitStateAction=ABORT_SERVER meant
any ungraceful shutdown required manual rebootClusterFromCompleteOutage().
New settings:
- group_replication_start_on_boot=ON: auto-start GR after crash
- autorejoin_tries=2016: retry rejoining for ~28 minutes
- exit_state_action=OFFLINE_MODE: stay alive on expulsion (don't abort)
- member_expel_timeout=30s: tolerate brief unresponsiveness
- unreachable_majority_timeout=60s: leave group cleanly if majority lost
LimitRange defaults had a 4-8x limit/request ratio causing the scheduler
to overpack nodes. When pods burst, nodes OOM-thrashed and became
unresponsive (k8s-node3 and k8s-node4 both went down today).
Changes:
- Increase default memory requests across all tiers (ratio now 2x):
- core/cluster: 64Mi → 256Mi request (512Mi limit)
- gpu: 256Mi → 1Gi request (2Gi limit)
- edge/aux/fallback: 64Mi → 128Mi request (256Mi limit)
- Add kubelet memory reservation and eviction thresholds:
- systemReserved: 512Mi, kubeReserved: 512Mi
- evictionHard: 500Mi (was 100Mi), evictionSoft: 1Gi (was unset)
- Applied to all nodes and future node template
Use correct dashboard-icons names where available (changedetection,
gramps-web), Material Design Icons for custom apps (city-guesser,
plotting-book, resume, tuya-bridge, trading-bot, poison-fountain),
and Simple Icons for F1 Stream.
Stale-while-revalidate cache in front of Homepage reduces first-paint
latency by serving cached /api/ responses instantly while refreshing
upstream in background. Non-API paths pass through uncached.
Deploy GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET, and GOOGLE_CALLBACK_URL
to the plotting-book container. Update CSP to allow accounts.google.com
for connect-src and form-action directives.
Prevents stale Redis connections from silently breaking file uploads.
The old Node.js Redis client doesn't auto-reconnect after Redis restarts,
causing all files to appear expired.
- qBittorrent: use service port 80 (not container port 8080)
- Immich: add version=2 for new API endpoints (/api/server/*)
- Nextcloud: use external URL (internal rejects untrusted Host header)
- HA London: remove widget (token expired, needs manual regeneration)
- Headscale: remove widget (requires nodeId param, not overview)
Services expose port 80 via ClusterIP but widgets were using container
target ports (5000, 3001, 4533, 3000). Calibre was using external URL
through Authentik. All now use correct internal service URLs.