Commit graph

1629 commits

Author SHA1 Message Date
Viktor Barzin
f7c2c06009 right-size memory: set requests=limits based on actual usage
- Set memory requests = limits across 56 stacks to prevent overcommit
- Right-sized limits based on actual pod usage (2x actual, rounded up)
- Scaled down trading-bot (replicas=0) to free memory
- Fixed OOMKilled services: forgejo, dawarich, health, meshcentral,
  paperless-ngx, vault auto-unseal, rybbit, whisper, openclaw, clickhouse
- Added startup+liveness probes to calibre-web
- Bumped inotify limits on nodes 2,3 (max_user_instances 128->8192)

Post node2 OOM incident (2026-03-14). Previous kubelet config had no
kubeReserved/systemReserved set, allowing pods to starve the kernel.
2026-03-14 21:01:24 +00:00
Viktor Barzin
2c296d4d7c add novelapp deployment [ci skip]
Deploy NovelApp (web novel reading tracker) to k8s cluster.
- Namespace: novelapp, tier: aux
- iSCSI PVC for SQLite persistence
- Ingress at novelapp.viktorbarzin.me
- Browser scraping disabled
2026-03-14 18:51:14 +00:00
Viktor Barzin
98d7c2a4a5 fix: resolve HCL semicolons and vault-platform dependency cycle
- Replace semicolons with newlines in vault/main.tf variable blocks
  (HCL does not support semicolons)
- Remove dependency "vault" from platform/terragrunt.hcl to break
  cycle (vault already depends on platform)
2026-03-14 17:37:25 +00:00
Viktor Barzin
a8d944eb9b migrate all secrets from SOPS to Vault KV
- Add vault provider to root terragrunt.hcl (generated providers.tf)
- Delete stacks/vault/vault_provider.tf (now in generated providers.tf)
- Add 124 variable declarations + 43 vault_kv_secret_v2 resources to
  vault/main.tf to populate Vault KV at secret/<stack-name>
- Migrate 43 consuming stacks to read secrets from Vault KV via
  data "vault_kv_secret_v2" instead of SOPS var-file
- Add dependency "vault" to all migrated stacks' terragrunt.hcl
- Complex types (maps/lists) stored as JSON strings, decoded with
  jsondecode() in locals blocks

Bootstrap secrets (vault_root_token, vault_authentik_client_id,
vault_authentik_client_secret) remain in SOPS permanently.

Apply order: vault stack first (populates KV), then all others.
2026-03-14 17:15:48 +00:00
Viktor Barzin
39b7dac1a9 fix: bump openclaw memory limit to 1536Mi
Was hitting V8 heap OOM at 768Mi during LLM orchestration.
2026-03-14 16:45:57 +00:00
Viktor Barzin
683361b55e fix: bump calibre memory limit to 512Mi
Calibre binary installation was timing out at 256Mi, leaving
the web server unable to start.
2026-03-14 16:38:09 +00:00
Viktor Barzin
8612eb3fc7 fix: bump affine migration init container memory to 512Mi
Init container was OOMKilled (137) with default 128Mi LimitRange limit.
Prisma/Node.js migrations need more memory.
2026-03-14 16:37:06 +00:00
Viktor Barzin
ef4cfc146a scale down ollama-ui, netbox, tandoor to free cluster memory
Disabled after 2026-03-14 node2 OOM incident. Frees ~5GB memory limits.
2026-03-14 16:30:08 +00:00
Viktor Barzin
2be858f616 fix: eliminate memory overcommit to prevent node OOM crashes
Set requests = limits (Guaranteed QoS) across LimitRange defaults and
explicit pod resources. Node2 crashed 2026-03-14 from 250% memory
overcommit (61GB limits on 24GB node).

Changes:
- LimitRange: default = defaultRequest for all 6 tiers
- Grafana: 3 → 2 replicas
- Grampsweb: document why replicas=0
- Prometheus: 1Gi/4Gi → 3Gi/3Gi
- OpenClaw: 512Mi/2Gi → 768Mi/768Mi
- Immich server: 256Mi/2Gi → 512Mi/512Mi
- Immich postgresql: 256Mi/1Gi → 512Mi/512Mi
- Calibre: 256Mi/1536Mi → 256Mi/256Mi
- Linkwarden: 256Mi/1536Mi → 768Mi/768Mi
- N8N: 256Mi/1Gi → 512Mi/512Mi
- MySQL cluster: 1Gi/3-4Gi → 2Gi/2Gi
- pg-cluster (CNPG): 512Mi/4Gi → 512Mi/512Mi
- DBaaS ResourceQuota limits.memory: 64Gi → 12Gi

[ci skip]
2026-03-14 16:01:41 +00:00
Viktor Barzin
27fa8ea18f Hide Vault OIDC from main login dropdown
OIDC popup flow hangs due to Authentik X-Frame-Options.
Keep OIDC accessible via the "Other" tab instead.
2026-03-14 14:12:16 +00:00
Viktor Barzin
1dec7e6bea Add Vault OIDC authentication via Authentik
Configure Vault to use Authentik as OIDC identity provider for SSO login.
Creates OAuth2 provider/application in Authentik, adds OIDC auth backend,
admin policy, and maps "authentik Admins" group to full vault-admin access.
2026-03-14 13:53:05 +00:00
Viktor Barzin
44aa6d61c2 Reduce downtime during platform stack applies
CrowdSec fixes:
- Increase ResourceQuota requests.cpu 1→4 (was at 302%, blocking upgrades)
- Add LAPI startupProbe: 30 attempts × 10s = 5min startup window
  (LAPI pods were failing default startup probe during rolling upgrades)
- Reduce Helm timeout 3600s→900s with wait=true, wait_for_jobs=true

Prometheus startup guard on 8 rate-based alerts:
- PodCrashLooping, ContainerOOMKilled, CoreDNSErrors,
  HighServiceErrorRate, HighService4xxRate, HighServiceLatency,
  SSDHighWriteRate, HDDHighWriteRate
- Suppresses false positives for 15m after Prometheus restart
2026-03-14 12:47:56 +00:00
Viktor Barzin
4ea3ffe9d3 Reduce downtime during platform stack applies
CrowdSec Helm fix:
- Increase ResourceQuota requests.cpu from 1 to 4 — pods were at 302%
  of quota, preventing scheduling during rolling upgrades
- Reduce Helm timeout from 3600s to 600s — 1 hour hang is excessive
- Add wait=true and wait_for_jobs=true for proper readiness checking

Prometheus startup guard:
- Add startup guard to 8 rate/increase-based alerts that false-fire
  after Prometheus restarts (needs 2 scrapes for rate() to work):
  PodCrashLooping, ContainerOOMKilled, CoreDNSErrors,
  HighServiceErrorRate, HighService4xxRate, HighServiceLatency,
  SSDHighWriteRate, HDDHighWriteRate
- Guard: and on() (time() - process_start_time_seconds) > 900
  suppresses alerts for 15m after Prometheus startup
2026-03-14 12:09:09 +00:00
Viktor Barzin
4635d3b826 remember: CrowdSec Helm upgrade timeout [ci skip] 2026-03-14 12:04:07 +00:00
Viktor Barzin
17065304dc Fix NFSServerUnresponsive false positives
Root cause: sum(rate(node_nfs_requests_total[5m])) == 0 was too fragile:
- rate() returns nothing after Prometheus restarts (needs 2 scrapes)
- Individual nodes show zero NFS rate during scrape gaps or low activity
- The sum() could hit zero during quiet hours + scrape gaps

New expression uses:
- changes() instead of rate() — works with a single scrape
- Per-instance aggregation: count nodes with any NFS counter change
- Threshold < 2 nodes: single-node restarts won't trigger, real NFS
  outage (all nodes affected) will
- Prometheus startup guard: skip first 15m after restart to avoid
  false positives from empty TSDB
- Wider 15m changes() window to smooth out scrape gaps
2026-03-14 11:28:17 +00:00
Viktor Barzin
6377a8b85b Monitoring overhaul: reduce noise, add coverage gaps, auto-load dashboards
Noise reduction (8 alerts tuned):
- PoisonFountainDown: 2m→5m, critical→warning (fail-open service)
- NodeExporterDown: 2m→5m (flaps during node restarts)
- PowerOutage: add for:1m (debounce transient voltage dips)
- New Tailscale client: add for:5m (debounce headscale reauths)
- NoNodeLoadData: use absent() instead of OR vector(0)==0
- NodeHighCPUUsage: 30%→60% (normal for 70+ services)
- HighMemoryUsage GPU: 12GB/5m→14GB/15m (T4=16GB, model loading)
- PrometheusStorageFull: 50GiB→150GiB (TSDB cap is 180GB)

Alert regrouping:
- Move MailServerDown, HackmdDown, PrivatebinDown → new "Application Health"
- Move New Tailscale client → "Infrastructure Health"

New alerts (14):
- Networking: Cloudflared (2), MetalLB (2), Technitium DNS
- Storage: NFS CSI, iSCSI CSI controllers
- Critical Services: PgBouncer, CNPG operator, MySQL operator
- Infra Health: CrowdSec, Kyverno, Sealed Secrets, Woodpecker

Inhibit rules:
- Consolidate 3 NodeDown rules into 1 comprehensive rule
- Extend NFS rule to suppress NFS-dependent services
- Add PowerOutage → downstream suppression

Dashboard loading:
- Add for_each ConfigMap in grafana.tf to auto-load all 18 dashboards
- Remove duplicate caretta dashboard ConfigMap from caretta.tf
2026-03-14 10:25:31 +00:00
Viktor Barzin
a6f71fc6f0 feat(claude-memory): add stack and update image to standalone repo
- Add claude-memory stack (was previously untracked)
- Update Docker image from viktorbarzin/claude-memory to
  viktorbarzin/claude-memory-mcp (standalone open-source repo)
- CI/CD now lives in the standalone repo's .woodpecker.yml
2026-03-14 09:49:38 +00:00
Viktor Barzin
2102cb2d73 Right-size CPU requests cluster-wide and remove missed CPU limits
Increase requests for under-requested pods (dashy 50m→250m, frigate 500m→1500m,
clickhouse 100m→500m, otp 100m→300m, linkwarden 25m→50m, authentik worker 50m→100m).

Reduce requests for over-requested pods (crowdsec agent/lapi 500m→25m each,
prometheus 200m→100m, dbaas mysql 1800m→100m, pg-cluster 250m→50m,
shlink-web 250m→10m, gpu-pod-exporter 50m→10m, stirling-pdf 100m→25m,
technitium 100m→25m, celery 50m→15m). Reduce crowdsec quota from 8→1 CPU.

Remove missed CPU limits in prometheus (cpu: "2") and dbaas (cpu: "3600m") tpl files.
2026-03-14 09:22:24 +00:00
Viktor Barzin
b00f810d3d Remove all CPU limits cluster-wide to eliminate CFS throttling
CPU limits cause CFS throttling even when nodes have idle capacity.
Move to a request-only CPU model: keep CPU requests for scheduling
fairness but remove all CPU limits. Memory limits stay (incompressible).

Changes across 108 files:
- Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers
- Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers
- Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas
- Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice)
- RBAC module: remove cpu_limits variable and quota reference
- Freedify factory: remove cpu_limit variable and limits reference
- 86 deployment files: remove cpu from all limits blocks
- 6 Helm values files: remove cpu under limits sections
2026-03-14 08:51:45 +00:00
Viktor Barzin
120f83ce93 Nextcloud performance tuning and fix backup cron job
- Set loglevel=2 (warnings) and disable mail_smtpdebug via configs
- Enable opcache.enable_file_override for faster file checks
- Increase APCu shared memory from 32M to 128M
- Fix broken module.nfs_nextcloud_data reference in backup cron job
  to use the iSCSI PVC directly
2026-03-14 08:20:51 +00:00
Viktor Barzin
8aaa75b57b Migrate Matrix Synapse from SQLite to PostgreSQL
SQLite over NFS caused database corruption (malformed disk image).
Recovered the DB, migrated data to PostgreSQL via synapse_port_db,
and updated the deployment to use psycopg2 with an init container.

Database: matrix on postgresql.dbaas.svc.cluster.local
Scaled replicas from 0 to 1.
2026-03-13 23:21:59 +00:00
Viktor Barzin
1fefffeebc Add OTP resource limits and scale up
OTP was crash-looping with Java OOM at the default 256Mi LimitRange.
Added explicit resource limits (1Gi request, 2Gi limit) and -Xmx1536m
JVM flag. Scaled replicas from 0 back to 1.
2026-03-13 22:33:27 +00:00
Viktor Barzin
c8d15adc16 Remove LokiDown alert rule and inhibit reference
Loki has been turned off — remove the orphaned alert rule and its
reference in the NodeDown inhibit configuration.
2026-03-13 22:21:11 +00:00
Viktor Barzin
b323e567e4 Add HAProxy for Redis HA master-only routing
The Redis K8s Service was load-balancing across both master and replica
nodes, causing READONLY errors when clients hit the replica. This broke
Nextcloud (DAV 500s, liveness probe timeouts, crash loops) and
potentially other services.

Replace the direct Service with HAProxy (2 replicas) that health-checks
each Redis node via INFO replication and only routes to role:master.
On Sentinel failover, HAProxy detects the new master within ~9 seconds.
2026-03-13 22:21:10 +00:00
Viktor Barzin
d05ff57b11 authentik: auto-assign invitation group via expression policy [ci skip]
Added invitation-group-assignment expression policy bound to the
enrollment-login stage. Reads group name from invitation fixed_data
and auto-adds the user to the target group on enrollment.
No more manual assign step needed after signup.
2026-03-13 22:21:10 +00:00
Viktor Barzin
160fda882f authentik: cleanup unused resources + add invitation enrollment flow [ci skip]
Cleanup:
- Deleted 5 unused flows (enrollment-inviation, headscale-auth/authz, default-enrollment, oauth-enrollment)
- Deleted 8 orphaned stages bound only to deleted flows
- Deleted authentik Read-only group and role (0 users)
- Deleted 2 unbound policies (map github username, Map Google Attributes)

Invitation enrollment:
- Created invitation-enrollment flow with 5 stages (invitation validation,
  identification with social login, prompt, user write, auto-login)
- Set all OAuth sources (Google/GitHub/Facebook) enrollment_flow to invitation-enrollment
- New users can only sign up via single-use invitation links
- Added authentik-invite.sh script for invitation management
- Updated reference docs and authentik skill
2026-03-13 22:21:10 +00:00
Viktor Barzin
af5f6a659b right-size Nextcloud resources after MySQL migration
SQLite caused 4.7 CPU / 2GB usage, now MySQL uses ~95m / 95Mi.
Reduced limits from 16 CPU / 6Gi to 2 CPU / 1Gi.
Reduced requests from 100m / 1Gi to 50m / 256Mi.
Frees ~14 CPU cores and 5Gi memory for other workloads.
2026-03-13 22:21:10 +00:00
Viktor Barzin
3e03fbec63 increase MaxRequestWorkers to 150 now that Nextcloud is on MySQL
With SQLite, 50 workers caused all workers to block on DB locks.
On MySQL, CPU is ~20m and memory ~143Mi — no resource pressure.
The crash-looping was caused by hitting MaxRequestWorkers=50 limit
("server reached MaxRequestWorkers setting"), not by DB contention.
2026-03-13 22:20:52 +00:00
Viktor Barzin
aa3d3d0e66 migrate Nextcloud from SQLite to MySQL InnoDB Cluster
SQLite was causing constant crash-looping (138 restarts/day) due to
write lock contention with concurrent sync clients.

Migration required workarounds for multiple occ db:convert-type bugs:
- GR error 3100: SET GLOBAL sql_generate_invisible_primary_key = ON
- utf8mb3 column creation: stripped 4-byte emoji + invalid UTF-8 from
  SQLite (F1 calendar events, filecache)
- SQLite index corruption: repaired via .dump + INSERT OR IGNORE reimport
- kubectl exec timeouts: used nohup + detached process

Verified: all 136 tables migrated, 100% row count match across 15 key
tables (users, files, calendars, contacts, shares, activity).

Also fixed typo: databse → database in chart values.
2026-03-13 22:20:28 +00:00
Viktor Barzin
ce79bd5c04 Add node hang instrumentation and scale down chromium services
- Add journald collection to Alloy (loki.source.journal) for kernel OOM,
  panic, hung task, and soft lockup detection — ships system logs off-node
  so they survive hard resets
- Add 5 Loki alerting rules (KernelOOMKiller, KernelPanic, KernelHungTask,
  KernelSoftLockup, ContainerdDown) evaluating against node-journal logs
- Fix Loki ruler config: correct rules mount path (/var/loki/rules/fake),
  add alertmanager_url and enable_api
- Add Prometheus alerts: NodeMemoryPressureTrending (>85%), NodeExporterDown,
  NodeHighIOWait (>30%)
- Add caretta tolerations for control-plane and GPU nodes
- Scale down chromium-based services to 0 for cluster stability:
  f1-stream, flaresolverr, changedetection, resume/printer
2026-03-13 22:20:28 +00:00
OpenClaw
8029823f79 fix(monitoring): Add setup script for automated health check environment
ISSUE: Automated cron health checks were failing with 'cluster unreachable'
ROOT CAUSE: Cron jobs lack access to kubeconfig (KUBECONFIG env var not set)

SOLUTION: Created setup-monitoring.sh script that:
 Copies working kubeconfig to expected location (/workspace/infra/config)
 Tests health check script functionality
 Provides clear feedback on setup status

USAGE:
./setup-monitoring.sh (run once to enable automated health checks)

REASONING:
- Kubeconfig contains secrets, shouldn't be committed to git
- Health check script logic: KUBECONFIG_PATH="${KUBECONFIG:-$(pwd)/config}"
- Cron jobs run without KUBECONFIG env var, so fall back to /workspace/infra/config
- This script bridges the gap between persistent kubeconfig and cron environment

VERIFICATION:
 Automated health checks now show realistic results (21 PASS, 4 WARN, 1 FAIL)
 No more false 'cluster unreachable' alerts from cron jobs

The script is idempotent and can be run multiple times safely.
2026-03-13 13:57:11 +00:00
Viktor Barzin
dfcef89c35 fix Frigate GPU stall: add inference speed check to liveness probe
The existing probe only checked nvidia-smi + API availability, which passes
even when the detector falls back to CPU. Now also checks /api/stats and
restarts the pod if inference speed exceeds 100ms (normal GPU: ~20ms,
CPU fallback: 200ms+).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-13 10:25:46 +00:00
OpenClaw
50d539908c feat(monitoring): Disable Loki centralized logging while preserving configuration
DECISION: Disable Loki due to operational overhead vs benefit analysis

EVIDENCE FROM NODE2 INCIDENT:
- Loki was the root cause of major cluster outage (PVC storage exhaustion)
- Centralized logging was unavailable when needed most (Loki was down)
- All debugging was accomplished with simpler tools (kubectl logs, events, describe)
- Prometheus metrics proved more valuable than centralized logs

OPERATIONAL OVERHEAD ELIMINATED:
 50GB iSCSI storage freed up (expensive)
 ~3.5GB memory freed up (Loki + Alloy agents across cluster)
 ~2+ CPU cores freed up for actual workloads
 Reduced complexity - fewer services to maintain and troubleshoot
 Eliminated single point of failure that can cascade cluster-wide

CONFIGURATION PRESERVED:
 All Terraform resources commented out (not deleted)
 loki.yaml preserved with 50GB configuration
 alloy.yaml preserved with log shipping configuration
 Alert rules and Grafana datasource preserved (commented)
 Easy re-enabling: just uncomment resources and apply

ALTERNATIVE DEBUGGING APPROACH:
 kubectl logs (always works, no storage dependency)
 kubectl get events (built-in Kubernetes events)
 Prometheus metrics (more reliable for monitoring)
 Enhanced health check scripts (direct status verification)

RE-ENABLING:
To restore Loki later, uncomment all /* ... */ blocks in loki.tf
and apply via Terraform. All configuration is preserved.

[ci skip] - Infrastructure changes applied locally first due to resource cleanup
2026-03-13 08:41:23 +00:00
OpenClaw
523f0ba7eb fix(monitoring): Expand Loki PVC from 15GB to 50GB to resolve storage exhaustion
ISSUE RESOLVED:
- Root cause: Loki's 15GB iSCSI PVC was completely full
- Symptom: 'no space left on device' errors during TSDB operations
- Impact: Loki service completely down, logging unavailable
- Side effects: Contributed to node2 containerd corruption incident

SOLUTION APPLIED:
- Expanded PVC storage: 15Gi → 50Gi via direct kubectl patch
- Triggered pod restart to complete filesystem resize
- Verified successful expansion and service recovery

CURRENT STATUS:
 PVC: 50Gi capacity (iscsi-truenas storage class)
 Loki StatefulSet: 1/1 ready
 Loki Pod: 2/2 containers running
 Service: Successfully processing log streams
 No storage errors in recent logs

TERRAFORM ALIGNED:
- Updated loki.yaml persistence.size to match actual PVC
- Infrastructure code now reflects deployed state

[ci skip] - Emergency fix applied locally first due to service outage
2026-03-13 08:13:05 +00:00
OpenClaw
4a9bd89b11 feat(health-check): Add Prometheus-based CPU and power monitoring
SECTIONS ADDED:
- Section 25: Advanced CPU Monitoring (Prometheus node_exporter metrics)
- Section 26: Power Monitoring (DCGM GPU power + host power)

FEATURES:
- 5-minute CPU usage averages (more accurate than kubectl top)
- Tesla T4 GPU power consumption monitoring
- CPU thresholds: 70% warn, 85% critical
- GPU power thresholds: 50W active, 65W high
- Maps IP addresses to friendly node names
- Integrates with existing health check infrastructure

CURRENT STATUS:
- All nodes have healthy disk usage (~10%)
- k8s-node4 flagged at 87% CPU (explains resource pressure)
- GPU operating normally at 30.9W
- Enhanced monitoring prevents issues like node2 containerd corruption

Total health check sections: 26 (was 24)
Addresses node2 incident prevention requirements
2026-03-13 07:32:36 +00:00
OpenClaw
a09967e098 feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident
IMMEDIATE CHANGES (Active Now):
- Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%)
- More aggressive alerting to prevent containerd corruption
- Enhanced cluster health check disk monitoring

INFRASTRUCTURE CHANGES (Requires Terraform Apply):
- Add containerd garbage collection configuration (30min intervals)
- More aggressive kubelet eviction policies (15%/20% vs 10%/15%)
- Enhanced disk space protection to prevent node2-type failures

Root Cause: node2 disk exhaustion corrupted containerd image store
Prevention: Proactive monitoring + aggressive cleanup policies

[ci skip] - Infrastructure changes require SOPS access for apply
2026-03-13 07:16:56 +00:00
OpenClaw
fd6c1cca93 fix(nextcloud): Database corruption recovery and conservative Apache tuning
- Restored clean SQLite database from pre-migration backup
- Fixed severe database corruption (rowid ordering, page corruption, index issues)
- Applied conservative MaxRequestWorkers=15 for SQLite stability
- Database integrity now perfect, all health checks passing
- Ready for future MySQL migration with clean data

[ci skip]
2026-03-12 13:38:37 +00:00
OpenClaw
db1e301eea fix(nextcloud): Increase Apache MaxRequestWorkers to resolve health check timeouts
- Increase MaxRequestWorkers from 10 to 25 for 4 CPU + 3Gi memory container
- Update Apache tuning for Redis + SQLite backend (not pure SQLite)
- Resolves CrashLoopBackOff caused by health probe timeouts
- Allows handling concurrent users without MaxRequestWorkers limit errors

[ci skip]
2026-03-12 13:14:20 +00:00
OpenClaw
cedb90be33 Clean up: Remove test push file 2026-03-12 12:38:46 +00:00
OpenClaw
84b616de41 Test: Verify git push functionality from OpenClaw 2026-03-12 12:38:36 +00:00
Viktor Barzin
3f0cf4ff4d stabilize Nextcloud: relax probes, reduce resources for 2-client SQLite workload
SQLite locks cause slow responses under concurrent access, triggering
liveness probe failures and restarts. With only 2 sync clients:

- Liveness: period 30→60s, timeout 10→30s, failures 6→10 (tolerates 10min)
- Readiness: period 30→60s, timeout 10→30s, failures 3→5
- Startup: timeout 10→30s
- Resources: CPU 16→4, memory 6Gi→3Gi (10 workers × 200MB = 2GB max)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-12 10:01:20 +00:00
Viktor Barzin
bef0c073d5 fix DNS health check: use system resolver instead of hardcoded 10.0.20.101
The check was querying Technitium DNS directly at 10.0.20.101:53, which
refuses connections from non-cluster hosts. Use the system resolver
(no @server flag) so it works from any host or pod environment.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-12 09:08:34 +00:00
Viktor Barzin
81bfccaefc fix OOM kills: tune MySQL memory, reduce Nextcloud workers, increase Uptime Kuma limit
MySQL (3 OOM kills):
- Cap group_replication_message_cache_size to 128MB (default 1GB caused OOM)
- Reduce innodb_log_buffer_size from 64MB to 16MB
- Lower max_connections from 151 to 80 (peak usage ~40)
- Increase memory limit from 3Gi to 4Gi for headroom

Nextcloud (30+ apache2 OOM kills per incident):
- Reduce MaxRequestWorkers from 50 to 10 to prevent fork bomb
  when SQLite locks cause request pileup
- Lower StartServers/MinSpare/MaxSpare proportionally

Uptime Kuma (Node.js memory leak):
- Increase memory limit from 256Mi to 512Mi
- Increase CPU limit from 200m to 500m

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-12 07:26:08 +00:00
Viktor Barzin
f2c7444159 fix nvidia quota: use custom quota (32 CPU) instead of Kyverno-generated (16 CPU)
The GPU operator needs ~19 CPU limits across 16 pods (NFD, device plugin,
driver, validators, exporters). The Kyverno auto-generated quota of 16 CPU
was insufficient, blocking NFD worker and GC pods from scheduling.

- Add custom-quota label to nvidia namespace to exempt from Kyverno generation
- Add explicit ResourceQuota with limits.cpu=32, limits.memory=48Gi
- Fix: nvidia namespace tier label was missing after CI re-apply, causing
  Kyverno to use fallback LimitRange instead of tier-2-gpu specific one

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-12 07:04:34 +00:00
Viktor Barzin
f07f05f9bb migrate Nextcloud data volume from NFS to iSCSI for fsync support
SQLite on NFS caused persistent 500 errors on WebDAV PROPFIND due to
missing fsync guarantees and database locking under concurrent access.
iSCSI (ext4) provides proper fsync and block-level I/O.

- Replace nfs_volume module with iscsi-truenas PVC (20Gi)
- Update Helm chart to use nextcloud-data-iscsi claim
- Excluded 12.5GB nextcloud.log and corrupted DB from migration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-11 23:24:03 +00:00
Viktor Barzin
4427530e65 Archive terraform.tfvars — secrets now in SOPS
Removed from git tracking and added to .gitignore.
File stays on disk locally for reference.
config.tfvars + secrets.auto.tfvars.json are the active var sources.

[ci skip]
2026-03-11 21:16:11 +00:00
Viktor Barzin
d7953322dd fix cluster health: pin actualbudget, spread MySQL, scale grampsweb, fix GPU toleration
- Pin actualbudget/actual-server from edge to 26.3.0 (all 3 instances) to
  prevent recurring migration breakage from rolling nightly builds
- Add podAntiAffinity to MySQL InnoDB Cluster to spread replicas across nodes,
  relieving memory pressure on k8s-node4
- Scale grampsweb to 0 replicas (unused, consuming 1.7Gi memory)
- Add GPU toleration Kyverno policy to Terraform using patchesJson6902 instead
  of patchStrategicMerge to fix toleration array being overwritten (caused
  caretta DaemonSet pod to be unable to schedule on k8s-master)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-11 11:43:34 +00:00
Viktor Barzin
6bdcd88d25 set Recreate strategy for plotting-book deployment
iSCSI volumes are ReadWriteOnce and cannot multi-attach, so the old
pod must terminate before the new one starts.
2026-03-10 23:47:30 +00:00
Viktor Barzin
5a9881337d Add terminal stack - reverse proxy to ttyd behind authentik
Exposes ttyd at 10.0.10.10:7681 via terminal.viktorbarzin.me with
Cloudflare DNS and Authentik forward-auth protection.
2026-03-10 23:46:01 +00:00
Viktor Barzin
d8bcdfef2e revert MaxRequestWorkers to 50, exclude nextcloud from 5xx alert
- MaxRequestWorkers 25→50: too few workers caused ALL workers to block
  on SQLite locks, making liveness probes fail even faster (131 restarts
  vs 50 before). 50 is a compromise — enough workers for probes.
- Excluded nextcloud from HighServiceErrorRate alert (chronic SQLite issue)
- MySQL migration attempted but hit: GR error 3100 (fixed with GIPK),
  emoji in calendar/filecache (stripped), SQLite corruption (pre-existing
  from crash-looping). Migration rolled back, Nextcloud restored to SQLite.
2026-03-09 22:05:20 +00:00