Commit graph

23 commits

Author SHA1 Message Date
Viktor Barzin
46afa85b01 fix openclaw config mount and OOM: use init container, increase memory to 2Gi
- Replace subPath ConfigMap mount with init container that copies openclaw.json
  to writable NFS home (OpenClaw writes back to the file at runtime)
- Remove invalid memory-api plugin references causing "Config invalid"
- Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536
- Fix tg wrapper to inject -auto-approve when apply --non-interactive is used
2026-03-14 23:42:17 +00:00
Viktor Barzin
76a4987eef [ci skip] add Forgejo task pipeline for OpenClaw AI agent
Forgejo issues as a task queue for OpenClaw:
- Forgejo OAuth2 with Authentik SSO, self-registration disabled
- Webhook-triggered task processing (instant) + CronJob backup (5min poll)
- Tasks processed via Mistral Large 3 (NVIDIA NIM API)
- Results posted as issue comments, auto-labeled and closed
- Comment follow-ups and reopened issues supported
- n8n RBAC for OpenClaw pod exec (future workflow integration)
2026-03-07 21:11:07 +00:00
Viktor Barzin
39333033a6 [ci skip] phase 1: SOPS tooling setup (.sops.yaml, scripts/tg, .gitignore)
Part of SOPS multi-user secrets migration.
- .sops.yaml: defines age recipients (Viktor + CI)
- scripts/tg: wrapper that decrypts secrets before running terragrunt
- .gitignore: excludes decrypted secrets.auto.tfvars.json

No functional change — terraform.tfvars still works as before.
2026-03-07 13:57:42 +00:00
Viktor Barzin
422dadafe5 [ci skip] replace resource overcommitment check with actual usage
Check real CPU/memory usage via kubectl top nodes instead of
limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL.
Limits overcommit is expected with 70+ services on 3 worker nodes.
2026-03-06 20:28:55 +00:00
Viktor Barzin
87ef313888 [ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts
- Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max)
- Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi)
- Fix PostgreSQLDown alert: check pod readiness instead of deployment
- Fix MySQLDown alert: check StatefulSet replicas instead of deployment
- Fix RedisDown alert: check StatefulSet replicas instead of deployment
- Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide
- Fix Uptime Kuma healthcheck: handle nested list heartbeat format
- Update etcd backup image to registry.k8s.io/etcd:3.6.5-0
2026-03-03 21:10:26 +00:00
Viktor Barzin
14b1c43713 [ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
69c4c0c76e [ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0
- Reduce Kyverno LimitRange default limits ~4x across all tiers to fix
  800-900% memory overcommitment on worker nodes
- Add cluster health check #25: per-node resource overcommitment
  showing requests and limits vs allocatable capacity
- Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces
  get VPA Off mode (recommend only, no evictions) to prevent downtime
  on critical infra (traefik, cloudflared, authentik, technitium, etc.)
- Non-tier-0 namespaces get VPA Auto mode for active right-sizing
2026-02-26 23:15:43 +00:00
Viktor Barzin
d041459ef2 [ci skip] Upgrade Woodpecker CI v3.5.1 → v3.13.0, fix helm healthcheck for v4 2026-02-23 20:14:30 +00:00
Viktor Barzin
c8de2c4803 [ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references
Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me.
Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma
monitor, dashboard entry, and all code/doc references across 18 files.
2026-02-23 19:38:55 +00:00
Viktor Barzin
a9ba8899be [ci skip] Phase 3: Create 66 service stacks and migrate state
Generated individual stack directories for all 66 services under stacks/.
Each stack has terragrunt.hcl (depends on platform) and main.tf (thin
wrapper calling existing module). Migrated all 64 active service states
from root terraform.tfstate to individual state files. Root state is now
empty. Verified with terragrunt plan on multiple stacks (no changes).
2026-02-22 13:56:34 +00:00
Viktor Barzin
db659b1f7a [ci skip] Fix dashy OOMKilled and healthcheck DNS false-failure
- Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled
  during webpack build on startup
- Rewrite DNS healthcheck to test from inside the Technitium pod via
  kubectl exec, since MetalLB virtual IPs aren't reachable from outside
  the L2 network
- Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled,
  not mounted by kured DaemonSet)
2026-02-22 12:46:12 +00:00
Viktor Barzin
00dc78e0d2 [ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls 2026-02-22 01:37:28 +00:00
Viktor Barzin
98b711ff8d [ci skip] Extend cluster healthcheck from 14 to 24 checks
Add 10 new checks covering gaps discovered during incident response:
ResourceQuota pressure, StatefulSets, node disk usage, Helm release
health, Kyverno policy engine, NFS connectivity, DNS resolution,
TLS certificate expiry, GPU health, and Cloudflare tunnel status.
2026-02-21 23:57:04 +00:00
Viktor Barzin
038d4434c4 [ci skip] Fix health check false positives for completed CronJob pods 2026-02-21 19:56:39 +00:00
Viktor Barzin
2bae6ccce3 Add Uptime Kuma monitor check to cluster health script [ci skip]
Adds check #14 that queries Uptime Kuma API for application-level
monitor status, complementing the kubectl-level checks with HTTP/ping
health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.
2026-02-15 17:49:40 +00:00
Viktor Barzin
9c4ff21d58 Add cluster health check script with 13 diagnostic sections [ci skip] 2026-02-15 17:34:22 +00:00
Viktor Barzin
a67a6f350e [ci skip] Fix pull-through cache for all registries
Replace deprecated wildcard containerd mirror with per-registry
config_path approach. Add proxy containers for ghcr.io, quay.io,
registry.k8s.io, and reg.kyverno.io on the docker-registry VM.
Set static IP for docker-registry VM to avoid DHCP issues.
2026-02-15 14:35:52 +00:00
Viktor Barzin
08ea489fe0 [ci skip] Add extend-vm-storage script and skills
- Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon)
- Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts)
- Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)
2026-02-13 22:08:46 +00:00
Viktor Barzin
a926a5022c [ci skip] sync tfstate and add frigate helper scripts 2026-02-12 23:11:23 +00:00
Viktor Barzin
7441538d6e upgrade to k8s 1.34.2 [ci skip] 2025-12-18 12:37:14 +00:00
Viktor Barzin
e1ec44c81d scale down calibre-web-automated instead of calibre [ci skip] 2025-12-06 22:04:41 +00:00
Viktor Barzin
c7c69905c0 some nits on the registry manager script - note it is still not working correctly [ci skip] 2025-10-17 19:23:43 +00:00
Viktor Barzin
8da88f9f6d move helper scripts in scripts dir [ci skip] 2025-10-11 17:14:59 +00:00