infra

Author	SHA1	Message	Date
Viktor Barzin	197cef7f3f	[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache - tiers.tf: Terragrunt-generated tier locals for all standalone stacks - .planning/: resource audit research and plans - docs/plans/: cluster hardening design doc - redis-25.3.2.tgz: Bitnami Redis Helm chart cache	2026-03-06 23:55:57 +00:00
Viktor Barzin	0abae33c71	[ci skip] complete NFS CSI migration: complex stacks + platform modules Migrate remaining multi-volume stacks and all platform modules from inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options). Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols), nextcloud (2 vols + old PV replaced), rybbit (1 vol) Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing, real-estate-crawler Platform modules: monitoring (prometheus, loki, alertmanager PVs converted from native NFS to CSI), redis, dbaas, technitium, headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance	2026-03-02 01:24:07 +00:00
Viktor Barzin	9e4fb23b10	[ci skip] right-size all pod resources based on VPA + live metrics audit Full cluster resource audit: cross-referenced Goldilocks VPA recommendations, live kubectl top metrics, and Terraform definitions for 100+ containers. Critical fixes: - dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit - stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit - traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi Added explicit resources to ~40 containers that had none: - audiobookshelf, changedetection, cyberchef, dawarich, diun, echo, excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n, navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor, tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver, cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard, k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server, immich-postgresql, osrm-foot GPU containers: added CPU/mem alongside GPU limits: - ollama: removed CPU/mem limits (models vary in size), keep GPU only - frigate: req 500m/2Gi, lim 4/8Gi + GPU - immich-ml: req 100m/1Gi, lim 2/4Gi + GPU Right-sized ~25 over-provisioned containers: - kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi) - onlyoffice: CPU 8 → 2 (VPA upper 45m) - realestate-crawler-api: CPU 2000m → 250m - blog/travel-blog/webhook-handler: 500m → 100m - coturn/health/plotting-book: reduced to match actual usage Conservative methodology: limits = max(VPA upper * 2, live usage * 2)	2026-03-01 19:18:50 +00:00
Viktor Barzin	ccf0b2232f	[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources - Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial' for non-core). Terraform is now sole authority for container resources. Goldilocks provides recommendations only. - Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit) alongside GPU allocation. Fixes OOMKill from VPA scaling down resources. - MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi. - Remove redundant per-namespace VPA opt-out labels from onlyoffice, openclaw, trading-bot (now handled globally by Kyverno policy).	2026-03-01 19:03:49 +00:00
Viktor Barzin	80dfc58fea	[ci skip] openclaw: fix workspace permissions — chown to node user Init container clones repo as root but main container runs as node (UID 1000). Added chown -R 1000:1000 /workspace/infra so OpenClaw can write to workspace.	2026-03-01 17:20:36 +00:00
Viktor Barzin	f203e7bd2c	[ci skip] openclaw: set workspace + enable elevated + native commands - Set workspace to /workspace/infra (was defaulting to ~/.openclaw/workspace) - Enable tools.elevated for unrestricted access - Enable commands.native and commands.nativeSkills - All tools, commands, and skills now fully accessible	2026-03-01 17:12:03 +00:00
Viktor Barzin	b2ac69e12b	[ci skip] openclaw: disable sandbox mode for unrestricted execution - Set agents.defaults.sandbox.mode = off - Combined with exec.host=gateway and exec.security=full, OpenClaw can now run any command on the container host	2026-03-01 16:51:35 +00:00
Viktor Barzin	99881b28e3	[ci skip] openclaw: fix exec host — use gateway instead of node host=node requires a companion app (not available in container). host=gateway runs commands directly on the gateway process host.	2026-03-01 16:47:14 +00:00
Viktor Barzin	6efc1e56c0	[ci skip] openclaw: fix exec config — use host=node, security=full Valid options: host=sandbox\|gateway\|node, security=deny\|allowlist\|full. Using node (run on container host) with full (no command restrictions).	2026-03-01 16:42:22 +00:00
Viktor Barzin	c83f3aab90	[ci skip] openclaw: disable sandbox, run commands on container host - exec.host: sandbox → local (run directly on container, no Docker sandbox) - exec.security: full → off (no restrictions on command execution)	2026-03-01 16:18:53 +00:00
Viktor Barzin	b10d43b7a7	[ci skip] openclaw: persist home directory on NFS - Switch openclaw-home from emptyDir to NFS (/mnt/main/openclaw/home) - Persists SOUL.md, IDENTITY.md, sessions, memory DB, telegram state, device identity, and all runtime files across pod restarts - Init container still refreshes openclaw.json and kubeconfig on each start	2026-03-01 16:12:07 +00:00
Viktor Barzin	0f7e7e5969	[ci skip] openclaw: remove all tool/command restrictions - Set tools.deny = [] (was blocking sessions, subagents, browser) - All tools now available: sessions, subagents, browser, etc.	2026-03-01 15:58:12 +00:00
Viktor Barzin	f031a6bcf6	[ci skip] openclaw: add modelrelay sidecar as fallback model router - Deploy modelrelay as sidecar container (auto-routes to fastest free model) - Configured with NVIDIA NIM + OpenRouter API keys - Primary: Mistral Large 3 (NIM), Fallback 1: Nemotron Ultra (NIM), Fallback 2: modelrelay/auto-fastest (80+ free models) - Modelrelay web UI available at pod:7352	2026-03-01 15:57:31 +00:00
Viktor Barzin	207164050c	[ci skip] openclaw: fix Telegram, update to v2026.2.26, fix startup issues - Update OpenClaw from v2026.2.9 to v2026.2.26 (fixes Telegram channel) - Add gateway.mode=local + wizard block (required for channel startup) - Add dangerouslyAllowHostHeaderOriginFallback (v2026.2.26 requirement) - Run doctor --fix at container startup to auto-enable Telegram - Create required dirs (canvas, devices, cron, sessions, credentials) - Fix permissions: chown -R 1000:1000 for node user - Telegram: DM allowlist, user 8281953845 only	2026-03-01 15:47:54 +00:00
Viktor Barzin	0da6f90ad2	[ci skip] openclaw: fix slow startup — proper resources + readiness probe + VPA off - Set explicit CPU (2 cores) and memory (2Gi) limits Root cause: Goldilocks VPA was throttling to 300m CPU, causing gateway to take 5+ minutes to start, and 1Gi memory caused OOM crashes - Add TCP readiness probe on port 18789 to prevent 502 Bad Gateway during startup (Traefik was routing before gateway was listening) - Disable Goldilocks VPA via namespace label (vpa-update-mode: off)	2026-03-01 14:44:22 +00:00
Viktor Barzin	e8ff760aff	[ci skip] openclaw: cache tools on NFS for fast restarts - Switch /tools volume from emptyDir to NFS (/mnt/main/openclaw/tools) - Skip download of kubectl, terraform, terragrunt, pip packages if cached - Startup time: ~2.5min → ~38s on subsequent restarts	2026-03-01 13:59:07 +00:00
Viktor Barzin	e728f4c106	[ci skip] openclaw: add Telegram channel + install terragrunt in init container - Add Telegram bot integration (DM allowlist, user 8281953845 only) - Install terragrunt v0.99.4 in init container alongside terraform - Remove terraform init from init (terragrunt handles this per-stack) - Add openclaw_telegram_bot_token variable	2026-03-01 13:44:58 +00:00
Viktor Barzin	014f6cad5a	[ci skip] openclaw: switch to free agentic models via NVIDIA NIM, OpenRouter, Llama API - Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling - Fallback 1: Nemotron Ultra 253B on NIM - Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience) - 10 models total across 3 providers, all free - Removed: Modal (GLM-5), Gemini, Ollama providers - Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5 - Bumped maxTokens from 8192 to 16384 for agentic output room	2026-03-01 13:22:47 +00:00
Viktor Barzin	f64c979ba5	[ci skip] tune resource limits and requests across 10 services Critical OOM fixes (add/increase limits): - netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi) - speedtest: add 512Mi limit (was at 80.9%) - meshcentral: add 384Mi limit (was at 72.7%) - ytdlp: uncomment resources, set 512Mi limit (was at 74.6%) Over-provisioned (reduce limits): - dashy: 2Gi → 512Mi (was using 135Mi) - redis master: 2Gi → 256Mi (was using 14Mi) - redis replica: 1Gi → 256Mi (was using 12Mi) - resume printer: 2Gi → 512Mi (was using 108Mi) - resume app: 1Gi → 384Mi (was using 125Mi) - openclaw: 4Gi → 1Gi (was using 372Mi) Under-provisioned requests (increase): - authentik server: 256Mi → 512Mi request (actual ~560Mi) - authentik worker: 256Mi → 384Mi request (actual ~400Mi) New explicit resources (previously Kyverno defaults): - forgejo: add 512Mi limit, 64Mi request	2026-02-28 21:59:08 +00:00
Viktor Barzin	89a6e08245	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	ddb293b2b7	[ci skip] Reduce healthcheck frequency to 8h, fix apiserver audit duplication bug Change cluster-healthcheck CronJob from every 30min to every 8h. Replace fragile sed-based audit config in apiserver manifest with idempotent Python script that deduplicates by name/mountPath, preventing the duplicate volume entries that crashed the API server.	2026-02-22 23:18:30 +00:00
Viktor Barzin	c7c7047f1c	[ci skip] Flatten module wrappers into stack roots Remove the module "xxx" { source = "./module" } indirection layer from all 66 service stacks. Resources are now defined directly in each stack's main.tf instead of through a wrapper module. - Merge module/main.tf contents into stack main.tf - Apply variable replacements (var.tier -> local.tiers.X, renamed vars) - Fix shared module paths (one fewer ../ at each level) - Move extra files/dirs (factory/, chart_values, subdirs) to stack root - Update state files to strip module.<name>. prefix - Update CLAUDE.md to reflect flat structure Verified: terragrunt plan shows 0 add, 0 destroy across all stacks.	2026-02-22 15:13:55 +00:00
Viktor Barzin	e6420c7b36	[ci skip] Move Terraform modules into stack directories Move all 88 service modules (66 individual + 22 platform) from modules/kubernetes/<service>/ into their corresponding stack directories: - Service stacks: stacks/<service>/module/ - Platform stack: stacks/platform/modules/<service>/ This collocates module source code with its Terragrunt definition. Only shared utility modules remain in modules/kubernetes/: ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy. All cross-references to shared modules updated to use correct relative paths. Verified with terragrunt run --all -- plan: 0 adds, 0 destroys across all 68 stacks.	2026-02-22 14:38:14 +00:00
Viktor Barzin	945a5f35b0	[ci skip] Fix path.root references for git-crypt key in openclaw and drone Modules used filebase64("${path.root}/.git/git-crypt/keys/default") which breaks with Terragrunt since path.root is now stacks/<service>/ instead of repo root. Changed to accept git_crypt_key_base64 variable and resolve the path in the stack wrapper.	2026-02-22 14:01:02 +00:00
Viktor Barzin	a9ba8899be	[ci skip] Phase 3: Create 66 service stacks and migrate state Generated individual stack directories for all 66 services under stacks/. Each stack has terragrunt.hcl (depends on platform) and main.tf (thin wrapper calling existing module). Migrated all 64 active service states from root terraform.tfstate to individual state files. Root state is now empty. Verified with terragrunt plan on multiple stacks (no changes).	2026-02-22 13:56:34 +00:00

25 commits