infra

Author	SHA1	Message	Date
Viktor Barzin	307b356f06	[ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3) Critical fix: StorageClass mountOptions only apply during dynamic provisioning. Our static PVs (created by Terraform) were missing mount_options, so all NFS mounts defaulted to hard,timeo=600 — the exact stale mount behavior we were trying to eliminate. Adds mount_options directly to the nfs_volume module PV spec and to the monitoring PVs (prometheus, loki, alertmanager). Requires re-applying all stacks to propagate to existing PVs.	2026-03-02 20:23:36 +00:00
Viktor Barzin	220aa739ce	[ci skip] migrate servarr sub-stacks + actualbudget factory NFS to CSI PV/PVC Final batch: servarr (aiostreams, listenarr, readarr, soulseek, prowlarr, qbittorrent, lidarr) and actualbudget factory. All use ../../../modules/kubernetes/nfs_volume (3 levels deep).	2026-03-02 02:04:22 +00:00
Viktor Barzin	0abae33c71	[ci skip] complete NFS CSI migration: complex stacks + platform modules Migrate remaining multi-volume stacks and all platform modules from inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options). Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols), nextcloud (2 vols + old PV replaced), rybbit (1 vol) Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing, real-estate-crawler Platform modules: monitoring (prometheus, loki, alertmanager PVs converted from native NFS to CSI), redis, dbaas, technitium, headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance	2026-03-02 01:24:07 +00:00
Viktor Barzin	79a2aa3784	[ci skip] migrate 29 services from inline NFS to CSI-backed PV/PVC Batch migration of all single-volume and simple multi-volume stacks. All services verified healthy after migration. Uses nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options to eliminate stale NFS mount hangs. Services: atuin, audiobookshelf, calibre, changedetection, diun, excalidraw, forgejo, freshrss, grampsweb, hackmd, health, isponsorblocktv, matrix, meshcentral, n8n, navidrome, ntfy, ollama, onlyoffice, owntracks, paperless-ngx, poison-fountain, send, stirling-pdf, tandoor, wealthfolio, whisper, woodpecker, ytdlp	2026-03-02 00:15:39 +00:00
Viktor Barzin	853a96cb57	[ci skip] migrate privatebin, resume, speedtest NFS volumes to CSI PV/PVC Pilot migration: replace inline nfs {} volumes with CSI-backed PV/PVC using nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options).	2026-03-01 23:42:23 +00:00
Viktor Barzin	c702fd2565	[ci skip] add NFS CSI driver + nfs_volume shared module - Deploy csi-driver-nfs Helm chart as platform module (nfs-csi) - Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options - Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)	2026-03-01 23:38:58 +00:00
Viktor Barzin	de598996f1	[ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io) Pull-through cache at 10.0.20.10 was serving corrupted/truncated images for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and previously causing Kyverno image pull failures. Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic, Docker Hub rate limits make caching essential. Removed from cloud-init template and all 5 live nodes: - registry.k8s.io (port 5030) — 14 system images, very low churn - quay.io (port 5020) — 11 images - reg.kyverno.io (port 5040) — 5 images The registry containers on the 10.0.20.10 VM still run but nodes no longer route to them. They can be stopped/removed from the VM later.	2026-03-01 21:46:41 +00:00
Viktor Barzin	ab7c655776	[ci skip] frigate: add liveness/startup probes for GPU recovery When the GPU becomes unavailable (overloaded, CUDA context corruption), Frigate silently falls back to CPU detection burning 4 cores with no automatic recovery. Add liveness probe checking nvidia-smi + API health every 60s (3 failures = restart), and startup probe allowing up to 5min for TensorRT model loading.	2026-03-01 20:36:49 +00:00
Viktor Barzin	858377e257	[ci skip] f1-stream: add Discord token and channel env vars	2026-03-01 20:17:38 +00:00
Viktor Barzin	28dd218590	[ci skip] rybbit: add CronJob to truncate ClickHouse system logs every 6h ClickHouse system log tables (metric_log, trace_log, text_log, etc.) were growing unboundedly on NFS (~10GiB, 1.3B rows) with no TTL, causing continuous background merge operations that burned ~920m CPU. Mounting custom config.d XML files crashes ClickHouse (exit code 36) so instead add a CronJob that truncates the tables via the HTTP API every 6 hours. Also removed the broken ConfigMap/volume mount that was causing crashes.	2026-03-01 19:41:39 +00:00
Viktor Barzin	9e4fb23b10	[ci skip] right-size all pod resources based on VPA + live metrics audit Full cluster resource audit: cross-referenced Goldilocks VPA recommendations, live kubectl top metrics, and Terraform definitions for 100+ containers. Critical fixes: - dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit - stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit - traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi Added explicit resources to ~40 containers that had none: - audiobookshelf, changedetection, cyberchef, dawarich, diun, echo, excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n, navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor, tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver, cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard, k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server, immich-postgresql, osrm-foot GPU containers: added CPU/mem alongside GPU limits: - ollama: removed CPU/mem limits (models vary in size), keep GPU only - frigate: req 500m/2Gi, lim 4/8Gi + GPU - immich-ml: req 100m/1Gi, lim 2/4Gi + GPU Right-sized ~25 over-provisioned containers: - kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi) - onlyoffice: CPU 8 → 2 (VPA upper 45m) - realestate-crawler-api: CPU 2000m → 250m - blog/travel-blog/webhook-handler: 500m → 100m - coturn/health/plotting-book: reduced to match actual usage Conservative methodology: limits = max(VPA upper * 2, live usage * 2)	2026-03-01 19:18:50 +00:00
Viktor Barzin	ccf0b2232f	[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources - Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial' for non-core). Terraform is now sole authority for container resources. Goldilocks provides recommendations only. - Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit) alongside GPU allocation. Fixes OOMKill from VPA scaling down resources. - MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi. - Remove redundant per-namespace VPA opt-out labels from onlyoffice, openclaw, trading-bot (now handled globally by Kyverno policy).	2026-03-01 19:03:49 +00:00
Viktor Barzin	35822e9207	[ci skip] onlyoffice: revert font cache NFS mounts, rebuild on startup NFS font caching caused issues. Reverted to default GENERATE_FONTS=true with 8 CPU burst limit for fast regeneration on startup.	2026-03-01 18:07:37 +00:00
Viktor Barzin	678f92ffb4	[ci skip] onlyoffice: cache fonts/themes on NFS for fast restarts Persist font cache (159MB) and theme images (10MB) to NFS volume. Set GENERATE_FONTS=false to skip regeneration on startup since cache is warm. Startup time: ~3 min -> 5 seconds.	2026-03-01 18:02:38 +00:00
Viktor Barzin	c91eb5c071	[ci skip] onlyoffice: bump CPU limit to 8, add custom LimitRange/Quota Startup was throttled by allthemesgen and font generation hitting 2 CPU ceiling. Bumped to 8 CPU burst limit with custom LimitRange (max 8 CPU) and custom ResourceQuota. Disabled VPA and goldilocks opt-out labels.	2026-03-01 17:58:26 +00:00
Viktor Barzin	12552b8ea4	[ci skip] onlyoffice: disable VPA to prevent CPU resource override Goldilocks VPA in Initial mode was overriding the explicit 2 CPU limit down to 700m, throttling the document server. Set vpa-update-mode=off.	2026-03-01 17:55:06 +00:00
Viktor Barzin	4558688baf	[ci skip] nextcloud: bump CPU limit to 16, add custom ResourceQuota CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota opt-out label and ResourceQuota allowing 32 CPU limits to accommodate the 16 CPU container limit plus sidecar defaults.	2026-03-01 17:41:18 +00:00
Viktor Barzin	80dfc58fea	[ci skip] openclaw: fix workspace permissions — chown to node user Init container clones repo as root but main container runs as node (UID 1000). Added chown -R 1000:1000 /workspace/infra so OpenClaw can write to workspace.	2026-03-01 17:20:36 +00:00
Viktor Barzin	f2678d3494	[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory - dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD list/watch needed by kopf framework in sidecar containers - kyverno: restrict inject-priority-class-from-tier to CREATE operations only (was blocking pod patches with immutable spec error) - kyverno: add resource-governance/custom-limitrange label opt-out to LimitRange generation policy (mirrors existing custom-quota) - nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange with 8Gi max, opt out of Kyverno-managed LimitRange	2026-03-01 17:16:03 +00:00
Viktor Barzin	f491073cca	[ci skip] redis: pin service to master pod to fix read-only errors The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas). Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only replicas, causing write failures. Pin the service to redis-node-0 (master).	2026-03-01 17:13:25 +00:00
Viktor Barzin	f203e7bd2c	[ci skip] openclaw: set workspace + enable elevated + native commands - Set workspace to /workspace/infra (was defaulting to ~/.openclaw/workspace) - Enable tools.elevated for unrestricted access - Enable commands.native and commands.nativeSkills - All tools, commands, and skills now fully accessible	2026-03-01 17:12:03 +00:00
Viktor Barzin	b2ac69e12b	[ci skip] openclaw: disable sandbox mode for unrestricted execution - Set agents.defaults.sandbox.mode = off - Combined with exec.host=gateway and exec.security=full, OpenClaw can now run any command on the container host	2026-03-01 16:51:35 +00:00
Viktor Barzin	99881b28e3	[ci skip] openclaw: fix exec host — use gateway instead of node host=node requires a companion app (not available in container). host=gateway runs commands directly on the gateway process host.	2026-03-01 16:47:14 +00:00
Viktor Barzin	2994ab4f29	[ci skip] calibre: increase resources to 1 CPU / 1Gi to fix sluggish web UI Was getting LimitRange defaults (250m CPU, 256Mi) causing throttling.	2026-03-01 16:42:35 +00:00
Viktor Barzin	6efc1e56c0	[ci skip] openclaw: fix exec config — use host=node, security=full Valid options: host=sandbox\|gateway\|node, security=deny\|allowlist\|full. Using node (run on container host) with full (no command restrictions).	2026-03-01 16:42:22 +00:00
Viktor Barzin	c83f3aab90	[ci skip] openclaw: disable sandbox, run commands on container host - exec.host: sandbox → local (run directly on container, no Docker sandbox) - exec.security: full → off (no restrictions on command execution)	2026-03-01 16:18:53 +00:00
Viktor Barzin	6af163f66f	[ci skip] shlink: increase memory limit to 512Mi to prevent OOMKill Shlink with GeoLite2 DB requires more than the 256Mi LimitRange default.	2026-03-01 16:13:50 +00:00
Viktor Barzin	b10d43b7a7	[ci skip] openclaw: persist home directory on NFS - Switch openclaw-home from emptyDir to NFS (/mnt/main/openclaw/home) - Persists SOUL.md, IDENTITY.md, sessions, memory DB, telegram state, device identity, and all runtime files across pod restarts - Init container still refreshes openclaw.json and kubeconfig on each start	2026-03-01 16:12:07 +00:00
Viktor Barzin	0f7e7e5969	[ci skip] openclaw: remove all tool/command restrictions - Set tools.deny = [] (was blocking sessions, subagents, browser) - All tools now available: sessions, subagents, browser, etc.	2026-03-01 15:58:12 +00:00
Viktor Barzin	f031a6bcf6	[ci skip] openclaw: add modelrelay sidecar as fallback model router - Deploy modelrelay as sidecar container (auto-routes to fastest free model) - Configured with NVIDIA NIM + OpenRouter API keys - Primary: Mistral Large 3 (NIM), Fallback 1: Nemotron Ultra (NIM), Fallback 2: modelrelay/auto-fastest (80+ free models) - Modelrelay web UI available at pod:7352	2026-03-01 15:57:31 +00:00
Viktor Barzin	207164050c	[ci skip] openclaw: fix Telegram, update to v2026.2.26, fix startup issues - Update OpenClaw from v2026.2.9 to v2026.2.26 (fixes Telegram channel) - Add gateway.mode=local + wizard block (required for channel startup) - Add dangerouslyAllowHostHeaderOriginFallback (v2026.2.26 requirement) - Run doctor --fix at container startup to auto-enable Telegram - Create required dirs (canvas, devices, cron, sessions, credentials) - Fix permissions: chown -R 1000:1000 for node user - Telegram: DM allowlist, user 8281953845 only	2026-03-01 15:47:54 +00:00
Viktor Barzin	da943c71ac	[ci skip] dbaas: add custom resource quota (12Gi req mem) to support 3-node MySQL cluster	2026-03-01 15:47:11 +00:00
Viktor Barzin	5676ee746e	[ci skip] f1-stream: use latest tag, CI manages image via kubectl set image	2026-03-01 15:15:14 +00:00
Viktor Barzin	6ac6dffe03	[ci skip] f1-stream: bump image to v5.2.0	2026-03-01 15:13:38 +00:00
Viktor Barzin	84f76a15c2	[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition	2026-03-01 15:05:57 +00:00
Viktor Barzin	0da6f90ad2	[ci skip] openclaw: fix slow startup — proper resources + readiness probe + VPA off - Set explicit CPU (2 cores) and memory (2Gi) limits Root cause: Goldilocks VPA was throttling to 300m CPU, causing gateway to take 5+ minutes to start, and 1Gi memory caused OOM crashes - Add TCP readiness probe on port 18789 to prevent 502 Bad Gateway during startup (Traefik was routing before gateway was listening) - Disable Goldilocks VPA via namespace label (vpa-update-mode: off)	2026-03-01 14:44:22 +00:00
Viktor Barzin	7ff3c61bd7	[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain	2026-03-01 14:35:53 +00:00
Viktor Barzin	51b8081594	f1-stream: add real F1 stream extractors and iframe player support Add three new extractors (Streamed.pk, DaddyLive, Aceztrims) for live F1 streams. Extend ExtractedStream model with stream_type/embed_url fields, skip health checks for embed streams, fix broken Akamai demo stream, add variant playlist validation, and add iframe player support in the frontend for embed-type streams.	2026-03-01 14:35:19 +00:00
Viktor Barzin	78c0956ab5	[ci skip] add Authentik PDB (minAvailable=2)	2026-03-01 14:24:47 +00:00
Viktor Barzin	0639719e5c	[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout	2026-03-01 14:18:54 +00:00
Viktor Barzin	f37bcf4717	[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down	2026-03-01 14:13:05 +00:00
Viktor Barzin	cd4c5ead1e	[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down	2026-03-01 14:05:41 +00:00
Viktor Barzin	e8ff760aff	[ci skip] openclaw: cache tools on NFS for fast restarts - Switch /tools volume from emptyDir to NFS (/mnt/main/openclaw/tools) - Skip download of kubectl, terraform, terragrunt, pip packages if cached - Startup time: ~2.5min → ~38s on subsequent restarts	2026-03-01 13:59:07 +00:00
Viktor Barzin	fd0c85c9cc	[ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress)	2026-03-01 13:57:54 +00:00
Viktor Barzin	e728f4c106	[ci skip] openclaw: add Telegram channel + install terragrunt in init container - Add Telegram bot integration (DM allowlist, user 8281953845 only) - Install terragrunt v0.99.4 in init container alongside terraform - Remove terraform init from init (terragrunt handles this per-stack) - Add openclaw_telegram_bot_token variable	2026-03-01 13:44:58 +00:00
Viktor Barzin	014f6cad5a	[ci skip] openclaw: switch to free agentic models via NVIDIA NIM, OpenRouter, Llama API - Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling - Fallback 1: Nemotron Ultra 253B on NIM - Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience) - 10 models total across 3 providers, all free - Removed: Modal (GLM-5), Gemini, Ollama providers - Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5 - Bumped maxTokens from 8192 to 16384 for agentic output room	2026-03-01 13:22:47 +00:00
Viktor Barzin	c07870d068	[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster lacks quorum. Changed service selector from mysqlrouter to mysqld with publishNotReadyAddresses=true to bypass the operator's readiness gate. Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.	2026-03-01 12:16:28 +00:00
Viktor Barzin	1101242036	[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator - MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane) - InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage - Group Replication with automatic failover via MySQL Router - Compatibility service: mysql.dbaas:3306 → Router port 6446 - Images from container-registry.oracle.com (not Docker Hub) - Init containers are slow (~20 min) due to mysqlsh plugin loading - Data restore from mysqldump pending after cluster is ONLINE	2026-03-01 03:00:21 +00:00
Viktor Barzin	6139052104	[ci skip] add graceful degradation to CrowdSec bouncer middleware P0: Set updateMaxFailure=-1 (fail-open) Previously defaulted to 0 which blocked ALL traffic on first LAPI failure. Now serves from cached decisions when LAPI is unreachable. P1: Enable Redis cache for CrowdSec decisions Decisions are now shared across all 3 Traefik replicas and survive pod restarts. redisCacheUnreachableBlock=false prevents Redis from becoming another SPOF. P1: Add clientTrustedIPs for internal cluster traffic Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass CrowdSec entirely, preventing internal cascade failures.	2026-03-01 02:36:53 +00:00
Viktor Barzin	fcb7d6780e	[ci skip] fix nextcloud: increase memory to 4Gi, extend startup probe - Memory limit: 2Gi → 4Gi (VPA target is 2.8Gi, was OOMKilling) - Memory request: 512Mi → 1Gi - Startup probe: 30s delay, 10s timeout, 60 failures (10min total) Previous 5min window was too short for NFS-backed SQLite init	2026-02-28 23:32:28 +00:00

1 2 3

140 commits