infra

Author	SHA1	Message	Date
Viktor Barzin	f2678d3494	[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory - dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD list/watch needed by kopf framework in sidecar containers - kyverno: restrict inject-priority-class-from-tier to CREATE operations only (was blocking pod patches with immutable spec error) - kyverno: add resource-governance/custom-limitrange label opt-out to LimitRange generation policy (mirrors existing custom-quota) - nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange with 8Gi max, opt out of Kyverno-managed LimitRange	2026-03-01 17:16:03 +00:00
Viktor Barzin	f491073cca	[ci skip] redis: pin service to master pod to fix read-only errors The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas). Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only replicas, causing write failures. Pin the service to redis-node-0 (master).	2026-03-01 17:13:25 +00:00
Viktor Barzin	f203e7bd2c	[ci skip] openclaw: set workspace + enable elevated + native commands - Set workspace to /workspace/infra (was defaulting to ~/.openclaw/workspace) - Enable tools.elevated for unrestricted access - Enable commands.native and commands.nativeSkills - All tools, commands, and skills now fully accessible	2026-03-01 17:12:03 +00:00
Viktor Barzin	b2ac69e12b	[ci skip] openclaw: disable sandbox mode for unrestricted execution - Set agents.defaults.sandbox.mode = off - Combined with exec.host=gateway and exec.security=full, OpenClaw can now run any command on the container host	2026-03-01 16:51:35 +00:00
Viktor Barzin	99881b28e3	[ci skip] openclaw: fix exec host — use gateway instead of node host=node requires a companion app (not available in container). host=gateway runs commands directly on the gateway process host.	2026-03-01 16:47:14 +00:00
Viktor Barzin	2994ab4f29	[ci skip] calibre: increase resources to 1 CPU / 1Gi to fix sluggish web UI Was getting LimitRange defaults (250m CPU, 256Mi) causing throttling.	2026-03-01 16:42:35 +00:00
Viktor Barzin	6efc1e56c0	[ci skip] openclaw: fix exec config — use host=node, security=full Valid options: host=sandbox\|gateway\|node, security=deny\|allowlist\|full. Using node (run on container host) with full (no command restrictions).	2026-03-01 16:42:22 +00:00
Viktor Barzin	c83f3aab90	[ci skip] openclaw: disable sandbox, run commands on container host - exec.host: sandbox → local (run directly on container, no Docker sandbox) - exec.security: full → off (no restrictions on command execution)	2026-03-01 16:18:53 +00:00
Viktor Barzin	6af163f66f	[ci skip] shlink: increase memory limit to 512Mi to prevent OOMKill Shlink with GeoLite2 DB requires more than the 256Mi LimitRange default.	2026-03-01 16:13:50 +00:00
Viktor Barzin	b10d43b7a7	[ci skip] openclaw: persist home directory on NFS - Switch openclaw-home from emptyDir to NFS (/mnt/main/openclaw/home) - Persists SOUL.md, IDENTITY.md, sessions, memory DB, telegram state, device identity, and all runtime files across pod restarts - Init container still refreshes openclaw.json and kubeconfig on each start	2026-03-01 16:12:07 +00:00
Viktor Barzin	0f7e7e5969	[ci skip] openclaw: remove all tool/command restrictions - Set tools.deny = [] (was blocking sessions, subagents, browser) - All tools now available: sessions, subagents, browser, etc.	2026-03-01 15:58:12 +00:00
Viktor Barzin	f031a6bcf6	[ci skip] openclaw: add modelrelay sidecar as fallback model router - Deploy modelrelay as sidecar container (auto-routes to fastest free model) - Configured with NVIDIA NIM + OpenRouter API keys - Primary: Mistral Large 3 (NIM), Fallback 1: Nemotron Ultra (NIM), Fallback 2: modelrelay/auto-fastest (80+ free models) - Modelrelay web UI available at pod:7352	2026-03-01 15:57:31 +00:00
Viktor Barzin	207164050c	[ci skip] openclaw: fix Telegram, update to v2026.2.26, fix startup issues - Update OpenClaw from v2026.2.9 to v2026.2.26 (fixes Telegram channel) - Add gateway.mode=local + wizard block (required for channel startup) - Add dangerouslyAllowHostHeaderOriginFallback (v2026.2.26 requirement) - Run doctor --fix at container startup to auto-enable Telegram - Create required dirs (canvas, devices, cron, sessions, credentials) - Fix permissions: chown -R 1000:1000 for node user - Telegram: DM allowlist, user 8281953845 only	2026-03-01 15:47:54 +00:00
Viktor Barzin	da943c71ac	[ci skip] dbaas: add custom resource quota (12Gi req mem) to support 3-node MySQL cluster	2026-03-01 15:47:11 +00:00
Viktor Barzin	5676ee746e	[ci skip] f1-stream: use latest tag, CI manages image via kubectl set image	2026-03-01 15:15:14 +00:00
Viktor Barzin	6ac6dffe03	[ci skip] f1-stream: bump image to v5.2.0	2026-03-01 15:13:38 +00:00
Viktor Barzin	84f76a15c2	[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition	2026-03-01 15:05:57 +00:00
Viktor Barzin	0da6f90ad2	[ci skip] openclaw: fix slow startup — proper resources + readiness probe + VPA off - Set explicit CPU (2 cores) and memory (2Gi) limits Root cause: Goldilocks VPA was throttling to 300m CPU, causing gateway to take 5+ minutes to start, and 1Gi memory caused OOM crashes - Add TCP readiness probe on port 18789 to prevent 502 Bad Gateway during startup (Traefik was routing before gateway was listening) - Disable Goldilocks VPA via namespace label (vpa-update-mode: off)	2026-03-01 14:44:22 +00:00
Viktor Barzin	b1a3685f50	remove f1-stream CI pipeline (moved to separate repo) The f1-stream project now lives at github.com/ViktorBarzin/f1-stream with its own Woodpecker CI pipeline.	2026-03-01 14:43:06 +00:00
Viktor Barzin	7ff3c61bd7	[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain	2026-03-01 14:35:53 +00:00
Viktor Barzin	51b8081594	f1-stream: add real F1 stream extractors and iframe player support Add three new extractors (Streamed.pk, DaddyLive, Aceztrims) for live F1 streams. Extend ExtractedStream model with stream_type/embed_url fields, skip health checks for embed streams, fix broken Akamai demo stream, add variant playlist validation, and add iframe player support in the frontend for embed-type streams.	2026-03-01 14:35:19 +00:00
Viktor Barzin	78c0956ab5	[ci skip] add Authentik PDB (minAvailable=2)	2026-03-01 14:24:47 +00:00
Viktor Barzin	0639719e5c	[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout	2026-03-01 14:18:54 +00:00
Viktor Barzin	f37bcf4717	[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down	2026-03-01 14:13:05 +00:00
Viktor Barzin	cd4c5ead1e	[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down	2026-03-01 14:05:41 +00:00
Viktor Barzin	e8ff760aff	[ci skip] openclaw: cache tools on NFS for fast restarts - Switch /tools volume from emptyDir to NFS (/mnt/main/openclaw/tools) - Skip download of kubectl, terraform, terragrunt, pip packages if cached - Startup time: ~2.5min → ~38s on subsequent restarts	2026-03-01 13:59:07 +00:00
Viktor Barzin	fd0c85c9cc	[ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress)	2026-03-01 13:57:54 +00:00
Viktor Barzin	e50cfa1d19	[ci skip] add Traefik resilience hardening implementation plan	2026-03-01 13:53:50 +00:00
Viktor Barzin	454a48c6ac	[ci skip] add Traefik resilience hardening design doc	2026-03-01 13:50:00 +00:00
Viktor Barzin	e728f4c106	[ci skip] openclaw: add Telegram channel + install terragrunt in init container - Add Telegram bot integration (DM allowlist, user 8281953845 only) - Install terragrunt v0.99.4 in init container alongside terraform - Remove terraform init from init (terragrunt handles this per-stack) - Add openclaw_telegram_bot_token variable	2026-03-01 13:44:58 +00:00
Viktor Barzin	014f6cad5a	[ci skip] openclaw: switch to free agentic models via NVIDIA NIM, OpenRouter, Llama API - Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling - Fallback 1: Nemotron Ultra 253B on NIM - Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience) - 10 models total across 3 providers, all free - Removed: Modal (GLM-5), Gemini, Ollama providers - Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5 - Bumped maxTokens from 8192 to 16384 for agentic output room	2026-03-01 13:22:47 +00:00
Viktor Barzin	99ecba46db	[ci skip] add Kyverno resource governance details to CLAUDE.md	2026-03-01 13:05:57 +00:00
Viktor Barzin	c07870d068	[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster lacks quorum. Changed service selector from mysqlrouter to mysqld with publishNotReadyAddresses=true to bypass the operator's readiness gate. Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.	2026-03-01 12:16:28 +00:00
Viktor Barzin	1101242036	[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator - MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane) - InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage - Group Replication with automatic failover via MySQL Router - Compatibility service: mysql.dbaas:3306 → Router port 6446 - Images from container-registry.oracle.com (not Docker Hub) - Init containers are slow (~20 min) due to mysqlsh plugin loading - Data restore from mysqldump pending after cluster is ONLINE	2026-03-01 03:00:21 +00:00
Viktor Barzin	6139052104	[ci skip] add graceful degradation to CrowdSec bouncer middleware P0: Set updateMaxFailure=-1 (fail-open) Previously defaulted to 0 which blocked ALL traffic on first LAPI failure. Now serves from cached decisions when LAPI is unreachable. P1: Enable Redis cache for CrowdSec decisions Decisions are now shared across all 3 Traefik replicas and survive pod restarts. redisCacheUnreachableBlock=false prevents Redis from becoming another SPOF. P1: Add clientTrustedIPs for internal cluster traffic Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass CrowdSec entirely, preventing internal cascade failures.	2026-03-01 02:36:53 +00:00
Viktor Barzin	946b5b1745	[ci skip] add qemu-guest-agent to VM templates and enable agent by default	2026-03-01 01:58:46 +00:00
Viktor Barzin	fcb7d6780e	[ci skip] fix nextcloud: increase memory to 4Gi, extend startup probe - Memory limit: 2Gi → 4Gi (VPA target is 2.8Gi, was OOMKilling) - Memory request: 512Mi → 1Gi - Startup probe: 30s delay, 10s timeout, 60 failures (10min total) Previous 5min window was too short for NFS-backed SQLite init	2026-02-28 23:32:28 +00:00
Viktor Barzin	c20312e41f	[ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub Bitnami MySQL images can't be pulled (not found on Docker Hub, likely moved to a different registry). Reverted MySQL to single instance on NFS as the known-working state. MySQL replication to be revisited once image availability is resolved. PostgreSQL and Redis remain on local disk with replication.	2026-02-28 22:53:33 +00:00
Viktor Barzin	f9a4823ccc	[ci skip] switch VPA from Auto to Initial mode for Terraform compatibility VPA Auto mode modifies Deployment specs at runtime, causing conflicts with Terraform on every apply (drift -> reset -> VPA evict loop). Initial mode only mutates Pod resource requests at creation time via the admission webhook, leaving the Deployment spec unchanged. This means terraform plan shows no drift while pods still get VPA-optimized resources on every restart. - 171 VPAs switched from Auto to Initial - 20 VPAs remain Off (tier-0 critical services) - Goldilocks dashboard continues to show recommendations	2026-02-28 22:43:29 +00:00
Viktor Barzin	f64c979ba5	[ci skip] tune resource limits and requests across 10 services Critical OOM fixes (add/increase limits): - netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi) - speedtest: add 512Mi limit (was at 80.9%) - meshcentral: add 384Mi limit (was at 72.7%) - ytdlp: uncomment resources, set 512Mi limit (was at 74.6%) Over-provisioned (reduce limits): - dashy: 2Gi → 512Mi (was using 135Mi) - redis master: 2Gi → 256Mi (was using 14Mi) - redis replica: 1Gi → 256Mi (was using 12Mi) - resume printer: 2Gi → 512Mi (was using 108Mi) - resume app: 1Gi → 384Mi (was using 125Mi) - openclaw: 4Gi → 1Gi (was using 372Mi) Under-provisioned requests (increase): - authentik server: 256Mi → 512Mi request (actual ~560Mi) - authentik worker: 256Mi → 384Mi request (actual ~400Mi) New explicit resources (previously Kyverno defaults): - forgejo: add 512Mi limit, 64Mi request	2026-02-28 21:59:08 +00:00
Viktor Barzin	ac482b5324	[ci skip] Phase 3: migrate MySQL from NFS to local disk - Add local-path PVC for MySQL data (10Gi, actual usage ~27GB) - Init container seeds data from NFS on first run (cp -a) - NFS volume kept as read-only seed source in init container - MySQL 9.2.0 running on local disk with proper fsync - All dependent services verified running: hackmd, speedtest, onlyoffice, paperless-ngx - mysqldump backup taken before migration - Existing daily mysqldump CronJob unchanged (writes to NFS)	2026-02-28 20:41:07 +00:00
Viktor Barzin	379c7e261f	[ci skip] fix nextcloud OOMKilled: increase memory limit to 2Gi	2026-02-28 20:21:00 +00:00
Viktor Barzin	f2e239b2b9	trigger CI: test registry push with port-preserving Host header	2026-02-28 20:17:47 +00:00
Viktor Barzin	09a810f8fb	[ci skip] fix: use $http_host in nginx to preserve port in registry redirects	2026-02-28 20:16:03 +00:00
Viktor Barzin	58644e036f	[ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA - Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2 - Architecture: replication with Sentinel (1 master + 1 replica + sentinels) - Automatic failover via Sentinel (quorum=2, masterSet=mymaster) - Service 'redis.redis' always points at current master (transparent to clients) - 120 clients connected immediately after deployment - Sentinel confirmed tracking redis-node-0 as master - Local-path PVCs for persistence (2Gi per node) - Auth disabled (matches previous setup) - Hourly RDB backup CronJob to NFS preserved - OCI chart pulled via pull-through cache (10.0.20.10:5000)	2026-02-28 19:59:58 +00:00
Viktor Barzin	691c74aa30	fix: use inline cache to avoid cache_from comma splitting bug	2026-02-28 19:56:00 +00:00
Viktor Barzin	3974827ea9	[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue	2026-02-28 19:44:16 +00:00
Viktor Barzin	2b22c90a56	[ci skip] Phase 2: migrate Redis from NFS to local disk - Switch from redis/redis-stack:latest to redis:7-alpine (modules were completely unused — zero module commands in stats) - Move data from NFS (/mnt/main/redis) to local-path PVC (RDB saves: 39s on NFS → <1s on local disk) - Start fresh (old RDB had redis-stack module data incompatible with plain redis; all Redis data is transient — queues and caches rebuild automatically) - Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline - Remove RedisInsight UI ingress (port 8001, only in redis-stack) - Add redis-backup to NFS exports - 110 clients reconnected immediately after switchover - Memory savings: ~100MB from dropping unused modules	2026-02-28 19:44:08 +00:00
Viktor Barzin	eb74aeaa6a	trigger CI: test BuildKit caching with TLS registry	2026-02-28 19:42:34 +00:00
Viktor Barzin	96c0353c13	[ci skip] add TLS to private registry, switch to registry.viktorbarzin.me	2026-02-28 19:40:38 +00:00

1 2 3 4 5 ...

1492 commits