infra

Author	SHA1	Message	Date
Viktor Barzin	cd4c5ead1e	[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down	2026-03-01 14:05:41 +00:00
Viktor Barzin	e8ff760aff	[ci skip] openclaw: cache tools on NFS for fast restarts - Switch /tools volume from emptyDir to NFS (/mnt/main/openclaw/tools) - Skip download of kubectl, terraform, terragrunt, pip packages if cached - Startup time: ~2.5min → ~38s on subsequent restarts	2026-03-01 13:59:07 +00:00
Viktor Barzin	fd0c85c9cc	[ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress)	2026-03-01 13:57:54 +00:00
Viktor Barzin	e50cfa1d19	[ci skip] add Traefik resilience hardening implementation plan	2026-03-01 13:53:50 +00:00
Viktor Barzin	454a48c6ac	[ci skip] add Traefik resilience hardening design doc	2026-03-01 13:50:00 +00:00
Viktor Barzin	e728f4c106	[ci skip] openclaw: add Telegram channel + install terragrunt in init container - Add Telegram bot integration (DM allowlist, user 8281953845 only) - Install terragrunt v0.99.4 in init container alongside terraform - Remove terraform init from init (terragrunt handles this per-stack) - Add openclaw_telegram_bot_token variable	2026-03-01 13:44:58 +00:00
Viktor Barzin	014f6cad5a	[ci skip] openclaw: switch to free agentic models via NVIDIA NIM, OpenRouter, Llama API - Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling - Fallback 1: Nemotron Ultra 253B on NIM - Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience) - 10 models total across 3 providers, all free - Removed: Modal (GLM-5), Gemini, Ollama providers - Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5 - Bumped maxTokens from 8192 to 16384 for agentic output room	2026-03-01 13:22:47 +00:00
Viktor Barzin	99ecba46db	[ci skip] add Kyverno resource governance details to CLAUDE.md	2026-03-01 13:05:57 +00:00
Viktor Barzin	c07870d068	[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster lacks quorum. Changed service selector from mysqlrouter to mysqld with publishNotReadyAddresses=true to bypass the operator's readiness gate. Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.	2026-03-01 12:16:28 +00:00
Viktor Barzin	1101242036	[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator - MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane) - InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage - Group Replication with automatic failover via MySQL Router - Compatibility service: mysql.dbaas:3306 → Router port 6446 - Images from container-registry.oracle.com (not Docker Hub) - Init containers are slow (~20 min) due to mysqlsh plugin loading - Data restore from mysqldump pending after cluster is ONLINE	2026-03-01 03:00:21 +00:00
Viktor Barzin	6139052104	[ci skip] add graceful degradation to CrowdSec bouncer middleware P0: Set updateMaxFailure=-1 (fail-open) Previously defaulted to 0 which blocked ALL traffic on first LAPI failure. Now serves from cached decisions when LAPI is unreachable. P1: Enable Redis cache for CrowdSec decisions Decisions are now shared across all 3 Traefik replicas and survive pod restarts. redisCacheUnreachableBlock=false prevents Redis from becoming another SPOF. P1: Add clientTrustedIPs for internal cluster traffic Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass CrowdSec entirely, preventing internal cascade failures.	2026-03-01 02:36:53 +00:00
Viktor Barzin	946b5b1745	[ci skip] add qemu-guest-agent to VM templates and enable agent by default	2026-03-01 01:58:46 +00:00
Viktor Barzin	fcb7d6780e	[ci skip] fix nextcloud: increase memory to 4Gi, extend startup probe - Memory limit: 2Gi → 4Gi (VPA target is 2.8Gi, was OOMKilling) - Memory request: 512Mi → 1Gi - Startup probe: 30s delay, 10s timeout, 60 failures (10min total) Previous 5min window was too short for NFS-backed SQLite init	2026-02-28 23:32:28 +00:00
Viktor Barzin	c20312e41f	[ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub Bitnami MySQL images can't be pulled (not found on Docker Hub, likely moved to a different registry). Reverted MySQL to single instance on NFS as the known-working state. MySQL replication to be revisited once image availability is resolved. PostgreSQL and Redis remain on local disk with replication.	2026-02-28 22:53:33 +00:00
Viktor Barzin	f9a4823ccc	[ci skip] switch VPA from Auto to Initial mode for Terraform compatibility VPA Auto mode modifies Deployment specs at runtime, causing conflicts with Terraform on every apply (drift -> reset -> VPA evict loop). Initial mode only mutates Pod resource requests at creation time via the admission webhook, leaving the Deployment spec unchanged. This means terraform plan shows no drift while pods still get VPA-optimized resources on every restart. - 171 VPAs switched from Auto to Initial - 20 VPAs remain Off (tier-0 critical services) - Goldilocks dashboard continues to show recommendations	2026-02-28 22:43:29 +00:00
Viktor Barzin	f64c979ba5	[ci skip] tune resource limits and requests across 10 services Critical OOM fixes (add/increase limits): - netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi) - speedtest: add 512Mi limit (was at 80.9%) - meshcentral: add 384Mi limit (was at 72.7%) - ytdlp: uncomment resources, set 512Mi limit (was at 74.6%) Over-provisioned (reduce limits): - dashy: 2Gi → 512Mi (was using 135Mi) - redis master: 2Gi → 256Mi (was using 14Mi) - redis replica: 1Gi → 256Mi (was using 12Mi) - resume printer: 2Gi → 512Mi (was using 108Mi) - resume app: 1Gi → 384Mi (was using 125Mi) - openclaw: 4Gi → 1Gi (was using 372Mi) Under-provisioned requests (increase): - authentik server: 256Mi → 512Mi request (actual ~560Mi) - authentik worker: 256Mi → 384Mi request (actual ~400Mi) New explicit resources (previously Kyverno defaults): - forgejo: add 512Mi limit, 64Mi request	2026-02-28 21:59:08 +00:00
Viktor Barzin	ac482b5324	[ci skip] Phase 3: migrate MySQL from NFS to local disk - Add local-path PVC for MySQL data (10Gi, actual usage ~27GB) - Init container seeds data from NFS on first run (cp -a) - NFS volume kept as read-only seed source in init container - MySQL 9.2.0 running on local disk with proper fsync - All dependent services verified running: hackmd, speedtest, onlyoffice, paperless-ngx - mysqldump backup taken before migration - Existing daily mysqldump CronJob unchanged (writes to NFS)	2026-02-28 20:41:07 +00:00
Viktor Barzin	379c7e261f	[ci skip] fix nextcloud OOMKilled: increase memory limit to 2Gi	2026-02-28 20:21:00 +00:00
Viktor Barzin	f2e239b2b9	trigger CI: test registry push with port-preserving Host header	2026-02-28 20:17:47 +00:00
Viktor Barzin	09a810f8fb	[ci skip] fix: use $http_host in nginx to preserve port in registry redirects	2026-02-28 20:16:03 +00:00
Viktor Barzin	58644e036f	[ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA - Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2 - Architecture: replication with Sentinel (1 master + 1 replica + sentinels) - Automatic failover via Sentinel (quorum=2, masterSet=mymaster) - Service 'redis.redis' always points at current master (transparent to clients) - 120 clients connected immediately after deployment - Sentinel confirmed tracking redis-node-0 as master - Local-path PVCs for persistence (2Gi per node) - Auth disabled (matches previous setup) - Hourly RDB backup CronJob to NFS preserved - OCI chart pulled via pull-through cache (10.0.20.10:5000)	2026-02-28 19:59:58 +00:00
Viktor Barzin	691c74aa30	fix: use inline cache to avoid cache_from comma splitting bug	2026-02-28 19:56:00 +00:00
Viktor Barzin	3974827ea9	[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue	2026-02-28 19:44:16 +00:00
Viktor Barzin	2b22c90a56	[ci skip] Phase 2: migrate Redis from NFS to local disk - Switch from redis/redis-stack:latest to redis:7-alpine (modules were completely unused — zero module commands in stats) - Move data from NFS (/mnt/main/redis) to local-path PVC (RDB saves: 39s on NFS → <1s on local disk) - Start fresh (old RDB had redis-stack module data incompatible with plain redis; all Redis data is transient — queues and caches rebuild automatically) - Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline - Remove RedisInsight UI ingress (port 8001, only in redis-stack) - Add redis-backup to NFS exports - 110 clients reconnected immediately after switchover - Memory savings: ~100MB from dropping unused modules	2026-02-28 19:44:08 +00:00
Viktor Barzin	eb74aeaa6a	trigger CI: test BuildKit caching with TLS registry	2026-02-28 19:42:34 +00:00
Viktor Barzin	96c0353c13	[ci skip] add TLS to private registry, switch to registry.viktorbarzin.me	2026-02-28 19:40:38 +00:00
Viktor Barzin	f7acc31d83	[ci skip] update NFS mount skill: add stale mount variant after node reboots New variant documents ghost Running pods with frozen processes after kured rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.	2026-02-28 19:38:30 +00:00
Viktor Barzin	c229f6ccab	[ci skip] set network observability dashboard auto-refresh to 1h	2026-02-28 19:32:49 +00:00
Viktor Barzin	11eb256511	[ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors	2026-02-28 19:25:52 +00:00
Viktor Barzin	882350250a	[ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG Phase 1 complete — PostgreSQL fully migrated off NFS: dbaas module changes: - Replace old kubernetes_deployment.postgres with null_resource.pg_cluster (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues) - Update postgresql Service selector: app=postgresql → cnpg primary - Update backup CronJob: use postgres user + read password from CNPG secret (pg-cluster-superuser) instead of hardcoded root password - Add kube_config_path variable for kubectl in null_resource - Old deployment deleted from cluster (was scaled to 0) CNPG cluster status: - 2 instances: primary (k8s-node4), replica (k8s-node2) - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) - 20Gi local-path storage per instance - All 13 dependent services verified running - Backup CronJob verified working with new endpoint	2026-02-28 19:23:36 +00:00
Viktor Barzin	140f48c6ee	fix: remove private registry from docker builds, push only to DockerHub The BuildKit builder cannot push to the insecure HTTP registry at registry.viktorbarzin.lan:5050 because buildkit_config is not being applied by the plugin. Simplified to DockerHub-only push for now. Private registry caching and push can be re-added once buildkit_config issue is resolved.	2026-02-28 19:15:54 +00:00
Viktor Barzin	79775fa2cc	[ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map	2026-02-28 19:14:20 +00:00
Viktor Barzin	809407c60b	trigger CI: test insecure registry and buildkit_config fixes for build-cli	2026-02-28 19:09:30 +00:00
Viktor Barzin	a1ba218cd2	[ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk Major milestone - shared PostgreSQL moved from NFS to CloudNativePG: - CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility - All 20 databases and 19 roles restored from pg_dumpall backup - postgresql.dbaas Service patched to point at CNPG primary - Old PG deployment scaled to 0 (NFS data intact for rollback) - All 12+ dependent services verified running: authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker, rybbit, affine, health, resume, trading-bot, atuin - Authentik PgBouncer working through the switched endpoint TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob	2026-02-28 19:08:06 +00:00
Viktor Barzin	5a62d7b9a5	[ci skip] add resource limits to paperless-ngx and stirling-pdf to prevent OOMKills Both services were hitting the 256Mi Kyverno LimitRange default and getting OOMKilled. Set explicit limits to 1Gi memory for both.	2026-02-28 19:07:59 +00:00
Viktor Barzin	5e177a8889	[ci skip] combine caretta and goflow2 into unified network observability dashboard	2026-02-28 19:04:53 +00:00
Viktor Barzin	b3eaf76684	fix: use cache_images instead of cache_from to avoid comma splitting The plugin-docker-buildx (Codeberg version) changed CacheFrom from string to StringSlice, which causes urfave/cli to split on commas. The cache_images setting properly handles registry refs by generating both --cache-from and --cache-to flags automatically.	2026-02-28 18:53:08 +00:00
Viktor Barzin	c5c3b092e5	[ci skip] fix caretta helm values and goflow2 transport args	2026-02-28 18:51:02 +00:00
Viktor Barzin	87fc11121d	fix: use plain string for cache_from/cache_to and fix caretta helm_release - cache_from/cache_to must be plain strings, not YAML lists — the plugin-docker-buildx treats them as single string values and the Woodpecker settings layer was splitting comma-separated list items into separate --cache-from flags (type=registry and ref=... separately) - caretta.tf: replace deprecated set{} blocks with values=[yamlencode()] to fix Terraform plan error with newer Helm provider	2026-02-28 18:47:20 +00:00
Viktor Barzin	0ebf850893	fix: use YAML list for cache_from/cache_to to prevent comma splitting	2026-02-28 18:40:55 +00:00
Viktor Barzin	00d78c2337	fix: remove orphaned git submodule reference blocking CI clone	2026-02-28 18:34:59 +00:00
Viktor Barzin	be47592e08	fix: remove deprecated secrets field from slack step	2026-02-28 18:32:10 +00:00
Viktor Barzin	e8997ec430	[ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module	2026-02-28 18:30:20 +00:00
Viktor Barzin	b378f30430	trigger CI: test BuildKit caching (retry)	2026-02-28 18:29:43 +00:00
Viktor Barzin	bf3404bf6b	[ci skip] add goflow2 netflow collector to monitoring module	2026-02-28 18:29:07 +00:00
Viktor Barzin	9d52acd286	[ci skip] add caretta eBPF pod topology to monitoring module	2026-02-28 18:28:09 +00:00
Viktor Barzin	4b6ade7b08	fix: replace removed woodpeckerci/plugin-slack with curl-based webhook	2026-02-28 18:25:23 +00:00
Viktor Barzin	73cb0fe9b3	trigger CI: test BuildKit caching	2026-02-28 18:23:50 +00:00
Viktor Barzin	052662540b	[ci skip] add network visualization implementation plan	2026-02-28 18:19:36 +00:00
Viktor Barzin	887075189a	[ci skip] add network traffic visualization design doc	2026-02-28 18:14:42 +00:00

1 2 3 4 5 ...

1468 commits