Commit graph

98 commits

Author SHA1 Message Date
Viktor Barzin
e8ff760aff [ci skip] openclaw: cache tools on NFS for fast restarts
- Switch /tools volume from emptyDir to NFS (/mnt/main/openclaw/tools)
- Skip download of kubectl, terraform, terragrunt, pip packages if cached
- Startup time: ~2.5min → ~38s on subsequent restarts
2026-03-01 13:59:07 +00:00
Viktor Barzin
fd0c85c9cc [ci skip] bump poison-fountain tier from aux to cluster (critical path for all ingress) 2026-03-01 13:57:54 +00:00
Viktor Barzin
e728f4c106 [ci skip] openclaw: add Telegram channel + install terragrunt in init container
- Add Telegram bot integration (DM allowlist, user 8281953845 only)
- Install terragrunt v0.99.4 in init container alongside terraform
- Remove terraform init from init (terragrunt handles this per-stack)
- Add openclaw_telegram_bot_token variable
2026-03-01 13:44:58 +00:00
Viktor Barzin
014f6cad5a [ci skip] openclaw: switch to free agentic models via NVIDIA NIM, OpenRouter, Llama API
- Primary: Mistral Large 3 (675B) on NIM - always warm, excellent tool calling
- Fallback 1: Nemotron Ultra 253B on NIM
- Fallback 2: Llama 4 Maverick on Llama API (different provider for resilience)
- 10 models total across 3 providers, all free
- Removed: Modal (GLM-5), Gemini, Ollama providers
- Added: NVIDIA NIM provider with DeepSeek V3.2, Qwen 3.5, Qwen 3 Coder, GLM-5
- Bumped maxTokens from 8192 to 16384 for agentic output room
2026-03-01 13:22:47 +00:00
Viktor Barzin
c07870d068 [ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary
The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster
lacks quorum. Changed service selector from mysqlrouter to mysqld with
publishNotReadyAddresses=true to bypass the operator's readiness gate.
Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.
2026-03-01 12:16:28 +00:00
Viktor Barzin
1101242036 [ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator
- MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane)
- InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage
- Group Replication with automatic failover via MySQL Router
- Compatibility service: mysql.dbaas:3306 → Router port 6446
- Images from container-registry.oracle.com (not Docker Hub)
- Init containers are slow (~20 min) due to mysqlsh plugin loading
- Data restore from mysqldump pending after cluster is ONLINE
2026-03-01 03:00:21 +00:00
Viktor Barzin
6139052104 [ci skip] add graceful degradation to CrowdSec bouncer middleware
P0: Set updateMaxFailure=-1 (fail-open)
  Previously defaulted to 0 which blocked ALL traffic on first LAPI
  failure. Now serves from cached decisions when LAPI is unreachable.

P1: Enable Redis cache for CrowdSec decisions
  Decisions are now shared across all 3 Traefik replicas and survive
  pod restarts. redisCacheUnreachableBlock=false prevents Redis from
  becoming another SPOF.

P1: Add clientTrustedIPs for internal cluster traffic
  Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass
  CrowdSec entirely, preventing internal cascade failures.
2026-03-01 02:36:53 +00:00
Viktor Barzin
fcb7d6780e [ci skip] fix nextcloud: increase memory to 4Gi, extend startup probe
- Memory limit: 2Gi → 4Gi (VPA target is 2.8Gi, was OOMKilling)
- Memory request: 512Mi → 1Gi
- Startup probe: 30s delay, 10s timeout, 60 failures (10min total)
  Previous 5min window was too short for NFS-backed SQLite init
2026-02-28 23:32:28 +00:00
Viktor Barzin
c20312e41f [ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub
Bitnami MySQL images can't be pulled (not found on Docker Hub, likely
moved to a different registry). Reverted MySQL to single instance on
NFS as the known-working state. MySQL replication to be revisited
once image availability is resolved.

PostgreSQL and Redis remain on local disk with replication.
2026-02-28 22:53:33 +00:00
Viktor Barzin
f9a4823ccc [ci skip] switch VPA from Auto to Initial mode for Terraform compatibility
VPA Auto mode modifies Deployment specs at runtime, causing conflicts
with Terraform on every apply (drift -> reset -> VPA evict loop).

Initial mode only mutates Pod resource requests at creation time via
the admission webhook, leaving the Deployment spec unchanged. This
means terraform plan shows no drift while pods still get VPA-optimized
resources on every restart.

- 171 VPAs switched from Auto to Initial
- 20 VPAs remain Off (tier-0 critical services)
- Goldilocks dashboard continues to show recommendations
2026-02-28 22:43:29 +00:00
Viktor Barzin
f64c979ba5 [ci skip] tune resource limits and requests across 10 services
Critical OOM fixes (add/increase limits):
- netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi)
- speedtest: add 512Mi limit (was at 80.9%)
- meshcentral: add 384Mi limit (was at 72.7%)
- ytdlp: uncomment resources, set 512Mi limit (was at 74.6%)

Over-provisioned (reduce limits):
- dashy: 2Gi → 512Mi (was using 135Mi)
- redis master: 2Gi → 256Mi (was using 14Mi)
- redis replica: 1Gi → 256Mi (was using 12Mi)
- resume printer: 2Gi → 512Mi (was using 108Mi)
- resume app: 1Gi → 384Mi (was using 125Mi)
- openclaw: 4Gi → 1Gi (was using 372Mi)

Under-provisioned requests (increase):
- authentik server: 256Mi → 512Mi request (actual ~560Mi)
- authentik worker: 256Mi → 384Mi request (actual ~400Mi)

New explicit resources (previously Kyverno defaults):
- forgejo: add 512Mi limit, 64Mi request
2026-02-28 21:59:08 +00:00
Viktor Barzin
ac482b5324 [ci skip] Phase 3: migrate MySQL from NFS to local disk
- Add local-path PVC for MySQL data (10Gi, actual usage ~27GB)
- Init container seeds data from NFS on first run (cp -a)
- NFS volume kept as read-only seed source in init container
- MySQL 9.2.0 running on local disk with proper fsync
- All dependent services verified running:
  hackmd, speedtest, onlyoffice, paperless-ngx
- mysqldump backup taken before migration
- Existing daily mysqldump CronJob unchanged (writes to NFS)
2026-02-28 20:41:07 +00:00
Viktor Barzin
379c7e261f [ci skip] fix nextcloud OOMKilled: increase memory limit to 2Gi 2026-02-28 20:21:00 +00:00
Viktor Barzin
58644e036f [ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA
- Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2
- Architecture: replication with Sentinel (1 master + 1 replica + sentinels)
- Automatic failover via Sentinel (quorum=2, masterSet=mymaster)
- Service 'redis.redis' always points at current master (transparent to clients)
- 120 clients connected immediately after deployment
- Sentinel confirmed tracking redis-node-0 as master
- Local-path PVCs for persistence (2Gi per node)
- Auth disabled (matches previous setup)
- Hourly RDB backup CronJob to NFS preserved
- OCI chart pulled via pull-through cache (10.0.20.10:5000)
2026-02-28 19:59:58 +00:00
Viktor Barzin
3974827ea9 [ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue 2026-02-28 19:44:16 +00:00
Viktor Barzin
2b22c90a56 [ci skip] Phase 2: migrate Redis from NFS to local disk
- Switch from redis/redis-stack:latest to redis:7-alpine
  (modules were completely unused — zero module commands in stats)
- Move data from NFS (/mnt/main/redis) to local-path PVC
  (RDB saves: 39s on NFS → <1s on local disk)
- Start fresh (old RDB had redis-stack module data incompatible with plain redis;
  all Redis data is transient — queues and caches rebuild automatically)
- Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline
- Remove RedisInsight UI ingress (port 8001, only in redis-stack)
- Add redis-backup to NFS exports
- 110 clients reconnected immediately after switchover
- Memory savings: ~100MB from dropping unused modules
2026-02-28 19:44:08 +00:00
Viktor Barzin
96c0353c13 [ci skip] add TLS to private registry, switch to registry.viktorbarzin.me 2026-02-28 19:40:38 +00:00
Viktor Barzin
c229f6ccab [ci skip] set network observability dashboard auto-refresh to 1h 2026-02-28 19:32:49 +00:00
Viktor Barzin
11eb256511 [ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors 2026-02-28 19:25:52 +00:00
Viktor Barzin
882350250a [ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG
Phase 1 complete — PostgreSQL fully migrated off NFS:

dbaas module changes:
- Replace old kubernetes_deployment.postgres with null_resource.pg_cluster
  (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues)
- Update postgresql Service selector: app=postgresql → cnpg primary
- Update backup CronJob: use postgres user + read password from CNPG secret
  (pg-cluster-superuser) instead of hardcoded root password
- Add kube_config_path variable for kubectl in null_resource
- Old deployment deleted from cluster (was scaled to 0)

CNPG cluster status:
- 2 instances: primary (k8s-node4), replica (k8s-node2)
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16)
- 20Gi local-path storage per instance
- All 13 dependent services verified running
- Backup CronJob verified working with new endpoint
2026-02-28 19:23:36 +00:00
Viktor Barzin
79775fa2cc [ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map 2026-02-28 19:14:20 +00:00
Viktor Barzin
a1ba218cd2 [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
  authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
  rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint

TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
2026-02-28 19:08:06 +00:00
Viktor Barzin
5a62d7b9a5 [ci skip] add resource limits to paperless-ngx and stirling-pdf to prevent OOMKills
Both services were hitting the 256Mi Kyverno LimitRange default and getting
OOMKilled. Set explicit limits to 1Gi memory for both.
2026-02-28 19:07:59 +00:00
Viktor Barzin
5e177a8889 [ci skip] combine caretta and goflow2 into unified network observability dashboard 2026-02-28 19:04:53 +00:00
Viktor Barzin
c5c3b092e5 [ci skip] fix caretta helm values and goflow2 transport args 2026-02-28 18:51:02 +00:00
Viktor Barzin
87fc11121d fix: use plain string for cache_from/cache_to and fix caretta helm_release
- cache_from/cache_to must be plain strings, not YAML lists — the
  plugin-docker-buildx treats them as single string values and the
  Woodpecker settings layer was splitting comma-separated list items
  into separate --cache-from flags (type=registry and ref=... separately)
- caretta.tf: replace deprecated set{} blocks with values=[yamlencode()]
  to fix Terraform plan error with newer Helm provider
2026-02-28 18:47:20 +00:00
Viktor Barzin
00d78c2337 fix: remove orphaned git submodule reference blocking CI clone 2026-02-28 18:34:59 +00:00
Viktor Barzin
e8997ec430 [ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module 2026-02-28 18:30:20 +00:00
Viktor Barzin
bf3404bf6b [ci skip] add goflow2 netflow collector to monitoring module 2026-02-28 18:29:07 +00:00
Viktor Barzin
9d52acd286 [ci skip] add caretta eBPF pod topology to monitoring module 2026-02-28 18:28:09 +00:00
Viktor Barzin
504e2b01ab [ci skip] add private registry to Terraform cloud-init provisioning 2026-02-28 17:57:24 +00:00
Viktor Barzin
3633c195cf [ci skip] install CloudNativePG operator as platform module
- CNPG v0.27.1 operator in cnpg-system namespace
- CRDs installed: clusters, backups, poolers, databases, etc.
- local-path StorageClass already exists (from cloud-init template)
- Prerequisite for PostgreSQL migration off NFS
2026-02-28 17:22:53 +00:00
Viktor Barzin
eb32190461 [ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik
- osrm-bicycle: 1Gi limit (loads 403MB routing graph)
- aiostreams: 768Mi limit (loads 44K anime entries)
- listenarr: 1Gi limit (.NET + Playwright/Chromium)
- authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn)
- servarr: pass nfs_server variable to all submodules
2026-02-28 17:03:33 +00:00
Viktor Barzin
c6beefc845 [ci skip] nextcloud: increase resource limits to prevent OOM crash loop
Default LimitRange (256Mi) was too low — pod was using 227Mi/256Mi and
getting OOM killed under sync client load, causing 500s and blank web UI.
2026-02-28 16:26:19 +00:00
Viktor Barzin
71d4801cca [ci skip] audiblez-web: switch from digest to tag for CI-driven deploys
Woodpecker CI pipeline now pushes tagged images and patches the
deployment with the build number tag. Using :latest as the Terraform
baseline so CI can override with specific build tags.
2026-02-28 14:17:19 +00:00
Viktor Barzin
0274cc0722 [ci skip] technitium: add primary-secondary DNS HA with AXFR zone replication
Secondary instance on a separate node replicates all zones from primary via
zone transfer. LoadBalancer routes DNS queries to both pods. PDB ensures at
least 1 DNS pod survives voluntary disruptions. Setup job automates zone
transfer enablement and secondary zone creation via Technitium REST API.
2026-02-28 14:14:20 +00:00
Viktor Barzin
69c4c0c76e [ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0
- Reduce Kyverno LimitRange default limits ~4x across all tiers to fix
  800-900% memory overcommitment on worker nodes
- Add cluster health check #25: per-node resource overcommitment
  showing requests and limits vs allocatable capacity
- Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces
  get VPA Off mode (recommend only, no evictions) to prevent downtime
  on critical infra (traefik, cloudflared, authentik, technitium, etc.)
- Non-tier-0 namespaces get VPA Auto mode for active right-sizing
2026-02-26 23:15:43 +00:00
Viktor Barzin
250f805c32 [ci skip] Deploy VPA + Goldilocks for dynamic resource right-sizing
Add Vertical Pod Autoscaler (recommender, updater, admission-controller)
and Goldilocks dashboard to monitor resource recommendations across all
namespaces. Dashboard at goldilocks.viktorbarzin.me behind Authentik.
2026-02-25 21:54:01 +00:00
Viktor Barzin
f1d90ff840 [ci skip] poison-fountain: fix single point of failure causing transient service outages
- Scale to 2 replicas with RollingUpdate (maxUnavailable=0)
- Add topology spread constraint to place pods on different nodes
- Switch from single-threaded to ThreadingMixIn HTTP server so tarpit
  slow-drip requests no longer block /auth and /healthz endpoints
2026-02-25 21:05:14 +00:00
Viktor Barzin
071a1a1d93 [ci skip] rybbit: increase clickhouse memory limit to fix OOMKilled crash loop 2026-02-25 20:53:08 +00:00
Viktor Barzin
7bc975aa16 [ci skip] kyverno: scale to 2 replicas, eliminate API calls from policies
- Scale admission controller to 2 replicas with topology spread across nodes
- Rewrite inject-priority-class-from-tier: use namespaceSelector instead of
  API call per pod admission (eliminates Kyverno→API server round-trip)
- Rewrite sync-tier-label-from-namespace: same namespaceSelector approach
- Extract governance_tiers local to DRY up tier definitions
2026-02-24 23:09:56 +00:00
Viktor Barzin
dcb465a7e5 [ci skip] Fix Woodpecker GitHub forge: add explicit GITHUB_URL to prevent Forgejo URL bleed
When both WOODPECKER_GITHUB and WOODPECKER_FORGEJO are enabled without
an explicit WOODPECKER_GITHUB_URL, the GitHub forge inherits the Forgejo
URL causing all GitHub API calls to hit forgejo.viktorbarzin.me with
GitHub OAuth credentials, resulting in 401 Unauthorized on repo add and
cron jobs. Also adds Forgejo forge variables to Terraform.
2026-02-24 23:02:33 +00:00
Viktor Barzin
e7e4faa57a [ci skip] kyverno: fix crash loop — failurePolicy Ignore, increase memory, pin chart
Admission controller was restarting every ~5min due to API server timeouts
causing leader election loss. failurePolicy:Fail meant the webhook blocked
all pod creation cluster-wide when Kyverno was unavailable.
2026-02-24 23:00:45 +00:00
Viktor Barzin
c35bef2fd8 [ci skip] fix cluster health: GPU tolerations, actualbudget nfs_server, AuthentikDown alert
- Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments
- Add node_selector gpu=true to ollama deployment
- Pass nfs_server variable through to actualbudget factory modules
- Fix AuthentikDown alert to match actual deployment name (goauthentik-server)
2026-02-24 22:55:58 +00:00
Viktor Barzin
4fab38da1f [ci skip] wrongmove dashboard: add per-path latency table, fix layout, sort top offenders
Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate
per endpoint. Fix bar gauge position to sit next to timeseries. Add sort
transformation to "Top Offenders (Avg Duration)" panel.
2026-02-24 22:31:41 +00:00
Viktor Barzin
a3f66c88fd [ci skip] f1-stream: use v5.0.0 tag to bypass stale pull-through cache 2026-02-24 00:28:12 +00:00
Viktor Barzin
540fffdf3f f1-stream: fix frontend routing with catch-all handler for SvelteKit SPA 2026-02-24 00:18:28 +00:00
Viktor Barzin
c9b187ed65 f1-stream: fix SvelteKit routing - add trailingSlash for static adapter 2026-02-24 00:11:44 +00:00
Viktor Barzin
d48551609d f1-stream: add pydantic dependency and trigger CI build 2026-02-24 00:00:02 +00:00
Viktor Barzin
9fd788b158 [ci skip] f1-stream: add CDN token refresh, SvelteKit frontend, multi-stream layout (Phases 6-8)
- Phase 6: CDN token lifecycle with 3-strategy URL matching and periodic refresh
- Phase 7: SvelteKit 2/Svelte 5 frontend with schedule calendar and hls.js player
- Phase 8: Multi-stream layout supporting up to 4 simultaneous HLS streams
- Update Dockerfile to multi-stage build (Node.js frontend + Python backend)
- Switch deployment to :latest tag with Always pull policy for CI-driven deploys
- Update Woodpecker CI to use explicit latest tag
2026-02-23 23:59:35 +00:00