Commit graph

1455 commits

Author SHA1 Message Date
Viktor Barzin
c20312e41f [ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub
Bitnami MySQL images can't be pulled (not found on Docker Hub, likely
moved to a different registry). Reverted MySQL to single instance on
NFS as the known-working state. MySQL replication to be revisited
once image availability is resolved.

PostgreSQL and Redis remain on local disk with replication.
2026-02-28 22:53:33 +00:00
Viktor Barzin
f9a4823ccc [ci skip] switch VPA from Auto to Initial mode for Terraform compatibility
VPA Auto mode modifies Deployment specs at runtime, causing conflicts
with Terraform on every apply (drift -> reset -> VPA evict loop).

Initial mode only mutates Pod resource requests at creation time via
the admission webhook, leaving the Deployment spec unchanged. This
means terraform plan shows no drift while pods still get VPA-optimized
resources on every restart.

- 171 VPAs switched from Auto to Initial
- 20 VPAs remain Off (tier-0 critical services)
- Goldilocks dashboard continues to show recommendations
2026-02-28 22:43:29 +00:00
Viktor Barzin
f64c979ba5 [ci skip] tune resource limits and requests across 10 services
Critical OOM fixes (add/increase limits):
- netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi)
- speedtest: add 512Mi limit (was at 80.9%)
- meshcentral: add 384Mi limit (was at 72.7%)
- ytdlp: uncomment resources, set 512Mi limit (was at 74.6%)

Over-provisioned (reduce limits):
- dashy: 2Gi → 512Mi (was using 135Mi)
- redis master: 2Gi → 256Mi (was using 14Mi)
- redis replica: 1Gi → 256Mi (was using 12Mi)
- resume printer: 2Gi → 512Mi (was using 108Mi)
- resume app: 1Gi → 384Mi (was using 125Mi)
- openclaw: 4Gi → 1Gi (was using 372Mi)

Under-provisioned requests (increase):
- authentik server: 256Mi → 512Mi request (actual ~560Mi)
- authentik worker: 256Mi → 384Mi request (actual ~400Mi)

New explicit resources (previously Kyverno defaults):
- forgejo: add 512Mi limit, 64Mi request
2026-02-28 21:59:08 +00:00
Viktor Barzin
ac482b5324 [ci skip] Phase 3: migrate MySQL from NFS to local disk
- Add local-path PVC for MySQL data (10Gi, actual usage ~27GB)
- Init container seeds data from NFS on first run (cp -a)
- NFS volume kept as read-only seed source in init container
- MySQL 9.2.0 running on local disk with proper fsync
- All dependent services verified running:
  hackmd, speedtest, onlyoffice, paperless-ngx
- mysqldump backup taken before migration
- Existing daily mysqldump CronJob unchanged (writes to NFS)
2026-02-28 20:41:07 +00:00
Viktor Barzin
379c7e261f [ci skip] fix nextcloud OOMKilled: increase memory limit to 2Gi 2026-02-28 20:21:00 +00:00
Viktor Barzin
f2e239b2b9 trigger CI: test registry push with port-preserving Host header 2026-02-28 20:17:47 +00:00
Viktor Barzin
09a810f8fb [ci skip] fix: use $http_host in nginx to preserve port in registry redirects 2026-02-28 20:16:03 +00:00
Viktor Barzin
58644e036f [ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA
- Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2
- Architecture: replication with Sentinel (1 master + 1 replica + sentinels)
- Automatic failover via Sentinel (quorum=2, masterSet=mymaster)
- Service 'redis.redis' always points at current master (transparent to clients)
- 120 clients connected immediately after deployment
- Sentinel confirmed tracking redis-node-0 as master
- Local-path PVCs for persistence (2Gi per node)
- Auth disabled (matches previous setup)
- Hourly RDB backup CronJob to NFS preserved
- OCI chart pulled via pull-through cache (10.0.20.10:5000)
2026-02-28 19:59:58 +00:00
Viktor Barzin
691c74aa30 fix: use inline cache to avoid cache_from comma splitting bug 2026-02-28 19:56:00 +00:00
Viktor Barzin
3974827ea9 [ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue 2026-02-28 19:44:16 +00:00
Viktor Barzin
2b22c90a56 [ci skip] Phase 2: migrate Redis from NFS to local disk
- Switch from redis/redis-stack:latest to redis:7-alpine
  (modules were completely unused — zero module commands in stats)
- Move data from NFS (/mnt/main/redis) to local-path PVC
  (RDB saves: 39s on NFS → <1s on local disk)
- Start fresh (old RDB had redis-stack module data incompatible with plain redis;
  all Redis data is transient — queues and caches rebuild automatically)
- Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline
- Remove RedisInsight UI ingress (port 8001, only in redis-stack)
- Add redis-backup to NFS exports
- 110 clients reconnected immediately after switchover
- Memory savings: ~100MB from dropping unused modules
2026-02-28 19:44:08 +00:00
Viktor Barzin
eb74aeaa6a trigger CI: test BuildKit caching with TLS registry 2026-02-28 19:42:34 +00:00
Viktor Barzin
96c0353c13 [ci skip] add TLS to private registry, switch to registry.viktorbarzin.me 2026-02-28 19:40:38 +00:00
Viktor Barzin
f7acc31d83 [ci skip] update NFS mount skill: add stale mount variant after node reboots
New variant documents ghost Running pods with frozen processes after kured
rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets
from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.
2026-02-28 19:38:30 +00:00
Viktor Barzin
c229f6ccab [ci skip] set network observability dashboard auto-refresh to 1h 2026-02-28 19:32:49 +00:00
Viktor Barzin
11eb256511 [ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors 2026-02-28 19:25:52 +00:00
Viktor Barzin
882350250a [ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG
Phase 1 complete — PostgreSQL fully migrated off NFS:

dbaas module changes:
- Replace old kubernetes_deployment.postgres with null_resource.pg_cluster
  (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues)
- Update postgresql Service selector: app=postgresql → cnpg primary
- Update backup CronJob: use postgres user + read password from CNPG secret
  (pg-cluster-superuser) instead of hardcoded root password
- Add kube_config_path variable for kubectl in null_resource
- Old deployment deleted from cluster (was scaled to 0)

CNPG cluster status:
- 2 instances: primary (k8s-node4), replica (k8s-node2)
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16)
- 20Gi local-path storage per instance
- All 13 dependent services verified running
- Backup CronJob verified working with new endpoint
2026-02-28 19:23:36 +00:00
Viktor Barzin
140f48c6ee fix: remove private registry from docker builds, push only to DockerHub
The BuildKit builder cannot push to the insecure HTTP registry at
registry.viktorbarzin.lan:5050 because buildkit_config is not being
applied by the plugin. Simplified to DockerHub-only push for now.
Private registry caching and push can be re-added once buildkit_config
issue is resolved.
2026-02-28 19:15:54 +00:00
Viktor Barzin
79775fa2cc [ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map 2026-02-28 19:14:20 +00:00
Viktor Barzin
809407c60b trigger CI: test insecure registry and buildkit_config fixes for build-cli 2026-02-28 19:09:30 +00:00
Viktor Barzin
a1ba218cd2 [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
  authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
  rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint

TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
2026-02-28 19:08:06 +00:00
Viktor Barzin
5a62d7b9a5 [ci skip] add resource limits to paperless-ngx and stirling-pdf to prevent OOMKills
Both services were hitting the 256Mi Kyverno LimitRange default and getting
OOMKilled. Set explicit limits to 1Gi memory for both.
2026-02-28 19:07:59 +00:00
Viktor Barzin
5e177a8889 [ci skip] combine caretta and goflow2 into unified network observability dashboard 2026-02-28 19:04:53 +00:00
Viktor Barzin
b3eaf76684 fix: use cache_images instead of cache_from to avoid comma splitting
The plugin-docker-buildx (Codeberg version) changed CacheFrom from
string to StringSlice, which causes urfave/cli to split on commas.
The cache_images setting properly handles registry refs by generating
both --cache-from and --cache-to flags automatically.
2026-02-28 18:53:08 +00:00
Viktor Barzin
c5c3b092e5 [ci skip] fix caretta helm values and goflow2 transport args 2026-02-28 18:51:02 +00:00
Viktor Barzin
87fc11121d fix: use plain string for cache_from/cache_to and fix caretta helm_release
- cache_from/cache_to must be plain strings, not YAML lists — the
  plugin-docker-buildx treats them as single string values and the
  Woodpecker settings layer was splitting comma-separated list items
  into separate --cache-from flags (type=registry and ref=... separately)
- caretta.tf: replace deprecated set{} blocks with values=[yamlencode()]
  to fix Terraform plan error with newer Helm provider
2026-02-28 18:47:20 +00:00
Viktor Barzin
0ebf850893 fix: use YAML list for cache_from/cache_to to prevent comma splitting 2026-02-28 18:40:55 +00:00
Viktor Barzin
00d78c2337 fix: remove orphaned git submodule reference blocking CI clone 2026-02-28 18:34:59 +00:00
Viktor Barzin
be47592e08 fix: remove deprecated secrets field from slack step 2026-02-28 18:32:10 +00:00
Viktor Barzin
e8997ec430 [ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module 2026-02-28 18:30:20 +00:00
Viktor Barzin
b378f30430 trigger CI: test BuildKit caching (retry) 2026-02-28 18:29:43 +00:00
Viktor Barzin
bf3404bf6b [ci skip] add goflow2 netflow collector to monitoring module 2026-02-28 18:29:07 +00:00
Viktor Barzin
9d52acd286 [ci skip] add caretta eBPF pod topology to monitoring module 2026-02-28 18:28:09 +00:00
Viktor Barzin
4b6ade7b08 fix: replace removed woodpeckerci/plugin-slack with curl-based webhook 2026-02-28 18:25:23 +00:00
Viktor Barzin
73cb0fe9b3 trigger CI: test BuildKit caching 2026-02-28 18:23:50 +00:00
Viktor Barzin
052662540b [ci skip] add network visualization implementation plan 2026-02-28 18:19:36 +00:00
Viktor Barzin
887075189a [ci skip] add network traffic visualization design doc 2026-02-28 18:14:42 +00:00
Viktor Barzin
c36d953573 trigger CI: test BuildKit caching for build-cli pipeline 2026-02-28 18:04:23 +00:00
Viktor Barzin
504e2b01ab [ci skip] add private registry to Terraform cloud-init provisioning 2026-02-28 17:57:24 +00:00
Viktor Barzin
925dbe39c1 [ci skip] add registry-private service to Docker Compose stack 2026-02-28 17:57:04 +00:00
Viktor Barzin
64c55a6710 [ci skip] add nginx upstream and server block for private registry on port 5050 2026-02-28 17:57:03 +00:00
Viktor Barzin
c7bb324f64 [ci skip] add BuildKit layer caching and dual-push to f1-stream pipeline 2026-02-28 17:56:56 +00:00
Viktor Barzin
4a0fc18c60 [ci skip] add BuildKit layer caching and dual-push to build-cli pipeline 2026-02-28 17:56:52 +00:00
Viktor Barzin
2102ffdb8b [ci skip] add private R/W registry config for CI build caching 2026-02-28 17:56:50 +00:00
Viktor Barzin
4651b67479 [ci skip] update CI caching plan: add Terraform provisioning for private registry 2026-02-28 17:51:55 +00:00
Viktor Barzin
2adfa86401 [ci skip] add CI build caching implementation plan 2026-02-28 17:46:44 +00:00
Viktor Barzin
5ef03cc0e0 [ci skip] add CI build caching design doc 2026-02-28 17:43:42 +00:00
Viktor Barzin
3633c195cf [ci skip] install CloudNativePG operator as platform module
- CNPG v0.27.1 operator in cnpg-system namespace
- CRDs installed: clusters, backups, poolers, databases, etc.
- local-path StorageClass already exists (from cloud-init template)
- Prerequisite for PostgreSQL migration off NFS
2026-02-28 17:22:53 +00:00
Viktor Barzin
eb32190461 [ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik
- osrm-bicycle: 1Gi limit (loads 403MB routing graph)
- aiostreams: 768Mi limit (loads 44K anime entries)
- listenarr: 1Gi limit (.NET + Playwright/Chromium)
- authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn)
- servarr: pass nfs_server variable to all submodules
2026-02-28 17:03:33 +00:00
Viktor Barzin
c6beefc845 [ci skip] nextcloud: increase resource limits to prevent OOM crash loop
Default LimitRange (256Mi) was too low — pod was using 227Mi/256Mi and
getting OOM killed under sync client load, causing 500s and blank web UI.
2026-02-28 16:26:19 +00:00