Commit graph

1444 commits

Author SHA1 Message Date
Viktor Barzin
eb74aeaa6a trigger CI: test BuildKit caching with TLS registry 2026-02-28 19:42:34 +00:00
Viktor Barzin
96c0353c13 [ci skip] add TLS to private registry, switch to registry.viktorbarzin.me 2026-02-28 19:40:38 +00:00
Viktor Barzin
f7acc31d83 [ci skip] update NFS mount skill: add stale mount variant after node reboots
New variant documents ghost Running pods with frozen processes after kured
rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets
from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.
2026-02-28 19:38:30 +00:00
Viktor Barzin
c229f6ccab [ci skip] set network observability dashboard auto-refresh to 1h 2026-02-28 19:32:49 +00:00
Viktor Barzin
11eb256511 [ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors 2026-02-28 19:25:52 +00:00
Viktor Barzin
882350250a [ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG
Phase 1 complete — PostgreSQL fully migrated off NFS:

dbaas module changes:
- Replace old kubernetes_deployment.postgres with null_resource.pg_cluster
  (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues)
- Update postgresql Service selector: app=postgresql → cnpg primary
- Update backup CronJob: use postgres user + read password from CNPG secret
  (pg-cluster-superuser) instead of hardcoded root password
- Add kube_config_path variable for kubectl in null_resource
- Old deployment deleted from cluster (was scaled to 0)

CNPG cluster status:
- 2 instances: primary (k8s-node4), replica (k8s-node2)
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16)
- 20Gi local-path storage per instance
- All 13 dependent services verified running
- Backup CronJob verified working with new endpoint
2026-02-28 19:23:36 +00:00
Viktor Barzin
140f48c6ee fix: remove private registry from docker builds, push only to DockerHub
The BuildKit builder cannot push to the insecure HTTP registry at
registry.viktorbarzin.lan:5050 because buildkit_config is not being
applied by the plugin. Simplified to DockerHub-only push for now.
Private registry caching and push can be re-added once buildkit_config
issue is resolved.
2026-02-28 19:15:54 +00:00
Viktor Barzin
79775fa2cc [ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map 2026-02-28 19:14:20 +00:00
Viktor Barzin
809407c60b trigger CI: test insecure registry and buildkit_config fixes for build-cli 2026-02-28 19:09:30 +00:00
Viktor Barzin
a1ba218cd2 [ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk
Major milestone - shared PostgreSQL moved from NFS to CloudNativePG:
- CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage
- PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility
- All 20 databases and 19 roles restored from pg_dumpall backup
- postgresql.dbaas Service patched to point at CNPG primary
- Old PG deployment scaled to 0 (NFS data intact for rollback)
- All 12+ dependent services verified running:
  authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker,
  rybbit, affine, health, resume, trading-bot, atuin
- Authentik PgBouncer working through the switched endpoint

TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob
2026-02-28 19:08:06 +00:00
Viktor Barzin
5a62d7b9a5 [ci skip] add resource limits to paperless-ngx and stirling-pdf to prevent OOMKills
Both services were hitting the 256Mi Kyverno LimitRange default and getting
OOMKilled. Set explicit limits to 1Gi memory for both.
2026-02-28 19:07:59 +00:00
Viktor Barzin
5e177a8889 [ci skip] combine caretta and goflow2 into unified network observability dashboard 2026-02-28 19:04:53 +00:00
Viktor Barzin
b3eaf76684 fix: use cache_images instead of cache_from to avoid comma splitting
The plugin-docker-buildx (Codeberg version) changed CacheFrom from
string to StringSlice, which causes urfave/cli to split on commas.
The cache_images setting properly handles registry refs by generating
both --cache-from and --cache-to flags automatically.
2026-02-28 18:53:08 +00:00
Viktor Barzin
c5c3b092e5 [ci skip] fix caretta helm values and goflow2 transport args 2026-02-28 18:51:02 +00:00
Viktor Barzin
87fc11121d fix: use plain string for cache_from/cache_to and fix caretta helm_release
- cache_from/cache_to must be plain strings, not YAML lists — the
  plugin-docker-buildx treats them as single string values and the
  Woodpecker settings layer was splitting comma-separated list items
  into separate --cache-from flags (type=registry and ref=... separately)
- caretta.tf: replace deprecated set{} blocks with values=[yamlencode()]
  to fix Terraform plan error with newer Helm provider
2026-02-28 18:47:20 +00:00
Viktor Barzin
0ebf850893 fix: use YAML list for cache_from/cache_to to prevent comma splitting 2026-02-28 18:40:55 +00:00
Viktor Barzin
00d78c2337 fix: remove orphaned git submodule reference blocking CI clone 2026-02-28 18:34:59 +00:00
Viktor Barzin
be47592e08 fix: remove deprecated secrets field from slack step 2026-02-28 18:32:10 +00:00
Viktor Barzin
e8997ec430 [ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module 2026-02-28 18:30:20 +00:00
Viktor Barzin
b378f30430 trigger CI: test BuildKit caching (retry) 2026-02-28 18:29:43 +00:00
Viktor Barzin
bf3404bf6b [ci skip] add goflow2 netflow collector to monitoring module 2026-02-28 18:29:07 +00:00
Viktor Barzin
9d52acd286 [ci skip] add caretta eBPF pod topology to monitoring module 2026-02-28 18:28:09 +00:00
Viktor Barzin
4b6ade7b08 fix: replace removed woodpeckerci/plugin-slack with curl-based webhook 2026-02-28 18:25:23 +00:00
Viktor Barzin
73cb0fe9b3 trigger CI: test BuildKit caching 2026-02-28 18:23:50 +00:00
Viktor Barzin
052662540b [ci skip] add network visualization implementation plan 2026-02-28 18:19:36 +00:00
Viktor Barzin
887075189a [ci skip] add network traffic visualization design doc 2026-02-28 18:14:42 +00:00
Viktor Barzin
c36d953573 trigger CI: test BuildKit caching for build-cli pipeline 2026-02-28 18:04:23 +00:00
Viktor Barzin
504e2b01ab [ci skip] add private registry to Terraform cloud-init provisioning 2026-02-28 17:57:24 +00:00
Viktor Barzin
925dbe39c1 [ci skip] add registry-private service to Docker Compose stack 2026-02-28 17:57:04 +00:00
Viktor Barzin
64c55a6710 [ci skip] add nginx upstream and server block for private registry on port 5050 2026-02-28 17:57:03 +00:00
Viktor Barzin
c7bb324f64 [ci skip] add BuildKit layer caching and dual-push to f1-stream pipeline 2026-02-28 17:56:56 +00:00
Viktor Barzin
4a0fc18c60 [ci skip] add BuildKit layer caching and dual-push to build-cli pipeline 2026-02-28 17:56:52 +00:00
Viktor Barzin
2102ffdb8b [ci skip] add private R/W registry config for CI build caching 2026-02-28 17:56:50 +00:00
Viktor Barzin
4651b67479 [ci skip] update CI caching plan: add Terraform provisioning for private registry 2026-02-28 17:51:55 +00:00
Viktor Barzin
2adfa86401 [ci skip] add CI build caching implementation plan 2026-02-28 17:46:44 +00:00
Viktor Barzin
5ef03cc0e0 [ci skip] add CI build caching design doc 2026-02-28 17:43:42 +00:00
Viktor Barzin
3633c195cf [ci skip] install CloudNativePG operator as platform module
- CNPG v0.27.1 operator in cnpg-system namespace
- CRDs installed: clusters, backups, poolers, databases, etc.
- local-path StorageClass already exists (from cloud-init template)
- Prerequisite for PostgreSQL migration off NFS
2026-02-28 17:22:53 +00:00
Viktor Barzin
eb32190461 [ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik
- osrm-bicycle: 1Gi limit (loads 403MB routing graph)
- aiostreams: 768Mi limit (loads 44K anime entries)
- listenarr: 1Gi limit (.NET + Playwright/Chromium)
- authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn)
- servarr: pass nfs_server variable to all submodules
2026-02-28 17:03:33 +00:00
Viktor Barzin
c6beefc845 [ci skip] nextcloud: increase resource limits to prevent OOM crash loop
Default LimitRange (256Mi) was too low — pod was using 227Mi/256Mi and
getting OOM killed under sync client load, causing 500s and blank web UI.
2026-02-28 16:26:19 +00:00
Viktor Barzin
14b1c43713 [ci skip] expand k8s worker nodes to 256G, update inventory and extend script
- k8s-node2: 128G → 256G (160GB free)
- k8s-node3: 128G → 256G (135GB free)
- k8s-node4: 128G → 256G (127GB free)
- k8s-node1: already 256G (51GB free)
- extend_vm_storage.sh: increase drain timeout to 300s, add --force flag
- Remove Vaultwarden from SQLite migration plan (too risky)
2026-02-28 16:00:16 +00:00
Viktor Barzin
517acd95af [ci skip] revise storage reliability design based on research agent findings
Key changes from v1:
- Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL
- Remove Headscale from PG migration (project discourages it)
- Remove MeshCentral from PG migration (NeDB, not SQLite)
- Replace Redis Sentinel with single redis:7 on local disk (modules unused)
- Add RAM overcommit warning and mitigation
- Add explicit single-host limitation acknowledgment
- Add per-component rollback plans
- Fix backup strategy (CNPG can't archive WAL to NFS natively)
- Reorder migration: low-risk services first, authentik last
- Add research gate before each service migration
2026-02-28 14:38:01 +00:00
Viktor Barzin
415d8704d4 [ci skip] add storage reliability design: DB replication + SQLite consolidation 2026-02-28 14:24:42 +00:00
Viktor Barzin
71d4801cca [ci skip] audiblez-web: switch from digest to tag for CI-driven deploys
Woodpecker CI pipeline now pushes tagged images and patches the
deployment with the build number tag. Using :latest as the Terraform
baseline so CI can override with specific build tags.
2026-02-28 14:17:19 +00:00
Viktor Barzin
0274cc0722 [ci skip] technitium: add primary-secondary DNS HA with AXFR zone replication
Secondary instance on a separate node replicates all zones from primary via
zone transfer. LoadBalancer routes DNS queries to both pods. PDB ensures at
least 1 DNS pod survives voluntary disruptions. Setup job automates zone
transfer enablement and secondary zone creation via Technitium REST API.
2026-02-28 14:14:20 +00:00
Viktor Barzin
3ebf4557f5 [ci skip] update claude knowledge: never restart NFS, NFS export dir prereq 2026-02-28 12:20:36 +00:00
Viktor Barzin
69c4c0c76e [ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0
- Reduce Kyverno LimitRange default limits ~4x across all tiers to fix
  800-900% memory overcommitment on worker nodes
- Add cluster health check #25: per-node resource overcommitment
  showing requests and limits vs allocatable capacity
- Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces
  get VPA Off mode (recommend only, no evictions) to prevent downtime
  on critical infra (traefik, cloudflared, authentik, technitium, etc.)
- Non-tier-0 namespaces get VPA Auto mode for active right-sizing
2026-02-26 23:15:43 +00:00
Viktor Barzin
250f805c32 [ci skip] Deploy VPA + Goldilocks for dynamic resource right-sizing
Add Vertical Pod Autoscaler (recommender, updater, admission-controller)
and Goldilocks dashboard to monitor resource recommendations across all
namespaces. Dashboard at goldilocks.viktorbarzin.me behind Authentik.
2026-02-25 21:54:01 +00:00
Viktor Barzin
f1d90ff840 [ci skip] poison-fountain: fix single point of failure causing transient service outages
- Scale to 2 replicas with RollingUpdate (maxUnavailable=0)
- Add topology spread constraint to place pods on different nodes
- Switch from single-threaded to ThreadingMixIn HTTP server so tarpit
  slow-drip requests no longer block /auth and /healthz endpoints
2026-02-25 21:05:14 +00:00
Viktor Barzin
071a1a1d93 [ci skip] rybbit: increase clickhouse memory limit to fix OOMKilled crash loop 2026-02-25 20:53:08 +00:00
Viktor Barzin
7bc975aa16 [ci skip] kyverno: scale to 2 replicas, eliminate API calls from policies
- Scale admission controller to 2 replicas with topology spread across nodes
- Rewrite inject-priority-class-from-tier: use namespaceSelector instead of
  API call per pod admission (eliminates Kyverno→API server round-trip)
- Rewrite sync-tier-label-from-namespace: same namespaceSelector approach
- Extract governance_tiers local to DRY up tier definitions
2026-02-24 23:09:56 +00:00