infra

Author	SHA1	Message	Date
Viktor Barzin	f7acc31d83	[ci skip] update NFS mount skill: add stale mount variant after node reboots New variant documents ghost Running pods with frozen processes after kured rolling reboots. Key diagnostic: Running 1/1 but zero listening sockets from ss -tlnp. Fix: force-delete pods to get fresh NFS mounts.	2026-02-28 19:38:30 +00:00
Viktor Barzin	c229f6ccab	[ci skip] set network observability dashboard auto-refresh to 1h	2026-02-28 19:32:49 +00:00
Viktor Barzin	11eb256511	[ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors	2026-02-28 19:25:52 +00:00
Viktor Barzin	882350250a	[ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG Phase 1 complete — PostgreSQL fully migrated off NFS: dbaas module changes: - Replace old kubernetes_deployment.postgres with null_resource.pg_cluster (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues) - Update postgresql Service selector: app=postgresql → cnpg primary - Update backup CronJob: use postgres user + read password from CNPG secret (pg-cluster-superuser) instead of hardcoded root password - Add kube_config_path variable for kubectl in null_resource - Old deployment deleted from cluster (was scaled to 0) CNPG cluster status: - 2 instances: primary (k8s-node4), replica (k8s-node2) - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) - 20Gi local-path storage per instance - All 13 dependent services verified running - Backup CronJob verified working with new endpoint	2026-02-28 19:23:36 +00:00
Viktor Barzin	140f48c6ee	fix: remove private registry from docker builds, push only to DockerHub The BuildKit builder cannot push to the insecure HTTP registry at registry.viktorbarzin.lan:5050 because buildkit_config is not being applied by the plugin. Simplified to DockerHub-only push for now. Private registry caching and push can be re-added once buildkit_config issue is resolved.	2026-02-28 19:15:54 +00:00
Viktor Barzin	79775fa2cc	[ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map	2026-02-28 19:14:20 +00:00
Viktor Barzin	809407c60b	trigger CI: test insecure registry and buildkit_config fixes for build-cli	2026-02-28 19:09:30 +00:00
Viktor Barzin	a1ba218cd2	[ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk Major milestone - shared PostgreSQL moved from NFS to CloudNativePG: - CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility - All 20 databases and 19 roles restored from pg_dumpall backup - postgresql.dbaas Service patched to point at CNPG primary - Old PG deployment scaled to 0 (NFS data intact for rollback) - All 12+ dependent services verified running: authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker, rybbit, affine, health, resume, trading-bot, atuin - Authentik PgBouncer working through the switched endpoint TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob	2026-02-28 19:08:06 +00:00
Viktor Barzin	5a62d7b9a5	[ci skip] add resource limits to paperless-ngx and stirling-pdf to prevent OOMKills Both services were hitting the 256Mi Kyverno LimitRange default and getting OOMKilled. Set explicit limits to 1Gi memory for both.	2026-02-28 19:07:59 +00:00
Viktor Barzin	5e177a8889	[ci skip] combine caretta and goflow2 into unified network observability dashboard	2026-02-28 19:04:53 +00:00
Viktor Barzin	b3eaf76684	fix: use cache_images instead of cache_from to avoid comma splitting The plugin-docker-buildx (Codeberg version) changed CacheFrom from string to StringSlice, which causes urfave/cli to split on commas. The cache_images setting properly handles registry refs by generating both --cache-from and --cache-to flags automatically.	2026-02-28 18:53:08 +00:00
Viktor Barzin	c5c3b092e5	[ci skip] fix caretta helm values and goflow2 transport args	2026-02-28 18:51:02 +00:00
Viktor Barzin	87fc11121d	fix: use plain string for cache_from/cache_to and fix caretta helm_release - cache_from/cache_to must be plain strings, not YAML lists — the plugin-docker-buildx treats them as single string values and the Woodpecker settings layer was splitting comma-separated list items into separate --cache-from flags (type=registry and ref=... separately) - caretta.tf: replace deprecated set{} blocks with values=[yamlencode()] to fix Terraform plan error with newer Helm provider	2026-02-28 18:47:20 +00:00
Viktor Barzin	0ebf850893	fix: use YAML list for cache_from/cache_to to prevent comma splitting	2026-02-28 18:40:55 +00:00
Viktor Barzin	00d78c2337	fix: remove orphaned git submodule reference blocking CI clone	2026-02-28 18:34:59 +00:00
Viktor Barzin	be47592e08	fix: remove deprecated secrets field from slack step	2026-02-28 18:32:10 +00:00
Viktor Barzin	e8997ec430	[ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module	2026-02-28 18:30:20 +00:00
Viktor Barzin	b378f30430	trigger CI: test BuildKit caching (retry)	2026-02-28 18:29:43 +00:00
Viktor Barzin	bf3404bf6b	[ci skip] add goflow2 netflow collector to monitoring module	2026-02-28 18:29:07 +00:00
Viktor Barzin	9d52acd286	[ci skip] add caretta eBPF pod topology to monitoring module	2026-02-28 18:28:09 +00:00
Viktor Barzin	4b6ade7b08	fix: replace removed woodpeckerci/plugin-slack with curl-based webhook	2026-02-28 18:25:23 +00:00
Viktor Barzin	73cb0fe9b3	trigger CI: test BuildKit caching	2026-02-28 18:23:50 +00:00
Viktor Barzin	052662540b	[ci skip] add network visualization implementation plan	2026-02-28 18:19:36 +00:00
Viktor Barzin	887075189a	[ci skip] add network traffic visualization design doc	2026-02-28 18:14:42 +00:00
Viktor Barzin	c36d953573	trigger CI: test BuildKit caching for build-cli pipeline	2026-02-28 18:04:23 +00:00
Viktor Barzin	504e2b01ab	[ci skip] add private registry to Terraform cloud-init provisioning	2026-02-28 17:57:24 +00:00
Viktor Barzin	925dbe39c1	[ci skip] add registry-private service to Docker Compose stack	2026-02-28 17:57:04 +00:00
Viktor Barzin	64c55a6710	[ci skip] add nginx upstream and server block for private registry on port 5050	2026-02-28 17:57:03 +00:00
Viktor Barzin	c7bb324f64	[ci skip] add BuildKit layer caching and dual-push to f1-stream pipeline	2026-02-28 17:56:56 +00:00
Viktor Barzin	4a0fc18c60	[ci skip] add BuildKit layer caching and dual-push to build-cli pipeline	2026-02-28 17:56:52 +00:00
Viktor Barzin	2102ffdb8b	[ci skip] add private R/W registry config for CI build caching	2026-02-28 17:56:50 +00:00
Viktor Barzin	4651b67479	[ci skip] update CI caching plan: add Terraform provisioning for private registry	2026-02-28 17:51:55 +00:00
Viktor Barzin	2adfa86401	[ci skip] add CI build caching implementation plan	2026-02-28 17:46:44 +00:00
Viktor Barzin	5ef03cc0e0	[ci skip] add CI build caching design doc	2026-02-28 17:43:42 +00:00
Viktor Barzin	3633c195cf	[ci skip] install CloudNativePG operator as platform module - CNPG v0.27.1 operator in cnpg-system namespace - CRDs installed: clusters, backups, poolers, databases, etc. - local-path StorageClass already exists (from cloud-init template) - Prerequisite for PostgreSQL migration off NFS	2026-02-28 17:22:53 +00:00
Viktor Barzin	eb32190461	[ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik - osrm-bicycle: 1Gi limit (loads 403MB routing graph) - aiostreams: 768Mi limit (loads 44K anime entries) - listenarr: 1Gi limit (.NET + Playwright/Chromium) - authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn) - servarr: pass nfs_server variable to all submodules	2026-02-28 17:03:33 +00:00
Viktor Barzin	c6beefc845	[ci skip] nextcloud: increase resource limits to prevent OOM crash loop Default LimitRange (256Mi) was too low — pod was using 227Mi/256Mi and getting OOM killed under sync client load, causing 500s and blank web UI.	2026-02-28 16:26:19 +00:00
Viktor Barzin	14b1c43713	[ci skip] expand k8s worker nodes to 256G, update inventory and extend script - k8s-node2: 128G → 256G (160GB free) - k8s-node3: 128G → 256G (135GB free) - k8s-node4: 128G → 256G (127GB free) - k8s-node1: already 256G (51GB free) - extend_vm_storage.sh: increase drain timeout to 300s, add --force flag - Remove Vaultwarden from SQLite migration plan (too risky)	2026-02-28 16:00:16 +00:00
Viktor Barzin	517acd95af	[ci skip] revise storage reliability design based on research agent findings Key changes from v1: - Drop 3-instance replication → 2-instance CNPG, single Redis/MySQL - Remove Headscale from PG migration (project discourages it) - Remove MeshCentral from PG migration (NeDB, not SQLite) - Replace Redis Sentinel with single redis:7 on local disk (modules unused) - Add RAM overcommit warning and mitigation - Add explicit single-host limitation acknowledgment - Add per-component rollback plans - Fix backup strategy (CNPG can't archive WAL to NFS natively) - Reorder migration: low-risk services first, authentik last - Add research gate before each service migration	2026-02-28 14:38:01 +00:00
Viktor Barzin	415d8704d4	[ci skip] add storage reliability design: DB replication + SQLite consolidation	2026-02-28 14:24:42 +00:00
Viktor Barzin	71d4801cca	[ci skip] audiblez-web: switch from digest to tag for CI-driven deploys Woodpecker CI pipeline now pushes tagged images and patches the deployment with the build number tag. Using :latest as the Terraform baseline so CI can override with specific build tags.	2026-02-28 14:17:19 +00:00
Viktor Barzin	0274cc0722	[ci skip] technitium: add primary-secondary DNS HA with AXFR zone replication Secondary instance on a separate node replicates all zones from primary via zone transfer. LoadBalancer routes DNS queries to both pods. PDB ensures at least 1 DNS pod survives voluntary disruptions. Setup job automates zone transfer enablement and secondary zone creation via Technitium REST API.	2026-02-28 14:14:20 +00:00
Viktor Barzin	3ebf4557f5	[ci skip] update claude knowledge: never restart NFS, NFS export dir prereq	2026-02-28 12:20:36 +00:00
Viktor Barzin	69c4c0c76e	[ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0 - Reduce Kyverno LimitRange default limits ~4x across all tiers to fix 800-900% memory overcommitment on worker nodes - Add cluster health check #25: per-node resource overcommitment showing requests and limits vs allocatable capacity - Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces get VPA Off mode (recommend only, no evictions) to prevent downtime on critical infra (traefik, cloudflared, authentik, technitium, etc.) - Non-tier-0 namespaces get VPA Auto mode for active right-sizing	2026-02-26 23:15:43 +00:00
Viktor Barzin	250f805c32	[ci skip] Deploy VPA + Goldilocks for dynamic resource right-sizing Add Vertical Pod Autoscaler (recommender, updater, admission-controller) and Goldilocks dashboard to monitor resource recommendations across all namespaces. Dashboard at goldilocks.viktorbarzin.me behind Authentik.	2026-02-25 21:54:01 +00:00
Viktor Barzin	f1d90ff840	[ci skip] poison-fountain: fix single point of failure causing transient service outages - Scale to 2 replicas with RollingUpdate (maxUnavailable=0) - Add topology spread constraint to place pods on different nodes - Switch from single-threaded to ThreadingMixIn HTTP server so tarpit slow-drip requests no longer block /auth and /healthz endpoints	2026-02-25 21:05:14 +00:00
Viktor Barzin	071a1a1d93	[ci skip] rybbit: increase clickhouse memory limit to fix OOMKilled crash loop	2026-02-25 20:53:08 +00:00
Viktor Barzin	7bc975aa16	[ci skip] kyverno: scale to 2 replicas, eliminate API calls from policies - Scale admission controller to 2 replicas with topology spread across nodes - Rewrite inject-priority-class-from-tier: use namespaceSelector instead of API call per pod admission (eliminates Kyverno→API server round-trip) - Rewrite sync-tier-label-from-namespace: same namespaceSelector approach - Extract governance_tiers local to DRY up tier definitions	2026-02-24 23:09:56 +00:00
Viktor Barzin	dcb465a7e5	[ci skip] Fix Woodpecker GitHub forge: add explicit GITHUB_URL to prevent Forgejo URL bleed When both WOODPECKER_GITHUB and WOODPECKER_FORGEJO are enabled without an explicit WOODPECKER_GITHUB_URL, the GitHub forge inherits the Forgejo URL causing all GitHub API calls to hit forgejo.viktorbarzin.me with GitHub OAuth credentials, resulting in 401 Unauthorized on repo add and cron jobs. Also adds Forgejo forge variables to Terraform.	2026-02-24 23:02:33 +00:00
Viktor Barzin	e7e4faa57a	[ci skip] kyverno: fix crash loop — failurePolicy Ignore, increase memory, pin chart Admission controller was restarting every ~5min due to API server timeouts causing leader election loss. failurePolicy:Fail meant the webhook blocked all pod creation cluster-wide when Kyverno was unavailable.	2026-02-24 23:00:45 +00:00

1 2 3 4 5 ...

1442 commits