infra

Author	SHA1	Message	Date
Viktor Barzin	fffc2ed0ab	fix node OOM: reduce memory overcommit ratio and add kubelet eviction thresholds LimitRange defaults had a 4-8x limit/request ratio causing the scheduler to overpack nodes. When pods burst, nodes OOM-thrashed and became unresponsive (k8s-node3 and k8s-node4 both went down today). Changes: - Increase default memory requests across all tiers (ratio now 2x): - core/cluster: 64Mi → 256Mi request (512Mi limit) - gpu: 256Mi → 1Gi request (2Gi limit) - edge/aux/fallback: 64Mi → 128Mi request (256Mi limit) - Add kubelet memory reservation and eviction thresholds: - systemReserved: 512Mi, kubeReserved: 512Mi - evictionHard: 500Mi (was 100Mi), evictionSoft: 1Gi (was unset) - Applied to all nodes and future node template	2026-03-08 10:33:38 +00:00
Viktor Barzin	9d031290cc	[ci skip] fix Homepage icons for Tandoor, Listenarr, Networking Toolbox, Goldilocks tandoor.png → tandoor-recipes.png (dashboard-icons), podcast.png → mdi-podcast, networking.png → mdi-lan, goldilocks.png → mdi-scale-balance	2026-03-07 21:29:51 +00:00
Viktor Barzin	7af0024473	[ci skip] fix pfSense widget: wan interface is vtnet0 not vmx0	2026-03-07 20:39:56 +00:00
Viktor Barzin	3ec643a897	[ci skip] fix pfSense widget: remove wanStatus (API v2 missing gateway endpoint) Replace wanStatus with temp field. Remove wan interface param since the pfSense REST API v2 package doesn't expose /status/gateway.	2026-03-07 20:39:56 +00:00
Viktor Barzin	f3042f318e	[ci skip] fix widget issues: ports, Immich v2 API, Nextcloud trusted domains - qBittorrent: use service port 80 (not container port 8080) - Immich: add version=2 for new API endpoints (/api/server/*) - Nextcloud: use external URL (internal rejects untrusted Host header) - HA London: remove widget (token expired, needs manual regeneration) - Headscale: remove widget (requires nodeId param, not overview)	2026-03-07 20:39:56 +00:00
Viktor Barzin	17256c8f76	[ci skip] fix widget URLs: use correct k8s service ports Services expose port 80 via ClusterIP but widgets were using container target ports (5000, 3001, 4533, 3000). Calibre was using external URL through Authentik. All now use correct internal service URLs.	2026-03-07 20:39:56 +00:00
Viktor Barzin	57eed07370	[ci skip] add widgets for qbittorrent, navidrome, nextcloud, freshrss, linkwarden, uptime-kuma Add API credentials to SOPS and wire homepage_credentials through stacks. Re-add Uptime Kuma widget with new "infra" status page slug.	2026-03-07 20:39:55 +00:00
Viktor Barzin	10acdcd5a2	[ci skip] add widgets for audiobookshelf, changedetection, prowlarr, headscale Wire homepage_credentials through servarr parent stack for prowlarr. Fix paperless-ngx widget to use internal service URL.	2026-03-07 20:39:55 +00:00
Viktor Barzin	1f1700c4ff	[ci skip] fix broken Homepage widgets + add service API tokens to SOPS - Grafana: fix service URL (grafana not monitoring-grafana) - Uptime Kuma: remove widget (no status page configured) - Speedtest/Frigate/Immich: use internal k8s service URLs (external goes through Authentik forward auth, blocking API calls) - pfSense: clean up annotations - SOPS: add headscale, prowlarr, changedetection, audiobookshelf tokens	2026-03-07 20:39:55 +00:00
Viktor Barzin	a9daf50142	[ci skip] add Homepage widget credentials for Authentik, Shlink, Home Assistant Wire homepage_credentials tokens through platform stack to enable live widgets for Authentik, Shlink (URL shortener), and Home Assistant London. Update SOPS with new credential entries.	2026-03-07 20:39:54 +00:00
Viktor Barzin	6bd3970579	[ci skip] add Homepage gethomepage.dev annotations to all services Add Kubernetes ingress annotations for Homepage auto-discovery across ~88 services organized into 11 groups. Enable serviceAccount for RBAC, configure group layouts, and add Grafana/Frigate/Speedtest widgets.	2026-03-07 20:39:54 +00:00
Viktor Barzin	b6aacf7b02	[ci skip] fix Svelte 5 table structure (thead/tbody required) + use versioned image tag Fixed architecture and services pages to wrap table rows in <thead>/<tbody> as required by Svelte 5's strict HTML validation. E2E test passed: clean Alpine container → setup script → kubectl installed → CA cert verified against API server → TLS SUCCESS	2026-03-07 15:34:32 +00:00
Viktor Barzin	6f8b48a73c	[ci skip] k8s portal: fix setup script + add onboarding hub (5 new pages) Bug fixes: - CA cert now populated in ConfigMap (was empty → TLS failures) - Remove useless heredoc quote escaping in setup script - Fix homepage: VPN callout, correct verification command (get namespaces) - Fix false-positive sensitive=true on ingress_path, tls_secret_name, truenas_host, ollama_host, client_certificate_secret_name New pages (direct Svelte, no mdsvex dependency): - /onboarding: step-by-step guide (VPN, kubectl, git, first PR) - /architecture: cluster topology, storage, networking, tiers - /services: catalog of 70+ services with URLs - /contributing: PR workflow, what you can/can't change, NEVER list - /troubleshooting: common issues and fixes Navigation bar added to layout. All pages use consistent docs styling. Requires Docker image rebuild: cd stacks/platform/modules/k8s-portal/files && docker build -t viktorbarzin/k8s-portal:latest . && docker push	2026-03-07 15:06:26 +00:00
Viktor Barzin	1f2c1ca361	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	614d14c47d	[ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history Storage analysis: ~10.5 GB/month ingestion rate, 1 year = ~125 GB + overhead. PVC: 30Gi → 200Gi, retention.size: 45GB → 180GB. Historical TSDB data restored from NFS (39.8 GB total including all blocks).	2026-03-06 23:16:32 +00:00
Viktor Barzin	a52a371e35	[ci skip] expand Prometheus iSCSI PVC to 30Gi for historical data restore	2026-03-06 22:51:38 +00:00
Viktor Barzin	100a876dfe	[ci skip] migrate Redis, Prometheus, Loki storage to iSCSI - Redis: local-path → iscsi-truenas (master + replica persistence) - Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data) - Loki: NFS PV → dynamic iSCSI via storageClass in Helm values - Deleted 2 orphaned Released iSCSI PVs (31Gi freed)	2026-03-06 20:50:55 +00:00
Viktor Barzin	a48915ee02	[ci skip] exclude linkwarden from HighService4xxRate alert	2026-03-06 20:15:58 +00:00
Viktor Barzin	0638e2cc2e	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup - Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md	2026-03-06 19:54:21 +00:00
Viktor Barzin	87ef313888	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	22223ec0fd	[ci skip] fix OOMKill: prometheus (4Gi), kyverno-reports (512Mi), grampsweb (512Mi) - Prometheus server: explicit 1Gi req / 4Gi limit (was inheriting 512Mi LimitRange default) - Kyverno reports controller: 128Mi req / 512Mi limit (was 128Mi Helm default) - Grampsweb: 256Mi req / 512Mi limit for both containers (was 256Mi LimitRange default)	2026-03-02 21:39:14 +00:00
Viktor Barzin	307b356f06	[ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3) Critical fix: StorageClass mountOptions only apply during dynamic provisioning. Our static PVs (created by Terraform) were missing mount_options, so all NFS mounts defaulted to hard,timeo=600 — the exact stale mount behavior we were trying to eliminate. Adds mount_options directly to the nfs_volume module PV spec and to the monitoring PVs (prometheus, loki, alertmanager). Requires re-applying all stacks to propagate to existing PVs.	2026-03-02 20:23:36 +00:00
Viktor Barzin	0abae33c71	[ci skip] complete NFS CSI migration: complex stacks + platform modules Migrate remaining multi-volume stacks and all platform modules from inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options). Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols), nextcloud (2 vols + old PV replaced), rybbit (1 vol) Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing, real-estate-crawler Platform modules: monitoring (prometheus, loki, alertmanager PVs converted from native NFS to CSI), redis, dbaas, technitium, headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance	2026-03-02 01:24:07 +00:00
Viktor Barzin	c702fd2565	[ci skip] add NFS CSI driver + nfs_volume shared module - Deploy csi-driver-nfs Helm chart as platform module (nfs-csi) - Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options - Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)	2026-03-01 23:38:58 +00:00
Viktor Barzin	9e4fb23b10	[ci skip] right-size all pod resources based on VPA + live metrics audit Full cluster resource audit: cross-referenced Goldilocks VPA recommendations, live kubectl top metrics, and Terraform definitions for 100+ containers. Critical fixes: - dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit - stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit - traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi Added explicit resources to ~40 containers that had none: - audiobookshelf, changedetection, cyberchef, dawarich, diun, echo, excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n, navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor, tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver, cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard, k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server, immich-postgresql, osrm-foot GPU containers: added CPU/mem alongside GPU limits: - ollama: removed CPU/mem limits (models vary in size), keep GPU only - frigate: req 500m/2Gi, lim 4/8Gi + GPU - immich-ml: req 100m/1Gi, lim 2/4Gi + GPU Right-sized ~25 over-provisioned containers: - kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi) - onlyoffice: CPU 8 → 2 (VPA upper 45m) - realestate-crawler-api: CPU 2000m → 250m - blog/travel-blog/webhook-handler: 500m → 100m - coturn/health/plotting-book: reduced to match actual usage Conservative methodology: limits = max(VPA upper * 2, live usage * 2)	2026-03-01 19:18:50 +00:00
Viktor Barzin	ccf0b2232f	[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources - Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial' for non-core). Terraform is now sole authority for container resources. Goldilocks provides recommendations only. - Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit) alongside GPU allocation. Fixes OOMKill from VPA scaling down resources. - MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi. - Remove redundant per-namespace VPA opt-out labels from onlyoffice, openclaw, trading-bot (now handled globally by Kyverno policy).	2026-03-01 19:03:49 +00:00
Viktor Barzin	f2678d3494	[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory - dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD list/watch needed by kopf framework in sidecar containers - kyverno: restrict inject-priority-class-from-tier to CREATE operations only (was blocking pod patches with immutable spec error) - kyverno: add resource-governance/custom-limitrange label opt-out to LimitRange generation policy (mirrors existing custom-quota) - nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange with 8Gi max, opt out of Kyverno-managed LimitRange	2026-03-01 17:16:03 +00:00
Viktor Barzin	f491073cca	[ci skip] redis: pin service to master pod to fix read-only errors The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas). Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only replicas, causing write failures. Pin the service to redis-node-0 (master).	2026-03-01 17:13:25 +00:00
Viktor Barzin	da943c71ac	[ci skip] dbaas: add custom resource quota (12Gi req mem) to support 3-node MySQL cluster	2026-03-01 15:47:11 +00:00
Viktor Barzin	84f76a15c2	[ci skip] add PoisonFountainDown and ForwardAuthFallbackActive alerts with inhibition	2026-03-01 15:05:57 +00:00
Viktor Barzin	7ff3c61bd7	[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain	2026-03-01 14:35:53 +00:00
Viktor Barzin	78c0956ab5	[ci skip] add Authentik PDB (minAvailable=2)	2026-03-01 14:24:47 +00:00
Viktor Barzin	0639719e5c	[ci skip] add Traefik topology spread, PDB (minAvailable=2), and 30s response timeout	2026-03-01 14:18:54 +00:00
Viktor Barzin	f37bcf4717	[ci skip] add auth resilience proxy: basicAuth fallback when Authentik is down	2026-03-01 14:13:05 +00:00
Viktor Barzin	cd4c5ead1e	[ci skip] add bot-block resilience proxy: fail-open when Poison Fountain is down	2026-03-01 14:05:41 +00:00
Viktor Barzin	c07870d068	[ci skip] fix MySQL service: point at mysqld pods, pin to healthy primary The InnoDB Cluster Router (mysqlrouter) doesn't deploy when the cluster lacks quorum. Changed service selector from mysqlrouter to mysqld with publishNotReadyAddresses=true to bypass the operator's readiness gate. Pinned to mysql-cluster-1 (healthy primary) until full cluster recovers.	2026-03-01 12:16:28 +00:00
Viktor Barzin	1101242036	[ci skip] MySQL: deploy InnoDB Cluster via Oracle MySQL Operator - MySQL Operator v2.2.7 in mysql-operator namespace (on control-plane) - InnoDB Cluster: 3 MySQL 9.2.0 servers + 1 Router, local-path storage - Group Replication with automatic failover via MySQL Router - Compatibility service: mysql.dbaas:3306 → Router port 6446 - Images from container-registry.oracle.com (not Docker Hub) - Init containers are slow (~20 min) due to mysqlsh plugin loading - Data restore from mysqldump pending after cluster is ONLINE	2026-03-01 03:00:21 +00:00
Viktor Barzin	6139052104	[ci skip] add graceful degradation to CrowdSec bouncer middleware P0: Set updateMaxFailure=-1 (fail-open) Previously defaulted to 0 which blocked ALL traffic on first LAPI failure. Now serves from cached decisions when LAPI is unreachable. P1: Enable Redis cache for CrowdSec decisions Decisions are now shared across all 3 Traefik replicas and survive pod restarts. redisCacheUnreachableBlock=false prevents Redis from becoming another SPOF. P1: Add clientTrustedIPs for internal cluster traffic Node CIDR (10.0.20.0/24) and pod CIDR (10.10.0.0/16) bypass CrowdSec entirely, preventing internal cascade failures.	2026-03-01 02:36:53 +00:00
Viktor Barzin	c20312e41f	[ci skip] revert MySQL to NFS — Bitnami images unavailable on Docker Hub Bitnami MySQL images can't be pulled (not found on Docker Hub, likely moved to a different registry). Reverted MySQL to single instance on NFS as the known-working state. MySQL replication to be revisited once image availability is resolved. PostgreSQL and Redis remain on local disk with replication.	2026-02-28 22:53:33 +00:00
Viktor Barzin	f9a4823ccc	[ci skip] switch VPA from Auto to Initial mode for Terraform compatibility VPA Auto mode modifies Deployment specs at runtime, causing conflicts with Terraform on every apply (drift -> reset -> VPA evict loop). Initial mode only mutates Pod resource requests at creation time via the admission webhook, leaving the Deployment spec unchanged. This means terraform plan shows no drift while pods still get VPA-optimized resources on every restart. - 171 VPAs switched from Auto to Initial - 20 VPAs remain Off (tier-0 critical services) - Goldilocks dashboard continues to show recommendations	2026-02-28 22:43:29 +00:00
Viktor Barzin	f64c979ba5	[ci skip] tune resource limits and requests across 10 services Critical OOM fixes (add/increase limits): - netbox: add 512Mi limit (was at 98.8% of Kyverno default 256Mi) - speedtest: add 512Mi limit (was at 80.9%) - meshcentral: add 384Mi limit (was at 72.7%) - ytdlp: uncomment resources, set 512Mi limit (was at 74.6%) Over-provisioned (reduce limits): - dashy: 2Gi → 512Mi (was using 135Mi) - redis master: 2Gi → 256Mi (was using 14Mi) - redis replica: 1Gi → 256Mi (was using 12Mi) - resume printer: 2Gi → 512Mi (was using 108Mi) - resume app: 1Gi → 384Mi (was using 125Mi) - openclaw: 4Gi → 1Gi (was using 372Mi) Under-provisioned requests (increase): - authentik server: 256Mi → 512Mi request (actual ~560Mi) - authentik worker: 256Mi → 384Mi request (actual ~400Mi) New explicit resources (previously Kyverno defaults): - forgejo: add 512Mi limit, 64Mi request	2026-02-28 21:59:08 +00:00
Viktor Barzin	ac482b5324	[ci skip] Phase 3: migrate MySQL from NFS to local disk - Add local-path PVC for MySQL data (10Gi, actual usage ~27GB) - Init container seeds data from NFS on first run (cp -a) - NFS volume kept as read-only seed source in init container - MySQL 9.2.0 running on local disk with proper fsync - All dependent services verified running: hackmd, speedtest, onlyoffice, paperless-ngx - mysqldump backup taken before migration - Existing daily mysqldump CronJob unchanged (writes to NFS)	2026-02-28 20:41:07 +00:00
Viktor Barzin	58644e036f	[ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA - Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2 - Architecture: replication with Sentinel (1 master + 1 replica + sentinels) - Automatic failover via Sentinel (quorum=2, masterSet=mymaster) - Service 'redis.redis' always points at current master (transparent to clients) - 120 clients connected immediately after deployment - Sentinel confirmed tracking redis-node-0 as master - Local-path PVCs for persistence (2Gi per node) - Auth disabled (matches previous setup) - Hourly RDB backup CronJob to NFS preserved - OCI chart pulled via pull-through cache (10.0.20.10:5000)	2026-02-28 19:59:58 +00:00
Viktor Barzin	3974827ea9	[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue	2026-02-28 19:44:16 +00:00
Viktor Barzin	2b22c90a56	[ci skip] Phase 2: migrate Redis from NFS to local disk - Switch from redis/redis-stack:latest to redis:7-alpine (modules were completely unused — zero module commands in stats) - Move data from NFS (/mnt/main/redis) to local-path PVC (RDB saves: 39s on NFS → <1s on local disk) - Start fresh (old RDB had redis-stack module data incompatible with plain redis; all Redis data is transient — queues and caches rebuild automatically) - Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline - Remove RedisInsight UI ingress (port 8001, only in redis-stack) - Add redis-backup to NFS exports - 110 clients reconnected immediately after switchover - Memory savings: ~100MB from dropping unused modules	2026-02-28 19:44:08 +00:00
Viktor Barzin	c229f6ccab	[ci skip] set network observability dashboard auto-refresh to 1h	2026-02-28 19:32:49 +00:00
Viktor Barzin	11eb256511	[ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors	2026-02-28 19:25:52 +00:00
Viktor Barzin	882350250a	[ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG Phase 1 complete — PostgreSQL fully migrated off NFS: dbaas module changes: - Replace old kubernetes_deployment.postgres with null_resource.pg_cluster (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues) - Update postgresql Service selector: app=postgresql → cnpg primary - Update backup CronJob: use postgres user + read password from CNPG secret (pg-cluster-superuser) instead of hardcoded root password - Add kube_config_path variable for kubectl in null_resource - Old deployment deleted from cluster (was scaled to 0) CNPG cluster status: - 2 instances: primary (k8s-node4), replica (k8s-node2) - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) - 20Gi local-path storage per instance - All 13 dependent services verified running - Backup CronJob verified working with new endpoint	2026-02-28 19:23:36 +00:00
Viktor Barzin	79775fa2cc	[ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map	2026-02-28 19:14:20 +00:00
Viktor Barzin	5e177a8889	[ci skip] combine caretta and goflow2 into unified network observability dashboard	2026-02-28 19:04:53 +00:00

1 2

86 commits