infra

Author	SHA1	Message	Date
Viktor Barzin	0cce9d350a	[ci skip] Redis: upgrade to Bitnami Helm chart with Sentinel HA - Replace manual redis:7-alpine deployment with Bitnami Redis Helm chart v25.3.2 - Architecture: replication with Sentinel (1 master + 1 replica + sentinels) - Automatic failover via Sentinel (quorum=2, masterSet=mymaster) - Service 'redis.redis' always points at current master (transparent to clients) - 120 clients connected immediately after deployment - Sentinel confirmed tracking redis-node-0 as master - Local-path PVCs for persistence (2Gi per node) - Auth disabled (matches previous setup) - Hourly RDB backup CronJob to NFS preserved - OCI chart pulled via pull-through cache (10.0.20.10:5000)	2026-02-28 19:59:58 +00:00
Viktor Barzin	00717d0c7e	[ci skip] color only public IPs red in service map, private IPs (10.x, 192.168.x) get light blue	2026-02-28 19:44:16 +00:00
Viktor Barzin	fdd4e3e467	[ci skip] Phase 2: migrate Redis from NFS to local disk - Switch from redis/redis-stack:latest to redis:7-alpine (modules were completely unused — zero module commands in stats) - Move data from NFS (/mnt/main/redis) to local-path PVC (RDB saves: 39s on NFS → <1s on local disk) - Start fresh (old RDB had redis-stack module data incompatible with plain redis; all Redis data is transient — queues and caches rebuild automatically) - Add hourly redis-backup CronJob: redis-cli --rdb to NFS for backup pipeline - Remove RedisInsight UI ingress (port 8001, only in redis-stack) - Add redis-backup to NFS exports - 110 clients reconnected immediately after switchover - Memory savings: ~100MB from dropping unused modules	2026-02-28 19:44:08 +00:00
Viktor Barzin	3e3699bbc6	[ci skip] add TLS to private registry, switch to registry.viktorbarzin.me	2026-02-28 19:40:38 +00:00
Viktor Barzin	5d745376bf	[ci skip] set network observability dashboard auto-refresh to 1h	2026-02-28 19:32:49 +00:00
Viktor Barzin	849116d08a	[ci skip] fix service map coloring: remove arc system, use color field for namespace-based node colors	2026-02-28 19:25:52 +00:00
Viktor Barzin	5be70fb955	[ci skip] codify CNPG PostgreSQL in Terraform, decommission old NFS-backed PG Phase 1 complete — PostgreSQL fully migrated off NFS: dbaas module changes: - Replace old kubernetes_deployment.postgres with null_resource.pg_cluster (CNPG Cluster CR managed via kubectl apply due to webhook mutation issues) - Update postgresql Service selector: app=postgresql → cnpg primary - Update backup CronJob: use postgres user + read password from CNPG secret (pg-cluster-superuser) instead of hardcoded root password - Add kube_config_path variable for kubectl in null_resource - Old deployment deleted from cluster (was scaled to 0) CNPG cluster status: - 2 instances: primary (k8s-node4), replica (k8s-node2) - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) - 20Gi local-path storage per instance - All 13 dependent services verified running - Backup CronJob verified working with new endpoint	2026-02-28 19:23:36 +00:00
Viktor Barzin	39e3fae488	[ci skip] improve network observability dashboard: namespace coloring, layered layout, full-width service map	2026-02-28 19:14:20 +00:00
Viktor Barzin	9d9c8fdc12	[ci skip] Phase 1: PostgreSQL migrated to CNPG on local disk Major milestone - shared PostgreSQL moved from NFS to CloudNativePG: - CNPG cluster (pg-cluster) running in dbaas namespace on local-path storage - PostGIS image (ghcr.io/cloudnative-pg/postgis:16) for dawarich compatibility - All 20 databases and 19 roles restored from pg_dumpall backup - postgresql.dbaas Service patched to point at CNPG primary - Old PG deployment scaled to 0 (NFS data intact for rollback) - All 12+ dependent services verified running: authentik, n8n, dawarich, tandoor, linkwarden, netbox, woodpecker, rybbit, affine, health, resume, trading-bot, atuin - Authentik PgBouncer working through the switched endpoint TODO: codify CNPG cluster in Terraform, add 2nd replica, update backup CronJob	2026-02-28 19:08:06 +00:00
Viktor Barzin	7724214054	[ci skip] add resource limits to paperless-ngx and stirling-pdf to prevent OOMKills Both services were hitting the 256Mi Kyverno LimitRange default and getting OOMKilled. Set explicit limits to 1Gi memory for both.	2026-02-28 19:07:59 +00:00
Viktor Barzin	5e7dc5ba4a	[ci skip] combine caretta and goflow2 into unified network observability dashboard	2026-02-28 19:04:53 +00:00
Viktor Barzin	c24ae7e8df	[ci skip] fix caretta helm values and goflow2 transport args	2026-02-28 18:51:02 +00:00
Viktor Barzin	205d9f9fc4	fix: use plain string for cache_from/cache_to and fix caretta helm_release - cache_from/cache_to must be plain strings, not YAML lists — the plugin-docker-buildx treats them as single string values and the Woodpecker settings layer was splitting comma-separated list items into separate --cache-from flags (type=registry and ref=... separately) - caretta.tf: replace deprecated set{} blocks with values=[yamlencode()] to fix Terraform plan error with newer Helm provider	2026-02-28 18:47:20 +00:00
Viktor Barzin	085334fbb9	fix: remove orphaned git submodule reference blocking CI clone	2026-02-28 18:34:59 +00:00
Viktor Barzin	22ffd5001d	[ci skip] add caretta, goflow2, and prometheus scrape targets to monitoring module	2026-02-28 18:30:20 +00:00
Viktor Barzin	ca6c7c865a	[ci skip] add goflow2 netflow collector to monitoring module	2026-02-28 18:29:07 +00:00
Viktor Barzin	be89d3c48f	[ci skip] add caretta eBPF pod topology to monitoring module	2026-02-28 18:28:09 +00:00
Viktor Barzin	d232cf72c2	[ci skip] add private registry to Terraform cloud-init provisioning	2026-02-28 17:57:24 +00:00
Viktor Barzin	3f558bd4da	[ci skip] install CloudNativePG operator as platform module - CNPG v0.27.1 operator in cnpg-system namespace - CRDs installed: clusters, backups, poolers, databases, etc. - local-path StorageClass already exists (from cloud-init template) - Prerequisite for PostgreSQL migration off NFS	2026-02-28 17:22:53 +00:00
Viktor Barzin	5318761336	[ci skip] fix OOM crashes: add resource limits for osrm-bicycle, aiostreams, listenarr, authentik - osrm-bicycle: 1Gi limit (loads 403MB routing graph) - aiostreams: 768Mi limit (loads 44K anime entries) - listenarr: 1Gi limit (.NET + Playwright/Chromium) - authentik server: 1Gi limit, worker: 1Gi limit (Django + gunicorn) - servarr: pass nfs_server variable to all submodules	2026-02-28 17:03:33 +00:00
Viktor Barzin	de4dffbab7	[ci skip] nextcloud: increase resource limits to prevent OOM crash loop Default LimitRange (256Mi) was too low — pod was using 227Mi/256Mi and getting OOM killed under sync client load, causing 500s and blank web UI.	2026-02-28 16:26:19 +00:00
Viktor Barzin	2bbf52ac9b	[ci skip] audiblez-web: switch from digest to tag for CI-driven deploys Woodpecker CI pipeline now pushes tagged images and patches the deployment with the build number tag. Using :latest as the Terraform baseline so CI can override with specific build tags.	2026-02-28 14:17:19 +00:00
Viktor Barzin	6aa29e9f77	[ci skip] technitium: add primary-secondary DNS HA with AXFR zone replication Secondary instance on a separate node replicates all zones from primary via zone transfer. LoadBalancer routes DNS queries to both pods. PDB ensures at least 1 DNS pod survives voluntary disruptions. Setup job automates zone transfer enablement and secondary zone creation via Technitium REST API.	2026-02-28 14:14:20 +00:00
Viktor Barzin	16343475b1	[ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0 - Reduce Kyverno LimitRange default limits ~4x across all tiers to fix 800-900% memory overcommitment on worker nodes - Add cluster health check #25: per-node resource overcommitment showing requests and limits vs allocatable capacity - Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces get VPA Off mode (recommend only, no evictions) to prevent downtime on critical infra (traefik, cloudflared, authentik, technitium, etc.) - Non-tier-0 namespaces get VPA Auto mode for active right-sizing	2026-02-26 23:15:43 +00:00
Viktor Barzin	a1b669ae97	[ci skip] Deploy VPA + Goldilocks for dynamic resource right-sizing Add Vertical Pod Autoscaler (recommender, updater, admission-controller) and Goldilocks dashboard to monitor resource recommendations across all namespaces. Dashboard at goldilocks.viktorbarzin.me behind Authentik.	2026-02-25 21:54:01 +00:00
Viktor Barzin	a0799d525d	[ci skip] poison-fountain: fix single point of failure causing transient service outages - Scale to 2 replicas with RollingUpdate (maxUnavailable=0) - Add topology spread constraint to place pods on different nodes - Switch from single-threaded to ThreadingMixIn HTTP server so tarpit slow-drip requests no longer block /auth and /healthz endpoints	2026-02-25 21:05:14 +00:00
Viktor Barzin	9caec43351	[ci skip] rybbit: increase clickhouse memory limit to fix OOMKilled crash loop	2026-02-25 20:53:08 +00:00
Viktor Barzin	ef158ed431	[ci skip] kyverno: scale to 2 replicas, eliminate API calls from policies - Scale admission controller to 2 replicas with topology spread across nodes - Rewrite inject-priority-class-from-tier: use namespaceSelector instead of API call per pod admission (eliminates Kyverno→API server round-trip) - Rewrite sync-tier-label-from-namespace: same namespaceSelector approach - Extract governance_tiers local to DRY up tier definitions	2026-02-24 23:09:56 +00:00
Viktor Barzin	900f72d60d	[ci skip] Fix Woodpecker GitHub forge: add explicit GITHUB_URL to prevent Forgejo URL bleed When both WOODPECKER_GITHUB and WOODPECKER_FORGEJO are enabled without an explicit WOODPECKER_GITHUB_URL, the GitHub forge inherits the Forgejo URL causing all GitHub API calls to hit forgejo.viktorbarzin.me with GitHub OAuth credentials, resulting in 401 Unauthorized on repo add and cron jobs. Also adds Forgejo forge variables to Terraform.	2026-02-24 23:02:33 +00:00
Viktor Barzin	882df4cc5c	[ci skip] kyverno: fix crash loop — failurePolicy Ignore, increase memory, pin chart Admission controller was restarting every ~5min due to API server timeouts causing leader election loss. failurePolicy:Fail meant the webhook blocked all pod creation cluster-wide when Kyverno was unavailable.	2026-02-24 23:00:45 +00:00
Viktor Barzin	c06cca288a	[ci skip] fix cluster health: GPU tolerations, actualbudget nfs_server, AuthentikDown alert - Add missing nvidia.com/gpu toleration to ollama and yt-highlights deployments - Add node_selector gpu=true to ollama deployment - Pass nfs_server variable through to actualbudget factory modules - Fix AuthentikDown alert to match actual deployment name (goauthentik-server)	2026-02-24 22:55:58 +00:00
Viktor Barzin	18a873a630	[ci skip] wrongmove dashboard: add per-path latency table, fix layout, sort top offenders Add "Per-Path Latency Breakdown" table with p50/p95/p99 and request rate per endpoint. Fix bar gauge position to sit next to timeseries. Add sort transformation to "Top Offenders (Avg Duration)" panel.	2026-02-24 22:31:41 +00:00
Viktor Barzin	650d785dcd	[ci skip] f1-stream: use v5.0.0 tag to bypass stale pull-through cache	2026-02-24 00:28:12 +00:00
Viktor Barzin	6d88a5df1e	f1-stream: fix frontend routing with catch-all handler for SvelteKit SPA	2026-02-24 00:18:28 +00:00
Viktor Barzin	0dc3fec3f9	f1-stream: fix SvelteKit routing - add trailingSlash for static adapter	2026-02-24 00:11:44 +00:00
Viktor Barzin	fd6a8bb6f2	f1-stream: add pydantic dependency and trigger CI build	2026-02-24 00:00:02 +00:00
Viktor Barzin	a050db616e	[ci skip] f1-stream: add CDN token refresh, SvelteKit frontend, multi-stream layout (Phases 6-8) - Phase 6: CDN token lifecycle with 3-strategy URL matching and periodic refresh - Phase 7: SvelteKit 2/Svelte 5 frontend with schedule calendar and hls.js player - Phase 8: Multi-stream layout supporting up to 4 simultaneous HLS streams - Update Dockerfile to multi-stage build (Node.js frontend + Python backend) - Switch deployment to :latest tag with Always pull policy for CI-driven deploys - Update Woodpecker CI to use explicit latest tag	2026-02-23 23:59:35 +00:00
Viktor Barzin	9bf0523ea9	[ci skip] f1-stream: add stream health checker and HLS proxy (Phases 4-5) Phase 4 - Stream Health and Fallback: - StreamHealthChecker with partial GET validation of m3u8 content - Bitrate extraction from BANDWIDTH tags - Response time measurement for quality ranking - Fallback ordering: live first, fastest response time first - GET /streams now only returns health-verified streams Phase 5 - HLS Proxy Core: - GET /proxy?url= - m3u8 playlist fetch with full URI rewriting - GET /relay?url= - chunked segment relay (never buffers full segment) - m3u8 rewriter handles master, variant, and segment URIs - Base64url encoding for URL parameters - CORS middleware for browser playback - Range header forwarding for seeking support	2026-02-23 23:41:16 +00:00
Viktor Barzin	b29f5ddb06	[ci skip] f1-stream: add extractor framework with demo streams (Phase 3) - BaseExtractor ABC with health_check method - ExtractorRegistry with concurrent fan-out extraction - ExtractionService with in-memory cache and background polling - DemoExtractor with 3 public HLS test streams - Adaptive polling: 5min during live sessions, 30min otherwise - GET /streams, GET /extractors, POST /extract endpoints	2026-02-23 23:02:56 +00:00
Viktor Barzin	4fd3e2d770	[ci skip] f1-stream: add gitignore for __pycache__, remove committed .pyc	2026-02-23 22:55:38 +00:00
Viktor Barzin	d7d347de27	[ci skip] f1-stream: add F1 schedule subsystem (Phase 2) - Fetch 2026 F1 race calendar from jolpica API with all sessions (FP1-3, Qualifying, Sprint, Race) and UTC timestamps - Persist schedule to NFS as JSON, load on startup if fresh - APScheduler daily refresh at 03:00 UTC - GET /schedule endpoint with live/upcoming/past session status - POST /schedule/refresh for manual refresh trigger	2026-02-23 22:55:13 +00:00
Viktor Barzin	f423a4d60c	[ci skip] f1-stream: replace Go service with Python/FastAPI skeleton Replaces the existing Go-based f1-stream service with a new Python/FastAPI backend as the foundation for the rebuilt F1 streaming aggregation service. - New FastAPI backend with health and root endpoints - Python 3.13 slim Dockerfile (replaces Go multi-stage build) - Updated Terraform deployment (port 8000, reduced resources) - Buildx-based redeploy.sh with --platform linux/amd64 - Added Woodpecker CI pipeline for automated builds - Removed all old Go source, node_modules, static assets	2026-02-23 22:47:06 +00:00
Viktor Barzin	c665544f41	[ci skip] fix plotting-book: add SESSION_SECRET env var Session secret stored in encrypted terraform.tfvars, referenced via variable to avoid committing secrets in plain text.	2026-02-23 22:44:04 +00:00
Viktor Barzin	85f88bf167	[ci skip] platform: add ndots=2 dns_config to all deployment pod specs Prevents Terraform from reverting the Kyverno inject-ndots mutation on every apply. 23 pod specs across 19 platform module files.	2026-02-23 22:43:05 +00:00
Viktor Barzin	a2a83d30aa	[ci skip] monitoring: increase resource quota limits Bump limits.cpu 80→120 and limits.memory 160Gi→240Gi to provide headroom. Previous values were at 87% and 92% utilization.	2026-02-23 22:42:30 +00:00
Viktor Barzin	e982a8ad81	[ci skip] fix redis OOMKilled: increase memory limits to 2Gi Redis was CrashLoopBackOff due to OOMKilled - 512Mi limit was insufficient for 650MB RDB dataset plus redis-stack modules.	2026-02-23 22:37:56 +00:00
Viktor Barzin	2789c0fa5c	[ci skip] add trading-bot Terraform stack	2026-02-23 22:29:59 +00:00
Viktor Barzin	2d919c4d34	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	48083bb1fd	Reorder realestate-crawler Grafana dashboard sections Move API Performance and Per-Endpoint Latency to the top. Move Scraping Overview, Scraping Activity, and Throttling & Errors to the bottom. Keeps the most operationally relevant panels visible first.	2026-02-23 22:03:27 +00:00
Viktor Barzin	449937e22e	Sync realestate-crawler Grafana dashboard with per-endpoint latency panels	2026-02-23 21:31:01 +00:00

1 2

85 commits