infra

Author	SHA1	Message	Date
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	ea18116da9	fix: NFS outage recovery — migrate to NFSv4, add alerting NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14). All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE. Changes: - nfs_volume module: add nfsvers=4 mount option - nfs-csi StorageClass: add nfsvers=4 mount option - dbaas: MySQL serverInstances 3→1, mysql-native-password=ON - monitoring: add NFSCSINodeDown and NFSMountFailures alerts [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:28:27 +00:00
Viktor Barzin	6101fb99f9	Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip] - Prometheus: persist metric whitelist (keep rules) to Helm template, preventing regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w. - MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0, doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners. - etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency. - VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module. - Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress). - Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3. - Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 19:01:21 +00:00
Viktor Barzin	c2f9ca0d13	modules: improve create-vm with additional config options and cloud-init updates	2026-04-06 11:57:55 +03:00
Viktor Barzin	d1059d6017	registry: set proxy TTL to 0 to prevent stale :latest images Blob caching (content-addressed by SHA256) is unaffected — only manifest re-validation changes. Every pull now checks upstream for the current manifest digest, eliminating stale :latest tag issues.	2026-03-30 00:02:48 +03:00
Viktor Barzin	28587c674d	fix-broken-blobs: use argparse for proper flag handling --dry-run as first arg was being parsed as the BASE directory path.	2026-03-29 22:33:33 +03:00
Viktor Barzin	dd461beb33	add registry blob integrity checker to self-heal corrupted cache The cleanup-tags.sh + garbage-collect cycle can delete blob data while leaving _layers/ link files intact. The registry then returns HTTP 200 with 0 bytes for those layers, causing "unexpected EOF" on image pulls. fix-broken-blobs.sh walks all repositories, checks each layer link against actual blob data, and removes orphaned links so the registry re-fetches from upstream on next pull. Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am (after garbage collection). First run found 2335/2556 (91%) of layer links were orphaned.	2026-03-29 22:31:39 +03:00
Viktor Barzin	facf959ecf	fix registry healthchecks: use 127.0.0.1 instead of localhost localhost resolves to IPv6 ::1 but containers bind to 0.0.0.0 (IPv4 only), causing wget to fail with "Connection refused". The nginx proxy had 18,462 consecutive health check failures because of this. Also cleared corrupted pull-through cache for mghee/novelapp — the registry had layer link files pointing to non-existent blob data, causing containerd to get 200 responses with 0 bytes (unexpected EOF).	2026-03-29 22:29:27 +03:00
Viktor Barzin	878b556179	state(monitoring): update encrypted state	2026-03-29 01:04:11 +02:00
Viktor Barzin	8c6f238697	add default Homepage annotations to ingress_factory for auto-discovery - ingress_factory now injects gethomepage.dev/* annotations on all ingresses (name, group, href, icon) with namespace-to-group mapping - Stacks with explicit annotations override defaults via merge order - New homepage_enabled var allows opt-out for internal-only ingresses - Homepage search widget switched to in-page quicklaunch (Ctrl+K / tap) - Added hideErrors and quicklaunch settings for clean service directory - Result: 116/134 ingresses now discoverable (up from ~30)	2026-03-25 11:00:38 +02:00
Viktor Barzin	2dcb4b7fa4	fix(renew-tls): clean stale _acme-challenge TXT records before certbot 21+ stale TXT records accumulated from previous runs, causing certbot DNS-01 challenge to fail. Now deletes all _acme-challenge records from Cloudflare before certbot creates fresh ones.	2026-03-23 22:32:27 +02:00
Viktor Barzin	3f0ecda737	harden pull-through cache: intercept errors, reduce lock timeout, add healthz - Add proxy_intercept_errors + error_page for 502/503/504 on blob locations to prevent caching truncated upstream responses (root cause of repeated ImagePullBackOff across services) - Reduce proxy_cache_lock_timeout from 15m to 5m — fail fast, let containerd retry instead of all concurrent pulls waiting on a failed first download - Add proxy_cache_valid any 0 — never cache error responses - Add /healthz endpoints on Docker Hub and GHCR servers - Add draintimeout and proxy.ttl to registry proxy configs	2026-03-23 11:33:06 +02:00
Viktor Barzin	a44f35bcf8	harden vaultwarden iSCSI storage and increase backup frequency - Increase backup from daily to every 6 hours (0 /6 * *) - Add pre/post-flight SQLite integrity checks to backup job - Harden iSCSI on all nodes: increase recovery timeout (300s), enable CRC32C data/header digests for bit-flip detection - Fix restore runbook PVC name (vaultwarden-data-iscsi) Motivated by SQLite corruption from iSCSI I/O errors.	2026-03-23 00:36:11 +02:00
Viktor Barzin	36171bcda4	add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me - Add auth.htpasswd section to config-private.yml - Mount htpasswd file in registry-private container, fix healthcheck for 401 - Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me - Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body) - Add docker to cloudflare_proxied_names (registry stays non-proxied) - Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces - Update infra provisioning to install apache2-utils and generate htpasswd from Vault	2026-03-22 22:10:10 +02:00
Viktor Barzin	250a058627	feat(traefik): add custom error pages with tarampampam/error-pages Deploy error-pages service to show themed error pages instead of raw Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1) for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.	2026-03-19 23:14:27 +00:00
Viktor Barzin	67d1ce453c	add /sentinel dir to cloud-init for kured reboot gating The kured sentinel gate DaemonSet requires /sentinel to exist on all nodes. Without it, kured pods get stuck in ContainerCreating with hostPath mount failure. Previously created manually; now provisioned automatically for new nodes.	2026-03-19 19:57:27 +00:00
Viktor Barzin	f8a36f0621	fix pull-through cache: remove maxsize, harden nginx caching [ci skip] Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to delete blob data while keeping metadata. Registry then served 200 OK with correct Content-Length but 0 bytes body. nginx cached these broken responses. Fixes: - Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC) - nginx: don't cache 206 responses, require 2 requests before caching - Wiped corrupted cache on registry VM and fixed corrupted pause container blobs on node3/node4	2026-03-16 07:41:11 +00:00
Viktor Barzin	c034adab5f	mitigate cluster instability during terraform applies - Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf) - Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno) to prevent memory request surge overwhelming scheduler - Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup - Disable Kyverno policy reports (ephemeral report cleanup) - Cloud-init: journald persistence + 4Gi swap for worker nodes - Kubelet: LimitedSwap behavior for memory pressure relief	2026-03-15 17:23:39 +00:00
Viktor Barzin	7e72a10848	exclude manifest requests from nginx registry cache Split /v2/ location into two: regex match for blobs (cached 24h, immutable content-addressed by SHA256) and prefix match for everything else including manifests (proxy_cache off, mutable tags). Also remove disabled registries (quay, k8s, kyverno) whose containers/configs don't exist on the VM.	2026-03-14 23:42:17 +00:00
Viktor Barzin	0638e2cc2e	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup - Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md	2026-03-06 19:54:21 +00:00
Viktor Barzin	1b78e44ab6	[ci skip] fix: add mount_options to nfs_volume PV spec StorageClass mountOptions only apply during dynamic provisioning. Static PVs (created by Terraform) need mount_options set explicitly. Without this, all CSI NFS mounts default to hard,timeo=600 — the exact problem we were trying to fix.	2026-03-02 20:22:47 +00:00
Viktor Barzin	c702fd2565	[ci skip] add NFS CSI driver + nfs_volume shared module - Deploy csi-driver-nfs Helm chart as platform module (nfs-csi) - Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options - Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)	2026-03-01 23:38:58 +00:00
Viktor Barzin	7ff3c61bd7	[ci skip] add retry middleware (2 attempts, 100ms) to default ingress chain	2026-03-01 14:35:53 +00:00
Viktor Barzin	946b5b1745	[ci skip] add qemu-guest-agent to VM templates and enable agent by default	2026-03-01 01:58:46 +00:00
Viktor Barzin	09a810f8fb	[ci skip] fix: use $http_host in nginx to preserve port in registry redirects	2026-02-28 20:16:03 +00:00
Viktor Barzin	96c0353c13	[ci skip] add TLS to private registry, switch to registry.viktorbarzin.me	2026-02-28 19:40:38 +00:00
Viktor Barzin	925dbe39c1	[ci skip] add registry-private service to Docker Compose stack	2026-02-28 17:57:04 +00:00
Viktor Barzin	64c55a6710	[ci skip] add nginx upstream and server block for private registry on port 5050	2026-02-28 17:57:03 +00:00
Viktor Barzin	2102ffdb8b	[ci skip] add private R/W registry config for CI build caching	2026-02-28 17:56:50 +00:00
Viktor Barzin	865b68ce77	[ci skip] Rebuild docker-registry with nginx serialization on all ports Replace individual `docker run` commands with Docker Compose stack managed by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040) with proxy_cache_lock to serialize concurrent blob pulls and prevent corrupt partial responses. Adds QEMU guest agent for remote management.	2026-02-22 21:45:53 +00:00
Viktor Barzin	006f95337e	[ci skip] Add anti_ai_scraping option to ingress_factory (default: true)	2026-02-22 19:50:07 +00:00
Viktor Barzin	116c4d9c30	[ci skip] Remove legacy files and orphaned modules Delete 20 orphaned module directories and 3 stray files from modules/kubernetes/ that are no longer referenced by any stack. Remove 7 root-level legacy files including the empty tfstate, 27MB terraform zip, commented-out main.tf, and migration notes. Clean up commented-out dockerhub_secret and oauth-proxy references in blog, travel_blog, and city-guesser stacks. Remove stale frigate config.yaml entry from .gitignore. Remove ephemeral docs/plans/ directory.	2026-02-22 15:23:27 +00:00
Viktor Barzin	e6420c7b36	[ci skip] Move Terraform modules into stack directories Move all 88 service modules (66 individual + 22 platform) from modules/kubernetes/<service>/ into their corresponding stack directories: - Service stacks: stacks/<service>/module/ - Platform stack: stacks/platform/modules/<service>/ This collocates module source code with its Terragrunt definition. Only shared utility modules remain in modules/kubernetes/: ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy. All cross-references to shared modules updated to use correct relative paths. Verified with terragrunt run --all -- plan: 0 adds, 0 destroys across all 68 stacks.	2026-02-22 14:38:14 +00:00
Viktor Barzin	945a5f35b0	[ci skip] Fix path.root references for git-crypt key in openclaw and drone Modules used filebase64("${path.root}/.git/git-crypt/keys/default") which breaks with Terragrunt since path.root is now stacks/<service>/ instead of repo root. Changed to accept git_crypt_key_base64 variable and resolve the path in the stack wrapper.	2026-02-22 14:01:02 +00:00
Viktor Barzin	71bfdc8e89	[ci skip] Phase 3: Remove migrated service modules from monolith All 66 service modules removed from modules/kubernetes/main.tf (now just a migration notice). The kubernetes_cluster module block removed from root main.tf. All services now managed via stacks/<service>/.	2026-02-22 13:58:07 +00:00
Viktor Barzin	39ce2000cf	[ci skip] Remove 22 platform services from modules/kubernetes/main.tf Migrated to stacks/platform/: metallb, dbaas, redis, traefik, technitium, headscale, authentik, rbac, k8s-portal, crowdsec, monitoring, vaultwarden, reverse-proxy, metrics-server, nvidia, kyverno, uptime-kuma, wireguard, xray, mailserver, cloudflared, infra-maintenance. Also removed null_resource.core_services and all depends_on references to it from the remaining ~66 service modules.	2026-02-22 13:40:45 +00:00
Viktor Barzin	db659b1f7a	[ci skip] Fix dashy OOMKilled and healthcheck DNS false-failure - Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled during webpack build on startup - Rewrite DNS healthcheck to test from inside the Technitium pod via kubectl exec, since MetalLB virtual IPs aren't reachable from outside the L2 network - Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled, not mounted by kured DaemonSet)	2026-02-22 12:46:12 +00:00
Viktor Barzin	f05bf109c5	[ci skip] Increase Drone CI resource quota to handle concurrent builds Each build pod has 8-10 containers inheriting 1 CPU / 2Gi limits from LimitRange defaults. With 4+ concurrent builds the old quota (48 CPU / 96Gi / 30 pods) was exhausted, blocking new builds. Increase to 64 CPU / 128Gi / 60 pods to safely support 5-6 concurrent builds.	2026-02-22 12:28:42 +00:00
Viktor Barzin	0ff2aaec60	[ci skip] Add native HLS playback for VIPLeague/DaddyLive streams (v1.3.1) - Add HLS proxy (hlsproxy) for rewriting m3u8 playlists and proxying segments with correct Referer/Origin headers (uses ?domain= param) - Add playerconfig service for detecting stream types (VIPLeague, DaddyLive, HLS) and extracting auth params from ksohls pages - Add VIPLeague URL resolution: extract slug from URL path, match against DaddyLive 24/7 channel index with token-based scoring - Replace Clappr with direct HLS.js player for better compatibility - Add CryptoJS CDN for DaddyLive auth module support - Disable CrowdSec on f1-stream ingress to prevent false positives - Bump image to v1.3.1	2026-02-22 01:30:06 +00:00
Viktor Barzin	e59928187b	[ci skip] Set CronJob backoffLimit=0 to prevent duplicate Slack alerts	2026-02-22 00:59:34 +00:00
Viktor Barzin	cd0c030a55	[ci skip] Fix CronJob kubectl image tag to :latest	2026-02-22 00:38:33 +00:00
Viktor Barzin	f79e84c693	[ci skip] Add cluster health check CronJob to OpenClaw module	2026-02-22 00:08:51 +00:00
Viktor Barzin	b925f9caf7	[ci skip] Add Slack webhook env var to OpenClaw deployment	2026-02-21 23:57:34 +00:00
Viktor Barzin	846eb3bd24	[ci skip] Add custom resource quota for authentik namespace Authentik runs ~10 pods (3 server + 3 worker + 3 pgbouncer + outpost) which exceeds the default tier-1-cluster quota limits. Add custom-quota label to opt out of Kyverno-generated quotas and define a Terraform-managed ResourceQuota with limits appropriate for authentik's workload.	2026-02-21 23:44:05 +00:00
Viktor Barzin	d345841ef2	[ci skip] Add tier labels to all namespace resources for Kyverno resource governance Added `tier = var.tier` to kubernetes_namespace labels in ~73 service modules. This enables Kyverno to generate LimitRange defaults, ResourceQuotas, and PriorityClass injection for all namespaces. Previously only 11 namespaces had tier labels; now all 80 active namespaces are labeled. All pods restarted in rolling waves to pick up the new policies.	2026-02-21 23:38:05 +00:00
Viktor Barzin	517f5d6a6c	[ci skip] Increase tier-based resource quotas to prevent quota exhaustion Tier 2-gpu: 32→48 CPU limits, 64→96Gi mem limits, 30→40 pods Tier 3-edge: 2→4 req CPU, 8→16 CPU limits, 16→32Gi mem limits, 20→30 pods Tier 4-aux: 1→2 req CPU, 4→8 CPU limits, 8→16Gi mem limits, 15→20 pods Fixes realestate-crawler (100% quota), nvidia (89.7%), resume/website (75%), and actualbudget (75%) quota exhaustion causing pod creation failures.	2026-02-21 23:26:00 +00:00
Viktor Barzin	ce31571a9f	[ci skip] Fix JS shim rw() routing non-proxy paths through proxy prefix When upstream JS constructs URLs via location.origin + '/path', the rw() function stripped the origin but returned bare '/path' which hit our server's HTML index. Now correctly prefixes with /proxy/{b64origin} so XHR/fetch requests for scripts reach the upstream via proxy. Bump image to v1.2.7	2026-02-21 23:16:09 +00:00
Viktor Barzin	8562ed1b8f	[ci skip] Fix video playback and comprehensive anti-debug neutralization Video: - Add allow="autoplay; encrypted-media; fullscreen" to iframe for media playback Anti-debug: - Strip ad/popup scripts (acscdn, popunder) and context menu blockers from HTML - Strip debugger statements from inline HTML scripts and proxied JS responses - Intercept setTimeout (not just setInterval) for debugger-based detection - Override eval() and Function() constructor to strip debugger statements - Bump image to v1.2.6	2026-02-21 23:12:11 +00:00
Viktor Barzin	642e774b62	[ci skip] Fix Kyverno priority injection to remove default priority/preemptionPolicy The priority injection policy was setting priorityClassName on pods but Kubernetes had already defaulted priority=0 and preemptionPolicy=PreemptLowerPriority on those pods, causing admission controller to reject the mismatch. Switch from patchStrategicMerge to patchesJson6902 to explicitly remove the priority and preemptionPolicy fields before setting priorityClassName.	2026-02-21 23:11:35 +00:00

1 2 3 4 5 ...

1084 commits