infra

Author	SHA1	Message	Date
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	6101fb99f9	Reduce disk write amplification across cluster (~200-350 GB/day savings) [ci skip] - Prometheus: persist metric whitelist (keep rules) to Helm template, preventing regression from 33K to 250K samples/scrape on next apply. Reduce retention 52w→26w. - MySQL InnoDB: aggressive write reduction — flush_log_at_trx_commit=0, sync_binlog=0, doublewrite=OFF, io_capacity=100/200, redo_log=1GB, flush_neighbors=1, reduced page cleaners. - etcd: increase snapshot-count 10000→50000 to reduce WAL snapshot frequency. - VM disks: enable TRIM/discard passthrough to LVM thin pool via create-vm module. - Cloud-init: enable fstrim.timer, journald limits (500M/7d/compress). - Kubelet: containerLogMaxSize=10Mi, containerLogMaxFiles=3. - Technitium: DNS query log retention 0→30 days (was unlimited writes to MySQL). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 19:01:21 +00:00
Viktor Barzin	0de2fef9c9	misc: actualbudget, authentik, headscale, rybbit, terminal, dbaas updates - actualbudget: adjust resource config - authentik: add configuration - headscale: minor fix - rybbit: add resources - terminal: add terminal stack config - platform/dbaas: add config - infra: update lock file	2026-04-06 11:58:00 +03:00
Viktor Barzin	c946e5fdc9	tune controller-manager + apiserver for faster volume detach - kube-controller-manager: --attach-detach-reconcile-sync-period=15s (was 1m default) - kube-apiserver: --default-unreachable-toleration-seconds=60 (was 300s default) - kube-apiserver: --default-not-ready-toleration-seconds=60 (was 300s default) Reduces VolumeAttachment auto-detach from ~6 min to ~2 min on node failure. Applied live + codified in cloud-init template. [ci skip]	2026-04-05 22:14:23 +03:00
Viktor Barzin	2eeb73fc57	add priority-based kubelet graceful shutdown ordering Replace 2-bucket shutdownGracePeriod (240s non-critical / 60s critical) with shutdownGracePeriodByPodPriority for 9-tier ordered shutdown: unclassified(20s) → tier-4-aux(20s) → tier-3-edge(30s) → tier-2-gpu(30s) → tier-1-cluster/DBs(90s) → tier-0-core(30s) → gpu(30s) → sys-cluster(30s) → sys-node(30s). Total: 310s. Apps stop before databases, databases stop before infrastructure. VM shutdown timeout: 300s → 420s. InhibitDelay: 300 → 480. [ci skip]	2026-04-05 20:54:36 +03:00
Viktor Barzin	337da2184d	add upstream fallback to containerd registry mirrors When the pull-through proxy (10.0.20.10) is down, containerd now falls back to the official upstream registries (registry-1.docker.io, ghcr.io) instead of failing. Also cleans up stale disabled registry mirror dirs and removes unnecessary containerd restart from the rollout script.	2026-04-02 11:05:30 +03:00
Viktor Barzin	dd461beb33	add registry blob integrity checker to self-heal corrupted cache The cleanup-tags.sh + garbage-collect cycle can delete blob data while leaving _layers/ link files intact. The registry then returns HTTP 200 with 0 bytes for those layers, causing "unexpected EOF" on image pulls. fix-broken-blobs.sh walks all repositories, checks each layer link against actual blob data, and removes orphaned links so the registry re-fetches from upstream on next pull. Schedule: daily at 2:30am (after tag cleanup) and Sunday 3:30am (after garbage collection). First run found 2335/2556 (91%) of layer links were orphaned.	2026-03-29 22:31:39 +03:00
Viktor Barzin	6bfade3013	update infra stack terraform lock file (helm/kubernetes/vault providers)	2026-03-23 02:24:47 +02:00
Viktor Barzin	1f4e8cb278	use registry.viktorbarzin.me hostname for private images + protect ingress - Switch priority-pass images from 10.0.20.10:5050 to registry.viktorbarzin.me - Add containerd hosts.toml for registry.viktorbarzin.me on all nodes + template (redirects to 10.0.20.10:5050 LAN direct, avoids Traefik round-trip) - Enable Authentik protection on priority-pass ingress	2026-03-23 01:02:27 +02:00
Viktor Barzin	ab7e18c07c	fix registry auth: add Kyverno RBAC for Secrets + containerd TLS skip-verify - Grant kyverno-admission-controller and kyverno-background-controller permissions to manage Secrets (required for generate clone rules) - Add containerd hosts.toml for 10.0.20.10:5050 with skip_verify=true (wildcard cert doesn't cover IP SANs) — applied to all nodes + template	2026-03-22 23:47:29 +02:00
Viktor Barzin	36171bcda4	add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me - Add auth.htpasswd section to config-private.yml - Mount htpasswd file in registry-private container, fix healthcheck for 401 - Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me - Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body) - Add docker to cloudflare_proxied_names (registry stays non-proxied) - Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces - Update infra provisioning to install apache2-utils and generate htpasswd from Vault	2026-03-22 22:10:10 +02:00
Viktor Barzin	1c13af142d	sync regenerated providers.tf + upstream changes - Terragrunt-regenerated providers.tf across stacks (vault_root_token variable removed from root generate block) - Upstream monitoring/openclaw/CLAUDE.md changes from rebase	2026-03-22 02:56:04 +02:00
Viktor Barzin	3d22599f7f	fix infra stack: use overwrite strategy for provider generation The child generate block needs if_exists="overwrite" to properly override the root terragrunt's k8s_providers block.	2026-03-22 01:28:25 +02:00
Viktor Barzin	728fbcd3bd	fix infra stack: add vault provider to terragrunt generate block The infra stack's provider override only included proxmox but main.tf uses data "vault_kv_secret_v2" which requires the vault provider.	2026-03-22 01:17:00 +02:00
Viktor Barzin	21bb3036af	state(dbaas): update encrypted state	2026-03-19 20:23:59 +00:00
Viktor Barzin	745e43c983	fix DB password desync + migrate remaining tfvars to Vault DB desync fix: Stacks with Vault DB engine rotation (24h) now read the password from vault-database ClusterSecretStore instead of vault-kv. 9 stacks updated with db ExternalSecrets reading from static-creds/*. Stacks fixed: speedtest, hackmd, health, trading-bot, claude-memory, woodpecker, linkwarden, nextcloud, url. terraform.tfvars migration: - plotting-book: google_client_id/secret → Vault KV + secret_key_ref - tandoor: email_password var removed (was default="", now optional ESO) - infra: ssh_private_key, vm_wizard_password, dockerhub_registry_password → Vault KV at secret/infra + data source	2026-03-15 21:39:45 +00:00
Viktor Barzin	c034adab5f	mitigate cluster instability during terraform applies - Recreate strategy for heavy single-replica deployments (onlyoffice, stirling-pdf) - Reduce maxSurge on multi-replica deployments (traefik, authentik, grafana, kyverno) to prevent memory request surge overwhelming scheduler - Weekly etcd defrag CronJob (Sunday 3 AM) to prevent fragmentation buildup - Disable Kyverno policy reports (ephemeral report cleanup) - Cloud-init: journald persistence + 4Gi swap for worker nodes - Kubelet: LimitedSwap behavior for memory pressure relief	2026-03-15 17:23:39 +00:00
Viktor Barzin	6f562b5da6	add vaultwarden daily backup CronJob to NFS SQLite backup via Online Backup API + copy of RSA keys, attachments, sends, and config. 30-day retention with rotation. Pod affinity ensures co-scheduling with vaultwarden for RWO PVC access.	2026-03-15 00:03:59 +00:00
Viktor Barzin	a8d944eb9b	migrate all secrets from SOPS to Vault KV - Add vault provider to root terragrunt.hcl (generated providers.tf) - Delete stacks/vault/vault_provider.tf (now in generated providers.tf) - Add 124 variable declarations + 43 vault_kv_secret_v2 resources to vault/main.tf to populate Vault KV at secret/<stack-name> - Migrate 43 consuming stacks to read secrets from Vault KV via data "vault_kv_secret_v2" instead of SOPS var-file - Add dependency "vault" to all migrated stacks' terragrunt.hcl - Complex types (maps/lists) stored as JSON strings, decoded with jsondecode() in locals blocks Bootstrap secrets (vault_root_token, vault_authentik_client_id, vault_authentik_client_secret) remain in SOPS permanently. Apply order: vault stack first (populates KV), then all others.	2026-03-14 17:15:48 +00:00
Viktor Barzin	b00f810d3d	Remove all CPU limits cluster-wide to eliminate CFS throttling CPU limits cause CFS throttling even when nodes have idle capacity. Move to a request-only CPU model: keep CPU requests for scheduling fairness but remove all CPU limits. Memory limits stay (incompressible). Changes across 108 files: - Kyverno LimitRange policy: remove cpu from default/max in all 6 tiers - Kyverno ResourceQuota policy: remove limits.cpu from all 5 tiers - Custom ResourceQuotas: remove limits.cpu from 8 namespace quotas - Custom LimitRanges: remove cpu from default/max (nextcloud, onlyoffice) - RBAC module: remove cpu_limits variable and quota reference - Freedify factory: remove cpu_limit variable and limits reference - 86 deployment files: remove cpu from all limits blocks - 6 Helm values files: remove cpu under limits sections	2026-03-14 08:51:45 +00:00
OpenClaw	a09967e098	feat(monitoring): Enhance disk monitoring and containerd GC after node2 incident IMMEDIATE CHANGES (Active Now): - Lower disk warning threshold: 70% → WARN, 85% → FAIL (was 80%/90%) - More aggressive alerting to prevent containerd corruption - Enhanced cluster health check disk monitoring INFRASTRUCTURE CHANGES (Requires Terraform Apply): - Add containerd garbage collection configuration (30min intervals) - More aggressive kubelet eviction policies (15%/20% vs 10%/15%) - Enhanced disk space protection to prevent node2-type failures Root Cause: node2 disk exhaustion corrupted containerd image store Prevention: Proactive monitoring + aggressive cleanup policies [ci skip] - Infrastructure changes require SOPS access for apply	2026-03-13 07:16:56 +00:00
Viktor Barzin	d352d6e7f8	resource quota review: fix OOM risks, close quota gaps, add HA protections Phase 1 - OOM fixes: - dashy: increase memory limit 512Mi→1Gi (was at 99% utilization) - caretta DaemonSet: set explicit resources 300Mi/512Mi (was at 85-98%) - mysql-operator: add Helm resource values 256Mi/512Mi, create namespace with tier label (was at 92% of LimitRange default) - prowlarr, flaresolverr, annas-archive-stacks: add explicit resources (outgrowing 256Mi LimitRange defaults) - real-estate-crawler celery: add resources 512Mi/3Gi (608Mi actual, no explicit resources) Phase 2 - Close quota gaps: - nvidia, real-estate-crawler, trading-bot: remove custom-quota=true labels so Kyverno generates tier-appropriate quotas - descheduler: add tier=1-cluster label for proper classification Phase 3 - Reduce excessive quotas: - monitoring: limits.memory 240Gi→64Gi, limits.cpu 120→64 - woodpecker: limits.memory 128Gi→32Gi, limits.cpu 64→16 - GPU tier default: limits.memory 96Gi→32Gi, limits.cpu 48→16 Phase 4 - Kubelet protection: - Add cpu: 200m to systemReserved and kubeReserved in kubelet template Phase 5 - HA improvements: - cloudflared: add topology spread (ScheduleAnyway) + PDB (maxUnavailable:1) - grafana: add topology spread + PDB via Helm values - crowdsec LAPI: add topology spread + PDB via Helm values - authentik server: add topology spread via Helm values - authentik worker: add topology spread + PDB via Helm values	2026-03-08 18:17:46 +00:00
Viktor Barzin	fffc2ed0ab	fix node OOM: reduce memory overcommit ratio and add kubelet eviction thresholds LimitRange defaults had a 4-8x limit/request ratio causing the scheduler to overpack nodes. When pods burst, nodes OOM-thrashed and became unresponsive (k8s-node3 and k8s-node4 both went down today). Changes: - Increase default memory requests across all tiers (ratio now 2x): - core/cluster: 64Mi → 256Mi request (512Mi limit) - gpu: 256Mi → 1Gi request (2Gi limit) - edge/aux/fallback: 64Mi → 128Mi request (256Mi limit) - Add kubelet memory reservation and eviction thresholds: - systemReserved: 512Mi, kubeReserved: 512Mi - evictionHard: 500Mi (was 100Mi), evictionSoft: 1Gi (was unset) - Applied to all nodes and future node template	2026-03-08 10:33:38 +00:00
Viktor Barzin	1f2c1ca361	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	de598996f1	[ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io) Pull-through cache at 10.0.20.10 was serving corrupted/truncated images for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and previously causing Kyverno image pull failures. Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic, Docker Hub rate limits make caching essential. Removed from cloud-init template and all 5 live nodes: - registry.k8s.io (port 5030) — 14 system images, very low churn - quay.io (port 5020) — 11 images - reg.kyverno.io (port 5040) — 5 images The registry containers on the 10.0.20.10 VM still run but nodes no longer route to them. They can be stopped/removed from the VM later.	2026-03-01 21:46:41 +00:00
Viktor Barzin	96c0353c13	[ci skip] add TLS to private registry, switch to registry.viktorbarzin.me	2026-02-28 19:40:38 +00:00
Viktor Barzin	504e2b01ab	[ci skip] add private registry to Terraform cloud-init provisioning	2026-02-28 17:57:24 +00:00
Viktor Barzin	865b68ce77	[ci skip] Rebuild docker-registry with nginx serialization on all ports Replace individual `docker run` commands with Docker Compose stack managed by systemd. Nginx now fronts all 5 registry ports (5000/5010/5020/5030/5040) with proxy_cache_lock to serialize concurrent blob pulls and prevent corrupt partial responses. Adds QEMU guest agent for remote management.	2026-02-22 21:45:53 +00:00
Viktor Barzin	9415823ab8	Use --queue-ignore-errors for CI (infra stack needs Proxmox SSH)	2026-02-22 18:29:27 +00:00
Viktor Barzin	2547a155ed	Skip infra stack in CI, remove DRONE_IMAGE_CLONE setting	2026-02-22 18:21:10 +00:00
Viktor Barzin	e6420c7b36	[ci skip] Move Terraform modules into stack directories Move all 88 service modules (66 individual + 22 platform) from modules/kubernetes/<service>/ into their corresponding stack directories: - Service stacks: stacks/<service>/module/ - Platform stack: stacks/platform/modules/<service>/ This collocates module source code with its Terragrunt definition. Only shared utility modules remain in modules/kubernetes/: ingress_factory, setup_tls_secret, dockerhub_secret, oauth-proxy. All cross-references to shared modules updated to use correct relative paths. Verified with terragrunt run --all -- plan: 0 adds, 0 destroys across all 68 stacks.	2026-02-22 14:38:14 +00:00
Viktor Barzin	f096a889d6	[ci skip] Add infra stack (Proxmox VMs)	2026-02-22 13:04:49 +00:00
Viktor Barzin	f962349465	[ci skip] Add Terragrunt directory skeleton and root config	2026-02-22 13:01:37 +00:00

33 commits