infra

Author	SHA1	Message	Date
Viktor Barzin	db68067925	[ci skip] phase 5+6: update CI pipelines for SOPS, add sensitive=true to secret vars Phase 5 — CI pipelines: - default.yml: add SOPS decrypt in prepare step, change git add . to specific paths (stacks/ state/ .woodpecker/), cleanup on success+failure - renew-tls.yml: change git add . to git add secrets/ state/ Phase 6 — sensitive=true: - Add sensitive = true to 256 variable declarations across 149 stack files - Prevents secret values from appearing in terraform plan output - Does NOT modify shared modules (ingress_factory, nfs_volume) to avoid breaking module interface contracts Note: CI pipeline SOPS decryption requires sops_age_key Woodpecker secret to be created before the pipeline will work with SOPS. Until then, the old terraform.tfvars path continues to function.	2026-03-07 14:30:36 +00:00
Viktor Barzin	154f8ff0c1	[ci skip] phase 3: switch terragrunt to load config.tfvars + SOPS secrets terragrunt.hcl now loads: - config.tfvars (required, plaintext) - terraform.tfvars (optional, git-crypt — backward compat) - secrets.auto.tfvars.json (optional, SOPS-decrypted) before_hook checks that at least one secrets source exists. Use `scripts/tg` wrapper for SOPS-based workflow. Old terraform.tfvars kept for reference and backward compatibility.	2026-03-07 14:16:28 +00:00
Viktor Barzin	22267fe386	[ci skip] phase 2: split terraform.tfvars into config.tfvars + secrets.sops.json config.tfvars (29 vars, plaintext): hostnames, IPs, DNS records, IDs secrets.sops.json (140 vars, SOPS-encrypted): passwords, tokens, keys, maps Both files coexist with terraform.tfvars — no functional change yet. Complex types preserved: maps (mailserver_accounts, k8s_users, homepage_credentials), lists (xray_reality_clients), heredocs as \n-escaped JSON strings (SSH keys, WireGuard conf, headscale config).	2026-03-07 14:04:40 +00:00
Viktor Barzin	7f5dbb82f4	[ci skip] phase 1: SOPS tooling setup (.sops.yaml, scripts/tg, .gitignore) Part of SOPS multi-user secrets migration. - .sops.yaml: defines age recipients (Viktor + CI) - scripts/tg: wrapper that decrypts secrets before running terragrunt - .gitignore: excludes decrypted secrets.auto.tfvars.json No functional change — terraform.tfvars still works as before.	2026-03-07 13:57:42 +00:00
Viktor Barzin	88989cfad3	[ci skip] add SOPS multi-user secrets migration design (v3, reviewed 3x) Replaces git-crypt all-or-nothing encryption with SOPS per-value encryption. Operators push PRs → Viktor reviews → CI applies. No encryption keys needed for operators. 7-phase migration plan, reviewed by 2 agents across 3 iterations (0 remaining CRITICALs).	2026-03-07 13:55:05 +00:00
Viktor Barzin	b73b2eac33	fix(actualbudget): raise http-api resources to prevent OOM [ci skip]	2026-03-07 00:28:02 +00:00
Viktor Barzin	7d68be870d	[ci skip] add auto-generated tiers.tf, planning docs, and helm chart cache - tiers.tf: Terragrunt-generated tier locals for all standalone stacks - .planning/: resource audit research and plans - docs/plans/: cluster hardening design doc - redis-25.3.2.tgz: Bitnami Redis Helm chart cache	2026-03-06 23:55:57 +00:00
Viktor Barzin	dc85c34069	[ci skip] add AGENTS.md for model-agnostic knowledge, slim CLAUDE.md to Claude-specific layer AGENTS.md (63 lines): shared infra knowledge for any AI tool (Codex, Claude, Cursor). Covers: critical rules, architecture, storage, tiers, common ops. CLAUDE.md (23 lines): Claude-specific addons — skills, agents, user preferences. References AGENTS.md for shared knowledge. Removed generic agents (devops-engineer, fullstack-developer).	2026-03-06 23:50:26 +00:00
Viktor Barzin	f02f2a5a4d	[ci skip] refactor claude files: compact CLAUDE.md, clean memory, remove generic agents CLAUDE.md: 260→72 lines. Moved detailed patterns (NFS, iSCSI, Kyverno tables, anti-AI, node rebuild) to .claude/reference/patterns.md. Kept: critical rules, quick patterns, key commands, tier overview, prefs. Memory: CLAUDE.md is now single source of truth. Auto-memory reduced to scratch pad (67→25 lines, 5→1 files). MetaClaw DB cleaned from 40→16 entries (removed all infra-specific duplicates, kept cross-project prefs). Agents: removed generic devops-engineer (885L) and fullstack-developer (234L). Kept custom cluster-health-checker (48L).	2026-03-06 23:27:46 +00:00
Viktor Barzin	51f2b040a6	[ci skip] archive 28 unused skills, add runbook index to CLAUDE.md, add cluster-health agent - Move 28 never-invoked troubleshooting runbook skills to .claude/skills/archived/ - Keep 7 active workflow skills: cluster-health, uptime-kuma, pfsense, home-assistant, setup-project, extend-vm-storage, k8s-ndots - Add one-line runbook index to CLAUDE.md for quick reference - Create cluster-health-checker custom agent (haiku model, read-only + bash) for autonomous health checks without consuming main context	2026-03-06 23:17:40 +00:00
Viktor Barzin	1824d2be67	[ci skip] expand Prometheus PVC to 200Gi, increase retention to 180GB for 1-year history Storage analysis: ~10.5 GB/month ingestion rate, 1 year = ~125 GB + overhead. PVC: 30Gi → 200Gi, retention.size: 45GB → 180GB. Historical TSDB data restored from NFS (39.8 GB total including all blocks).	2026-03-06 23:16:32 +00:00
Viktor Barzin	a7f3d432ee	[ci skip] expand Prometheus iSCSI PVC to 30Gi for historical data restore	2026-03-06 22:51:38 +00:00
Viktor Barzin	3eaac4b9db	[ci skip] update claude knowledge: iSCSI migration for Redis, Prometheus, Loki	2026-03-06 21:05:21 +00:00
Viktor Barzin	63fb6201c8	[ci skip] migrate Redis, Prometheus, Loki storage to iSCSI - Redis: local-path → iscsi-truenas (master + replica persistence) - Prometheus: NFS PV+PVC → dynamic iSCSI PVC (prometheus-data) - Loki: NFS PV → dynamic iSCSI via storageClass in Helm values - Deleted 2 orphaned Released iSCSI PVs (31Gi freed)	2026-03-06 20:50:55 +00:00
Viktor Barzin	ce0cab7554	[ci skip] replace resource overcommitment check with actual usage Check real CPU/memory usage via kubectl top nodes instead of limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL. Limits overcommit is expected with 70+ services on 3 worker nodes.	2026-03-06 20:28:55 +00:00
Viktor Barzin	bfa4a3ffdf	[ci skip] reduce resource limits per VPA recommendations dashy: 4Gi→512Mi mem, 2→500m cpu (actual: 206Mi) affine: 4Gi→512Mi mem, 2→1 cpu (actual: 186Mi) rybbit clickhouse: 4Gi→2Gi mem, 2→1 cpu (actual: 618Mi)	2026-03-06 20:23:21 +00:00
Viktor Barzin	94dcf22db4	[ci skip] exclude linkwarden from HighService4xxRate alert	2026-03-06 20:15:58 +00:00
Viktor Barzin	d4400f8283	[ci skip] remove atuin: destroy stack, DNS, NFS export, PostgreSQL credentials	2026-03-06 20:11:14 +00:00
Viktor Barzin	1d80c49201	[ci skip] iSCSI migration, healthcheck fixes, health probes, etcd backup - Migrate MySQL/PostgreSQL storage from local-path to iscsi-truenas - Add democratic-csi iSCSI driver module for TrueNAS - Add open-iscsi to cloud-init VM template - Fix Shlink health probe path (/api/v3 -> /rest/v3 for Shlink 5.0) - Fix etcd backup: use etcd 3.5.21-0 (3.6.x is distroless, no /bin/sh) - Fix cluster healthcheck CronJob: always exit 0 to prevent circular JobFailed alerts (reporting via Slack, not exit codes) - Fix Uptime Kuma nested list handling in cluster-health.sh - Add health probes to: audiobookshelf, immich ML, ntfy, headscale, uptime-kuma, vaultwarden, rybbit (clickhouse + server + client), shlink, shlink-web - Add iSCSI storage documentation to CLAUDE.md	2026-03-06 19:54:21 +00:00
Viktor Barzin	a8e07ad930	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	065090dfe0	[ci skip] fix calibre: bump CPU/memory to prevent SIGBUS during calibre_postinstall	2026-03-03 19:48:45 +00:00
Viktor Barzin	31f3fc0773	[ci skip] fix OOMKill: prometheus (4Gi), kyverno-reports (512Mi), grampsweb (512Mi) - Prometheus server: explicit 1Gi req / 4Gi limit (was inheriting 512Mi LimitRange default) - Kyverno reports controller: 128Mi req / 512Mi limit (was 128Mi Helm default) - Grampsweb: 256Mi req / 512Mi limit for both containers (was 256Mi LimitRange default)	2026-03-02 21:39:14 +00:00
Viktor Barzin	ea67939525	[ci skip] add security observability layer design document Tetragon-centric approach: eBPF runtime security, pfSense syslog collection, CoreDNS query logging, Calico NetworkPolicies, on-demand mitmproxy, unified Grafana security dashboard. ~625MB steady-state, <5GB budget.	2026-03-02 21:13:01 +00:00
Viktor Barzin	51d77369de	[ci skip] fix: add mount_options to all NFS PVs (soft,timeo=30,retrans=3) Critical fix: StorageClass mountOptions only apply during dynamic provisioning. Our static PVs (created by Terraform) were missing mount_options, so all NFS mounts defaulted to hard,timeo=600 — the exact stale mount behavior we were trying to eliminate. Adds mount_options directly to the nfs_volume module PV spec and to the monitoring PVs (prometheus, loki, alertmanager). Requires re-applying all stacks to propagate to existing PVs.	2026-03-02 20:23:36 +00:00
Viktor Barzin	8137c8df63	[ci skip] fix: add mount_options to nfs_volume PV spec StorageClass mountOptions only apply during dynamic provisioning. Static PVs (created by Terraform) need mount_options set explicitly. Without this, all CSI NFS mounts default to hard,timeo=600 — the exact problem we were trying to fix.	2026-03-02 20:22:47 +00:00
Viktor Barzin	70dd172ba7	[ci skip] update CLAUDE.md: NFS volume pattern now uses CSI-backed nfs_volume module	2026-03-02 02:04:47 +00:00
Viktor Barzin	395bd94f0f	[ci skip] migrate servarr sub-stacks + actualbudget factory NFS to CSI PV/PVC Final batch: servarr (aiostreams, listenarr, readarr, soulseek, prowlarr, qbittorrent, lidarr) and actualbudget factory. All use ../../../modules/kubernetes/nfs_volume (3 levels deep).	2026-03-02 02:04:22 +00:00
Viktor Barzin	0e324df545	[ci skip] complete NFS CSI migration: complex stacks + platform modules Migrate remaining multi-volume stacks and all platform modules from inline NFS volumes to CSI-backed PV/PVC with nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options). Complex stacks: openclaw (4 vols), immich (8 vols), frigate (2 vols), nextcloud (2 vols + old PV replaced), rybbit (1 vol) Remaining stacks: affine, ebook2audiobook, f1-stream, osm_routing, real-estate-crawler Platform modules: monitoring (prometheus, loki, alertmanager PVs converted from native NFS to CSI), redis, dbaas, technitium, headscale, vaultwarden, uptime-kuma, mailserver, infra-maintenance	2026-03-02 01:24:07 +00:00
Viktor Barzin	11b3d92684	[ci skip] migrate 29 services from inline NFS to CSI-backed PV/PVC Batch migration of all single-volume and simple multi-volume stacks. All services verified healthy after migration. Uses nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options to eliminate stale NFS mount hangs. Services: atuin, audiobookshelf, calibre, changedetection, diun, excalidraw, forgejo, freshrss, grampsweb, hackmd, health, isponsorblocktv, matrix, meshcentral, n8n, navidrome, ntfy, ollama, onlyoffice, owntracks, paperless-ngx, poison-fountain, send, stirling-pdf, tandoor, wealthfolio, whisper, woodpecker, ytdlp	2026-03-02 00:15:39 +00:00
Viktor Barzin	8faad47994	[ci skip] migrate privatebin, resume, speedtest NFS volumes to CSI PV/PVC Pilot migration: replace inline nfs {} volumes with CSI-backed PV/PVC using nfs-truenas StorageClass (soft,timeo=30,retrans=3 mount options).	2026-03-01 23:42:23 +00:00
Viktor Barzin	481e4fa46e	[ci skip] add NFS CSI driver + nfs_volume shared module - Deploy csi-driver-nfs Helm chart as platform module (nfs-csi) - Create nfs-truenas StorageClass with soft,timeo=30,retrans=3 mount options - Add shared nfs_volume module for PV/PVC boilerplate (modules/kubernetes/nfs_volume/)	2026-03-01 23:38:58 +00:00
Viktor Barzin	2c115f2dc5	[ci skip] add NFS CSI migration design doc and implementation plan	2026-03-01 23:30:27 +00:00
Viktor Barzin	00197c931e	[ci skip] remove low-traffic pull-through caches (registry.k8s.io, quay.io, reg.kyverno.io) Pull-through cache at 10.0.20.10 was serving corrupted/truncated images for low-traffic registries, breaking VPA certgen (ImagePullBackOff) and previously causing Kyverno image pull failures. Kept: docker.io (port 5000) and ghcr.io (port 5010) — high traffic, Docker Hub rate limits make caching essential. Removed from cloud-init template and all 5 live nodes: - registry.k8s.io (port 5030) — 14 system images, very low churn - quay.io (port 5020) — 11 images - reg.kyverno.io (port 5040) — 5 images The registry containers on the 10.0.20.10 VM still run but nodes no longer route to them. They can be stopped/removed from the VM later.	2026-03-01 21:46:41 +00:00
Viktor Barzin	f30ef660e1	[ci skip] add clickhouse-k8s-nfs-system-log-bloat skill, update GPU skill with auto-recovery New skill: ClickHouse on K8s/NFS burns CPU from unbounded system log tables and background merges. Covers config.d mount crash (exit code 36), CronJob truncation workaround, and diagnostic commands. Updated: k8s-gpu-no-nvidia-devices v1.1.0 — added automatic GPU recovery via liveness probe pattern (nvidia-smi + app health check).	2026-03-01 21:04:19 +00:00
Viktor Barzin	14a5b4d7d5	[ci skip] frigate: add liveness/startup probes for GPU recovery When the GPU becomes unavailable (overloaded, CUDA context corruption), Frigate silently falls back to CPU detection burning 4 cores with no automatic recovery. Add liveness probe checking nvidia-smi + API health every 60s (3 failures = restart), and startup probe allowing up to 5min for TensorRT model loading.	2026-03-01 20:36:49 +00:00
Viktor Barzin	78d5aeb5db	[ci skip] f1-stream: add Discord token and channel env vars	2026-03-01 20:17:38 +00:00
Viktor Barzin	6c1dffbfd8	[ci skip] rybbit: add CronJob to truncate ClickHouse system logs every 6h ClickHouse system log tables (metric_log, trace_log, text_log, etc.) were growing unboundedly on NFS (~10GiB, 1.3B rows) with no TTL, causing continuous background merge operations that burned ~920m CPU. Mounting custom config.d XML files crashes ClickHouse (exit code 36) so instead add a CronJob that truncates the tables via the HTTP API every 6 hours. Also removed the broken ConfigMap/volume mount that was causing crashes.	2026-03-01 19:41:39 +00:00
Viktor Barzin	ca648ff9bb	[ci skip] right-size all pod resources based on VPA + live metrics audit Full cluster resource audit: cross-referenced Goldilocks VPA recommendations, live kubectl top metrics, and Terraform definitions for 100+ containers. Critical fixes: - dashy: CPU throttled at 98% (490m/500m) → 2 CPU limit - stirling-pdf: CPU throttled at 99.7% (299m/300m) → 2 CPU limit - traefik auth-proxy/bot-block-proxy: mem limit 32Mi → 128Mi Added explicit resources to ~40 containers that had none: - audiobookshelf, changedetection, cyberchef, dawarich, diun, echo, excalidraw, freshrss, hackmd, isponsorblocktv, linkwarden, n8n, navidrome, ntfy, owntracks, privatebin, send, shadowsocks, tandoor, tor-proxy, wealthfolio, networking-toolbox, rybbit, mailserver, cloudflared, pgadmin, phpmyadmin, crowdsec-web, xray, wireguard, k8s-portal, tuya-bridge, ollama-ui, whisper, piper, immich-server, immich-postgresql, osrm-foot GPU containers: added CPU/mem alongside GPU limits: - ollama: removed CPU/mem limits (models vary in size), keep GPU only - frigate: req 500m/2Gi, lim 4/8Gi + GPU - immich-ml: req 100m/1Gi, lim 2/4Gi + GPU Right-sized ~25 over-provisioned containers: - kms-web-page: 500m/512Mi → 50m/64Mi (was using 0m/10Mi) - onlyoffice: CPU 8 → 2 (VPA upper 45m) - realestate-crawler-api: CPU 2000m → 250m - blog/travel-blog/webhook-handler: 500m → 100m - coturn/health/plotting-book: reduced to match actual usage Conservative methodology: limits = max(VPA upper * 2, live usage * 2)	2026-03-01 19:18:50 +00:00
Viktor Barzin	32762a0916	[ci skip] switch VPA to off mode globally, fix Ollama/MySQL resources - Kyverno policy: VPA mode set to 'off' for all namespaces (was 'initial' for non-core). Terraform is now sole authority for container resources. Goldilocks provides recommendations only. - Ollama: add explicit CPU/memory resources (500m/4Gi req, 4/12Gi limit) alongside GPU allocation. Fixes OOMKill from VPA scaling down resources. - MySQL InnoDB Cluster: bump memory limit from 2Gi to 3Gi. - Remove redundant per-namespace VPA opt-out labels from onlyoffice, openclaw, trading-bot (now handled globally by Kyverno policy).	2026-03-01 19:03:49 +00:00
Viktor Barzin	304b5e4b3d	[ci skip] add nfsv4-idmapd-uid-mapping skill, cross-ref from NFS troubleshooting New skill documenting the NFSv4 idmapd UID mapping crisis where all file UIDs show as 65534 (nobody) inside K8s containers. Root cause: containers auto-negotiate NFSv4.2, and idmapd domain mismatch maps all UIDs to nobody. Fix: v4_v3owner=true on TrueNAS for numeric UID passthrough.	2026-03-01 18:14:37 +00:00
Viktor Barzin	7a467c75ae	[ci skip] add openclaw-k8s-deployment skill from claudeception Extracts all non-obvious gotchas from deploying OpenClaw on Kubernetes: - wizard block required for Telegram, exec.host valid values, - VPA resource overrides, file permissions, startup command, - modelrelay sidecar, NFS caching strategy	2026-03-01 18:10:33 +00:00
Viktor Barzin	ae9565c3e6	[ci skip] onlyoffice: revert font cache NFS mounts, rebuild on startup NFS font caching caused issues. Reverted to default GENERATE_FONTS=true with 8 CPU burst limit for fast regeneration on startup.	2026-03-01 18:07:37 +00:00
Viktor Barzin	b8b41a9408	[ci skip] update claude knowledge: kyverno fixes, nextcloud, onlyoffice learnings	2026-03-01 18:07:04 +00:00
Viktor Barzin	a82f86b3e4	[ci skip] onlyoffice: cache fonts/themes on NFS for fast restarts Persist font cache (159MB) and theme images (10MB) to NFS volume. Set GENERATE_FONTS=false to skip regeneration on startup since cache is warm. Startup time: ~3 min -> 5 seconds.	2026-03-01 18:02:38 +00:00
Viktor Barzin	81e128acc4	[ci skip] onlyoffice: bump CPU limit to 8, add custom LimitRange/Quota Startup was throttled by allthemesgen and font generation hitting 2 CPU ceiling. Bumped to 8 CPU burst limit with custom LimitRange (max 8 CPU) and custom ResourceQuota. Disabled VPA and goldilocks opt-out labels.	2026-03-01 17:58:26 +00:00
Viktor Barzin	07874f8021	[ci skip] onlyoffice: disable VPA to prevent CPU resource override Goldilocks VPA in Initial mode was overriding the explicit 2 CPU limit down to 700m, throttling the document server. Set vpa-update-mode=off.	2026-03-01 17:55:06 +00:00
Viktor Barzin	beec5acbc7	[ci skip] nextcloud: bump CPU limit to 16, add custom ResourceQuota CPU was pegged at 2000m/2000m (100% throttled). Add custom-quota opt-out label and ResourceQuota allowing 32 CPU limits to accommodate the 16 CPU container limit plus sidecar defaults.	2026-03-01 17:41:18 +00:00
Viktor Barzin	ecc3445860	[ci skip] openclaw: fix workspace permissions — chown to node user Init container clones repo as root but main container runs as node (UID 1000). Added chown -R 1000:1000 /workspace/infra so OpenClaw can write to workspace.	2026-03-01 17:20:36 +00:00
Viktor Barzin	79af6fff47	[ci skip] fix MySQL cluster RBAC, Kyverno policy bugs, Nextcloud memory - dbaas: add mysql-sidecar-extra ClusterRole for namespaces/CRD list/watch needed by kopf framework in sidecar containers - kyverno: restrict inject-priority-class-from-tier to CREATE operations only (was blocking pod patches with immutable spec error) - kyverno: add resource-governance/custom-limitrange label opt-out to LimitRange generation policy (mirrors existing custom-quota) - nextcloud: bump memory limit 4Gi -> 6Gi, add custom LimitRange with 8Gi max, opt out of Kyverno-managed LimitRange	2026-03-01 17:16:03 +00:00
Viktor Barzin	a8da2e3790	[ci skip] redis: pin service to master pod to fix read-only errors The Bitnami Redis Sentinel chart's service selects all nodes (master + replicas). Clients using plain redis:// URLs (paperless-ngx, etc.) randomly hit read-only replicas, causing write failures. Pin the service to redis-node-0 (master).	2026-03-01 17:13:25 +00:00

1 2 3 4 5 ...

1881 commits