infra

Author	SHA1	Message	Date
Viktor Barzin	ca5039f8aa	switch backup + offsite sync from weekly to daily — RPO 7d → 1d [ci skip] - weekly-backup.timer: Sun 05:00 → daily 05:00 - offsite-sync-backup.timer: Sun 08:00 → daily 06:00 - Monthly full rsync --delete unchanged (1st-7th of month) - Total daily I/O cost: ~20GB sdc reads, ~3.5GB sda writes, seconds of network - Updated script headers and service descriptions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:24:38 +00:00
Viktor Barzin	28ad11d12c	consolidate offsite backup: inotify change tracking, deduplicate Synology paths [ci skip] Architecture overhaul: - Synology truenas/ renamed to nfs/, immich paths flattened to match source - Created nfs-ssd/ on Synology for SSD data (thumbs, ML cache) - Deleted pve-backup/nfs-mirror (53GB duplication eliminated) - New inotifywait daemon (nfs-change-tracker.service) watches /srv/nfs + /srv/nfs-ssd - offsite-sync Step 2: reads inotify change log, rsync --files-from only changed files - weekly-backup: removed NFS mirror step entirely (NFS goes direct to Synology) - Cleaned 9 orphaned LVs (101GB + 38 snapshots reclaimed from thin pool) Performance: incremental sync completes in seconds (vs 30+ min with full rsync) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:06:20 +00:00
Viktor Barzin	aa4c125f9c	improve 3-2-1 backup: auto-discover dirs, Immich offsite sync, SQLite backup [ci skip] - weekly-backup.sh: replace hardcoded BACKUP_DIRS with glob auto-discovery (catches nextcloud-backup, council-complaints-backup, future dirs) - weekly-backup.sh: add auto SQLite backup from PVC snapshots (magic number check, ?mode=ro URI, fallback to raw copy) - offsite-sync-backup.sh: add NFS media direct-to-Synology sync (Immich, calibre, audiobookshelf — reuses existing TrueNAS Cloud Sync paths) - Cleaned up 9 orphaned LVs + 38 snapshots on PVE host (101GB reclaimed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 15:47:56 +00:00
Viktor Barzin	38d51ab0af	deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip] - Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS - Update config.tfvars nfs_server to 192.168.1.127 (Proxmox) - Update nfs-csi StorageClass share to /srv/nfs - Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP - Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh) - Rewrite nfs-health.sh for Proxmox NFS monitoring - Update Freedify nfs_music_server default to Proxmox - Mark CloudSync monitor CronJob as deprecated - Update Prometheus alert summaries - Update all architecture docs, AGENTS.md, and reference docs - Zero PVs remain on TrueNAS — VM ready for decommission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 14:42:07 +00:00
Viktor Barzin	d2af5339af	fix offsite sync: use --chmod for Synology permission compatibility Synology Administrator user can't create dirs with root-owned permissions from PVC snapshots. Switch from -az to -rltz --chmod to set writable permissions on destination. Also updated Cloud Sync Task 1 excludes to prevent duplication of backup dirs on Synology.	2026-04-06 16:01:42 +03:00
Viktor Barzin	9e2ac5fbb5	feat: add hardware exporter checks to cluster healthcheck (check #30 ) Verifies snmp-exporter, idrac-redfish-exporter, proxmox-exporter, and tuya-bridge pods are running, plus checks Prometheus scrape targets (snmp-idrac, snmp-ups, redfish-idrac, proxmox-host) are UP.	2026-04-06 14:58:46 +03:00
Viktor Barzin	d009f9a0f2	add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync - weekly-backup.sh: mounts LVM thin snapshots ro, rsyncs files to /mnt/backup/pvc-data with --link-dest versioning (4 weeks). Also mirrors NFS backup dirs from TrueNAS, backs up pfsense (config.xml + full tar), PVE host config, and prunes >7d snapshots. - offsite-sync-backup.sh: rsync --files-from manifest to Synology (no full dir walk). Monthly full --delete sync on 1st Sunday. After=weekly-backup.service dependency. - lvm-pvc-snapshot.timer: changed to daily 03:00 (was 2x daily) - Prometheus alerts: WeeklyBackupStale, WeeklyBackupFailing, PfsenseBackupStale, OffsiteBackupSyncStale, BackupDiskFull. LVMSnapshotStale threshold 24h→48h.	2026-04-06 14:53:28 +03:00
Viktor Barzin	72d832fee7	add HA Sofia checks (26-29) to cluster healthcheck and backup-dr docs - Healthcheck: add entity availability, integration health, automation status, and system resources checks for Home Assistant Sofia - Docs: add backup-dr architecture documentation	2026-04-06 11:57:36 +03:00
Viktor Barzin	337da2184d	add upstream fallback to containerd registry mirrors When the pull-through proxy (10.0.20.10) is down, containerd now falls back to the official upstream registries (registry-1.docker.io, ghcr.io) instead of failing. Also cleans up stale disabled registry mirror dirs and removes unnecessary containerd restart from the rollout script.	2026-04-02 11:05:30 +03:00
Viktor Barzin	4e74f816bc	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip] Both services migrated to unified ebooks namespace. Remove: - Old stack directories and Terraform state - calibre references from monitoring namespace lists - calibre/audiobookshelf from operational scripts	2026-03-25 23:56:07 +02:00
Viktor Barzin	77143dfd6b	state: per-stack Transit keys for namespace-owner access control - Each stack gets its own Vault Transit key (transit/keys/sops-state-<stack>) - state-sync passes per-stack Transit URI + age keys on encrypt - Vault policies scope namespace-owners to their stacks only: - sops-admin: wildcard access to all transit keys - sops-user-<name>: access only to owned stack keys - Anca (plotting-book) can only decrypt plotting-book state - Admin can decrypt everything (via admin Transit policy or age fallback) - External group sops-plotting-book maps Authentik group to Vault policy - Updated CLAUDE.md with state sync documentation	2026-03-17 23:08:18 +00:00
Viktor Barzin	4e7ca1ad61	state: add Vault Transit as primary SOPS backend, age as fallback - .sops.yaml: add hc_vault_transit_uri for transit/keys/sops-state - state-sync: try Vault Transit first, fall back to age key on disk - Re-encrypted all 101 state files with both Vault Transit + age - Normal workflow: vault login → decrypt via Transit (no key files) - Bootstrap/DR: age key at ~/.config/sops/age/keys.txt	2026-03-17 22:56:33 +00:00
Viktor Barzin	b6faa24349	state: add SOPS-encrypted terraform state to git - SOPS + age encrypts all 101 .tfstate files (JSON-aware: keys visible, values encrypted) - scripts/state-sync: encrypt/decrypt/commit wrapper - scripts/tg: auto-decrypt before ops, auto-encrypt+commit after apply/destroy - terragrunt.hcl: -backup=- prevents backup file accumulation - .gitignore: track .tfstate.enc, ignore plaintext .tfstate - Cleaned 964MB of stale backups (state/backups/, .backup files)	2026-03-17 22:37:56 +00:00
Viktor Barzin	0f262ceda3	add pod dependency management via Kyverno init container injection Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation and injects busybox init containers that block until each dependency is reachable (nc -z). Annotations added to 18 stacks (24 deployments). Includes graceful-db-maintenance.sh script for planned DB maintenance (scales dependents to 0, saves replica counts, restores on startup).	2026-03-15 19:17:57 +00:00
Viktor Barzin	3aba29e7a3	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines Vault is now the sole source of truth for secrets. SOPS pipeline removed entirely — auth via `vault login -method=oidc`. Part A: SOPS removal - vault/main.tf: delete 990 lines (93 vars + 43 KV write resources), add self-read data source for OIDC creds from secret/vault - terragrunt.hcl: remove SOPS var loading, vault_root_token, check_secrets hook - scripts/tg: remove SOPS decryption, keep -auto-approve logic - .woodpecker/default.yml: replace SOPS with Vault K8s auth via curl - Delete secrets.sops.json, .sops.yaml Part B: External Secrets Operator - New stack stacks/external-secrets/ with Helm chart + 2 ClusterSecretStores (vault-kv for KV v2, vault-database for DB engine) Part C: Database secrets engine (in vault/main.tf) - MySQL + PostgreSQL connections with static role rotation (24h) - 6 MySQL roles (speedtest, wrongmove, codimd, nextcloud, shlink, grafana) - 6 PostgreSQL roles (trading, health, linkwarden, affine, woodpecker, claude_memory) Part D: Kubernetes secrets engine (in vault/main.tf) - RBAC for Vault SA to manage K8s tokens - Roles: dashboard-admin, ci-deployer, openclaw, local-admin - New scripts/vault-kubeconfig helper for dynamic kubeconfig K8s auth method with scoped policies for CI, ESO, OpenClaw, Woodpecker sync.	2026-03-15 16:37:38 +00:00
Viktor Barzin	46afa85b01	fix openclaw config mount and OOM: use init container, increase memory to 2Gi - Replace subPath ConfigMap mount with init container that copies openclaw.json to writable NFS home (OpenClaw writes back to the file at runtime) - Remove invalid memory-api plugin references causing "Config invalid" - Increase memory to 2Gi (req+limit) with NODE_OPTIONS=--max-old-space-size=1536 - Fix tg wrapper to inject -auto-approve when apply --non-interactive is used	2026-03-14 23:42:17 +00:00
Viktor Barzin	76a4987eef	[ci skip] add Forgejo task pipeline for OpenClaw AI agent Forgejo issues as a task queue for OpenClaw: - Forgejo OAuth2 with Authentik SSO, self-registration disabled - Webhook-triggered task processing (instant) + CronJob backup (5min poll) - Tasks processed via Mistral Large 3 (NVIDIA NIM API) - Results posted as issue comments, auto-labeled and closed - Comment follow-ups and reopened issues supported - n8n RBAC for OpenClaw pod exec (future workflow integration)	2026-03-07 21:11:07 +00:00
Viktor Barzin	39333033a6	[ci skip] phase 1: SOPS tooling setup (.sops.yaml, scripts/tg, .gitignore) Part of SOPS multi-user secrets migration. - .sops.yaml: defines age recipients (Viktor + CI) - scripts/tg: wrapper that decrypts secrets before running terragrunt - .gitignore: excludes decrypted secrets.auto.tfvars.json No functional change — terraform.tfvars still works as before.	2026-03-07 13:57:42 +00:00
Viktor Barzin	422dadafe5	[ci skip] replace resource overcommitment check with actual usage Check real CPU/memory usage via kubectl top nodes instead of limits-vs-allocatable ratios. Thresholds: >80% WARN, >90% FAIL. Limits overcommit is expected with 70+ services on 3 worker nodes.	2026-03-06 20:28:55 +00:00
Viktor Barzin	87ef313888	[ci skip] fix post-NFS-migration issues: MySQL GR, Loki, grampsweb, alerts - Loki: reduce memory limit from 6Gi to 4Gi (within LimitRange max) - Grampsweb: increase memory to 2Gi (was OOMKilled at 512Mi) - Fix PostgreSQLDown alert: check pod readiness instead of deployment - Fix MySQLDown alert: check StatefulSet replicas instead of deployment - Fix RedisDown alert: check StatefulSet replicas instead of deployment - Fix NFSServerUnresponsive: aggregate all NFS versions cluster-wide - Fix Uptime Kuma healthcheck: handle nested list heartbeat format - Update etcd backup image to registry.k8s.io/etcd:3.6.5-0	2026-03-03 21:10:26 +00:00
Viktor Barzin	14b1c43713	[ci skip] expand k8s worker nodes to 256G, update inventory and extend script - k8s-node2: 128G → 256G (160GB free) - k8s-node3: 128G → 256G (135GB free) - k8s-node4: 128G → 256G (127GB free) - k8s-node1: already 256G (51GB free) - extend_vm_storage.sh: increase drain timeout to 300s, add --force flag - Remove Vaultwarden from SQLite migration plan (too risky)	2026-02-28 16:00:16 +00:00
Viktor Barzin	69c4c0c76e	[ci skip] VPA: reduce LimitRange defaults, add overcommit check, protect tier-0 - Reduce Kyverno LimitRange default limits ~4x across all tiers to fix 800-900% memory overcommitment on worker nodes - Add cluster health check #25: per-node resource overcommitment showing requests and limits vs allocatable capacity - Add Kyverno policy for Goldilocks VPA mode by tier: tier-0 namespaces get VPA Off mode (recommend only, no evictions) to prevent downtime on critical infra (traefik, cloudflared, authentik, technitium, etc.) - Non-tier-0 namespaces get VPA Auto mode for active right-sizing	2026-02-26 23:15:43 +00:00
Viktor Barzin	d041459ef2	[ci skip] Upgrade Woodpecker CI v3.5.1 → v3.13.0, fix helm healthcheck for v4	2026-02-23 20:14:30 +00:00
Viktor Barzin	c8de2c4803	[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me. Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma monitor, dashboard entry, and all code/doc references across 18 files.	2026-02-23 19:38:55 +00:00
Viktor Barzin	a9ba8899be	[ci skip] Phase 3: Create 66 service stacks and migrate state Generated individual stack directories for all 66 services under stacks/. Each stack has terragrunt.hcl (depends on platform) and main.tf (thin wrapper calling existing module). Migrated all 64 active service states from root terraform.tfstate to individual state files. Root state is now empty. Verified with terragrunt plan on multiple stacks (no changes).	2026-02-22 13:56:34 +00:00
Viktor Barzin	db659b1f7a	[ci skip] Fix dashy OOMKilled and healthcheck DNS false-failure - Add explicit resource limits to dashy (2Gi memory) to prevent OOMKilled during webpack build on startup - Rewrite DNS healthcheck to test from inside the Technitium pod via kubectl exec, since MetalLB virtual IPs aren't reachable from outside the L2 network - Deleted orphaned kured/tls-secret (expired Oct 2025, module disabled, not mounted by kured DaemonSet)	2026-02-22 12:46:12 +00:00
Viktor Barzin	00dc78e0d2	[ci skip] Fix Uptime Kuma false-down reports: use bulk heartbeat API instead of per-monitor calls	2026-02-22 01:37:28 +00:00
Viktor Barzin	98b711ff8d	[ci skip] Extend cluster healthcheck from 14 to 24 checks Add 10 new checks covering gaps discovered during incident response: ResourceQuota pressure, StatefulSets, node disk usage, Helm release health, Kyverno policy engine, NFS connectivity, DNS resolution, TLS certificate expiry, GPU health, and Cloudflare tunnel status.	2026-02-21 23:57:04 +00:00
Viktor Barzin	038d4434c4	[ci skip] Fix health check false positives for completed CronJob pods	2026-02-21 19:56:39 +00:00
Viktor Barzin	2bae6ccce3	Add Uptime Kuma monitor check to cluster health script [ci skip] Adds check #14 that queries Uptime Kuma API for application-level monitor status, complementing the kubectl-level checks with HTTP/ping health data. Reports down monitors by name with PASS/WARN/FAIL thresholds.	2026-02-15 17:49:40 +00:00
Viktor Barzin	9c4ff21d58	Add cluster health check script with 13 diagnostic sections [ci skip]	2026-02-15 17:34:22 +00:00
Viktor Barzin	a67a6f350e	[ci skip] Fix pull-through cache for all registries Replace deprecated wildcard containerd mirror with per-registry config_path approach. Add proxy containers for ghcr.io, quay.io, registry.k8s.io, and reg.kyverno.io on the docker-registry VM. Set static IP for docker-registry VM to avoid DHCP issues.	2026-02-15 14:35:52 +00:00
Viktor Barzin	08ea489fe0	[ci skip] Add extend-vm-storage script and skills - Script to automate K8s node VM disk expansion (drain, shutdown, resize, boot, expand FS, uncordon) - Skill docs for the workflow and troubleshooting pitfalls (growpart, macOS grep -P, drain timeouts) - Successfully tested on k8s-node2, k8s-node3, k8s-node4 (64G → 128G)	2026-02-13 22:08:46 +00:00
Viktor Barzin	a926a5022c	[ci skip] sync tfstate and add frigate helper scripts	2026-02-12 23:11:23 +00:00
Viktor Barzin	7441538d6e	upgrade to k8s 1.34.2 [ci skip]	2025-12-18 12:37:14 +00:00
Viktor Barzin	e1ec44c81d	scale down calibre-web-automated instead of calibre [ci skip]	2025-12-06 22:04:41 +00:00
Viktor Barzin	c7c69905c0	some nits on the registry manager script - note it is still not working correctly [ci skip]	2025-10-17 19:23:43 +00:00
Viktor Barzin	8da88f9f6d	move helper scripts in scripts dir [ci skip]	2025-10-11 17:14:59 +00:00

38 commits