infra

Author	SHA1	Message	Date
Viktor Barzin	e9311915cb	add agent route to k8s-portal	2026-03-23 02:25:08 +02:00
Viktor Barzin	6bfade3013	update infra stack terraform lock file (helm/kubernetes/vault providers)	2026-03-23 02:24:47 +02:00
Viktor Barzin	2dcdc65db5	add weekly SQLite backup for plotting-book to NFS	2026-03-23 02:24:43 +02:00
Viktor Barzin	e4cf0dee83	add TrueNAS Cloud Sync monitor CronJob and bump Prometheus Helm timeout - New cloudsync-monitor CronJob: queries TrueNAS API every 6h, pushes metrics to Pushgateway - Increase Prometheus Helm timeout to 900s for slow iSCSI reattach	2026-03-23 02:24:39 +02:00
Viktor Barzin	e463281205	optimize backup schedules: compress dumps, stagger to weekly, extend retention - dbaas: gzip MySQL/PostgreSQL dumps, stagger to 0:30, clean old uncompressed - infra-maintenance: etcd backup daily→weekly Sunday 1am - redis: backup hourly→weekly Sunday 3am, retention 7→28 days - vault: raft backup daily→weekly Sunday 2am	2026-03-23 02:24:34 +02:00
Viktor Barzin	644562454c	add IPv6 connectivity via Hurricane Electric 6in4 tunnel - Add public_ipv6 variable and AAAA records for all 34 non-proxied services - Fix stale DNS records (85.130.108.6 → 176.12.22.76, old IPv6 → HE tunnel) - Update SPF record with current IPv4/IPv6 addresses - Add AAAA update support to Technitium DNS updater CLI - Pin mailserver MetalLB IP to 10.0.20.201 for stable pfSense NAT - pfSense: HE_IPv6 interface, strict firewall (80,443,25,465,587,993 + ICMPv6), socat IPv6→IPv4 proxy, removed dangerous "Allow all DEBUG" rules	2026-03-23 02:22:00 +02:00
Viktor Barzin	1f4e8cb278	use registry.viktorbarzin.me hostname for private images + protect ingress - Switch priority-pass images from 10.0.20.10:5050 to registry.viktorbarzin.me - Add containerd hosts.toml for registry.viktorbarzin.me on all nodes + template (redirects to 10.0.20.10:5050 LAN direct, avoids Traefik round-trip) - Enable Authentik protection on priority-pass ingress	2026-03-23 01:02:27 +02:00
Viktor Barzin	e9919d8fc9	fix priority-pass: bump backend memory to 512Mi (OOM with OpenCV)	2026-03-23 00:58:39 +02:00
Viktor Barzin	0674d6e538	deploy priority-pass app to cluster via private registry - SvelteKit frontend + FastAPI backend in single pod with sidecar pattern - Images pushed to 10.0.20.10:5050 private registry (v4/v1) - SvelteKit server route proxies /api/transform to backend on 127.0.0.1:8000 - Exposed at priority-pass.viktorbarzin.me (Cloudflare-proxied, no auth) - Uses imagePullSecrets for authenticated registry pulls	2026-03-23 00:55:41 +02:00
Viktor Barzin	311ff5dd9e	add hourly SQLite integrity check for vaultwarden with Prometheus alerting - New CronJob runs PRAGMA integrity_check every hour - Pushes vaultwarden_sqlite_integrity_ok metric to Prometheus pushgateway - VaultwardenSQLiteCorrupt alert fires immediately on corruption (critical) - VaultwardenIntegrityCheckStale alert if check hasn't run in 2h (warning) - Prevents running for days on a corrupted DB unnoticed	2026-03-23 00:50:15 +02:00
Viktor Barzin	3b89a7d7e4	add VaultwardenDown alert and tighten backup staleness threshold - Add dedicated VaultwardenDown Prometheus alert (critical, 5m) - Reduce backup staleness threshold from 8d to 24h to match 6h schedule - Fixes monitoring gap where VW downtime went undetected	2026-03-23 00:47:00 +02:00
Viktor Barzin	a44f35bcf8	harden vaultwarden iSCSI storage and increase backup frequency - Increase backup from daily to every 6 hours (0 /6 * *) - Add pre/post-flight SQLite integrity checks to backup job - Harden iSCSI on all nodes: increase recovery timeout (300s), enable CRC32C data/header digests for bit-flip detection - Fix restore runbook PVC name (vaultwarden-data-iscsi) Motivated by SQLite corruption from iSCSI I/O errors.	2026-03-23 00:36:11 +02:00
Viktor Barzin	ab7e18c07c	fix registry auth: add Kyverno RBAC for Secrets + containerd TLS skip-verify - Grant kyverno-admission-controller and kyverno-background-controller permissions to manage Secrets (required for generate clone rules) - Add containerd hosts.toml for 10.0.20.10:5050 with skip_verify=true (wildcard cert doesn't cover IP SANs) — applied to all nodes + template	2026-03-22 23:47:29 +02:00
Viktor Barzin	36171bcda4	add htpasswd auth to private docker registry + expose at registry.viktorbarzin.me - Add auth.htpasswd section to config-private.yml - Mount htpasswd file in registry-private container, fix healthcheck for 401 - Rename registry UI from registry.viktorbarzin.me → docker.viktorbarzin.me - Add Docker CLI ingress at registry.viktorbarzin.me (HTTPS backend, no rate-limit, unlimited body) - Add docker to cloudflare_proxied_names (registry stays non-proxied) - Add Kyverno ClusterPolicy to sync registry-credentials secret to all namespaces - Update infra provisioning to install apache2-utils and generate htpasswd from Vault	2026-03-22 22:10:10 +02:00
Viktor Barzin	e4f478b490	switch claude-memory server to multi-user API_KEYS auth Enables isolated memory namespaces per user (wizard/emo) by switching from single API_KEY to API_KEYS JSON map env var.	2026-03-22 20:08:07 +02:00
Viktor Barzin	c103a1ee05	fix OOMKilled containers: bump immich/actualbudget memory, disable changedetection, cap clickhouse - immich-server: 512Mi/1Gi → 1700Mi/1700Mi (VPA upperBound 1.39Gi, 34 OOM restarts) - actualbudget http-api: 384Mi → 768Mi (VPA upperBound 615Mi, 3 OOM restarts) - changedetection: replicas 1 → 0 (chronic OOM at 64Mi, not worth memory cost) - rybbit clickhouse: add ConfigMap capping max_server_memory_usage to 800Mi (within 1Gi limit)	2026-03-22 15:22:29 +02:00
Viktor Barzin	ad689076d8	scale down non-critical services to free cluster memory - authentik server: 3→2, worker: 3→2, PDB minAvailable: 2→1 - tuya-bridge: 3→1 - realestate-crawler-api: 2→1 - claude-memory: 2→1 - grafana: 2→1 (config only, apply pending) - alertmanager: 2→1 (config only, apply pending) Estimated savings: ~1.2 Gi total	2026-03-22 03:10:12 +02:00
Viktor Barzin	bd98b84ded	scale grafana and alertmanager to 1 replica to free cluster memory Grafana: 2 → 1 (saves ~312 Mi) Alertmanager: 2 → 1 (saves ~150 Mi) Matrix already scaled to 0 (saves ~212 Mi)	2026-03-22 03:02:17 +02:00
Viktor Barzin	1c13af142d	sync regenerated providers.tf + upstream changes - Terragrunt-regenerated providers.tf across stacks (vault_root_token variable removed from root generate block) - Upstream monitoring/openclaw/CLAUDE.md changes from rebase	2026-03-22 02:56:04 +02:00
Viktor Barzin	2e016d7df2	fix nextcloud db-username + k8s-dashboard chart repo - nextcloud: add db-username to ESO secret template and usernameKey to chart values (required by newer chart version) - k8s-dashboard: update chart repo URL to kubernetes-retired.github.io (old kubernetes.github.io/dashboard returns 404)	2026-03-22 02:50:48 +02:00
Viktor Barzin	3d22599f7f	fix infra stack: use overwrite strategy for provider generation The child generate block needs if_exists="overwrite" to properly override the root terragrunt's k8s_providers block.	2026-03-22 01:28:25 +02:00
Viktor Barzin	728fbcd3bd	fix infra stack: add vault provider to terragrunt generate block The infra stack's provider override only included proxmox but main.tf uses data "vault_kv_secret_v2" which requires the vault provider.	2026-03-22 01:17:00 +02:00
Viktor Barzin	1d1549b8af	state(descheduler): update encrypted state	2026-03-21 15:13:02 +00:00
Viktor Barzin	21cfa8c072	bump memory limits for OOM-prone services FreshRSS: 64Mi → 256Mi (171 restarts, VPA upper ~204Mi) Actual Budget HTTP API: 128Mi → 384Mi (17 restarts, VPA upper ~297Mi) n8n: 768Mi → 1Gi (18 restarts, VPA upper ~765Mi) Dawarich: 768Mi → 896Mi (2 restarts, VPA upper ~628Mi) Traefik: 384Mi → 768Mi (2 restarts, VPA upper ~584Mi)	2026-03-21 11:12:12 +00:00
Viktor Barzin	b3c9c45a17	multi-user access: fix template memory default, add storage quota, add CONTRIBUTING.md [ci skip] - Template: bump default memory from 128Mi to 256Mi (matches deploy-app skill guidance) - ResourceQuota: add requests.storage (20Gi) and persistentvolumeclaims (5) defaults - CONTRIBUTING.md: agent-friendly contributor guide for namespace-owners	2026-03-19 23:49:15 +00:00
Viktor Barzin	6b8ce04d44	fix(openclaw): change agent workspace from /workspace/infra to /workspace Keeps infra repo as a subdirectory, allows OpenClaw to write to /workspace directly.	2026-03-19 23:32:28 +00:00
Viktor Barzin	e823b795f7	fix(dbaas,vault): fix backup CronJob failures and mysql-operator memory - Add docker.io/library/ prefix to mysql and postgres backup images to satisfy Kyverno require-trusted-registries policy (both CronJobs were blocked for 46h, triggering MySQLBackupStale alert) - Document mysql-operator chart ignoring resources values key — the LimitRange default (256Mi) was silently applied, putting the operator at 97% memory. Patched live to 512Mi via kubectl. - Increase vault-raft-backup backoff_limit to 6 for transient failures (also fixed NFS export: vault-backup was a separate ZFS dataset not in the TrueNAS NFS share — destroyed dataset, created directory)	2026-03-19 23:26:05 +00:00
Viktor Barzin	250a058627	feat(traefik): add custom error pages with tarampampam/error-pages Deploy error-pages service to show themed error pages instead of raw Traefik 502/503/504 responses. Adds catch-all IngressRoute (priority 1) for 404 on unknown hosts. Only 5xx intercepted to avoid breaking JSON APIs.	2026-03-19 23:14:27 +00:00
Viktor Barzin	d95144bd05	fix(immich): bump postgres memory 512Mi → 1Gi for v2.6.1 geodata migration v2.6.1 bulk-inserts into geodata_places on first boot, OOM-killing postgres at 512Mi. Raise to 1Gi to accommodate the migration.	2026-03-19 22:50:36 +00:00
Viktor Barzin	da630b8869	upgrade immich v2.5.6 → v2.6.1	2026-03-19 22:45:04 +00:00
Viktor Barzin	af2222fce8	backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks Phase 1: Add 12 PrometheusRules for backup health alerting - PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts - CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces - Generic BackupCronJobFailed alert Phase 2: Fix backup rotation - etcd: timestamped snapshots instead of overwriting single file - Redis: timestamped RDB files with 7-day retention purge - PostgreSQL: retention increased from 7 to 14 days Phase 3: Fix MySQL password exposure - Move root password from command line arg to MYSQL_PWD env var via secretKeyRef Phase 5: Add restore runbooks - PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild	2026-03-19 20:34:33 +00:00
Viktor Barzin	e54bc016ba	reduce alert noise: raise memory thresholds, exclude claude-memory 4xx, right-size mysql-operator - ContainerNearOOM: 85% → 90% (silences forgejo, changedetection, immich-pg, mysql-cluster) - ClusterMemoryRequestsHigh: 85% → 92% (intentional overcommit) - NodeMemoryPressureTrending: 85% → 92% - HighService4xxRate: exclude claude-memory (401s from unauth requests are expected) - mysql-operator memory limit: 512Mi → 580Mi (VPA upperBound 481Mi × 1.2)	2026-03-19 20:25:36 +00:00
Viktor Barzin	21bb3036af	state(dbaas): update encrypted state	2026-03-19 20:23:59 +00:00
Viktor Barzin	01eb9dd121	fix(monitoring): patch idrac-redfish-exporter to restore PSU voltage metric Upstream v2.4.1 accidentally dropped idrac_power_supply_input_voltage from the legacy RefreshPowerOld code path during a Huawei OEM support refactor. Built a patched image that restores the single missing line: mc.NewPowerSupplyInputVoltage(ch, psu.LineInputVoltage, id) Image: viktorbarzin/idrac-redfish-exporter:2.4.1-voltage-fix [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 13:37:14 +00:00
Viktor Barzin	b05421dbb5	add comment explaining prometheus 4Gi minimum memory requirement [ci skip]	2026-03-18 21:45:26 +00:00
Viktor Barzin	9d87ce605f	revert prometheus memory 3Gi→4Gi: WAL tmpfs shares cgroup limit The 2Gi WAL tmpfs (medium: Memory) counts against the container's memory limit. At 3Gi, Prometheus OOM-kills during WAL replay on startup (heap + tmpfs > 3Gi). Reverting to 4Gi restores headroom.	2026-03-18 21:44:14 +00:00
Viktor Barzin	410c893647	fix(provision): security hardening from code review - Add input validation: username regex + email format check in pipeline - Quote variables in .provision-env to prevent shell injection - Remove dead source command (each Woodpecker command is separate shell) - Use jq to build JSON payloads (prevents injection via group names) - Clean up git-crypt key on failure (use ; instead of &&) - Add Kyverno ndots lifecycle ignore to webhook-handler deployment	2026-03-18 21:25:03 +00:00
Viktor Barzin	fd130971aa	feat(provision): automated user provisioning via Authentik webhook - Expand CI Vault policy: write secret/data/platform + Transit SOPS keys - Add Woodpecker provision-user.yml pipeline (manual event, API-triggered) - Add env vars to webhook-handler deployment for Woodpecker/Authentik integration - Update add-user skill with automated flow documentation - Update Woodpecker repo ID list in CLAUDE.md	2026-03-17 23:56:30 +00:00
Viktor Barzin	0fff155f17	feat(k8s-portal): update onboarding + architecture with SOPS state docs Onboarding (namespace-owner): - Add steps for sops/terragrunt install, state decrypt, apply workflow - Add flow diagram showing auth → decrypt → apply → encrypt → push - Add architecture overview with security model table - Add access control callout explaining per-stack Transit keys Architecture: - Add secrets & state encryption section with ASCII diagrams - Add request flow diagram (Cloudflare → Traefik → pods) - Add CI/CD pipeline diagram (GHA → Woodpecker → K8s) [ci skip]	2026-03-17 23:17:47 +00:00
Viktor Barzin	ccbcebb670	feat(vault): automate SOPS onboarding for namespace-owners - Add Transit mount + per-stack Transit keys to vault stack TF - Auto-create sops-user-<name> policy scoping decrypt to owned stacks - Auto-create sops-<name> external group + alias for Authentik mapping - Add sops-admin policy to authentik-admins group - Attach sops-user policy to namespace-owner identity entities - Update add-user skill with SOPS onboarding steps and Authentik group - Adding a user to k8s_users + applying vault stack = full SOPS access [ci skip]	2026-03-17 23:15:25 +00:00
Viktor Barzin	12a51c4ffa	right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip] - nvidia: custom LimitRange (128Mi default, was 1Gi from Kyverno) to stop inflating GPU operator init containers; saves ~2.5Gi on GPU node - nvidia: dcgm-exporter 1536Mi → 768Mi (actual usage 489Mi) - monitoring: prometheus server 4Gi → 3Gi (actual usage 2.6Gi) - onlyoffice: 2304Mi → 1536Mi (actual usage 1.3Gi) - immich: frame explicit 64Mi resources (was getting 1Gi LimitRange default) - dbaas: quota limits.memory 20Gi → 24Gi to fit 3rd MySQL replica Root cause: Kyverno tier-2-gpu LimitRange injected 1Gi on every NVIDIA init container (no explicit resources), wasting ~2.5Gi scheduling overhead on the GPU node. Combined with over-requesting, frigate and immich-ml couldn't schedule.	2026-03-17 22:35:54 +00:00
Viktor Barzin	73511b1230	extract remaining 19 modules from platform, complete stack split [ci skip] Phase 3: all 27 platform modules now run as independent stacks. Platform reduced to empty shell (outputs only) for backward compat with 72 app stacks that declare dependency "platform". Fixed technitium cross-module dashboard reference by copying file. Woodpecker pipeline applies all 27+1 stacks in parallel via loop. All applied with zero destroys.	2026-03-17 21:42:16 +00:00
Viktor Barzin	ae36dc253b	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules.	2026-03-17 21:34:11 +00:00
Viktor Barzin	3c804aedf8	extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip] Phase 1 of platform stack split for parallel CI applies. All 3 modules were fully independent (no cross-module refs). State migrated via terraform state mv. All 3 stacks applied with zero changes (dbaas had pre-existing ResourceQuota drift). Woodpecker pipeline updated to run extracted stacks in parallel.	2026-03-17 18:11:53 +00:00
Viktor Barzin	c8b42f78df	fix DB password rotation desync in 5 stacks Vault DB engine rotates passwords weekly but 5 stacks baked passwords at Terraform plan time, causing stale credentials until next apply. - real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments - nextcloud: switch Helm chart to existingSecret for DB password - grafana: add vault-database ESO, use envFromSecrets in Helm values - woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain - affine: add vault-database ESO, use secret_key_ref in deployment + init container	2026-03-17 07:39:29 +00:00
Viktor Barzin	8d8c8db737	increase DB password rotation from 24h to weekly (604800s)	2026-03-16 23:17:01 +00:00
Viktor Barzin	c31ba2c50c	k8s-portal: use Recreate strategy, limit revision history to 3 Prevents stale pods serving old content during rapid successive deploys. With 1 replica + RollingUpdate, old and new pods briefly coexist.	2026-03-16 22:55:15 +00:00
Viktor Barzin	fb66676d7b	post-mortem: kured + containerd cascade outage — alerts + report 26h outage caused by unattended-upgrades kernel update → kured reboot → containerd overlayfs snapshotter corruption → image pull failures → calico down → cascading cluster outage. Remediation: - Add "Node Runtime Health" Prometheus alert group (6 alerts): KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating, KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady - Add containerd cascade inhibition rule - Save post-mortem report as HTML in post-mortems/ Also applied via kubectl (needs Terraform codification): - Sentinel gate DaemonSet gating kured reboots on cluster health - Fixed kured Helm values: reboot window + gated sentinel path	2026-03-16 22:06:10 +00:00
Viktor Barzin	327c021a90	fix: improve Slack alert formatting — add values, fix ContainerNearOOM filter - Add container!="" filter to ContainerNearOOM to exclude system-level cadvisor entries - Add $value to summaries: ContainerOOMKilled, ClusterMemoryRequestsHigh, ContainerNearOOM, PVPredictedFull, NFSServerUnresponsive, NewTailscaleClient - Add fallback field to all Slack receivers for clean push notifications - Multiply ratio exprs by 100 for readable percentages - Rename "New Tailscale client" to CamelCase "NewTailscaleClient" - Add actionable hints to PodUnschedulable, NodeConditionBad, ForwardAuthFallbackActive	2026-03-16 19:35:24 +00:00
Viktor Barzin	b2d07556d5	fix: migrate woodpecker database credentials to runtime-refreshed ExternalSecret The woodpecker server was crashing repeatedly with database authentication failures because Vault rotates the database password every 24 hours, but the Helm release had hardcoded the password into WOODPECKER_DATABASE_DATASOURCE at plan time. Changes: - Updated ExternalSecret to provide the full DATABASE_DATASOURCE URI dynamically - Modified Helm values to use envFrom to inject the secret instead of hardcoding - ExternalSecret refreshes every 15 minutes, automatically picking up rotated passwords - Pod will auto-restart when secret changes (via reloader.stakater.com annotation) - This eliminates the plan-time password snapshot that goes stale within 24h The pod still has an unrelated image pull issue on k8s-node4 (containerd blob corruption), but the database credentials mechanism is now correctly implemented.	2026-03-16 19:12:01 +00:00

1 2 3 4 5 ...

323 commits