infra

Author	SHA1	Message	Date
Viktor Barzin	e097b7eb29	fix(ingress): wire up backend_protocol, remove dead ssl_redirect variable Post nginx→Traefik migration cleanup: - backend_protocol now sets serversscheme + serverstransport annotations for HTTPS backends (k8s-dashboard, pfsense, nas, idrac, proxmox, etc.) - Remove ssl_redirect variable (nginx-only, silently ignored by Traefik) and all 9 caller references [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 08:45:56 +00:00
Viktor Barzin	8c4942779f	feat(ingress): wire up max_body_size as Traefik buffering middleware The max_body_size variable existed but was never used. Now creates a per-ingress buffering middleware enforcing maxRequestBodyBytes (default 50MB). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 08:36:01 +00:00
Viktor Barzin	586f8345d1	fix(dbaas,actualbudget): apply OOM fixes — sync live cluster with Terraform code Live cluster had stale resource limits causing OOMKills: - actualbudget-http-api: 128Mi → 512Mi (code already correct) - pg-cluster CNPG: 512Mi → 4Gi (code already correct) - dbaas ResourceQuota: 20Gi → 24Gi live (TF code has 64Gi) Formatting cleanup from terraform fmt included. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 08:04:10 +00:00
Viktor Barzin	332dbdee59	state: add devvm as SOPS recipient Add devvm age public key to .sops.yaml and re-encrypt all 101 state files with both laptop and devvm keys.	2026-03-18 08:04:10 +00:00
Viktor Barzin	3583994efe	state: add SOPS-encrypted terraform state to git - SOPS + age encrypts all 101 .tfstate files (JSON-aware: keys visible, values encrypted) - scripts/state-sync: encrypt/decrypt/commit wrapper - scripts/tg: auto-decrypt before ops, auto-encrypt+commit after apply/destroy - terragrunt.hcl: -backup=- prevents backup file accumulation - .gitignore: track .tfstate.enc, ignore plaintext .tfstate - Cleaned 964MB of stale backups (state/backups/, .backup files)	2026-03-18 08:04:10 +00:00
Viktor Barzin	7ec627f365	right-size memory requests to unblock GPU workloads and fix dbaas quota [ci skip] - nvidia: custom LimitRange (128Mi default, was 1Gi from Kyverno) to stop inflating GPU operator init containers; saves ~2.5Gi on GPU node - nvidia: dcgm-exporter 1536Mi → 768Mi (actual usage 489Mi) - monitoring: prometheus server 4Gi → 3Gi (actual usage 2.6Gi) - onlyoffice: 2304Mi → 1536Mi (actual usage 1.3Gi) - immich: frame explicit 64Mi resources (was getting 1Gi LimitRange default) - dbaas: quota limits.memory 20Gi → 24Gi to fit 3rd MySQL replica Root cause: Kyverno tier-2-gpu LimitRange injected 1Gi on every NVIDIA init container (no explicit resources), wasting ~2.5Gi scheduling overhead on the GPU node. Combined with over-requesting, frigate and immich-ml couldn't schedule.	2026-03-18 08:04:08 +00:00
Viktor Barzin	263d97bea2	extract remaining 19 modules from platform, complete stack split [ci skip] Phase 3: all 27 platform modules now run as independent stacks. Platform reduced to empty shell (outputs only) for backward compat with 72 app stacks that declare dependency "platform". Fixed technitium cross-module dashboard reference by copying file. Woodpecker pipeline applies all 27+1 stacks in parallel via loop. All applied with zero destroys.	2026-03-18 08:04:08 +00:00
Viktor Barzin	f7c3a338a5	extract monitoring, nvidia, mailserver, cloudflared, kyverno from platform [ci skip] Phase 2 of platform stack split. 5 more modules extracted into independent stacks. All applied successfully with zero destroys. Cloudflared now reads k8s_users from Vault directly to compute user_domains. Woodpecker pipeline runs all 8 extracted stacks in parallel. Memory bumped to 6Gi for 9 concurrent TF processes. Platform reduced from 27 to 19 modules.	2026-03-18 08:04:07 +00:00
Viktor Barzin	5b11761d22	extract dbaas, authentik, crowdsec from platform into independent stacks [ci skip] Phase 1 of platform stack split for parallel CI applies. All 3 modules were fully independent (no cross-module refs). State migrated via terraform state mv. All 3 stacks applied with zero changes (dbaas had pre-existing ResourceQuota drift). Woodpecker pipeline updated to run extracted stacks in parallel.	2026-03-18 08:04:06 +00:00
Viktor Barzin	94717dcd32	fix DB password rotation desync in 5 stacks Vault DB engine rotates passwords weekly but 5 stacks baked passwords at Terraform plan time, causing stale credentials until next apply. - real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments - nextcloud: switch Helm chart to existingSecret for DB password - grafana: add vault-database ESO, use envFromSecrets in Helm values - woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain - affine: add vault-database ESO, use secret_key_ref in deployment + init container	2026-03-18 08:04:05 +00:00
Viktor Barzin	6656743968	increase DB password rotation from 24h to weekly (604800s)	2026-03-18 08:04:05 +00:00
Viktor Barzin	7282e37294	k8s-portal: use Recreate strategy, limit revision history to 3 Prevents stale pods serving old content during rapid successive deploys. With 1 replica + RollingUpdate, old and new pods briefly coexist.	2026-03-18 08:04:05 +00:00
Viktor Barzin	ebc1aca791	add GitHub Pages for post-mortems - Index page listing all incident reports - GHA workflow deploys post-mortems/ on push - Available at viktorbarzin.github.io/infra/	2026-03-18 08:04:04 +00:00
Viktor Barzin	12918dd491	post-mortem: kured + containerd cascade outage — alerts + report 26h outage caused by unattended-upgrades kernel update → kured reboot → containerd overlayfs snapshotter corruption → image pull failures → calico down → cascading cluster outage. Remediation: - Add "Node Runtime Health" Prometheus alert group (6 alerts): KubeletImagePullErrors, KubeletPLEGUnhealthy, PodsStuckContainerCreating, KubeletRuntimeOperationsLatency, KubeletRunningContainersDrop, CalicoNodeNotReady - Add containerd cascade inhibition rule - Save post-mortem report as HTML in post-mortems/ Also applied via kubectl (needs Terraform codification): - Sentinel gate DaemonSet gating kured reboots on cluster health - Fixed kured Helm values: reboot window + gated sentinel path	2026-03-18 08:04:04 +00:00
Viktor Barzin	6efaed096d	post-mortem v2: pipeline team architecture with 4-stage agents [ci skip] Split monolithic orchestrator into triage (haiku), historian (sonnet), and report-writer (opus) stages. Each stage gets its own tool budget. Added sev-context.sh for structured cluster context gathering.	2026-03-18 08:04:04 +00:00
Viktor Barzin	66c70ce10f	fix: improve Slack alert formatting — add values, fix ContainerNearOOM filter - Add container!="" filter to ContainerNearOOM to exclude system-level cadvisor entries - Add $value to summaries: ContainerOOMKilled, ClusterMemoryRequestsHigh, ContainerNearOOM, PVPredictedFull, NFSServerUnresponsive, NewTailscaleClient - Add fallback field to all Slack receivers for clean push notifications - Multiply ratio exprs by 100 for readable percentages - Rename "New Tailscale client" to CamelCase "NewTailscaleClient" - Add actionable hints to PodUnschedulable, NodeConditionBad, ForwardAuthFallbackActive	2026-03-18 08:04:04 +00:00
Viktor Barzin	1cd767652d	fix: migrate woodpecker database credentials to runtime-refreshed ExternalSecret The woodpecker server was crashing repeatedly with database authentication failures because Vault rotates the database password every 24 hours, but the Helm release had hardcoded the password into WOODPECKER_DATABASE_DATASOURCE at plan time. Changes: - Updated ExternalSecret to provide the full DATABASE_DATASOURCE URI dynamically - Modified Helm values to use envFrom to inject the secret instead of hardcoding - ExternalSecret refreshes every 15 minutes, automatically picking up rotated passwords - Pod will auto-restart when secret changes (via reloader.stakater.com annotation) - This eliminates the plan-time password snapshot that goes stale within 24h The pod still has an unrelated image pull issue on k8s-node4 (containerd blob corruption), but the database credentials mechanism is now correctly implemented.	2026-03-18 08:04:04 +00:00
Viktor Barzin	04084a8f0f	add deploy-app skill and agent for automated repo→app deployment [ci skip]	2026-03-18 08:04:04 +00:00
Viktor Barzin	3b195d661a	fix pull-through cache: remove maxsize, harden nginx caching [ci skip] Root cause: storage.filesystem.maxsize (5GiB) caused Docker Registry to delete blob data while keeping metadata. Registry then served 200 OK with correct Content-Length but 0 bytes body. nginx cached these broken responses. Fixes: - Remove maxsize from dockerhub/ghcr proxy configs (rely on weekly GC) - nginx: don't cache 206 responses, require 2 requests before caching - Wiped corrupted cache on registry VM and fixed corrupted pause container blobs on node3/node4	2026-03-18 08:04:04 +00:00
Viktor Barzin	b8f4f14bda	update claude knowledge: GHA builds architecture, postgresql_host fix [ci skip]	2026-03-18 08:04:04 +00:00
Viktor Barzin	042151ad11	fix: update postgresql_host to pg-cluster-rw (old service had no endpoints) The legacy `postgresql.dbaas` service had no endpoints after CNPG migration, causing Woodpecker and other stacks to fail DB connections. Changed to `pg-cluster-rw.dbaas` which points to the CNPG primary.	2026-03-18 08:04:04 +00:00
Viktor Barzin	36850a7e40	scale up f1-stream and changedetection [ci skip]	2026-03-18 08:04:04 +00:00
Viktor Barzin	e383dfb443	fix platform stack: k8s_users.domains and sensitive for_each errors [ci skip] - Use lookup(user, "domains", []) for missing domains attribute - Wrap user_domains in nonsensitive() for Cloudflare for_each	2026-03-18 08:04:03 +00:00
Viktor Barzin	2eb9ff0156	update claude knowledge: secret/viktor is go-to for all personal secrets [ci skip]	2026-03-18 08:04:03 +00:00
Viktor Barzin	46f329ba48	trigger CI: json webhook	2026-03-18 08:04:03 +00:00
Viktor Barzin	5b24c8d437	fix woodpecker sync: single $ in heredoc, alpine image for jq, port 80 not 8000	2026-03-18 08:04:03 +00:00
Viktor Barzin	171d03086e	right-size 14 services and scale down GPU-heavy workloads [ci skip] Memory right-sizing based on VPA upperBound analysis: - Increases: stirling-pdf 1200→1536Mi, claude-memory 64→128Mi, dawarich 512→768Mi, kyverno-cleanup 128→192Mi, linkwarden 768→1Gi, navidrome 64→128Mi, listenarr 768→896Mi, privatebin 64→128Mi, ntfy 64→128Mi, health 128→256Mi, dbaas quota 16→20Gi, mysql-operator 384→512Mi - Decreases: rybbit 768→384Mi, nvidia-exporter added explicit 192Mi, dcgm-exporter 2560→1536Mi - Scale to 0: ebook2audiobook/audiblez-web, whisper (GPU node pressure) Net effect: -496Mi cluster-wide, 13 ContainerNearOOM alerts resolved, all ResourceQuota pressures cleared, GPU health green.	2026-03-18 08:04:03 +00:00
Viktor Barzin	2694f884d6	fix: increase terragrunt-apply step memory to 2Gi LimitRange defaults containers to 192Mi which is insufficient for terragrunt apply on the platform stack (48 vault refs, many modules). Set explicit 1Gi request / 2Gi limit via backend_options.	2026-03-18 08:04:03 +00:00
Viktor Barzin	24c4cd32a5	fix: CI pipeline - disable corrupted cache, add pull before push - build-cli.yml: comment out cache_from/cache_to to avoid BuildKit "short read" errors from corrupted registry cache - default.yml: add git pull --rebase before push in cleanup-and-push to handle remote having newer commits	2026-03-18 08:04:03 +00:00
Viktor Barzin	6f94d6fc13	update claude knowledge: final ESO migration state [ci skip]	2026-03-18 08:04:03 +00:00
Viktor Barzin	aec6215000	add add-user skill for cluster onboarding Interactive skill that collects user info, updates Vault KV k8s_users, and applies vault/platform/woodpecker stacks. Includes verification checklist and auto-generated resource table.	2026-03-18 08:04:03 +00:00
Viktor Barzin	ecb8273ba5	add user onboarding and admin instructions to README Admin section: how to add a new namespace-owner (Authentik group, Vault KV entry, three terragrunt applies). Includes auto-generated resource table. User section: VPN setup, tool install, Vault/kubectl auth, first app deployment from template, CI/CD pipeline example, useful commands, and important rules.	2026-03-18 08:04:03 +00:00
Viktor Barzin	642b0e578d	fix ollama: remove conditional count on basicAuth (incompatible with ESO data source)	2026-03-18 08:04:03 +00:00
Viktor Barzin	0610ea30d4	add generic multi-user cluster onboarding system Data-driven user onboarding: add a JSON entry to Vault KV k8s_users, apply vault + platform + woodpecker stacks, and everything is auto-generated. Vault stack: namespace creation, per-user Vault policies with secret isolation via identity entities/aliases, K8s deployer roles, CI policy update. Platform stack: domains field in k8s_users type, TLS secrets per user namespace, user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal. Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true. K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt, contributing page with CI pipeline template, versioned image tags in CI pipeline. New: stacks/_template/ with copyable stack template for namespace-owners.	2026-03-18 08:04:03 +00:00
Viktor Barzin	5bc50af99e	migrate 16 plan-time stacks: vault data source → ESO + kubernetes_secret Replaced data "vault_kv_secret_v2" with: 1. ExternalSecret (ESO syncs Vault KV → K8s Secret) 2. data "kubernetes_secret" (reads ESO-created secret at plan time) This removes the Vault provider dependency at plan time for these stacks — they now only need K8s API access, not a Vault token. Stacks: actualbudget, affine, audiobookshelf, calibre, changedetection, coturn, freedify, freshrss, grampsweb, navidrome, novelapp, ollama, owntracks, real-estate-crawler, servarr, ytdlp	2026-03-18 08:04:03 +00:00
Viktor Barzin	f7f8e4beba	fix health DB ExternalSecret: use pg-health not postgresql-health role name	2026-03-18 08:04:03 +00:00
root	7dad3ec54b	Woodpecker CI deploy commit [CI SKIP]	2026-03-18 08:04:03 +00:00
Viktor Barzin	fca99fd418	fix DB password desync + migrate remaining tfvars to Vault DB desync fix: Stacks with Vault DB engine rotation (24h) now read the password from vault-database ClusterSecretStore instead of vault-kv. 9 stacks updated with db ExternalSecrets reading from static-creds/*. Stacks fixed: speedtest, hackmd, health, trading-bot, claude-memory, woodpecker, linkwarden, nextcloud, url. terraform.tfvars migration: - plotting-book: google_client_id/secret → Vault KV + secret_key_ref - tandoor: email_password var removed (was default="", now optional ESO) - infra: ssh_private_key, vm_wizard_password, dockerhub_registry_password → Vault KV at secret/infra + data source	2026-03-18 08:04:03 +00:00
Viktor Barzin	19e0aef67b	regenerate providers.tf: remove vault_root_token variable [ci skip]	2026-03-18 08:04:03 +00:00
Viktor Barzin	9ed19e1b42	fix realestate-crawler: access nested notification_settings correctly Vault KV stores notification_settings as nested JSON ({"slack":{"webhook_url":""}}). TF code was passing the map object directly as a string env var value. Fix: access ["slack"]["webhook_url"] with try() fallback.	2026-03-18 08:04:03 +00:00
Viktor Barzin	e9a14f7df9	update claude knowledge: vault-native secrets migration decisions [ci skip]	2026-03-18 08:04:03 +00:00
Viktor Barzin	7e3540e56a	fix woodpecker sync script: escape $ and %{} for HCL heredoc HCL heredocs always interpolate — use $$ for literal $ and %%{} for literal %{}. Fixes terraform plan errors.	2026-03-18 08:04:03 +00:00
Viktor Barzin	14125c1b9b	add pod dependency management via Kyverno init container injection Kyverno ClusterPolicy reads dependency.kyverno.io/wait-for annotation and injects busybox init containers that block until each dependency is reachable (nc -z). Annotations added to 18 stacks (24 deployments). Includes graceful-db-maintenance.sh script for planned DB maintenance (scales dependents to 0, saves replica counts, restores on startup).	2026-03-18 08:04:02 +00:00
root	8af7e20527	Woodpecker CI deploy commit [CI SKIP]	2026-03-18 08:04:02 +00:00
Viktor Barzin	a833363e1d	fix gpu-workload Kyverno policy: use replace with explicit priority value The API server doesn't re-resolve priority from PriorityClassName after webhook mutation. Changed from remove+add to replace with explicit priority=1200000 and preemptionPolicy=PreemptLowerPriority.	2026-03-18 08:04:02 +00:00
Viktor Barzin	f82ece8d1f	add Vault→Woodpecker secret sync CronJob (Part E) Syncs secrets from Vault KV at secret/ci/global to Woodpecker global secrets via REST API every 6 hours. Authenticates via K8s SA JWT (woodpecker-sync role). New repos just add secrets to Vault and use from_secret: in pipeline files. Also removes k8s-dashboard static admin token — use vault write kubernetes/creds/dashboard-admin instead.	2026-03-18 08:04:02 +00:00
Viktor Barzin	850ab5277f	migrate consuming stacks to ESO + remove k8s-dashboard static token Phase 9: ExternalSecret migration across 26 stacks: Fully migrated (vault data source removed, ESO delivers secrets): - speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor - n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge - hackmd (ESO template for DB URL), health (ESO template for DB URL) - trading-bot (ESO template for DATABASE_URL + 7 secret env vars) - forgejo (removed unused vault data source) Partially migrated (vault kept for plan-time, ESO added for runtime): - immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage) - claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs) - woodpecker, openclaw, resume (plan-time in helm values/jobs/modules) 17 stacks unchanged (all plan-time: homepage annotations, configmaps, module inputs) — vault data source works with OIDC auth. Phase 17a: Remove k8s-dashboard static admin token secret. Users now get tokens via: vault write kubernetes/creds/dashboard-admin	2026-03-18 08:04:02 +00:00
Viktor Barzin	484f2f29fb	enhance devops-engineer agent: deploy + monitor pod health [ci skip] - Upgrade model from sonnet to opus for subagent orchestration - Add Write, Edit, Agent tools for spawning monitor subagents - Add mandatory deployment workflow: pre-deploy snapshot, apply, spawn background haiku pod monitor, react to results - Monitor detects CrashLoopBackOff, OOM, ImagePullBackOff, stuck Pending, and probe failures within 3 min timeout - Allow terragrunt apply and kubectl set image as safe operations	2026-03-18 08:04:02 +00:00
Viktor Barzin	0bae93a097	claude-memory: read DB password from Vault KV instead of tfvars Vault DB engine rotates the password every 24h, so the static tfvars value was stale. Now reads from secret/claude-memory db_password key.	2026-03-18 08:04:02 +00:00
Viktor Barzin	91e5d728a2	etcd defrag cronjob: add --command-timeout=60s Default 5s timeout causes defrag to fail on fragmented DBs. Discovered during manual defrag that took ~7s.	2026-03-18 08:04:02 +00:00

1 2 3 4 5 ...

2060 commits