infra

Author	SHA1	Message	Date
Viktor Barzin	d3d86ce433	[woodpecker] Bump WOODPECKER_FORGE_TIMEOUT 3s → 30s The default forge-API timeout is 3 seconds. The config-loader makes 4-6 sequential calls per pipeline trigger (probing for .woodpecker dir then each .woodpecker.{yaml,yml} variant), and Forgejo responses on this cluster spike to 1-2s under load — easy to trip the cumulative 3s deadline. Result: 'could not load config from forge: context deadline exceeded' on virtually every pipeline trigger. This was the actual root cause of the 'Woodpecker forge-API bug' that v3.13 → v3.14 was supposed to fix — turns out v3.14 didn't change the timeout default, and the v3.13 successes I saw earlier were warm-cache flukes.	2026-05-07 23:10:48 +00:00
Viktor Barzin	8f1ee9a55f	[woodpecker] Bump server + agent v3.13.0 → v3.14.0 Fixes the 'could not load config from forge: context deadline exceeded' issue that blocked every Forgejo-triggered pipeline during the forgejo-registry-consolidation cutover. Helm chart 3.5.1 stays (no 3.6 yet); only the image tag overrides change.	2026-05-07 22:28:56 +00:00
Viktor Barzin	39e7409f11	[woodpecker] Persist hostAliases patch via null_resource (chart doesn't expose it) Helm chart 3.5.1 has no `server.hostAliases` field, so the YAML addition I made earlier was a no-op. Apply via kubectl patch in a null_resource keyed on helm revision so it re-asserts on every chart upgrade. Same pattern as the CoreDNS replicas/affinity patch in stacks/technitium/. Without this, every helm upgrade on woodpecker reverts the hostAliases fix and the Forgejo pipeline triggers start failing with context-deadline-exceeded again. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 17:18:57 +00:00
Viktor Barzin	00fc0cf5bb	[woodpecker] Pin forgejo.viktorbarzin.me to in-cluster Traefik LB Pipeline triggers from Forgejo were failing with "could not load config from forge: context deadline exceeded" — Woodpecker's forge-API fetch path was round-tripping through Cloudflare via the public IP, hitting 30s deadline timeouts on cold connections. The in-cluster path via the Traefik LB (10.0.20.200) is consistently sub-100ms. Same trick we use for the containerd hosts.toml redirect on each node — Traefik serves the *.viktorbarzin.me wildcard cert so SNI verification still passes. OAuth callbacks still use the public hostname (correct, those come from the user's browser). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 17:13:51 +00:00
Viktor Barzin	601a83d84e	fix: CI pipeline image pull auth + shallow clone resilience [ci skip] - Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step pods can pull from private registry (registry.viktorbarzin.me:5050) - Add fallback in default.yml when HEAD~1 is unavailable (shallow clone with depth=1): fetch more history, or apply all platform stacks as safe default - Root cause: pipeline #243 failed because infra-ci:latest image couldn't be pulled (no imagePullSecrets on step pods) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:41:08 +00:00
Viktor Barzin	82b0f6c4cb	truenas deprecation: migrate all non-immich storage to proxmox NFS - Migrate 7 backup CronJobs to Proxmox host NFS (192.168.1.127) (etcd, mysql, postgresql, nextcloud, redis, vaultwarden, plotting-book) - Migrate headscale backup, ebook2audiobook, osm_routing to Proxmox NFS - Migrate servarr (lidarr, readarr, soulseek) NFS refs to Proxmox - Remove 79 orphaned TrueNAS NFS module declarations from 49 stacks - Delete stacks/platform/modules/ (27 dead module copies, 65MB) - Update nfs-truenas StorageClass to point to Proxmox (192.168.1.127) - Remove iscsi DNS record from config.tfvars - Fix woodpecker persistence config and alertmanager PV Only Immich (8 PVCs, ~1.4TB) remains on TrueNAS.	2026-04-12 14:35:39 +01:00
Viktor Barzin	c8b42f78df	fix DB password rotation desync in 5 stacks Vault DB engine rotates passwords weekly but 5 stacks baked passwords at Terraform plan time, causing stale credentials until next apply. - real-estate-crawler: add vault-database ESO, use secret_key_ref in 3 deployments - nextcloud: switch Helm chart to existingSecret for DB password - grafana: add vault-database ESO, use envFromSecrets in Helm values - woodpecker: use extraSecretNamesForEnvFrom, remove plan-time data source chain - affine: add vault-database ESO, use secret_key_ref in deployment + init container	2026-03-17 07:39:29 +00:00
Viktor Barzin	b2d07556d5	fix: migrate woodpecker database credentials to runtime-refreshed ExternalSecret The woodpecker server was crashing repeatedly with database authentication failures because Vault rotates the database password every 24 hours, but the Helm release had hardcoded the password into WOODPECKER_DATABASE_DATASOURCE at plan time. Changes: - Updated ExternalSecret to provide the full DATABASE_DATASOURCE URI dynamically - Modified Helm values to use envFrom to inject the secret instead of hardcoding - ExternalSecret refreshes every 15 minutes, automatically picking up rotated passwords - Pod will auto-restart when secret changes (via reloader.stakater.com annotation) - This eliminates the plan-time password snapshot that goes stale within 24h The pod still has an unrelated image pull issue on k8s-node4 (containerd blob corruption), but the database credentials mechanism is now correctly implemented.	2026-03-16 19:12:01 +00:00
Viktor Barzin	50620e6047	add generic multi-user cluster onboarding system Data-driven user onboarding: add a JSON entry to Vault KV k8s_users, apply vault + platform + woodpecker stacks, and everything is auto-generated. Vault stack: namespace creation, per-user Vault policies with secret isolation via identity entities/aliases, K8s deployer roles, CI policy update. Platform stack: domains field in k8s_users type, TLS secrets per user namespace, user domains merged into Cloudflare DNS, user-roles ConfigMap mounted in portal. Woodpecker stack: admin list auto-generated from k8s_users, WOODPECKER_OPEN=true. K8s-portal: dual-track onboarding (general/namespace-owner), namespace-owner dashboard with Vault/kubectl commands, setup script adds Vault+Terraform+Terragrunt, contributing page with CI pipeline template, versioned image tags in CI pipeline. New: stacks/_template/ with copyable stack template for namespace-owners.	2026-03-15 22:23:36 +00:00
Viktor Barzin	1acf8cc4e8	migrate consuming stacks to ESO + remove k8s-dashboard static token Phase 9: ExternalSecret migration across 26 stacks: Fully migrated (vault data source removed, ESO delivers secrets): - speedtest, shadowsocks, wealthfolio, plotting-book, f1-stream, tandoor - n8n, dawarich, diun, netbox, onlyoffice, tuya-bridge - hackmd (ESO template for DB URL), health (ESO template for DB URL) - trading-bot (ESO template for DATABASE_URL + 7 secret env vars) - forgejo (removed unused vault data source) Partially migrated (vault kept for plan-time, ESO added for runtime): - immich, linkwarden, nextcloud, paperless-ngx (jsondecode for homepage) - claude-memory, rybbit, url, webhook_handler (plan-time in locals/jobs) - woodpecker, openclaw, resume (plan-time in helm values/jobs/modules) 17 stacks unchanged (all plan-time: homepage annotations, configmaps, module inputs) — vault data source works with OIDC auth. Phase 17a: Remove k8s-dashboard static admin token secret. Users now get tokens via: vault write kubernetes/creds/dashboard-admin	2026-03-15 19:05:04 +00:00
Viktor Barzin	dcb465a7e5	[ci skip] Fix Woodpecker GitHub forge: add explicit GITHUB_URL to prevent Forgejo URL bleed When both WOODPECKER_GITHUB and WOODPECKER_FORGEJO are enabled without an explicit WOODPECKER_GITHUB_URL, the GitHub forge inherits the Forgejo URL causing all GitHub API calls to hit forgejo.viktorbarzin.me with GitHub OAuth credentials, resulting in 401 Unauthorized on repo add and cron jobs. Also adds Forgejo forge variables to Terraform.	2026-02-24 23:02:33 +00:00
Viktor Barzin	89a6e08245	[ci skip] Infrastructure hardening: security, monitoring, reliability, maintainability Phase 1 - Critical Security: - Netbox: move hardcoded DB/superuser passwords to variables - MeshCentral: disable public registration, add Authentik auth - Traefik: disable insecure API dashboard (api.insecure=false) - Traefik: configure forwarded headers with Cloudflare trusted IPs Phase 2 - Security Hardening: - Add security headers middleware (HSTS, X-Frame-Options, nosniff, etc.) - Add Kyverno pod security policies in audit mode (privileged, host namespaces, SYS_ADMIN, trusted registries) - Tighten rate limiting (avg=10, burst=50) - Add Authentik protection to grampsweb Phase 3 - Monitoring & Alerting: - Add critical service alerts (PostgreSQL, MySQL, Redis, Headscale, Authentik, Loki) - Increase Loki retention from 7 to 30 days (720h) - Add predictive PV filling alert (predict_linear) - Re-enable Hackmd and Privatebin down alerts Phase 4 - Reliability: - Add resource requests/limits to Redis, DBaaS, Technitium, Headscale, Vaultwarden, Uptime Kuma - Increase Alloy DaemonSet memory to 512Mi/1Gi Phase 6 - Maintainability: - Extract duplicated tiers locals to terragrunt.hcl generate block (removed from 67 stacks) - Replace hardcoded NFS IP 10.0.10.15 with var.nfs_server (114 instances across 63 files) - Replace hardcoded Redis/PostgreSQL/MySQL/Ollama/mail host references with variables across ~35 stacks - Migrate xray raw ingress resources to ingress_factory modules	2026-02-23 22:05:28 +00:00
Viktor Barzin	d041459ef2	[ci skip] Upgrade Woodpecker CI v3.5.1 → v3.13.0, fix helm healthcheck for v4	2026-02-23 20:14:30 +00:00
Viktor Barzin	c8de2c4803	[ci skip] Sunset Drone CI: remove all artifacts, DNS, configs, and references Drone CI has been fully replaced by Woodpecker CI at ci.viktorbarzin.me. Destroys K8s resources (12), removes DNS records, NFS exports, Uptime Kuma monitor, dashboard entry, and all code/doc references across 18 files.	2026-02-23 19:38:55 +00:00
Viktor Barzin	cbf041bcc9	[ci skip] Add Woodpecker CI stack (WIP) and claude agents - Add stacks/woodpecker/ with Helm-based deployment config - Add .woodpecker/ CI pipeline configs (default, build-cli, renew-tls) - Add NFS export entry for woodpecker - Add .claude/agents/ definitions	2026-02-22 21:30:25 +00:00

15 commits