infra

Author	SHA1	Message	Date
Viktor Barzin	69474fae96	docs: add comprehensive DNS architecture documentation Covers Technitium HA (3-instance AXFR replication), CoreDNS config, Cloudflare external DNS, Split Horizon hairpin NAT fix, DHCP-DNS auto-registration, 6 automation CronJobs, and troubleshooting guides. Also fixes stale NFS reference in networking.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 18:10:27 +00:00
Viktor Barzin	2053776d1c	chore: sort outage report service list alphabetically [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 18:01:54 +00:00
Viktor Barzin	0a448c2bae	docs: rewrite incident-response as user contribution guide Complete rewrite of the user-facing documentation: - How to report outages and request features - Mermaid flow diagrams for both incident and feature request paths - SLA expectations (automated vs human response times) - Self-service checks before reporting - Severity level definitions - Status page explanation - Full technical architecture section with component inventory - Safety guardrails, labels, and commit conventions [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:59:09 +00:00
Viktor Barzin	cf578516e9	feat: auto-cleanup failed/evicted pods via Kyverno ClusterCleanupPolicy Add cleanup-failed-pods policy that runs hourly (at :15) to delete all pods in Failed phase cluster-wide. Prevents stale evicted and failed CronJob pods from accumulating and creating healthcheck noise. Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup controller permission to delete Pods (not included by default). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:37:49 +00:00
Viktor Barzin	f726d1c3fd	fix: stash local changes before git pull in CI pipelines DevVM may have unstaged changes from active sessions. Use git stash before pull to avoid 'cannot pull with rebase: unstaged changes' errors. Stash pop after to restore working state. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:37:10 +00:00
Viktor Barzin	dcce76403a	fix: use direct env vars for Woodpecker pipeline variables Woodpecker injects manual pipeline variables as direct env vars (e.g., $ISSUE_NUMBER), not as CI_PIPELINE_VARIABLE_* prefixed vars. The provision-user pipeline already uses this pattern correctly. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:34:42 +00:00
Viktor Barzin	704fa09185	fix: remove manual event from build-ci-image to fix issue automation build-ci-image.yml had event:[push,manual] which caused it to run on every manual pipeline trigger. Its registry_user/registry_password secrets don't have the manual event, causing all manual pipelines to error. Removed manual from its event list since it only needs push. Reverted evaluate conditions (Woodpecker evaluates secrets before conditions, so evaluate can't prevent missing-secret errors). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:31:25 +00:00
Viktor Barzin	a583b11484	fix: guard manual Woodpecker pipelines with evaluate conditions When GHA triggers a manual pipeline for issue automation, ALL pipelines with event:manual fire. Added evaluate conditions: - issue-automation.yml: only runs when ISSUE_NUMBER is set - provision-user.yml: only runs when ISSUE_NUMBER is NOT set - build-ci-image.yml: only runs when ISSUE_NUMBER is NOT set This prevents build-ci-image from failing on missing registry_password secret when issue automation triggers. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:29:35 +00:00
Viktor Barzin	de42acd68e	fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump - daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS noload mounts — in-flight writes have corrupt metadata from skipped journal replay, but core data is intact - daily-backup: clean up stale LUKS dm mappings from previous crashed runs before attempting to open - daily-backup: capture rsync exit code safely with set -e (\|\| pattern) - kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%) - actualbudget: patched custom quota 5Gi→6Gi (was at 82%) Verified: backup now completes status=0 (96 PVCs OK, 0 failed) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:21:51 +00:00
Viktor Barzin	92495d0fc3	fix: start Claude from ~/code to load root CLAUDE.md correctly Both issue-automation and postmortem pipelines were cd'ing into ~/code/infra before running Claude, missing the root CLAUDE.md with beads config and project-wide instructions. Now cd to ~/code and use relative agent paths from there. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:15:32 +00:00
Viktor Barzin	7bb9ec2934	Add agent task tracking documentation Documents the centralized Beads/Dolt task tracking system used by all Claude Code sessions. Covers architecture, session lifecycle, settings hierarchy, known issues, and E2E test verification. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:11:26 +00:00
Viktor Barzin	9baefa22ab	fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape - technitium-password-sync: remove RWO encrypted PVC mount that caused pods to stick in ContainerCreating on wrong nodes. Plugin install now warns instead of failing when zip unavailable. - daily-backup: add LUKS decryption support for encrypted PVC snapshots using /root/.luks-backup-key. Uses noload mount option to skip ext4 journal replay. Also installed cryptsetup-bin on PVE host. - speedtest: disable prometheus.io/scrape annotation (no /prometheus endpoint exists, causing ScrapeTargetDown alert). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:12:32 +00:00
Viktor Barzin	25ef5176bb	state(speedtest): update encrypted state	2026-04-15 15:09:37 +00:00
Viktor Barzin	8b9c335f9c	state(technitium): update encrypted state	2026-04-15 15:09:18 +00:00
Viktor Barzin	601a83d84e	fix: CI pipeline image pull auth + shallow clone resilience [ci skip] - Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step pods can pull from private registry (registry.viktorbarzin.me:5050) - Add fallback in default.yml when HEAD~1 is unavailable (shallow clone with depth=1): fetch more history, or apply all platform stacks as safe default - Root cause: pipeline #243 failed because infra-ci:latest image couldn't be pulled (no imagePullSecrets on step pods) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:41:08 +00:00
Viktor Barzin	e23153cf03	chore: add pre-commit size guard and harden .gitignore - Add .githooks/pre-commit that blocks files >2MB (configurable via GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks - Expand .gitignore to block common binary/archive patterns (.tar.gz, .tgz, .iso, .img, .bin, .exe, *.dmg) - Add explicit root-level terraform.tfstate ignore rules - Remove stale redis-25.3.2.tgz helm chart (unreferenced) Prevents re-accumulation of large blobs after git history cleanup that reduced .git from 2.6GB to 128MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:13:18 +00:00
Viktor Barzin	b0192d9545	chore: untrack binary build artifacts from git Remove cli/cli (12.5MB), cli/infra_cli (12MB), clipboard-upload (8.7MB) from git tracking. These are build outputs that should be generated by CI. Add patterns to .gitignore to prevent re-committing. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:33:43 +00:00
Viktor Barzin	36454b87d1	feat: CI/CD performance overhaul - New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4, git-crypt, sops, kubectl pre-installed. Pushed to private registry. Eliminates 17 apk add calls + binary downloads per pipeline run. - Unified CI pipeline: merge default.yml + app-stacks.yml into one. Changed-stacks-only detection (git diff, with global-file fallback). Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4). Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR). - Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale lock detection. Blocks concurrent applies to same stack. - TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev. - Daily drift detection pipeline (.woodpecker/drift-detection.yml). Runs terraform plan on all stacks, Slack alert on drift. - CI image build pipeline (.woodpecker/build-ci-image.yml). Expected speedup: ~5-10 min per pipeline run → ~2-4 min. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:22:26 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	bd41bb9230	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 - Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore and stepped migration. Switch to existingSecret, PgBouncer session mode. - Mailserver: migrate email roundtrip probe from Mailgun to Brevo API - Redis: fix HAProxy tcp-check regex (rstring), faster health intervals - Nextcloud: fix Redis fallback to HAProxy service, update dependency - MeshCentral: fix TLSOffload + certUrl init container for first-run - Monitoring: remove authentik from latency alert exclusion - Diun: simplify to webhook notifier, remove git auto-update [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:41:56 +00:00
Viktor Barzin	d31bbc9a18	docs: update monitoring and backup docs for external monitors and per-db backups - CLAUDE.md: document external monitoring (ExternalAccessDivergence alert, external-monitor-sync CronJob) and per-database backup/restore paths - backup-dr.md: add per-db backup CronJobs to inventory table and daily timeline, update restore runbook references - monitoring.md: add External Monitor Sync component and external monitoring architecture section [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:37:07 +00:00
Viktor Barzin	0256ccdccc	feat: add per-database backups for PostgreSQL and MySQL Add separate CronJobs that dump each database individually: - postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15) - mysql-backup-per-db: mysqldump per DB (daily 00:45) Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC. Enables single-database restore without affecting other databases. Also fixed CNPG superuser password sync and added --single-transaction --set-gtid-purged=OFF to MySQL per-db dumps. Updated restore runbooks with per-database restore procedures. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:39:33 +00:00
Viktor Barzin	ca1ae23f34	state(dbaas): update encrypted state	2026-04-14 22:39:24 +00:00
Viktor Barzin	99b08e8e50	state(dbaas): update encrypted state	2026-04-14 22:27:40 +00:00
Viktor Barzin	79a41ffde0	state(nextcloud): update encrypted state	2026-04-14 20:29:21 +00:00
Viktor Barzin	554c2e3b5e	state(nextcloud): update encrypted state	2026-04-14 20:28:34 +00:00
Viktor Barzin	3343cc7ba4	state(redis): update encrypted state	2026-04-14 20:28:09 +00:00
Viktor Barzin	41f2b00fa3	state(monitoring): update encrypted state	2026-04-14 20:17:40 +00:00
Viktor Barzin	34a5301123	state(monitoring): update encrypted state	2026-04-14 20:17:24 +00:00
Viktor Barzin	909c4c2744	state(authentik): update encrypted state	2026-04-14 20:16:40 +00:00
Viktor Barzin	5b443c2311	state(authentik): update encrypted state	2026-04-14 20:11:51 +00:00
Viktor Barzin	3e9231ae0d	feat: augment outage report template with debugging context - Expand service list: add Home Assistant, Actual Budget, Audiobookshelf, Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD, Excalidraw, Wealthfolio, Send, Stirling PDF - Add structured debugging fields: error type, scope (just me vs others), when it started, URL accessed - Fix user report parser to extract all form fields into status.json - Show error type, scope, and start time in status page report cards [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:03:44 +00:00
Viktor Barzin	c00a908610	state(status-page): update encrypted state	2026-04-14 20:03:29 +00:00
Viktor Barzin	460c68e015	feat: add incident management system with user reporting - Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:31 +00:00
Viktor Barzin	24a23709a5	fix: update healthcheck to report internal and external monitors separately - Increase Uptime Kuma API timeout to 120s with wait_events=0.2 - Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var - Report internal and external monitor status separately - Install uptime-kuma-api in local venv [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:44:20 +00:00
Viktor Barzin	26cc6fdf2f	state(authentik): update encrypted state	2026-04-14 19:43:04 +00:00
Viktor Barzin	d05e13e3cc	state(status-page): update encrypted state	2026-04-14 19:36:45 +00:00
Viktor Barzin	c69eba9b46	fix: increase Uptime Kuma API timeout and fix status code format - Increase socket timeout from 30s to 120s (121+ monitors need time to sync) - Add wait_events=0.2 for reliable login - Fix accepted_statuscodes format: use 100-increment ranges not arbitrary [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:28:18 +00:00
Viktor Barzin	3d189ce4d7	state(uptime-kuma): update encrypted state	2026-04-14 19:21:19 +00:00
Viktor Barzin	3a6fa6d34f	state(uptime-kuma): update encrypted state	2026-04-14 19:18:55 +00:00
Viktor Barzin	43342f860c	state(mailserver): update encrypted state	2026-04-14 19:13:46 +00:00
Viktor Barzin	c9fa482996	state(uptime-kuma): update encrypted state	2026-04-14 19:07:36 +00:00
Viktor Barzin	ff360a8807	feat: add external monitoring for all Cloudflare-proxied services Add automatic external HTTPS monitors to Uptime Kuma for ~96 services exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from a Terraform-generated ConfigMap and creates/deletes [External] monitors to match cloudflare_proxied_names. Status page groups these separately as "External Reachability" and pushes a divergence metric to Pushgateway when services are externally down but internally up. Prometheus alert ExternalAccessDivergence fires after 15min of divergence. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:04:45 +00:00
Viktor Barzin	3258ff6cb7	state(monitoring): update encrypted state	2026-04-14 19:04:18 +00:00
Viktor Barzin	647d73d499	state(status-page): update encrypted state	2026-04-14 19:03:30 +00:00
Viktor Barzin	74c1fb83cb	state(uptime-kuma): update encrypted state	2026-04-14 19:02:56 +00:00
Viktor Barzin	30c3450c61	docs: add user-facing issue reporting guide Adds "Reporting an Issue" section with: - Where to report (Slack, GitHub, DM) - What to include (examples of good vs bad reports) - What happens after reporting (flow diagram) - Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:37:37 +00:00
Viktor Barzin	dfad89ef81	fix: remove br tags from mermaid diagram (GitHub compat) [ci skip]	2026-04-14 18:35:13 +00:00
Viktor Barzin	f658b18a50	docs: add incident response & post-mortem pipeline architecture Documents the full E2E automated incident response pipeline: - /post-mortem skill for generating structured post-mortems - Woodpecker pipeline triggered on post-mortem file changes - Claude Code headless agent for implementing safe TODOs - Safety guardrails, commit conventions, secrets, limitations [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:31:29 +00:00
Viktor Barzin	cf87a747d8	docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit SHAs. Flag 3 Migration TODOs as needing human review. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:09:11 +00:00

... 3 4 5 6 7 ...

2790 commits