infra

Author	SHA1	Message	Date
Viktor Barzin	25ef5176bb	state(speedtest): update encrypted state	2026-04-15 15:09:37 +00:00
Viktor Barzin	8b9c335f9c	state(technitium): update encrypted state	2026-04-15 15:09:18 +00:00
Viktor Barzin	601a83d84e	fix: CI pipeline image pull auth + shallow clone resilience [ci skip] - Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step pods can pull from private registry (registry.viktorbarzin.me:5050) - Add fallback in default.yml when HEAD~1 is unavailable (shallow clone with depth=1): fetch more history, or apply all platform stacks as safe default - Root cause: pipeline #243 failed because infra-ci:latest image couldn't be pulled (no imagePullSecrets on step pods) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:41:08 +00:00
Viktor Barzin	e23153cf03	chore: add pre-commit size guard and harden .gitignore - Add .githooks/pre-commit that blocks files >2MB (configurable via GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks - Expand .gitignore to block common binary/archive patterns (.tar.gz, .tgz, .iso, .img, .bin, .exe, *.dmg) - Add explicit root-level terraform.tfstate ignore rules - Remove stale redis-25.3.2.tgz helm chart (unreferenced) Prevents re-accumulation of large blobs after git history cleanup that reduced .git from 2.6GB to 128MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:13:18 +00:00
Viktor Barzin	b0192d9545	chore: untrack binary build artifacts from git Remove cli/cli (12.5MB), cli/infra_cli (12MB), clipboard-upload (8.7MB) from git tracking. These are build outputs that should be generated by CI. Add patterns to .gitignore to prevent re-committing. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:33:43 +00:00
Viktor Barzin	36454b87d1	feat: CI/CD performance overhaul - New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4, git-crypt, sops, kubectl pre-installed. Pushed to private registry. Eliminates 17 apk add calls + binary downloads per pipeline run. - Unified CI pipeline: merge default.yml + app-stacks.yml into one. Changed-stacks-only detection (git diff, with global-file fallback). Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4). Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR). - Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale lock detection. Blocks concurrent applies to same stack. - TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev. - Daily drift detection pipeline (.woodpecker/drift-detection.yml). Runs terraform plan on all stacks, Slack alert on drift. - CI image build pipeline (.woodpecker/build-ci-image.yml). Expected speedup: ~5-10 min per pipeline run → ~2-4 min. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:22:26 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	bd41bb9230	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 - Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore and stepped migration. Switch to existingSecret, PgBouncer session mode. - Mailserver: migrate email roundtrip probe from Mailgun to Brevo API - Redis: fix HAProxy tcp-check regex (rstring), faster health intervals - Nextcloud: fix Redis fallback to HAProxy service, update dependency - MeshCentral: fix TLSOffload + certUrl init container for first-run - Monitoring: remove authentik from latency alert exclusion - Diun: simplify to webhook notifier, remove git auto-update [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:41:56 +00:00
Viktor Barzin	d31bbc9a18	docs: update monitoring and backup docs for external monitors and per-db backups - CLAUDE.md: document external monitoring (ExternalAccessDivergence alert, external-monitor-sync CronJob) and per-database backup/restore paths - backup-dr.md: add per-db backup CronJobs to inventory table and daily timeline, update restore runbook references - monitoring.md: add External Monitor Sync component and external monitoring architecture section [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:37:07 +00:00
Viktor Barzin	0256ccdccc	feat: add per-database backups for PostgreSQL and MySQL Add separate CronJobs that dump each database individually: - postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15) - mysql-backup-per-db: mysqldump per DB (daily 00:45) Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC. Enables single-database restore without affecting other databases. Also fixed CNPG superuser password sync and added --single-transaction --set-gtid-purged=OFF to MySQL per-db dumps. Updated restore runbooks with per-database restore procedures. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:39:33 +00:00
Viktor Barzin	ca1ae23f34	state(dbaas): update encrypted state	2026-04-14 22:39:24 +00:00
Viktor Barzin	99b08e8e50	state(dbaas): update encrypted state	2026-04-14 22:27:40 +00:00
Viktor Barzin	79a41ffde0	state(nextcloud): update encrypted state	2026-04-14 20:29:21 +00:00
Viktor Barzin	554c2e3b5e	state(nextcloud): update encrypted state	2026-04-14 20:28:34 +00:00
Viktor Barzin	3343cc7ba4	state(redis): update encrypted state	2026-04-14 20:28:09 +00:00
Viktor Barzin	41f2b00fa3	state(monitoring): update encrypted state	2026-04-14 20:17:40 +00:00
Viktor Barzin	34a5301123	state(monitoring): update encrypted state	2026-04-14 20:17:24 +00:00
Viktor Barzin	909c4c2744	state(authentik): update encrypted state	2026-04-14 20:16:40 +00:00
Viktor Barzin	5b443c2311	state(authentik): update encrypted state	2026-04-14 20:11:51 +00:00
Viktor Barzin	3e9231ae0d	feat: augment outage report template with debugging context - Expand service list: add Home Assistant, Actual Budget, Audiobookshelf, Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD, Excalidraw, Wealthfolio, Send, Stirling PDF - Add structured debugging fields: error type, scope (just me vs others), when it started, URL accessed - Fix user report parser to extract all form fields into status.json - Show error type, scope, and start time in status page report cards [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:03:44 +00:00
Viktor Barzin	c00a908610	state(status-page): update encrypted state	2026-04-14 20:03:29 +00:00
Viktor Barzin	460c68e015	feat: add incident management system with user reporting - Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:31 +00:00
Viktor Barzin	24a23709a5	fix: update healthcheck to report internal and external monitors separately - Increase Uptime Kuma API timeout to 120s with wait_events=0.2 - Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var - Report internal and external monitor status separately - Install uptime-kuma-api in local venv [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:44:20 +00:00
Viktor Barzin	26cc6fdf2f	state(authentik): update encrypted state	2026-04-14 19:43:04 +00:00
Viktor Barzin	d05e13e3cc	state(status-page): update encrypted state	2026-04-14 19:36:45 +00:00
Viktor Barzin	c69eba9b46	fix: increase Uptime Kuma API timeout and fix status code format - Increase socket timeout from 30s to 120s (121+ monitors need time to sync) - Add wait_events=0.2 for reliable login - Fix accepted_statuscodes format: use 100-increment ranges not arbitrary [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:28:18 +00:00
Viktor Barzin	3d189ce4d7	state(uptime-kuma): update encrypted state	2026-04-14 19:21:19 +00:00
Viktor Barzin	3a6fa6d34f	state(uptime-kuma): update encrypted state	2026-04-14 19:18:55 +00:00
Viktor Barzin	43342f860c	state(mailserver): update encrypted state	2026-04-14 19:13:46 +00:00
Viktor Barzin	c9fa482996	state(uptime-kuma): update encrypted state	2026-04-14 19:07:36 +00:00
Viktor Barzin	ff360a8807	feat: add external monitoring for all Cloudflare-proxied services Add automatic external HTTPS monitors to Uptime Kuma for ~96 services exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from a Terraform-generated ConfigMap and creates/deletes [External] monitors to match cloudflare_proxied_names. Status page groups these separately as "External Reachability" and pushes a divergence metric to Pushgateway when services are externally down but internally up. Prometheus alert ExternalAccessDivergence fires after 15min of divergence. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:04:45 +00:00
Viktor Barzin	3258ff6cb7	state(monitoring): update encrypted state	2026-04-14 19:04:18 +00:00
Viktor Barzin	647d73d499	state(status-page): update encrypted state	2026-04-14 19:03:30 +00:00
Viktor Barzin	74c1fb83cb	state(uptime-kuma): update encrypted state	2026-04-14 19:02:56 +00:00
Viktor Barzin	30c3450c61	docs: add user-facing issue reporting guide Adds "Reporting an Issue" section with: - Where to report (Slack, GitHub, DM) - What to include (examples of good vs bad reports) - What happens after reporting (flow diagram) - Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:37:37 +00:00
Viktor Barzin	dfad89ef81	fix: remove br tags from mermaid diagram (GitHub compat) [ci skip]	2026-04-14 18:35:13 +00:00
Viktor Barzin	f658b18a50	docs: add incident response & post-mortem pipeline architecture Documents the full E2E automated incident response pipeline: - /post-mortem skill for generating structured post-mortems - Woodpecker pipeline triggered on post-mortem file changes - Claude Code headless agent for implementing safe TODOs - Safety guardrails, commit conventions, secrets, limitations [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:31:29 +00:00
Viktor Barzin	cf87a747d8	docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit SHAs. Flag 3 Migration TODOs as needing human review. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:09:11 +00:00
Viktor Barzin	4498f61402	fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14] - scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with detailed comments explaining fsid=0 danger and NFSv3 disable rationale. Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra - scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts. Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running, no active exports. Warns but doesn't abort (block-storage PVC backups can still run). - .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory, fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:08:24 +00:00
Viktor Barzin	ca2680c189	fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14] - Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade - Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted eliminating the circular dependency where alertmanager couldn't alert about NFS failures - Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:05:33 +00:00
Viktor Barzin	0901dd5f61	state(monitoring): update encrypted state	2026-04-14 17:52:13 +00:00
Viktor Barzin	b1b408ff0e	fix: use full path to claude CLI for non-interactive SSH	2026-04-14 17:44:50 +00:00
Viktor Barzin	7674cf8c5c	docs: final E2E pipeline test	2026-04-14 17:43:38 +00:00
Viktor Barzin	f2e7367401	fix: use sh instead of bash in pipeline (Alpine compat)	2026-04-14 17:29:14 +00:00
Viktor Barzin	91b97709b7	docs: trigger postmortem pipeline with TODO	2026-04-14 17:27:45 +00:00
Viktor Barzin	c742fa3dfb	fix: scan all post-mortems for TODOs (no git diff needed)	2026-04-14 17:14:22 +00:00
Viktor Barzin	f336e5ed53	docs: E2E test postmortem pipeline with deep clone	2026-04-14 17:12:46 +00:00
Viktor Barzin	0b2f5a4729	fix: use depth 5 clone for postmortem pipeline (need HEAD~1)	2026-04-14 17:12:41 +00:00
Viktor Barzin	59367cc588	fix: handle Woodpecker shallow clone in postmortem pipeline	2026-04-14 17:12:02 +00:00
Viktor Barzin	60c04e51b7		2026-04-14 17:10:45 +00:00

1 2 3 4 5 ...

2578 commits