infra

Author	SHA1	Message	Date
Viktor Barzin	c69eba9b46	fix: increase Uptime Kuma API timeout and fix status code format - Increase socket timeout from 30s to 120s (121+ monitors need time to sync) - Add wait_events=0.2 for reliable login - Fix accepted_statuscodes format: use 100-increment ranges not arbitrary [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:28:18 +00:00
Viktor Barzin	3d189ce4d7	state(uptime-kuma): update encrypted state	2026-04-14 19:21:19 +00:00
Viktor Barzin	3a6fa6d34f	state(uptime-kuma): update encrypted state	2026-04-14 19:18:55 +00:00
Viktor Barzin	43342f860c	state(mailserver): update encrypted state	2026-04-14 19:13:46 +00:00
Viktor Barzin	c9fa482996	state(uptime-kuma): update encrypted state	2026-04-14 19:07:36 +00:00
Viktor Barzin	ff360a8807	feat: add external monitoring for all Cloudflare-proxied services Add automatic external HTTPS monitors to Uptime Kuma for ~96 services exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from a Terraform-generated ConfigMap and creates/deletes [External] monitors to match cloudflare_proxied_names. Status page groups these separately as "External Reachability" and pushes a divergence metric to Pushgateway when services are externally down but internally up. Prometheus alert ExternalAccessDivergence fires after 15min of divergence. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:04:45 +00:00
Viktor Barzin	3258ff6cb7	state(monitoring): update encrypted state	2026-04-14 19:04:18 +00:00
Viktor Barzin	647d73d499	state(status-page): update encrypted state	2026-04-14 19:03:30 +00:00
Viktor Barzin	74c1fb83cb	state(uptime-kuma): update encrypted state	2026-04-14 19:02:56 +00:00
Viktor Barzin	30c3450c61	docs: add user-facing issue reporting guide Adds "Reporting an Issue" section with: - Where to report (Slack, GitHub, DM) - What to include (examples of good vs bad reports) - What happens after reporting (flow diagram) - Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:37:37 +00:00
Viktor Barzin	dfad89ef81	fix: remove br tags from mermaid diagram (GitHub compat) [ci skip]	2026-04-14 18:35:13 +00:00
Viktor Barzin	f658b18a50	docs: add incident response & post-mortem pipeline architecture Documents the full E2E automated incident response pipeline: - /post-mortem skill for generating structured post-mortems - Woodpecker pipeline triggered on post-mortem file changes - Claude Code headless agent for implementing safe TODOs - Safety guardrails, commit conventions, secrets, limitations [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:31:29 +00:00
Viktor Barzin	cf87a747d8	docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit SHAs. Flag 3 Migration TODOs as needing human review. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:09:11 +00:00
Viktor Barzin	4498f61402	fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14] - scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with detailed comments explaining fsid=0 danger and NFSv3 disable rationale. Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra - scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts. Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running, no active exports. Warns but doesn't abort (block-storage PVC backups can still run). - .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory, fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:08:24 +00:00
Viktor Barzin	ca2680c189	fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14] - Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade - Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted eliminating the circular dependency where alertmanager couldn't alert about NFS failures - Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:05:33 +00:00
Viktor Barzin	0901dd5f61	state(monitoring): update encrypted state	2026-04-14 17:52:13 +00:00
Viktor Barzin	b1b408ff0e	fix: use full path to claude CLI for non-interactive SSH	2026-04-14 17:44:50 +00:00
Viktor Barzin	7674cf8c5c	docs: final E2E pipeline test	2026-04-14 17:43:38 +00:00
Viktor Barzin	f2e7367401	fix: use sh instead of bash in pipeline (Alpine compat)	2026-04-14 17:29:14 +00:00
Viktor Barzin	91b97709b7	docs: trigger postmortem pipeline with TODO	2026-04-14 17:27:45 +00:00
Viktor Barzin	c742fa3dfb	fix: scan all post-mortems for TODOs (no git diff needed)	2026-04-14 17:14:22 +00:00
Viktor Barzin	f336e5ed53	docs: E2E test postmortem pipeline with deep clone	2026-04-14 17:12:46 +00:00
Viktor Barzin	0b2f5a4729	fix: use depth 5 clone for postmortem pipeline (need HEAD~1)	2026-04-14 17:12:41 +00:00
Viktor Barzin	59367cc588	fix: handle Woodpecker shallow clone in postmortem pipeline	2026-04-14 17:12:02 +00:00
Viktor Barzin	60c04e51b7		2026-04-14 17:10:45 +00:00
Viktor Barzin	933c562aa9	docs: trigger postmortem pipeline E2E test	2026-04-14 16:49:07 +00:00
Viktor Barzin	ce7a4e6e76	fix: Woodpecker v3 secrets→environment migration	2026-04-14 16:47:17 +00:00
Viktor Barzin	8540f48a28	fix: move pipeline logic to shell script (avoid YAML quoting issues)	2026-04-14 16:46:42 +00:00
Viktor Barzin	df95f52d08	docs: test postmortem with TODO for pipeline E2E	2026-04-14 16:45:44 +00:00
Viktor Barzin	7f5115f9fe	fix: Woodpecker pipeline YAML quoting + trigger test [ci skip]	2026-04-14 16:45:27 +00:00
Viktor Barzin	b3cc5fcc32	test: trigger postmortem pipeline webhook	2026-04-14 16:44:11 +00:00
Viktor Barzin	777450cb19	docs: test post-mortem for pipeline E2E validation	2026-04-14 15:55:32 +00:00
Viktor Barzin	8ad674e7b1	fix: postmortem pipeline uses Vault for SSH key (not Woodpecker secrets) Pipeline authenticates to Vault via K8s SA JWT, fetches devvm_ssh_key from secret/ci/infra, SSHes to DevVM to run Claude Code headlessly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:55:12 +00:00
Viktor Barzin	a703c6e84f	docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328, Tier 1 (30s/3 retries). Investigation TODO flagged for human review. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 15:48:11 +00:00
Viktor Barzin	8badb8181a	feat: post-mortem automation pipeline E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:34:42 +00:00
Viktor Barzin	e832581caf	docs: update Apr 14 post-mortem with Phase 2 findings Key additions: - NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14) - All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE - DNS zone sync gap: secondary/tertiary had no custom zones - Converted one-time setup Job to recurring zone-sync CronJob - MySQL, Redis, Vault collateral damage and fixes - 3 new lessons learned (zone replication, NFS client state, operator rollout) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:26:11 +00:00
Viktor Barzin	803cb5fd26	fix: convert Technitium zone sync from one-time Job to CronJob Secondary/tertiary DNS instances had no custom zones — only the primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job ran once at deployment and never synced new zones. New CronJob runs every 30 minutes: - Gets all zones from primary - Enables zone transfer on primary - Creates missing zones as Secondary type on replicas - Resyncs existing zones via AXFR Fixes .lan resolution failures (2/3 queries returned NXDOMAIN). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:18:19 +00:00
Viktor Barzin	c0a33b5157	state(technitium): update encrypted state	2026-04-14 12:17:29 +00:00
Viktor Barzin	5ff26dd8ef	state(technitium): update encrypted state	2026-04-14 12:13:27 +00:00
Viktor Barzin	30cdeefb1c	chore: sync terraform state after nfsvers=4 convergence Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4). State files encrypted and committed. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:20:18 +00:00
Viktor Barzin	bb2731256b	state(immich): update encrypted state	2026-04-14 11:19:06 +00:00
Viktor Barzin	99e2bc1bef	state(immich): update encrypted state	2026-04-14 11:19:00 +00:00
Viktor Barzin	ac6ec06afe	state(ollama): update encrypted state	2026-04-14 11:18:57 +00:00
Viktor Barzin	7a60108e97	state(ollama): update encrypted state	2026-04-14 11:18:20 +00:00
Viktor Barzin	a39b90bbcc	state(ollama): update encrypted state	2026-04-14 11:18:10 +00:00
Viktor Barzin	a25739a572	state(poison-fountain): update encrypted state	2026-04-14 11:13:32 +00:00
Viktor Barzin	d9ddf102ec	state(plotting-book): update encrypted state	2026-04-14 11:13:02 +00:00
Viktor Barzin	6d209fffad	state(meshcentral): update encrypted state	2026-04-14 11:11:59 +00:00
Viktor Barzin	d0805ed2a8	state(infra-maintenance): update encrypted state	2026-04-14 11:11:09 +00:00
Viktor Barzin	28264e69c6	state(headscale): update encrypted state	2026-04-14 11:11:05 +00:00

1 2 3 4 5 ...

2553 commits