infra

Author	SHA1	Message	Date
Viktor Barzin	ca1ae23f34	state(dbaas): update encrypted state	2026-04-14 22:39:24 +00:00
Viktor Barzin	99b08e8e50	state(dbaas): update encrypted state	2026-04-14 22:27:40 +00:00
Viktor Barzin	79a41ffde0	state(nextcloud): update encrypted state	2026-04-14 20:29:21 +00:00
Viktor Barzin	554c2e3b5e	state(nextcloud): update encrypted state	2026-04-14 20:28:34 +00:00
Viktor Barzin	3343cc7ba4	state(redis): update encrypted state	2026-04-14 20:28:09 +00:00
Viktor Barzin	41f2b00fa3	state(monitoring): update encrypted state	2026-04-14 20:17:40 +00:00
Viktor Barzin	34a5301123	state(monitoring): update encrypted state	2026-04-14 20:17:24 +00:00
Viktor Barzin	909c4c2744	state(authentik): update encrypted state	2026-04-14 20:16:40 +00:00
Viktor Barzin	5b443c2311	state(authentik): update encrypted state	2026-04-14 20:11:51 +00:00
Viktor Barzin	3e9231ae0d	feat: augment outage report template with debugging context - Expand service list: add Home Assistant, Actual Budget, Audiobookshelf, Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD, Excalidraw, Wealthfolio, Send, Stirling PDF - Add structured debugging fields: error type, scope (just me vs others), when it started, URL accessed - Fix user report parser to extract all form fields into status.json - Show error type, scope, and start time in status page report cards [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:03:44 +00:00
Viktor Barzin	c00a908610	state(status-page): update encrypted state	2026-04-14 20:03:29 +00:00
Viktor Barzin	460c68e015	feat: add incident management system with user reporting - Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:31 +00:00
Viktor Barzin	24a23709a5	fix: update healthcheck to report internal and external monitors separately - Increase Uptime Kuma API timeout to 120s with wait_events=0.2 - Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var - Report internal and external monitor status separately - Install uptime-kuma-api in local venv [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:44:20 +00:00
Viktor Barzin	26cc6fdf2f	state(authentik): update encrypted state	2026-04-14 19:43:04 +00:00
Viktor Barzin	d05e13e3cc	state(status-page): update encrypted state	2026-04-14 19:36:45 +00:00
Viktor Barzin	c69eba9b46	fix: increase Uptime Kuma API timeout and fix status code format - Increase socket timeout from 30s to 120s (121+ monitors need time to sync) - Add wait_events=0.2 for reliable login - Fix accepted_statuscodes format: use 100-increment ranges not arbitrary [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:28:18 +00:00
Viktor Barzin	3d189ce4d7	state(uptime-kuma): update encrypted state	2026-04-14 19:21:19 +00:00
Viktor Barzin	3a6fa6d34f	state(uptime-kuma): update encrypted state	2026-04-14 19:18:55 +00:00
Viktor Barzin	43342f860c	state(mailserver): update encrypted state	2026-04-14 19:13:46 +00:00
Viktor Barzin	c9fa482996	state(uptime-kuma): update encrypted state	2026-04-14 19:07:36 +00:00
Viktor Barzin	ff360a8807	feat: add external monitoring for all Cloudflare-proxied services Add automatic external HTTPS monitors to Uptime Kuma for ~96 services exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from a Terraform-generated ConfigMap and creates/deletes [External] monitors to match cloudflare_proxied_names. Status page groups these separately as "External Reachability" and pushes a divergence metric to Pushgateway when services are externally down but internally up. Prometheus alert ExternalAccessDivergence fires after 15min of divergence. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:04:45 +00:00
Viktor Barzin	3258ff6cb7	state(monitoring): update encrypted state	2026-04-14 19:04:18 +00:00
Viktor Barzin	647d73d499	state(status-page): update encrypted state	2026-04-14 19:03:30 +00:00
Viktor Barzin	74c1fb83cb	state(uptime-kuma): update encrypted state	2026-04-14 19:02:56 +00:00
Viktor Barzin	30c3450c61	docs: add user-facing issue reporting guide Adds "Reporting an Issue" section with: - Where to report (Slack, GitHub, DM) - What to include (examples of good vs bad reports) - What happens after reporting (flow diagram) - Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:37:37 +00:00
Viktor Barzin	dfad89ef81	fix: remove br tags from mermaid diagram (GitHub compat) [ci skip]	2026-04-14 18:35:13 +00:00
Viktor Barzin	f658b18a50	docs: add incident response & post-mortem pipeline architecture Documents the full E2E automated incident response pipeline: - /post-mortem skill for generating structured post-mortems - Woodpecker pipeline triggered on post-mortem file changes - Claude Code headless agent for implementing safe TODOs - Safety guardrails, commit conventions, secrets, limitations [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 18:31:29 +00:00
Viktor Barzin	cf87a747d8	docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit SHAs. Flag 3 Migration TODOs as needing human review. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:09:11 +00:00
Viktor Barzin	4498f61402	fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14] - scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with detailed comments explaining fsid=0 danger and NFSv3 disable rationale. Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra - scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts. Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running, no active exports. Warns but doesn't abort (block-storage PVC backups can still run). - .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory, fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:08:24 +00:00
Viktor Barzin	ca2680c189	fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14] - Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade - Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted eliminating the circular dependency where alertmanager couldn't alert about NFS failures - Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:05:33 +00:00
Viktor Barzin	0901dd5f61	state(monitoring): update encrypted state	2026-04-14 17:52:13 +00:00
Viktor Barzin	b1b408ff0e	fix: use full path to claude CLI for non-interactive SSH	2026-04-14 17:44:50 +00:00
Viktor Barzin	7674cf8c5c	docs: final E2E pipeline test	2026-04-14 17:43:38 +00:00
Viktor Barzin	f2e7367401	fix: use sh instead of bash in pipeline (Alpine compat)	2026-04-14 17:29:14 +00:00
Viktor Barzin	91b97709b7	docs: trigger postmortem pipeline with TODO	2026-04-14 17:27:45 +00:00
Viktor Barzin	c742fa3dfb	fix: scan all post-mortems for TODOs (no git diff needed)	2026-04-14 17:14:22 +00:00
Viktor Barzin	f336e5ed53	docs: E2E test postmortem pipeline with deep clone	2026-04-14 17:12:46 +00:00
Viktor Barzin	0b2f5a4729	fix: use depth 5 clone for postmortem pipeline (need HEAD~1)	2026-04-14 17:12:41 +00:00
Viktor Barzin	59367cc588	fix: handle Woodpecker shallow clone in postmortem pipeline	2026-04-14 17:12:02 +00:00
Viktor Barzin	60c04e51b7		2026-04-14 17:10:45 +00:00
Viktor Barzin	933c562aa9	docs: trigger postmortem pipeline E2E test	2026-04-14 16:49:07 +00:00
Viktor Barzin	ce7a4e6e76	fix: Woodpecker v3 secrets→environment migration	2026-04-14 16:47:17 +00:00
Viktor Barzin	8540f48a28	fix: move pipeline logic to shell script (avoid YAML quoting issues)	2026-04-14 16:46:42 +00:00
Viktor Barzin	df95f52d08	docs: test postmortem with TODO for pipeline E2E	2026-04-14 16:45:44 +00:00
Viktor Barzin	7f5115f9fe	fix: Woodpecker pipeline YAML quoting + trigger test [ci skip]	2026-04-14 16:45:27 +00:00
Viktor Barzin	b3cc5fcc32	test: trigger postmortem pipeline webhook	2026-04-14 16:44:11 +00:00
Viktor Barzin	777450cb19	docs: test post-mortem for pipeline E2E validation	2026-04-14 15:55:32 +00:00
Viktor Barzin	8ad674e7b1	fix: postmortem pipeline uses Vault for SSH key (not Woodpecker secrets) Pipeline authenticates to Vault via K8s SA JWT, fetches devvm_ssh_key from secret/ci/infra, SSHes to DevVM to run Claude Code headlessly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:55:12 +00:00
Viktor Barzin	a703c6e84f	docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip] Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328, Tier 1 (30s/3 retries). Investigation TODO flagged for human review. Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 15:48:11 +00:00
Viktor Barzin	8badb8181a	feat: post-mortem automation pipeline E2E workflow for incident post-mortems: 1. /post-mortem skill generates structured post-mortem markdown 2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes 3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor) 4. postmortem-todo-resolver agent implements TODOs headlessly 5. Agent updates post-mortem with Follow-up Implementation table Components: - .claude/skills/post-mortem/ — writer skill + template - .claude/agents/postmortem-todo-resolver.md — headless agent - .woodpecker/postmortem-todos.yml — CI pipeline - scripts/parse-postmortem-todos.sh — TODO extractor - cluster-health skill — auto-suggest post-mortem after recovery Safety: only auto-implements Alert/Config/Monitor types. Architecture/Migration/Investigation items are skipped. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 15:34:42 +00:00

1 2 3 4 5 ...

2568 commits