Commit graph

2537 commits

Author SHA1 Message Date
Viktor Barzin
b1b408ff0e fix: use full path to claude CLI for non-interactive SSH 2026-04-14 17:44:50 +00:00
Viktor Barzin
7674cf8c5c docs: final E2E pipeline test 2026-04-14 17:43:38 +00:00
Viktor Barzin
f2e7367401 fix: use sh instead of bash in pipeline (Alpine compat) 2026-04-14 17:29:14 +00:00
Viktor Barzin
91b97709b7 docs: trigger postmortem pipeline with TODO 2026-04-14 17:27:45 +00:00
Viktor Barzin
c742fa3dfb fix: scan all post-mortems for TODOs (no git diff needed) 2026-04-14 17:14:22 +00:00
Viktor Barzin
f336e5ed53 docs: E2E test postmortem pipeline with deep clone 2026-04-14 17:12:46 +00:00
Viktor Barzin
0b2f5a4729 fix: use depth 5 clone for postmortem pipeline (need HEAD~1) 2026-04-14 17:12:41 +00:00
Viktor Barzin
59367cc588 fix: handle Woodpecker shallow clone in postmortem pipeline 2026-04-14 17:12:02 +00:00
Viktor Barzin
60c04e51b7 2026-04-14 17:10:45 +00:00
Viktor Barzin
933c562aa9 docs: trigger postmortem pipeline E2E test 2026-04-14 16:49:07 +00:00
Viktor Barzin
ce7a4e6e76 fix: Woodpecker v3 secrets→environment migration 2026-04-14 16:47:17 +00:00
Viktor Barzin
8540f48a28 fix: move pipeline logic to shell script (avoid YAML quoting issues) 2026-04-14 16:46:42 +00:00
Viktor Barzin
df95f52d08 docs: test postmortem with TODO for pipeline E2E 2026-04-14 16:45:44 +00:00
Viktor Barzin
7f5115f9fe fix: Woodpecker pipeline YAML quoting + trigger test [ci skip] 2026-04-14 16:45:27 +00:00
Viktor Barzin
b3cc5fcc32 test: trigger postmortem pipeline webhook 2026-04-14 16:44:11 +00:00
Viktor Barzin
777450cb19 docs: test post-mortem for pipeline E2E validation 2026-04-14 15:55:32 +00:00
Viktor Barzin
8ad674e7b1 fix: postmortem pipeline uses Vault for SSH key (not Woodpecker secrets)
Pipeline authenticates to Vault via K8s SA JWT, fetches devvm_ssh_key
from secret/ci/infra, SSHes to DevVM to run Claude Code headlessly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:55:12 +00:00
Viktor Barzin
a703c6e84f docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328,
Tier 1 (30s/3 retries). Investigation TODO flagged for human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 15:48:11 +00:00
Viktor Barzin
8badb8181a feat: post-mortem automation pipeline
E2E workflow for incident post-mortems:
1. /post-mortem skill generates structured post-mortem markdown
2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes
3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor)
4. postmortem-todo-resolver agent implements TODOs headlessly
5. Agent updates post-mortem with Follow-up Implementation table

Components:
- .claude/skills/post-mortem/ — writer skill + template
- .claude/agents/postmortem-todo-resolver.md — headless agent
- .woodpecker/postmortem-todos.yml — CI pipeline
- scripts/parse-postmortem-todos.sh — TODO extractor
- cluster-health skill — auto-suggest post-mortem after recovery

Safety: only auto-implements Alert/Config/Monitor types.
Architecture/Migration/Investigation items are skipped.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:34:42 +00:00
Viktor Barzin
e832581caf docs: update Apr 14 post-mortem with Phase 2 findings
Key additions:
- NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14)
- All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE
- DNS zone sync gap: secondary/tertiary had no custom zones
- Converted one-time setup Job to recurring zone-sync CronJob
- MySQL, Redis, Vault collateral damage and fixes
- 3 new lessons learned (zone replication, NFS client state, operator rollout)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:26:11 +00:00
Viktor Barzin
803cb5fd26 fix: convert Technitium zone sync from one-time Job to CronJob
Secondary/tertiary DNS instances had no custom zones — only the
primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job
ran once at deployment and never synced new zones.

New CronJob runs every 30 minutes:
- Gets all zones from primary
- Enables zone transfer on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones via AXFR

Fixes .lan resolution failures (2/3 queries returned NXDOMAIN).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:18:19 +00:00
Viktor Barzin
c0a33b5157 state(technitium): update encrypted state 2026-04-14 12:17:29 +00:00
Viktor Barzin
5ff26dd8ef state(technitium): update encrypted state 2026-04-14 12:13:27 +00:00
Viktor Barzin
30cdeefb1c chore: sync terraform state after nfsvers=4 convergence
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:20:18 +00:00
Viktor Barzin
bb2731256b state(immich): update encrypted state 2026-04-14 11:19:06 +00:00
Viktor Barzin
99e2bc1bef state(immich): update encrypted state 2026-04-14 11:19:00 +00:00
Viktor Barzin
ac6ec06afe state(ollama): update encrypted state 2026-04-14 11:18:57 +00:00
Viktor Barzin
7a60108e97 state(ollama): update encrypted state 2026-04-14 11:18:20 +00:00
Viktor Barzin
a39b90bbcc state(ollama): update encrypted state 2026-04-14 11:18:10 +00:00
Viktor Barzin
a25739a572 state(poison-fountain): update encrypted state 2026-04-14 11:13:32 +00:00
Viktor Barzin
d9ddf102ec state(plotting-book): update encrypted state 2026-04-14 11:13:02 +00:00
Viktor Barzin
6d209fffad state(meshcentral): update encrypted state 2026-04-14 11:11:59 +00:00
Viktor Barzin
d0805ed2a8 state(infra-maintenance): update encrypted state 2026-04-14 11:11:09 +00:00
Viktor Barzin
28264e69c6 state(headscale): update encrypted state 2026-04-14 11:11:05 +00:00
Viktor Barzin
1738c3437c state(frigate): update encrypted state 2026-04-14 11:09:30 +00:00
Viktor Barzin
fe42993446 state(ebook2audiobook): update encrypted state 2026-04-14 11:08:37 +00:00
Viktor Barzin
23140cf780 state(real-estate-crawler): update encrypted state 2026-04-14 11:08:24 +00:00
Viktor Barzin
d24e4aac0b state(osm_routing): update encrypted state 2026-04-14 11:08:09 +00:00
Viktor Barzin
94b7097789 state(openclaw): update encrypted state 2026-04-14 11:08:05 +00:00
Viktor Barzin
25f4682dc0 state(nextcloud): update encrypted state 2026-04-14 11:06:41 +00:00
Viktor Barzin
aac81e0a1f state(vault): update encrypted state 2026-04-14 11:06:27 +00:00
Viktor Barzin
047f695129 state(ytdlp): update encrypted state 2026-04-14 11:06:11 +00:00
Viktor Barzin
20e86e96a3 state(servarr): update encrypted state 2026-04-14 11:05:54 +00:00
Viktor Barzin
0d6b6cbd95 state(navidrome): update encrypted state 2026-04-14 11:05:10 +00:00
Viktor Barzin
9ea3b33a55 state(ebooks): update encrypted state 2026-04-14 10:54:47 +00:00
Viktor Barzin
ea18116da9 fix: NFS outage recovery — migrate to NFSv4, add alerting
NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14).
All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE.

Changes:
- nfs_volume module: add nfsvers=4 mount option
- nfs-csi StorageClass: add nfsvers=4 mount option
- dbaas: MySQL serverInstances 3→1, mysql-native-password=ON
- monitoring: add NFSCSINodeDown and NFSMountFailures alerts

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 10:28:27 +00:00
Viktor Barzin
92900b5e08 state(dbaas): update encrypted state 2026-04-14 10:27:04 +00:00
Viktor Barzin
b4b6fd5946 state(nfs-csi): update encrypted state 2026-04-14 09:32:41 +00:00
Viktor Barzin
30e5150ecd state(status-page): update encrypted state 2026-04-14 09:31:50 +00:00
Viktor Barzin
ac3a6a96dd state(hermes-agent): update encrypted state 2026-04-14 09:04:35 +00:00