Commit graph

2525 commits

Author SHA1 Message Date
Viktor Barzin
df95f52d08 docs: test postmortem with TODO for pipeline E2E 2026-04-14 16:45:44 +00:00
Viktor Barzin
7f5115f9fe fix: Woodpecker pipeline YAML quoting + trigger test [ci skip] 2026-04-14 16:45:27 +00:00
Viktor Barzin
b3cc5fcc32 test: trigger postmortem pipeline webhook 2026-04-14 16:44:11 +00:00
Viktor Barzin
777450cb19 docs: test post-mortem for pipeline E2E validation 2026-04-14 15:55:32 +00:00
Viktor Barzin
8ad674e7b1 fix: postmortem pipeline uses Vault for SSH key (not Woodpecker secrets)
Pipeline authenticates to Vault via K8s SA JWT, fetches devvm_ssh_key
from secret/ci/infra, SSHes to DevVM to run Claude Code headlessly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:55:12 +00:00
Viktor Barzin
a703c6e84f docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Added Uptime Kuma TCP monitor for PVE NFS (192.168.1.127:2049), ID 328,
Tier 1 (30s/3 retries). Investigation TODO flagged for human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 15:48:11 +00:00
Viktor Barzin
8badb8181a feat: post-mortem automation pipeline
E2E workflow for incident post-mortems:
1. /post-mortem skill generates structured post-mortem markdown
2. Woodpecker pipeline triggers on docs/post-mortems/*.md changes
3. parse-postmortem-todos.sh extracts safe TODOs (Alert/Config/Monitor)
4. postmortem-todo-resolver agent implements TODOs headlessly
5. Agent updates post-mortem with Follow-up Implementation table

Components:
- .claude/skills/post-mortem/ — writer skill + template
- .claude/agents/postmortem-todo-resolver.md — headless agent
- .woodpecker/postmortem-todos.yml — CI pipeline
- scripts/parse-postmortem-todos.sh — TODO extractor
- cluster-health skill — auto-suggest post-mortem after recovery

Safety: only auto-implements Alert/Config/Monitor types.
Architecture/Migration/Investigation items are skipped.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 15:34:42 +00:00
Viktor Barzin
e832581caf docs: update Apr 14 post-mortem with Phase 2 findings
Key additions:
- NFSv3 broke after NFS restart (kernel lockd bug on PVE 6.14)
- All 52 PVs migrated to NFSv4, NFSv3 disabled on PVE
- DNS zone sync gap: secondary/tertiary had no custom zones
- Converted one-time setup Job to recurring zone-sync CronJob
- MySQL, Redis, Vault collateral damage and fixes
- 3 new lessons learned (zone replication, NFS client state, operator rollout)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:26:11 +00:00
Viktor Barzin
803cb5fd26 fix: convert Technitium zone sync from one-time Job to CronJob
Secondary/tertiary DNS instances had no custom zones — only the
primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job
ran once at deployment and never synced new zones.

New CronJob runs every 30 minutes:
- Gets all zones from primary
- Enables zone transfer on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones via AXFR

Fixes .lan resolution failures (2/3 queries returned NXDOMAIN).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 12:18:19 +00:00
Viktor Barzin
c0a33b5157 state(technitium): update encrypted state 2026-04-14 12:17:29 +00:00
Viktor Barzin
5ff26dd8ef state(technitium): update encrypted state 2026-04-14 12:13:27 +00:00
Viktor Barzin
30cdeefb1c chore: sync terraform state after nfsvers=4 convergence
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:20:18 +00:00
Viktor Barzin
bb2731256b state(immich): update encrypted state 2026-04-14 11:19:06 +00:00
Viktor Barzin
99e2bc1bef state(immich): update encrypted state 2026-04-14 11:19:00 +00:00
Viktor Barzin
ac6ec06afe state(ollama): update encrypted state 2026-04-14 11:18:57 +00:00
Viktor Barzin
7a60108e97 state(ollama): update encrypted state 2026-04-14 11:18:20 +00:00
Viktor Barzin
a39b90bbcc state(ollama): update encrypted state 2026-04-14 11:18:10 +00:00
Viktor Barzin
a25739a572 state(poison-fountain): update encrypted state 2026-04-14 11:13:32 +00:00
Viktor Barzin
d9ddf102ec state(plotting-book): update encrypted state 2026-04-14 11:13:02 +00:00
Viktor Barzin
6d209fffad state(meshcentral): update encrypted state 2026-04-14 11:11:59 +00:00
Viktor Barzin
d0805ed2a8 state(infra-maintenance): update encrypted state 2026-04-14 11:11:09 +00:00
Viktor Barzin
28264e69c6 state(headscale): update encrypted state 2026-04-14 11:11:05 +00:00
Viktor Barzin
1738c3437c state(frigate): update encrypted state 2026-04-14 11:09:30 +00:00
Viktor Barzin
fe42993446 state(ebook2audiobook): update encrypted state 2026-04-14 11:08:37 +00:00
Viktor Barzin
23140cf780 state(real-estate-crawler): update encrypted state 2026-04-14 11:08:24 +00:00
Viktor Barzin
d24e4aac0b state(osm_routing): update encrypted state 2026-04-14 11:08:09 +00:00
Viktor Barzin
94b7097789 state(openclaw): update encrypted state 2026-04-14 11:08:05 +00:00
Viktor Barzin
25f4682dc0 state(nextcloud): update encrypted state 2026-04-14 11:06:41 +00:00
Viktor Barzin
aac81e0a1f state(vault): update encrypted state 2026-04-14 11:06:27 +00:00
Viktor Barzin
047f695129 state(ytdlp): update encrypted state 2026-04-14 11:06:11 +00:00
Viktor Barzin
20e86e96a3 state(servarr): update encrypted state 2026-04-14 11:05:54 +00:00
Viktor Barzin
0d6b6cbd95 state(navidrome): update encrypted state 2026-04-14 11:05:10 +00:00
Viktor Barzin
9ea3b33a55 state(ebooks): update encrypted state 2026-04-14 10:54:47 +00:00
Viktor Barzin
ea18116da9 fix: NFS outage recovery — migrate to NFSv4, add alerting
NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14).
All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE.

Changes:
- nfs_volume module: add nfsvers=4 mount option
- nfs-csi StorageClass: add nfsvers=4 mount option
- dbaas: MySQL serverInstances 3→1, mysql-native-password=ON
- monitoring: add NFSCSINodeDown and NFSMountFailures alerts

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 10:28:27 +00:00
Viktor Barzin
92900b5e08 state(dbaas): update encrypted state 2026-04-14 10:27:04 +00:00
Viktor Barzin
b4b6fd5946 state(nfs-csi): update encrypted state 2026-04-14 09:32:41 +00:00
Viktor Barzin
30e5150ecd state(status-page): update encrypted state 2026-04-14 09:31:50 +00:00
Viktor Barzin
ac3a6a96dd state(hermes-agent): update encrypted state 2026-04-14 09:04:35 +00:00
Viktor Barzin
37b3395017 state(hermes-agent): update encrypted state 2026-04-14 09:00:30 +00:00
Viktor Barzin
8b2d3b7e6c state(hermes-agent): update encrypted state 2026-04-14 08:58:36 +00:00
Viktor Barzin
8787ad9f1d state(hermes-agent): update encrypted state 2026-04-14 08:39:18 +00:00
Viktor Barzin
71182f2867 state(openclaw): update encrypted state 2026-04-14 08:39:06 +00:00
Viktor Barzin
aa3af753a6 state(openclaw): update encrypted state 2026-04-14 08:38:54 +00:00
Viktor Barzin
4e059b138c docs: consolidate all post-mortems under docs/post-mortems/
Move HTML post-mortems from repo root post-mortems/ to docs/post-mortems/.
Update index.html with all 3 incidents (newest first).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:24:36 +00:00
Viktor Barzin
bdba15a387 docs: move post-mortems to docs/post-mortems/
Consolidate all outage reports under docs/ for better discoverability.
Moved from .claude/post-mortems/ (agent-internal) to docs/post-mortems/
(repo documentation).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:20:09 +00:00
Viktor Barzin
68c8c5b4a0 fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem
SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory
mounts from k8s (NFSv4 pseudo-root path resolution). Combined with
lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded
into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services.

Changes:
- Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted
- Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted
- Removed NFS module dependency from technitium stack
- Added full post-mortem with prevention plan

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 08:18:59 +00:00
Viktor Barzin
b239af9b6d state(technitium): update encrypted state 2026-04-14 08:04:07 +00:00
Viktor Barzin
6afba3b338 state(hermes-agent): update encrypted state 2026-04-13 22:33:35 +00:00
Viktor Barzin
e8ef81b276 state(hermes-agent): update encrypted state 2026-04-13 22:31:39 +00:00
Viktor Barzin
ab52b8eec2 state(hermes-agent): update encrypted state 2026-04-13 22:26:19 +00:00