Commit graph

2588 commits

Author SHA1 Message Date
Viktor Barzin
0a448c2bae docs: rewrite incident-response as user contribution guide
Complete rewrite of the user-facing documentation:
- How to report outages and request features
- Mermaid flow diagrams for both incident and feature request paths
- SLA expectations (automated vs human response times)
- Self-service checks before reporting
- Severity level definitions
- Status page explanation
- Full technical architecture section with component inventory
- Safety guardrails, labels, and commit conventions

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:59:09 +00:00
Viktor Barzin
cf578516e9 feat: auto-cleanup failed/evicted pods via Kyverno ClusterCleanupPolicy
Add cleanup-failed-pods policy that runs hourly (at :15) to delete all
pods in Failed phase cluster-wide. Prevents stale evicted and failed
CronJob pods from accumulating and creating healthcheck noise.

Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup
controller permission to delete Pods (not included by default).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:37:49 +00:00
Viktor Barzin
f726d1c3fd fix: stash local changes before git pull in CI pipelines
DevVM may have unstaged changes from active sessions. Use git stash
before pull to avoid 'cannot pull with rebase: unstaged changes' errors.
Stash pop after to restore working state.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:37:10 +00:00
Viktor Barzin
dcce76403a fix: use direct env vars for Woodpecker pipeline variables
Woodpecker injects manual pipeline variables as direct env vars
(e.g., $ISSUE_NUMBER), not as CI_PIPELINE_VARIABLE_* prefixed vars.
The provision-user pipeline already uses this pattern correctly.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:34:42 +00:00
Viktor Barzin
704fa09185 fix: remove manual event from build-ci-image to fix issue automation
build-ci-image.yml had event:[push,manual] which caused it to run
on every manual pipeline trigger. Its registry_user/registry_password
secrets don't have the manual event, causing all manual pipelines to
error. Removed manual from its event list since it only needs push.

Reverted evaluate conditions (Woodpecker evaluates secrets before
conditions, so evaluate can't prevent missing-secret errors).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:31:25 +00:00
Viktor Barzin
a583b11484 fix: guard manual Woodpecker pipelines with evaluate conditions
When GHA triggers a manual pipeline for issue automation, ALL pipelines
with event:manual fire. Added evaluate conditions:
- issue-automation.yml: only runs when ISSUE_NUMBER is set
- provision-user.yml: only runs when ISSUE_NUMBER is NOT set
- build-ci-image.yml: only runs when ISSUE_NUMBER is NOT set

This prevents build-ci-image from failing on missing registry_password
secret when issue automation triggers.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:29:35 +00:00
Viktor Barzin
de42acd68e fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump
- daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS
  noload mounts — in-flight writes have corrupt metadata from skipped
  journal replay, but core data is intact
- daily-backup: clean up stale LUKS dm mappings from previous crashed
  runs before attempting to open
- daily-backup: capture rsync exit code safely with set -e (|| pattern)
- kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%)
- actualbudget: patched custom quota 5Gi→6Gi (was at 82%)

Verified: backup now completes status=0 (96 PVCs OK, 0 failed)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:21:51 +00:00
Viktor Barzin
92495d0fc3 fix: start Claude from ~/code to load root CLAUDE.md correctly
Both issue-automation and postmortem pipelines were cd'ing into
~/code/infra before running Claude, missing the root CLAUDE.md
with beads config and project-wide instructions. Now cd to ~/code
and use relative agent paths from there.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:15:32 +00:00
Viktor Barzin
7bb9ec2934 Add agent task tracking documentation
Documents the centralized Beads/Dolt task tracking system used by all
Claude Code sessions. Covers architecture, session lifecycle, settings
hierarchy, known issues, and E2E test verification.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 17:11:26 +00:00
Viktor Barzin
9baefa22ab fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape
- technitium-password-sync: remove RWO encrypted PVC mount that caused
  pods to stick in ContainerCreating on wrong nodes. Plugin install now
  warns instead of failing when zip unavailable.
- daily-backup: add LUKS decryption support for encrypted PVC snapshots
  using /root/.luks-backup-key. Uses noload mount option to skip ext4
  journal replay. Also installed cryptsetup-bin on PVE host.
- speedtest: disable prometheus.io/scrape annotation (no /prometheus
  endpoint exists, causing ScrapeTargetDown alert).

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 15:12:32 +00:00
Viktor Barzin
25ef5176bb state(speedtest): update encrypted state 2026-04-15 15:09:37 +00:00
Viktor Barzin
8b9c335f9c state(technitium): update encrypted state 2026-04-15 15:09:18 +00:00
Viktor Barzin
601a83d84e fix: CI pipeline image pull auth + shallow clone resilience [ci skip]
- Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step
  pods can pull from private registry (registry.viktorbarzin.me:5050)
- Add fallback in default.yml when HEAD~1 is unavailable (shallow
  clone with depth=1): fetch more history, or apply all platform
  stacks as safe default
- Root cause: pipeline #243 failed because infra-ci:latest image
  couldn't be pulled (no imagePullSecrets on step pods)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:41:08 +00:00
Viktor Barzin
e23153cf03 chore: add pre-commit size guard and harden .gitignore
- Add .githooks/pre-commit that blocks files >2MB (configurable via
  GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks
- Expand .gitignore to block common binary/archive patterns
  (*.tar.gz, *.tgz, *.iso, *.img, *.bin, *.exe, *.dmg)
- Add explicit root-level terraform.tfstate ignore rules
- Remove stale redis-25.3.2.tgz helm chart (unreferenced)

Prevents re-accumulation of large blobs after git history cleanup
that reduced .git from 2.6GB to 128MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:13:18 +00:00
Viktor Barzin
b0192d9545 chore: untrack binary build artifacts from git
Remove cli/cli (12.5MB), cli/infra_cli (12MB), clipboard-upload (8.7MB)
from git tracking. These are build outputs that should be generated by CI.
Add patterns to .gitignore to prevent re-committing.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:33:43 +00:00
Viktor Barzin
36454b87d1 feat: CI/CD performance overhaul
- New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4,
  git-crypt, sops, kubectl pre-installed. Pushed to private registry.
  Eliminates 17 apk add calls + binary downloads per pipeline run.

- Unified CI pipeline: merge default.yml + app-stacks.yml into one.
  Changed-stacks-only detection (git diff, with global-file fallback).
  Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4).
  Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR).

- Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale
  lock detection. Blocks concurrent applies to same stack.

- TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev.

- Daily drift detection pipeline (.woodpecker/drift-detection.yml).
  Runs terraform plan on all stacks, Slack alert on drift.

- CI image build pipeline (.woodpecker/build-ci-image.yml).

Expected speedup: ~5-10 min per pipeline run → ~2-4 min.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:22:26 +00:00
Viktor Barzin
bcad200a23 chore: add untracked stacks, scripts, and agent configs
- New stacks: beads-server, hermes-agent
- Terragrunt tiers.tf for infra, phpipam, status-page
- Secrets symlinks for vault, phpipam, hermes-agent
- Scripts: cluster_manager, image_pull, containerd pullthrough setup
- Frigate config, audiblez-web app source, n8n workflows dir
- Claude agent: service-upgrade, reference: upgrade-config.json
- Removed: claudeception skill, excalidraw empty submodule, temp listings

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:33:06 +00:00
Viktor Barzin
bd41bb9230 fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2
- Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore
  and stepped migration. Switch to existingSecret, PgBouncer session mode.
- Mailserver: migrate email roundtrip probe from Mailgun to Brevo API
- Redis: fix HAProxy tcp-check regex (rstring), faster health intervals
- Nextcloud: fix Redis fallback to HAProxy service, update dependency
- MeshCentral: fix TLSOffload + certUrl init container for first-run
- Monitoring: remove authentik from latency alert exclusion
- Diun: simplify to webhook notifier, remove git auto-update

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:41:56 +00:00
Viktor Barzin
d31bbc9a18 docs: update monitoring and backup docs for external monitors and per-db backups
- CLAUDE.md: document external monitoring (ExternalAccessDivergence alert,
  external-monitor-sync CronJob) and per-database backup/restore paths
- backup-dr.md: add per-db backup CronJobs to inventory table and daily
  timeline, update restore runbook references
- monitoring.md: add External Monitor Sync component and external monitoring
  architecture section

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:37:07 +00:00
Viktor Barzin
0256ccdccc feat: add per-database backups for PostgreSQL and MySQL
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)

Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.

Updated restore runbooks with per-database restore procedures.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 22:39:33 +00:00
Viktor Barzin
ca1ae23f34 state(dbaas): update encrypted state 2026-04-14 22:39:24 +00:00
Viktor Barzin
99b08e8e50 state(dbaas): update encrypted state 2026-04-14 22:27:40 +00:00
Viktor Barzin
79a41ffde0 state(nextcloud): update encrypted state 2026-04-14 20:29:21 +00:00
Viktor Barzin
554c2e3b5e state(nextcloud): update encrypted state 2026-04-14 20:28:34 +00:00
Viktor Barzin
3343cc7ba4 state(redis): update encrypted state 2026-04-14 20:28:09 +00:00
Viktor Barzin
41f2b00fa3 state(monitoring): update encrypted state 2026-04-14 20:17:40 +00:00
Viktor Barzin
34a5301123 state(monitoring): update encrypted state 2026-04-14 20:17:24 +00:00
Viktor Barzin
909c4c2744 state(authentik): update encrypted state 2026-04-14 20:16:40 +00:00
Viktor Barzin
5b443c2311 state(authentik): update encrypted state 2026-04-14 20:11:51 +00:00
Viktor Barzin
3e9231ae0d feat: augment outage report template with debugging context
- Expand service list: add Home Assistant, Actual Budget, Audiobookshelf,
  Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD,
  Excalidraw, Wealthfolio, Send, Stirling PDF
- Add structured debugging fields: error type, scope (just me vs others),
  when it started, URL accessed
- Fix user report parser to extract all form fields into status.json
- Show error type, scope, and start time in status page report cards

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:03:44 +00:00
Viktor Barzin
c00a908610 state(status-page): update encrypted state 2026-04-14 20:03:29 +00:00
Viktor Barzin
460c68e015 feat: add incident management system with user reporting
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
  expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
  → historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
  and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
  postmortem-done on infra repo

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:00:31 +00:00
Viktor Barzin
24a23709a5 fix: update healthcheck to report internal and external monitors separately
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:44:20 +00:00
Viktor Barzin
26cc6fdf2f state(authentik): update encrypted state 2026-04-14 19:43:04 +00:00
Viktor Barzin
d05e13e3cc state(status-page): update encrypted state 2026-04-14 19:36:45 +00:00
Viktor Barzin
c69eba9b46 fix: increase Uptime Kuma API timeout and fix status code format
- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:28:18 +00:00
Viktor Barzin
3d189ce4d7 state(uptime-kuma): update encrypted state 2026-04-14 19:21:19 +00:00
Viktor Barzin
3a6fa6d34f state(uptime-kuma): update encrypted state 2026-04-14 19:18:55 +00:00
Viktor Barzin
43342f860c state(mailserver): update encrypted state 2026-04-14 19:13:46 +00:00
Viktor Barzin
c9fa482996 state(uptime-kuma): update encrypted state 2026-04-14 19:07:36 +00:00
Viktor Barzin
ff360a8807 feat: add external monitoring for all Cloudflare-proxied services
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:04:45 +00:00
Viktor Barzin
3258ff6cb7 state(monitoring): update encrypted state 2026-04-14 19:04:18 +00:00
Viktor Barzin
647d73d499 state(status-page): update encrypted state 2026-04-14 19:03:30 +00:00
Viktor Barzin
74c1fb83cb state(uptime-kuma): update encrypted state 2026-04-14 19:02:56 +00:00
Viktor Barzin
30c3450c61 docs: add user-facing issue reporting guide
Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:37:37 +00:00
Viktor Barzin
dfad89ef81 fix: remove br tags from mermaid diagram (GitHub compat) [ci skip] 2026-04-14 18:35:13 +00:00
Viktor Barzin
f658b18a50 docs: add incident response & post-mortem pipeline architecture
Documents the full E2E automated incident response pipeline:
- /post-mortem skill for generating structured post-mortems
- Woodpecker pipeline triggered on post-mortem file changes
- Claude Code headless agent for implementing safe TODOs
- Safety guardrails, commit conventions, secrets, limitations

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:31:29 +00:00
Viktor Barzin
cf87a747d8 docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:09:11 +00:00
Viktor Barzin
4498f61402 fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14]
- scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with
  detailed comments explaining fsid=0 danger and NFSv3 disable rationale.
  Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra
- scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts.
  Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running,
  no active exports. Warns but doesn't abort (block-storage PVC backups can still run).
- .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory,
  fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:08:24 +00:00
Viktor Barzin
ca2680c189 fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14]
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
  rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
  eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:05:33 +00:00