Commit graph

2578 commits

Author SHA1 Message Date
Viktor Barzin
25ef5176bb state(speedtest): update encrypted state 2026-04-15 15:09:37 +00:00
Viktor Barzin
8b9c335f9c state(technitium): update encrypted state 2026-04-15 15:09:18 +00:00
Viktor Barzin
601a83d84e fix: CI pipeline image pull auth + shallow clone resilience [ci skip]
- Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step
  pods can pull from private registry (registry.viktorbarzin.me:5050)
- Add fallback in default.yml when HEAD~1 is unavailable (shallow
  clone with depth=1): fetch more history, or apply all platform
  stacks as safe default
- Root cause: pipeline #243 failed because infra-ci:latest image
  couldn't be pulled (no imagePullSecrets on step pods)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:41:08 +00:00
Viktor Barzin
e23153cf03 chore: add pre-commit size guard and harden .gitignore
- Add .githooks/pre-commit that blocks files >2MB (configurable via
  GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks
- Expand .gitignore to block common binary/archive patterns
  (*.tar.gz, *.tgz, *.iso, *.img, *.bin, *.exe, *.dmg)
- Add explicit root-level terraform.tfstate ignore rules
- Remove stale redis-25.3.2.tgz helm chart (unreferenced)

Prevents re-accumulation of large blobs after git history cleanup
that reduced .git from 2.6GB to 128MB.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 14:13:18 +00:00
Viktor Barzin
b0192d9545 chore: untrack binary build artifacts from git
Remove cli/cli (12.5MB), cli/infra_cli (12MB), clipboard-upload (8.7MB)
from git tracking. These are build outputs that should be generated by CI.
Add patterns to .gitignore to prevent re-committing.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:33:43 +00:00
Viktor Barzin
36454b87d1 feat: CI/CD performance overhaul
- New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4,
  git-crypt, sops, kubectl pre-installed. Pushed to private registry.
  Eliminates 17 apk add calls + binary downloads per pipeline run.

- Unified CI pipeline: merge default.yml + app-stacks.yml into one.
  Changed-stacks-only detection (git diff, with global-file fallback).
  Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4).
  Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR).

- Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale
  lock detection. Blocks concurrent applies to same stack.

- TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev.

- Daily drift detection pipeline (.woodpecker/drift-detection.yml).
  Runs terraform plan on all stacks, Slack alert on drift.

- CI image build pipeline (.woodpecker/build-ci-image.yml).

Expected speedup: ~5-10 min per pipeline run → ~2-4 min.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:22:26 +00:00
Viktor Barzin
bcad200a23 chore: add untracked stacks, scripts, and agent configs
- New stacks: beads-server, hermes-agent
- Terragrunt tiers.tf for infra, phpipam, status-page
- Secrets symlinks for vault, phpipam, hermes-agent
- Scripts: cluster_manager, image_pull, containerd pullthrough setup
- Frigate config, audiblez-web app source, n8n workflows dir
- Claude agent: service-upgrade, reference: upgrade-config.json
- Removed: claudeception skill, excalidraw empty submodule, temp listings

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 09:33:06 +00:00
Viktor Barzin
bd41bb9230 fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2
- Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore
  and stepped migration. Switch to existingSecret, PgBouncer session mode.
- Mailserver: migrate email roundtrip probe from Mailgun to Brevo API
- Redis: fix HAProxy tcp-check regex (rstring), faster health intervals
- Nextcloud: fix Redis fallback to HAProxy service, update dependency
- MeshCentral: fix TLSOffload + certUrl init container for first-run
- Monitoring: remove authentik from latency alert exclusion
- Diun: simplify to webhook notifier, remove git auto-update

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:41:56 +00:00
Viktor Barzin
d31bbc9a18 docs: update monitoring and backup docs for external monitors and per-db backups
- CLAUDE.md: document external monitoring (ExternalAccessDivergence alert,
  external-monitor-sync CronJob) and per-database backup/restore paths
- backup-dr.md: add per-db backup CronJobs to inventory table and daily
  timeline, update restore runbook references
- monitoring.md: add External Monitor Sync component and external monitoring
  architecture section

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 06:37:07 +00:00
Viktor Barzin
0256ccdccc feat: add per-database backups for PostgreSQL and MySQL
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)

Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.

Updated restore runbooks with per-database restore procedures.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 22:39:33 +00:00
Viktor Barzin
ca1ae23f34 state(dbaas): update encrypted state 2026-04-14 22:39:24 +00:00
Viktor Barzin
99b08e8e50 state(dbaas): update encrypted state 2026-04-14 22:27:40 +00:00
Viktor Barzin
79a41ffde0 state(nextcloud): update encrypted state 2026-04-14 20:29:21 +00:00
Viktor Barzin
554c2e3b5e state(nextcloud): update encrypted state 2026-04-14 20:28:34 +00:00
Viktor Barzin
3343cc7ba4 state(redis): update encrypted state 2026-04-14 20:28:09 +00:00
Viktor Barzin
41f2b00fa3 state(monitoring): update encrypted state 2026-04-14 20:17:40 +00:00
Viktor Barzin
34a5301123 state(monitoring): update encrypted state 2026-04-14 20:17:24 +00:00
Viktor Barzin
909c4c2744 state(authentik): update encrypted state 2026-04-14 20:16:40 +00:00
Viktor Barzin
5b443c2311 state(authentik): update encrypted state 2026-04-14 20:11:51 +00:00
Viktor Barzin
3e9231ae0d feat: augment outage report template with debugging context
- Expand service list: add Home Assistant, Actual Budget, Audiobookshelf,
  Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD,
  Excalidraw, Wealthfolio, Send, Stirling PDF
- Add structured debugging fields: error type, scope (just me vs others),
  when it started, URL accessed
- Fix user report parser to extract all form fields into status.json
- Show error type, scope, and start time in status page report cards

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:03:44 +00:00
Viktor Barzin
c00a908610 state(status-page): update encrypted state 2026-04-14 20:03:29 +00:00
Viktor Barzin
460c68e015 feat: add incident management system with user reporting
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
  expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
  → historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
  and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
  postmortem-done on infra repo

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 20:00:31 +00:00
Viktor Barzin
24a23709a5 fix: update healthcheck to report internal and external monitors separately
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:44:20 +00:00
Viktor Barzin
26cc6fdf2f state(authentik): update encrypted state 2026-04-14 19:43:04 +00:00
Viktor Barzin
d05e13e3cc state(status-page): update encrypted state 2026-04-14 19:36:45 +00:00
Viktor Barzin
c69eba9b46 fix: increase Uptime Kuma API timeout and fix status code format
- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:28:18 +00:00
Viktor Barzin
3d189ce4d7 state(uptime-kuma): update encrypted state 2026-04-14 19:21:19 +00:00
Viktor Barzin
3a6fa6d34f state(uptime-kuma): update encrypted state 2026-04-14 19:18:55 +00:00
Viktor Barzin
43342f860c state(mailserver): update encrypted state 2026-04-14 19:13:46 +00:00
Viktor Barzin
c9fa482996 state(uptime-kuma): update encrypted state 2026-04-14 19:07:36 +00:00
Viktor Barzin
ff360a8807 feat: add external monitoring for all Cloudflare-proxied services
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 19:04:45 +00:00
Viktor Barzin
3258ff6cb7 state(monitoring): update encrypted state 2026-04-14 19:04:18 +00:00
Viktor Barzin
647d73d499 state(status-page): update encrypted state 2026-04-14 19:03:30 +00:00
Viktor Barzin
74c1fb83cb state(uptime-kuma): update encrypted state 2026-04-14 19:02:56 +00:00
Viktor Barzin
30c3450c61 docs: add user-facing issue reporting guide
Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:37:37 +00:00
Viktor Barzin
dfad89ef81 fix: remove br tags from mermaid diagram (GitHub compat) [ci skip] 2026-04-14 18:35:13 +00:00
Viktor Barzin
f658b18a50 docs: add incident response & post-mortem pipeline architecture
Documents the full E2E automated incident response pipeline:
- /post-mortem skill for generating structured post-mortems
- Woodpecker pipeline triggered on post-mortem file changes
- Claude Code headless agent for implementing safe TODOs
- Safety guardrails, commit conventions, secrets, limitations

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 18:31:29 +00:00
Viktor Barzin
cf87a747d8 docs: update post-mortem follow-up implementation [PM-2026-04-14] [ci skip]
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:09:11 +00:00
Viktor Barzin
4498f61402 fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14]
- scripts/pve-nfs-exports: git-managed copy of PVE host /etc/exports with
  detailed comments explaining fsid=0 danger and NFSv3 disable rationale.
  Deploy: scp scripts/pve-nfs-exports root@192.168.1.127:/etc/exports && ssh root@192.168.1.127 exportfs -ra
- scripts/daily-backup.sh: add check_nfs_exports() that runs before backup starts.
  Detects: missing /etc/exports, dangerous fsid=0 on /srv/nfs, nfs-server not running,
  no active exports. Warns but doesn't abort (block-storage PVC backups can still run).
- .claude/CLAUDE.md: document NFS CSI mount option requirements — nfsvers=4 mandatory,
  fsid=0 forbidden, /etc/exports is git-managed, critical services must use proxmox-lvm-encrypted.

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:08:24 +00:00
Viktor Barzin
ca2680c189 fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14]
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
  rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
  eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes

Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
2026-04-14 18:05:33 +00:00
Viktor Barzin
0901dd5f61 state(monitoring): update encrypted state 2026-04-14 17:52:13 +00:00
Viktor Barzin
b1b408ff0e fix: use full path to claude CLI for non-interactive SSH 2026-04-14 17:44:50 +00:00
Viktor Barzin
7674cf8c5c docs: final E2E pipeline test 2026-04-14 17:43:38 +00:00
Viktor Barzin
f2e7367401 fix: use sh instead of bash in pipeline (Alpine compat) 2026-04-14 17:29:14 +00:00
Viktor Barzin
91b97709b7 docs: trigger postmortem pipeline with TODO 2026-04-14 17:27:45 +00:00
Viktor Barzin
c742fa3dfb fix: scan all post-mortems for TODOs (no git diff needed) 2026-04-14 17:14:22 +00:00
Viktor Barzin
f336e5ed53 docs: E2E test postmortem pipeline with deep clone 2026-04-14 17:12:46 +00:00
Viktor Barzin
0b2f5a4729 fix: use depth 5 clone for postmortem pipeline (need HEAD~1) 2026-04-14 17:12:41 +00:00
Viktor Barzin
59367cc588 fix: handle Woodpecker shallow clone in postmortem pipeline 2026-04-14 17:12:02 +00:00
Viktor Barzin
60c04e51b7 2026-04-14 17:10:45 +00:00