Complete rewrite of the user-facing documentation:
- How to report outages and request features
- Mermaid flow diagrams for both incident and feature request paths
- SLA expectations (automated vs human response times)
- Self-service checks before reporting
- Severity level definitions
- Status page explanation
- Full technical architecture section with component inventory
- Safety guardrails, labels, and commit conventions
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cleanup-failed-pods policy that runs hourly (at :15) to delete all
pods in Failed phase cluster-wide. Prevents stale evicted and failed
CronJob pods from accumulating and creating healthcheck noise.
Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup
controller permission to delete Pods (not included by default).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
DevVM may have unstaged changes from active sessions. Use git stash
before pull to avoid 'cannot pull with rebase: unstaged changes' errors.
Stash pop after to restore working state.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Woodpecker injects manual pipeline variables as direct env vars
(e.g., $ISSUE_NUMBER), not as CI_PIPELINE_VARIABLE_* prefixed vars.
The provision-user pipeline already uses this pattern correctly.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
build-ci-image.yml had event:[push,manual] which caused it to run
on every manual pipeline trigger. Its registry_user/registry_password
secrets don't have the manual event, causing all manual pipelines to
error. Removed manual from its event list since it only needs push.
Reverted evaluate conditions (Woodpecker evaluates secrets before
conditions, so evaluate can't prevent missing-secret errors).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When GHA triggers a manual pipeline for issue automation, ALL pipelines
with event:manual fire. Added evaluate conditions:
- issue-automation.yml: only runs when ISSUE_NUMBER is set
- provision-user.yml: only runs when ISSUE_NUMBER is NOT set
- build-ci-image.yml: only runs when ISSUE_NUMBER is NOT set
This prevents build-ci-image from failing on missing registry_password
secret when issue automation triggers.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS
noload mounts — in-flight writes have corrupt metadata from skipped
journal replay, but core data is intact
- daily-backup: clean up stale LUKS dm mappings from previous crashed
runs before attempting to open
- daily-backup: capture rsync exit code safely with set -e (|| pattern)
- kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%)
- actualbudget: patched custom quota 5Gi→6Gi (was at 82%)
Verified: backup now completes status=0 (96 PVCs OK, 0 failed)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both issue-automation and postmortem pipelines were cd'ing into
~/code/infra before running Claude, missing the root CLAUDE.md
with beads config and project-wide instructions. Now cd to ~/code
and use relative agent paths from there.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Documents the centralized Beads/Dolt task tracking system used by all
Claude Code sessions. Covers architecture, session lifecycle, settings
hierarchy, known issues, and E2E test verification.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- technitium-password-sync: remove RWO encrypted PVC mount that caused
pods to stick in ContainerCreating on wrong nodes. Plugin install now
warns instead of failing when zip unavailable.
- daily-backup: add LUKS decryption support for encrypted PVC snapshots
using /root/.luks-backup-key. Uses noload mount option to skip ext4
journal replay. Also installed cryptsetup-bin on PVE host.
- speedtest: disable prometheus.io/scrape annotation (no /prometheus
endpoint exists, causing ScrapeTargetDown alert).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step
pods can pull from private registry (registry.viktorbarzin.me:5050)
- Add fallback in default.yml when HEAD~1 is unavailable (shallow
clone with depth=1): fetch more history, or apply all platform
stacks as safe default
- Root cause: pipeline #243 failed because infra-ci:latest image
couldn't be pulled (no imagePullSecrets on step pods)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .githooks/pre-commit that blocks files >2MB (configurable via
GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks
- Expand .gitignore to block common binary/archive patterns
(*.tar.gz, *.tgz, *.iso, *.img, *.bin, *.exe, *.dmg)
- Add explicit root-level terraform.tfstate ignore rules
- Remove stale redis-25.3.2.tgz helm chart (unreferenced)
Prevents re-accumulation of large blobs after git history cleanup
that reduced .git from 2.6GB to 128MB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove cli/cli (12.5MB), cli/infra_cli (12MB), clipboard-upload (8.7MB)
from git tracking. These are build outputs that should be generated by CI.
Add patterns to .gitignore to prevent re-committing.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4,
git-crypt, sops, kubectl pre-installed. Pushed to private registry.
Eliminates 17 apk add calls + binary downloads per pipeline run.
- Unified CI pipeline: merge default.yml + app-stacks.yml into one.
Changed-stacks-only detection (git diff, with global-file fallback).
Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4).
Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR).
- Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale
lock detection. Blocks concurrent applies to same stack.
- TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev.
- Daily drift detection pipeline (.woodpecker/drift-detection.yml).
Runs terraform plan on all stacks, Slack alert on drift.
- CI image build pipeline (.woodpecker/build-ci-image.yml).
Expected speedup: ~5-10 min per pipeline run → ~2-4 min.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)
Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.
Updated restore runbooks with per-database restore procedures.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Expand service list: add Home Assistant, Actual Budget, Audiobookshelf,
Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD,
Excalidraw, Wealthfolio, Send, Stirling PDF
- Add structured debugging fields: error type, scope (just me vs others),
when it started, URL accessed
- Fix user report parser to extract all form fields into status.json
- Show error type, scope, and start time in status page report cards
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
→ historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
postmortem-done on infra repo
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>