- Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step
pods can pull from private registry (registry.viktorbarzin.me:5050)
- Add fallback in default.yml when HEAD~1 is unavailable (shallow
clone with depth=1): fetch more history, or apply all platform
stacks as safe default
- Root cause: pipeline #243 failed because infra-ci:latest image
couldn't be pulled (no imagePullSecrets on step pods)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .githooks/pre-commit that blocks files >2MB (configurable via
GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks
- Expand .gitignore to block common binary/archive patterns
(*.tar.gz, *.tgz, *.iso, *.img, *.bin, *.exe, *.dmg)
- Add explicit root-level terraform.tfstate ignore rules
- Remove stale redis-25.3.2.tgz helm chart (unreferenced)
Prevents re-accumulation of large blobs after git history cleanup
that reduced .git from 2.6GB to 128MB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove cli/cli (12.5MB), cli/infra_cli (12MB), clipboard-upload (8.7MB)
from git tracking. These are build outputs that should be generated by CI.
Add patterns to .gitignore to prevent re-committing.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New custom CI Docker image (ci/Dockerfile) with TF 1.5.7, TG 0.99.4,
git-crypt, sops, kubectl pre-installed. Pushed to private registry.
Eliminates 17 apk add calls + binary downloads per pipeline run.
- Unified CI pipeline: merge default.yml + app-stacks.yml into one.
Changed-stacks-only detection (git diff, with global-file fallback).
Concurrency limit (xargs -P 4). Step consolidation (2 steps vs 4).
Shallow clone (depth=2). Provider cache (TF_PLUGIN_CACHE_DIR).
- Per-stack Vault advisory locks in scripts/tg. 30min TTL with stale
lock detection. Blocks concurrent applies to same stack.
- TF_PLUGIN_CACHE_DIR enabled by default in scripts/tg for local dev.
- Daily drift detection pipeline (.woodpecker/drift-detection.yml).
Runs terraform plan on all stacks, Slack alert on drift.
- CI image build pipeline (.woodpecker/build-ci-image.yml).
Expected speedup: ~5-10 min per pipeline run → ~2-4 min.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)
Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.
Updated restore runbooks with per-database restore procedures.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Expand service list: add Home Assistant, Actual Budget, Audiobookshelf,
Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD,
Excalidraw, Wealthfolio, Send, Stirling PDF
- Add structured debugging fields: error type, scope (just me vs others),
when it started, URL accessed
- Fix user report parser to extract all form fields into status.json
- Show error type, scope, and start time in status page report cards
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
→ historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
postmortem-done on infra repo
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase Uptime Kuma API timeout to 120s with wait_events=0.2
- Remove hardcoded password, use Vault or UPTIME_KUMA_PASSWORD env var
- Report internal and external monitor status separately
- Install uptime-kuma-api in local venv
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds "Reporting an Issue" section with:
- Where to report (Slack, GitHub, DM)
- What to include (examples of good vs bad reports)
- What happens after reporting (flow diagram)
- Self-service status checks (Uptime Kuma, Grafana, K8s Dashboard)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mark all 8 safe TODOs as Done. Add Follow-up Implementation table with commit
SHAs. Flag 3 Migration TODOs as needing human review.
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>