The rewrite-body plugin (rybbit analytics, anti-AI trap links) requires
strip-accept-encoding to work, which killed HTTP compression for 50+
services. This adds Traefik's built-in compress middleware at the
websecure entrypoint level to re-compress responses to clients after
rewrite-body has modified them.
Uses includedContentTypes whitelist (not excludedContentTypes) so only
text-based types are compressed. SSE, WebSocket, gRPC, and binary
downloads are unaffected.
Measured improvement on ha-sofia:
- app.js: 540KB → 167KB (3.2x)
- core.js: 52KB → 19KB (2.7x)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.
Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.
## This change:
- Replace helm_release.mysql_cluster service selector with raw
kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes
## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)
Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Version 1.3.0+ changed the recommended command from `bin/dev` (development)
to `bin/rails server -p 3000 -b ::` (production). Also requires RAILS_ENV=production,
SECRET_KEY_BASE, and RAILS_LOG_TO_STDOUT env vars.
## This change
- Command: `bin/dev` → `bin/rails server -p 3000 -b ::`
- Add RAILS_ENV=production
- Add SECRET_KEY_BASE (stored in Vault secret/dawarich, synced via ESO)
- Add RAILS_LOG_TO_STDOUT=true
## What happened
1. Initial upgrade applied version 1.6.1 — DB migrations ran but pod
CrashLooped due to wrong entrypoint (bin/dev exits in production mode)
2. Rollback to 0.37.1 failed because 1.6.1 migrations already ran
(ActiveRecord::UnknownPrimaryKey on rails_pulse_routes)
3. Rolled forward with corrected entrypoint + env vars
4. Service now stable: 20/20 health checks passed over 5 minutes
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
v2.20.14 OOMKills at 1Gi during search index rebuild on upgrade.
Bumped to 2Gi request=limit to handle startup index operations.
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
DB migrations from 1.6.1 already ran, making 0.37.1 incompatible
(ActiveRecord::UnknownPrimaryKey on rails_pulse_routes table).
Rolling forward is the correct path.
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
Changelog summary: Major version bump spanning 13 releases. v9.0.0 adds PDF editor
API, macro recording, Service Worker caching. v9.2.1 fixes critical security vulns
(XSS, memory manipulation leading to RCE in XLS conversion). v9.3.0 adds GIF animations,
multiple pages view, signature settings, hyperlinks on images/shapes.
Risk: CAUTION (major version bump 8->9)
Breaking changes: none affecting Docker+MySQL deployment. PostgreSQL schema change
in v9.0.0 (irrelevant — we use MySQL). API endpoint deprecations (ConvertService.ashx,
GET requests to converter/command) — not removals. Config parameter renames
(leftMenu->layout.leftMenu etc.) are editor JS API, not server config.
DB backup: yes (job: pre-upgrade-onlyoffice-1776357277, MySQL full dump)
Config changes applied: none required
Flagged for manual review: none
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
Changelog summary: Major version bump. v5.0.0 removes QR code generation,
REDIRECT_APPEND_EXTRA_PATH env var, and trusted proxy auto-detection.
Various CLI option removals. v4.4-4.6 added REDIRECT_EXTRA_PATH_MODE,
DB_USE_ENCRYPTION, TRUSTED_PROXIES, CORS controls, FrankenPHP support.
Risk: CAUTION (major version bump 4→5)
Breaking changes: QR codes removed, REDIRECT_APPEND_EXTRA_PATH removed,
trusted proxy auto-detection removed, CLI option renames
DB backup: yes (job: pre-upgrade-url-1776357271, completed)
Config changes applied: none (no affected env vars in current config)
Flagged for manual review: TRUSTED_PROXIES env var may be needed
(Shlink behind Cloudflare + Traefik = 2 proxies, auto-detection removed in 5.0.0)
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
Changelog summary: Security fixes (CVE-2025-69217, CVE-2026-27624,
CVE-2026-40613), performance improvements (recvmmsg, lock-free atomics),
memory safety fixes, and DDoS handling improvements.
Risk: CAUTION (4.7.0 has breaking changes for deprecated config options)
Breaking changes: 4.7.0 removed keep-address-family,
response-origin-only-with-rfc5780, inverted no-stun-backward-compatibility.
None of these are in our config — no impact.
DB backup: no (not DB-backed)
Config changes applied: none (no-tlsv1, no-tlsv1_1, no-cli now unnecessary
but still accepted — no removal needed)
Flagged for manual review: none
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key)
and includes cloudflare in required_providers. These are the generated
files from running `terragrunt init -upgrade` across all stacks.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Changelog summary: Security fixes (IDOR vulnerabilities in sessions/progress/bookmarks),
DB index + query parallelization for discover performance, crash fixes, HTML sanitization
on playlist/collection/podcast endpoints, API key enabled/disabled fix.
Risk: SAFE
Breaking changes: none
DB backup: no (not DB-backed)
Config changes applied: none
Flagged for manual review: none
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dolt Workbench hardcodes http://localhost:9002/graphql in the built JS.
For k8s hosting, init container patches this to relative /graphql path.
Second ingress routes /graphql to port 9002 behind Authentik auth.
- Init container copies static JS to writable emptyDir, patches URL
- Pre-seeds store.json with Dolt connection config
- Added /graphql ingress with Authentik forward-auth
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Traefik records websocket connection lifetimes (minutes to hours) as
"request duration." When websockets close, the full lifetime pollutes
the average latency metric — Authentik showed 6.7s avg (201s websocket
avg) vs 0.065s actual HTTP avg. This caused ~90 false alerts/day across
12 services (Authentik, Vaultwarden, Terminal, HA, etc.).
Changes:
- Add protocol!="websocket" filter to HighServiceLatency alert expr
- Raise minimum traffic threshold from 0.01 to 0.05 rps to filter
statistical noise from services with <3 req/min
- Remove .githooks/pre-commit file-size hook (blocked state commits)
Validated against 7-day historical data: 637 breaches → ~2 with both
filters applied (99.7% reduction).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pipeline pods pull from registry.viktorbarzin.me:5050 but the
registry-credentials secret only had auth for registry.viktorbarzin.me
(without port). Containerd requires exact hostname:port match.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The vault-woodpecker-sync script was creating global secrets with only
push/tag/deployment events. Manual and cron-triggered pipelines couldn't
access secrets, causing "secret not found" errors and pipeline failures.
Also fixes three root causes of CI failures:
1. Pull-through cache corruption: purged stale blobs, added post-GC
registry restart cron to prevent recurrence
2. Missing repo-level secrets: added registry_user/registry_password
for the infra repo's build-ci-image workflow
3. Stuck pipelines: cleaned up 3 pipelines stuck in "running" since March
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set protected=true on ingress (Authentik forward-auth)
- Remove unused DATABASE_URL env var (Workbench uses browser-based connection config)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deploy dolthub/dolt-workbench alongside the Dolt server in beads-server
namespace. Provides SQL console, spreadsheet editor, and commit graph
visualization for the centralized beads task database.
- Workbench at dolt-workbench.viktorbarzin.me (Cloudflare-proxied)
- Connects to Dolt server via in-cluster service DNS
- Added to cloudflare_proxied_names for external access
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pipeline pods were failing with "authorization failed: no basic auth
credentials" when pulling from the private registry. The
WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES env var was in values.yaml but
never deployed to the agents.
Also removes the stale db-init job that used `-U root` (incompatible
with CNPG's `postgres` superuser). The database already exists.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Image ghcr.io/toeverything/affine:0.20.7 was removed from ghcr.io,
causing persistent ImagePullBackOff. Updated to latest stable 0.26.6.
Prisma migrations run via init container on startup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cleanup-failed-pods policy that runs hourly (at :15) to delete all
pods in Failed phase cluster-wide. Prevents stale evicted and failed
CronJob pods from accumulating and creating healthcheck noise.
Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup
controller permission to delete Pods (not included by default).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS
noload mounts — in-flight writes have corrupt metadata from skipped
journal replay, but core data is intact
- daily-backup: clean up stale LUKS dm mappings from previous crashed
runs before attempting to open
- daily-backup: capture rsync exit code safely with set -e (|| pattern)
- kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%)
- actualbudget: patched custom quota 5Gi→6Gi (was at 82%)
Verified: backup now completes status=0 (96 PVCs OK, 0 failed)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- technitium-password-sync: remove RWO encrypted PVC mount that caused
pods to stick in ContainerCreating on wrong nodes. Plugin install now
warns instead of failing when zip unavailable.
- daily-backup: add LUKS decryption support for encrypted PVC snapshots
using /root/.luks-backup-key. Uses noload mount option to skip ext4
journal replay. Also installed cryptsetup-bin on PVE host.
- speedtest: disable prometheus.io/scrape annotation (no /prometheus
endpoint exists, causing ScrapeTargetDown alert).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step
pods can pull from private registry (registry.viktorbarzin.me:5050)
- Add fallback in default.yml when HEAD~1 is unavailable (shallow
clone with depth=1): fetch more history, or apply all platform
stacks as safe default
- Root cause: pipeline #243 failed because infra-ci:latest image
couldn't be pulled (no imagePullSecrets on step pods)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .githooks/pre-commit that blocks files >2MB (configurable via
GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks
- Expand .gitignore to block common binary/archive patterns
(*.tar.gz, *.tgz, *.iso, *.img, *.bin, *.exe, *.dmg)
- Add explicit root-level terraform.tfstate ignore rules
- Remove stale redis-25.3.2.tgz helm chart (unreferenced)
Prevents re-accumulation of large blobs after git history cleanup
that reduced .git from 2.6GB to 128MB.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add separate CronJobs that dump each database individually:
- postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15)
- mysql-backup-per-db: mysqldump per DB (daily 00:45)
Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC.
Enables single-database restore without affecting other databases.
Also fixed CNPG superuser password sync and added --single-transaction
--set-gtid-purged=OFF to MySQL per-db dumps.
Updated restore runbooks with per-database restore procedures.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Expand service list: add Home Assistant, Actual Budget, Audiobookshelf,
Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD,
Excalidraw, Wealthfolio, Send, Stirling PDF
- Add structured debugging fields: error type, scope (just me vs others),
when it started, URL accessed
- Fix user report parser to extract all form fields into status.json
- Show error type, scope, and start time in status page report cards
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Status page (status.viktorbarzin.me): incident cards with SEV badges,
expandable timelines, postmortem links, user report rendering
- Issue templates on infra repo for user outage reports
- CronJob reads incidents + user-reports from ViktorBarzin/infra
- "Report an Outage" button on status page links to infra repo
- Post-mortem agents restored (4-stage pipeline: triage → investigation
→ historian → report writer) with updated paths and issue linking
- Post-mortem skill/template updated to link reports to GitHub Issues
and manage postmortem-required/postmortem-done labels
- Labels: incident, sev1-3, user-report, postmortem-required,
postmortem-done on infra repo
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total
rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade
- Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted
eliminating the circular dependency where alertmanager couldn't alert about NFS failures
- Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes
Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>
Secondary/tertiary DNS instances had no custom zones — only the
primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job
ran once at deployment and never synced new zones.
New CronJob runs every 30 minutes:
- Gets all zones from primary
- Enables zone transfer on primary
- Creates missing zones as Secondary type on replicas
- Resyncs existing zones via AXFR
Fixes .lan resolution failures (2/3 queries returned NXDOMAIN).
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4).
State files encrypted and committed.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>