infra

Author	SHA1	Message	Date
Viktor Barzin	4c8e5bea0b	[traefik] Add global compress middleware to fix response compression The rewrite-body plugin (rybbit analytics, anti-AI trap links) requires strip-accept-encoding to work, which killed HTTP compression for 50+ services. This adds Traefik's built-in compress middleware at the websecure entrypoint level to re-compress responses to clients after rewrite-body has modified them. Uses includedContentTypes whitelist (not excludedContentTypes) so only text-based types are compressed. SSE, WebSocket, gRPC, and binary downloads are unaffected. Measured improvement on ha-sofia: - app.js: 540KB → 167KB (3.2x) - core.js: 52KB → 19KB (2.7x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:18:51 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	f538115c43	[dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet ## Context Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only ~35 MB of actual data due to Group Replication overhead (binlog, relay log, GR apply log). The operator enforces GR even with serverInstances=1. Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free container images available. Using official mysql:8.4 image instead. ## This change: - Replace helm_release.mysql_cluster service selector with raw kubernetes_stateful_set_v1 using official mysql:8.4 image - ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2, innodb_doublewrite=ON (re-enabled for standalone safety) - Service selector switched to standalone pod labels - Technitium: disable SQLite query logging (18 GB/day write amplification), keep PostgreSQL-only logging (90-day retention) - Grafana datasource and dashboards migrated from MySQL to PostgreSQL - Dashboard SQL queries fixed for PG integer division (::float cast) - Updated CLAUDE.md service-specific notes ## What is NOT in this change: - InnoDB Cluster + operator removal (Phase 4, 7+ days from now) - Stale Vault role cleanup (Phase 4) - Old PVC deletion (Phase 4) Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:01:06 +00:00
Viktor Barzin	39b5ed04a7	upgrade: dawarich 0.37.1 -> 1.6.1 (fix entrypoint + add production env) ## Context Version 1.3.0+ changed the recommended command from `bin/dev` (development) to `bin/rails server -p 3000 -b ::` (production). Also requires RAILS_ENV=production, SECRET_KEY_BASE, and RAILS_LOG_TO_STDOUT env vars. ## This change - Command: `bin/dev` → `bin/rails server -p 3000 -b ::` - Add RAILS_ENV=production - Add SECRET_KEY_BASE (stored in Vault secret/dawarich, synced via ESO) - Add RAILS_LOG_TO_STDOUT=true ## What happened 1. Initial upgrade applied version 1.6.1 — DB migrations ran but pod CrashLooped due to wrong entrypoint (bin/dev exits in production mode) 2. Rollback to 0.37.1 failed because 1.6.1 migrations already ran (ActiveRecord::UnknownPrimaryKey on rails_pulse_routes) 3. Rolled forward with corrected entrypoint + env vars 4. Service now stable: 20/20 health checks passed over 5 minutes Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 17:25:29 +00:00
Viktor Barzin	59e99f2a3a	upgrade: paperless-ngx increase memory 1Gi -> 2Gi v2.20.14 OOMKills at 1Gi during search index rebuild on upgrade. Bumped to 2Gi request=limit to handle startup index operations. Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 17:16:23 +00:00
Viktor Barzin	5f1b14ad53	upgrade: dawarich re-apply 1.6.1 (forward-fix after failed rollback) DB migrations from 1.6.1 already ran, making 0.37.1 incompatible (ActiveRecord::UnknownPrimaryKey on rails_pulse_routes table). Rolling forward is the correct path. Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 17:08:20 +00:00
Viktor Barzin	f5883be981	Revert "upgrade: dawarich 0.37.1 -> 1.6.1" This reverts commit `ec8b4dbaac`.	2026-04-16 17:04:06 +00:00
Viktor Barzin	7b69641357	upgrade: linkwarden v2.9.1 -> v2.14.0 Changelog summary: 23 intermediate releases spanning text highlighting, drag & drop organization, tag management page, compact sidebar, mobile app (iOS/Android), SingleFile upload support, performance refactors, and NextJS CVE patches. Risk: SAFE Breaking changes: none DB backup: yes (job: pre-upgrade-linkwarden-1776357253, 254M dump confirmed) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:54:30 +00:00
Viktor Barzin	e1aa59ce53	upgrade: paperless-ngx 2.16.4 -> 2.20.14 Changelog summary: 4 minor versions of bug fixes, security patches (7+ CVEs), and features (nested tags, PDF editor, advanced workflow filters, Trixie base). Risk: CAUTION Breaking changes: - v2.17.0: Scheduled workflow offset sign corrected (restores pre-2.16 behavior) - v2.20.7: Filename template rendering restricted to safe document context DB backup: yes (job: pre-upgrade-paperless-ngx-1776357314, 254MiB) Config changes applied: none Flagged for manual review: - Check scheduled workflow offsets if any were modified during 2.16.x - Verify storage path templates still render correctly after v2.20.7 restriction Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:53:42 +00:00
Viktor Barzin	237126eb3a	upgrade: onlyoffice 8.2.3 -> 9.3.1 Changelog summary: Major version bump spanning 13 releases. v9.0.0 adds PDF editor API, macro recording, Service Worker caching. v9.2.1 fixes critical security vulns (XSS, memory manipulation leading to RCE in XLS conversion). v9.3.0 adds GIF animations, multiple pages view, signature settings, hyperlinks on images/shapes. Risk: CAUTION (major version bump 8->9) Breaking changes: none affecting Docker+MySQL deployment. PostgreSQL schema change in v9.0.0 (irrelevant — we use MySQL). API endpoint deprecations (ConvertService.ashx, GET requests to converter/command) — not removals. Config parameter renames (leftMenu->layout.leftMenu etc.) are editor JS API, not server config. DB backup: yes (job: pre-upgrade-onlyoffice-1776357277, MySQL full dump) Config changes applied: none required Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:53:24 +00:00
Viktor Barzin	8637d82817	upgrade: phpipam v1.7.0 -> v1.7.4 Changelog summary: Bugfixes (PHP8 compat, UI performance, jQuery errors) and security fixes (XSS reflected, CSRF cookie, RCE via ping_path, DB credential exposure via mysqldump). Risk: SAFE Breaking changes: none DB backup: yes (job: pre-upgrade-phpipam-1776357227, mysql, dbaas ns) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:53:21 +00:00
Viktor Barzin	6727894573	upgrade: url (shlink) 4.3.4 -> 5.0.2 Changelog summary: Major version bump. v5.0.0 removes QR code generation, REDIRECT_APPEND_EXTRA_PATH env var, and trusted proxy auto-detection. Various CLI option removals. v4.4-4.6 added REDIRECT_EXTRA_PATH_MODE, DB_USE_ENCRYPTION, TRUSTED_PROXIES, CORS controls, FrankenPHP support. Risk: CAUTION (major version bump 4→5) Breaking changes: QR codes removed, REDIRECT_APPEND_EXTRA_PATH removed, trusted proxy auto-detection removed, CLI option renames DB backup: yes (job: pre-upgrade-url-1776357271, completed) Config changes applied: none (no affected env vars in current config) Flagged for manual review: TRUSTED_PROXIES env var may be needed (Shlink behind Cloudflare + Traefik = 2 proxies, auto-detection removed in 5.0.0) Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:49:42 +00:00
Viktor Barzin	ec8b4dbaac	upgrade: dawarich 0.37.1 -> 1.6.1 Changelog summary: 19 intermediate releases. 1.0.0 is cosmetic (same as 0.37.3). Key changes: per-user timezone (1.3.0), motion_data column with background migration (1.3.0), GPS noise filtering (1.5.0), family page with map (1.4.0), redesigned archival system (1.4.0). Risk: CAUTION (major version boundary + breaking keyword in 1.3.3) Breaking changes: 1.3.3 API change (distance field integer→object, affects API consumers only) DB backup: yes (job: pre-upgrade-dawarich-1776357303, postgresql, completed) Config changes applied: none (existing TIME_ZONE=Europe/London is compatible) Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:48:34 +00:00
Viktor Barzin	287d5eb28d	upgrade: coturn 4.6.3-r1 -> 4.10.0-r1 Changelog summary: Security fixes (CVE-2025-69217, CVE-2026-27624, CVE-2026-40613), performance improvements (recvmmsg, lock-free atomics), memory safety fixes, and DDoS handling improvements. Risk: CAUTION (4.7.0 has breaking changes for deprecated config options) Breaking changes: 4.7.0 removed keep-address-family, response-origin-only-with-rfc5780, inverted no-stun-backward-compatibility. None of these are in our config — no impact. DB backup: no (not DB-backed) Config changes applied: none (no-tlsv1, no-tlsv1_1, no-cli now unnecessary but still accepted — no removal needed) Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:34:59 +00:00
Viktor Barzin	cce513349a	upgrade: immich v2.7.4 -> v2.7.5 Changelog summary: Bug fix for version check rate limiting and deduplication, translation updates. Patch-only release with no breaking changes. Risk: SAFE Breaking changes: none DB backup: yes (job: pre-upgrade-immich-1776357229, 1.9G, immich namespace) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:34:57 +00:00
Viktor Barzin	3afdc9a6cb	upgrade: ollama (open-webui) v0.7.2 -> v0.8.12 Changelog summary: 13 intermediate releases. v0.8.0 introduces analytics dashboard, Skills support, Open Responses protocol, and a long-running DB migration on chat_message table. v0.8.1-v0.8.12 add model editing shortcuts, OIDC logout endpoint, terminal integration, notebook execution, and numerous bug fixes. Risk: CAUTION Breaking changes: v0.8.0 long-running chat_message table migration + schema changes, v0.8.1 additional schema changes. SQLite auto-migrates on startup. DB backup: skipped (SQLite on proxmox-lvm PVC, LVM snapshots available for rollback) Config changes applied: none Flagged for manual review: none — all changes are additive features/fixes Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:34:48 +00:00
Viktor Barzin	1ea48c93e5	upgrade: owntracks 0.9.9 -> 1.0.1 Changelog summary: - 1.0.0: POI inline image support, deprecate google maps in vmap.html, packaging fixes - 1.0.1: ocat JSON array output fix, revgeo error messages, OpenBSD support, storage dir env fix Risk: CAUTION (major version 0→1, but changes are benign — no schema/config/API breaking changes) Breaking changes: none (deprecate keyword hit on vmap.html google maps — cosmetic only) DB backup: skipped (not DB-backed) Config changes applied: none required Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:34:29 +00:00
Viktor Barzin	216d4240c9	[infra] Add Cloudflare provider to all stack lock files and generated providers Terragrunt now generates cloudflare_provider.tf (Vault-sourced API key) and includes cloudflare in required_providers. These are the generated files from running `terragrunt init -upgrade` across all stacks. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 16:31:36 +00:00
Viktor Barzin	cf93f123f1	upgrade: audiobookshelf 2.32.1 -> 2.33.1 Changelog summary: Security fixes (IDOR vulnerabilities in sessions/progress/bookmarks), DB index + query parallelization for discover performance, crash fixes, HTML sanitization on playlist/collection/podcast endpoints, API key enabled/disabled fix. Risk: SAFE Breaking changes: none DB backup: no (not DB-backed) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:00:26 +00:00
root	af090c818b	Woodpecker CI deploy [CI SKIP]	2026-04-16 13:46:08 +00:00
Viktor Barzin	b1d152be1f	[infra] Auto-create Cloudflare DNS records from ingress_factory ## Context Deploying new services required manually adding hostnames to cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars — a separate file from the service stack. This was frequently forgotten, leaving services unreachable externally. ## This change: - Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory` modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP). - Simplify cloudflared tunnel from 100 per-hostname rules to wildcard `*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing. - Add global Cloudflare provider via terragrunt.hcl (separate cloudflare_provider.tf with Vault-sourced API key). - Migrate 118 hostnames from centralized config.tfvars to per-service dns_type. 17 hostnames remain centrally managed (Helm ingresses, special cases). - Update docs, AGENTS.md, CLAUDE.md, dns.md runbook. ``` BEFORE AFTER config.tfvars (manual list) stacks/<svc>/main.tf \| module "ingress" { v dns_type = "proxied" stacks/cloudflared/ } for_each = list \| cloudflare_record auto-creates tunnel per-hostname cloudflare_record + annotation ``` ## What is NOT in this change: - Uptime Kuma monitor migration (still reads from config.tfvars) - 17 remaining centrally-managed hostnames (Helm, special cases) - Removal of allow_overwrite (keep until migration confirmed stable) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 13:45:04 +00:00
Viktor Barzin	b98890d799	fix(beads-server): fix Workbench GraphQL URL for remote hosting Dolt Workbench hardcodes http://localhost:9002/graphql in the built JS. For k8s hosting, init container patches this to relative /graphql path. Second ingress routes /graphql to port 9002 behind Authentik auth. - Init container copies static JS to writable emptyDir, patches URL - Pre-seeds store.json with Dolt connection config - Added /graphql ingress with Authentik forward-auth Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:53:57 +00:00
Viktor Barzin	375a3d91d5	[monitoring] Exclude websocket protocol from HighServiceLatency alert Traefik records websocket connection lifetimes (minutes to hours) as "request duration." When websockets close, the full lifetime pollutes the average latency metric — Authentik showed 6.7s avg (201s websocket avg) vs 0.065s actual HTTP avg. This caused ~90 false alerts/day across 12 services (Authentik, Vaultwarden, Terminal, HA, etc.). Changes: - Add protocol!="websocket" filter to HighServiceLatency alert expr - Raise minimum traffic threshold from 0.01 to 0.05 rps to filter statistical noise from services with <3 req/min - Remove .githooks/pre-commit file-size hook (blocked state commits) Validated against 7-day historical data: 637 breaches → ~2 with both filters applied (99.7% reduction). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:51:19 +00:00
Viktor Barzin	3e273399c1	fix(ci): add registry.viktorbarzin.me:5050 to imagePullSecrets Pipeline pods pull from registry.viktorbarzin.me:5050 but the registry-credentials secret only had auth for registry.viktorbarzin.me (without port). Containerd requires exact hostname:port match. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:50:51 +00:00
Viktor Barzin	116fdcf82d	fix(ci): Woodpecker secret sync includes all event types The vault-woodpecker-sync script was creating global secrets with only push/tag/deployment events. Manual and cron-triggered pipelines couldn't access secrets, causing "secret not found" errors and pipeline failures. Also fixes three root causes of CI failures: 1. Pull-through cache corruption: purged stale blobs, added post-GC registry restart cron to prevent recurrence 2. Missing repo-level secrets: added registry_user/registry_password for the infra repo's build-ci-image workflow 3. Stuck pipelines: cleaned up 3 pipelines stuck in "running" since March Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:43:48 +00:00
Viktor Barzin	d9ed166640	fix(beads-server): add Authentik auth to Dolt Workbench - Set protected=true on ingress (Authentik forward-auth) - Remove unused DATABASE_URL env var (Workbench uses browser-based connection config) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:43:22 +00:00
Viktor Barzin	c33f597111	feat(upgrade-agent): add automated service upgrade pipeline with n8n + DIUN Pipeline: DIUN detects new image versions every 6h → webhook to n8n → n8n filters (skip databases/custom/infra/:latest) and rate-limits (max 5/6h) → SSH to dev VM → claude -p runs upgrade agent. Agent workflow: resolve GitHub repo → fetch changelogs → classify risk (SAFE/CAUTION) → backup DB if needed → bump version in .tf → commit+push → wait for CI → verify (pod ready + HTTP + Uptime Kuma) → rollback on failure. Changes: - stacks/n8n: add N8N_PORT=5678 to fix K8s env var conflict - stacks/n8n/workflows: version-controlled n8n workflow backup - docs/architecture/automated-upgrades.md: full pipeline documentation - AGENTS.md: add upgrade agent section - service-catalog.md: update DIUN description Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:38:27 +00:00
Viktor Barzin	27d7c91608	feat(beads-server): add Dolt Workbench web UI Deploy dolthub/dolt-workbench alongside the Dolt server in beads-server namespace. Provides SQL console, spreadsheet editor, and commit graph visualization for the centralized beads task database. - Workbench at dolt-workbench.viktorbarzin.me (Cloudflare-proxied) - Connects to Dolt server via in-cluster service DNS - Added to cloudflare_proxied_names for external access Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:32:45 +00:00
Viktor Barzin	c124a23390	fix(ci): add K8s pull secrets to Woodpecker agents Pipeline pods were failing with "authorization failed: no basic auth credentials" when pulling from the private registry. The WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES env var was in values.yaml but never deployed to the agents. Also removes the stale db-init job that used `-U root` (incompatible with CNPG's `postgres` superuser). The database already exists. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 21:21:12 +00:00
Viktor Barzin	f7411327d1	fix(affine): update image tag 0.20.7 → 0.26.6 Image ghcr.io/toeverything/affine:0.20.7 was removed from ghcr.io, causing persistent ImagePullBackOff. Updated to latest stable 0.26.6. Prisma migrations run via init container on startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:49:46 +00:00
Viktor Barzin	8b004c4c94	feat(storage): migrate all sensitive services to proxmox-lvm-encrypted Reconcile Terraform with cluster state after manual encrypted PVC migrations and complete the remaining unfinished migrations. All services storing sensitive data now use LUKS2-encrypted block storage via the Proxmox CSI plugin. ## Context Only Technitium DNS was using encrypted storage in Terraform. Many services had been manually migrated to encrypted PVCs in the cluster, but Terraform was never updated — creating dangerous state drift where a `tg apply` could recreate unencrypted PVCs. ## This change Phase 0 — Infrastructure: - Add `proxmox-lvm-encrypted` StorageClass to Helm values (extraParameters) - Add ExternalSecret for LUKS encryption passphrase to Terraform - Fix CSI node plugin memory: `node.plugin.resources` (not `node.resources`) with 1280Mi limit for LUKS2 Argon2id key derivation Phase 1 — TF state reconciliation (zero downtime): - Health, Matrix, N8N, Forgejo, Vaultwarden, Mailserver: state rm + import - Redis, DBAAS MySQL, DBAAS PostgreSQL: Helm/CNPG value updates Phase 2 — Data migration (encrypted PVCs existed but unused): - Headscale, Frigate, MeshCentral: rsync + switchover - Nextcloud (20Gi): rsync + chart_values update Phase 3 — New encrypted PVCs: - Roundcube HTML, HackMD, Affine, DBAAS pgadmin: create + rsync + switchover Phase 4 — Cleanup: - Deleted 5 orphaned unencrypted PVCs ## Services migrated (18 PVCs across 14 namespaces) ``` vaultwarden → vaultwarden-data-encrypted dbaas → datadir-mysql-cluster-0, pg-cluster-{1,2}, dbaas-pgadmin-encrypted mailserver → mailserver-data-encrypted, roundcubemail-{enigma,html}-encrypted nextcloud → nextcloud-data-encrypted forgejo → forgejo-data-encrypted matrix → matrix-data-encrypted n8n → n8n-data-encrypted affine → affine-data-encrypted health → health-uploads-encrypted hackmd → hackmd-data-encrypted redis → redis-data-redis-node-{0,1} headscale → headscale-data-encrypted frigate → frigate-config-encrypted meshcentral → meshcentral-{data,files}-encrypted ``` Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 20:15:30 +00:00
Viktor Barzin	1613003d00	upgrade: vaultwarden 1.35.4 -> 1.35.7 Security fixes (1.35.5): 3 CVEs — org vault purge by unconfirmed owner (GHSA-937x-3j8m-7w7p), cross-org group binding unauthorized access (GHSA-569v-845w-g82p), refresh tokens not invalidated on stamp rotation (GHSA-6j4w-g4jh-xjfx). 2FA remember tokens now max 30 days. 1.35.6: Fix 2FA remember tokens broken in 1.35.5. 1.35.7: Fix 2FA for Android. Risk: SAFE (patch bump, no breaking changes) DB backup: yes (job: pre-upgrade-vaultwarden-1776280439, SQLite, 7 MiB) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-15 19:14:21 +00:00
Viktor Barzin	cf578516e9	feat: auto-cleanup failed/evicted pods via Kyverno ClusterCleanupPolicy Add cleanup-failed-pods policy that runs hourly (at :15) to delete all pods in Failed phase cluster-wide. Prevents stale evicted and failed CronJob pods from accumulating and creating healthcheck noise. Also adds ClusterRole + ClusterRoleBinding to grant Kyverno cleanup controller permission to delete Pods (not included by default). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:37:49 +00:00
Viktor Barzin	de42acd68e	fix: backup LUKS rsync tolerance, stale mapping cleanup, tier-4-aux quota bump - daily-backup: handle rsync exit 23 (partial transfer) as OK for LUKS noload mounts — in-flight writes have corrupt metadata from skipped journal replay, but core data is intact - daily-backup: clean up stale LUKS dm mappings from previous crashed runs before attempting to open - daily-backup: capture rsync exit code safely with set -e (\|\| pattern) - kyverno: bump tier-4-aux requests.memory 2Gi→3Gi (servarr was at 83%) - actualbudget: patched custom quota 5Gi→6Gi (was at 82%) Verified: backup now completes status=0 (96 PVCs OK, 0 failed) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 17:21:51 +00:00
Viktor Barzin	9baefa22ab	fix: technitium CronJob scheduling, LUKS backup support, speedtest scrape - technitium-password-sync: remove RWO encrypted PVC mount that caused pods to stick in ContainerCreating on wrong nodes. Plugin install now warns instead of failing when zip unavailable. - daily-backup: add LUKS decryption support for encrypted PVC snapshots using /root/.luks-backup-key. Uses noload mount option to skip ext4 journal replay. Also installed cryptsetup-bin on PVE host. - speedtest: disable prometheus.io/scrape annotation (no /prometheus endpoint exists, causing ScrapeTargetDown alert). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 15:12:32 +00:00
Viktor Barzin	601a83d84e	fix: CI pipeline image pull auth + shallow clone resilience [ci skip] - Add WOODPECKER_BACKEND_K8S_PULL_SECRET_NAMES to agent env so step pods can pull from private registry (registry.viktorbarzin.me:5050) - Add fallback in default.yml when HEAD~1 is unavailable (shallow clone with depth=1): fetch more history, or apply all platform stacks as safe default - Root cause: pipeline #243 failed because infra-ci:latest image couldn't be pulled (no imagePullSecrets on step pods) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:41:08 +00:00
Viktor Barzin	e23153cf03	chore: add pre-commit size guard and harden .gitignore - Add .githooks/pre-commit that blocks files >2MB (configurable via GIT_MAX_FILE_SIZE). Activate with: git config core.hooksPath .githooks - Expand .gitignore to block common binary/archive patterns (.tar.gz, .tgz, .iso, .img, .bin, .exe, *.dmg) - Add explicit root-level terraform.tfstate ignore rules - Remove stale redis-25.3.2.tgz helm chart (unreferenced) Prevents re-accumulation of large blobs after git history cleanup that reduced .git from 2.6GB to 128MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 14:13:18 +00:00
Viktor Barzin	bcad200a23	chore: add untracked stacks, scripts, and agent configs - New stacks: beads-server, hermes-agent - Terragrunt tiers.tf for infra, phpipam, status-page - Secrets symlinks for vault, phpipam, hermes-agent - Scripts: cluster_manager, image_pull, containerd pullthrough setup - Frigate config, audiblez-web app source, n8n workflows dir - Claude agent: service-upgrade, reference: upgrade-config.json - Removed: claudeception skill, excalidraw empty submodule, temp listings [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 09:33:06 +00:00
Viktor Barzin	bd41bb9230	fix: cluster healthcheck fixes + Authentik upgrade to 2026.2.2 - Authentik: upgrade 2025.10.3 → 2025.12.4 → 2026.2.2 with DB restore and stepped migration. Switch to existingSecret, PgBouncer session mode. - Mailserver: migrate email roundtrip probe from Mailgun to Brevo API - Redis: fix HAProxy tcp-check regex (rstring), faster health intervals - Nextcloud: fix Redis fallback to HAProxy service, update dependency - MeshCentral: fix TLSOffload + certUrl init container for first-run - Monitoring: remove authentik from latency alert exclusion - Diun: simplify to webhook notifier, remove git auto-update [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 06:41:56 +00:00
Viktor Barzin	0256ccdccc	feat: add per-database backups for PostgreSQL and MySQL Add separate CronJobs that dump each database individually: - postgresql-backup-per-db: pg_dump -Fc per DB (daily 00:15) - mysql-backup-per-db: mysqldump per DB (daily 00:45) Dumps go to /backup/per-db/<dbname>/ on the same NFS PVC. Enables single-database restore without affecting other databases. Also fixed CNPG superuser password sync and added --single-transaction --set-gtid-purged=OFF to MySQL per-db dumps. Updated restore runbooks with per-database restore procedures. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 22:39:33 +00:00
Viktor Barzin	3e9231ae0d	feat: augment outage report template with debugging context - Expand service list: add Home Assistant, Actual Budget, Audiobookshelf, Linkwarden, Matrix, Paperless, Tandoor, FreshRSS, Frigate, HackMD, Excalidraw, Wealthfolio, Send, Stirling PDF - Add structured debugging fields: error type, scope (just me vs others), when it started, URL accessed - Fix user report parser to extract all form fields into status.json - Show error type, scope, and start time in status page report cards [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:03:44 +00:00
Viktor Barzin	460c68e015	feat: add incident management system with user reporting - Status page (status.viktorbarzin.me): incident cards with SEV badges, expandable timelines, postmortem links, user report rendering - Issue templates on infra repo for user outage reports - CronJob reads incidents + user-reports from ViktorBarzin/infra - "Report an Outage" button on status page links to infra repo - Post-mortem agents restored (4-stage pipeline: triage → investigation → historian → report writer) with updated paths and issue linking - Post-mortem skill/template updated to link reports to GitHub Issues and manage postmortem-required/postmortem-done labels - Labels: incident, sev1-3, user-report, postmortem-required, postmortem-done on infra repo [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 20:00:31 +00:00
Viktor Barzin	c69eba9b46	fix: increase Uptime Kuma API timeout and fix status code format - Increase socket timeout from 30s to 120s (121+ monitors need time to sync) - Add wait_events=0.2 for reliable login - Fix accepted_statuscodes format: use 100-increment ranges not arbitrary [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:28:18 +00:00
Viktor Barzin	ff360a8807	feat: add external monitoring for all Cloudflare-proxied services Add automatic external HTTPS monitors to Uptime Kuma for ~96 services exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from a Terraform-generated ConfigMap and creates/deletes [External] monitors to match cloudflare_proxied_names. Status page groups these separately as "External Reachability" and pushes a divergence metric to Pushgateway when services are externally down but internally up. Prometheus alert ExternalAccessDivergence fires after 15min of divergence. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 19:04:45 +00:00
Viktor Barzin	ca2680c189	fix(post-mortem): add NFSHighRPCRetransmissions alert + migrate alertmanager to proxmox-lvm-encrypted [PM-2026-04-14] - Add PrometheusRule: NFSHighRPCRetransmissions fires when node_nfs_rpc_retransmissions_total rate exceeds 5/s for 5m — catches NFS server degradation before pod failures cascade - Migrate alertmanager PV from NFS (192.168.1.127:/srv/nfs/alertmanager) to proxmox-lvm-encrypted eliminating the circular dependency where alertmanager couldn't alert about NFS failures - Set force_update=true on prometheus helm_release to handle StatefulSet volumeClaimTemplate changes Co-Authored-By: postmortem-todo-resolver <noreply@anthropic.com>	2026-04-14 18:05:33 +00:00
Viktor Barzin	0901dd5f61	state(monitoring): update encrypted state	2026-04-14 17:52:13 +00:00
Viktor Barzin	803cb5fd26	fix: convert Technitium zone sync from one-time Job to CronJob Secondary/tertiary DNS instances had no custom zones — only the primary had viktorbarzin.lan and viktorbarzin.me. The old setup Job ran once at deployment and never synced new zones. New CronJob runs every 30 minutes: - Gets all zones from primary - Enables zone transfer on primary - Creates missing zones as Secondary type on replicas - Resyncs existing zones via AXFR Fixes .lan resolution failures (2/3 queries returned NXDOMAIN). [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 12:18:19 +00:00
Viktor Barzin	30cdeefb1c	chore: sync terraform state after nfsvers=4 convergence Applied all 20 NFS stacks to converge PV mount_options (nfsvers=4). State files encrypted and committed. [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:20:18 +00:00
Viktor Barzin	ea18116da9	fix: NFS outage recovery — migrate to NFSv4, add alerting NFS server restart broke NFSv3 (lockd kernel bug on PVE 6.14). All 52 NFS PVs patched to nfsvers=4, NFSv3 disabled on PVE. Changes: - nfs_volume module: add nfsvers=4 mount option - nfs-csi StorageClass: add nfsvers=4 mount option - dbaas: MySQL serverInstances 3→1, mysql-native-password=ON - monitoring: add NFSCSINodeDown and NFSMountFailures alerts [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 10:28:27 +00:00
Viktor Barzin	68c8c5b4a0	fix(technitium): migrate primary to proxmox-lvm-encrypted + post-mortem SEV1 outage: fsid=0 in PVE /etc/exports broke all NFS subdirectory mounts from k8s (NFSv4 pseudo-root path resolution). Combined with lockd failure, both NFSv4 and NFSv3 mount paths broken. Cascaded into DNS primary, Vault (2/3 pods), Alertmanager, 20+ services. Changes: - Primary PVC: NFS (nfs-truenas) → proxmox-lvm-encrypted - Secondary/tertiary PVCs: proxmox-lvm → proxmox-lvm-encrypted - Removed NFS module dependency from technitium stack - Added full post-mortem with prevention plan [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 08:18:59 +00:00

1 2 3 4 5 ...

526 commits