infra

Author	SHA1	Message	Date
Viktor Barzin	d3be9b50af	[frigate] Remove orphan config.yaml with leaked RTSP passwords ## Context A Frigate configuration file was added to modules/kubernetes/frigate/ in `bcad200a` (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked stacks, scripts, and agent configs` commit. The file contains 14 inline rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP passwords for the cameras at 192.168.1.10 (LAN-only) and valchedrym.ddns.net (confirmed reachable from public internet on port 554). Both remotes are public, so the creds have been exposed for ~2 days. Grep across the repo confirms nothing references this config.yaml — the active stacks/frigate/main.tf stack reads its configuration from a persistent volume claim named `frigate-config-encrypted`, not from this file. The file is therefore an orphan from the bulk add, with no production function. ## This change - git rm modules/kubernetes/frigate/config.yaml ## What is NOT in this change - Camera password rotation. The user does not own the cameras; rotation must be coordinated out-of-band with the camera operators. The DDNS camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked password is high-priority to rotate from the device side. - Git-history rewrite. The file plus its leaked strings remain in all commits from `bcad200a` forward. Scheduled to be purged via `git filter-repo --path modules/kubernetes/frigate/config.yaml --invert-paths --replace-text <list>` in the broader remediation pass. - Future Frigate config provisioning. If the stack is re-platformed to source config from Git rather than the PVC, the replacement should go through ExternalSecret + env-var interpolation, not an inline YAML. ## Test plan ### Automated $ grep -rn 'frigate/config\.yaml' --include='.tf' --include='.hcl' \ --include='.yaml' --include='.yml' --include='*.sh' (no output — confirms orphan status) ### Manual Verification 1. `git show HEAD --stat` shows exactly one deletion: modules/kubernetes/frigate/config.yaml \| 229 --------------------------------- 2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true. 3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the PVC bound (unaffected by this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 19:39:35 +00:00
Viktor Barzin	7a884a0b97	[monitoring] Fix alerts for intentionally scaled-down services PoisonFountainDown and ForwardAuthFallbackActive both fired because poison-fountain was scaled to 0 replicas (intentional). Updated both alert expressions to check kube_deployment_spec_replicas > 0 before alerting on missing available replicas — if desired replicas is 0, the service is intentionally down and should not alert. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 19:17:41 +00:00
Viktor Barzin	a19581e32b	fix(beads-server): fix Workbench timeout — use internal GraphQL URL GRAPHQLAPI_URL must point to localhost:9002 (internal), not the external URL which goes through Authentik. SSR can't authenticate to Authentik. Also removed Authentik from /graphql ingress — browser fetch() can't follow 302 redirects on POST requests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 19:05:47 +00:00
Viktor Barzin	da6b82ed5c	fix(beads-server): persist GRAPHQLAPI_URL in Terraform The env var was only set via kubectl and got overwritten on next apply. Now permanently in the deployment spec. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:58:59 +00:00
Viktor Barzin	afb8a16623	[infra] Scale down unused services + remove DoH ingress Scale to 0 replicas: - ollama: low usage, saves ~2Gi memory + 59GB NFS-SSD model data idle - poison-fountain: RSS link archiver, not actively used - travel-blog: Hugo blog, not actively used Remove technitium DoH ingress (dns.viktorbarzin.me): externally unreachable and unused. DNS is served on UDP/TCP port 53 via LoadBalancer (10.0.20.201). Clears 3 of 5 ExternalAccessDivergence services. Remaining 2 (pdf, travel) should clear now that the Uptime Kuma monitors will report both down. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:55:52 +00:00
Viktor Barzin	cdc851fc63	[alerts] Fix status-page-pusher crash + Prometheus backup push ## status-page-pusher (ExternalAccessDivergence false positive) The pusher was crashing with `AttributeError: 'list' object has no attribute 'get'` at line 122 — the uptime-kuma-api library changed the heartbeats return format. Fixed by making beat flattening more robust: handle any nesting of lists/dicts in the heartbeat data, and add isinstance check before calling `.get()` on the latest beat. ## Prometheus backup (PrometheusBackupNeverRun) The backup sidecar's Pushgateway push was silently failing because `wget --post-file=-` needs `--header="Content-Type: text/plain"` for Pushgateway to accept the Prometheus exposition format. Added the header. Also manually pushed the metric to clear the `absent()` alert immediately. Note: ExternalAccessDivergence still fires because 5 services (ollama, pdf, poison, dns, travel) ARE genuinely externally unreachable but internally up. This is a real issue (likely Cloudflare tunnel routing) not a false positive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:29:43 +00:00
Viktor Barzin	eef4242408	fix(beads-server): auto-connect Workbench to Dolt on startup The Workbench's database connection is in-memory and lost on pod restart. Added startup script that waits for GraphQL server readiness, then calls addDatabaseConnection mutation automatically. No more manual reconnection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:12:31 +00:00
Viktor Barzin	5e9e487661	feat(setup-project): auto-PR working Dockerfiles back to upstream ## Context The setup-project skill treats "build from a Dockerfile" as priority 6 — "last resort, avoid if possible" — with no formalized path for apps whose upstream lacks a working Dockerfile. When we end up writing one to get the deploy green, that Dockerfile stays private in the infra repo and upstream never benefits. ## This change Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against the upstream repo so the self-hosting community gets the working recipe. Flow: 1. Classify dockerfile_state during research phase (image-used / used-as-is / fixed-broken-upstream / written-from-scratch). Persist to modules/kubernetes/<service>/.contribution-state.json. 2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready + HTTP 200 every 30s x 20 iterations, requires 18/20 successes. 3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the GitHub API dance: fork → merge-upstream → branch → commit Dockerfile / .dockerignore / BUILD.md via Contents API → open PR with body rendered from templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork, existing branch, open PR, upstream landed a Dockerfile mid-deploy). GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token pulled from Vault (`secret/viktor` → `github_pat`). Commits include Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile` for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with timestamp suffix on collision. ## Files - SKILL.md — state classification table, quality bar checklist, §8b stability gate, §10 contribute-upstream step, checklist updates - scripts/stability-gate.sh — 10-minute health probe - scripts/contribute-dockerfile.sh — GitHub API orchestrator - templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description - templates/Dockerfile.README.md — BUILD.md template shipped with the PR ## What is NOT in this change - No Woodpecker / GHA changes (skill-local flow). - No auto-tracking of merge/reject outcomes upstream (manual follow-up). - Not yet exercised end-to-end; first real-world run will validate the API dance. Plan to dry-run against a throwaway sink repo before pointing at a real upstream. ## Test Plan ### Automated - bash -n on both scripts → pass - Manual read-through of SKILL.md — step numbering coherent, existing §1-9 untouched semantics, new §8b/§10 reference real files ### Manual Verification 1. Next time setup-project onboards a Dockerfile-less app: - Confirm .contribution-state.json is written with `written-from-scratch` - Run stability-gate.sh — expect 18/20 passes on a healthy deploy - Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin - Verify contribution_pr_url is back-written to the state file 2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent) 3. Upstream-archived case: manually archive a test upstream → re-run → expect SKIP, no PR created [ci skip] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 18:12:13 +00:00
Viktor Barzin	1860cd1dfb	state(vault): update encrypted state	2026-04-17 14:14:05 +00:00
Viktor Barzin	f0ddfb8cae	state(dbaas): update encrypted state	2026-04-17 14:08:49 +00:00
Viktor Barzin	b034c868db	[traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping. Both plugins load without errors but never inject content. Removed: - rewrite-body plugin download (init container) and registration - strip-accept-encoding middleware (only existed for rewrite-body bug) - anti-ai-trap-links middleware (used rewrite-body for injection) - rybbit_site_id variable from ingress_factory and reverse_proxy factory - rybbit_site_id from 25 service stacks (39 instances) - Per-service rybbit-analytics middleware CRD resources Kept: - compress middleware (entrypoint-level, working correctly) - ai-bot-block middleware (ForwardAuth to bot-block-proxy) - anti-ai-headers middleware (X-Robots-Tag: noai, noimageai) - All CrowdSec, Authentik, rate-limit middleware unchanged Next: Cloudflare Workers with HTMLRewriter for edge-side injection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 12:41:17 +00:00
Viktor Barzin	b24545ffdb	fix(beads-server): fix BeadBoard project ID + install bd binary - Fixed project_id mismatch (was "beadboard", should be actual DB project ID) - Rebuilt Docker image with bd v1.0.2 binary (node:20-slim for glibc compat) - Ran bd migrate to update schema from 1.0.0 → 1.0.2 (adds started_at, etc.) - Task creation and bd CLI now work inside the container Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:57:45 +00:00
Viktor Barzin	f2037545b3	fix(beads-server): make BeadBoard .beads dir writable BeadBoard needs to create templates/ and archetypes/ subdirectories inside .beads/. ConfigMap mounts are read-only, causing ENOENT errors and 503 responses. Fix: init container copies ConfigMap to emptyDir. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:37:26 +00:00
Viktor Barzin	00e2f15a5d	feat(beads-server): deploy BeadBoard task visualization dashboard Add BeadBoard (zenchantlive/beadboard) alongside Dolt server and Workbench for task dependency graph, kanban, and agent coordination views. - Built custom Docker image (registry.viktorbarzin.me:5050/beadboard) - ConfigMap provides .beads/metadata.json pointing to Dolt server - Behind Authentik auth at beadboard.viktorbarzin.me - Also fixed: GraphQL ingress now has Authentik middleware - Also fixed: Workbench store.json type enum (mysql → Mysql) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:30:43 +00:00
Viktor Barzin	26abd8fe94	[skill] Add /disk-wear skill for periodic disk write analysis ## Context After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day of disk writes, this methodology should be reusable for periodic health reviews. ## This change: Adds `/disk-wear` skill that combines three data sources: - SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health - Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total joined with node_disk_device_mapper_info for dm->LVM mapping) - kubectl for PVC UUID -> pod/namespace mapping Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC. Includes baselines, red flag detection, and annualized wear projections. Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track block device writes per container), so per-app attribution uses the PVE host's dm-device level metrics mapped through Prometheus and kubectl. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:15:26 +00:00
Viktor Barzin	366e2ab083	[uptime-kuma] Opt-out external monitoring for every public ingress [ci skip] ## Context After the previous commit migrated monitor discovery to per-ingress annotation (opt-in via `uptime.viktorbarzin.me/external-monitor=true`), coverage expanded from 13 → 26 monitors but still left ~99 public ingresses uncovered — notably Helm-managed services (authentik, grafana, vault, forgejo, ntfy) that don't go through `ingress_factory`, plus any `dns_type = "non-proxied"` ingress (Immich was a direct victim: `dns_type = "non-proxied"` → no annotation added → no monitor → invisible outage). The user's concern: "I should have known external Immich was down before users tried to open it." ## This change Flipped the semantic from opt-in to opt-out by default: - Every ingress whose host ends in `.viktorbarzin.me` gets a `[External] <label>` monitor automatically - Only ingresses with annotation `uptime.viktorbarzin.me/external-monitor=false` are skipped - Host dedup via a `seen` set (one monitor per hostname, regardless of how many Ingress resources share it) ## Verification Triggered a manual CronJob run post-apply: ``` Sync complete: 102 created, 1 deleted, 23 unchanged ``` Coverage jumped from 26 → ~124 external monitors. All 6 Helm-managed services now have dedicated monitors: - [External] immich, authentik, forgejo, grafana, ntfy, vault ## Scope Only `stacks/uptime-kuma/modules/uptime-kuma/main.tf` (Python script in the CronJob resource). No RBAC or service account changes — the ones added in the previous commit still cover this path. ## Test plan ### Automated \`\`\` \$ kubectl -n uptime-kuma logs -l job-name=manual-sync-optout-1776422993 --tail=50 \| grep -iE 'immich\|authentik\|grafana\|forgejo\|vault\|ntfy' Creating monitor: [External] authentik -> https://authentik.viktorbarzin.me Creating monitor: [External] forgejo -> https://forgejo.viktorbarzin.me Creating monitor: [External] immich -> https://immich.viktorbarzin.me Creating monitor: [External] grafana -> https://grafana.viktorbarzin.me Creating monitor: [External] ntfy -> https://ntfy.viktorbarzin.me Creating monitor: [External] vault -> https://vault.viktorbarzin.me \`\`\` ### Manual Verification 1. Open `https://uptime.viktorbarzin.me` → confirm `[External] immich` exists 2. Simulate an Immich outage (scale deploy to 0 briefly) → external monitor should go red within the probe interval (5min); internal monitor stays up (pod-level from a different probe angle) → `ExternalAccessDivergence` alert fires after 15 min Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 11:12:00 +00:00
Viktor Barzin	66d2d9916b	[infra] Per-ingress external-monitor annotation + actualbudget plan-time fix [ci skip] ## Context Two operational gaps surfaced during a healthcheck sweep today: 1. External monitoring coverage: Only ~13 hostnames (via `cloudflare_proxied_names` in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via `ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT registered for external probing — so outages like Immich going down externally were invisible until a user complained. 99 of ~125 public ingresses had no external monitor. 2. actualbudget stack unplannable: `count = var.budget_encryption_password != null ? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the value flows from a `data.kubernetes_secret` whose contents are `(known after apply)` at plan time. Blocked CI applies and drift reconciliation. ## This change ### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory) - New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string, nullable). Default is "follow dns_type" — enabled for any public DNS record (`dns_type != "none"`, covers both proxied and non-proxied so Immich and other direct-A records are also monitored). - Emits two annotations on the Ingress: - `uptime.viktorbarzin.me/external-monitor = "true"` - `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override) ### external-monitor-sync CronJob (uptime-kuma stack) - Discovers targets from live Ingress objects via the K8s API first (filter by annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any API error (zero rollout risk). - New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving `list`/`get` on `networking.k8s.io/ingresses`. - `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s) instead of `kubernetes.default.svc` — the search-domain expansion failed in the CronJob pod's DNS config. Verified working: CronJob now logs `Loaded N external monitor targets (source=k8s-api)`. ### actualbudget count-on-unknown refactor - Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at plan; no `-target` workaround needed. - Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is unchanged — the secret is still consumed via env var. - Also aligned the factory with live state (the 3 budget-* PVCs had been migrated `proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed `data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module removed. State was rm'd + re-imported with matching UIDs, so no data was moved. ## Rollout status (already partially applied in this session) - `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified - `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally - `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live - CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active (was 13 on the central list) ## Deferred (separate work) - 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory, rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade. `[ci skip]` here so those don't auto-apply; they will be fixed manually before the next CI push. - Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik, grafana, vault, forgejo) are annotated — separate PR. ## Test plan ### Automated \`\`\` \$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name \| tail -1) Loaded 26 external monitor targets (source=k8s-api) Sync complete: 7 created, 0 deleted, 17 unchanged \$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\ https://dawarich.viktorbarzin.me/ https://nextcloud.viktorbarzin.me/ \\ https://budget-viktor.viktorbarzin.me/ 200 302 200 \$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor deployment.apps/budget-viktor 1/1 1 1 Ready persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted \`\`\` ### Manual Verification 1. Confirm the annotation is present on an ingress_factory ingress: \`\`\` kubectl -n dawarich get ingress dawarich -o \\ jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}' # Expected: "true" \`\`\` 2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min (CronJob interval). For Immich specifically, it will appear after the immich stack is re-applied. 3. Verify actualbudget plan is clean: \`\`\` cd stacks/actualbudget && scripts/tg plan --non-interactive # Expected: no "Invalid count argument" errors \`\`\` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 10:34:32 +00:00
Viktor Barzin	0c4fe98d75	state(dbaas): update encrypted state	2026-04-17 10:08:04 +00:00
Viktor Barzin	996bdfc9b6	[technitium] Uninstall MySQL+SQLite query log plugins instead of just disabling ## Context Disabling MySQL/SQLite query logging via config was not durable — Technitium re-enables disabled plugins on pod restart, causing 46 GB/day of writes to the standalone MySQL (15M inserts to technitium.dns_logs between CronJob runs). ## This change: The password-sync CronJob now UNINSTALLS MySQL and SQLite query log plugins via `/api/apps/uninstall` instead of setting `enableLogging:false`. This is permanent — the plugin files are removed from the PVC, so they can't re-enable on restart. The CronJob checks if the plugins are present first (idempotent). Only PostgreSQL query logging remains (90-day retention). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 08:20:55 +00:00
Viktor Barzin	f0a73815d8	[freedify] Remove stale sed patches from container startup The audio-engine.js, dom.js, and dj.js files were refactored/removed in the upstream Freedify repo. The sed patches that disabled iOS EQ auto-init and visualizer no longer have targets, causing the container to crash on startup. Use the image's default CMD instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 06:17:13 +00:00
Viktor Barzin	f8facf44dd	[infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps ## Context The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI injection returned HTTP 200 with "Error 404: Not Found" body. Root cause: middleware specs referenced plugin name `rewrite-body` but Traefik registered it as `traefik-plugin-rewritebody`. Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3 which uses the correct plugin name. Also added `lastModified = true` and `methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML responses. ## This change - Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3 - Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI) - Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13) - Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts - Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule) - Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2, networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0 - MySQL standalone storage_limit 30Gi → 50Gi - beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 05:51:52 +00:00
Viktor Barzin	8b206a63ad	state(dbaas): update encrypted state	2026-04-16 22:55:52 +00:00
Viktor Barzin	4c8e5bea0b	[traefik] Add global compress middleware to fix response compression The rewrite-body plugin (rybbit analytics, anti-AI trap links) requires strip-accept-encoding to work, which killed HTTP compression for 50+ services. This adds Traefik's built-in compress middleware at the websecure entrypoint level to re-compress responses to clients after rewrite-body has modified them. Uses includedContentTypes whitelist (not excludedContentTypes) so only text-based types are compressed. SSE, WebSocket, gRPC, and binary downloads are unaffected. Measured improvement on ha-sofia: - app.js: 540KB → 167KB (3.2x) - core.js: 52KB → 19KB (2.7x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:18:51 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	f538115c43	[dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet ## Context Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only ~35 MB of actual data due to Group Replication overhead (binlog, relay log, GR apply log). The operator enforces GR even with serverInstances=1. Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free container images available. Using official mysql:8.4 image instead. ## This change: - Replace helm_release.mysql_cluster service selector with raw kubernetes_stateful_set_v1 using official mysql:8.4 image - ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2, innodb_doublewrite=ON (re-enabled for standalone safety) - Service selector switched to standalone pod labels - Technitium: disable SQLite query logging (18 GB/day write amplification), keep PostgreSQL-only logging (90-day retention) - Grafana datasource and dashboards migrated from MySQL to PostgreSQL - Dashboard SQL queries fixed for PG integer division (::float cast) - Updated CLAUDE.md service-specific notes ## What is NOT in this change: - InnoDB Cluster + operator removal (Phase 4, 7+ days from now) - Stale Vault role cleanup (Phase 4) - Old PVC deletion (Phase 4) Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:01:06 +00:00
Viktor Barzin	ef30f27ac9	state(dbaas): update encrypted state	2026-04-16 18:56:59 +00:00
Viktor Barzin	b6fc1e63a6	state(dbaas): import postgresql-lb service	2026-04-16 18:55:40 +00:00
Viktor Barzin	14fa2b9762	state(vault): update encrypted state	2026-04-16 18:43:06 +00:00
Viktor Barzin	1a42f750f8	state(dbaas): update encrypted state	2026-04-16 18:41:34 +00:00
Viktor Barzin	0a43b5c2ac	state(dbaas): update encrypted state	2026-04-16 18:31:33 +00:00
Viktor Barzin	cd513a2226	state(dbaas): update encrypted state	2026-04-16 18:24:31 +00:00
Viktor Barzin	0368601eff	state(dbaas): update encrypted state	2026-04-16 18:24:20 +00:00
Viktor Barzin	a237ac97e0	docs(upgrades): add bulk upgrade results from first production run 12 services upgraded in 30 min: audiobookshelf, owntracks, open-webui, immich, coturn, shlink, phpipam, onlyoffice, paperless-ngx, linkwarden, synapse, dawarich. Documents auto-rollback behavior, resource awareness (paperless memory bump), bulk upgrade procedure, and rate limit reset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 17:34:27 +00:00
Viktor Barzin	39b5ed04a7	upgrade: dawarich 0.37.1 -> 1.6.1 (fix entrypoint + add production env) ## Context Version 1.3.0+ changed the recommended command from `bin/dev` (development) to `bin/rails server -p 3000 -b ::` (production). Also requires RAILS_ENV=production, SECRET_KEY_BASE, and RAILS_LOG_TO_STDOUT env vars. ## This change - Command: `bin/dev` → `bin/rails server -p 3000 -b ::` - Add RAILS_ENV=production - Add SECRET_KEY_BASE (stored in Vault secret/dawarich, synced via ESO) - Add RAILS_LOG_TO_STDOUT=true ## What happened 1. Initial upgrade applied version 1.6.1 — DB migrations ran but pod CrashLooped due to wrong entrypoint (bin/dev exits in production mode) 2. Rollback to 0.37.1 failed because 1.6.1 migrations already ran (ActiveRecord::UnknownPrimaryKey on rails_pulse_routes) 3. Rolled forward with corrected entrypoint + env vars 4. Service now stable: 20/20 health checks passed over 5 minutes Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 17:25:29 +00:00
Viktor Barzin	8bd2ace00d	state(technitium): update encrypted state	2026-04-16 17:21:06 +00:00
Viktor Barzin	7680d4e009	state(dawarich): update encrypted state	2026-04-16 17:19:29 +00:00
Viktor Barzin	59e99f2a3a	upgrade: paperless-ngx increase memory 1Gi -> 2Gi v2.20.14 OOMKills at 1Gi during search index rebuild on upgrade. Bumped to 2Gi request=limit to handle startup index operations. Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 17:16:23 +00:00
Viktor Barzin	611c67b92c	state(paperless-ngx): update encrypted state	2026-04-16 17:10:57 +00:00
Viktor Barzin	5f1b14ad53	upgrade: dawarich re-apply 1.6.1 (forward-fix after failed rollback) DB migrations from 1.6.1 already ran, making 0.37.1 incompatible (ActiveRecord::UnknownPrimaryKey on rails_pulse_routes table). Rolling forward is the correct path. Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 17:08:20 +00:00
Viktor Barzin	e9275534b6	state(dawarich): update encrypted state	2026-04-16 17:08:15 +00:00
Viktor Barzin	1f589a403c	state(dawarich): update encrypted state	2026-04-16 17:04:44 +00:00
Viktor Barzin	f5883be981	Revert "upgrade: dawarich 0.37.1 -> 1.6.1" This reverts commit `ec8b4dbaac`.	2026-04-16 17:04:06 +00:00
Viktor Barzin	178fc4b398	state(matrix): update encrypted state	2026-04-16 17:01:28 +00:00
Viktor Barzin	449f1af9d6	state(immich): update encrypted state	2026-04-16 17:00:59 +00:00
Viktor Barzin	0ec48d942f	state(paperless-ngx): update encrypted state	2026-04-16 17:00:58 +00:00
Viktor Barzin	88c47efa1d	state(url): update encrypted state	2026-04-16 16:54:31 +00:00
Viktor Barzin	7b69641357	upgrade: linkwarden v2.9.1 -> v2.14.0 Changelog summary: 23 intermediate releases spanning text highlighting, drag & drop organization, tag management page, compact sidebar, mobile app (iOS/Android), SingleFile upload support, performance refactors, and NextJS CVE patches. Risk: SAFE Breaking changes: none DB backup: yes (job: pre-upgrade-linkwarden-1776357253, 254M dump confirmed) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:54:30 +00:00
Viktor Barzin	e1aa59ce53	upgrade: paperless-ngx 2.16.4 -> 2.20.14 Changelog summary: 4 minor versions of bug fixes, security patches (7+ CVEs), and features (nested tags, PDF editor, advanced workflow filters, Trixie base). Risk: CAUTION Breaking changes: - v2.17.0: Scheduled workflow offset sign corrected (restores pre-2.16 behavior) - v2.20.7: Filename template rendering restricted to safe document context DB backup: yes (job: pre-upgrade-paperless-ngx-1776357314, 254MiB) Config changes applied: none Flagged for manual review: - Check scheduled workflow offsets if any were modified during 2.16.x - Verify storage path templates still render correctly after v2.20.7 restriction Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:53:42 +00:00
Viktor Barzin	237126eb3a	upgrade: onlyoffice 8.2.3 -> 9.3.1 Changelog summary: Major version bump spanning 13 releases. v9.0.0 adds PDF editor API, macro recording, Service Worker caching. v9.2.1 fixes critical security vulns (XSS, memory manipulation leading to RCE in XLS conversion). v9.3.0 adds GIF animations, multiple pages view, signature settings, hyperlinks on images/shapes. Risk: CAUTION (major version bump 8->9) Breaking changes: none affecting Docker+MySQL deployment. PostgreSQL schema change in v9.0.0 (irrelevant — we use MySQL). API endpoint deprecations (ConvertService.ashx, GET requests to converter/command) — not removals. Config parameter renames (leftMenu->layout.leftMenu etc.) are editor JS API, not server config. DB backup: yes (job: pre-upgrade-onlyoffice-1776357277, MySQL full dump) Config changes applied: none required Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:53:24 +00:00
Viktor Barzin	8637d82817	upgrade: phpipam v1.7.0 -> v1.7.4 Changelog summary: Bugfixes (PHP8 compat, UI performance, jQuery errors) and security fixes (XSS reflected, CSRF cookie, RCE via ping_path, DB credential exposure via mysqldump). Risk: SAFE Breaking changes: none DB backup: yes (job: pre-upgrade-phpipam-1776357227, mysql, dbaas ns) Config changes applied: none Flagged for manual review: none Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>	2026-04-16 16:53:21 +00:00

1 2 3 4 5 ...

2788 commits