Commit graph

2788 commits

Author SHA1 Message Date
Viktor Barzin
d3be9b50af [frigate] Remove orphan config.yaml with leaked RTSP passwords
## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.

Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.

## This change
- git rm modules/kubernetes/frigate/config.yaml

## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
  must be coordinated out-of-band with the camera operators. The DDNS
  camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
  password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
  commits from bcad200a forward. Scheduled to be purged via
  `git filter-repo --path modules/kubernetes/frigate/config.yaml
  --invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
  source config from Git rather than the PVC, the replacement should go
  through ExternalSecret + env-var interpolation, not an inline YAML.

## Test plan
### Automated
  $ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms orphan status)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
   PVC bound (unaffected by this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 19:39:35 +00:00
Viktor Barzin
7a884a0b97 [monitoring] Fix alerts for intentionally scaled-down services
PoisonFountainDown and ForwardAuthFallbackActive both fired because
poison-fountain was scaled to 0 replicas (intentional). Updated both
alert expressions to check kube_deployment_spec_replicas > 0 before
alerting on missing available replicas — if desired replicas is 0,
the service is intentionally down and should not alert.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 19:17:41 +00:00
Viktor Barzin
a19581e32b fix(beads-server): fix Workbench timeout — use internal GraphQL URL
GRAPHQLAPI_URL must point to localhost:9002 (internal), not the external
URL which goes through Authentik. SSR can't authenticate to Authentik.
Also removed Authentik from /graphql ingress — browser fetch() can't
follow 302 redirects on POST requests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 19:05:47 +00:00
Viktor Barzin
da6b82ed5c fix(beads-server): persist GRAPHQLAPI_URL in Terraform
The env var was only set via kubectl and got overwritten on next apply.
Now permanently in the deployment spec.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:58:59 +00:00
Viktor Barzin
afb8a16623 [infra] Scale down unused services + remove DoH ingress
Scale to 0 replicas:
- ollama: low usage, saves ~2Gi memory + 59GB NFS-SSD model data idle
- poison-fountain: RSS link archiver, not actively used
- travel-blog: Hugo blog, not actively used

Remove technitium DoH ingress (dns.viktorbarzin.me): externally unreachable
and unused. DNS is served on UDP/TCP port 53 via LoadBalancer (10.0.20.201).

Clears 3 of 5 ExternalAccessDivergence services. Remaining 2 (pdf, travel)
should clear now that the Uptime Kuma monitors will report both down.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:55:52 +00:00
Viktor Barzin
cdc851fc63 [alerts] Fix status-page-pusher crash + Prometheus backup push
## status-page-pusher (ExternalAccessDivergence false positive)
The pusher was crashing with `AttributeError: 'list' object has no attribute
'get'` at line 122 — the uptime-kuma-api library changed the heartbeats return
format. Fixed by making beat flattening more robust: handle any nesting of
lists/dicts in the heartbeat data, and add isinstance check before calling
`.get()` on the latest beat.

## Prometheus backup (PrometheusBackupNeverRun)
The backup sidecar's Pushgateway push was silently failing because `wget
--post-file=-` needs `--header="Content-Type: text/plain"` for Pushgateway
to accept the Prometheus exposition format. Added the header. Also manually
pushed the metric to clear the `absent()` alert immediately.

Note: ExternalAccessDivergence still fires because 5 services (ollama, pdf,
poison, dns, travel) ARE genuinely externally unreachable but internally up.
This is a real issue (likely Cloudflare tunnel routing) not a false positive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:29:43 +00:00
Viktor Barzin
eef4242408 fix(beads-server): auto-connect Workbench to Dolt on startup
The Workbench's database connection is in-memory and lost on pod restart.
Added startup script that waits for GraphQL server readiness, then calls
addDatabaseConnection mutation automatically. No more manual reconnection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:12:31 +00:00
Viktor Barzin
5e9e487661 feat(setup-project): auto-PR working Dockerfiles back to upstream
## Context
The setup-project skill treats "build from a Dockerfile" as priority 6 — "last
resort, avoid if possible" — with no formalized path for apps whose upstream
lacks a working Dockerfile. When we end up writing one to get the deploy green,
that Dockerfile stays private in the infra repo and upstream never benefits.

## This change
Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken
upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against
the upstream repo so the self-hosting community gets the working recipe.

Flow:
1. Classify dockerfile_state during research phase (image-used / used-as-is /
   fixed-broken-upstream / written-from-scratch). Persist to
   modules/kubernetes/<service>/.contribution-state.json.
2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready +
   HTTP 200 every 30s x 20 iterations, requires 18/20 successes.
3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the
   GitHub API dance: fork → merge-upstream → branch → commit Dockerfile /
   .dockerignore / BUILD.md via Contents API → open PR with body rendered from
   templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork,
   existing branch, open PR, upstream landed a Dockerfile mid-deploy).

GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token
pulled from Vault (`secret/viktor` → `github_pat`). Commits include
Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile`
for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with
timestamp suffix on collision.

## Files
- SKILL.md — state classification table, quality bar checklist, §8b stability
  gate, §10 contribute-upstream step, checklist updates
- scripts/stability-gate.sh — 10-minute health probe
- scripts/contribute-dockerfile.sh — GitHub API orchestrator
- templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description
- templates/Dockerfile.README.md — BUILD.md template shipped with the PR

## What is NOT in this change
- No Woodpecker / GHA changes (skill-local flow).
- No auto-tracking of merge/reject outcomes upstream (manual follow-up).
- Not yet exercised end-to-end; first real-world run will validate the API
  dance. Plan to dry-run against a throwaway sink repo before pointing at a
  real upstream.

## Test Plan
### Automated
- bash -n on both scripts → pass
- Manual read-through of SKILL.md — step numbering coherent, existing
  §1-9 untouched semantics, new §8b/§10 reference real files

### Manual Verification
1. Next time setup-project onboards a Dockerfile-less app:
   - Confirm .contribution-state.json is written with `written-from-scratch`
   - Run stability-gate.sh — expect 18/20 passes on a healthy deploy
   - Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin
   - Verify contribution_pr_url is back-written to the state file
2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent)
3. Upstream-archived case: manually archive a test upstream → re-run →
   expect SKIP, no PR created

[ci skip]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 18:12:13 +00:00
Viktor Barzin
1860cd1dfb state(vault): update encrypted state 2026-04-17 14:14:05 +00:00
Viktor Barzin
f0ddfb8cae state(dbaas): update encrypted state 2026-04-17 14:08:49 +00:00
Viktor Barzin
b034c868db [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.

Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources

Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged

Next: Cloudflare Workers with HTMLRewriter for edge-side injection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 12:41:17 +00:00
Viktor Barzin
b24545ffdb fix(beads-server): fix BeadBoard project ID + install bd binary
- Fixed project_id mismatch (was "beadboard", should be actual DB project ID)
- Rebuilt Docker image with bd v1.0.2 binary (node:20-slim for glibc compat)
- Ran bd migrate to update schema from 1.0.0 → 1.0.2 (adds started_at, etc.)
- Task creation and bd CLI now work inside the container

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:57:45 +00:00
Viktor Barzin
f2037545b3 fix(beads-server): make BeadBoard .beads dir writable
BeadBoard needs to create templates/ and archetypes/ subdirectories
inside .beads/. ConfigMap mounts are read-only, causing ENOENT errors
and 503 responses. Fix: init container copies ConfigMap to emptyDir.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:37:26 +00:00
Viktor Barzin
00e2f15a5d feat(beads-server): deploy BeadBoard task visualization dashboard
Add BeadBoard (zenchantlive/beadboard) alongside Dolt server and Workbench
for task dependency graph, kanban, and agent coordination views.

- Built custom Docker image (registry.viktorbarzin.me:5050/beadboard)
- ConfigMap provides .beads/metadata.json pointing to Dolt server
- Behind Authentik auth at beadboard.viktorbarzin.me
- Also fixed: GraphQL ingress now has Authentik middleware
- Also fixed: Workbench store.json type enum (mysql → Mysql)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:30:43 +00:00
Viktor Barzin
26abd8fe94 [skill] Add /disk-wear skill for periodic disk write analysis
## Context
After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day
of disk writes, this methodology should be reusable for periodic health reviews.

## This change:
Adds `/disk-wear` skill that combines three data sources:
- SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health
- Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total
  joined with node_disk_device_mapper_info for dm->LVM mapping)
- kubectl for PVC UUID -> pod/namespace mapping

Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC.
Includes baselines, red flag detection, and annualized wear projections.

Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track
block device writes per container), so per-app attribution uses the PVE host's
dm-device level metrics mapped through Prometheus and kubectl.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:15:26 +00:00
Viktor Barzin
366e2ab083 [uptime-kuma] Opt-out external monitoring for every public ingress [ci skip]
## Context
After the previous commit migrated monitor discovery to per-ingress annotation
(opt-in via `uptime.viktorbarzin.me/external-monitor=true`), coverage expanded
from 13 → 26 monitors but still left ~99 public ingresses uncovered — notably
Helm-managed services (authentik, grafana, vault, forgejo, ntfy) that don't
go through `ingress_factory`, plus any `dns_type = "non-proxied"` ingress
(Immich was a direct victim: `dns_type = "non-proxied"` → no annotation added
→ no monitor → invisible outage).

The user's concern: "I should have known external Immich was down before
users tried to open it."

## This change
Flipped the semantic from opt-in to **opt-out by default**:
- Every ingress whose host ends in `.viktorbarzin.me` gets a `[External] <label>`
  monitor automatically
- Only ingresses with annotation `uptime.viktorbarzin.me/external-monitor=false`
  are skipped
- Host dedup via a `seen` set (one monitor per hostname, regardless of how
  many Ingress resources share it)

## Verification
Triggered a manual CronJob run post-apply:
```
Sync complete: 102 created, 1 deleted, 23 unchanged
```
Coverage jumped from 26 → ~124 external monitors. All 6 Helm-managed services
now have dedicated monitors:
- [External] immich, authentik, forgejo, grafana, ntfy, vault

## Scope
Only `stacks/uptime-kuma/modules/uptime-kuma/main.tf` (Python script in the
CronJob resource). No RBAC or service account changes — the ones added in the
previous commit still cover this path.

## Test plan

### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs -l job-name=manual-sync-optout-1776422993 --tail=50 | grep -iE 'immich|authentik|grafana|forgejo|vault|ntfy'
Creating monitor: [External] authentik -> https://authentik.viktorbarzin.me
Creating monitor: [External] forgejo   -> https://forgejo.viktorbarzin.me
Creating monitor: [External] immich    -> https://immich.viktorbarzin.me
Creating monitor: [External] grafana   -> https://grafana.viktorbarzin.me
Creating monitor: [External] ntfy      -> https://ntfy.viktorbarzin.me
Creating monitor: [External] vault     -> https://vault.viktorbarzin.me
\`\`\`

### Manual Verification
1. Open `https://uptime.viktorbarzin.me` → confirm `[External] immich` exists
2. Simulate an Immich outage (scale deploy to 0 briefly) → external monitor
   should go red within the probe interval (5min); internal monitor stays up
   (pod-level from a different probe angle) → `ExternalAccessDivergence`
   alert fires after 15 min

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 11:12:00 +00:00
Viktor Barzin
66d2d9916b [infra] Per-ingress external-monitor annotation + actualbudget plan-time fix [ci skip]
## Context
Two operational gaps surfaced during a healthcheck sweep today:

1. **External monitoring coverage**: Only ~13 hostnames (via `cloudflare_proxied_names`
   in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via
   `ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT
   registered for external probing — so outages like Immich going down externally were
   invisible until a user complained. 99 of ~125 public ingresses had no external
   monitor.

2. **actualbudget stack unplannable**: `count = var.budget_encryption_password != null
   ? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the
   value flows from a `data.kubernetes_secret` whose contents are `(known after apply)`
   at plan time. Blocked CI applies and drift reconciliation.

## This change

### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory)
- New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string,
  nullable). Default is "follow dns_type" — enabled for any public DNS record
  (`dns_type != "none"`, covers both proxied and non-proxied so Immich and other
  direct-A records are also monitored).
- Emits two annotations on the Ingress:
  - `uptime.viktorbarzin.me/external-monitor = "true"`
  - `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override)

### external-monitor-sync CronJob (uptime-kuma stack)
- Discovers targets from live Ingress objects via the K8s API first (filter by
  annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any
  API error (zero rollout risk).
- New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving
  `list`/`get` on `networking.k8s.io/ingresses`.
- `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s)
  instead of `kubernetes.default.svc` — the search-domain expansion failed in the
  CronJob pod's DNS config. Verified working: CronJob now logs
  `Loaded N external monitor targets (source=k8s-api)`.

### actualbudget count-on-unknown refactor
- Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit
  plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at
  plan; no `-target` workaround needed.
- Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is
  unchanged — the secret is still consumed via env var.
- Also aligned the factory with live state (the 3 budget-* PVCs had been migrated
  `proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed
  `data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module
  removed. State was rm'd + re-imported with matching UIDs, so no data was moved.

## Rollout status (already partially applied in this session)
- `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified
- `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally
- `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live
- CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active
  (was 13 on the central list)

## Deferred (separate work)
- 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory,
  rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade.
  `[ci skip]` here so those don't auto-apply; they will be fixed manually before the
  next CI push.
- Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik,
  grafana, vault, forgejo) are annotated — separate PR.

## Test plan

### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name | tail -1)
Loaded 26 external monitor targets (source=k8s-api)
Sync complete: 7 created, 0 deleted, 17 unchanged

\$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\
    https://dawarich.viktorbarzin.me/ https://nextcloud.viktorbarzin.me/ \\
    https://budget-viktor.viktorbarzin.me/
200 302 200

\$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor
deployment.apps/budget-viktor     1/1 1 1 Ready
persistentvolumeclaim/budget-viktor-data-encrypted  Bound  10Gi  RWO  proxmox-lvm-encrypted
\`\`\`

### Manual Verification
1. Confirm the annotation is present on an ingress_factory ingress:
   \`\`\`
   kubectl -n dawarich get ingress dawarich -o \\
     jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}'
   # Expected: "true"
   \`\`\`
2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min
   (CronJob interval). For Immich specifically, it will appear after the immich stack
   is re-applied.
3. Verify actualbudget plan is clean:
   \`\`\`
   cd stacks/actualbudget && scripts/tg plan --non-interactive
   # Expected: no "Invalid count argument" errors
   \`\`\`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 10:34:32 +00:00
Viktor Barzin
0c4fe98d75 state(dbaas): update encrypted state 2026-04-17 10:08:04 +00:00
Viktor Barzin
996bdfc9b6 [technitium] Uninstall MySQL+SQLite query log plugins instead of just disabling
## Context
Disabling MySQL/SQLite query logging via config was not durable — Technitium
re-enables disabled plugins on pod restart, causing 46 GB/day of writes to
the standalone MySQL (15M inserts to technitium.dns_logs between CronJob runs).

## This change:
The password-sync CronJob now UNINSTALLS MySQL and SQLite query log plugins
via `/api/apps/uninstall` instead of setting `enableLogging:false`. This is
permanent — the plugin files are removed from the PVC, so they can't re-enable
on restart. The CronJob checks if the plugins are present first (idempotent).

Only PostgreSQL query logging remains (90-day retention).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 08:20:55 +00:00
Viktor Barzin
f0a73815d8 [freedify] Remove stale sed patches from container startup
The audio-engine.js, dom.js, and dj.js files were refactored/removed
in the upstream Freedify repo. The sed patches that disabled iOS EQ
auto-init and visualizer no longer have targets, causing the container
to crash on startup. Use the image's default CMD instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 06:17:13 +00:00
Viktor Barzin
f8facf44dd [infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps
## Context

The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently
broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI
injection returned HTTP 200 with "Error 404: Not Found" body. Root cause:
middleware specs referenced plugin name `rewrite-body` but Traefik registered
it as `traefik-plugin-rewritebody`.

Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3
which uses the correct plugin name. Also added `lastModified = true` and
`methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML
responses.

## This change

- Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3
- Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI)
- Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13)
- Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts
- Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule)
- Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2,
  networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0
- MySQL standalone storage_limit 30Gi → 50Gi
- beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 05:51:52 +00:00
Viktor Barzin
8b206a63ad state(dbaas): update encrypted state 2026-04-16 22:55:52 +00:00
Viktor Barzin
4c8e5bea0b [traefik] Add global compress middleware to fix response compression
The rewrite-body plugin (rybbit analytics, anti-AI trap links) requires
strip-accept-encoding to work, which killed HTTP compression for 50+
services. This adds Traefik's built-in compress middleware at the
websecure entrypoint level to re-compress responses to clients after
rewrite-body has modified them.

Uses includedContentTypes whitelist (not excludedContentTypes) so only
text-based types are compressed. SSE, WebSocket, gRPC, and binary
downloads are unaffected.

Measured improvement on ha-sofia:
- app.js: 540KB → 167KB (3.2x)
- core.js: 52KB → 19KB (2.7x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 22:18:51 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
f538115c43 [dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.

Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.

## This change:
- Replace helm_release.mysql_cluster service selector with raw
  kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
  innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
  keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes

## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)

Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:01:06 +00:00
Viktor Barzin
ef30f27ac9 state(dbaas): update encrypted state 2026-04-16 18:56:59 +00:00
Viktor Barzin
b6fc1e63a6 state(dbaas): import postgresql-lb service 2026-04-16 18:55:40 +00:00
Viktor Barzin
14fa2b9762 state(vault): update encrypted state 2026-04-16 18:43:06 +00:00
Viktor Barzin
1a42f750f8 state(dbaas): update encrypted state 2026-04-16 18:41:34 +00:00
Viktor Barzin
0a43b5c2ac state(dbaas): update encrypted state 2026-04-16 18:31:33 +00:00
Viktor Barzin
cd513a2226 state(dbaas): update encrypted state 2026-04-16 18:24:31 +00:00
Viktor Barzin
0368601eff state(dbaas): update encrypted state 2026-04-16 18:24:20 +00:00
Viktor Barzin
a237ac97e0 docs(upgrades): add bulk upgrade results from first production run
12 services upgraded in 30 min: audiobookshelf, owntracks, open-webui,
immich, coturn, shlink, phpipam, onlyoffice, paperless-ngx, linkwarden,
synapse, dawarich. Documents auto-rollback behavior, resource awareness
(paperless memory bump), bulk upgrade procedure, and rate limit reset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 17:34:27 +00:00
Viktor Barzin
39b5ed04a7 upgrade: dawarich 0.37.1 -> 1.6.1 (fix entrypoint + add production env)
## Context
Version 1.3.0+ changed the recommended command from `bin/dev` (development)
to `bin/rails server -p 3000 -b ::` (production). Also requires RAILS_ENV=production,
SECRET_KEY_BASE, and RAILS_LOG_TO_STDOUT env vars.

## This change
- Command: `bin/dev` → `bin/rails server -p 3000 -b ::`
- Add RAILS_ENV=production
- Add SECRET_KEY_BASE (stored in Vault secret/dawarich, synced via ESO)
- Add RAILS_LOG_TO_STDOUT=true

## What happened
1. Initial upgrade applied version 1.6.1 — DB migrations ran but pod
   CrashLooped due to wrong entrypoint (bin/dev exits in production mode)
2. Rollback to 0.37.1 failed because 1.6.1 migrations already ran
   (ActiveRecord::UnknownPrimaryKey on rails_pulse_routes)
3. Rolled forward with corrected entrypoint + env vars
4. Service now stable: 20/20 health checks passed over 5 minutes

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 17:25:29 +00:00
Viktor Barzin
8bd2ace00d state(technitium): update encrypted state 2026-04-16 17:21:06 +00:00
Viktor Barzin
7680d4e009 state(dawarich): update encrypted state 2026-04-16 17:19:29 +00:00
Viktor Barzin
59e99f2a3a upgrade: paperless-ngx increase memory 1Gi -> 2Gi
v2.20.14 OOMKills at 1Gi during search index rebuild on upgrade.
Bumped to 2Gi request=limit to handle startup index operations.

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 17:16:23 +00:00
Viktor Barzin
611c67b92c state(paperless-ngx): update encrypted state 2026-04-16 17:10:57 +00:00
Viktor Barzin
5f1b14ad53 upgrade: dawarich re-apply 1.6.1 (forward-fix after failed rollback)
DB migrations from 1.6.1 already ran, making 0.37.1 incompatible
(ActiveRecord::UnknownPrimaryKey on rails_pulse_routes table).
Rolling forward is the correct path.

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 17:08:20 +00:00
Viktor Barzin
e9275534b6 state(dawarich): update encrypted state 2026-04-16 17:08:15 +00:00
Viktor Barzin
1f589a403c state(dawarich): update encrypted state 2026-04-16 17:04:44 +00:00
Viktor Barzin
f5883be981 Revert "upgrade: dawarich 0.37.1 -> 1.6.1"
This reverts commit ec8b4dbaac.
2026-04-16 17:04:06 +00:00
Viktor Barzin
178fc4b398 state(matrix): update encrypted state 2026-04-16 17:01:28 +00:00
Viktor Barzin
449f1af9d6 state(immich): update encrypted state 2026-04-16 17:00:59 +00:00
Viktor Barzin
0ec48d942f state(paperless-ngx): update encrypted state 2026-04-16 17:00:58 +00:00
Viktor Barzin
88c47efa1d state(url): update encrypted state 2026-04-16 16:54:31 +00:00
Viktor Barzin
7b69641357 upgrade: linkwarden v2.9.1 -> v2.14.0
Changelog summary: 23 intermediate releases spanning text highlighting, drag & drop
organization, tag management page, compact sidebar, mobile app (iOS/Android),
SingleFile upload support, performance refactors, and NextJS CVE patches.
Risk: SAFE
Breaking changes: none
DB backup: yes (job: pre-upgrade-linkwarden-1776357253, 254M dump confirmed)
Config changes applied: none
Flagged for manual review: none

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 16:54:30 +00:00
Viktor Barzin
e1aa59ce53 upgrade: paperless-ngx 2.16.4 -> 2.20.14
Changelog summary: 4 minor versions of bug fixes, security patches (7+ CVEs),
and features (nested tags, PDF editor, advanced workflow filters, Trixie base).

Risk: CAUTION
Breaking changes:
- v2.17.0: Scheduled workflow offset sign corrected (restores pre-2.16 behavior)
- v2.20.7: Filename template rendering restricted to safe document context
DB backup: yes (job: pre-upgrade-paperless-ngx-1776357314, 254MiB)
Config changes applied: none
Flagged for manual review:
- Check scheduled workflow offsets if any were modified during 2.16.x
- Verify storage path templates still render correctly after v2.20.7 restriction

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 16:53:42 +00:00
Viktor Barzin
237126eb3a upgrade: onlyoffice 8.2.3 -> 9.3.1
Changelog summary: Major version bump spanning 13 releases. v9.0.0 adds PDF editor
API, macro recording, Service Worker caching. v9.2.1 fixes critical security vulns
(XSS, memory manipulation leading to RCE in XLS conversion). v9.3.0 adds GIF animations,
multiple pages view, signature settings, hyperlinks on images/shapes.
Risk: CAUTION (major version bump 8->9)
Breaking changes: none affecting Docker+MySQL deployment. PostgreSQL schema change
in v9.0.0 (irrelevant — we use MySQL). API endpoint deprecations (ConvertService.ashx,
GET requests to converter/command) — not removals. Config parameter renames
(leftMenu->layout.leftMenu etc.) are editor JS API, not server config.
DB backup: yes (job: pre-upgrade-onlyoffice-1776357277, MySQL full dump)
Config changes applied: none required
Flagged for manual review: none

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 16:53:24 +00:00
Viktor Barzin
8637d82817 upgrade: phpipam v1.7.0 -> v1.7.4
Changelog summary: Bugfixes (PHP8 compat, UI performance, jQuery errors)
and security fixes (XSS reflected, CSRF cookie, RCE via ping_path,
DB credential exposure via mysqldump).

Risk: SAFE
Breaking changes: none
DB backup: yes (job: pre-upgrade-phpipam-1776357227, mysql, dbaas ns)
Config changes applied: none
Flagged for manual review: none

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 16:53:21 +00:00