infra

Author	SHA1	Message	Date
Viktor Barzin	cc44bccfaa	[traefik] Remove ollama-tcp entrypoint ## Context Stage 2 of ollama decommission. The Traefik `ollama-tcp` entrypoint on port 11434 forwarded TCP traffic to the ollama service. With the IngressRouteTCP already deleted (previous commit), the entrypoint is now orphaned — removing it cleans up the Helm values and closes the port on the LB IP. ## This change - Deletes the `ollama-tcp` entry from the `ports` map in traefik Helm values. - Apply: `0 added, 4 changed, 0 destroyed` — helm_release.traefik rolled out new config, 3 auxiliary deployments picked up benign Kyverno ndots drift (already accepted per user approval). ## Verification - `kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[].name}'` output: `piper-tcp web websecure websecure-http3 whisper-tcp` - `ollama-tcp` no longer listed. ## Test plan ### Automated - `scripts/tg plan` showed 4 in-place updates, 0 destroy. - `scripts/tg apply` → "Apply complete! Resources: 0 added, 4 changed, 0 destroyed." ### Manual Verification 1. `kubectl get svc -n traefik traefik -o jsonpath='{.spec.ports[].name}'` 2. Confirm `ollama-tcp` is absent from the output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 11:12:59 +00:00
Viktor Barzin	dbf7732a66	[uptime-kuma] Bump CPU + memory requests to reduce TTFB jitter ## Context Uptime Kuma TTFB was bimodal — fast ~150ms responses mixed with slow ~3s responses — median 1.7s, p95 3.2s across 20 samples. CPU request was 50m (5% of one core) against a Node.js process that handles ~190 monitors plus SQLite DB maintenance. Memory request was 64Mi while actual RSS sat around 221Mi, so the pod was also running above its guaranteed memory floor and subject to eviction pressure when nodes got tight. CPU limits are intentionally absent cluster-wide (CFS throttling caused more pain than it solved), so the only knob to give the scheduler a higher floor is the request itself. Raising the request makes the node reserve more CPU for the pod and lets the kernel's CFS weight it more generously when the node is busy — should reduce the tail on the slow path without introducing throttling. ## This change - requests.cpu: 50m -> 100m - requests.memory: 64Mi -> 128Mi - limits.memory: unchanged at 512Mi - limits.cpu: still unset (explicit — cluster-wide rule) ## What is NOT in this change - No CPU limit added - No readiness/liveness probe tuning - No replica count change (still 1, Recreate strategy) - No DB layer / SQLite tuning ## Measurements (20 curl samples of https://uptime.viktorbarzin.me/) Before: min 0.143s median 1.727s p95 3.163s max 3.204s mean 1.768s After: min 0.149s median 1.228s p95 3.154s max 3.283s mean 1.590s Median dropped ~29% (1.73s -> 1.23s). Tail (p95/max) essentially unchanged — the slow bucket appears driven by something other than CPU scheduling (likely socket.io / SSR render path inside the app, or TLS/cf-tunnel handshake — worth a separate investigation). Closes: code-79d	2026-04-18 11:11:39 +00:00
Viktor Barzin	80b6591e8b	[whisper] Remove ollama_tcp IngressRouteTCP (ollama decom) ## Context Ollama is being decommissioned. The `ollama_tcp_ingressroute` manifest in stacks/whisper routed Traefik TCP entrypoint 11434 → ollama service in the ollama namespace. With ollama going away, this route is dead weight and blocks the subsequent destroy of the ollama stack. ## This change - Deletes `kubernetes_manifest.ollama_tcp_ingressroute` from stacks/whisper/main.tf - Apply result: 0 added, 5 changed, 0 destroyed (the manifest destroy happened in a previous partial-apply; the 5 "changed" resources are benign Kyverno ndots / PVC ownership drift which was already accepted per the user's approval). - Verified `kubectl get ingressroutetcp -n traefik ollama-tcp` returns NotFound. ## What is NOT in this change - Traefik entrypoint 11434 still exists (stage 2) - Ollama namespace, deployments, services still present (stage 8) ## Test plan ### Automated - `scripts/tg plan` showed 1 destroy (ollama_tcp_ingressroute), 1 create (data_proxmox PVC import), 4 benign updates. - `scripts/tg apply -auto-approve` → "Apply complete! Resources: 0 added, 5 changed, 0 destroyed." ### Manual Verification - kubectl get ingressroutetcp -n traefik ollama-tcp → NotFound (confirmed) - kubectl get ingressroutetcp -n traefik whisper-tcp piper-tcp → still present Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 11:11:21 +00:00
Viktor Barzin	69fbd0ffd6	[docs] Update auto-upgrade docs — new HTTP auth path + n8n expression gotcha Replaces the stale "Dev VM SSH key" secret entry with the current `claude-agent-service` bearer token path (synced to both consumer + caller namespaces). Adds an "n8n workflow gotchas" section documenting: 1. The workflow is DB-state, not Terraform-managed — the JSON in the repo is a backup, not authoritative. 2. Header-expression syntax: `=Bearer {{ $env.X }}` works, JS concat `='Bearer ' + $env.X` does NOT — costs silent 401s. 3. `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` requirement. 4. 401-troubleshooting steps and the UPDATE pattern for in-place workflow patches. Follow-up to `99180bec` which fixed the actual pipeline break. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:42:11 +00:00
Viktor Barzin	99180bec42	[n8n] Fix broken DIUN auto-upgrade pipeline — missing auth token to claude-agent-service ## Context DIUN has been detecting image updates and firing Slack + webhook notifications for weeks, but zero automated upgrades ran because the handoff from n8n to claude-agent-service was silently 401-ing. The pipeline (DIUN → n8n webhook → claude-agent-service /execute → service-upgrade agent) was migrated from DevVM SSH to K8s HTTP in `42f1c3cf`. The migration wired `claude-agent-service` (API_BEARER_TOKEN env set), updated the n8n workflow JSON to POST with `Authorization: Bearer $env.CLAUDE_AGENT_API_TOKEN`, but missed two things on the n8n side: 1. The deployment didn't expose `CLAUDE_AGENT_API_TOKEN` to the n8n container — workflow sent `Authorization: Bearer ` (empty). 2. The workflow header expression used JS concat (`='Bearer ' + $env.X`) which n8n 1.x does NOT evaluate in HTTP Request node header params. It needs template-literal form: `=Bearer {{ $env.X }}`. Evidence: `claude-agent-service` logs showed only `/health` probes — zero `/execute` calls over 12h despite DIUN firing webhooks. n8n PG execution 2250 returned `401 Missing bearer token`. ## This change - Adds ExternalSecret `claude-agent-token` in the `n8n` namespace that pulls `api_bearer_token` from Vault `secret/claude-agent-service` (same source as the receiving service's token). - Wires the token into the n8n container as env var `CLAUDE_AGENT_API_TOKEN` via `secret_key_ref`. - Sets `N8N_BLOCK_ENV_ACCESS_IN_NODE=false` so expressions CAN read `$env.*` at all (default in 1.x is false already, but setting explicitly guards against upstream default flips). - Fixes the workflow JSON backup (`workflows/diun-upgrade.json`) header expression to use `{{ $env.X }}` template syntax. The live workflow in n8n's PG DB was also patched in place (one-time `UPDATE workflow_entity SET nodes = REPLACE(...)` — workflows are not TF-managed; they were imported once). ## What is NOT in this change - No retroactive re-run of skipped DIUN events. They'll be rediscovered in future scans. - No change to the `claude-agent-service` side — its token and endpoint were already correct. - No Slack alert on n8n HTTP-node failures — future work; right now a broken workflow fails silently unless you check Execution History. ## End-to-end verification ``` $ curl -X POST n8n.viktorbarzin.me/webhook/30805ab6-... \ -d '{"diun_entry_status":"update","diun_entry_image":"docker.io/library/httpd","diun_entry_imagetag":"2.4.66",...}' {"message":"Workflow was started"} HTTP 200 # n8n PG: execution_entity latest row → status=success # claude-agent-service logs → "POST /execute HTTP/1.1" 202 Accepted ``` ## Reproduce locally ``` 1. vault login -method=oidc 2. cd stacks/n8n && ../../scripts/tg apply 3. kubectl -n n8n exec deploy/n8n -- printenv CLAUDE_AGENT_API_TOKEN (should print 64-char hex) 4. Fire synthetic webhook with non-critical image (httpd / alpine) 5. Check n8n execution is success, claude-agent-service shows 202 ``` Closes: code-ekz Related: code-bck Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:41:09 +00:00
Viktor Barzin	42f1c3cf4f	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP ## Context The claude-agent-service K8s pod (deployed 2026-04-15) provides an HTTP API for running Claude headless agents. Three workflows still SSH'd to the DevVM (10.0.10.10) to invoke `claude -p`. This eliminates that dependency. ## This change Pipeline migrations (SSH → HTTP POST to claude-agent-service): - `.woodpecker/issue-automation.yml` — Vault auth fetches API token instead of SSH key; curl POST /execute + poll /jobs/{id} replaces SSH invocation - `scripts/postmortem-pipeline.sh` — same pattern; uses jq for safe JSON construction of TODO payloads - `.woodpecker/postmortem-todos.yml` — drop openssh-client from apk install - `stacks/n8n/workflows/diun-upgrade.json` — SSH node replaced with HTTP Request node; API token via $env.CLAUDE_AGENT_API_TOKEN (added to Vault secret/n8n) Documentation updates: - `docs/architecture/incident-response.md` — Mermaid diagram: DevVM → K8s - `docs/architecture/automated-upgrades.md` — pipeline diagram + n8n action - `AGENTS.md` — pipeline description updated ## What is NOT in this change - DevVM decommissioning (still hosts terminal/foolery services) - Removal of SSH key secrets from Vault (kept for rollback) - n8n workflow import (must be done manually in n8n UI) [ci skip] Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 10:12:02 +00:00
Viktor Barzin	947f8ace54	[monitoring] Remove stale MySQL InnoDB Cluster alerts MySQL migrated from InnoDB Cluster (Bitnami chart + mysql-operator) to a standalone StatefulSet on 2026-04-16. Two Prometheus alerts still referenced the old topology and were firing falsely against resources that no longer exist: - MySQLDown: queried kube_statefulset_status_replicas_ready{statefulset="mysql-cluster"} — that StatefulSet was deleted as part of Phase 1 of the migration. - MySQLOperatorDown: queried kube_deployment_status_replicas_available{namespace="mysql-operator"} — the operator Deployment was removed in Phase 1. Replacement availability monitoring for the standalone MySQL pod will be handled via an Uptime Kuma MySQL-connection monitor (out of scope for this change — no Prometheus replacement alert is being added, per the migration plan's "simpler is better" principle). MySQLBackupStale and MySQLBackupNeverSucceeded are retained — they query the mysql-backup CronJob which is unchanged by the migration. Also removes MySQLDown from the two inhibition rules (NodeDown and NFSServerUnresponsive) that previously suppressed it during cascade outages — the alert no longer exists so the reference became dead. Closes: code-3sa Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 10:03:58 +00:00
Viktor Barzin	99688bbb02	[uptime-kuma] Omit trailing slash when path annotation not set ## Context After commit `f6812fe6` every external-monitor-sync run updated all ~107 monitors without any real change — because the new code always appended `/` to the host (default path), while historical monitors had been created with bare `https://host` URLs. Sync saw `https://host` != `https://host/` and re-wrote every monitor on each cycle: noisy logs, wasted Uptime Kuma writes. ## This change When the `uptime.viktorbarzin.me/external-monitor-path` annotation is absent, build the URL WITHOUT a trailing slash so it matches the shape of pre-existing monitors. When the annotation is set, append it as before (e.g. `https://forgejo.viktorbarzin.me/api/healthz`). Also flip the lenient/strict codes branch to trigger off the same "annotation set?" signal instead of comparing against DEFAULT_PATH. ## Verification Verified via two consecutive manual triggers of the CronJob against the live stack: Pass 1 (migration): 0 created, 107 updated, 0 deleted, 1 unchanged Pass 2 (stable): 0 created, 0 updated, 0 deleted, 108 unchanged `[External] forgejo` still probes `https://forgejo.viktorbarzin.me/api/healthz` with strict `200-299`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 22:41:02 +00:00
Viktor Barzin	b30bfd4690	[dbaas] Fix mysql_static_user heredoc quoting ## Context The null_resource.mysql_static_user provisioner in commit `2033e767` used a bash -c wrapper with nested single quotes (`'"$DB"'`-style injection) to interpolate the app-specific database name and credentials. The outer bash -c '...' single-quoted string was broken by the inner ' characters long before reaching the container, so the local (tg) shell saw `$DB` and `$USER` unset and produced an empty database name: ERROR 1102 (42000) at line 1: Incorrect database name '' Apply failed for both forgejo and roundcubemail. ## This change Feed the SQL to mysql on the pod via stdin through `kubectl exec -i`: - Outer command: `kubectl exec -i ... -- sh -c 'exec mysql -uroot -p"$MYSQL_ROOT_PASSWORD"'` - Single-quoted shell heredoc (`<<'SQL'`) carries the SQL statements - HCL interpolates `${each.key}`, `${each.value.database}`, `${each.value.password}` into the heredoc body before the shell runs - No nested quoting — one single-quote layer, one double-quote layer, one heredoc layer Plan/apply verified on the live stack: 2 added (forgejo + roundcubemail), 7 pre-existing drift items changed, 0 destroyed. Both users now log in with their app-cached passwords. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 22:34:12 +00:00
Viktor Barzin	9780c04ca0	state(dbaas): update encrypted state	2026-04-17 22:33:13 +00:00
Viktor Barzin	18338a883f	[ci skip] cleanup: remove e2e test file	2026-04-17 22:06:24 +00:00
Viktor Barzin	2033e76798	[dbaas] Declare forgejo + roundcubemail MySQL users in Terraform ## Context The 2026-04-16 MySQL InnoDB Cluster → standalone migration recreated the MySQL user table but scripted fresh passwords for every app user. Two apps (forgejo, roundcubemail) store their DB password inside their own application config — forgejo in `/data/gitea/conf/app.ini` (baked into the PVC), roundcubemail in the ROUNDCUBEMAIL_DB_PASSWORD env from the mailserver stack (sourced from Vault `secret/platform`). Neither app could be restarted with a new password without rewriting its own config. Both apps silently broke with `Access denied for user 'X'@'%'` after the migration. Remediation on 2026-04-17 was a manual `ALTER USER ... IDENTIFIED BY '<app_password>'` to re-sync MySQL with what each app already has. With nothing in Terraform managing those users, the next migration would break them again — that's the gap this change closes. ## What this change does Codifies both MySQL users in `stacks/dbaas/modules/dbaas/` using the same `null_resource` + `local-exec` + `kubectl exec` pattern already used for `pg_terraform_state_db` (line 1373 of the same file). Rejected alternatives: - `petoju/mysql` Terraform provider — no existing usage in the repo; would be a net-new dependency. Module-level `for_each` over `mysql_user` + `mysql_grant` is cleaner, but the added machinery (new provider block, extra auth path via `MYSQL_HOST`/`MYSQL_USERNAME`/`MYSQL_PASSWORD` TF env vars, state-dependent password reads) outweighs the benefit for two static users. - K8s Job — adds lifecycle management for a one-shot resource; needs secret mounts and is harder to retry. `local-exec` is exactly what the existing PG bootstrap uses. Idempotency contract: CREATE DATABASE IF NOT EXISTS <db>; CREATE USER IF NOT EXISTS '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>'; ALTER USER '<user>'@'%' IDENTIFIED WITH caching_sha2_password BY '<pw>'; GRANT ALL PRIVILEGES ON <db>.* TO '<user>'@'%'; FLUSH PRIVILEGES; The `ALTER USER` on every re-run re-syncs the password if Vault was rotated out-of-band (healing drift). The `sha256(password)` trigger also re-runs the provisioner when the Vault password legitimately changes, so the resource is responsive to both new and rotated passwords. `caching_sha2_password` matches the live plugin returned by `SHOW CREATE USER`; forcing it prevents silent drift to `mysql_native_password`. Flow (apply-time): scripts/tg apply │ ├── data.vault_kv_secret_v2.viktor ── reads mysql_{forgejo,roundcubemail}_password │ ▼ module.dbaas │ ├── mysql-standalone-0 (StatefulSet, already running) │ ├── null_resource.mysql_static_user["forgejo"] │ └── kubectl exec ... mysql -uroot -p$ROOT_PASSWORD ... CREATE/ALTER/GRANT │ └── null_resource.mysql_static_user["roundcubemail"] └── (same, for roundcubemail) ## Secrets Two new keys added to Vault `secret/viktor`: mysql_forgejo_password # bound to forgejo `[database]` in app.ini mysql_roundcubemail_password # duplicates secret/platform # mailserver_roundcubemail_db_password; # secret/viktor is the personal vault of # record per .claude/CLAUDE.md Passwords are never written to the repo — both come from Vault via `data "vault_kv_secret_v2" "viktor"` in the dbaas root module. ## What is NOT in this change - PG-side users (managed by Vault DB engine static-roles already — see MEMORY.md "Database rotation") - Other MySQL users (speedtest, wrongmove, codimd, nextcloud, shlink, grafana, phpipam are all rotated by Vault DB engine; root users excluded by design) - Removing the old mysql-operator / InnoDB Cluster helm releases (Phase 4 cleanup tracked under the MySQL standalone migration work — still pending) ## Test plan ### Automated `terraform fmt -check -recursive stacks/dbaas` → exit 0 `scripts/tg plan` in stacks/dbaas → Plan: 2 to add, 7 to change, 0 to destroy. # module.dbaas.null_resource.mysql_static_user["forgejo"] will be created # module.dbaas.null_resource.mysql_static_user["roundcubemail"] will be created The 7 "update in-place" entries are pre-existing drift (Kyverno labels on LimitRange, MetalLB ip-allocated-from-pool annotation on postgresql_lb, Kyverno-injected `dns_config` on 4 CronJobs lacking the `ignore_changes` workaround, `resize.topolvm.io/storage_limit` bump 30Gi→50Gi on mysql-standalone PVC). None of those are introduced by this commit and all are benign (no data loss, no pod restart). ### Manual Verification # 1. Sanity check pre-apply — users are in their current (manually-fixed) state. kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \ 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -N -e \ "SELECT user,host,plugin FROM mysql.user WHERE user IN (\"forgejo\",\"roundcubemail\");"' # Expected: # forgejo % caching_sha2_password # roundcubemail % caching_sha2_password # 2. Apply and confirm the provisioner exits 0. cd stacks/dbaas && ../../scripts/tg apply # Expect: null_resource.mysql_static_user["forgejo"]: Creation complete # null_resource.mysql_static_user["roundcubemail"]: Creation complete # 3. App-level smoke: log in to forgejo.viktorbarzin.me (any git push) # and load https://mail.viktorbarzin.me/roundcube (IMAP login). Both # must succeed. # 4. Destructive test (run ONCE, off-hours): kubectl exec -n dbaas mysql-standalone-0 -c mysql -- bash -c \ 'mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "DROP USER '\''forgejo'\''@'\''%'\''"' cd stacks/dbaas && ../../scripts/tg apply # Expected: apply recreates the user with the Vault password, forgejo UI # recovers without touching /data/gitea/conf/app.ini. ### Reproduce locally 1. vault login -method=oidc 2. cd infra/stacks/dbaas 3. ../../scripts/tg plan 4. Expected: "Plan: 2 to add, 7 to change, 0 to destroy." with the two null_resource.mysql_static_user additions. 7 changes are pre-existing drift unrelated to this commit. Closes: code-6th Closes: code-96w Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 22:06:23 +00:00
Viktor Barzin	b326c572a6	[forgejo] Probe /api/healthz for external monitor Forgejo's /api/healthz verifies cache + DB and returns 503 when degraded, where / returns 200 even with a broken backend. Prevents recurrence of the false-negative from the 2026-04-17 outage. Closes: code-ut0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 22:06:23 +00:00
Viktor Barzin	f6812fe69f	[uptime-kuma] Support per-ingress probe path annotation ## Context The `external-monitor-sync` CronJob probed `https://<host>/` for every `.viktorbarzin.me` ingress. Homepages frequently return 200 (or allow-listed 30x/40x) even when the backend or DB is broken, producing false-negatives — the forgejo outage on 2026-04-17 was not caught for this reason: `/` returned a login page while `/api/healthz` returned 503 from the DB probe. Manual monitor edits don't stick: the next sync is create-if-missing only, so a deleted monitor gets recreated pointing at `/` again. ## This change Teaches the sync three things: 1. Reads a new annotation* `uptime.viktorbarzin.me/external-monitor-path`. The annotation value is appended as the probe path; default `/` preserves today's behaviour for every ingress that hasn't opted in. 2. Tightens accepted status codes when an explicit path is set: `['200-299']` (strict — we expect a real healthz). The default `/` path keeps the existing lenient set `['200-299','300-399','400-499']` because homepages routinely 30x redirect or 40x on missing auth. 3. Updates existing monitors when the target URL or accepted status codes drift. Previously the loop was create-if-missing only, so annotating an already-monitored ingress had no effect until the monitor was deleted. Now re-running the sync after changing the annotation converges the live monitor. ## What is NOT in this change - No change to the Ingress annotations on any individual stack. Each service that wants a non-`/` probe path opts in separately. - No change to the ConfigMap fallback payload shape — legacy entries still get the lenient status codes. - Monitor DB state in Uptime Kuma's SQLite is untouched at plan time; the sync CronJob is what reconciles state on each run. ## Flow ``` ingress annotation CronJob Python ------------------ -------------- (none) --> url = https://host/ codes = lenient external-monitor-path --> url = https://host<path> codes = strict ['200-299'] ^^ "/api/healthz" https://host/api/healthz codes = ['200-299'] existing monitor + drifted target url --> api.edit_monitor(id, url=..., accepted_statuscodes=...) ``` ## Test Plan ### Automated - `terraform fmt -check -recursive stacks/uptime-kuma` — exit 0. - `scripts/tg plan` on `stacks/uptime-kuma` — `Plan: 0 to add, 1 to change, 0 to destroy`. The single in-place change is the CronJob command (Python heredoc re-rendered). No other resources drift. - Embedded Python compiles: extracted the `PYEOF` block and ran `python3 -m py_compile` — OK. ### Manual Verification 1. Annotate an ingress: `kubectl annotate ingress/<name> -n <ns> uptime.viktorbarzin.me/external-monitor-path=/api/healthz` 2. Trigger sync early: `kubectl -n uptime-kuma create job --from=cronjob/external-monitor-sync external-monitor-sync-manual` 3. Expected log line: `Updating monitor [External] <name>: https://host/ -> https://host/api/healthz (codes ['200-299','300-399','400-499'] -> ['200-299'])` 4. Inspect monitor in Uptime Kuma UI: URL and accepted status codes reflect the annotation. 5. Final summary line includes updated count: `Sync complete: 0 created, 1 updated, 0 deleted, N unchanged`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 22:06:23 +00:00
Claude Agent	842646ea4f	[ci skip] e2e: test commit from claude-agent-service	2026-04-17 22:03:50 +00:00
Viktor Barzin	65b0f30d5e	[docs] Update anti-AI and rybbit docs after rewrite-body removal - Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit) - Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible - Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter) - strip-accept-encoding middleware removed from all references Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 21:43:13 +00:00
Viktor Barzin	4117809a54	[rybbit] Deploy Cloudflare Worker for analytics injection Replaces the broken Traefik rewrite-body plugin with a Cloudflare Worker using HTMLRewriter to inject the rybbit tracking script into HTML responses at the CDN edge. - Wildcard route: .viktorbarzin.me/ covers all proxied services - 28 services have explicit site ID mappings - Unmapped hosts pass through without injection - Zero Traefik dependency, zero performance impact Closes: code-sed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 21:26:16 +00:00
Viktor Barzin	498e7f3305	[uptime-kuma] Fix duplicate monitor creation + clean up down monitors ## Duplicate bug fix The external-monitor-sync deduped targets by hostname (`host in seen`) but multiple ingresses can share the same hostname. Changed to dedupe by final monitor name (`f"{PREFIX}{label}" in seen`) — prevents creating duplicate [External] monitors on every sync run. This caused 90 duplicates. ## Monitor cleanup Deleted 118 monitors total: - 90 duplicate [External] monitors (kept lower ID of each pair) - 14 paused internal monitors for decommissioned services - 14 external monitors for non-existent, scaled-down, or non-HTTP services (xray-vless, complaints, hermes-agent, etc.) ## Opt-outs Added `uptime.viktorbarzin.me/external-monitor=false` annotation to ingresses that shouldn't have external HTTP monitors: xray (non-HTTP protocol), council-complaints, hermes-agent, task-webhook, torrserver, www (no CF DNS). 329 monitors → ~210 monitors. Zero down monitors expected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 21:12:31 +00:00
Viktor Barzin	5319f03ebc	[storage] Fix owntracks + wealthfolio: switch to encrypted PVCs Some checks failed Build Custom DIUN Image / build (push) Has been cancelled Details Deploy Post-Mortems to GitHub Pages / deploy (push) Has been cancelled Details Both services were running against empty unencrypted PVCs after the proxmox-lvm-encrypted migration. Data copied from old Released PVs via LUKS-unlock on PVE host, deployments switched to encrypted PVCs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 20:29:57 +00:00
Viktor Barzin	e51bdb2af8	Add broker-sync Terraform stack (#7 ) * [f1-stream] Remove committed cluster-admin kubeconfig ## Context A kubeconfig granting cluster-admin access was accidentally committed into the f1-stream stack's application bundle in `c7c7047f` (2026-02-22). It contained the cluster CA certificate plus the kubernetes-admin client certificate and its RSA private key. Both remotes (github.com, forgejo) are public, so the credential has been reachable for ~2 months. Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references this path; the file is a stray local artifact, likely swept in during a bulk `git add`. ## This change - git rm stacks/f1-stream/files/.config ## What is NOT in this change - Cluster-admin cert rotation on the control plane. The leaked client cert must be invalidated separately via `kubeadm certs renew admin.conf` or CA regeneration. Tracked in the broader secrets-remediation plan. - Git-history rewrite. The file is still reachable in every commit since `c7c7047f`. A `git filter-repo --path ... --invert-paths` pass against a fresh mirror is planned and will be force-pushed to both remotes. ## Test plan ### Automated No tests needed for a file removal. Sanity: $ grep -rn 'f1-stream/files/\.config' --include='.tf' --include='.hcl' \ --include='.yaml' --include='.yml' --include='.sh' (no output) ### Manual Verification 1. `git show HEAD --stat` shows exactly one path deleted: stacks/f1-stream/files/.config \| 19 ------------------- 2. `test ! -e stacks/f1-stream/files/.config` returns true. 3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns` fails with 401/403 once the admin cert is renewed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> [frigate] Remove orphan config.yaml with leaked RTSP passwords ## Context A Frigate configuration file was added to modules/kubernetes/frigate/ in `bcad200a` (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked stacks, scripts, and agent configs` commit. The file contains 14 inline rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP passwords for the cameras at 192.168.1.10 (LAN-only) and valchedrym.ddns.net (confirmed reachable from public internet on port 554). Both remotes are public, so the creds have been exposed for ~2 days. Grep across the repo confirms nothing references this config.yaml — the active stacks/frigate/main.tf stack reads its configuration from a persistent volume claim named `frigate-config-encrypted`, not from this file. The file is therefore an orphan from the bulk add, with no production function. ## This change - git rm modules/kubernetes/frigate/config.yaml ## What is NOT in this change - Camera password rotation. The user does not own the cameras; rotation must be coordinated out-of-band with the camera operators. The DDNS camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked password is high-priority to rotate from the device side. - Git-history rewrite. The file plus its leaked strings remain in all commits from `bcad200a` forward. Scheduled to be purged via `git filter-repo --path modules/kubernetes/frigate/config.yaml --invert-paths --replace-text <list>` in the broader remediation pass. - Future Frigate config provisioning. If the stack is re-platformed to source config from Git rather than the PVC, the replacement should go through ExternalSecret + env-var interpolation, not an inline YAML. ## Test plan ### Automated $ grep -rn 'frigate/config\.yaml' --include='.tf' --include='.hcl' \ --include='.yaml' --include='.yml' --include='.sh' (no output — confirms orphan status) ### Manual Verification 1. `git show HEAD --stat` shows exactly one deletion: modules/kubernetes/frigate/config.yaml \| 229 --------------------------------- 2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true. 3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the PVC bound (unaffected by this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token ## Context modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old expect(1) script for manual Let's Encrypt wildcard-cert renewal via Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium API token on line 7 (as an expect variable) and line 27 (inside a certbot-cleanup heredoc). Both remotes are public, so the token has been exposed for ~2.5 years. The script is not invoked by the module's Terraform (main.tf only creates a kubernetes.io/tls Secret from PEM files); it is a standalone run-it-yourself tool. grep across the repo confirms nothing references `renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret` module, nor any CI pipeline, nor any shell wrapper. A replacement script `renew2.sh` (4 weeks old) lives alongside it. It sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the current renewal path. ## This change - git rm modules/kubernetes/setup_tls_secret/renew.sh ## What is NOT in this change - Technitium token rotation. The leaked token still works against `technitium-web.technitium.svc.cluster.local:5380` until revoked in the Technitium admin UI. Rotation is a prerequisite for the upcoming git-history scrub, which will remove the token from every commit via `git filter-repo --replace-text`. - renew2.sh is retained as-is (already env-var-sourced; clean). - The setup_tls_secret module's main.tf is not touched; 20+ consuming stacks keep working. ## Test plan ### Automated $ grep -rn 'renew\.sh' --include='.tf' --include='.hcl' \ --include='.yaml' --include='.yml' --include='.sh' (no output — confirms no consumer) $ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be' (no output in HEAD after this commit) ### Manual Verification 1. `git show HEAD --stat` shows exactly one deletion: modules/kubernetes/setup_tls_secret/renew.sh \| 136 --------- 2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true. 3. `renew2.sh` still exists and is executable: ls -la modules/kubernetes/setup_tls_secret/renew2.sh 4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no behavioral regression because renew.sh was never part of the automated flow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds ## Context stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old shell implementation of a power-cycle watchdog that polled the Dell iDRAC on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes are public, so those credentials — and the implicit statement that 'this host has not rotated the default BMC password' — have been exposed. The current implementation is main.py in the same directory. It reads iDRAC credentials from the environment variables `idrac_user` and `idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR constants), which are populated from Vault via ExternalSecret at runtime. main.sh is not referenced by any Terraform, ConfigMap, or deploy script — grep confirms no `file()` / `templatefile()` / `filebase64()` call loads it, and no hand-rolled shell wrapper invokes it. ## This change - git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh main.py is retained unchanged. ## What is NOT in this change - iDRAC password rotation on 192.168.1.4. The BMC should be moved off the vendor default `calvin` regardless; rotation is tracked in the broader remediation plan and in the iDRAC web UI. - A separate finding in stacks/monitoring/modules/monitoring/idrac.tf (the redfish-exporter ConfigMap has `default: username: root, password: calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT addressed here — filed as its own task so the fix (drop the default block vs. source from env) can be considered in isolation. - Git-history scrub of main.sh is pending the broader filter-repo pass. ## Test plan ### Automated $ grep -rn 'server-power-cycle/main\.sh\\|main\.sh' \ --include='.tf' --include='.hcl' --include='.yaml' \ --include='.yml' --include='.sh' (no consumer references) ### Manual Verification 1. `git show HEAD --stat` shows only the one deletion. 2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh` 3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows the exporter running — unrelated to this file. 4. main.py continues to run its watchdog loop without regression, because it was never coupled to main.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink ## Context foolery, terminal, and claude-memory each had their own `stacks/<x>/secrets/` directory with a plaintext EC-256 private key (privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B) for .viktorbarzin.me. The 92 other stacks under stacks/ symlink `secrets/` → `../../secrets`, which resolves to the repo-root /secrets/ directory covered by the `secrets/* filter=git-crypt` .gitattributes rule — i.e., every other stack consumes the same git-crypt-encrypted root wildcard cert. The 3 outliers shipped their keys in plaintext because `.gitattributes` secrets/** rule matches only repo-root /secrets/, not stacks//secrets/. Both remotes are public, so the 6 plaintext PEM files have been exposed for 1–6 weeks (commits `5a988133` 2026-03-11, `a6f71fc6` 2026-03-18, `9820f2ce` 2026-04-10). Verified: - Root wildcard cert subject = CN viktorbarzin.me, SAN .viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains. - Root privkey + fullchain are a valid key pair (pubkey SHA256 match). - All 3 outlier certs have the same subject/SAN as root; different distinct cert material but equivalent coverage. ## This change - Delete plaintext PEMs in all 3 outlier stacks (6 files total). - Replace each stacks/<x>/secrets directory with a symlink to ../../secrets, matching the fleet pattern. - Add `stacks//secrets/ filter=git-crypt diff=git-crypt` to .gitattributes as a regression guard — any future real file placed under stacks/<x>/secrets/ gets git-crypt-encrypted automatically. setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`, which via the symlink resolves to the root wildcard. ## What is NOT in this change - Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise` once the user's LE account is authenticated. Revocation must happen before or alongside the history-rewrite force-push to both remotes. - Git-history scrub. The leaked PEM blobs are still reachable in every commit from 2026-03-11 forward. Scheduled for removal via `git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths` (and fullchain.pem for each stack) in the broader remediation pass. - cert-manager introduction. The fleet does not use cert-manager today; this commit matches the existing symlink-to-wildcard pattern rather than introducing a new component. ## Test plan ### Automated $ readlink stacks/foolery/secrets ../../secrets (likewise for terminal, claude-memory) $ for s in foolery terminal claude-memory; do openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject done subject=CN = viktorbarzin.me (x3 — all resolve via symlink to root wildcard) $ git check-attr filter -- stacks/foolery/secrets/fullchain.pem stacks/foolery/secrets/fullchain.pem: filter: git-crypt (now matched by the new rule, though for the symlink target the repo-root rule already applied) ### Manual Verification 1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory shows only the K8s TLS secret being re-created with the root-wildcard material. No ingress changes. 2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret <name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with the root serial (different from the pre-change per-stack serials). 3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal, claude-memory) → cert chain presents the new serial, handshake OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add broker-sync Terraform stack (pending apply) Context ------- Part of the broker-sync rollout — see the plan at ~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the companion repo at ViktorBarzin/broker-sync. This change ----------- New stack `stacks/broker-sync/`: - `broker-sync` namespace, aux tier. - ExternalSecret pulling `secret/broker-sync` via vault-kv ClusterSecretStore. - `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted, auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio cookie, CSV archive, watermarks. - Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public DockerHub image; no pull secret): * `broker-sync-version` — daily 01:00 liveness probe (`broker-sync version`), used to smoke-test each new image. * `broker-sync-trading212` — daily 02:00 `broker-sync trading212 --mode steady`. * `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2). * `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3). * `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED (Phase 1 tail). - `broker-sync-backup` — daily 04:15, snapshots /data into NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches the convention in infra/.claude/CLAUDE.md §3-2-1. NOT in this commit: - Old `wealthfolio-sync` CronJob retirement in stacks/wealthfolio/main.tf — happens in the same commit that first applies this stack, per the plan's "clean cutover" decision. - Vault seed. `secret/broker-sync` must be populated before apply; required keys documented in the ExternalSecret comment block. Test plan --------- ## Automated - `terraform fmt` — clean (ran before commit). - `terraform validate` needs `terragrunt init` first; deferred to apply time. ## Manual Verification 1. Seed Vault `secret/broker-sync/` (see comment block on the ExternalSecret in main.tf). 2. `cd stacks/broker-sync && scripts/tg apply`. 3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended. 4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`. 5. `kubectl -n broker-sync logs -l job-name=smoke` — expect `broker-sync 0.1.0`. fix(beads-server): disable Authentik + CrowdSec on Workbench Authentik forward-auth returns 400 for dolt-workbench (no Authentik application configured for this domain). CrowdSec bouncer also intermittently returns 400. Both disabled — Workbench is accessible via Cloudflare tunnel only. TODO: Create Authentik application for dolt-workbench.viktorbarzin.me Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 21:17:45 +01:00
Viktor Barzin	7a884a0b97	[monitoring] Fix alerts for intentionally scaled-down services PoisonFountainDown and ForwardAuthFallbackActive both fired because poison-fountain was scaled to 0 replicas (intentional). Updated both alert expressions to check kube_deployment_spec_replicas > 0 before alerting on missing available replicas — if desired replicas is 0, the service is intentionally down and should not alert. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 19:17:41 +00:00
Viktor Barzin	a19581e32b	fix(beads-server): fix Workbench timeout — use internal GraphQL URL GRAPHQLAPI_URL must point to localhost:9002 (internal), not the external URL which goes through Authentik. SSR can't authenticate to Authentik. Also removed Authentik from /graphql ingress — browser fetch() can't follow 302 redirects on POST requests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 19:05:47 +00:00
Viktor Barzin	da6b82ed5c	fix(beads-server): persist GRAPHQLAPI_URL in Terraform The env var was only set via kubectl and got overwritten on next apply. Now permanently in the deployment spec. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:58:59 +00:00
Viktor Barzin	afb8a16623	[infra] Scale down unused services + remove DoH ingress Scale to 0 replicas: - ollama: low usage, saves ~2Gi memory + 59GB NFS-SSD model data idle - poison-fountain: RSS link archiver, not actively used - travel-blog: Hugo blog, not actively used Remove technitium DoH ingress (dns.viktorbarzin.me): externally unreachable and unused. DNS is served on UDP/TCP port 53 via LoadBalancer (10.0.20.201). Clears 3 of 5 ExternalAccessDivergence services. Remaining 2 (pdf, travel) should clear now that the Uptime Kuma monitors will report both down. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:55:52 +00:00
Viktor Barzin	cdc851fc63	[alerts] Fix status-page-pusher crash + Prometheus backup push ## status-page-pusher (ExternalAccessDivergence false positive) The pusher was crashing with `AttributeError: 'list' object has no attribute 'get'` at line 122 — the uptime-kuma-api library changed the heartbeats return format. Fixed by making beat flattening more robust: handle any nesting of lists/dicts in the heartbeat data, and add isinstance check before calling `.get()` on the latest beat. ## Prometheus backup (PrometheusBackupNeverRun) The backup sidecar's Pushgateway push was silently failing because `wget --post-file=-` needs `--header="Content-Type: text/plain"` for Pushgateway to accept the Prometheus exposition format. Added the header. Also manually pushed the metric to clear the `absent()` alert immediately. Note: ExternalAccessDivergence still fires because 5 services (ollama, pdf, poison, dns, travel) ARE genuinely externally unreachable but internally up. This is a real issue (likely Cloudflare tunnel routing) not a false positive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:29:43 +00:00
Viktor Barzin	eef4242408	fix(beads-server): auto-connect Workbench to Dolt on startup The Workbench's database connection is in-memory and lost on pod restart. Added startup script that waits for GraphQL server readiness, then calls addDatabaseConnection mutation automatically. No more manual reconnection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 18:12:31 +00:00
Viktor Barzin	5e9e487661	feat(setup-project): auto-PR working Dockerfiles back to upstream ## Context The setup-project skill treats "build from a Dockerfile" as priority 6 — "last resort, avoid if possible" — with no formalized path for apps whose upstream lacks a working Dockerfile. When we end up writing one to get the deploy green, that Dockerfile stays private in the infra repo and upstream never benefits. ## This change Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against the upstream repo so the self-hosting community gets the working recipe. Flow: 1. Classify dockerfile_state during research phase (image-used / used-as-is / fixed-broken-upstream / written-from-scratch). Persist to modules/kubernetes/<service>/.contribution-state.json. 2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready + HTTP 200 every 30s x 20 iterations, requires 18/20 successes. 3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the GitHub API dance: fork → merge-upstream → branch → commit Dockerfile / .dockerignore / BUILD.md via Contents API → open PR with body rendered from templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork, existing branch, open PR, upstream landed a Dockerfile mid-deploy). GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token pulled from Vault (`secret/viktor` → `github_pat`). Commits include Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile` for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with timestamp suffix on collision. ## Files - SKILL.md — state classification table, quality bar checklist, §8b stability gate, §10 contribute-upstream step, checklist updates - scripts/stability-gate.sh — 10-minute health probe - scripts/contribute-dockerfile.sh — GitHub API orchestrator - templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description - templates/Dockerfile.README.md — BUILD.md template shipped with the PR ## What is NOT in this change - No Woodpecker / GHA changes (skill-local flow). - No auto-tracking of merge/reject outcomes upstream (manual follow-up). - Not yet exercised end-to-end; first real-world run will validate the API dance. Plan to dry-run against a throwaway sink repo before pointing at a real upstream. ## Test Plan ### Automated - bash -n on both scripts → pass - Manual read-through of SKILL.md — step numbering coherent, existing §1-9 untouched semantics, new §8b/§10 reference real files ### Manual Verification 1. Next time setup-project onboards a Dockerfile-less app: - Confirm .contribution-state.json is written with `written-from-scratch` - Run stability-gate.sh — expect 18/20 passes on a healthy deploy - Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin - Verify contribution_pr_url is back-written to the state file 2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent) 3. Upstream-archived case: manually archive a test upstream → re-run → expect SKIP, no PR created [ci skip] Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 18:12:13 +00:00
Viktor Barzin	1860cd1dfb	state(vault): update encrypted state	2026-04-17 14:14:05 +00:00
Viktor Barzin	f0ddfb8cae	state(dbaas): update encrypted state	2026-04-17 14:08:49 +00:00
Viktor Barzin	b034c868db	[traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping. Both plugins load without errors but never inject content. Removed: - rewrite-body plugin download (init container) and registration - strip-accept-encoding middleware (only existed for rewrite-body bug) - anti-ai-trap-links middleware (used rewrite-body for injection) - rybbit_site_id variable from ingress_factory and reverse_proxy factory - rybbit_site_id from 25 service stacks (39 instances) - Per-service rybbit-analytics middleware CRD resources Kept: - compress middleware (entrypoint-level, working correctly) - ai-bot-block middleware (ForwardAuth to bot-block-proxy) - anti-ai-headers middleware (X-Robots-Tag: noai, noimageai) - All CrowdSec, Authentik, rate-limit middleware unchanged Next: Cloudflare Workers with HTMLRewriter for edge-side injection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 12:41:17 +00:00
Viktor Barzin	b24545ffdb	fix(beads-server): fix BeadBoard project ID + install bd binary - Fixed project_id mismatch (was "beadboard", should be actual DB project ID) - Rebuilt Docker image with bd v1.0.2 binary (node:20-slim for glibc compat) - Ran bd migrate to update schema from 1.0.0 → 1.0.2 (adds started_at, etc.) - Task creation and bd CLI now work inside the container Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:57:45 +00:00
Viktor Barzin	f2037545b3	fix(beads-server): make BeadBoard .beads dir writable BeadBoard needs to create templates/ and archetypes/ subdirectories inside .beads/. ConfigMap mounts are read-only, causing ENOENT errors and 503 responses. Fix: init container copies ConfigMap to emptyDir. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:37:26 +00:00
Viktor Barzin	00e2f15a5d	feat(beads-server): deploy BeadBoard task visualization dashboard Add BeadBoard (zenchantlive/beadboard) alongside Dolt server and Workbench for task dependency graph, kanban, and agent coordination views. - Built custom Docker image (registry.viktorbarzin.me:5050/beadboard) - ConfigMap provides .beads/metadata.json pointing to Dolt server - Behind Authentik auth at beadboard.viktorbarzin.me - Also fixed: GraphQL ingress now has Authentik middleware - Also fixed: Workbench store.json type enum (mysql → Mysql) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:30:43 +00:00
Viktor Barzin	26abd8fe94	[skill] Add /disk-wear skill for periodic disk write analysis ## Context After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day of disk writes, this methodology should be reusable for periodic health reviews. ## This change: Adds `/disk-wear` skill that combines three data sources: - SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health - Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total joined with node_disk_device_mapper_info for dm->LVM mapping) - kubectl for PVC UUID -> pod/namespace mapping Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC. Includes baselines, red flag detection, and annualized wear projections. Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track block device writes per container), so per-app attribution uses the PVE host's dm-device level metrics mapped through Prometheus and kubectl. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 11:15:26 +00:00
Viktor Barzin	366e2ab083	[uptime-kuma] Opt-out external monitoring for every public ingress [ci skip] ## Context After the previous commit migrated monitor discovery to per-ingress annotation (opt-in via `uptime.viktorbarzin.me/external-monitor=true`), coverage expanded from 13 → 26 monitors but still left ~99 public ingresses uncovered — notably Helm-managed services (authentik, grafana, vault, forgejo, ntfy) that don't go through `ingress_factory`, plus any `dns_type = "non-proxied"` ingress (Immich was a direct victim: `dns_type = "non-proxied"` → no annotation added → no monitor → invisible outage). The user's concern: "I should have known external Immich was down before users tried to open it." ## This change Flipped the semantic from opt-in to opt-out by default: - Every ingress whose host ends in `.viktorbarzin.me` gets a `[External] <label>` monitor automatically - Only ingresses with annotation `uptime.viktorbarzin.me/external-monitor=false` are skipped - Host dedup via a `seen` set (one monitor per hostname, regardless of how many Ingress resources share it) ## Verification Triggered a manual CronJob run post-apply: ``` Sync complete: 102 created, 1 deleted, 23 unchanged ``` Coverage jumped from 26 → ~124 external monitors. All 6 Helm-managed services now have dedicated monitors: - [External] immich, authentik, forgejo, grafana, ntfy, vault ## Scope Only `stacks/uptime-kuma/modules/uptime-kuma/main.tf` (Python script in the CronJob resource). No RBAC or service account changes — the ones added in the previous commit still cover this path. ## Test plan ### Automated \`\`\` \$ kubectl -n uptime-kuma logs -l job-name=manual-sync-optout-1776422993 --tail=50 \| grep -iE 'immich\|authentik\|grafana\|forgejo\|vault\|ntfy' Creating monitor: [External] authentik -> https://authentik.viktorbarzin.me Creating monitor: [External] forgejo -> https://forgejo.viktorbarzin.me Creating monitor: [External] immich -> https://immich.viktorbarzin.me Creating monitor: [External] grafana -> https://grafana.viktorbarzin.me Creating monitor: [External] ntfy -> https://ntfy.viktorbarzin.me Creating monitor: [External] vault -> https://vault.viktorbarzin.me \`\`\` ### Manual Verification 1. Open `https://uptime.viktorbarzin.me` → confirm `[External] immich` exists 2. Simulate an Immich outage (scale deploy to 0 briefly) → external monitor should go red within the probe interval (5min); internal monitor stays up (pod-level from a different probe angle) → `ExternalAccessDivergence` alert fires after 15 min Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 11:12:00 +00:00
Viktor Barzin	66d2d9916b	[infra] Per-ingress external-monitor annotation + actualbudget plan-time fix [ci skip] ## Context Two operational gaps surfaced during a healthcheck sweep today: 1. External monitoring coverage: Only ~13 hostnames (via `cloudflare_proxied_names` in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via `ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT registered for external probing — so outages like Immich going down externally were invisible until a user complained. 99 of ~125 public ingresses had no external monitor. 2. actualbudget stack unplannable: `count = var.budget_encryption_password != null ? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the value flows from a `data.kubernetes_secret` whose contents are `(known after apply)` at plan time. Blocked CI applies and drift reconciliation. ## This change ### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory) - New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string, nullable). Default is "follow dns_type" — enabled for any public DNS record (`dns_type != "none"`, covers both proxied and non-proxied so Immich and other direct-A records are also monitored). - Emits two annotations on the Ingress: - `uptime.viktorbarzin.me/external-monitor = "true"` - `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override) ### external-monitor-sync CronJob (uptime-kuma stack) - Discovers targets from live Ingress objects via the K8s API first (filter by annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any API error (zero rollout risk). - New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving `list`/`get` on `networking.k8s.io/ingresses`. - `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s) instead of `kubernetes.default.svc` — the search-domain expansion failed in the CronJob pod's DNS config. Verified working: CronJob now logs `Loaded N external monitor targets (source=k8s-api)`. ### actualbudget count-on-unknown refactor - Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at plan; no `-target` workaround needed. - Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is unchanged — the secret is still consumed via env var. - Also aligned the factory with live state (the 3 budget-* PVCs had been migrated `proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed `data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module removed. State was rm'd + re-imported with matching UIDs, so no data was moved. ## Rollout status (already partially applied in this session) - `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified - `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally - `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live - CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active (was 13 on the central list) ## Deferred (separate work) - 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory, rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade. `[ci skip]` here so those don't auto-apply; they will be fixed manually before the next CI push. - Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik, grafana, vault, forgejo) are annotated — separate PR. ## Test plan ### Automated \`\`\` \$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name \| tail -1) Loaded 26 external monitor targets (source=k8s-api) Sync complete: 7 created, 0 deleted, 17 unchanged \$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\ https://dawarich.viktorbarzin.me/ https://nextcloud.viktorbarzin.me/ \\ https://budget-viktor.viktorbarzin.me/ 200 302 200 \$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor deployment.apps/budget-viktor 1/1 1 1 Ready persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted \`\`\` ### Manual Verification 1. Confirm the annotation is present on an ingress_factory ingress: \`\`\` kubectl -n dawarich get ingress dawarich -o \\ jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}' # Expected: "true" \`\`\` 2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min (CronJob interval). For Immich specifically, it will appear after the immich stack is re-applied. 3. Verify actualbudget plan is clean: \`\`\` cd stacks/actualbudget && scripts/tg plan --non-interactive # Expected: no "Invalid count argument" errors \`\`\` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 10:34:32 +00:00
Viktor Barzin	0c4fe98d75	state(dbaas): update encrypted state	2026-04-17 10:08:04 +00:00
Viktor Barzin	996bdfc9b6	[technitium] Uninstall MySQL+SQLite query log plugins instead of just disabling ## Context Disabling MySQL/SQLite query logging via config was not durable — Technitium re-enables disabled plugins on pod restart, causing 46 GB/day of writes to the standalone MySQL (15M inserts to technitium.dns_logs between CronJob runs). ## This change: The password-sync CronJob now UNINSTALLS MySQL and SQLite query log plugins via `/api/apps/uninstall` instead of setting `enableLogging:false`. This is permanent — the plugin files are removed from the PVC, so they can't re-enable on restart. The CronJob checks if the plugins are present first (idempotent). Only PostgreSQL query logging remains (90-day retention). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 08:20:55 +00:00
Viktor Barzin	f0a73815d8	[freedify] Remove stale sed patches from container startup The audio-engine.js, dom.js, and dj.js files were refactored/removed in the upstream Freedify repo. The sed patches that disabled iOS EQ auto-init and visualizer no longer have targets, causing the container to crash on startup. Use the image's default CMD instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 06:17:13 +00:00
Viktor Barzin	f8facf44dd	[infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps ## Context The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI injection returned HTTP 200 with "Error 404: Not Found" body. Root cause: middleware specs referenced plugin name `rewrite-body` but Traefik registered it as `traefik-plugin-rewritebody`. Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3 which uses the correct plugin name. Also added `lastModified = true` and `methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML responses. ## This change - Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3 - Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI) - Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13) - Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts - Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule) - Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2, networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0 - MySQL standalone storage_limit 30Gi → 50Gi - beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 05:51:52 +00:00
Viktor Barzin	8b206a63ad	state(dbaas): update encrypted state	2026-04-16 22:55:52 +00:00
Viktor Barzin	4c8e5bea0b	[traefik] Add global compress middleware to fix response compression The rewrite-body plugin (rybbit analytics, anti-AI trap links) requires strip-accept-encoding to work, which killed HTTP compression for 50+ services. This adds Traefik's built-in compress middleware at the websecure entrypoint level to re-compress responses to clients after rewrite-body has modified them. Uses includedContentTypes whitelist (not excludedContentTypes) so only text-based types are compressed. SSE, WebSocket, gRPC, and binary downloads are unaffected. Measured improvement on ha-sofia: - app.js: 540KB → 167KB (3.2x) - core.js: 52KB → 19KB (2.7x) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 22:18:51 +00:00
Viktor Barzin	e80b2f026f	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend Two-tier state architecture: - Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local state with SOPS encryption in git — unchanged, required for bootstrap. - Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at 10.0.20.200:5432/terraform_state with native pg_advisory_lock. Motivation: multi-operator friction (every workstation needed SOPS + age + git-crypt), bootstrap complexity for new operators, and headless agents/CI needing the full encryption toolchain just to read state. Changes: - terragrunt.hcl: conditional backend (local vs pg) based on tier0 list - scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1, skip SOPS and Vault KV locking for Tier 1 stacks - scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1) - scripts/migrate-state-to-pg: one-shot migration script (idempotent) - stacks/vault/main.tf: pg-terraform-state static role + K8s auth role for claude-agent namespace - stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer service on shared IP 10.0.20.200 - Deleted 107 .tfstate.enc files for migrated Tier 1 stacks - Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl) [ci skip] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:33:12 +00:00
Viktor Barzin	f538115c43	[dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet ## Context Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only ~35 MB of actual data due to Group Replication overhead (binlog, relay log, GR apply log). The operator enforces GR even with serverInstances=1. Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free container images available. Using official mysql:8.4 image instead. ## This change: - Replace helm_release.mysql_cluster service selector with raw kubernetes_stateful_set_v1 using official mysql:8.4 image - ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2, innodb_doublewrite=ON (re-enabled for standalone safety) - Service selector switched to standalone pod labels - Technitium: disable SQLite query logging (18 GB/day write amplification), keep PostgreSQL-only logging (90-day retention) - Grafana datasource and dashboards migrated from MySQL to PostgreSQL - Dashboard SQL queries fixed for PG integer division (::float cast) - Updated CLAUDE.md service-specific notes ## What is NOT in this change: - InnoDB Cluster + operator removal (Phase 4, 7+ days from now) - Stale Vault role cleanup (Phase 4) - Old PVC deletion (Phase 4) Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 19:01:06 +00:00
Viktor Barzin	ef30f27ac9	state(dbaas): update encrypted state	2026-04-16 18:56:59 +00:00
Viktor Barzin	b6fc1e63a6	state(dbaas): import postgresql-lb service	2026-04-16 18:55:40 +00:00
Viktor Barzin	14fa2b9762	state(vault): update encrypted state	2026-04-16 18:43:06 +00:00
Viktor Barzin	1a42f750f8	state(dbaas): update encrypted state	2026-04-16 18:41:34 +00:00
Viktor Barzin	0a43b5c2ac	state(dbaas): update encrypted state	2026-04-16 18:31:33 +00:00
Viktor Barzin	cd513a2226	state(dbaas): update encrypted state	2026-04-16 18:24:31 +00:00

1 2 3 4 5 ...

2807 commits