Commit graph

2794 commits

Author SHA1 Message Date
Viktor Barzin
f6812fe69f [uptime-kuma] Support per-ingress probe path annotation
## Context

The `external-monitor-sync` CronJob probed `https://<host>/` for every
`*.viktorbarzin.me` ingress. Homepages frequently return 200 (or
allow-listed 30x/40x) even when the backend or DB is broken, producing
false-negatives — the forgejo outage on 2026-04-17 was not caught for
this reason: `/` returned a login page while `/api/healthz` returned
503 from the DB probe.

Manual monitor edits don't stick: the next sync is create-if-missing
only, so a deleted monitor gets recreated pointing at `/` again.

## This change

Teaches the sync three things:

1. **Reads a new annotation** `uptime.viktorbarzin.me/external-monitor-path`.
   The annotation value is appended as the probe path; default `/`
   preserves today's behaviour for every ingress that hasn't opted in.
2. **Tightens accepted status codes** when an explicit path is set:
   `['200-299']` (strict — we expect a real healthz). The default `/`
   path keeps the existing lenient set `['200-299','300-399','400-499']`
   because homepages routinely 30x redirect or 40x on missing auth.
3. **Updates existing monitors** when the target URL or accepted
   status codes drift. Previously the loop was create-if-missing only,
   so annotating an already-monitored ingress had no effect until the
   monitor was deleted. Now re-running the sync after changing the
   annotation converges the live monitor.

## What is NOT in this change

- No change to the Ingress annotations on any individual stack. Each
  service that wants a non-`/` probe path opts in separately.
- No change to the ConfigMap fallback payload shape — legacy entries
  still get the lenient status codes.
- Monitor DB state in Uptime Kuma's SQLite is untouched at plan time;
  the sync CronJob is what reconciles state on each run.

## Flow

```
  ingress annotation           CronJob Python
  ------------------           --------------
  (none)                 -->   url = https://host/        codes = lenient
  external-monitor-path  -->   url = https://host<path>   codes = strict ['200-299']
  ^^ "/api/healthz"            https://host/api/healthz   codes = ['200-299']

  existing monitor + drifted target url  -->  api.edit_monitor(id, url=..., accepted_statuscodes=...)
```

## Test Plan

### Automated

- `terraform fmt -check -recursive stacks/uptime-kuma` — exit 0.
- `scripts/tg plan` on `stacks/uptime-kuma` — `Plan: 0 to add, 1 to
  change, 0 to destroy`. The single in-place change is the CronJob
  command (Python heredoc re-rendered). No other resources drift.
- Embedded Python compiles: extracted the `PYEOF` block and ran
  `python3 -m py_compile` — OK.

### Manual Verification

1. Annotate an ingress: `kubectl annotate ingress/<name> -n <ns> uptime.viktorbarzin.me/external-monitor-path=/api/healthz`
2. Trigger sync early: `kubectl -n uptime-kuma create job --from=cronjob/external-monitor-sync external-monitor-sync-manual`
3. Expected log line:
   `Updating monitor [External] <name>: https://host/ -> https://host/api/healthz (codes ['200-299','300-399','400-499'] -> ['200-299'])`
4. Inspect monitor in Uptime Kuma UI: URL and accepted status codes
   reflect the annotation.
5. Final summary line includes updated count:
   `Sync complete: 0 created, 1 updated, 0 deleted, N unchanged`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 22:06:23 +00:00
Claude Agent
842646ea4f [ci skip] e2e: test commit from claude-agent-service 2026-04-17 22:03:50 +00:00
Viktor Barzin
65b0f30d5e [docs] Update anti-AI and rybbit docs after rewrite-body removal
- Anti-AI: 5-layer → 3 active layers (bot-block, X-Robots-Tag, tarpit)
- Layer 3 (trap links via rewrite-body) removed — Yaegi v3 incompatible
- Rybbit analytics now injected via Cloudflare Worker (HTMLRewriter)
- strip-accept-encoding middleware removed from all references

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 21:43:13 +00:00
Viktor Barzin
4117809a54 [rybbit] Deploy Cloudflare Worker for analytics injection
Replaces the broken Traefik rewrite-body plugin with a Cloudflare Worker
using HTMLRewriter to inject the rybbit tracking script into HTML responses
at the CDN edge.

- Wildcard route: *.viktorbarzin.me/* covers all proxied services
- 28 services have explicit site ID mappings
- Unmapped hosts pass through without injection
- Zero Traefik dependency, zero performance impact

Closes: code-sed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 21:26:16 +00:00
Viktor Barzin
498e7f3305 [uptime-kuma] Fix duplicate monitor creation + clean up down monitors
## Duplicate bug fix
The external-monitor-sync deduped targets by hostname (`host in seen`) but
multiple ingresses can share the same hostname. Changed to dedupe by final
monitor name (`f"{PREFIX}{label}" in seen`) — prevents creating duplicate
[External] monitors on every sync run. This caused 90 duplicates.

## Monitor cleanup
Deleted 118 monitors total:
- 90 duplicate [External] monitors (kept lower ID of each pair)
- 14 paused internal monitors for decommissioned services
- 14 external monitors for non-existent, scaled-down, or non-HTTP services
  (xray-vless, complaints, hermes-agent, etc.)

## Opt-outs
Added `uptime.viktorbarzin.me/external-monitor=false` annotation to ingresses
that shouldn't have external HTTP monitors: xray (non-HTTP protocol),
council-complaints, hermes-agent, task-webhook, torrserver, www (no CF DNS).

329 monitors → ~210 monitors. Zero down monitors expected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 21:12:31 +00:00
Viktor Barzin
5319f03ebc [storage] Fix owntracks + wealthfolio: switch to encrypted PVCs
Some checks failed
Build Custom DIUN Image / build (push) Has been cancelled
Deploy Post-Mortems to GitHub Pages / deploy (push) Has been cancelled
Both services were running against empty unencrypted PVCs after the
proxmox-lvm-encrypted migration. Data copied from old Released PVs
via LUKS-unlock on PVE host, deployments switched to encrypted PVCs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 20:29:57 +00:00
Viktor Barzin
e51bdb2af8 Add broker-sync Terraform stack (#7)
* [f1-stream] Remove committed cluster-admin kubeconfig

## Context
A kubeconfig granting cluster-admin access was accidentally committed into
the f1-stream stack's application bundle in c7c7047f (2026-02-22). It
contained the cluster CA certificate plus the kubernetes-admin client
certificate and its RSA private key. Both remotes (github.com, forgejo)
are public, so the credential has been reachable for ~2 months.

Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references
this path; the file is a stray local artifact, likely swept in during a
bulk `git add`.

## This change
- git rm stacks/f1-stream/files/.config

## What is NOT in this change
- Cluster-admin cert rotation on the control plane. The leaked client cert
  must be invalidated separately via `kubeadm certs renew admin.conf` or
  CA regeneration. Tracked in the broader secrets-remediation plan.
- Git-history rewrite. The file is still reachable in every commit since
  c7c7047f. A `git filter-repo --path ... --invert-paths` pass against a
  fresh mirror is planned and will be force-pushed to both remotes.

## Test plan
### Automated
No tests needed for a file removal. Sanity:
  $ grep -rn 'f1-stream/files/\.config' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output)

### Manual Verification
1. `git show HEAD --stat` shows exactly one path deleted:
     stacks/f1-stream/files/.config | 19 -------------------
2. `test ! -e stacks/f1-stream/files/.config` returns true.
3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation
   verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns`
   fails with 401/403 once the admin cert is renewed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [frigate] Remove orphan config.yaml with leaked RTSP passwords

## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.

Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.

## This change
- git rm modules/kubernetes/frigate/config.yaml

## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
  must be coordinated out-of-band with the camera operators. The DDNS
  camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
  password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
  commits from bcad200a forward. Scheduled to be purged via
  `git filter-repo --path modules/kubernetes/frigate/config.yaml
  --invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
  source config from Git rather than the PVC, the replacement should go
  through ExternalSecret + env-var interpolation, not an inline YAML.

## Test plan
### Automated
  $ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms orphan status)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
   PVC bound (unaffected by this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token

## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.

The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.

A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.

## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh

## What is NOT in this change
- Technitium token rotation. The leaked token still works against
  `technitium-web.technitium.svc.cluster.local:5380` until revoked in the
  Technitium admin UI. Rotation is a prerequisite for the upcoming
  git-history scrub, which will remove the token from every commit via
  `git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
  stacks keep working.

## Test plan
### Automated
  $ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
       --include='*.yaml' --include='*.yml' --include='*.sh'
  (no output — confirms no consumer)
  $ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
  (no output in HEAD after this commit)

### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
     modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
     ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
   behavioral regression because renew.sh was never part of the automated
   flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds

## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.

The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.

## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh

main.py is retained unchanged.

## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
  vendor default `calvin` regardless; rotation is tracked in the broader
  remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
  (the redfish-exporter ConfigMap has `default: username: root, password:
  calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
  addressed here — filed as its own task so the fix (drop the default
  block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.

## Test plan
### Automated
  $ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
       --include='*.tf' --include='*.hcl' --include='*.yaml' \
       --include='*.yml' --include='*.sh'
  (no consumer references)

### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
   the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
   it was never coupled to main.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink

## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.

The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).

Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
  SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
  distinct cert material but equivalent coverage.

## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
  ../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
  .gitattributes as a regression guard — any future real file placed
  under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.

setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.

## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
  to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
  once the user's LE account is authenticated. Revocation must happen
  before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
  commit from 2026-03-11 forward. Scheduled for removal via
  `git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
  (and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
  this commit matches the existing symlink-to-wildcard pattern rather
  than introducing a new component.

## Test plan
### Automated
  $ readlink stacks/foolery/secrets
  ../../secrets
  (likewise for terminal, claude-memory)

  $ for s in foolery terminal claude-memory; do
      openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
    done
  subject=CN = viktorbarzin.me  (x3 — all resolve via symlink to root wildcard)

  $ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
  stacks/foolery/secrets/fullchain.pem: filter: git-crypt
  (now matched by the new rule, though for the symlink target the
   repo-root rule already applied)

### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
   shows only the K8s TLS secret being re-created with the root-wildcard
   material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
   <name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
   the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
   claude-memory) → cert chain presents the new serial, handshake OK.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add broker-sync Terraform stack (pending apply)

Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.

This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
  ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
  auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
  cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
  DockerHub image; no pull secret):
    * `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
      version`), used to smoke-test each new image.
    * `broker-sync-trading212` — daily 02:00 `broker-sync trading212
      --mode steady`.
    * `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
    * `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
    * `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
      (Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
  NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
  the convention in infra/.claude/CLAUDE.md §3-2-1.

NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
  stacks/wealthfolio/main.tf — happens in the same commit that first
  applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
  required keys documented in the ExternalSecret comment block.

Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
  apply time.

## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
   ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
   `broker-sync 0.1.0`.

* fix(beads-server): disable Authentik + CrowdSec on Workbench

Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.

TODO: Create Authentik application for dolt-workbench.viktorbarzin.me

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 21:17:45 +01:00
Viktor Barzin
7a884a0b97 [monitoring] Fix alerts for intentionally scaled-down services
PoisonFountainDown and ForwardAuthFallbackActive both fired because
poison-fountain was scaled to 0 replicas (intentional). Updated both
alert expressions to check kube_deployment_spec_replicas > 0 before
alerting on missing available replicas — if desired replicas is 0,
the service is intentionally down and should not alert.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 19:17:41 +00:00
Viktor Barzin
a19581e32b fix(beads-server): fix Workbench timeout — use internal GraphQL URL
GRAPHQLAPI_URL must point to localhost:9002 (internal), not the external
URL which goes through Authentik. SSR can't authenticate to Authentik.
Also removed Authentik from /graphql ingress — browser fetch() can't
follow 302 redirects on POST requests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 19:05:47 +00:00
Viktor Barzin
da6b82ed5c fix(beads-server): persist GRAPHQLAPI_URL in Terraform
The env var was only set via kubectl and got overwritten on next apply.
Now permanently in the deployment spec.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:58:59 +00:00
Viktor Barzin
afb8a16623 [infra] Scale down unused services + remove DoH ingress
Scale to 0 replicas:
- ollama: low usage, saves ~2Gi memory + 59GB NFS-SSD model data idle
- poison-fountain: RSS link archiver, not actively used
- travel-blog: Hugo blog, not actively used

Remove technitium DoH ingress (dns.viktorbarzin.me): externally unreachable
and unused. DNS is served on UDP/TCP port 53 via LoadBalancer (10.0.20.201).

Clears 3 of 5 ExternalAccessDivergence services. Remaining 2 (pdf, travel)
should clear now that the Uptime Kuma monitors will report both down.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:55:52 +00:00
Viktor Barzin
cdc851fc63 [alerts] Fix status-page-pusher crash + Prometheus backup push
## status-page-pusher (ExternalAccessDivergence false positive)
The pusher was crashing with `AttributeError: 'list' object has no attribute
'get'` at line 122 — the uptime-kuma-api library changed the heartbeats return
format. Fixed by making beat flattening more robust: handle any nesting of
lists/dicts in the heartbeat data, and add isinstance check before calling
`.get()` on the latest beat.

## Prometheus backup (PrometheusBackupNeverRun)
The backup sidecar's Pushgateway push was silently failing because `wget
--post-file=-` needs `--header="Content-Type: text/plain"` for Pushgateway
to accept the Prometheus exposition format. Added the header. Also manually
pushed the metric to clear the `absent()` alert immediately.

Note: ExternalAccessDivergence still fires because 5 services (ollama, pdf,
poison, dns, travel) ARE genuinely externally unreachable but internally up.
This is a real issue (likely Cloudflare tunnel routing) not a false positive.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:29:43 +00:00
Viktor Barzin
eef4242408 fix(beads-server): auto-connect Workbench to Dolt on startup
The Workbench's database connection is in-memory and lost on pod restart.
Added startup script that waits for GraphQL server readiness, then calls
addDatabaseConnection mutation automatically. No more manual reconnection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 18:12:31 +00:00
Viktor Barzin
5e9e487661 feat(setup-project): auto-PR working Dockerfiles back to upstream
## Context
The setup-project skill treats "build from a Dockerfile" as priority 6 — "last
resort, avoid if possible" — with no formalized path for apps whose upstream
lacks a working Dockerfile. When we end up writing one to get the deploy green,
that Dockerfile stays private in the infra repo and upstream never benefits.

## This change
Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken
upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against
the upstream repo so the self-hosting community gets the working recipe.

Flow:
1. Classify dockerfile_state during research phase (image-used / used-as-is /
   fixed-broken-upstream / written-from-scratch). Persist to
   modules/kubernetes/<service>/.contribution-state.json.
2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready +
   HTTP 200 every 30s x 20 iterations, requires 18/20 successes.
3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the
   GitHub API dance: fork → merge-upstream → branch → commit Dockerfile /
   .dockerignore / BUILD.md via Contents API → open PR with body rendered from
   templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork,
   existing branch, open PR, upstream landed a Dockerfile mid-deploy).

GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token
pulled from Vault (`secret/viktor` → `github_pat`). Commits include
Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile`
for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with
timestamp suffix on collision.

## Files
- SKILL.md — state classification table, quality bar checklist, §8b stability
  gate, §10 contribute-upstream step, checklist updates
- scripts/stability-gate.sh — 10-minute health probe
- scripts/contribute-dockerfile.sh — GitHub API orchestrator
- templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description
- templates/Dockerfile.README.md — BUILD.md template shipped with the PR

## What is NOT in this change
- No Woodpecker / GHA changes (skill-local flow).
- No auto-tracking of merge/reject outcomes upstream (manual follow-up).
- Not yet exercised end-to-end; first real-world run will validate the API
  dance. Plan to dry-run against a throwaway sink repo before pointing at a
  real upstream.

## Test Plan
### Automated
- bash -n on both scripts → pass
- Manual read-through of SKILL.md — step numbering coherent, existing
  §1-9 untouched semantics, new §8b/§10 reference real files

### Manual Verification
1. Next time setup-project onboards a Dockerfile-less app:
   - Confirm .contribution-state.json is written with `written-from-scratch`
   - Run stability-gate.sh — expect 18/20 passes on a healthy deploy
   - Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin
   - Verify contribution_pr_url is back-written to the state file
2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent)
3. Upstream-archived case: manually archive a test upstream → re-run →
   expect SKIP, no PR created

[ci skip]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 18:12:13 +00:00
Viktor Barzin
1860cd1dfb state(vault): update encrypted state 2026-04-17 14:14:05 +00:00
Viktor Barzin
f0ddfb8cae state(dbaas): update encrypted state 2026-04-17 14:08:49 +00:00
Viktor Barzin
b034c868db [traefik] Remove broken rewrite-body plugin and all rybbit/anti-AI injection
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.

Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources

Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged

Next: Cloudflare Workers with HTMLRewriter for edge-side injection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 12:41:17 +00:00
Viktor Barzin
b24545ffdb fix(beads-server): fix BeadBoard project ID + install bd binary
- Fixed project_id mismatch (was "beadboard", should be actual DB project ID)
- Rebuilt Docker image with bd v1.0.2 binary (node:20-slim for glibc compat)
- Ran bd migrate to update schema from 1.0.0 → 1.0.2 (adds started_at, etc.)
- Task creation and bd CLI now work inside the container

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:57:45 +00:00
Viktor Barzin
f2037545b3 fix(beads-server): make BeadBoard .beads dir writable
BeadBoard needs to create templates/ and archetypes/ subdirectories
inside .beads/. ConfigMap mounts are read-only, causing ENOENT errors
and 503 responses. Fix: init container copies ConfigMap to emptyDir.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:37:26 +00:00
Viktor Barzin
00e2f15a5d feat(beads-server): deploy BeadBoard task visualization dashboard
Add BeadBoard (zenchantlive/beadboard) alongside Dolt server and Workbench
for task dependency graph, kanban, and agent coordination views.

- Built custom Docker image (registry.viktorbarzin.me:5050/beadboard)
- ConfigMap provides .beads/metadata.json pointing to Dolt server
- Behind Authentik auth at beadboard.viktorbarzin.me
- Also fixed: GraphQL ingress now has Authentik middleware
- Also fixed: Workbench store.json type enum (mysql → Mysql)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:30:43 +00:00
Viktor Barzin
26abd8fe94 [skill] Add /disk-wear skill for periodic disk write analysis
## Context
After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day
of disk writes, this methodology should be reusable for periodic health reviews.

## This change:
Adds `/disk-wear` skill that combines three data sources:
- SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health
- Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total
  joined with node_disk_device_mapper_info for dm->LVM mapping)
- kubectl for PVC UUID -> pod/namespace mapping

Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC.
Includes baselines, red flag detection, and annualized wear projections.

Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track
block device writes per container), so per-app attribution uses the PVE host's
dm-device level metrics mapped through Prometheus and kubectl.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 11:15:26 +00:00
Viktor Barzin
366e2ab083 [uptime-kuma] Opt-out external monitoring for every public ingress [ci skip]
## Context
After the previous commit migrated monitor discovery to per-ingress annotation
(opt-in via `uptime.viktorbarzin.me/external-monitor=true`), coverage expanded
from 13 → 26 monitors but still left ~99 public ingresses uncovered — notably
Helm-managed services (authentik, grafana, vault, forgejo, ntfy) that don't
go through `ingress_factory`, plus any `dns_type = "non-proxied"` ingress
(Immich was a direct victim: `dns_type = "non-proxied"` → no annotation added
→ no monitor → invisible outage).

The user's concern: "I should have known external Immich was down before
users tried to open it."

## This change
Flipped the semantic from opt-in to **opt-out by default**:
- Every ingress whose host ends in `.viktorbarzin.me` gets a `[External] <label>`
  monitor automatically
- Only ingresses with annotation `uptime.viktorbarzin.me/external-monitor=false`
  are skipped
- Host dedup via a `seen` set (one monitor per hostname, regardless of how
  many Ingress resources share it)

## Verification
Triggered a manual CronJob run post-apply:
```
Sync complete: 102 created, 1 deleted, 23 unchanged
```
Coverage jumped from 26 → ~124 external monitors. All 6 Helm-managed services
now have dedicated monitors:
- [External] immich, authentik, forgejo, grafana, ntfy, vault

## Scope
Only `stacks/uptime-kuma/modules/uptime-kuma/main.tf` (Python script in the
CronJob resource). No RBAC or service account changes — the ones added in the
previous commit still cover this path.

## Test plan

### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs -l job-name=manual-sync-optout-1776422993 --tail=50 | grep -iE 'immich|authentik|grafana|forgejo|vault|ntfy'
Creating monitor: [External] authentik -> https://authentik.viktorbarzin.me
Creating monitor: [External] forgejo   -> https://forgejo.viktorbarzin.me
Creating monitor: [External] immich    -> https://immich.viktorbarzin.me
Creating monitor: [External] grafana   -> https://grafana.viktorbarzin.me
Creating monitor: [External] ntfy      -> https://ntfy.viktorbarzin.me
Creating monitor: [External] vault     -> https://vault.viktorbarzin.me
\`\`\`

### Manual Verification
1. Open `https://uptime.viktorbarzin.me` → confirm `[External] immich` exists
2. Simulate an Immich outage (scale deploy to 0 briefly) → external monitor
   should go red within the probe interval (5min); internal monitor stays up
   (pod-level from a different probe angle) → `ExternalAccessDivergence`
   alert fires after 15 min

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 11:12:00 +00:00
Viktor Barzin
66d2d9916b [infra] Per-ingress external-monitor annotation + actualbudget plan-time fix [ci skip]
## Context
Two operational gaps surfaced during a healthcheck sweep today:

1. **External monitoring coverage**: Only ~13 hostnames (via `cloudflare_proxied_names`
   in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via
   `ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT
   registered for external probing — so outages like Immich going down externally were
   invisible until a user complained. 99 of ~125 public ingresses had no external
   monitor.

2. **actualbudget stack unplannable**: `count = var.budget_encryption_password != null
   ? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the
   value flows from a `data.kubernetes_secret` whose contents are `(known after apply)`
   at plan time. Blocked CI applies and drift reconciliation.

## This change

### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory)
- New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string,
  nullable). Default is "follow dns_type" — enabled for any public DNS record
  (`dns_type != "none"`, covers both proxied and non-proxied so Immich and other
  direct-A records are also monitored).
- Emits two annotations on the Ingress:
  - `uptime.viktorbarzin.me/external-monitor = "true"`
  - `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override)

### external-monitor-sync CronJob (uptime-kuma stack)
- Discovers targets from live Ingress objects via the K8s API first (filter by
  annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any
  API error (zero rollout risk).
- New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving
  `list`/`get` on `networking.k8s.io/ingresses`.
- `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s)
  instead of `kubernetes.default.svc` — the search-domain expansion failed in the
  CronJob pod's DNS config. Verified working: CronJob now logs
  `Loaded N external monitor targets (source=k8s-api)`.

### actualbudget count-on-unknown refactor
- Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit
  plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at
  plan; no `-target` workaround needed.
- Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is
  unchanged — the secret is still consumed via env var.
- Also aligned the factory with live state (the 3 budget-* PVCs had been migrated
  `proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed
  `data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module
  removed. State was rm'd + re-imported with matching UIDs, so no data was moved.

## Rollout status (already partially applied in this session)
- `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified
- `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally
- `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live
- CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active
  (was 13 on the central list)

## Deferred (separate work)
- 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory,
  rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade.
  `[ci skip]` here so those don't auto-apply; they will be fixed manually before the
  next CI push.
- Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik,
  grafana, vault, forgejo) are annotated — separate PR.

## Test plan

### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name | tail -1)
Loaded 26 external monitor targets (source=k8s-api)
Sync complete: 7 created, 0 deleted, 17 unchanged

\$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\
    https://dawarich.viktorbarzin.me/ https://nextcloud.viktorbarzin.me/ \\
    https://budget-viktor.viktorbarzin.me/
200 302 200

\$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor
deployment.apps/budget-viktor     1/1 1 1 Ready
persistentvolumeclaim/budget-viktor-data-encrypted  Bound  10Gi  RWO  proxmox-lvm-encrypted
\`\`\`

### Manual Verification
1. Confirm the annotation is present on an ingress_factory ingress:
   \`\`\`
   kubectl -n dawarich get ingress dawarich -o \\
     jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}'
   # Expected: "true"
   \`\`\`
2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min
   (CronJob interval). For Immich specifically, it will appear after the immich stack
   is re-applied.
3. Verify actualbudget plan is clean:
   \`\`\`
   cd stacks/actualbudget && scripts/tg plan --non-interactive
   # Expected: no "Invalid count argument" errors
   \`\`\`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 10:34:32 +00:00
Viktor Barzin
0c4fe98d75 state(dbaas): update encrypted state 2026-04-17 10:08:04 +00:00
Viktor Barzin
996bdfc9b6 [technitium] Uninstall MySQL+SQLite query log plugins instead of just disabling
## Context
Disabling MySQL/SQLite query logging via config was not durable — Technitium
re-enables disabled plugins on pod restart, causing 46 GB/day of writes to
the standalone MySQL (15M inserts to technitium.dns_logs between CronJob runs).

## This change:
The password-sync CronJob now UNINSTALLS MySQL and SQLite query log plugins
via `/api/apps/uninstall` instead of setting `enableLogging:false`. This is
permanent — the plugin files are removed from the PVC, so they can't re-enable
on restart. The CronJob checks if the plugins are present first (idempotent).

Only PostgreSQL query logging remains (90-day retention).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 08:20:55 +00:00
Viktor Barzin
f0a73815d8 [freedify] Remove stale sed patches from container startup
The audio-engine.js, dom.js, and dj.js files were refactored/removed
in the upstream Freedify repo. The sed patches that disabled iOS EQ
auto-init and visualizer no longer have targets, causing the container
to crash on startup. Use the image's default CMD instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 06:17:13 +00:00
Viktor Barzin
f8facf44dd [infra] Fix rewrite-body plugin + cleanup TrueNAS + version bumps
## Context

The rewrite-body Traefik plugin (packruler/rewrite-body v1.2.0) silently
broke on Traefik v3.6.12 — every service using rybbit analytics or anti-AI
injection returned HTTP 200 with "Error 404: Not Found" body. Root cause:
middleware specs referenced plugin name `rewrite-body` but Traefik registered
it as `traefik-plugin-rewritebody`.

Migrated to maintained fork `the-ccsn/traefik-plugin-rewritebody` v0.1.3
which uses the correct plugin name. Also added `lastModified = true` and
`methods = ["GET"]` to anti-AI middleware to avoid rewriting non-HTML
responses.

## This change

- Replace packruler/rewrite-body v1.2.0 with the-ccsn/traefik-plugin-rewritebody v0.1.3
- Fix plugin name in all 3 middleware locations (ingress_factory, reverse-proxy factory, traefik anti-AI)
- Remove deprecated TrueNAS cloud sync monitor (VM decommissioned 2026-04-13)
- Remove CloudSyncStale/CloudSyncFailing/CloudSyncNeverRun alerts
- Fix PrometheusBackupNeverRun alert (for: 48h → 32d to match monthly sidecar schedule)
- Bump versions: rybbit v1.0.21→v1.1.0, wealthfolio v1.1.0→v3.2,
  networking-toolbox 1.1.1→1.6.0, cyberchef v10.24.0→v9.55.0
- MySQL standalone storage_limit 30Gi → 50Gi
- beads-server: fix Dolt workbench type casing, remove Authentik on GraphQL endpoint

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 05:51:52 +00:00
Viktor Barzin
8b206a63ad state(dbaas): update encrypted state 2026-04-16 22:55:52 +00:00
Viktor Barzin
4c8e5bea0b [traefik] Add global compress middleware to fix response compression
The rewrite-body plugin (rybbit analytics, anti-AI trap links) requires
strip-accept-encoding to work, which killed HTTP compression for 50+
services. This adds Traefik's built-in compress middleware at the
websecure entrypoint level to re-compress responses to clients after
rewrite-body has modified them.

Uses includedContentTypes whitelist (not excludedContentTypes) so only
text-based types are compressed. SSE, WebSocket, gRPC, and binary
downloads are unaffected.

Measured improvement on ha-sofia:
- app.js: 540KB → 167KB (3.2x)
- core.js: 52KB → 19KB (2.7x)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 22:18:51 +00:00
Viktor Barzin
e80b2f026f [infra] Migrate Terraform state from local SOPS to PostgreSQL backend
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
  state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
  10.0.20.200:5432/terraform_state with native pg_advisory_lock.

Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.

Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
  skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
  for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
  service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)

[ci skip]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:33:12 +00:00
Viktor Barzin
f538115c43 [dbaas] Migrate MySQL from InnoDB Cluster to standalone StatefulSet
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.

Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.

## This change:
- Replace helm_release.mysql_cluster service selector with raw
  kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
  innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
  keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes

## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)

Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 19:01:06 +00:00
Viktor Barzin
ef30f27ac9 state(dbaas): update encrypted state 2026-04-16 18:56:59 +00:00
Viktor Barzin
b6fc1e63a6 state(dbaas): import postgresql-lb service 2026-04-16 18:55:40 +00:00
Viktor Barzin
14fa2b9762 state(vault): update encrypted state 2026-04-16 18:43:06 +00:00
Viktor Barzin
1a42f750f8 state(dbaas): update encrypted state 2026-04-16 18:41:34 +00:00
Viktor Barzin
0a43b5c2ac state(dbaas): update encrypted state 2026-04-16 18:31:33 +00:00
Viktor Barzin
cd513a2226 state(dbaas): update encrypted state 2026-04-16 18:24:31 +00:00
Viktor Barzin
0368601eff state(dbaas): update encrypted state 2026-04-16 18:24:20 +00:00
Viktor Barzin
a237ac97e0 docs(upgrades): add bulk upgrade results from first production run
12 services upgraded in 30 min: audiobookshelf, owntracks, open-webui,
immich, coturn, shlink, phpipam, onlyoffice, paperless-ngx, linkwarden,
synapse, dawarich. Documents auto-rollback behavior, resource awareness
(paperless memory bump), bulk upgrade procedure, and rate limit reset.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 17:34:27 +00:00
Viktor Barzin
39b5ed04a7 upgrade: dawarich 0.37.1 -> 1.6.1 (fix entrypoint + add production env)
## Context
Version 1.3.0+ changed the recommended command from `bin/dev` (development)
to `bin/rails server -p 3000 -b ::` (production). Also requires RAILS_ENV=production,
SECRET_KEY_BASE, and RAILS_LOG_TO_STDOUT env vars.

## This change
- Command: `bin/dev` → `bin/rails server -p 3000 -b ::`
- Add RAILS_ENV=production
- Add SECRET_KEY_BASE (stored in Vault secret/dawarich, synced via ESO)
- Add RAILS_LOG_TO_STDOUT=true

## What happened
1. Initial upgrade applied version 1.6.1 — DB migrations ran but pod
   CrashLooped due to wrong entrypoint (bin/dev exits in production mode)
2. Rollback to 0.37.1 failed because 1.6.1 migrations already ran
   (ActiveRecord::UnknownPrimaryKey on rails_pulse_routes)
3. Rolled forward with corrected entrypoint + env vars
4. Service now stable: 20/20 health checks passed over 5 minutes

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 17:25:29 +00:00
Viktor Barzin
8bd2ace00d state(technitium): update encrypted state 2026-04-16 17:21:06 +00:00
Viktor Barzin
7680d4e009 state(dawarich): update encrypted state 2026-04-16 17:19:29 +00:00
Viktor Barzin
59e99f2a3a upgrade: paperless-ngx increase memory 1Gi -> 2Gi
v2.20.14 OOMKills at 1Gi during search index rebuild on upgrade.
Bumped to 2Gi request=limit to handle startup index operations.

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 17:16:23 +00:00
Viktor Barzin
611c67b92c state(paperless-ngx): update encrypted state 2026-04-16 17:10:57 +00:00
Viktor Barzin
5f1b14ad53 upgrade: dawarich re-apply 1.6.1 (forward-fix after failed rollback)
DB migrations from 1.6.1 already ran, making 0.37.1 incompatible
(ActiveRecord::UnknownPrimaryKey on rails_pulse_routes table).
Rolling forward is the correct path.

Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
2026-04-16 17:08:20 +00:00
Viktor Barzin
e9275534b6 state(dawarich): update encrypted state 2026-04-16 17:08:15 +00:00
Viktor Barzin
1f589a403c state(dawarich): update encrypted state 2026-04-16 17:04:44 +00:00
Viktor Barzin
f5883be981 Revert "upgrade: dawarich 0.37.1 -> 1.6.1"
This reverts commit ec8b4dbaac.
2026-04-16 17:04:06 +00:00
Viktor Barzin
178fc4b398 state(matrix): update encrypted state 2026-04-16 17:01:28 +00:00
Viktor Barzin
449f1af9d6 state(immich): update encrypted state 2026-04-16 17:00:59 +00:00