## Context
The `external-monitor-sync` CronJob probed `https://<host>/` for every
`*.viktorbarzin.me` ingress. Homepages frequently return 200 (or
allow-listed 30x/40x) even when the backend or DB is broken, producing
false-negatives — the forgejo outage on 2026-04-17 was not caught for
this reason: `/` returned a login page while `/api/healthz` returned
503 from the DB probe.
Manual monitor edits don't stick: the next sync is create-if-missing
only, so a deleted monitor gets recreated pointing at `/` again.
## This change
Teaches the sync three things:
1. **Reads a new annotation** `uptime.viktorbarzin.me/external-monitor-path`.
The annotation value is appended as the probe path; default `/`
preserves today's behaviour for every ingress that hasn't opted in.
2. **Tightens accepted status codes** when an explicit path is set:
`['200-299']` (strict — we expect a real healthz). The default `/`
path keeps the existing lenient set `['200-299','300-399','400-499']`
because homepages routinely 30x redirect or 40x on missing auth.
3. **Updates existing monitors** when the target URL or accepted
status codes drift. Previously the loop was create-if-missing only,
so annotating an already-monitored ingress had no effect until the
monitor was deleted. Now re-running the sync after changing the
annotation converges the live monitor.
## What is NOT in this change
- No change to the Ingress annotations on any individual stack. Each
service that wants a non-`/` probe path opts in separately.
- No change to the ConfigMap fallback payload shape — legacy entries
still get the lenient status codes.
- Monitor DB state in Uptime Kuma's SQLite is untouched at plan time;
the sync CronJob is what reconciles state on each run.
## Flow
```
ingress annotation CronJob Python
------------------ --------------
(none) --> url = https://host/ codes = lenient
external-monitor-path --> url = https://host<path> codes = strict ['200-299']
^^ "/api/healthz" https://host/api/healthz codes = ['200-299']
existing monitor + drifted target url --> api.edit_monitor(id, url=..., accepted_statuscodes=...)
```
## Test Plan
### Automated
- `terraform fmt -check -recursive stacks/uptime-kuma` — exit 0.
- `scripts/tg plan` on `stacks/uptime-kuma` — `Plan: 0 to add, 1 to
change, 0 to destroy`. The single in-place change is the CronJob
command (Python heredoc re-rendered). No other resources drift.
- Embedded Python compiles: extracted the `PYEOF` block and ran
`python3 -m py_compile` — OK.
### Manual Verification
1. Annotate an ingress: `kubectl annotate ingress/<name> -n <ns> uptime.viktorbarzin.me/external-monitor-path=/api/healthz`
2. Trigger sync early: `kubectl -n uptime-kuma create job --from=cronjob/external-monitor-sync external-monitor-sync-manual`
3. Expected log line:
`Updating monitor [External] <name>: https://host/ -> https://host/api/healthz (codes ['200-299','300-399','400-499'] -> ['200-299'])`
4. Inspect monitor in Uptime Kuma UI: URL and accepted status codes
reflect the annotation.
5. Final summary line includes updated count:
`Sync complete: 0 created, 1 updated, 0 deleted, N unchanged`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the broken Traefik rewrite-body plugin with a Cloudflare Worker
using HTMLRewriter to inject the rybbit tracking script into HTML responses
at the CDN edge.
- Wildcard route: *.viktorbarzin.me/* covers all proxied services
- 28 services have explicit site ID mappings
- Unmapped hosts pass through without injection
- Zero Traefik dependency, zero performance impact
Closes: code-sed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Duplicate bug fix
The external-monitor-sync deduped targets by hostname (`host in seen`) but
multiple ingresses can share the same hostname. Changed to dedupe by final
monitor name (`f"{PREFIX}{label}" in seen`) — prevents creating duplicate
[External] monitors on every sync run. This caused 90 duplicates.
## Monitor cleanup
Deleted 118 monitors total:
- 90 duplicate [External] monitors (kept lower ID of each pair)
- 14 paused internal monitors for decommissioned services
- 14 external monitors for non-existent, scaled-down, or non-HTTP services
(xray-vless, complaints, hermes-agent, etc.)
## Opt-outs
Added `uptime.viktorbarzin.me/external-monitor=false` annotation to ingresses
that shouldn't have external HTTP monitors: xray (non-HTTP protocol),
council-complaints, hermes-agent, task-webhook, torrserver, www (no CF DNS).
329 monitors → ~210 monitors. Zero down monitors expected.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both services were running against empty unencrypted PVCs after the
proxmox-lvm-encrypted migration. Data copied from old Released PVs
via LUKS-unlock on PVE host, deployments switched to encrypted PVCs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [f1-stream] Remove committed cluster-admin kubeconfig
## Context
A kubeconfig granting cluster-admin access was accidentally committed into
the f1-stream stack's application bundle in c7c7047f (2026-02-22). It
contained the cluster CA certificate plus the kubernetes-admin client
certificate and its RSA private key. Both remotes (github.com, forgejo)
are public, so the credential has been reachable for ~2 months.
Grep across the repo confirms no .tf / .hcl / .sh / .yaml file references
this path; the file is a stray local artifact, likely swept in during a
bulk `git add`.
## This change
- git rm stacks/f1-stream/files/.config
## What is NOT in this change
- Cluster-admin cert rotation on the control plane. The leaked client cert
must be invalidated separately via `kubeadm certs renew admin.conf` or
CA regeneration. Tracked in the broader secrets-remediation plan.
- Git-history rewrite. The file is still reachable in every commit since
c7c7047f. A `git filter-repo --path ... --invert-paths` pass against a
fresh mirror is planned and will be force-pushed to both remotes.
## Test plan
### Automated
No tests needed for a file removal. Sanity:
$ grep -rn 'f1-stream/files/\.config' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output)
### Manual Verification
1. `git show HEAD --stat` shows exactly one path deleted:
stacks/f1-stream/files/.config | 19 -------------------
2. `test ! -e stacks/f1-stream/files/.config` returns true.
3. A copy of the leaked file is at /tmp/leaked.conf for post-rotation
verification (confirming `kubectl --kubeconfig /tmp/leaked.conf get ns`
fails with 401/403 once the admin cert is renewed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [frigate] Remove orphan config.yaml with leaked RTSP passwords
## Context
A Frigate configuration file was added to modules/kubernetes/frigate/ in
bcad200a (2026-04-15, ~2 days ago) as part of a bulk `chore: add untracked
stacks, scripts, and agent configs` commit. The file contains 14 inline
rtsp://admin:<password>@<host>:554/... URLs, leaking two distinct RTSP
passwords for the cameras at 192.168.1.10 (LAN-only) and
valchedrym.ddns.net (confirmed reachable from public internet on port
554). Both remotes are public, so the creds have been exposed for ~2 days.
Grep across the repo confirms nothing references this config.yaml — the
active stacks/frigate/main.tf stack reads its configuration from a
persistent volume claim named `frigate-config-encrypted`, not from this
file. The file is therefore an orphan from the bulk add, with no
production function.
## This change
- git rm modules/kubernetes/frigate/config.yaml
## What is NOT in this change
- Camera password rotation. The user does not own the cameras; rotation
must be coordinated out-of-band with the camera operators. The DDNS
camera (valchedrym.ddns.net:554) is internet-reachable, so the leaked
password is high-priority to rotate from the device side.
- Git-history rewrite. The file plus its leaked strings remain in all
commits from bcad200a forward. Scheduled to be purged via
`git filter-repo --path modules/kubernetes/frigate/config.yaml
--invert-paths --replace-text <list>` in the broader remediation pass.
- Future Frigate config provisioning. If the stack is re-platformed to
source config from Git rather than the PVC, the replacement should go
through ExternalSecret + env-var interpolation, not an inline YAML.
## Test plan
### Automated
$ grep -rn 'frigate/config\.yaml' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output — confirms orphan status)
### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
modules/kubernetes/frigate/config.yaml | 229 ---------------------------------
2. `test ! -e modules/kubernetes/frigate/config.yaml` returns true.
3. `kubectl -n frigate get pvc frigate-config-encrypted` still shows the
PVC bound (unaffected by this change).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [setup-tls-secret] Delete deprecated renew.sh with hardcoded Technitium token
## Context
modules/kubernetes/setup_tls_secret/renew.sh is a 2.5-year-old
expect(1) script for manual Let's Encrypt wildcard-cert renewal via
Technitium DNS TXT-record challenges. It hardcodes a 64-char Technitium
API token on line 7 (as an expect variable) and line 27 (inside a
certbot-cleanup heredoc). Both remotes are public, so the token has been
exposed for ~2.5 years.
The script is not invoked by the module's Terraform (main.tf only creates
a kubernetes.io/tls Secret from PEM files); it is a standalone
run-it-yourself tool. grep across the repo confirms nothing references
`renew.sh` — neither the 20+ stacks that consume the `setup_tls_secret`
module, nor any CI pipeline, nor any shell wrapper.
A replacement script `renew2.sh` (4 weeks old) lives alongside it. It
sources the Technitium token from `$TECHNITIUM_API_KEY` env var and also
supports Cloudflare DNS-01 challenges via `$CLOUDFLARE_TOKEN`. It is the
current renewal path.
## This change
- git rm modules/kubernetes/setup_tls_secret/renew.sh
## What is NOT in this change
- Technitium token rotation. The leaked token still works against
`technitium-web.technitium.svc.cluster.local:5380` until revoked in the
Technitium admin UI. Rotation is a prerequisite for the upcoming
git-history scrub, which will remove the token from every commit via
`git filter-repo --replace-text`.
- renew2.sh is retained as-is (already env-var-sourced; clean).
- The setup_tls_secret module's main.tf is not touched; 20+ consuming
stacks keep working.
## Test plan
### Automated
$ grep -rn 'renew\.sh' --include='*.tf' --include='*.hcl' \
--include='*.yaml' --include='*.yml' --include='*.sh'
(no output — confirms no consumer)
$ git grep -n 'e28818f309a9ce7f72f0fcc867a365cf5d57b214751b75e2ef3ea74943ef23be'
(no output in HEAD after this commit)
### Manual Verification
1. `git show HEAD --stat` shows exactly one deletion:
modules/kubernetes/setup_tls_secret/renew.sh | 136 ---------
2. `test ! -e modules/kubernetes/setup_tls_secret/renew.sh` returns true.
3. `renew2.sh` still exists and is executable:
ls -la modules/kubernetes/setup_tls_secret/renew2.sh
4. Next cert-renewal run uses renew2.sh with env-var-sourced token; no
behavioral regression because renew.sh was never part of the automated
flow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [monitoring] Delete orphan server-power-cycle/main.sh with iDRAC default creds
## Context
stacks/monitoring/modules/monitoring/server-power-cycle/main.sh is an old
shell implementation of a power-cycle watchdog that polled the Dell iDRAC
on 192.168.1.4 for PSU voltage. It hardcoded the Dell iDRAC default
credentials (root:calvin) in 5 `curl -u root:calvin` calls. Both remotes
are public, so those credentials — and the implicit statement that 'this
host has not rotated the default BMC password' — have been exposed.
The current implementation is main.py in the same directory. It reads
iDRAC credentials from the environment variables `idrac_user` and
`idrac_password` (see module's iDRAC_USER_ENV_VAR / iDRAC_PASSWORD_ENV_VAR
constants), which are populated from Vault via ExternalSecret at runtime.
main.sh is not referenced by any Terraform, ConfigMap, or deploy script —
grep confirms no `file()` / `templatefile()` / `filebase64()` call loads
it, and no hand-rolled shell wrapper invokes it.
## This change
- git rm stacks/monitoring/modules/monitoring/server-power-cycle/main.sh
main.py is retained unchanged.
## What is NOT in this change
- iDRAC password rotation on 192.168.1.4. The BMC should be moved off the
vendor default `calvin` regardless; rotation is tracked in the broader
remediation plan and in the iDRAC web UI.
- A separate finding in stacks/monitoring/modules/monitoring/idrac.tf
(the redfish-exporter ConfigMap has `default: username: root, password:
calvin` as a fallback for iDRAC hosts not explicitly listed) is NOT
addressed here — filed as its own task so the fix (drop the default
block vs. source from env) can be considered in isolation.
- Git-history scrub of main.sh is pending the broader filter-repo pass.
## Test plan
### Automated
$ grep -rn 'server-power-cycle/main\.sh\|main\.sh' \
--include='*.tf' --include='*.hcl' --include='*.yaml' \
--include='*.yml' --include='*.sh'
(no consumer references)
### Manual Verification
1. `git show HEAD --stat` shows only the one deletion.
2. `test ! -e stacks/monitoring/modules/monitoring/server-power-cycle/main.sh`
3. `kubectl -n monitoring get deploy idrac-redfish-exporter` still shows
the exporter running — unrelated to this file.
4. main.py continues to run its watchdog loop without regression, because
it was never coupled to main.sh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* [tls] Move 3 outlier stacks from per-stack PEMs to root-wildcard symlink
## Context
foolery, terminal, and claude-memory each had their own
`stacks/<x>/secrets/` directory with a plaintext EC-256 private key
(privkey.pem, 241 B) and matching TLS certificate (fullchain.pem, 2868 B)
for *.viktorbarzin.me. The 92 other stacks under stacks/ symlink
`secrets/` → `../../secrets`, which resolves to the repo-root
/secrets/ directory covered by the `secrets/** filter=git-crypt`
.gitattributes rule — i.e., every other stack consumes the same
git-crypt-encrypted root wildcard cert.
The 3 outliers shipped their keys in plaintext because `.gitattributes`
secrets/** rule matches only repo-root /secrets/, not
stacks/*/secrets/. Both remotes are public, so the 6 plaintext PEM files
have been exposed for 1–6 weeks (commits 5a988133 2026-03-11,
a6f71fc6 2026-03-18, 9820f2ce 2026-04-10).
Verified:
- Root wildcard cert subject = CN viktorbarzin.me,
SAN *.viktorbarzin.me + viktorbarzin.me — covers the 3 subdomains.
- Root privkey + fullchain are a valid key pair (pubkey SHA256 match).
- All 3 outlier certs have the same subject/SAN as root; different
distinct cert material but equivalent coverage.
## This change
- Delete plaintext PEMs in all 3 outlier stacks (6 files total).
- Replace each stacks/<x>/secrets directory with a symlink to
../../secrets, matching the fleet pattern.
- Add `stacks/**/secrets/** filter=git-crypt diff=git-crypt` to
.gitattributes as a regression guard — any future real file placed
under stacks/<x>/secrets/ gets git-crypt-encrypted automatically.
setup_tls_secret module (modules/kubernetes/setup_tls_secret/main.tf) is
unchanged. It still reads `file("${path.root}/secrets/fullchain.pem")`,
which via the symlink resolves to the root wildcard.
## What is NOT in this change
- Revocation of the 3 leaked per-stack certs. Backed up the leaked PEMs
to /tmp/leaked-certs/ for `certbot revoke --reason keycompromise`
once the user's LE account is authenticated. Revocation must happen
before or alongside the history-rewrite force-push to both remotes.
- Git-history scrub. The leaked PEM blobs are still reachable in every
commit from 2026-03-11 forward. Scheduled for removal via
`git filter-repo --path stacks/<x>/secrets/privkey.pem --invert-paths`
(and fullchain.pem for each stack) in the broader remediation pass.
- cert-manager introduction. The fleet does not use cert-manager today;
this commit matches the existing symlink-to-wildcard pattern rather
than introducing a new component.
## Test plan
### Automated
$ readlink stacks/foolery/secrets
../../secrets
(likewise for terminal, claude-memory)
$ for s in foolery terminal claude-memory; do
openssl x509 -in stacks/$s/secrets/fullchain.pem -noout -subject
done
subject=CN = viktorbarzin.me (x3 — all resolve via symlink to root wildcard)
$ git check-attr filter -- stacks/foolery/secrets/fullchain.pem
stacks/foolery/secrets/fullchain.pem: filter: git-crypt
(now matched by the new rule, though for the symlink target the
repo-root rule already applied)
### Manual Verification
1. `terragrunt plan` in stacks/foolery, stacks/terminal, stacks/claude-memory
shows only the K8s TLS secret being re-created with the root-wildcard
material. No ingress changes.
2. `terragrunt apply` for each stack → `kubectl -n <ns> get secret
<name>-tls -o yaml` → tls.crt decodes to CN viktorbarzin.me with
the root serial (different from the pre-change per-stack serials).
3. `curl -v https://foolery.viktorbarzin.me/` (and likewise terminal,
claude-memory) → cert chain presents the new serial, handshake OK.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add broker-sync Terraform stack (pending apply)
Context
-------
Part of the broker-sync rollout — see the plan at
~/.claude/plans/let-s-work-on-linking-temporal-valiant.md and the
companion repo at ViktorBarzin/broker-sync.
This change
-----------
New stack `stacks/broker-sync/`:
- `broker-sync` namespace, aux tier.
- ExternalSecret pulling `secret/broker-sync` via vault-kv
ClusterSecretStore.
- `broker-sync-data-encrypted` PVC (1Gi, proxmox-lvm-encrypted,
auto-resizer) — holds the sync SQLite db, FX cache, Wealthfolio
cookie, CSV archive, watermarks.
- Five CronJobs (all under `viktorbarzin/broker-sync:<tag>`, public
DockerHub image; no pull secret):
* `broker-sync-version` — daily 01:00 liveness probe (`broker-sync
version`), used to smoke-test each new image.
* `broker-sync-trading212` — daily 02:00 `broker-sync trading212
--mode steady`.
* `broker-sync-imap` — daily 02:30, SUSPENDED (Phase 2).
* `broker-sync-csv` — daily 03:00, SUSPENDED (Phase 3).
* `broker-sync-fx-reconcile` — 7th of month 05:05, SUSPENDED
(Phase 1 tail).
- `broker-sync-backup` — daily 04:15, snapshots /data into
NFS `/srv/nfs/broker-sync-backup/` with 30-day retention, matches
the convention in infra/.claude/CLAUDE.md §3-2-1.
NOT in this commit:
- Old `wealthfolio-sync` CronJob retirement in
stacks/wealthfolio/main.tf — happens in the same commit that first
applies this stack, per the plan's "clean cutover" decision.
- Vault seed. `secret/broker-sync` must be populated before apply;
required keys documented in the ExternalSecret comment block.
Test plan
---------
## Automated
- `terraform fmt` — clean (ran before commit).
- `terraform validate` needs `terragrunt init` first; deferred to
apply time.
## Manual Verification
1. Seed Vault `secret/broker-sync/*` (see comment block on the
ExternalSecret in main.tf).
2. `cd stacks/broker-sync && scripts/tg apply`.
3. `kubectl -n broker-sync get cronjob` — expect 6 CJs, 3 suspended.
4. `kubectl -n broker-sync create job smoke --from=cronjob/broker-sync-version`.
5. `kubectl -n broker-sync logs -l job-name=smoke` — expect
`broker-sync 0.1.0`.
* fix(beads-server): disable Authentik + CrowdSec on Workbench
Authentik forward-auth returns 400 for dolt-workbench (no Authentik
application configured for this domain). CrowdSec bouncer also
intermittently returns 400. Both disabled — Workbench is accessible
via Cloudflare tunnel only.
TODO: Create Authentik application for dolt-workbench.viktorbarzin.me
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PoisonFountainDown and ForwardAuthFallbackActive both fired because
poison-fountain was scaled to 0 replicas (intentional). Updated both
alert expressions to check kube_deployment_spec_replicas > 0 before
alerting on missing available replicas — if desired replicas is 0,
the service is intentionally down and should not alert.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GRAPHQLAPI_URL must point to localhost:9002 (internal), not the external
URL which goes through Authentik. SSR can't authenticate to Authentik.
Also removed Authentik from /graphql ingress — browser fetch() can't
follow 302 redirects on POST requests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The env var was only set via kubectl and got overwritten on next apply.
Now permanently in the deployment spec.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scale to 0 replicas:
- ollama: low usage, saves ~2Gi memory + 59GB NFS-SSD model data idle
- poison-fountain: RSS link archiver, not actively used
- travel-blog: Hugo blog, not actively used
Remove technitium DoH ingress (dns.viktorbarzin.me): externally unreachable
and unused. DNS is served on UDP/TCP port 53 via LoadBalancer (10.0.20.201).
Clears 3 of 5 ExternalAccessDivergence services. Remaining 2 (pdf, travel)
should clear now that the Uptime Kuma monitors will report both down.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## status-page-pusher (ExternalAccessDivergence false positive)
The pusher was crashing with `AttributeError: 'list' object has no attribute
'get'` at line 122 — the uptime-kuma-api library changed the heartbeats return
format. Fixed by making beat flattening more robust: handle any nesting of
lists/dicts in the heartbeat data, and add isinstance check before calling
`.get()` on the latest beat.
## Prometheus backup (PrometheusBackupNeverRun)
The backup sidecar's Pushgateway push was silently failing because `wget
--post-file=-` needs `--header="Content-Type: text/plain"` for Pushgateway
to accept the Prometheus exposition format. Added the header. Also manually
pushed the metric to clear the `absent()` alert immediately.
Note: ExternalAccessDivergence still fires because 5 services (ollama, pdf,
poison, dns, travel) ARE genuinely externally unreachable but internally up.
This is a real issue (likely Cloudflare tunnel routing) not a false positive.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Workbench's database connection is in-memory and lost on pod restart.
Added startup script that waits for GraphQL server readiness, then calls
addDatabaseConnection mutation automatically. No more manual reconnection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
The setup-project skill treats "build from a Dockerfile" as priority 6 — "last
resort, avoid if possible" — with no formalized path for apps whose upstream
lacks a working Dockerfile. When we end up writing one to get the deploy green,
that Dockerfile stays private in the infra repo and upstream never benefits.
## This change
Adds a closed-loop flow: when we author a new Dockerfile (or fix a broken
upstream one) and the deploy is healthy for 10 minutes, auto-open a PR against
the upstream repo so the self-hosting community gets the working recipe.
Flow:
1. Classify dockerfile_state during research phase (image-used / used-as-is /
fixed-broken-upstream / written-from-scratch). Persist to
modules/kubernetes/<service>/.contribution-state.json.
2. After Terraform apply, run scripts/stability-gate.sh — polls pod Ready +
HTTP 200 every 30s x 20 iterations, requires 18/20 successes.
3. On pass with a trigger state, scripts/contribute-dockerfile.sh does the
GitHub API dance: fork → merge-upstream → branch → commit Dockerfile /
.dockerignore / BUILD.md via Contents API → open PR with body rendered from
templates/PR_BODY.md. Idempotent (skips on recorded PR URL, existing fork,
existing branch, open PR, upstream landed a Dockerfile mid-deploy).
GitHub API via curl (gh CLI is sandbox-blocked per .claude/CLAUDE.md); token
pulled from Vault (`secret/viktor` → `github_pat`). Commits include
Signed-off-by for DCO-enforcing repos. Fork branch name is `add-dockerfile`
for written-from-scratch or `fix-dockerfile` for fixed-broken-upstream, with
timestamp suffix on collision.
## Files
- SKILL.md — state classification table, quality bar checklist, §8b stability
gate, §10 contribute-upstream step, checklist updates
- scripts/stability-gate.sh — 10-minute health probe
- scripts/contribute-dockerfile.sh — GitHub API orchestrator
- templates/PR_BODY.md — `{{VAR}}` placeholder template for PR description
- templates/Dockerfile.README.md — BUILD.md template shipped with the PR
## What is NOT in this change
- No Woodpecker / GHA changes (skill-local flow).
- No auto-tracking of merge/reject outcomes upstream (manual follow-up).
- Not yet exercised end-to-end; first real-world run will validate the API
dance. Plan to dry-run against a throwaway sink repo before pointing at a
real upstream.
## Test Plan
### Automated
- bash -n on both scripts → pass
- Manual read-through of SKILL.md — step numbering coherent, existing
§1-9 untouched semantics, new §8b/§10 reference real files
### Manual Verification
1. Next time setup-project onboards a Dockerfile-less app:
- Confirm .contribution-state.json is written with `written-from-scratch`
- Run stability-gate.sh — expect 18/20 passes on a healthy deploy
- Run contribute-dockerfile.sh — expect a fork + branch + PR on ViktorBarzin
- Verify contribution_pr_url is back-written to the state file
2. Re-run contribute-dockerfile.sh → must be a no-op (idempotent)
3. Upstream-archived case: manually archive a test upstream → re-run →
expect SKIP, no PR created
[ci skip]
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.
Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources
Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged
Next: Cloudflare Workers with HTMLRewriter for edge-side injection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fixed project_id mismatch (was "beadboard", should be actual DB project ID)
- Rebuilt Docker image with bd v1.0.2 binary (node:20-slim for glibc compat)
- Ran bd migrate to update schema from 1.0.0 → 1.0.2 (adds started_at, etc.)
- Task creation and bd CLI now work inside the container
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BeadBoard needs to create templates/ and archetypes/ subdirectories
inside .beads/. ConfigMap mounts are read-only, causing ENOENT errors
and 503 responses. Fix: init container copies ConfigMap to emptyDir.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add BeadBoard (zenchantlive/beadboard) alongside Dolt server and Workbench
for task dependency graph, kanban, and agent coordination views.
- Built custom Docker image (registry.viktorbarzin.me:5050/beadboard)
- ConfigMap provides .beads/metadata.json pointing to Dolt server
- Behind Authentik auth at beadboard.viktorbarzin.me
- Also fixed: GraphQL ingress now has Authentik middleware
- Also fixed: Workbench store.json type enum (mysql → Mysql)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
After the MySQL standalone migration + Technitium SQLite disable saved ~130 GB/day
of disk writes, this methodology should be reusable for periodic health reviews.
## This change:
Adds `/disk-wear` skill that combines three data sources:
- SSH to PVE host for real-time 30s I/O snapshots and SSD SMART health
- Prometheus PromQL for per-app write attribution (node_disk_written_bytes_total
joined with node_disk_device_mapper_info for dm->LVM mapping)
- kubectl for PVC UUID -> pod/namespace mapping
Produces ranked breakdowns by physical disk, VM, k8s namespace, and individual PVC.
Includes baselines, red flag detection, and annualized wear projections.
Note: container_fs_writes_bytes_total has 0 series (cadvisor doesn't track
block device writes per container), so per-app attribution uses the PVE host's
dm-device level metrics mapped through Prometheus and kubectl.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
After the previous commit migrated monitor discovery to per-ingress annotation
(opt-in via `uptime.viktorbarzin.me/external-monitor=true`), coverage expanded
from 13 → 26 monitors but still left ~99 public ingresses uncovered — notably
Helm-managed services (authentik, grafana, vault, forgejo, ntfy) that don't
go through `ingress_factory`, plus any `dns_type = "non-proxied"` ingress
(Immich was a direct victim: `dns_type = "non-proxied"` → no annotation added
→ no monitor → invisible outage).
The user's concern: "I should have known external Immich was down before
users tried to open it."
## This change
Flipped the semantic from opt-in to **opt-out by default**:
- Every ingress whose host ends in `.viktorbarzin.me` gets a `[External] <label>`
monitor automatically
- Only ingresses with annotation `uptime.viktorbarzin.me/external-monitor=false`
are skipped
- Host dedup via a `seen` set (one monitor per hostname, regardless of how
many Ingress resources share it)
## Verification
Triggered a manual CronJob run post-apply:
```
Sync complete: 102 created, 1 deleted, 23 unchanged
```
Coverage jumped from 26 → ~124 external monitors. All 6 Helm-managed services
now have dedicated monitors:
- [External] immich, authentik, forgejo, grafana, ntfy, vault
## Scope
Only `stacks/uptime-kuma/modules/uptime-kuma/main.tf` (Python script in the
CronJob resource). No RBAC or service account changes — the ones added in the
previous commit still cover this path.
## Test plan
### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs -l job-name=manual-sync-optout-1776422993 --tail=50 | grep -iE 'immich|authentik|grafana|forgejo|vault|ntfy'
Creating monitor: [External] authentik -> https://authentik.viktorbarzin.me
Creating monitor: [External] forgejo -> https://forgejo.viktorbarzin.me
Creating monitor: [External] immich -> https://immich.viktorbarzin.me
Creating monitor: [External] grafana -> https://grafana.viktorbarzin.me
Creating monitor: [External] ntfy -> https://ntfy.viktorbarzin.me
Creating monitor: [External] vault -> https://vault.viktorbarzin.me
\`\`\`
### Manual Verification
1. Open `https://uptime.viktorbarzin.me` → confirm `[External] immich` exists
2. Simulate an Immich outage (scale deploy to 0 briefly) → external monitor
should go red within the probe interval (5min); internal monitor stays up
(pod-level from a different probe angle) → `ExternalAccessDivergence`
alert fires after 15 min
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Two operational gaps surfaced during a healthcheck sweep today:
1. **External monitoring coverage**: Only ~13 hostnames (via `cloudflare_proxied_names`
in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via
`ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT
registered for external probing — so outages like Immich going down externally were
invisible until a user complained. 99 of ~125 public ingresses had no external
monitor.
2. **actualbudget stack unplannable**: `count = var.budget_encryption_password != null
? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the
value flows from a `data.kubernetes_secret` whose contents are `(known after apply)`
at plan time. Blocked CI applies and drift reconciliation.
## This change
### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory)
- New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string,
nullable). Default is "follow dns_type" — enabled for any public DNS record
(`dns_type != "none"`, covers both proxied and non-proxied so Immich and other
direct-A records are also monitored).
- Emits two annotations on the Ingress:
- `uptime.viktorbarzin.me/external-monitor = "true"`
- `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override)
### external-monitor-sync CronJob (uptime-kuma stack)
- Discovers targets from live Ingress objects via the K8s API first (filter by
annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any
API error (zero rollout risk).
- New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving
`list`/`get` on `networking.k8s.io/ingresses`.
- `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s)
instead of `kubernetes.default.svc` — the search-domain expansion failed in the
CronJob pod's DNS config. Verified working: CronJob now logs
`Loaded N external monitor targets (source=k8s-api)`.
### actualbudget count-on-unknown refactor
- Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit
plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at
plan; no `-target` workaround needed.
- Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is
unchanged — the secret is still consumed via env var.
- Also aligned the factory with live state (the 3 budget-* PVCs had been migrated
`proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed
`data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module
removed. State was rm'd + re-imported with matching UIDs, so no data was moved.
## Rollout status (already partially applied in this session)
- `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified
- `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally
- `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live
- CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active
(was 13 on the central list)
## Deferred (separate work)
- 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory,
rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade.
`[ci skip]` here so those don't auto-apply; they will be fixed manually before the
next CI push.
- Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik,
grafana, vault, forgejo) are annotated — separate PR.
## Test plan
### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name | tail -1)
Loaded 26 external monitor targets (source=k8s-api)
Sync complete: 7 created, 0 deleted, 17 unchanged
\$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\
https://dawarich.viktorbarzin.me/https://nextcloud.viktorbarzin.me/ \\
https://budget-viktor.viktorbarzin.me/
200 302 200
\$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor
deployment.apps/budget-viktor 1/1 1 1 Ready
persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted
\`\`\`
### Manual Verification
1. Confirm the annotation is present on an ingress_factory ingress:
\`\`\`
kubectl -n dawarich get ingress dawarich -o \\
jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}'
# Expected: "true"
\`\`\`
2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min
(CronJob interval). For Immich specifically, it will appear after the immich stack
is re-applied.
3. Verify actualbudget plan is clean:
\`\`\`
cd stacks/actualbudget && scripts/tg plan --non-interactive
# Expected: no "Invalid count argument" errors
\`\`\`
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Disabling MySQL/SQLite query logging via config was not durable — Technitium
re-enables disabled plugins on pod restart, causing 46 GB/day of writes to
the standalone MySQL (15M inserts to technitium.dns_logs between CronJob runs).
## This change:
The password-sync CronJob now UNINSTALLS MySQL and SQLite query log plugins
via `/api/apps/uninstall` instead of setting `enableLogging:false`. This is
permanent — the plugin files are removed from the PVC, so they can't re-enable
on restart. The CronJob checks if the plugins are present first (idempotent).
Only PostgreSQL query logging remains (90-day retention).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The audio-engine.js, dom.js, and dj.js files were refactored/removed
in the upstream Freedify repo. The sed patches that disabled iOS EQ
auto-init and visualizer no longer have targets, causing the container
to crash on startup. Use the image's default CMD instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The rewrite-body plugin (rybbit analytics, anti-AI trap links) requires
strip-accept-encoding to work, which killed HTTP compression for 50+
services. This adds Traefik's built-in compress middleware at the
websecure entrypoint level to re-compress responses to clients after
rewrite-body has modified them.
Uses includedContentTypes whitelist (not excludedContentTypes) so only
text-based types are compressed. SSE, WebSocket, gRPC, and binary
downloads are unaffected.
Measured improvement on ha-sofia:
- app.js: 540KB → 167KB (3.2x)
- core.js: 52KB → 19KB (2.7x)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Disk write analysis showed MySQL InnoDB Cluster writing ~95 GB/day for only
~35 MB of actual data due to Group Replication overhead (binlog, relay log,
GR apply log). The operator enforces GR even with serverInstances=1.
Bitnami Helm charts were deprecated by Broadcom in Aug 2025 — no free
container images available. Using official mysql:8.4 image instead.
## This change:
- Replace helm_release.mysql_cluster service selector with raw
kubernetes_stateful_set_v1 using official mysql:8.4 image
- ConfigMap mysql-standalone-cnf: skip-log-bin, innodb_flush_log_at_trx_commit=2,
innodb_doublewrite=ON (re-enabled for standalone safety)
- Service selector switched to standalone pod labels
- Technitium: disable SQLite query logging (18 GB/day write amplification),
keep PostgreSQL-only logging (90-day retention)
- Grafana datasource and dashboards migrated from MySQL to PostgreSQL
- Dashboard SQL queries fixed for PG integer division (::float cast)
- Updated CLAUDE.md service-specific notes
## What is NOT in this change:
- InnoDB Cluster + operator removal (Phase 4, 7+ days from now)
- Stale Vault role cleanup (Phase 4)
- Old PVC deletion (Phase 4)
Expected write reduction: ~113 GB/day (MySQL 95 + Technitium 18)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Version 1.3.0+ changed the recommended command from `bin/dev` (development)
to `bin/rails server -p 3000 -b ::` (production). Also requires RAILS_ENV=production,
SECRET_KEY_BASE, and RAILS_LOG_TO_STDOUT env vars.
## This change
- Command: `bin/dev` → `bin/rails server -p 3000 -b ::`
- Add RAILS_ENV=production
- Add SECRET_KEY_BASE (stored in Vault secret/dawarich, synced via ESO)
- Add RAILS_LOG_TO_STDOUT=true
## What happened
1. Initial upgrade applied version 1.6.1 — DB migrations ran but pod
CrashLooped due to wrong entrypoint (bin/dev exits in production mode)
2. Rollback to 0.37.1 failed because 1.6.1 migrations already ran
(ActiveRecord::UnknownPrimaryKey on rails_pulse_routes)
3. Rolled forward with corrected entrypoint + env vars
4. Service now stable: 20/20 health checks passed over 5 minutes
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
v2.20.14 OOMKills at 1Gi during search index rebuild on upgrade.
Bumped to 2Gi request=limit to handle startup index operations.
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>
DB migrations from 1.6.1 already ran, making 0.37.1 incompatible
(ActiveRecord::UnknownPrimaryKey on rails_pulse_routes table).
Rolling forward is the correct path.
Co-Authored-By: Service Upgrade Agent <noreply@viktorbarzin.me>