After fixing the threshold=80% misconfig and seeing two PVCs
(prometheus + technitium primary) get stuck Terminating, a 3rd round
showed four more PVCs (frigate, hackmd, immich-postgresql,
paperless-ngx) in the same state. Same root cause: TF spec'd a
smaller storage size than the autoresizer-grown live value, K8s
rejected the shrink, TF force-replaced the PVC, and the
pvc-protection finalizer held it in Terminating while the pod kept
using the underlying volume.
Bulk-inject lifecycle.ignore_changes = [spec[0].resources[0].requests]
on every kubernetes_persistent_volume_claim block that has
resize.topolvm.io/threshold annotations. The pattern was already
documented in .claude/CLAUDE.md but ~63 stacks were missing it.
Live PVCs are unaffected; this only prevents future TF applies from
attempting the destroy+recreate.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
topolvm/pvc-autoresizer's threshold annotation is the FREE-SPACE
percentage below which expansion fires (per upstream README). Setting
it to "80%" means "expand when free-space drops below 80%", i.e. as
soon as the PVC crosses 20% utilization — which caused
prometheus-data-proxmox to be repeatedly expanded from 200Gi to 433Gi
in 70 minutes (six 10% bumps, all when the volume was only ~14% used).
Once the SC opt-in fix landed (1e4eac53) and the inode metrics fix
landed (02a12f1a), the autoresizer started actively misfiring across
75+ PVCs cluster-wide.
Flip the value to "10%" everywhere — that's "expand when free-space
drops below 10%", i.e. at 90% utilization, which is the conventional
semantic and matches the alert thresholds in
prometheus_chart_values.tpl (PVAutoExpanding fires at 80%, PVFillingUp
at 95%).
The CLAUDE.md PVC template was the source of the misconfig, so update
it too. Live PVC annotations were patched in parallel via kubectl
annotate; TF apply on each affected stack will be a no-op against
those live values.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The TP-Link gateway was wired via ExternalName `gw.viktorbarzin.lan`, but
Technitium has no record for that name (the router isn't a DHCP client and
Kea DDNS never registers it), so the ingress backend returned NXDOMAIN and
the `[External] gw` Uptime Kuma monitor was permanently failing.
Factory now accepts `backend_ip` as an alternative to `external_name`: it
creates a selector-less ClusterIP Service + manual EndpointSlice pointing
at the given IP, bypassing cluster DNS entirely. Used for gw (192.168.1.1);
the old ExternalName path is retained for every other service.
Also add a direct `port` monitor for the router in uptime-kuma's
internal_monitors list so we can tell a Cloudflare/tunnel outage apart
from the router itself being down. Extended the internal-monitor-sync
script to handle non-DB monitor types (hostname + port fields).
The Redis monitor (id=53) was created manually with a connection string
pointing at redis-master.redis-headless.redis.svc.cluster.local, which
doesn't resolve — headless only exposes pod DNS (redis-node-N.redis-headless),
not a synthetic "redis-master" name. Status had been DOWN with ENOTFOUND
for weeks.
Declare it in local.internal_monitors using redis-master.redis.svc.cluster.local
(the HAProxy-fronted ClusterIP that already routes to the Sentinel-elected
master). Verified RESP PING through HAProxy returns PONG.
Tighten intervals to 60s / 30s retry / 3 retries — Redis is core (Paperless,
Immich, Authentik, Dawarich all depend on it), a 5-minute detection window
was way too loose given the blast radius.
Also teach the sync CronJob to handle no-password monitors (auth disabled
on the Bitnami chart), via an optional database_password_vault_key.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3A (commit c9d221d5) added the `# KYVERNO_LIFECYCLE_V1` marker to the
27 pre-existing `ignore_changes = [...dns_config]` sites so they could be
grepped and audited. It did NOT address pod-owning resources that were
simply missing the suppression entirely. Post-Wave-3A sampling (2026-04-18)
found that navidrome, f1-stream, frigate, servarr, monitoring, crowdsec,
and many other stacks showed perpetual `dns_config` drift every plan
because their `kubernetes_deployment` / `kubernetes_stateful_set` /
`kubernetes_cron_job_v1` resources had no `lifecycle {}` block at all.
Root cause (same as Wave 3A): Kyverno's admission webhook stamps
`dns_config { option { name = "ndots"; value = "2" } }` on every pod's
`spec.template.spec.dns_config` to prevent NxDomain search-domain flooding
(see `k8s-ndots-search-domain-nxdomain-flood` skill). Without `ignore_changes`
on every Terraform-managed pod-owner, Terraform repeatedly tries to strip
the injected field.
## This change
Extends the Wave 3A convention by sweeping EVERY `kubernetes_deployment`,
`kubernetes_stateful_set`, `kubernetes_daemon_set`, `kubernetes_cron_job_v1`,
`kubernetes_job_v1` (+ their `_v1` variants) in the repo and ensuring each
carries the right `ignore_changes` path:
- **kubernetes_deployment / stateful_set / daemon_set / job_v1**:
`spec[0].template[0].spec[0].dns_config`
- **kubernetes_cron_job_v1**:
`spec[0].job_template[0].spec[0].template[0].spec[0].dns_config`
(extra `job_template[0]` nesting — the CronJob's PodTemplateSpec is
one level deeper)
Each injection / extension is tagged `# KYVERNO_LIFECYCLE_V1: Kyverno
admission webhook mutates dns_config with ndots=2` inline so the
suppression is discoverable via `rg 'KYVERNO_LIFECYCLE_V1' stacks/`.
Two insertion paths are handled by a Python pass (`/tmp/add_dns_config_ignore.py`):
1. **No existing `lifecycle {}`**: inject a brand-new block just before the
resource's closing `}`. 108 new blocks on 93 files.
2. **Existing `lifecycle {}` (usually for `DRIFT_WORKAROUND: CI owns image tag`
from Wave 4, commit a62b43d1)**: extend its `ignore_changes` list with the
dns_config path. Handles both inline (`= [x]`) and multiline
(`= [\n x,\n]`) forms; ensures the last pre-existing list item carries
a trailing comma so the extended list is valid HCL. 34 extensions.
The script skips anything already mentioning `dns_config` inside an
`ignore_changes`, so re-running is a no-op.
## Scale
- 142 total lifecycle injections/extensions
- 93 `.tf` files touched
- 108 brand-new `lifecycle {}` blocks + 34 extensions of existing ones
- Every Tier 0 and Tier 1 stack with a pod-owning resource is covered
- Together with Wave 3A's 27 pre-existing markers → **169 greppable
`KYVERNO_LIFECYCLE_V1` dns_config sites across the repo**
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entirely commented-out block (`/* … */`).
Python script touched the file, reverted manually.
- `_template/main.tf.example` skeleton — kept minimal on purpose; any
future stack created from it should either inherit the Wave 3A one-line
form or add its own on first `kubernetes_deployment`.
- `terraform fmt` fixes to pre-existing alignment issues in meshcentral,
nvidia/modules/nvidia, vault — unrelated to this commit. Left for a
separate fmt-only pass.
- Non-pod resources (`kubernetes_service`, `kubernetes_secret`,
`kubernetes_manifest`, etc.) — they don't own pods so they don't get
Kyverno dns_config mutation.
## Verification
Random sample post-commit:
```
$ cd stacks/navidrome && ../../scripts/tg plan → No changes.
$ cd stacks/f1-stream && ../../scripts/tg plan → No changes.
$ cd stacks/frigate && ../../scripts/tg plan → No changes.
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
169
```
## Reproduce locally
1. `git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` → 169+
3. `cd stacks/navidrome && ../../scripts/tg plan` → expect 0 drift on
the deployment's dns_config field.
Refs: code-seq (Wave 3B dns_config class closed; kubernetes_manifest
annotation class handled separately in 8d94688d for tls_secret)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Wave 3B-continued: the Goldilocks VPA dashboard (stacks/vpa) runs a Kyverno
ClusterPolicy `goldilocks-vpa-auto-mode` that mutates every namespace with
`metadata.labels["goldilocks.fairwinds.com/vpa-update-mode"] = "off"`. This
is intentional — Terraform owns container resource limits, and Goldilocks
should only provide recommendations, never auto-update. The label is how
Goldilocks decides per-namespace whether to run its VPA in `off` mode.
Effect on Terraform: every `kubernetes_namespace` resource shows the label
as pending-removal (`-> null`) on every `scripts/tg plan`. Dawarich survey
2026-04-18 confirmed the drift. Cluster-side count: 88 namespaces carry the
label (`kubectl get ns -o json | jq ... | wc -l`). Every TF-managed namespace
is affected.
This commit brings the intentional admission drift under the same
`# KYVERNO_LIFECYCLE_V1` discoverability marker introduced in c9d221d5 for
the ndots dns_config pattern. The marker now stands generically for any
Kyverno admission-webhook drift suppression; the inline comment records
which specific policy stamps which specific field so future grep audits
show why each suppression exists.
## This change
107 `.tf` files touched — every stack's `resource "kubernetes_namespace"`
resource gets:
```hcl
lifecycle {
# KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode ClusterPolicy stamps this label on every namespace
ignore_changes = [metadata[0].labels["goldilocks.fairwinds.com/vpa-update-mode"]]
}
```
Injection was done with a brace-depth-tracking Python pass (`/tmp/add_goldilocks_ignore.py`):
match `^resource "kubernetes_namespace" ` → track `{` / `}` until the
outermost closing brace → insert the lifecycle block before the closing
brace. The script is idempotent (skips any file that already mentions
`goldilocks.fairwinds.com/vpa-update-mode`) so re-running is safe.
Vault stack picked up 2 namespaces in the same file (k8s-users produces
one, plus a second explicit ns) — confirmed via file diff (+8 lines).
## What is NOT in this change
- `stacks/trading-bot/main.tf` — entire file is `/* … */` commented out
(paused 2026-04-06 per user decision). Reverted after the script ran.
- `stacks/_template/main.tf.example` — per-stack skeleton, intentionally
minimal. User keeps it that way. Not touched by the script (file
has no real `resource "kubernetes_namespace"` — only a placeholder
comment).
- `.terraform/` copies (e.g. `stacks/metallb/.terraform/modules/...`) —
gitignored, won't commit; the live path was edited.
- `terraform fmt` cleanup of adjacent pre-existing alignment issues in
authentik, freedify, hermes-agent, nvidia, vault, meshcentral. Reverted
to keep the commit scoped to the Goldilocks sweep. Those files will
need a separate fmt-only commit or will be cleaned up on next real
apply to that stack.
## Verification
Dawarich (one of the hundred-plus touched stacks) showed the pattern
before and after:
```
$ cd stacks/dawarich && ../../scripts/tg plan
Before:
Plan: 0 to add, 2 to change, 0 to destroy.
# kubernetes_namespace.dawarich will be updated in-place
(goldilocks.fairwinds.com/vpa-update-mode -> null)
# module.tls_secret.kubernetes_secret.tls_secret will be updated in-place
(Kyverno generate.* labels — fixed in 8d94688d)
After:
No changes. Your infrastructure matches the configuration.
```
Injection count check:
```
$ rg -c 'KYVERNO_LIFECYCLE_V1: goldilocks-vpa-auto-mode' stacks/ | awk -F: '{s+=$2} END {print s}'
108
```
## Reproduce locally
1. `git pull`
2. Pick any stack: `cd stacks/<name> && ../../scripts/tg plan`
3. Expect: no drift on the namespace's goldilocks.fairwinds.com/vpa-update-mode label.
Closes: code-dwx
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Phase 1 of the state-drift consolidation audit (plan Wave 3) identified that
the entire repo leans on a repeated `lifecycle { ignore_changes = [...dns_config] }`
snippet to suppress Kyverno's admission-webhook dns_config mutation (the ndots=2
override that prevents NxDomain search-domain flooding). 27 occurrences across
19 stacks. Without this suppression, every pod-owning resource shows perpetual
TF plan drift.
The original plan proposed a shared `modules/kubernetes/kyverno_lifecycle/`
module emitting the ignore-paths list as an output that stacks would consume in
their `ignore_changes` blocks. That approach is architecturally impossible:
Terraform's `ignore_changes` meta-argument accepts only static attribute paths
— it rejects module outputs, locals, variables, and any expression (the HCL
spec evaluates `lifecycle` before the regular expression graph). So a DRY
module cannot exist. The canonical pattern IS the repeated snippet.
What the snippet was missing was a *discoverability tag* so that (a) new
resources can be validated for compliance, (b) the existing 27 sites can be
grep'd in a single command, and (c) future maintainers understand the
convention rather than each reinventing it.
## This change
- Introduces `# KYVERNO_LIFECYCLE_V1` as the canonical marker comment.
Attached inline on every `spec[0].template[0].spec[0].dns_config` line
(or `spec[0].job_template[0].spec[0]...` for CronJobs) across all 27
existing suppression sites.
- Documents the convention with rationale and copy-paste snippets in
`AGENTS.md` → new "Kyverno Drift Suppression" section.
- Expands the existing `.claude/CLAUDE.md` Kyverno ndots note to reference
the marker and explain why the module approach is blocked.
- Updates `_template/main.tf.example` so every new stack starts compliant.
## What is NOT in this change
- The `kubernetes_manifest` Kyverno annotation drift (beads `code-seq`)
— that is Phase B with a sibling `# KYVERNO_MANIFEST_V1` marker.
- Behavioral changes — every `ignore_changes` list is byte-identical
save for the inline comment.
- The fallback module the original plan anticipated — skipped because
Terraform rejects expressions in `ignore_changes`.
- `terraform fmt` cleanup on adjacent unrelated blocks in three files
(claude-agent-service, freedify/factory, hermes-agent). Reverted to
keep this commit scoped to the convention rollout.
## Before / after
Before (cannot distinguish accidental-forgotten from intentional-convention):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config]
}
```
After (greppable, self-documenting, discoverable by tooling):
```hcl
lifecycle {
ignore_changes = [spec[0].template[0].spec[0].dns_config] # KYVERNO_LIFECYCLE_V1
}
```
## Test Plan
### Automated
```
$ rg -c 'KYVERNO_LIFECYCLE_V1' stacks/ --include='*.tf' --include='*.tf.example' \
| awk -F: '{s+=$2} END {print s}'
27
$ git diff --stat | grep -E '\.(tf|tf\.example|md)$' | wc -l
21
# All code-file diffs are 1 insertion + 1 deletion per marker site,
# except beads-server (3), ebooks (4), immich (3), uptime-kuma (2).
$ git diff --stat stacks/ | tail -1
20 files changed, 45 insertions(+), 28 deletions(-)
```
### Manual Verification
No apply required — HCL comments only. Zero effect on any stack's plan output.
Future audits: `rg 'KYVERNO_LIFECYCLE_V1' stacks/ | wc -l` must grow as new
pod-owning resources are added.
## Reproduce locally
1. `cd infra && git pull`
2. `rg 'KYVERNO_LIFECYCLE_V1' stacks/` → expect 27 hits in 19 files
3. Grep any new `kubernetes_deployment` for the marker; absence = missing
suppression.
Closes: code-28m
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Monitor id 663 "MySQL Standalone (dbaas)" was created manually yesterday via
the `uptime-kuma-api` Python library when the dbaas stack migrated from
InnoDB Cluster to standalone MySQL. It worked and was UP, but lived only in
Uptime Kuma's MariaDB — if UK's DB were wiped or restored from an older
backup, the monitor would be lost.
## This change
Adds declarative, self-healing management for internal-service monitors
(databases, non-HTTP endpoints) that can't be discovered from ingress
annotations. Modelled on the existing `external-monitor-sync` CronJob.
- `local.internal_monitors` — list of desired monitors (name, type,
connection string, Vault password key, interval, retries). Seeded with
the MySQL Standalone monitor. Add new entries here to manage more.
- `kubernetes_secret.internal_monitor_sync` — pulls admin password and all
referenced DB passwords from Vault `secret/viktor` at apply time. Secret
key names are derived from monitor name (`DB_PASSWORD_<upper_snake>`).
- `kubernetes_config_map_v1.internal_monitor_targets` — renders the target
list to JSON for the sync container.
- `kubernetes_cron_job_v1.internal_monitor_sync` — runs every 10 min,
looks up monitors by name, creates if missing, patches if drifted,
leaves id and history untouched when already in desired state.
## Why this approach (Option B, not a Terraform provider)
The `louislam/uptime-kuma` Terraform provider does NOT exist in the public
registry (verified — only a CLI tool of the same name). Option A from the
task brief was therefore unavailable. Option B (idempotent K8s CronJob)
matches the established pattern in the same module for
`external-monitor-sync` — no new machinery introduced.
## Monitor 663: no-op on first sync
Manual import was not possible (no provider → no state to import). The
sync job correctly identifies the existing monitor by name and reports:
Monitor MySQL Standalone (dbaas) (id=663) already in desired state
Internal monitor sync complete
DB heartbeats confirm monitor 663 stayed UP throughout with `status=1` and
`Rows: 1` responses every 60s — no disruption.
## Vault key — left manual (by design)
`secret/viktor` is not Terraform-managed anywhere in the repo (only read
via `data "vault_kv_secret_v2"`). It is a user-edited Vault entry holding
135 keys. The `uptimekuma_db_password` key was added manually yesterday;
this change does NOT codify it. Codifying the whole `secret/viktor` entry
is out of scope for this task (would need a separate migration + rotation
story). The sync job reads the existing value at apply time — so if the
value is ever rotated in Vault, the next sync picks it up.
## Plan + apply
Plan: 3 to add, 0 to change, 0 to destroy.
Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
Re-plan: No changes. Your infrastructure matches the configuration.
Also updated `.claude/skills/uptime-kuma/SKILL.md` with the new pattern.
Closes: code-ed2
## Context
Uptime Kuma TTFB was bimodal — fast ~150ms responses mixed with slow
~3s responses — median 1.7s, p95 3.2s across 20 samples. CPU request
was 50m (5% of one core) against a Node.js process that handles ~190
monitors plus SQLite DB maintenance. Memory request was 64Mi while
actual RSS sat around 221Mi, so the pod was also running above its
guaranteed memory floor and subject to eviction pressure when nodes got
tight.
CPU limits are intentionally absent cluster-wide (CFS throttling caused
more pain than it solved), so the only knob to give the scheduler a
higher floor is the request itself. Raising the request makes the node
reserve more CPU for the pod and lets the kernel's CFS weight it more
generously when the node is busy — should reduce the tail on the slow
path without introducing throttling.
## This change
- requests.cpu: 50m -> 100m
- requests.memory: 64Mi -> 128Mi
- limits.memory: unchanged at 512Mi
- limits.cpu: still unset (explicit — cluster-wide rule)
## What is NOT in this change
- No CPU limit added
- No readiness/liveness probe tuning
- No replica count change (still 1, Recreate strategy)
- No DB layer / SQLite tuning
## Measurements (20 curl samples of https://uptime.viktorbarzin.me/)
Before:
min 0.143s
median 1.727s
p95 3.163s
max 3.204s
mean 1.768s
After:
min 0.149s
median 1.228s
p95 3.154s
max 3.283s
mean 1.590s
Median dropped ~29% (1.73s -> 1.23s). Tail (p95/max) essentially
unchanged — the slow bucket appears driven by something other than
CPU scheduling (likely socket.io / SSR render path inside the app,
or TLS/cf-tunnel handshake — worth a separate investigation).
Closes: code-79d
## Context
After commit f6812fe6 every external-monitor-sync run updated all ~107
monitors without any real change — because the new code always appended
`/` to the host (default path), while historical monitors had been
created with bare `https://host` URLs. Sync saw `https://host` !=
`https://host/` and re-wrote every monitor on each cycle: noisy logs,
wasted Uptime Kuma writes.
## This change
When the `uptime.viktorbarzin.me/external-monitor-path` annotation is
absent, build the URL WITHOUT a trailing slash so it matches the shape
of pre-existing monitors. When the annotation is set, append it as
before (e.g. `https://forgejo.viktorbarzin.me/api/healthz`).
Also flip the lenient/strict codes branch to trigger off the same
"annotation set?" signal instead of comparing against DEFAULT_PATH.
## Verification
Verified via two consecutive manual triggers of the CronJob against the
live stack:
Pass 1 (migration): 0 created, 107 updated, 0 deleted, 1 unchanged
Pass 2 (stable): 0 created, 0 updated, 0 deleted, 108 unchanged
`[External] forgejo` still probes `https://forgejo.viktorbarzin.me/api/healthz`
with strict `200-299`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
The `external-monitor-sync` CronJob probed `https://<host>/` for every
`*.viktorbarzin.me` ingress. Homepages frequently return 200 (or
allow-listed 30x/40x) even when the backend or DB is broken, producing
false-negatives — the forgejo outage on 2026-04-17 was not caught for
this reason: `/` returned a login page while `/api/healthz` returned
503 from the DB probe.
Manual monitor edits don't stick: the next sync is create-if-missing
only, so a deleted monitor gets recreated pointing at `/` again.
## This change
Teaches the sync three things:
1. **Reads a new annotation** `uptime.viktorbarzin.me/external-monitor-path`.
The annotation value is appended as the probe path; default `/`
preserves today's behaviour for every ingress that hasn't opted in.
2. **Tightens accepted status codes** when an explicit path is set:
`['200-299']` (strict — we expect a real healthz). The default `/`
path keeps the existing lenient set `['200-299','300-399','400-499']`
because homepages routinely 30x redirect or 40x on missing auth.
3. **Updates existing monitors** when the target URL or accepted
status codes drift. Previously the loop was create-if-missing only,
so annotating an already-monitored ingress had no effect until the
monitor was deleted. Now re-running the sync after changing the
annotation converges the live monitor.
## What is NOT in this change
- No change to the Ingress annotations on any individual stack. Each
service that wants a non-`/` probe path opts in separately.
- No change to the ConfigMap fallback payload shape — legacy entries
still get the lenient status codes.
- Monitor DB state in Uptime Kuma's SQLite is untouched at plan time;
the sync CronJob is what reconciles state on each run.
## Flow
```
ingress annotation CronJob Python
------------------ --------------
(none) --> url = https://host/ codes = lenient
external-monitor-path --> url = https://host<path> codes = strict ['200-299']
^^ "/api/healthz" https://host/api/healthz codes = ['200-299']
existing monitor + drifted target url --> api.edit_monitor(id, url=..., accepted_statuscodes=...)
```
## Test Plan
### Automated
- `terraform fmt -check -recursive stacks/uptime-kuma` — exit 0.
- `scripts/tg plan` on `stacks/uptime-kuma` — `Plan: 0 to add, 1 to
change, 0 to destroy`. The single in-place change is the CronJob
command (Python heredoc re-rendered). No other resources drift.
- Embedded Python compiles: extracted the `PYEOF` block and ran
`python3 -m py_compile` — OK.
### Manual Verification
1. Annotate an ingress: `kubectl annotate ingress/<name> -n <ns> uptime.viktorbarzin.me/external-monitor-path=/api/healthz`
2. Trigger sync early: `kubectl -n uptime-kuma create job --from=cronjob/external-monitor-sync external-monitor-sync-manual`
3. Expected log line:
`Updating monitor [External] <name>: https://host/ -> https://host/api/healthz (codes ['200-299','300-399','400-499'] -> ['200-299'])`
4. Inspect monitor in Uptime Kuma UI: URL and accepted status codes
reflect the annotation.
5. Final summary line includes updated count:
`Sync complete: 0 created, 1 updated, 0 deleted, N unchanged`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Duplicate bug fix
The external-monitor-sync deduped targets by hostname (`host in seen`) but
multiple ingresses can share the same hostname. Changed to dedupe by final
monitor name (`f"{PREFIX}{label}" in seen`) — prevents creating duplicate
[External] monitors on every sync run. This caused 90 duplicates.
## Monitor cleanup
Deleted 118 monitors total:
- 90 duplicate [External] monitors (kept lower ID of each pair)
- 14 paused internal monitors for decommissioned services
- 14 external monitors for non-existent, scaled-down, or non-HTTP services
(xray-vless, complaints, hermes-agent, etc.)
## Opt-outs
Added `uptime.viktorbarzin.me/external-monitor=false` annotation to ingresses
that shouldn't have external HTTP monitors: xray (non-HTTP protocol),
council-complaints, hermes-agent, task-webhook, torrserver, www (no CF DNS).
329 monitors → ~210 monitors. Zero down monitors expected.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The rewrite-body Traefik plugin (both packruler/rewrite-body v1.2.0 and
the-ccsn/traefik-plugin-rewritebody v0.1.3) silently fails on Traefik
v3.6.12 due to Yaegi interpreter issues with ResponseWriter wrapping.
Both plugins load without errors but never inject content.
Removed:
- rewrite-body plugin download (init container) and registration
- strip-accept-encoding middleware (only existed for rewrite-body bug)
- anti-ai-trap-links middleware (used rewrite-body for injection)
- rybbit_site_id variable from ingress_factory and reverse_proxy factory
- rybbit_site_id from 25 service stacks (39 instances)
- Per-service rybbit-analytics middleware CRD resources
Kept:
- compress middleware (entrypoint-level, working correctly)
- ai-bot-block middleware (ForwardAuth to bot-block-proxy)
- anti-ai-headers middleware (X-Robots-Tag: noai, noimageai)
- All CrowdSec, Authentik, rate-limit middleware unchanged
Next: Cloudflare Workers with HTMLRewriter for edge-side injection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
After the previous commit migrated monitor discovery to per-ingress annotation
(opt-in via `uptime.viktorbarzin.me/external-monitor=true`), coverage expanded
from 13 → 26 monitors but still left ~99 public ingresses uncovered — notably
Helm-managed services (authentik, grafana, vault, forgejo, ntfy) that don't
go through `ingress_factory`, plus any `dns_type = "non-proxied"` ingress
(Immich was a direct victim: `dns_type = "non-proxied"` → no annotation added
→ no monitor → invisible outage).
The user's concern: "I should have known external Immich was down before
users tried to open it."
## This change
Flipped the semantic from opt-in to **opt-out by default**:
- Every ingress whose host ends in `.viktorbarzin.me` gets a `[External] <label>`
monitor automatically
- Only ingresses with annotation `uptime.viktorbarzin.me/external-monitor=false`
are skipped
- Host dedup via a `seen` set (one monitor per hostname, regardless of how
many Ingress resources share it)
## Verification
Triggered a manual CronJob run post-apply:
```
Sync complete: 102 created, 1 deleted, 23 unchanged
```
Coverage jumped from 26 → ~124 external monitors. All 6 Helm-managed services
now have dedicated monitors:
- [External] immich, authentik, forgejo, grafana, ntfy, vault
## Scope
Only `stacks/uptime-kuma/modules/uptime-kuma/main.tf` (Python script in the
CronJob resource). No RBAC or service account changes — the ones added in the
previous commit still cover this path.
## Test plan
### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs -l job-name=manual-sync-optout-1776422993 --tail=50 | grep -iE 'immich|authentik|grafana|forgejo|vault|ntfy'
Creating monitor: [External] authentik -> https://authentik.viktorbarzin.me
Creating monitor: [External] forgejo -> https://forgejo.viktorbarzin.me
Creating monitor: [External] immich -> https://immich.viktorbarzin.me
Creating monitor: [External] grafana -> https://grafana.viktorbarzin.me
Creating monitor: [External] ntfy -> https://ntfy.viktorbarzin.me
Creating monitor: [External] vault -> https://vault.viktorbarzin.me
\`\`\`
### Manual Verification
1. Open `https://uptime.viktorbarzin.me` → confirm `[External] immich` exists
2. Simulate an Immich outage (scale deploy to 0 briefly) → external monitor
should go red within the probe interval (5min); internal monitor stays up
(pod-level from a different probe angle) → `ExternalAccessDivergence`
alert fires after 15 min
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
Two operational gaps surfaced during a healthcheck sweep today:
1. **External monitoring coverage**: Only ~13 hostnames (via `cloudflare_proxied_names`
in `config.tfvars`) had `[External]` monitors in Uptime Kuma. Any service deployed via
`ingress_factory` with `dns_type = "proxied"` auto-created its DNS record but was NOT
registered for external probing — so outages like Immich going down externally were
invisible until a user complained. 99 of ~125 public ingresses had no external
monitor.
2. **actualbudget stack unplannable**: `count = var.budget_encryption_password != null
? 1 : 0` in `factory/main.tf:152` failed with "Invalid count argument" because the
value flows from a `data.kubernetes_secret` whose contents are `(known after apply)`
at plan time. Blocked CI applies and drift reconciliation.
## This change
### Per-ingress external-monitor annotation (ingress_factory + reverse_proxy/factory)
- New variables `external_monitor` (bool, nullable) + `external_monitor_name` (string,
nullable). Default is "follow dns_type" — enabled for any public DNS record
(`dns_type != "none"`, covers both proxied and non-proxied so Immich and other
direct-A records are also monitored).
- Emits two annotations on the Ingress:
- `uptime.viktorbarzin.me/external-monitor = "true"`
- `uptime.viktorbarzin.me/external-monitor-name = "<label>"` (optional override)
### external-monitor-sync CronJob (uptime-kuma stack)
- Discovers targets from live Ingress objects via the K8s API first (filter by
annotation), falls back to the legacy `external-monitor-targets` ConfigMap on any
API error (zero rollout risk).
- New `ServiceAccount` + cluster-wide `ClusterRole`/`ClusterRoleBinding` giving
`list`/`get` on `networking.k8s.io/ingresses`.
- `API_SERVER` now uses the `KUBERNETES_SERVICE_HOST` env var (always injected by K8s)
instead of `kubernetes.default.svc` — the search-domain expansion failed in the
CronJob pod's DNS config. Verified working: CronJob now logs
`Loaded N external monitor targets (source=k8s-api)`.
### actualbudget count-on-unknown refactor
- Replaced `count = var.budget_encryption_password != null ? 1 : 0` with two explicit
plan-time booleans: `enable_http_api` and `enable_bank_sync`. Values are known at
plan; no `-target` workaround needed.
- Callers (`stacks/actualbudget/main.tf`) pass `true` explicitly. Runtime behaviour is
unchanged — the secret is still consumed via env var.
- Also aligned the factory with live state (the 3 budget-* PVCs had been migrated
`proxmox-lvm` → `proxmox-lvm-encrypted` outside Terraform): PVC resource renamed
`data_proxmox` → `data_encrypted`, storage class updated, orphaned `nfs_data` module
removed. State was rm'd + re-imported with matching UIDs, so no data was moved.
## Rollout status (already partially applied in this session)
- `stacks/uptime-kuma` applied — SA + RBAC + CronJob changes live; FQDN fix verified
- `stacks/actualbudget` applied — budget-{viktor,anca,emo} all 200 OK externally
- `stacks/mailserver` + 21 other ingress_factory consumers applied — annotations live
- CronJob `external-monitor-sync` latest run: `source=k8s-api`, 26 monitors active
(was 13 on the central list)
## Deferred (separate work)
- 4 stacks show pre-existing DESTRUCTIVE drift in plan (metallb namespace, claude-memory,
rbac, redis) — NOT triggered by this commit but will be by CI's global-file cascade.
`[ci skip]` here so those don't auto-apply; they will be fixed manually before the
next CI push.
- Cleanup of `cloudflare_proxied_names` list once Helm-managed ingresses (authentik,
grafana, vault, forgejo) are annotated — separate PR.
## Test plan
### Automated
\`\`\`
\$ kubectl -n uptime-kuma logs \$(kubectl -n uptime-kuma get pods -l job-name -o name | tail -1)
Loaded 26 external monitor targets (source=k8s-api)
Sync complete: 7 created, 0 deleted, 17 unchanged
\$ curl -sk -o /dev/null -w "%{http_code}\n" -H "Accept: text/html" \\
https://dawarich.viktorbarzin.me/https://nextcloud.viktorbarzin.me/ \\
https://budget-viktor.viktorbarzin.me/
200 302 200
\$ kubectl -n actualbudget get deploy,pvc -l app=budget-viktor
deployment.apps/budget-viktor 1/1 1 1 Ready
persistentvolumeclaim/budget-viktor-data-encrypted Bound 10Gi RWO proxmox-lvm-encrypted
\`\`\`
### Manual Verification
1. Confirm the annotation is present on an ingress_factory ingress:
\`\`\`
kubectl -n dawarich get ingress dawarich -o \\
jsonpath='{.metadata.annotations.uptime\.viktorbarzin\.me/external-monitor}'
# Expected: "true"
\`\`\`
2. Confirm the new `[External] <name>` monitor appears in Uptime Kuma within 10 min
(CronJob interval). For Immich specifically, it will appear after the immich stack
is re-applied.
3. Verify actualbudget plan is clean:
\`\`\`
cd stacks/actualbudget && scripts/tg plan --non-interactive
# Expected: no "Invalid count argument" errors
\`\`\`
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-tier state architecture:
- Tier 0 (infra, platform, cnpg, vault, dbaas, external-secrets): local
state with SOPS encryption in git — unchanged, required for bootstrap.
- Tier 1 (105 app stacks): PostgreSQL backend on CNPG cluster at
10.0.20.200:5432/terraform_state with native pg_advisory_lock.
Motivation: multi-operator friction (every workstation needed SOPS + age +
git-crypt), bootstrap complexity for new operators, and headless agents/CI
needing the full encryption toolchain just to read state.
Changes:
- terragrunt.hcl: conditional backend (local vs pg) based on tier0 list
- scripts/tg: tier detection, auto-fetch PG creds from Vault for Tier 1,
skip SOPS and Vault KV locking for Tier 1 stacks
- scripts/state-sync: tier-aware encrypt/decrypt (skips Tier 1)
- scripts/migrate-state-to-pg: one-shot migration script (idempotent)
- stacks/vault/main.tf: pg-terraform-state static role + K8s auth role
for claude-agent namespace
- stacks/dbaas: terraform_state DB creation + MetalLB LoadBalancer
service on shared IP 10.0.20.200
- Deleted 107 .tfstate.enc files for migrated Tier 1 stacks
- Cleaned up per-stack tiers.tf (now generated by root terragrunt.hcl)
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Context
Deploying new services required manually adding hostnames to
cloudflare_proxied_names/cloudflare_non_proxied_names in config.tfvars —
a separate file from the service stack. This was frequently forgotten,
leaving services unreachable externally.
## This change:
- Add `dns_type` parameter to `ingress_factory` and `reverse_proxy/factory`
modules. Setting `dns_type = "proxied"` or `"non-proxied"` auto-creates
the Cloudflare DNS record (CNAME to tunnel or A/AAAA to public IP).
- Simplify cloudflared tunnel from 100 per-hostname rules to wildcard
`*.viktorbarzin.me → Traefik`. Traefik still handles host-based routing.
- Add global Cloudflare provider via terragrunt.hcl (separate
cloudflare_provider.tf with Vault-sourced API key).
- Migrate 118 hostnames from centralized config.tfvars to per-service
dns_type. 17 hostnames remain centrally managed (Helm ingresses,
special cases).
- Update docs, AGENTS.md, CLAUDE.md, dns.md runbook.
```
BEFORE AFTER
config.tfvars (manual list) stacks/<svc>/main.tf
| module "ingress" {
v dns_type = "proxied"
stacks/cloudflared/ }
for_each = list |
cloudflare_record auto-creates
tunnel per-hostname cloudflare_record + annotation
```
## What is NOT in this change:
- Uptime Kuma monitor migration (still reads from config.tfvars)
- 17 remaining centrally-managed hostnames (Helm, special cases)
- Removal of allow_overwrite (keep until migration confirmed stable)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Increase socket timeout from 30s to 120s (121+ monitors need time to sync)
- Add wait_events=0.2 for reliable login
- Fix accepted_statuscodes format: use 100-increment ranges not arbitrary
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add automatic external HTTPS monitors to Uptime Kuma for ~96 services
exposed via Cloudflare tunnel. A sync CronJob (every 10min) reads from
a Terraform-generated ConfigMap and creates/deletes [External] monitors
to match cloudflare_proxied_names. Status page groups these separately
as "External Reachability" and pushes a divergence metric to Pushgateway
when services are externally down but internally up. Prometheus alert
ExternalAccessDivergence fires after 15min of divergence.
[ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add proxmox-lvm PVCs with pvc-autoresizer annotations for all
SQLite-backed services. Deployments updated to use new block storage
PVCs. Old NFS modules retained for 1-week rollback.
Services: ntfy, freshrss, insta2spotify, actualbudget (x3),
wealthfolio, navidrome (DB only), audiobookshelf config,
headscale, forgejo, uptime-kuma.
Also: set Recreate strategy on ntfy, forgejo, insta2spotify,
wealthfolio (required for RWO volumes).
Phase 3: all 27 platform modules now run as independent stacks.
Platform reduced to empty shell (outputs only) for backward compat
with 72 app stacks that declare dependency "platform".
Fixed technitium cross-module dashboard reference by copying file.
Woodpecker pipeline applies all 27+1 stacks in parallel via loop.
All applied with zero destroys.