When a zone is created against a stale primary IP (e.g. the old primary
pod IP 10.10.36.189 before the technitium-primary ClusterIP service
existed), AXFR refresh keeps failing forever while every other zone on
the same replica refreshes fine from 10.110.37.186. The resync-only
branch didn't touch zone options, so the bad IP was pinned indefinitely.
This surfaced as rpi-sofia.viktorbarzin.lan returning 192.168.1.16
(pre-move) on secondaries while primary had the correct .10 from
2026-04-22 morning — Uptime Kuma Sofia RPI monitor DOWN, cluster
cluster_healthcheck FAIL.
The sync loop now re-applies primaryNameServerAddresses on every run
for existing zones. Idempotent — Technitium accepts identical values
— and self-heals any drift within 30 min. Env renamed PRIMARY_IP →
PRIMARY_HOST for consistency with the reconcile semantics.
Hostname form (technitium-primary.technitium.svc.cluster.local) was
tried but Technitium's own resolver doesn't forward svc.cluster.local,
so the field must stay a literal IP. Terraform tracks the ClusterIP on
every apply and the reconcile loop propagates it to replicas.
Main container crashes with "mkdir: cannot create directory '/opt/data':
Permission denied". Init container writes fine but main container runs
with different fsGroup/runAsUser. Scaling to 0 until the PVC permission
model is reworked.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After k8s-node1 was silently cordoned and broke Frigate camera streams,
existing alerts (NvidiaExporterDown, PodUnschedulable) didn't catch the
root cause proactively. This alert fires within 5m of the GPU node being
cordoned, before any pod restart attempts to schedule and fails.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove every hardcoded reference to k8s-node1 that pinned GPU
scheduling to a specific host:
- GPU workload nodeSelectors: gpu=true -> nvidia.com/gpu.present=true
(frigate, immich, whisper, piper, ytdlp, ebook2audiobook, audiblez,
audiblez-web, nvidia-exporter, gpu-pod-exporter). The NFD label is
auto-applied by gpu-feature-discovery on any node carrying an
NVIDIA PCI device, so the selector follows the card.
- null_resource.gpu_node_config: rewrite to enumerate NFD-labeled
nodes (feature.node.kubernetes.io/pci-10de.present=true) and taint
each with nvidia.com/gpu=true:PreferNoSchedule. Drop the manual
'kubectl label gpu=true' since NFD handles labeling.
- MySQL anti-affinity: kubernetes.io/hostname NotIn [k8s-node1] ->
nvidia.com/gpu.present NotIn [true]. Same intent (keep MySQL off
the GPU node) but portable when the card relocates.
Net effect: moving the GPU card between nodes no longer requires any
Terraform edit. Verified no-op for current scheduling — both old and
new labels resolve to node1 today.
Docs updated to match: AGENTS.md, compute.md, overview.md,
proxmox-inventory.md, k8s-portal agent-guidance string.
The 2026-04-22 Vault outage caught kubelet in a 2-minute chown loop that
never exited because the default fsGroupChangePolicy (Always) walks every
file on the NFS-backed data PVC. With retrans=3,timeo=30 NFS options and
a 1GB audit log, the recursive chown outlasted the deadline and restarted
forever — blocking raft quorum recovery. OnRootMismatch makes chown a
no-op when the volume root is already correct, which it always is after
initial setup.
The breakglass fix was applied live via kubectl patch at 10:54 UTC; this
commit persists it in Terraform so the next apply doesn't revert.
The post-mortem also documents the upstream raft stuck-leader pattern,
NFS kernel client corruption after force-kill, and the path to migrate
Vault off NFS to proxmox-lvm-encrypted.
Threshold was 48h + 30m for: a job that runs daily. We don't need
to wait 2.5 days to detect a broken timer — bring it down to 30h
+ 30m (just over a day of cadence + minor drift/retry grace). Also
add a description pointing to the restore runbook so the alert
text surfaces the fix path directly.
Threshold change: 172800s → 108000s. Docs in backup-dr.md synced.
Re-triggers default.yml apply now that ci/Dockerfile is rebuilt
with vault CLI — this is the first commit touching a stack that
will actually succeed since the e80b2f02 regression.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- HighPowerUsage: add subsystem:gpu (line 724) + subsystem:r730 (line 775)
labels so the two same-named alerts are distinguishable in routing.
- HeadscaleDown (deployment-replicas flavor, line 1414) → rename to
HeadscaleReplicasMismatch. Line 2039 keeps HeadscaleDown as the real
up-metric critical check. NodeDown inhibit rule updated to suppress
the renamed alert too.
- EmailRoundtripStale (line 1816): for 10m → 20m. Survives one missed
20-min probe cycle before firing, cuts flapping (12 short-burst fires
over last 24h).
ATSOverload tuning skipped: 24h fire-count is 0, it's continuously
firing not flapping — already-known sustained 83% ATS load, tuning
would not change behavior.
8 backup *NeverSucceeded rules audited: all 7 using
kube_cronjob_status_last_successful_time target real K8s CronJobs with
active metrics (not Pushgateway-sourced). PrometheusBackupNeverRun
already uses absent() correctly. No fixes needed.
Registry integrity probe surfaced 38 broken manifest references
(34 unique repo:tag pairs, same OCI-index orphan pattern as the 04-19
infra-ci incident). Deleted all via registry HTTP API + ran GC;
reclaimed ~3GB blob storage.
beads-server CronJobs were stuck ImagePullBackOff on
claude-agent-service:0c24c9b6 for >6h — bumped variable default to
2fd7670d (canonical tag in claude-agent-service stack, already healthy
in registry) so new ticks can fire.
Rebuilt in-use broken tags: freedify:{latest,c803de02} and
beadboard:{17a38e43,latest} on registry VM; priority-pass via
Woodpecker pipeline #8. wealthfolio-sync:latest deferred (monthly
CronJob, next run 2026-05-01).
Probe now reports 0/39 failures. RegistryManifestIntegrityFailure
alert cleared.
Closes: code-8hk
Closes: code-jh3c
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Primary has only the Primary-type zones it owns (10). Replicas have those
+ built-in zones (localhost, in-addr.arpa reverse, etc.), so their count
(14) can never match primary. Alert expr compared max-min across all
instances, making it chronically firing.
Fix: instance!="primary" filter. The real signal this alert wants is
"did one replica drift from the others" — replica-to-replica comparison
captures that; primary was never comparable.
Both alerts fired as expected noise while the MAM account is in new-member
Mouse class — tracker refuses announces and the 72h seed-gate can't be met
until ratio recovers. Keeping the rest of the MAM rules (cookie expiry,
ratio, farming/janitor stalls, qbt disconnect) which still signal real
pipeline failures.
Firing count drops from 7 → 3 in healthcheck.
"Monthly cash flow — tax impact (RSU excluded)" was already stacking
group A in normal mode but rendered as 70%-opacity filled lines — the
overlap made the total-per-month figure visually inaccessible.
Switch drawStyle to bars (100% fill, 0-width lineWidth, no per-point
markers) so each month reads as a single stacked bar whose top edge is
the total cash-side deduction. Add "sum" to legend.calcs so the
tax-year totals per series show in the legend table alongside last and
max.
Panel 11 (Tax & pension — monthly, RSU-inclusive) retains the line/
area style so the two panels remain visually distinct.
Activates the Schwab/InvestEngine IMAP ingest CronJob that's been
scaffolded-but-suspended since Phase 2 of broker-sync, now that the
Schwab parser can detect vest-confirmation emails. Runs nightly 02:30 UK.
Current behaviour once deployed:
- Trade confirmations (Schwab sell-to-cover, InvestEngine orders) →
Activity rows posted to Wealthfolio. Unchanged.
- Release Confirmations (Schwab RSU vests) → parser returns gross-vest
BUY + sell-to-cover SELL Activities (to Wealthfolio) and a VestEvent
object (NOT YET persisted — Postgres sink + DB grant pending; see
follow-up under code-860). Vest detection uses a subject/body
heuristic that will need tightening against a real email fixture.
Panel 15 of the UK payslip dashboard added: per-vest-month join of
payslip.rsu_vest vs rsu_vest_events (gross_value_gbp, tax_withheld_gbp)
with delta columns. Tax-delta-percent coloured green/orange/red at
0/2%/5% thresholds. Table is empty until broker-sync starts persisting
VestEvents — harmless until then.
Before applying:
- Verify IMAP creds in Vault (secret/broker-sync: imap_host,
imap_user, imap_password, imap_directory) are still valid.
- Empty vest-event table is expected; delta columns show NULL until
the postgres sink lands.
Part of: code-860
Wires the daily ActualBudget deposit sync from the payslip-ingest app into
K8s as a CronJob, and adds dashboard Panel 14 to overlay bank deposits
against payslip net_pay.
CronJob: actualbudget-payroll-sync in payslip-ingest namespace, runs
02:00 UTC. Calls `python -m payslip_ingest sync-meta-deposits`, which
hits budget-http-api-viktor in the actualbudget namespace and upserts
matching Meta payroll deposits into payslip_ingest.external_meta_deposits.
ExternalSecret extended with three new Vault keys:
- ACTUALBUDGET_API_KEY (same as actualbudget-http-api-viktor's env API_KEY)
- ACTUALBUDGET_ENCRYPTION_PASSWORD (Viktor's budget password)
- ACTUALBUDGET_BUDGET_SYNC_ID (Viktor's sync_id)
These must be seeded at secret/payslip-ingest in Vault before the
CronJob will run — it'll CrashLoop on missing env vars otherwise. First
run can be triggered on demand via `kubectl -n payslip-ingest create
job --from=cronjob/actualbudget-payroll-sync initial-sync`.
Panel 14 plots monthly SUM(external_meta_deposits.amount) vs
SUM(payslip.net_pay), plus a delta bar series — |delta| > £50 flags
likely parser drift on net_pay.
Part of: code-860
Phase A of RSU tax spike fix. Two changes:
1. Panel 2 "Monthly cash flow (RSU stripped)" plotted raw income_tax despite
the title. Switch to COALESCE(cash_income_tax, income_tax) so the chart
is honest once the Phase B back-fill populates cash_income_tax on
variant-A slips. For slips where cash_income_tax is already populated
(variant B, 2024+) the spike is removed immediately.
2. Panel 4 "Data integrity" now surfaces rows where cash_income_tax is NULL
on vest months (rsu_vest > 0). New status value NULL_CASH_TAX (orange)
highlights the back-fill remaining population — expected to drop to 0
after Phase B lands.
Part of: code-860
Locally-built image pushed to registry.viktorbarzin.me/job-hunter:9c42eac9
after Woodpecker v3.13 Forgejo webhook parsing bug left CI unable to build
the initial image (server/forge/forgejo/helper.go:57 nil pointer panic on
parse — see repaired webhooks still not triggering pipelines).
Unblocks code-97n (TF apply) without waiting for CI recovery.
Refs: code-snp, code-0c6
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Vertical orange markers at every vest month added more visual noise
than signal. Panel 13 (cash-only) already conveys the "no spike on
vest months" story without needing markers across panels 1/2/3/7/11/12.
Panel 11 stacks RSU-attributed income tax on top of cash PAYE, which
is mathematically correct but emotionally misleading since RSU tax is
withheld at source via sell-to-cover and never hits the bank. Adopts
the two-view convention: Panel 11 keeps the full PAYE picture; new
Panel 13 shows cash-only deductions. Dashboard-level "RSU vests"
annotation paints orange markers on every vest month across all
timeseries panels, with tooltips like "RSU vest: £31232 gross /
£15257 tax withheld".
Shifts Panels 4/5/6/8/9/10 down by 9 rows to make room for Panel 13
at y=29.
Replaces dataFrom.extract with per-key `data` entries so the Secret
keys in K8s (and therefore env vars in the pod) are always UPPERCASE:
WEBHOOK_BEARER_TOKEN, CDIO_API_KEY, SMTP_USERNAME, SMTP_PASSWORD,
DIGEST_TO_ADDRESS, DIGEST_FROM_ADDRESS. Vault KV keys at
secret/job-hunter stay lowercase (webhook_bearer_token etc.).
Refs: code-snp
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New service stack at stacks/job-hunter/ mirroring the payslip-ingest
pattern: per-service CNPG database + role (via dbaas null_resource),
Vault static role pg-job-hunter (7d rotation), ExternalSecrets for app
secrets and DB creds, Deployment with alembic-migrate init container,
ClusterIP Service, Grafana datasource ConfigMap.
Grafana dashboard job-hunter.json in Finance folder: new roles per
day, source breakdown, top companies, GBP salary distribution, recent
roles table (sorted by parse confidence then salary).
n8n weekly-digest workflow calls POST /digest/generate with bearer
auth every Monday 07:00 London; digest_runs table provides
idempotency.
Refs: code-snp
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second identical registry incident on 2026-04-19 (first 2026-04-13): the
infra-ci:latest image index resolved to child manifests whose blobs had been
garbage-collected out from under the index. Pipelines P366→P376 all exited
126 "image can't be pulled". Hot fix (a05d63e / 6371e75 / c113be4) restored
green CI but left the underlying bug unaddressed.
Root cause: cleanup-tags.sh rmtrees tag dirs on the registry VM daily at
02:00, registry:2's GC (Sunday 03:25) walks OCI index children imperfectly
(distribution/distribution#3324 class). Nothing verified pushes end-to-end;
nothing probed the registry for fetchability; nothing caught orphan indexes.
Phase 1 — Detection:
- .woodpecker/build-ci-image.yml: after build-and-push, a verify-integrity
step walks the just-pushed manifest (index + children + config + every
layer blob) via HEAD and fails the pipeline on any non-200. Catches
broken pushes at the source.
- stacks/monitoring: new registry-integrity-probe CronJob (every 15m) and
three alerts — RegistryManifestIntegrityFailure,
RegistryIntegrityProbeStale, RegistryCatalogInaccessible — closing the
"registry serves 404 for a tag that exists" gap that masked the incident
for 2+ hours.
- docs/post-mortems/2026-04-19-registry-orphan-index.md: root cause,
timeline, monitoring gaps, permanent fix.
Phase 2 — Prevention:
- modules/docker-registry/docker-compose.yml: pin registry:2 → registry:2.8.3
across all six registry services. Removes the floating-tag footgun.
- modules/docker-registry/fix-broken-blobs.sh: new scan walks every
_manifests/revisions/sha256/<digest> that is an image index and logs a
loud WARNING when a referenced child blob is missing. Does NOT auto-
delete — deleting a published image is a conscious decision. Layer-link
scan preserved.
Phase 3 — Recovery:
- build-ci-image.yml: accept `manual` event so Woodpecker API/UI rebuilds
don't need a cosmetic Dockerfile edit (matches convention from
pve-nfs-exports-sync.yml).
- docs/runbooks/registry-rebuild-image.md: exact command sequence for
diagnosing + rebuilding after an orphan-index incident, plus a fallback
for building directly on the registry VM if Woodpecker itself is down.
- docs/runbooks/registry-vm.md + .claude/reference/service-catalog.md:
cross-references to the new runbook.
Out of scope (verified healthy or intentionally deferred):
- Pull-through DockerHub/GHCR mirrors (74.5% hit rate, no 404s).
- Registry HA/replication (single-VM SPOF is a known architectural
choice; Synology offsite covers RPO < 1 day).
- Diun exclude for registry:2 — not applicable; Diun only watches
k8s (DIUN_PROVIDERS_KUBERNETES=true), not the VM's docker-compose.
Verified locally:
- fix-broken-blobs.sh --dry-run on a synthetic registry directory correctly
flags both orphan layer links and orphan OCI-index children.
- terraform fmt + validate on stacks/monitoring: success (only unrelated
deprecation warnings).
- python3 yaml.safe_load on .woodpecker/build-ci-image.yml and
modules/docker-registry/docker-compose.yml: both parse clean.
Closes: code-4b8
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TrueNAS VM 9000 at 10.0.10.15 was operationally decommissioned 2026-04-13.
The subagent-driven doc sweep in 5a0b24f5 covered the prose. This commit
removes the remaining in-code references:
- reverse-proxy: drop truenas Traefik ingress + Cloudflare record
(truenas.viktorbarzin.me was 502-ing since the VM stopped), drop
truenas_homepage_token variable.
- config.tfvars: drop deprecated `truenas IN A 10.0.10.15`, `iscsi CNAME
truenas`, and the commented-out `iscsi`/`zabbix` A records.
- dashy/conf.yml: remove Truenas dashboard entry (&ref_28).
- monitoring/loki.yaml: change storageClass from the decommissioned
`iscsi-truenas` to `proxmox-lvm` so a future re-enable has a valid SC
(Loki is currently disabled).
- actualbudget/main.tf + freedify/main.tf: update new-deployment
docstrings to cite Proxmox host NFS instead of TrueNAS.
- nfs-csi: add an explanatory comment to the `nfs-truenas` StorageClass
noting the name is historical — 48 bound PVs reference it, SC names
are immutable on PVs, rename not worth the churn.
Also cleaned out-of-band:
- Technitium DNS: deleted `truenas.viktorbarzin.lan` A and
`iscsi.viktorbarzin.lan` CNAME records.
- Vault: `secret/viktor` → removed `truenas_api_key` and
`truenas_ssh_private_key`; `secret/platform.homepage_credentials.reverse_proxy.truenas_token` removed.
- Terraform-applied: `scripts/tg apply -target=module.reverse-proxy.module.truenas`
destroyed the 3 K8s/Cloudflare resources cleanly.
Deferred:
- VM 9000 is still stopped on PVE. Deletion (destructive) awaits explicit
user go-ahead.
- `nfs-truenas` StorageClass name retained (see nfs-csi comment above).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.
In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).
Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New Panel 11 (monthly) + Panel 12 (YTD cumulative), side-by-side at
y=19. Six series each: cash income tax, RSU-attributed income tax, NI,
student loan, employee pension, employer pension. Employer pension
included to show full retirement contribution picture (paid on top of
salary, not deducted from take-home). Downstream panels shifted down
by 10.
Bringing the 2026-04-19 rework to its end-state. Cutover soaked for ~1h
with 0 alerts firing and 127 ops/sec on the v2 master — skipped the
nominal 24h rollback window per user direction.
- Removed `helm_release.redis` (Bitnami chart v25.3.2) from TF. Helm
destroy cleaned up the StatefulSet redis-node (already scaled to 0),
ConfigMaps, ServiceAccount, RBAC, and the deprecated `redis` + `redis-headless`
ClusterIP services that the chart owned.
- Removed `null_resource.patch_redis_service` — the kubectl-patch hack
that worked around the Bitnami chart's broken service selector. No
Helm chart, no patch needed.
- Removed the dead `depends_on = [helm_release.redis]` from the HAProxy
deployment.
- `kubectl delete pvc -n redis redis-data-redis-node-{0,1}` for the two
orphan PVCs the StatefulSet template left behind (K8s doesn't cascade-delete).
- Simplified the top-of-file comment and the redis-v2 architecture
comment — they talked about the parallel-cluster migration state that
no longer exists. Folded in the sentinel hostname gotcha, the redis
8.x image requirement, and the BGSAVE+AOF-rewrite memory reasoning
so the rationale survives in the code rather than only in beads.
- `RedisDown` alert no longer matches `redis-node|redis-v2` — just
`redis-v2` since that's the only StatefulSet now. Kept the `or on()
vector(0)` so the alert fires when kube_state_metrics has no sample
(e.g. after accidental delete).
- `docs/architecture/databases.md` trimmed: no more "pending TF removal"
or "cold rollback for 24h" language.
Verification after apply:
- kubectl get all -n redis: redis-v2-{0,1,2} (3/3 Running) + redis-haproxy-*
(3 pods, PDB minAvailable=2). Services: redis-master + redis-v2-headless only.
- PVCs: data-redis-v2-{0,1,2} only (redis-data-redis-node-* deleted).
- Sentinel: all 3 agree mymaster = redis-v2-0 hostname.
- HAProxy: PING PONG, DBSIZE 92, 127 ops/sec on master.
- Prometheus: 0 firing redis alerts.
Closes: code-v2b
Closes: code-2mw
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 — replication chain (old → v2):
- Discovered the v2 cluster was running redis:7.4-alpine, but the
Bitnami old master ships redis 8.6.2 which writes RDB format 13 —
the 7.4 replicas rejected the stream with "Can't handle RDB format
version 13". Bumped v2 image to redis:8-alpine (also 8.6.2) to
restore PSYNC compatibility.
- Discovered that sentinel on BOTH v2 and old Bitnami clusters
auto-discovered the cross-cluster replication chain when v2-0
REPLICAOF'd the old master, triggering a failover that reparented
old-master to a v2 replica and took HAProxy's backend offline.
Mitigation: `SENTINEL REMOVE mymaster` on all 5 sentinels (both
clusters) during the REPLICAOF surgery, then re-MONITOR after
cutover. This must be done on the OLD sentinels too, not just v2 —
they're the ones that kept fighting our REPLICAOF.
- Set up the chain: v2-0 REPLICAOF old-master; v2-{1,2} REPLICAOF v2-0.
All 76 keys (db0:76, db1:22, db4:16) synced including `immich_bull:*`
BullMQ queues and `_kombu.*` Celery queues — the user-stated
must-survive data class.
Phase 4 — HAProxy cutover:
- Updated `kubernetes_config_map.haproxy` to point at
`redis-v2-{0,1,2}.redis-v2-headless` for both redis_master and
redis_sentinel backends (removed redis-node-{0,1}).
- Promoted v2-0 (`REPLICAOF NO ONE`) at the same time as the
ConfigMap apply so HAProxy's 1s health-check interval found a
role:master within a few seconds. Cutover disruption on HAProxy
rollout was brief; old clients naturally moved to new HAProxy pods
within the rolling update window.
- Re-enabled sentinel monitoring on v2 with `SENTINEL MONITOR
mymaster <hostname> 6379 2` after verifying `resolve-hostnames yes`
+ `announce-hostnames yes` were active — this ensures sentinel
stores the hostname (not resolved IP) in its rewritten config, so
pod-IP churn on restart doesn't break failover.
Phase 5 — chaos:
- Round 1: killed master v2-0 mid-probe. First run exposed the
sentinel IP-storage issue (stored 10.10.107.222, went stale on
restart) — ~12s probe disruption. Fixed hostname persistence and
re-MONITORed.
- Round 2: killed new master v2-2 with hostnames correctly stored.
Sentinel elected v2-0, HAProxy re-routed, 1/40 probe failures over
60s — target <3s of actual user-visible disruption.
Phase 6 — Nextcloud simplification:
- `zzz-redis.config.php` no longer queries sentinel in-process —
just points at `redis-master.redis.svc.cluster.local`. Removed 20
lines of PHP. HAProxy handles master tracking transparently now
that it's scaled to 3 + PDB minAvailable=2.
Phase 7 step 1:
- `kubectl scale statefulset/redis-node --replicas=0` (transient —
TF removal in a 24h follow-up). Old PVCs `redis-data-redis-node-{0,1}`
preserved as cold rollback.
Docs:
- Rewrote `databases.md` Redis section to reflect post-cutover reality
and the sentinel hostname gotcha (so future sessions don't relearn it).
- `.claude/reference/service-catalog.md` entry updated.
The parallel-bootstrap race documented in the previous commit is still
worth watching — the init container now defaults to pod-0 as master
when no peer reports role:master-with-slaves, so fresh boots land in
a deterministic topology.
Closes: code-7n4
Closes: code-9y6
Closes: code-cnf
Closes: code-tc4
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the stack with the repo HEAD carrying migration 0004
(cash_income_tax + ytd_rsu_* columns), migration 0005 (p60_reference
table), the bonus-dedup logic, and the Woodpecker path-filter fix.
Applied + verified:
- pod rolled out with the new image, Alembic ran 0003→0004→0005
- cash_income_tax backfilled on 71/71 existing rows
- dashboard Panel 7 YTD split query returns real numbers
- no existing (tax_year, bonus) duplicates found — guard ships for future
Closes: code-7z0
## Context
After the Mouse-class unblock on 2026-04-19, end-to-end testing of the
grabber revealed three issues with the plan's original filter values:
1. **`SEEDER_CEILING=50` rejects ~99% of MAM's catalogue.** MAM is a
well-seeded private tracker — 100-700 seeders per torrent is normal.
A ceiling of 50 makes the filter too tight: across 140 FL torrents
sampled in one loop, only 0-1 matched. The intent ("avoid oversupplied
swarms") is still valid; the threshold was wrong for MAM's shape.
2. **`RATIO_FLOOR=1.2` was sized for Mouse-class defence and is now
over-tight.** Its job is preventing the death spiral where Mouse-class
accounts can't announce, so any grab deepens the ratio hole. Once
class > Mouse, MAM serves peer lists normally and demand-first
filtering (`leechers>=1`) keeps new grabs upload-positive on average.
With ratio sitting at 0.7 post-recovery (we over-downloaded while
unblocking), 1.2 was preventing the very grabs that would earn us
back to healthy ratio.
3. **`parse_size` crashed on `"1,002.9 MiB"`.** MAM's pretty-printed
sizes use thousands separators; `float("1,002.9")` raises
`ValueError`. Every grabber run that hit a ≥1000-MiB candidate on
the page crashed with a traceback instead of skipping the size.
## This change
- `SEEDER_CEILING`: 50 → 200 — live catalogue evidence showed 50 was
rejecting viable demand-first candidates like `Zen and the Art of
Motorcycle Maintenance` (S=156, L=1, score=125).
- `RATIO_FLOOR`: 1.2 → 0.5 — still a tripwire for catastrophic dips,
but no longer a steady-state block. Class == Mouse remains an
absolute skip (separate branch).
- `parse_size`: `s.replace(",", "").split()` before int-parse.
## Verified post-change
Manual grabber loop (5 runs at random offsets) after applying:
run=1 parse_size crash on "1,002.9" (this crash motivated fix#3)
run=2 GRABBED 3 torrents:
Dean and Me: A Love Story (240.7 MiB, S:18, L:1) score=194
Digital Nature Photography (83.7 MiB, S:42, L:1) score=182
Zen and the Art of Motorcycle (830.3 MiB, S:156, L:1) score=125
run=3-5 grabbed=0 at offsets that landed on pages with no matches
(expected — MAM returns 20/page, many offsets yield nothing)
MAM profile: class=User, ratio=0.7 (recovering from the Mouse unblock),
BP=24,053. 28 mam-farming torrents in forcedUP state, actively uploading
~8 MiB to MAM this session across 2 of the Maxximized comic issues.
## What is NOT in this change
- No alert threshold changes — `MAMRatioBelowOne` (24h) and `MAMMouseClass`
(1h) already handle the "going back to Mouse" case; lowering the floor
on the grabber doesn't change alerting.
- No janitor changes — the janitor rules are H&R-based and independent
of ratio/class state.
## Test plan
### Automated
$ cd infra/stacks/servarr && ../../scripts/tg apply --non-interactive
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
$ python3 -c 'import ast; ast.parse(open(
"infra/stacks/servarr/mam-farming/files/freeleech-grabber.py").read())'
### Manual Verification
1. Trigger the grabber and confirm it doesn't skip-for-ratio at ratio 0.7:
$ kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
$ kubectl -n servarr logs job/g1 | head -5
Profile: ratio=0.7 class=User | Farming: 33, 2.0 GiB, tracked IDs: 4
Search offset=<random>, found=1323, page_results=20
Added (score=...) ...
2. Repeat 3-5× at different random offsets. Over the course of a 30-min
cron cadence, expect 2-5 grabs across the day given MAM's catalogue
churn and our filter intersection.
## Reproduce locally
cd infra/stacks/servarr
../../scripts/tg plan # expect: 0 to add, 2 to change (configmap + cronjob)
../../scripts/tg apply --non-interactive
kubectl -n servarr create job --from=cronjob/mam-freeleech-grabber g1
kubectl -n servarr logs job/g1
Follow-up: `bd close code-qfs` already completed in the parent commit;
this is a post-shipping tune, no beads action needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds per-node DNS cache that transparently intercepts pod queries on
10.96.0.10 (kube-dns ClusterIP) AND 169.254.20.10 (link-local) via
hostNetwork + NET_ADMIN iptables NOTRACK rules. Pods keep using their
existing /etc/resolv.conf (nameserver 10.96.0.10) unchanged — no kubelet
rollout needed for transparent mode.
Layout mirrors existing stacks (technitium, descheduler, kured):
stacks/nodelocal-dns/
main.tf # module wiring + IP params
modules/nodelocal-dns/main.tf # SA, Services, ConfigMap, DS
Key decisions:
- Image: registry.k8s.io/dns/k8s-dns-node-cache:1.23.1
- Co-listens on 169.254.20.10 + 10.96.0.10 (transparent interception)
- Upstream path: kube-dns-upstream (new headless svc) → CoreDNS pods
(separate ClusterIP avoids cache looping back through itself)
- viktorbarzin.lan zone forwards directly to Technitium ClusterIP
(10.96.0.53), bypassing CoreDNS for internal names
- priorityClassName: system-node-critical
- tolerations: operator=Exists (runs on master + all tainted nodes)
- No CPU limit (cluster-wide policy); mem requests=32Mi, limit=128Mi
- Kyverno dns_config drift suppressed on the DaemonSet
- Kubelet clusterDNS NOT changed — transparent mode is sufficient;
rolling 5 nodes just to switch to 169.254.20.10 has no additional
benefit and expanding blast radius for no reason.
Verified:
- DaemonSet 5/5 Ready across k8s-master + 4 workers
- dig @169.254.20.10 idrac.viktorbarzin.lan -> 192.168.1.4
- dig @169.254.20.10 github.com -> 140.82.121.3
- Deleted all 3 CoreDNS pods; cached queries still resolved via
NodeLocal DNSCache (resilience confirmed)
Docs: architecture/dns.md — adds NodeLocal DNSCache to Components table,
graph diagram, stacks table; rewrites pod DNS resolution paths to show
the cache layer; adds troubleshooting entry.
Closes: code-2k6
Zone-count parity required hitting /api/zones/list which requires auth. The
null_resource has no access to the Technitium admin password (it's declared
`sensitive = true` on the module variable), so we were probing with an empty
token and getting 200 OK with an error JSON — silently returning 0 zones for
every instance.
Replaced the HTTP probe with a second DNS check: dig idrac.viktorbarzin.lan
on each pod, require the same A record from all three. This catches both
"zone not loaded on an instance" and "zone drift between primary and
replicas" without needing any HTTP client or credentials. The AXFR chain
guarantees all three should converge on the same value.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Panel 7 (YTD uses): replace the single `ytd_income_tax` stack segment
with two — `ytd_cash_income_tax` (full red, same color as before) and
`ytd_rsu_income_tax` (desaturated orange) — computed from the new
`cash_income_tax` column on payslip. RSU-vest months now visually
separate the cash tax from the PAYE attributable to the grossed-up
RSU, matching user mental model of "what I actually paid in cash tax".
Panel 8 (Sankey): split the single `Gross → Income Tax` edge into two
edges (`Gross → Income Tax (cash)` and `Gross → Income Tax (RSU)`)
sourcing the same two figures.
Panel 3 (effective rate): left untouched — it's the "all-in" rate and
keeps using raw `income_tax`.
Panel 9 (P60 reconciliation — new): per-tax-year table comparing HMRC
P60 annual figures against SUM(payslip) via LATERAL JOIN on
payslip_ingest.p60_reference. Threshold-coloured delta columns (|Δ|<1
green, 1-50 yellow, >50 red) surface missing months or parser drift.
Panel 10 (HMRC Tax Year Reconciliation — new): placeholder for the
hmrc-sync service (code scaffolded, awaiting HMRC prod approval to
activate). Queries `hmrc_sync.tax_year_snapshot`; renders empty until
that schema lands. Delta > £10 → red.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The zone-count parity check was trivially passing when the ephemeral
curl pod failed to reach the Technitium web API: all three counts came
back as 0, UNIQ=1, gate claimed "PASSED". This happened during today's
DNS hardening apply when CoreDNS was in CrashLoopBackOff and the curl
pod couldn't resolve service names.
Added a MIN > 0 sanity check. Technitium always has built-in zones
(localhost, standard reverse PTRs), so a zero count means the probe
didn't reach the API, not that the instance truly has zero zones.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds the target 3-node raw StatefulSet alongside the legacy Bitnami Helm
release so data can migrate via REPLICAOF during a future short maintenance
window (Phase 3-7). No traffic touches the new cluster yet — HAProxy still
points at redis-node-{0,1}.
Architecture:
- 3 redis pods, each co-locating redis + sentinel + oliver006/redis_exporter
- podManagementPolicy=Parallel + init container that writes fresh
sentinel.conf on every boot by probing peer sentinels and redis for
consensus master (priority: sentinel vote > role:master with slaves >
pod-0 fallback). Kills the stale-state bug that broke sentinel on Apr 19 PM.
- redis.conf `include /shared/replica.conf` — init container writes
`replicaof <master> 6379` for non-master pods so they come up already in
the correct role. No bootstrap race.
- master+replica memory 768Mi (was 512Mi) for concurrent BGSAVE+AOF fork
COW headroom. auto-aof-rewrite-percentage=200 tunes down rewrite churn.
- RDB (save 900 1 / 300 100 / 60 10000) + AOF appendfsync=everysec.
- PodDisruptionBudget minAvailable=2.
Also:
- HAProxy scaled 2→3 replicas + PodDisruptionBudget minAvailable=2, since
Phase 6 drops Nextcloud's sentinel-query fallback and HAProxy becomes
the sole client-facing path for all 17 consumers.
- New Prometheus alerts: RedisMemoryPressure, RedisEvictions,
RedisReplicationLagHigh, RedisForkLatencyHigh, RedisAOFRewriteLong,
RedisReplicasMissing. Updated RedisDown to cover both statefulsets
during the migration.
- databases.md updated to describe the interim parallel-cluster state.
Verified live: redis-v2-0 master, redis-v2-{1,2} replicas, master_link_status
up, all 3 sentinels agree on get-master-addr-by-name. All new alerts loaded
into Prometheus and inactive.
Beads: code-v2b (still in progress — Phase 3-7 await maintenance window).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CoreDNS refused to load the new Corefile with `serve_stale 3600s 86400s`:
plugin/cache: invalid value for serve_stale refresh mode: 86400s
serve_stale takes one DURATION and an optional refresh_mode keyword
("immediate" or "verify"), not two durations. Simplified to
`serve_stale 86400s` (serve cached entries for up to 24h when upstream
is unreachable). The new CoreDNS pods were CrashLoopBackOff; the two
old pods kept serving traffic so there was no outage, but the partial
apply left the cluster wedged with the bad ConfigMap.
Also collapses the inline viktorbarzin.lan cache block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `external-monitor-sync` script is opt-IN by default for any
*.viktorbarzin.me ingress, so a missing annotation means "monitored."
Both ingress factories previously OMITTED the annotation when
`external_monitor = false`, which silently left monitors in place.
Fix: when the caller sets `external_monitor = false` explicitly, emit
`uptime.viktorbarzin.me/external-monitor = "false"` so the sync script
deletes the monitor. Keep the previous behavior (no annotation) for
callers that leave external_monitor null — otherwise 19 publicly-reachable
services with `dns_type="none"` would lose monitoring.
Set external_monitor=false on family (grampsweb) and mladost3 (reverse-proxy)
to match the other two already-flagged services. Delete the r730 ingress
module entirely — the Dell server has been decommissioned.
- scripts/cluster_healthcheck.sh: add 12 new checks (cert-manager
readiness/expiry/requests, backup freshness per-DB/offsite/LVM,
monitoring prom+AM/vault-sealed/CSS, external reachability cloudflared
+authentik/ExternalAccessDivergence/traefik-5xx). Bump TOTAL_CHECKS
to 42, add --no-fix flag.
- Remove the duplicate pod-version .claude/cluster-health.sh (1728
lines) and the openclaw cluster_healthcheck CronJob (local CLI is
now the single authoritative runner). Keep the healthcheck SA +
Role + RoleBinding — still reused by task_processor CronJob.
- Remove SLACK_WEBHOOK_URL env from openclaw deployment and delete
the unused setup-monitoring.sh.
- Rewrite .claude/skills/cluster-health/SKILL.md: mandates running
the script first, refreshes the 42-check table, drops stale
CronJob/Slack/post-mortem sections, documents the monorepo-canonical
+ hardlink layout. File is hardlinked to
/home/wizard/code/.claude/skills/cluster-health/SKILL.md for
dual discovery.
- AGENTS.md + k8s-portal agent page: 25-check → 42-check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Primary was at 401Mi / 512Mi (78%) before the first bump; the plan's 1Gi
leaves enough headroom for normal operation but thin margin if blocklists or
cache grow. User escalated: OOM cascades are the exact failure mode that
causes user-visible DNS outages, so give a full 2x safety margin across all
three instances. Replicas currently use 124-155Mi steady-state so they have
enormous headroom at 2Gi — accepted for symmetry and future growth (OISD
blocklists, in-memory cache).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TP-Link gateway was wired via ExternalName `gw.viktorbarzin.lan`, but
Technitium has no record for that name (the router isn't a DHCP client and
Kea DDNS never registers it), so the ingress backend returned NXDOMAIN and
the `[External] gw` Uptime Kuma monitor was permanently failing.
Factory now accepts `backend_ip` as an alternative to `external_name`: it
creates a selector-less ClusterIP Service + manual EndpointSlice pointing
at the given IP, bypassing cluster DNS entirely. Used for gw (192.168.1.1);
the old ExternalName path is retained for every other service.
Also add a direct `port` monitor for the router in uptime-kuma's
internal_monitors list so we can tell a Cloudflare/tunnel outage apart
from the router itself being down. Extended the internal-monitor-sync
script to handle non-DB monitor types (hostname + port fields).
Technitium pods don't ship wget/curl, only dig/nslookup. Switched the per-pod
health check from wget against /api to dig +short against 127.0.0.1. This
probes the actual DNS serving path, which is what we care about anyway.
Zone-count parity can't be done inside the Technitium pod (no HTTP client),
so it spawns a short-lived curlimages/curl pod via kubectl run --rm that
curls the three internal web services and exits.
Added retry loop on the dig check (6 × 10s) to tolerate zone-load delay after
a pod restart — viktorbarzin.lan is ~864KB and can take tens of seconds to
load into memory on a cold start.
Relaxed the A-record regex to match any IPv4 rather than 10.x — records may
legitimately live outside that range.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workstreams A, B, G, H, I of the DNS reliability plan (code-q2e).
Follow-ups for C, D, E, F filed as code-2k6, code-k0d, code-o6j, code-dw8.
**Technitium (WS A)**
- Primary deployment: add Kyverno lifecycle ignore_changes on dns_config
(secondary/tertiary already had it) — eliminates per-apply ndots drift.
- All 3 instances: raise memory request+limit from 512Mi to 1Gi (primary
was restarting near the ceiling; CPU limits stay off per cluster policy).
- zone-sync CronJob: parse API responses, push status/failures/last-run and
per-instance zone_count gauges to Pushgateway, fail the job on any
create error (was silently passing).
**CoreDNS (WS B)**
- Corefile: add policy sequential + health_check 5s + max_fails 2 on root
forward, health_check on viktorbarzin.lan forward, serve_stale
3600s/86400s on both cache blocks — pfSense flap no longer takes the
cluster down; upstream outage keeps cached names resolving for 24h.
- Scale deploy/coredns to 3 replicas with required pod anti-affinity on
hostname via null_resource (hashicorp/kubernetes v3 dropped the _patch
resources); readiness gate asserts state post-apply.
- PDB coredns with minAvailable=2.
**Observability (WS G)**
- Fix DNSQuerySpike — rewrite to compare against
avg_over_time(dns_anomaly_total_queries[1h] offset 15m); previous
dns_anomaly_avg_queries was computed from a per-pod /tmp file so always
equalled the current value (alert could never fire).
- New: DNSQueryRateDropped, TechnitiumZoneSyncFailed,
TechnitiumZoneSyncStale, TechnitiumZoneCountMismatch,
CoreDNSForwardFailureRate.
**Post-apply readiness gate (WS H)**
- null_resource.technitium_readiness_gate runs at end of apply:
kubectl rollout status on all 3 deployments (180s), per-pod
/api/stats/get probe, zone-count parity across the 3 instances.
Fails the apply on any check fail. Override: -var skip_readiness=true.
**Docs (WS I)**
- docs/architecture/dns.md: CoreDNS Corefile hardening, new alerts table,
zone-sync metrics reference, why DNSQuerySpike was broken.
- docs/runbooks/technitium-apply.md (new): what the gate checks, failure
modes, emergency override.
Out of scope for this commit (see beads follow-ups):
- WS C: NodeLocal DNSCache (code-2k6)
- WS D: pfSense Unbound replaces dnsmasq (code-k0d)
- WS E: Kea multi-IP DHCP + TSIG (code-o6j)
- WS F: static-client DNS fixes (code-dw8)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Context
For weeks, every push to infra has resulted in `build-cli` workflow
failure AND `default` workflow succeed — but the `default` workflow's
"success" was a lie. Inside the apply-loop we were swallowing per-stack
failures with `set +e ... echo FAILED` and the step exited 0 regardless.
Discovered during bd code-3o3 e2e test (qbittorrent 5.0.4 → 5.1.4):
agent commit landed, CI reported `default=success`, but cluster was
unchanged. Log inside the step showed:
[servarr] Starting apply...
ERROR: Cannot read PG credentials from Vault.
Run: vault login -method=oidc
[servarr] FAILED (exit 1)
Two root causes, two fixes here.
### 1. Vault `ci` role lacks Tier-1 PG backend creds
The Tier-1 PG state backend (2026-04-16 migration, memory 407) uses
the `pg-terraform-state` static DB role. `scripts/tg` reads it via
`vault read database/static-creds/pg-terraform-state`. That path is
permitted by the separate `terraform-state` Vault policy, which is
bound only to a role in namespace `claude-agent`. The CI runner is in
namespace `woodpecker` using role `ci`, whose policy grants only KV
+ K8s-creds + transit. Net: every Tier-1 stack apply from CI has
been dying at the PG-creds fetch since the migration.
**Fix**: attach `vault_policy.terraform_state` to
`vault_kubernetes_auth_backend_role.ci`'s `token_policies`. No new
policy needed — reuses the minimal one from 2026-04-16.
### 2. Apply-loop swallows stack failures
`.woodpecker/default.yml`'s platform + app apply loops use
`set +e; OUTPUT=$(... tg apply ...); EXIT=$?; set -e; [ $EXIT -ne 0 ]
&& echo FAILED` and then continue the while-loop. The step never
re-raises, so it exits 0 regardless of how many stacks failed.
**Fix**: accumulate failed stack names (excluding lock-skipped ones)
into `FAILED_PLATFORM_STACKS` / `FAILED_APP_STACKS`, serialise the
platform list to `.platform_failed` so it survives the step boundary,
and at the end of the app-stack step exit 1 if either list is
non-empty. Lock-skipped stacks remain non-fatal.
Together, (1) unblocks real apply and (2) ensures the Woodpecker
pipeline + the service-upgrade agent can both trust `default`
workflow state again.
## What is NOT in this change
- Re-running the qbittorrent upgrade to converge the cluster — the
TF file is already at 5.1.4 in git; once CI picks up this commit
it'll apply on its own, or Viktor can run `tg apply` locally now
that the ci role has access too.
- Retiring the `set +e ... continue` pattern entirely — keeping the
per-stack continuation so a single bad stack doesn't hide the
others' plans from the log. Just making the final status honest.
## Test Plan
### Automated
`terraform plan` / apply clean (Tier-0 via scripts/tg):
```
Plan: 0 to add, 2 to change, 0 to destroy.
# vault_kubernetes_auth_backend_role.ci will be updated in-place
~ token_policies = [
+ "terraform-state",
# (1 unchanged element hidden)
]
# vault_jwt_auth_backend.oidc will be updated in-place
~ tune = [...] # cosmetic provider-schema drift, pre-existing
Apply complete! Resources: 0 added, 2 changed, 0 destroyed.
```
State re-encrypted via `scripts/state-sync encrypt vault`; enc file
committed.
### Manual Verification
```
# Before (on previous commit — expect failure):
$ kubectl -n woodpecker exec woodpecker-server-0 -- sh -c '
SA=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token);
TOK=$(curl -s -X POST http://vault-active.vault.svc:8200/v1/auth/kubernetes/login \
-d "{\"role\":\"ci\",\"jwt\":\"$SA\"}" | jq -r .auth.client_token);
curl -s -H "X-Vault-Token: $TOK" \
http://vault-active.vault.svc:8200/v1/database/static-creds/pg-terraform-state'
→ {"errors":["1 error occurred:\n\t* permission denied\n\n"]}
# After (this commit):
→ {"data":{"username":"terraform_state","password":"..."},...}
```
Pipeline-level: the next infra push will exercise
`.woodpecker/default.yml`; expected first push is this very commit.
Watch `ci.viktorbarzin.me` — the `default` workflow should either
succeed for real (and land actual changes) or exit 1 with
"=== FAILED STACKS ===" so the cause is visible.
Refs: bd code-e1x
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a ha-sofia-retry Middleware (attempts=3, initialInterval=100ms)
and ha-sofia-transport ServersTransport (dialTimeout=500ms) wired into
ha-sofia + music-assistant ingresses. Absorbs the 67-156ms connect/DNS
stalls that were surfacing as 18 x 502s/day without disturbing the
global 2-attempt retry or Immich's 60s dialTimeout. depends_on the new
manifests to avoid the dangling-reference pattern from the 2026-04-17
Traefik P0.
Closes: code-rd1
Three changes:
1. Split panel 1 (YTD overlay of 6 non-additive lines) into two accounting-
clean stacked-area panels side-by-side:
- "YTD sources": salary + bonus + rsu_vest + residual (= gross)
- "YTD uses": net + income_tax + NI + pension_employee + student_loan
+ rsu_offset (= gross, per validate_totals identity)
Green for take-home, red/orange for taxes, purple for pension, teal
for RSU offset — visually encodes "what you earned vs what was taken".
2. Panel 3 effective rate switched from per-slip attribution to YTD
cumulative (SUM OVER w / SUM OVER w). Kills the vest-month >100% spike:
the old SQL subtracted `rsu_vest × ytd_avg_rate` from income_tax, but
Meta's variant-C grossup means actual RSU tax is on `rsu_grossup × top
marginal`, not rsu_vest × average. Cumulative approach blends both
proportionally, no attribution hack needed. Also adds a third series:
all-deductions rate (income_tax + NI + student_loan / gross).
3. New panel 8 — Sankey (netsage-sankey-panel) showing sources → Gross →
uses over the selected time range. Plugin added to grafana Helm values.