Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.
Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
- navidrome (Subsonic user/password)
- ntfy (deny-all default + user.db tokens)
- nextcloud (WebDAV/CalDAV/CardDAV app passwords)
- vaultwarden (Bitwarden-compatible token auth)
- headscale (OIDC + preauth keys for Tailscale nodes)
- paperless-ngx (app-layer login + API tokens)
Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).
real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.
No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.
dbaas (Tier 0):
* max_connections: 100 → 200
* shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
* effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
* pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
+ ~16MB work_mem * concurrent sorts + OS cache + overhead)
* Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
which rolls the cluster (standby first, then primary failover).
monitoring:
* New scrape job 'cnpg' on dbaas namespace pods labeled
cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
cnpg_cluster + cnpg_role labels for alert grouping.
* PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
* PGConnectionsCritical (critical, >95% for 3m) — last call before
refusing connections.
Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
The bank-sync CronJob was posting to /accounts/banksync which fans out to
ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls
per-account per-24h quota, a single rate-limited account would 500 the
whole call, and `bank_sync_success` would flip to 0 even though the data
itself was still flowing through manual UI syncs. Result: BankSyncFailing
fired routinely whenever the user had been active in the UI that day —
a structural false positive.
Fix:
* CronJob: enumerate accounts via GET /accounts, POST per-account
/accounts/{id}/banksync, emit bank_sync_account_success and
bank_sync_account_last_success_timestamp labelled by account name.
Roll up bank_sync_success = 1 iff any account succeeded.
* Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale
at 48h (global drought). Add BankSyncAccountStale at 72h (catches
single-account auth expiry — the real signal we wanted).
Verified: manual run on bank-sync-viktor pushes 6 per-account success +
timestamp series; roll-up bank_sync_success=1; no firing alerts.
Two real issues found while triaging HomeAssistantCriticalSensorUnavailable
alerts and the prometheus + technitium PVC Terminating-but-in-use
state from the earlier session.
1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required →
auth=none. HA Sofia REST sensors scrape these endpoints
programmatically; with Authentik forward-auth in front, every
request got a 302 to authentik.viktorbarzin.me and the REST
sensors parsed the HTML login page instead of metrics — leaving
the R730, UPS, and ~20 other sensors permanently unavailable.
The allow_local_access_only IP allowlist (192.168.0.0/16 +
10.0.0.0/8) already gates external access, so authentik on top
was breaking machine-to-machine traffic for no security gain.
2. prometheus_server_pvc + technitium primary_config_encrypted:
add lifecycle.ignore_changes = [spec[0].resources[0].requests].
The autoresizer expands these PVCs; PVCs can't shrink. Without
the ignore, every TF apply tried to revert the live size back
to the TF spec value, hit K8s's shrink-forbidden rule, and
force-replaced the PVC. Because the pod still mounted it, the
PVC went into Terminating-but-protected limbo — fine until a
pod restart would have orphaned the volume. Root cause of the
2026-05-10 PVC Terminating incident.
Bonus: prometheus_server_pvc threshold was the inverted "90%" (the
same bug the bulk fecfa211 sweep fixed elsewhere; my regex only
matched "80%" so this one slipped through). Now "10%".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.
The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
-> etcd snapshot save
-> optional master containerd skew fix
-> apt repo URL rewrite (minor bumps only)
-> drain/upgrade/uncordon master via ssh < update_k8s.sh
-> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
-> post-flight verification
Two new Upgrade Gates alerts catch failure modes:
- K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
- EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)
update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.
Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.
Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if
all four kubelet_volume_stats metrics (available_bytes, capacity_bytes,
inodes_free, inodes) are retrieved. The keep-list in the
kubernetes-nodes scrape job had available_bytes and capacity_bytes
(post 9d5da4d8) but was missing the two inode metrics, so the
autoresizer's reconcile logged "failed to get volume stats" for every
PVC and never resized anything.
Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free
to the regex.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI image (ci/Dockerfile) is alpine + jq, no python3. The
grafana_admin_only_folder_acl null_resource was parsing /api/folders
with a python3 oneliner, which crashed every CI apply with
"python3: command not found" and made every monitoring stack apply
fail in CI (worked locally because the dev VM has python3).
jq is already in the CI image and produces the same output.
Reverses the March 2026 outage mitigation that disabled unattended-
upgrades cluster-wide. Now re-enables it on the k8s template VM with:
- Allowed-Origins limited to security/updates pockets
- Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark
hold on the cluster-critical components)
- Automatic-Reboot disabled — kured drives the actual reboots
- Compatible with the existing kured + sentinel-gate flow
kured side:
- rebootDelay 30s, concurrency 1
- Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak
window from the post-mortem)
- prometheusUrl + alertFilterRegexp wired so any firing non-ignored
alert halts the rollout. Ignore-list excludes self-referential
alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/
InfoInhibitor) that would otherwise deadlock kured.
Prometheus side (already partly landed in 6c4e0966 — the "Upgrade
Gates" rule group):
- Refine `KubeQuotaAlmostFull` to include the resourcequota label in
both the on-clause and the summary, so multi-quota namespaces
(authentik, beads-server, frigate) report the quota name correctly.
grafana.tf: terraform fmt whitespace only.
Together with the post-mortem 2026-03-22 (memory id=390) the loop is
closed: unattended-upgrades runs again, kernel-class updates can land,
but only when cluster health is green and the reboot window is open.
- add traefik-authentik-forward-auth to grafana ingress middleware list
- disable auth.anonymous (was Viewer-by-default for the public)
- enable auth.proxy with X-authentik-username so Authentik users get
signed in seamlessly (no double-login UX)
Prometheus and Alertmanager already had forward-auth — no change.
Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on
sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which
is the symptom of the auth-proxy Emergency-Access fallback firing —
in turn caused by zero ready endpoints on the outpost service.
Why this rule and not `kube_endpoint_address_available == 0`:
kube-state-metrics endpoint metrics exist as series names but never
have current values in this Prometheus pipeline (something is dropping
them silently). Detecting the failure at the edge via Traefik is more
reliable than instrumenting the broken middle.
Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex
— the service label is `authentik-ak-outpost-...`, not
`authentik-authentik-outpost-...`, so the alert never matched any
series and never could have fired. Verified in Prometheus before/after
the fix.
Add an "Upgrade Validation Checklist" section to
`.claude/reference/authentik-state.md` with the seven-step smoke test
to run after Authentik chart bumps, provider bumps, or outpost pod
recreation. Covers the brittle surfaces (Service selector, JSON
patches, postgres backend wiring, access_token_validity TTL, edge
auth flow, plan-to-zero).
f1.viktorbarzin.me is a SPA whose JS fetches /schedule, /embed,
/embed-asset, … on the same path tree. With Anubis fronting `/`,
those XHRs land on the challenge HTML even when the cookie *should*
be valid, breaking the page with `Unexpected token '<', "<!doctype "
... is not valid JSON`. Removed Anubis from f1 — would need a path
carve-out (the way wrongmove does for /api) to re-enable. Added a
top-of-block comment so future me remembers why.
Plus four new Prometheus alerts in `Slow Ingress Latency` group
(stacks/monitoring/.../prometheus_chart_values.tpl):
- IngressTTFBHigh (warn, 10m, avg latency >1s)
- IngressTTFBCritical (crit, 5m, avg latency >3s)
- IngressErrorRate5xxHigh (crit, 5m, 5xx >5%)
- AnubisChallengeStoreErrors (crit, 5m, any 5xx on *anubis* services
via Traefik — proxies for the in-pod challenge-store error since
Anubis itself only exposes Go-runtime metrics)
Notes from the alert author: avg-not-p95 because the existing
Prometheus scrape config drops traefik bucket series; once those
are restored, swap to histogram_quantile(0.95). TraefikDown inhibit
rule extended to suppress these four during a Traefik outage.
Wealth, Payslips, and Job-Hunter Grafana datasources all baked the
rotating PG password into their ConfigMap at TF-apply time, so every
7-day Vault static-role rotation silently broke the panels until a
manual `terragrunt apply`. Same family as the recurring grafana-mysql
backend bug — Grafana caches creds at startup and never picks up the
new ESO-synced password without a restart.
Fix:
- Each source stack now creates an ExternalSecret in `monitoring`
exposing the rotating password as `<NAME>_PG_PASSWORD` env-var.
- Grafana mounts those via `envFromSecrets` (optional=true so a
missing source stack doesn't block boot) and the datasource
ConfigMaps reference `$__env{<NAME>_PG_PASSWORD}` instead of a
literal password.
- `reloader.stakater.com/auto: "true"` on the Grafana pod restarts
it whenever any of the four DB-cred Secrets is updated.
Tested end-to-end: forced `vault write -force database/rotate-role/
pg-wealthfolio-sync` → ESO synced (~30s) → reloader fired →
Grafana booted with new env in ~50s total → all three /api/datasources
/uid/*/health endpoints return "Database Connection OK".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Prometheus scrape config for the kubernetes-nodes job kept
capacity_bytes + used_bytes but dropped available_bytes. pvc-autoresizer
computes utilization from available/capacity, so without that metric it
was silent for every PVC in the cluster — including mailserver, which
filled to 89% (1.7G/2.0G) and started rejecting all inbound mail with
'452 4.3.1 Insufficient system storage' (15+ hours, all real senders:
Brevo, Gmail, Facebook).
Also bumps the floors of mailserver (2Gi -> 5Gi, limit 10Gi) and forgejo
(15Gi -> 30Gi) PVCs to recover from the immediate outage, and adds
ignore_changes on requests.storage so future autoresizer expansions
don't cause TF drift.
User asked for two lines instead of side-by-side bars at monthly
granularity. Converts panel 25 from barchart to timeseries:
* type: barchart -> timeseries
* format: table -> time_series, SELECT month::timestamp AS time
* drawStyle line, lineWidth 2, fillOpacity 0, showPoints auto
* Same blue (contributions) / green (market gain) colour overrides
Where the green line rises above the blue line is the visual cue that
the market out-earned new contributions for that month -- the trend
the user wants to track.
Diff is small (15 ins / 28 del) because the bar-chart-only fields
(barRadius, barWidth, groupWidth, stacking, xField, xTickLabelRotation)
are dropped.
Goal stated by user: see when monthly market gain starts to exceed
monthly contributions, i.e. the inflection point where the market is
out-earning savings rather than the other way around.
New panel id=25 between the annual decomposition (13) and per-account
ROI (14): bar chart with two side-by-side bars per month --
contributions (blue) and market gain (green). Same calculation as
panel 13 but month-grain instead of year-grain. Months where the
green bar dwarfs the blue one are visible at a glance.
SQL: same endpoints CTE pattern as panel 13, with date_trunc('month',
valuation_date) as the grouping key. Uses max_complete cutoff so
partial-today doesn't skew the latest month.
Layout: panels at y >= 75 shifted down by 11 (chart height). New
chart at y=75; panel 14 (per-account ROI) -> y=86; panel 10
(activity log) -> y=96.
Spot check (recent months from PG):
2025-07: contrib +£5,601 market +£42,295 <- big market month
2025-09: contrib +£1,501 market +£24,206
2026-02: contrib +£35,501 market +£41,382
2026-03: contrib +£5,501 market -£38,483 <- correction
2026-04: contrib +£73,267 market +£21,448
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.
Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
— image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
(viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
from Forgejo. build-ci-image.yml dual-pushes still until next
build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.
Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
hosts.toml entries for registry.viktorbarzin.me +
10.0.20.10:5050. (Existing nodes already had the file removed
manually by `setup-forgejo-containerd-mirror.sh` rollout — the
cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
service block removed; nginx 5050 port mapping dropped. Pull-
through caches for upstream registries (5000/5010/5020/5030/5040)
stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
`private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
integrity_probe + registry_probe_credentials resources stripped.
forgejo_integrity_probe is the only manifest probe now.
Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
diagram now reflect Forgejo. Pre-migration root-cause sentence
preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
alert annotation simplified now that only one registry is in
scope.
Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
match the new template AND `docker compose up -d --remove-orphans`
to actually stop the registry-private container. Memory id=1078
confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
registry.viktorbarzin.me:5050 from the `repo:` list — at that
point the post-push integrity check at line 33-107 also needs
to be repointed at Forgejo or removed (the per-build verify is
redundant with the every-15min Forgejo probe).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wrap the three new Vault key reads in try(...) so the first apply
succeeds even when forgejo_pull_token / forgejo_cleanup_token /
secret/ci/global haven't been populated yet. Without this, CI
auto-apply blocks on the very push that introduces the references —
chicken-and-egg with the runbook order (which is: apply Forgejo bumps,
then create users + PATs, then apply the rest).
Empty tokens are intentionally visible-broken (auth fails, probe
reports auth failure, cleanup CronJob errors) — that's the signal
to run the bootstrap runbook. Subsequent apply picks up the real
values.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.
What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
in after the nginx→Traefik migration. Now creates a per-ingress
Buffering middleware when set; default null = no limit (preserves
existing behavior). Forgejo ingress sets max_body_size=5g to allow
multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
forgejo.viktorbarzin.me, populated from Vault secret/viktor/
forgejo_pull_token (cluster-puller PAT, read:package). Existing
Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
package + always :latest. First 7 days dry-run (DRY_RUN=true);
flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
existing registry-integrity-probe. Existing Prometheus alerts
(RegistryManifestIntegrityFailure et al) made instance-aware so
they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.
Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.
Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.
Metrics exported:
openclaw_codex_messages_total{provider,model,session_kind} counter
openclaw_codex_input/output/cache_read/cache_write_tokens_total
openclaw_codex_message_errors_total{reason}
openclaw_codex_active_sessions{kind} gauge
openclaw_codex_oauth_expiry_seconds{provider,account,plan} gauge
openclaw_codex_last_run_timestamp gauge
Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.
Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
Better visual grouping: instead of 8 paired panels in a single row at
w=3 (cramped, hard to scan), arrange as a 2x4 grid at w=6. Top row
("all" — wealth change incl new money), bottom row ("mkt" — pure
market gain). Columns are timeframes 1d / 7d / 30d / 90d.
Reading vertically: same window, two interpretations side by side.
Reading horizontally: same metric across timeframes.
Layout shift: delta row goes from y=4 (4 wide) to y=4..11 (8 high).
All chart/log panels with y >= 8 shift down by another 4 rows
(net-worth chart 8->12, activity log 81->85, etc.).
User feedback: net-worth delta panels (1d/7d/30d/90d) confused
because +£174k over 90d looked too big against the £271k cumulative
unrealised gain. Decomposition showed the 90d delta was £114k of new
money in (contributions) + £60k of actual market gain.
So now the delta row shows BOTH:
Δ Nd (all) — net-worth change incl new money (the original number)
Δ Nd (mkt) — pure market gain, contributions stripped out
Pattern for "(mkt)" panels: same now_snap / past_snap CTEs but
selecting both total_value and net_contribution, then computing
(nw_delta - contrib_delta) = market_gain over window.
Layout: 8 panels at w=3 each on the y=4 row, paired by window
(all next to mkt for each timeframe), so you can see "wealth
change vs investment performance" at a glance.
Verified live (90d): all=+£174,612, mkt=+£60,343, contrib=+£114,268.
New row at y=4 with 4 stat panels showing net-worth change over the
trailing windows. Each uses the latest-per-account stitching pattern
(skew-resilient against partial-day syncs) and computes:
delta = SUM(latest per account) - SUM(latest per account at or
before max_complete - N)
Where max_complete is the most recent date all accounts have a row.
For each window: 1d, 7d, 30d, 90d.
Verified live values: +£8,575 / +£22,696 / +£144,633 / +£174,612.
All panels at y >= 4 shifted down by 4 rows to make room (Net worth
chart 4->8, Per-account stacked 24->28, Activity log 77->81, etc.).
Note: this commit also reformats the dashboard JSON from compact-
object form to indented form (json.dump indent=2 side effect from the
Python patch script). No semantic changes outside the new panels and
y-shifts.
Bug: timeseries panels were empty before 2024-04-10. Cause was the
complete_dates CTE filtering to "every active account has a row for
this date" -- which excluded every day before the most-recently-added
account first appeared. The 6th account (Trading212 Invest GIA) only
started 2024-04-10, so 4 years of legitimate historical data
(2020-06-07 onwards, when the user genuinely had fewer accounts) got
hidden.
New pattern across panels 5/6/7/8/9/12/13: replace complete_dates with
max_complete cutoff. Compute the most-recent date where all current
accounts have a row, then include every historical date up to and
including that day. Partial-today is still excluded automatically.
Historical days with fewer accounts now show as their actual smaller
sums -- which is the correct historical net worth at the time.
Verified via PG: new pattern returns 2,159 distinct days from
2020-06-07 to 2026-05-05 (vs the previous 391 from 2024-04-10).
Per-account first-seen dates:
InvestEngine ISA - 2020-06-07
Schwab US workplace - 2020-11-17
InvestEngine GIA - 2022-03-17
Fidelity UK Pension - 2022-05-16
Trading212 ISA - 2024-04-08
Trading212 Invest GIA - 2024-04-10 (was the bottleneck)
Re-applies the milestone annotation commit reverted in 0ef36aec. The
earlier "nothing loads / syntax error" was a red herring: Vault had
rotated the wealthfolio_sync DB password 7 days prior, the K8s Secret
picked it up automatically (pg-sync sidecar still working), but the
Grafana datasource ConfigMap is baked at TF-apply time so Grafana was
sending the old password. Every panel + the new annotation alike
failed with: pq password authentication failed for user wealthfolio_sync.
Fix today: refresh the datasource ConfigMap and roll Grafana.
scripts/tg apply -target=kubernetes_config_map.grafana_wealth_datasource
kubectl -n monitoring rollout restart deploy/grafana
Annotation source verified live via /api/ds/query: SQL returns 5
milestone rows correctly. Dashboard charts now show vertical dashed
lines at GBP100k 2021-11-01, GBP250k 2023-07-18, GBP500k 2024-09-19,
GBP750k 2025-08-26, GBP1M 2026-04-18.
KNOWN FOLLOW-UP: Vault rotates pg-wealthfolio-sync every 7 days
(static role). Todays failure will recur unless the Grafana
datasource auto-refreshes. Options:
1. Annotate Grafana deploy with stakater/reloader so it restarts
when wealthfolio-sync-db-creds Secret changes.
2. Switch datasource provisioning to read password from an env var
sourced from the Secret instead of baking into the ConfigMap.
Combined with reloader, picks up rotation cleanly.
Inspired by the user's "Journey to £1M" reference — adds vertical
dashed lines on every timeseries panel at the date net worth first
crossed each round threshold (£100k, £250k, £500k, £750k, £1M).
Implementation: a dashboard-level annotation source ("Milestones",
purple) backed by a PG query that finds the MIN(valuation_date) where
SUM(total_value) >= each threshold. The query returns (time, text)
pairs, e.g. "2026-04-18 → £1M 🎉". Annotations attach to all
timeseries panels automatically; auto-extends as future thresholds
are crossed.
Verified against current data:
£100k → 2021-11-01 £250k → 2023-07-18 £500k → 2024-09-19
£750k → 2025-08-26 £1M → 2026-04-18 🎉
Future work (per user request): add a "Journey" stat-card row at the
top mirroring the reference (date achieved + months from previous).
Make daily movements visible on the line charts. The y-axis still spans
~£700k–£1M so an £8k daily move is ~1% of vertical range and easy to
miss when only the line is drawn.
Changes per panel:
* 5 (Net worth): showPoints never→always, pointSize 4→5, fillOpacity 20→10
* 6 (Net contrib vs market): showPoints never→always, pointSize 4→5
* 7 (Growth over time): showPoints never→always, pointSize 4→5, fillOpacity 50→25
* 8 (Per-account stacked): showPoints never→always (kept stacking fill at 70)
* 9 (Cash vs invested stacked): showPoints never→always (kept stacking fill at 70)
Each daily value now renders as a visible dot, so even if the line
appears flat at this scale, the per-day points trace the wiggle. Lighter
fill on the unstacked panels lets the line + points dominate visually.
Caveat: the fundamental "£8k on a £1M base" visibility issue is best
solved with a dedicated "Daily change" delta panel — happy to add one
on next pass if this isn't enough.
Fix: panels 5–9 had `AS \"time\"` (literal backslash-quote sequence
embedded in the SQL string). PostgreSQL parsed that as a syntax error
at the leading backslash:
ERROR: syntax error at or near "\"
LINE 1: ...complete_dates)) SELECT valuation_date::timestamp AS \"time\"
Root cause: the patch script for the skew-resilient queries (commit
628f5a0d) used a Python f-string with `\\\"time\\\"`, which produces
a literal backslash-quote in the Python string. When that string
was JSON-encoded the backslash was preserved verbatim instead of
collapsed to plain `"time"`.
Replaces all five occurrences with the correct `AS "time"` form.
Verified the corrected query against PG returns 7 daily net-worth
rows for 04-25..05-01 as expected.
Two adjustments to make daily movements visible:
1. Default time range: now-5y → now-180d. The timeseries charts (Net
worth, Net contribution vs market value, Growth, Per-account
stacked, Cash vs invested) auto-fit their y-axis to the data range
in view. Over 5 years, daily £1k–£10k moves are ~1% of axis range
and visually invisible against the cumulative trend. Over 6
months, the same daily moves dominate. Yearly bar charts (12, 13)
are unaffected — they aggregate by calendar year and don't filter
on $__timeFilter.
2. Decimals → 2 on every currency panel (1, 2, 3, 5–9, 13, 15, 16)
and every percent panel (4, 14). Stat panels now show pennies on
currency and 0.01% on rates; chart y-axis ticks are likewise more
precise. Honest caveat: pennies on a £1M number don't make the
absolute readout easier — to see "today changed by £8,358" cleanly
we'd want a dedicated delta panel; pending user direction.
Widen the time picker manually to recover the 5-year view; default
just zooms into the last 6 months.
Bug witnessed 2026-05-01: dashboard "Net worth (current)" showed £88k
instead of £1.03M because at 02:00 UTC an external trigger refreshed
ONE account (Trading212 ISA), creating its 05-01 daily_account_valuation
row. The 5 other accounts still had their last row at 04-30. The panel
SQL `WHERE valuation_date = (SELECT MAX(valuation_date))` then summed
only the single account that had a 05-01 row.
Two new SQL patterns adopted across all 15 affected panels:
1. Stat / barchart "current snapshot" panels (1, 2, 3, 4, 11, 14, 15,
16): latest-per-account stitching —
WITH latest AS (SELECT DISTINCT ON (d.account_id) ...
FROM daily_account_valuation d
JOIN accounts a ON a.id = d.account_id
ORDER BY d.account_id, d.valuation_date DESC)
gives a coherent "now" snapshot regardless of refresh skew, and
the inner join filters out orphan/deleted accounts (one such was
adding a stale £33k from 04-17). 12-month panels add a parallel
`ago` CTE picking each account's row closest to (d_now - 12mo).
2. Time-series / yearly panels (5, 6, 7, 8, 9, 12, 13): complete-days-
only filter —
WITH active_accounts AS (SELECT COUNT(*) FROM accounts),
complete_dates AS (SELECT valuation_date
FROM daily_account_valuation d
JOIN accounts a ON a.id=d.account_id
GROUP BY valuation_date
HAVING COUNT(*) >= active.n)
so a partial today never renders as a chart dip. The day rejoins
the chart automatically once the daily 16:00 UTC sync writes rows
for every account.
Verified end-to-end against live PG: new queries produce £1,033,734
(matches the 6 active accounts' true latest sum) where the old query
gave £88k.
Top row goes from 5 → 7 stat panels (widths 4+4+4+3+3+3+3=24):
- Net worth, Net contribution, Growth shrink from w=5 to w=4.
- ROI % shrinks from w=5 to w=3 (now sits at x=12).
- 12mo return slides from x=20/w=4 to x=15/w=3.
- New: 12mo contrib (id=15, currency, blue) at x=18 — net contributions
added in the trailing 12 months.
- New: 12mo gain (id=16, currency, red/green) at x=21 — pure market gain
in £ over the trailing 12 months (12mo Δnet-worth − 12mo contribs).
Live values verified against PG: contrib_12mo=£245k, gain_12mo=£172k,
sum = £417k = nw_now − nw_ago, return = 23.51%.
wealth: move Activity log table from y=45 to y=77; the three barcharts
(Yearly return, Annual change, Per-account ROI) shift up by 14 to fill
the gap.
uk-payslip: move Sankey "where the money went" from y=80 to y=48 (right
above the table block); the three tables (Data integrity, All payslips,
YTD reconciliation) shift down by 14 so all four tables (4, 5, 6, 9) sit
contiguously at the bottom.
fire-planner and job-hunter still have intentional side-by-side
table/chart pairings; left untouched pending user direction on whether
to break them.
Trailing 12-month investment return % was a full-width stat at y=59.
Now sits inline with Net worth / Contribution / Growth / ROI as the
fifth headline number — top-row stats reflowed from w=6 (×4) to w=5
(×4) + w=4 (×1). Title shortened to "12mo return" so it fits.
Panels below the old row shifted up by 4 rows to close the gap.
Switch the RSU stack from "after band-aware tax" to gross. Receipt
total is now pre-sacrifice gross compensation; bar − pension stack
≈ ytd_gross reported on the final March payslip / P60.
Verified alignment for 2025/26: bar−pension = £266,752 vs P60
ytd_gross = £268,127 — gap of £1,375 ≈ "other taxable" (benefits,
overtime). Remaining year-level gaps are upstream parser/ingest
issues, not dashboard logic:
- 2024/25 +£27k: March 2025 payslip parsed bonus=£26,969 but never
propagated it into gross_pay/income_tax. Receipt is more
accurate than ytd_gross here.
- 2023/24 −£36k: Feb 2024 payslip row appears to be missing from
the table; ytd_gross has it, sum(gross_pay) doesn't.
- 2022/23 −£10k: variant A→B transition residual.
SQL simplified — band-aware CTE chain dropped (no longer needed for
this panel since RSU is shown gross).
The salary field on the payslip is pre-pension-sacrifice, so the
"Salary (gross)" stack already silently included the salary-sacrifice
pension contribution. Split it out so pension is explicitly visible:
- Salary (cash, post-sacrifice) = salary - pension_sacrifice
- Pension (salary sacrifice, untaxed) = pension_sacrifice
- Bonus
- RSU vest (after band-aware tax)
Bar total unchanged (just relabels what was already there). Pension
is now visibly counted as income — consistent with "untaxed but real"
framing.
Caveat documented in panel description: receipt total ≠ P60 gross
because P60 reports pre-RSU-tax gross. Receipt shows RSU net of tax
per earlier intent. To exactly match P60, swap rsu_after_tax →
rsu_vest gross.
Move both barchart/timeseries panels into row 4 (y=29, side-by-side
w=12 each, h=10) so the per-tax-year overviews appear right after
the income-tax-and-pension YTD row. Shift panels 13, 4, 5, 6, 8, 9
down by 10 to accommodate.
Final ordering: rows 1–3 = monthly + YTD timeseries (panels 1/7/2/3/11/12),
row 4 = yearly receipt + YTD gross YoY (16/17), then the wider
deduction/integrity/table panels below.
Removed:
- Panel 10 "HMRC Tax Year Reconciliation — Individual Tax API"
→ references hmrc_sync.tax_year_snapshot schema. The hmrc-sync
service / DB has not been deployed, so the panel always errored
with "relation does not exist".
- Panel 14 "Meta payroll: bank deposit vs payslip net pay"
→ references payslip_ingest.external_meta_deposits, which is
created by alembic migration 0007. The deployed payslip-ingest
image is at 0005, so the table doesn't exist.
- Panel 15 "RSU vest reconciliation — payslip vs Schwab"
→ references payslip_ingest.rsu_vest_events, created by migration
0008. Same image-staleness story.
Verified all 14 remaining panels return without error via Grafana
/api/ds/query. SQL for the removed panels is preserved in git history;
re-add when the data sources are actually deployed.
Replace the 7-stack "where total comp went" decomposition with a 3-stack
"what I actually earned" view: salary (gross), bonus (gross), and RSU
vest after band-aware tax (PAYE+NI withheld via sell-to-cover). Skips
income tax / NI / student loan / pension / RSU offset.
Bar height = real income kept across all components. RSU is net of tax
because it's withheld at source and never hits the bank account; salary
and bonus are gross because they're paid in full and taxes are deducted
elsewhere. This is the income-side view where tax is implicit, not the
deduction waterfall.
Per-year RSU after tax: 2020/21 £18k · 2021/22 £39k · 2022/23 £50k ·
2023/24 £26k · 2024/25 £71k · 2025/26 £73k.
Two bugs:
1. Synthetic dates projected onto 1970/71 fell outside the dashboard's
default time range (now-10y → now), so Grafana filtered out every
point. Switched to a sliding 12-month window
(CURRENT_DATE - INTERVAL '12 months') as the projection base, plus
a per-panel timeFrom: "13M" override so the panel always shows the
last 13 months regardless of the dashboard's time picker.
2. ORDER BY tax_year, pay_date violated Grafana's long→wide conversion
requirement (data must be ascending by time). Wrapped in a CTE and
re-ordered by the synthetic time column. Pivoted result is now a
single wide frame with 7 series (2019/20…2025/26).
The default fieldConfig unit (percent on Yearly investment return %,
currencyGBP on Annual change decomposition) was being applied to the
"year" string column too — so x-axis labels rendered as "2024%" and
"£2,024" respectively. Add field overrides on the "year" column to
force unit=string. The earlier "tax_year" panels weren't affected
because "2024/25" doesn't parse as a number; "2024" did.
Wealth dashboard:
- "Yearly growth %" → "Yearly investment return %": switched to
modified-Dietz formula `market_gain / (nw_start + 0.5 × contributions)`
so contributions don't inflate the return. New money in is excluded —
this is portfolio performance, not net-worth change.
- "Trailing 12-month growth %" → "Trailing 12-month investment return %":
same formula, applied to the trailing 12mo window.
Pre-fix vs post-fix:
2020: 155.0% → 5.12% (large contributions on small base)
2021: 344.7% → 26.45%
2022: 26.9% → -25.65% (the actual 2022 bear market)
2023: 123.2% → 41.60%
2024: 87.4% → 25.70%
2025: 46.8% → 8.43%
2026: 16.7% → 3.28% (YTD)
UK Payslip dashboard:
- Replaced the per-tax-year stacked bar with a year-over-year line chart:
one line per tax year, X = month-of-tax-year (April→March, projected
onto a 1970/71 fiscal calendar so years overlay), Y = cumulative YTD
gross. Five+ lines visible at a glance for trend comparison.
Wealth (4 new panels at the bottom):
- Trailing 12-month growth % (stat) — % change in net worth over last 12mo.
- Yearly growth % (bar per calendar year) — first→last valuation each year.
- Annual change decomposition (stacked bar) — splits each year's NW change
into "net contributions" (new money in) and "market gain" (everything
else: appreciation, dividends, FX). Answers "did I grow because I saved
or because the market did the work?".
- Per-account ROI % (horizontal bar) — (value − contribution) / contribution
× 100, latest snapshot. Excludes accounts with zero/negative net
contribution (Schwab — distorts ratio after RSU sells).
UK Payslip (1 new panel below the yearly receipt):
- Gross composition by tax year (stacked bar) — salary / bonus / RSU vest /
other components per tax year. Bar height = gross pay. Trends in salary
growth, bonus levels, and RSU vest sizing at a glance.
All queries spot-checked via Grafana /api/ds/query.
Folder ACL:
- Move uk-payslip + wealth dashboards to a new "Finance (Personal)"
folder; job-hunter + fire-planner stay in "Finance" (open).
- New null_resource calls Grafana's folder permissions API after the
dashboard sidecar materialises the folder, setting an admin-only
ACL ({Admin: 4}). Default Viewer/Editor inheritance is overridden,
so anonymous-Viewer (auth.anonymous=true) is denied. Server-admin
always retains access.
- Verified: anonymous → 403 on uk-payslip + wealth, 200 on
control dashboards (node-exporter); admin → 200 on all.
Wealth cash fix:
- Wealthfolio dumps WORKPLACE_PENSION wrappers entirely into
cash_balance because it doesn't track underlying fund holdings.
Reclassify pension cash as invested in the "Cash vs invested"
panel so the cash series reflects actual uninvested broker cash
(~£16k T212 ISA + Schwab) instead of phantom £154k.
Pre-fix: cash=£153,789 / invested=£870,282 / total=£1,024,071
Post-fix: cash=£16,064 / invested=£1,008,008 / total=£1,024,071
New stacks/fire-planner/ mirrors payslip-ingest layout:
- ExternalSecret pulling RECOMPUTE_BEARER_TOKEN from Vault secret/fire-planner
- DB ExternalSecret templating DB_CONNECTION_STRING via static role pg-fire-planner
- FastAPI Deployment (serve), CronJob (recompute-all monthly on 2nd at 09:00 UTC,
scheduled after wealthfolio-sync's 1st at 08:00), ClusterIP Service
- Grafana datasource ConfigMap "FirePlanner" — `database` inside jsonData
(cc56ba29 fix; otherwise Grafana 11.2+ hits "you do not have default database")
Plus:
- vault/main.tf: pg-fire-planner static role (7d rotation), allowed_roles
- dbaas/modules/dbaas/main.tf: null_resource creates fire_planner DB+role
- monitoring/dashboards/fire-planner.json: 9-panel Finance-folder dashboard
(NW timeseries, MC fan chart, success heatmap, lifetime tax bars,
years-to-ruin table, optimal leave-UK stat, ending wealth stat,
UK success-by-strategy bars, sequence-risk correlation table)
- monitoring/modules/monitoring/grafana.tf: register "fire-planner.json" in Finance folder
Apply order:
1. vault stack — creates the static role
2. dbaas stack — creates the database & role
3. external-secrets stack picks up vault-database refs (no change needed)
4. fire-planner stack — first apply with -target=kubernetes_manifest.db_external_secret
before full apply, per the plan-time-data-source pattern
5. monitoring stack — picks up the new dashboard ConfigMap
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors Wealthfolio's daily_account_valuation / accounts / activities
from SQLite into a new PG database (wealthfolio_sync) every hour, so
Grafana can chart net worth, contributions, and growth over time.
Components:
- dbaas: null_resource creates wealthfolio_sync DB + role on the CNPG
cluster (dynamic primary lookup so it survives failover).
- vault: pg-wealthfolio-sync static role rotates the password every 7d.
- wealthfolio: ExternalSecret pulls the rotated password into the WF
namespace; new pg-sync sidecar (alpine + sqlite + postgresql-client +
busybox crond) does sqlite3 .backup → TSV dump → truncate-and-reload
psql, hourly at :07. Plus a grafana-wealth-datasource ConfigMap in
the monitoring namespace (uid: wealth-pg).
- monitoring: new Wealth dashboard (wealth.json, 10 panels) — current
net worth / contribution / growth / ROI% stats, then time-series
for net worth, contribution-vs-market, growth area, per-account
stacked area, cash-vs-invested, and a 100-row activity log.
Initial sync: 6 accounts, 10,798 daily valuations, 518 activities.
Verified PG totals match SQLite latest snapshot exactly.
New panel 16 (barchart, h=11, y=179): one stacked bar per tax year showing
total comp split into net pay (bank deposit), cash income tax, RSU tax
(band-aware marginal: PAYE+NI), cash NI, student loan, pension salary-
sacrifice, and RSU offset (Variant A only).
X-axis = tax_year (categorical), y-axis = currencyGBP. Bar height ≈
gross_pay + pension_sacrifice (small over-attribution in Variant A years
where the band-aware model exceeds recorded payslip PAYE).
Replaces the flat 47% (45 PAYE + 2 NI) RSU marginal across panels 3, 7, 8, 11,
and 12 with an exact piecewise band-aware computation. Each row computes
ani_prior/ani_pre/ani_post over the tax-year YTD (chronological model — the
RSU is taxed at the band its YTD ANI position occupies at the vest date,
mirroring PAYE withholding behaviour).
Bands (2024/25+, applied to all years):
IT: 0% / 20% / 40% / 60% (PA-taper) / 45% at 12,570 / 50,270 / 100k / 125,140
NI: 0% / 8% / 2% at 12,570 / 50,270
PA-taper modelled as 60% effective IT marginal in £100k–£125,140
(40% on the £1 + 40% on the £0.50 of lost PA = 60%).
Spot-checked per tax-year totals via psql; numbers diverge from the flat
47% baseline most for years where vests cross PA-taper or basic-rate bands
(2020/21 ~35%, 2024/25 ~41%, 2025/26 ~43%).
Drop the two misleading series in "Effective rate & take-home % (YTD
cumulative)" — both used SUM(gross_pay) as denominator while only
counting cash deductions/net in the numerator, which understated
take-home by 25-30 pp because RSU shares are absent from the cash
deposit but present in gross. Replaced with three semantically clean
angles:
- ytd_paye_rate_pct: SUM(income_tax) / SUM(taxable_pay) — HMRC audit
rate (~41-42% in additional-rate band), kept as before.
- ytd_cash_take_home_pct: SUM(net_pay) / SUM(gross_pay - rsu_vest) —
what fraction of cash earnings hits the bank (~62-65%).
- ytd_total_keep_pct: (SUM(net_pay) + 0.53 × SUM(rsu_vest)) /
SUM(gross_pay) — true "what I actually keep" including post-tax RSU
shares (47% marginal applied to vest value), ~55-60%.
Added field overrides for clear color-coding (red/green/blue).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>