Commit graph

3045 commits

Author SHA1 Message Date
Viktor Barzin
2722260ce9 monitoring(wealth): unbreak timeseries SQL — over-escaped time alias
Fix: panels 5–9 had `AS \"time\"` (literal backslash-quote sequence
embedded in the SQL string). PostgreSQL parsed that as a syntax error
at the leading backslash:

  ERROR:  syntax error at or near "\"
  LINE 1: ...complete_dates)) SELECT valuation_date::timestamp AS \"time\"

Root cause: the patch script for the skew-resilient queries (commit
628f5a0d) used a Python f-string with `\\\"time\\\"`, which produces
a literal backslash-quote in the Python string. When that string
was JSON-encoded the backslash was preserved verbatim instead of
collapsed to plain `"time"`.

Replaces all five occurrences with the correct `AS "time"` form.
Verified the corrected query against PG returns 7 daily net-worth
rows for 04-25..05-01 as expected.
2026-05-01 16:19:07 +00:00
Viktor Barzin
d67416d4ca monitoring(wealth): tighten default time range, bump decimals for granularity
Two adjustments to make daily movements visible:

1. Default time range: now-5y → now-180d. The timeseries charts (Net
   worth, Net contribution vs market value, Growth, Per-account
   stacked, Cash vs invested) auto-fit their y-axis to the data range
   in view. Over 5 years, daily £1k–£10k moves are ~1% of axis range
   and visually invisible against the cumulative trend. Over 6
   months, the same daily moves dominate. Yearly bar charts (12, 13)
   are unaffected — they aggregate by calendar year and don't filter
   on $__timeFilter.

2. Decimals → 2 on every currency panel (1, 2, 3, 5–9, 13, 15, 16)
   and every percent panel (4, 14). Stat panels now show pennies on
   currency and 0.01% on rates; chart y-axis ticks are likewise more
   precise. Honest caveat: pennies on a £1M number don't make the
   absolute readout easier — to see "today changed by £8,358" cleanly
   we'd want a dedicated delta panel; pending user direction.

Widen the time picker manually to recover the 5-year view; default
just zooms into the last 6 months.
2026-05-01 16:15:39 +00:00
Viktor Barzin
628f5a0d26 monitoring(wealth): skew-resilient queries, no more partial-day dips
Bug witnessed 2026-05-01: dashboard "Net worth (current)" showed £88k
instead of £1.03M because at 02:00 UTC an external trigger refreshed
ONE account (Trading212 ISA), creating its 05-01 daily_account_valuation
row. The 5 other accounts still had their last row at 04-30. The panel
SQL `WHERE valuation_date = (SELECT MAX(valuation_date))` then summed
only the single account that had a 05-01 row.

Two new SQL patterns adopted across all 15 affected panels:

  1. Stat / barchart "current snapshot" panels (1, 2, 3, 4, 11, 14, 15,
     16): latest-per-account stitching —
       WITH latest AS (SELECT DISTINCT ON (d.account_id) ...
                       FROM daily_account_valuation d
                       JOIN accounts a ON a.id = d.account_id
                       ORDER BY d.account_id, d.valuation_date DESC)
     gives a coherent "now" snapshot regardless of refresh skew, and
     the inner join filters out orphan/deleted accounts (one such was
     adding a stale £33k from 04-17). 12-month panels add a parallel
     `ago` CTE picking each account's row closest to (d_now - 12mo).

  2. Time-series / yearly panels (5, 6, 7, 8, 9, 12, 13): complete-days-
     only filter —
       WITH active_accounts AS (SELECT COUNT(*) FROM accounts),
            complete_dates AS (SELECT valuation_date
                               FROM daily_account_valuation d
                               JOIN accounts a ON a.id=d.account_id
                               GROUP BY valuation_date
                               HAVING COUNT(*) >= active.n)
     so a partial today never renders as a chart dip. The day rejoins
     the chart automatically once the daily 16:00 UTC sync writes rows
     for every account.

Verified end-to-end against live PG: new queries produce £1,033,734
(matches the 6 active accounts' true latest sum) where the old query
gave £88k.
2026-05-01 16:08:18 +00:00
Viktor Barzin
1d3ae01aac wealthfolio(daily-sync): API call CronJob, replaces rollout-restart
Restart-only didn't refresh the wealth Grafana dashboard — verified
empirically: a fresh `daily_account_valuation` row only lands when a
PortfolioJob runs with ValuationRecalcMode != None, and Wealthfolio's
internal schedulers don't trigger that path:

  - 6h quotes scheduler refreshes the `quotes` table only.
  - 4h broker scheduler short-circuits on missing `sync_refresh_token`.

The right knob is `POST /api/v1/market-data/sync`. Replaced the
rollout-restart CronJob (+ its SA/Role/RoleBinding) with a curl-based
CronJob that logs in (`POST /api/v1/auth/login`) then POSTs to
`/api/v1/market-data/sync` with the session cookie. Backfills missing
days via IncrementalFromLast in one call.

Schedule 16:00 UTC (= 17:00 BST):
  * After UK market close (15:30 UTC BST), EOD UK prices settled.
  * US market open ~2.5h, intra-day US quotes fresh.
  * pg-sync next :07 tick mirrors → Grafana refresh ≤5m → fresh data
    by ~17:12 BST, comfortably before the 18:00 BST target.

Plaintext password lives in Vault `secret/wealthfolio.web_password`,
flows via the existing `dataFrom.extract` ExternalSecret — no extra
ESO wiring needed. Verified end-to-end: API call backfilled 04-26
through 04-29, pg-sync mirrored, PG now shows rows up to today.
2026-04-29 21:21:24 +00:00
Viktor Barzin
31b9e5d4a9 monitoring(wealth): add 12mo contrib + 12mo gain to top row
Top row goes from 5 → 7 stat panels (widths 4+4+4+3+3+3+3=24):

- Net worth, Net contribution, Growth shrink from w=5 to w=4.
- ROI % shrinks from w=5 to w=3 (now sits at x=12).
- 12mo return slides from x=20/w=4 to x=15/w=3.
- New: 12mo contrib (id=15, currency, blue) at x=18 — net contributions
  added in the trailing 12 months.
- New: 12mo gain (id=16, currency, red/green) at x=21 — pure market gain
  in £ over the trailing 12 months (12mo Δnet-worth − 12mo contribs).

Live values verified against PG: contrib_12mo=£245k, gain_12mo=£172k,
sum = £417k = nw_now − nw_ago, return = 23.51%.
2026-04-27 06:32:53 +00:00
Viktor Barzin
cd96fb64a8 phpipam-pfsense-import: every 5min → hourly
Reduces 5-min disk-write spikes on PVE sdc. The cronjob was the
heaviest single contributor in our hourly fan-out investigation
(11.2 MB/s burst when it fired). Kea DDNS still handles real-time
DNS auto-registration; phpIPAM inventory just lags by up to 1h,
which we don't need fresher.

Docs (dns.md, networking.md, .claude/CLAUDE.md) updated to match.
2026-04-26 22:48:43 +00:00
Viktor Barzin
6ad5292128 immich: bump server to 8Gi + override tier-2-gpu quota to 20Gi
Eliminates the OOM-on-face-detection-burst class of incidents (2026-04-26).
VPA upper for immich-server is 2.98Gi steady-state; the prior 4Gi limit was
1.34x upper and still got SIGKILL'd when face-detection bursts pushed
transient RSS past 4Gi. 8Gi gives 2.7x VPA upper headroom.

The kyverno tier-2-gpu default quota is 12Gi requests.memory which can't fit
8Gi (server) + 3.5Gi (ML) + 3Gi (PG) + backup CronJobs simultaneously. Opts
the namespace into the kyverno custom-quota exclude rule and overrides with
20Gi (~4.5Gi headroom) — same pattern as woodpecker/nvidia.
2026-04-26 20:02:28 +00:00
Viktor Barzin
d093aed7f6 immich(server,ml): bump server to 4Gi + Recreate strategy on tight quota
Root cause of 502/503/decode errors clustered at 19:20 BST 2026-04-26: immich-server
hit its 3500Mi memory limit during a face-detection burst and was OOMKilled (Exit Code
137). VPA upperBound is 3050Mi but real-world bursts crossed it; with the single pod
running both API and microservices workers, the OOM took the API down for ~30s of
restart, surfacing as PlatformException image decode + 502 on uploads + 503 on
ActivityService to the iOS app.

Bump immich-server requests=limits to 4096Mi (per CLAUDE.md "upperBound x 1.3 for
volatile workloads" rule, with headroom over the OOM mark). Quota math: 9680Mi used -
2000Mi old req + 4096Mi new req = 11776Mi, fits the tier-2-gpu 12Gi cap.

Switch both immich-server and immich-machine-learning to Recreate strategy: the
namespace tier-2-gpu quota is too tight for RollingUpdate to keep an old + new pod up
during apply (transient 13776Mi > 12Gi cap, see "ResourceQuota blocks rolling updates"
in CLAUDE.md). With single replicas and Recreate, future memory tweaks no longer
require manual scale-to-0 dance.

Verified: new pod has limits.memory=4Gi, quota usage stable at 11776Mi/12Gi, immich
API serving normally.

Note: a pending node_selector drift on immich-machine-learning (gpu=true ->
nvidia.com/gpu.present=true) also reconciled in this apply; the canonical NVIDIA
operator label already on the GPU node, no scheduling impact.
2026-04-26 19:11:50 +00:00
Viktor Barzin
07bc0098e3
ci(woodpecker): show full terraform error on stack apply failure
The default workflow truncated the failed-stack output at `tail -5`,
which only captured the trailing source-line indicator (`│  45: resource
…`) and dropped the actual `Error: …` line above it. Bump to `tail -50`
so the real error is visible without re-running locally to reproduce.

Also fix the pre-warm step's FIRST_STACK detection — `head -1 file1
file2 | head -1` returns the file header (`==> .platform_apply <==`),
not the first stack name, so the cd then fails with "no such file or
directory". Use `cat | head -1` instead.

Pure logging-and-pre-warm change; no stacks touched, so this commit is
a no-op for the apply step.
2026-04-26 18:39:46 +00:00
Viktor Barzin
215717c90f monitoring(dashboards): tables at the bottom convention
wealth: move Activity log table from y=45 to y=77; the three barcharts
(Yearly return, Annual change, Per-account ROI) shift up by 14 to fill
the gap.

uk-payslip: move Sankey "where the money went" from y=80 to y=48 (right
above the table block); the three tables (Data integrity, All payslips,
YTD reconciliation) shift down by 14 so all four tables (4, 5, 6, 9) sit
contiguously at the bottom.

fire-planner and job-hunter still have intentional side-by-side
table/chart pairings; left untouched pending user direction on whether
to break them.
2026-04-26 18:30:52 +00:00
Viktor Barzin
bb28485ce0 monitoring(wealth): move 12mo return to top bar, shrink to w=4
Trailing 12-month investment return % was a full-width stat at y=59.
Now sits inline with Net worth / Contribution / Growth / ROI as the
fifth headline number — top-row stats reflowed from w=6 (×4) to w=5
(×4) + w=4 (×1). Title shortened to "12mo return" so it fits.
Panels below the old row shifted up by 4 rows to close the gap.
2026-04-26 18:19:24 +00:00
Viktor Barzin
532285e48c traefik: raise websecure idleTimeout 180s -> 600s for iOS Immich -1005
iOS NSURLSession held a dead TCP/TLS socket past Traefik's 180s idle close,
then errored with NSURLErrorDomain -1005 on the next thumbnail. Bumping the
timeout to 600s pushes the bug to "app idle for >10 min" -- much rarer in
normal use. Verified with /home/wizard/.claude/immich-scroll-sim.py
keepalive probe: 200s idle, mean reuse latency +1.8ms over warmup (was ~50ms
TLS handshake penalty before). Synthesis: ~/.claude/immich-debug/synthesis.md.
2026-04-26 12:32:05 +00:00
Viktor Barzin
3489621a45 nextcloud(backup): pin backup pod to nextcloud's node via podAffinity
The weekly backup mounts the same RWO PVC (proxmox-lvm-encrypted) as the
main nextcloud deployment. Single-node attach — the backup pod can never
mount the volume if it lands on a different node, and was stuck in
ContainerCreating for 6+ hours when cron fired today.

Add pod_affinity (required, hostname topology) so the backup co-locates
with the nextcloud app pod. Discovered via cluster-health probe; manual
verify run scheduled on k8s-node3 next to nextcloud's pod and completed
the rsync in seconds.
2026-04-26 11:03:20 +00:00
Viktor Barzin
a24cd7ceb7 monitoring(uk-payslip): yearly receipt aligns with P60 (RSU gross)
Switch the RSU stack from "after band-aware tax" to gross. Receipt
total is now pre-sacrifice gross compensation; bar − pension stack
≈ ytd_gross reported on the final March payslip / P60.

Verified alignment for 2025/26: bar−pension = £266,752 vs P60
ytd_gross = £268,127 — gap of £1,375 ≈ "other taxable" (benefits,
overtime). Remaining year-level gaps are upstream parser/ingest
issues, not dashboard logic:
  - 2024/25 +£27k: March 2025 payslip parsed bonus=£26,969 but never
    propagated it into gross_pay/income_tax. Receipt is more
    accurate than ytd_gross here.
  - 2023/24 −£36k: Feb 2024 payslip row appears to be missing from
    the table; ytd_gross has it, sum(gross_pay) doesn't.
  - 2022/23 −£10k: variant A→B transition residual.

SQL simplified — band-aware CTE chain dropped (no longer needed for
this panel since RSU is shown gross).
2026-04-26 10:24:06 +00:00
Viktor Barzin
d0152e1f38 crowdsec/traefik: stop captchaing legit Immich mobile bursts
Mobile timeline scrubs prefetch ~100 thumbs in <1s, which exhausted the
immich-rate-limit (avg=500, burst=5000) and produced a cascade of HTTP
429s. CrowdSec's local http-429-abuse scenario then fired captcha:1 on
the source IP (alert #291, 2026-04-25 — owner's Hyperoptic IPv6).

Two changes:
- crowdsec: add a second whitelist doc (viktor/immich-asset-paths-whitelist)
  filtering events by Immich asset paths so they never feed leaky buckets.
  Auth endpoints intentionally excluded — brute-force protection unchanged.
- traefik: raise immich-rate-limit avg=500->1000, burst=5000->20000 so
  legitimate mobile scrubs don't produce 429s in the first place.
2026-04-26 09:27:16 +00:00
Viktor Barzin
222013806d monitoring(uk-payslip): split salary into cash + pension on yearly receipt
The salary field on the payslip is pre-pension-sacrifice, so the
"Salary (gross)" stack already silently included the salary-sacrifice
pension contribution. Split it out so pension is explicitly visible:
- Salary (cash, post-sacrifice) = salary - pension_sacrifice
- Pension (salary sacrifice, untaxed) = pension_sacrifice
- Bonus
- RSU vest (after band-aware tax)

Bar total unchanged (just relabels what was already there). Pension
is now visibly counted as income — consistent with "untaxed but real"
framing.

Caveat documented in panel description: receipt total ≠ P60 gross
because P60 reports pre-RSU-tax gross. Receipt shows RSU net of tax
per earlier intent. To exactly match P60, swap rsu_after_tax →
rsu_vest gross.
2026-04-26 09:18:32 +00:00
root
423aac0908 Woodpecker CI Update TLS Certificates Commit 2026-04-26 00:03:26 +00:00
Viktor Barzin
21ac619fac monitoring(uk-payslip): promote yearly receipt + YTD gross YoY to row 4
Move both barchart/timeseries panels into row 4 (y=29, side-by-side
w=12 each, h=10) so the per-tax-year overviews appear right after
the income-tax-and-pension YTD row. Shift panels 13, 4, 5, 6, 8, 9
down by 10 to accommodate.

Final ordering: rows 1–3 = monthly + YTD timeseries (panels 1/7/2/3/11/12),
row 4 = yearly receipt + YTD gross YoY (16/17), then the wider
deduction/integrity/table panels below.
2026-04-25 23:58:15 +00:00
Viktor Barzin
53f555dc61 monitoring(uk-payslip): drop 3 panels referencing undeployed data
Removed:
- Panel 10 "HMRC Tax Year Reconciliation — Individual Tax API"
  → references hmrc_sync.tax_year_snapshot schema. The hmrc-sync
    service / DB has not been deployed, so the panel always errored
    with "relation does not exist".
- Panel 14 "Meta payroll: bank deposit vs payslip net pay"
  → references payslip_ingest.external_meta_deposits, which is
    created by alembic migration 0007. The deployed payslip-ingest
    image is at 0005, so the table doesn't exist.
- Panel 15 "RSU vest reconciliation — payslip vs Schwab"
  → references payslip_ingest.rsu_vest_events, created by migration
    0008. Same image-staleness story.

Verified all 14 remaining panels return without error via Grafana
/api/ds/query. SQL for the removed panels is preserved in git history;
re-add when the data sources are actually deployed.
2026-04-25 23:56:03 +00:00
Viktor Barzin
b2a25775aa monitoring(uk-payslip): simplify yearly receipt to earned-and-kept view
Replace the 7-stack "where total comp went" decomposition with a 3-stack
"what I actually earned" view: salary (gross), bonus (gross), and RSU
vest after band-aware tax (PAYE+NI withheld via sell-to-cover). Skips
income tax / NI / student loan / pension / RSU offset.

Bar height = real income kept across all components. RSU is net of tax
because it's withheld at source and never hits the bank account; salary
and bonus are gross because they're paid in full and taxes are deducted
elsewhere. This is the income-side view where tax is implicit, not the
deduction waterfall.

Per-year RSU after tax: 2020/21 £18k · 2021/22 £39k · 2022/23 £50k ·
2023/24 £26k · 2024/25 £71k · 2025/26 £73k.
2026-04-25 23:42:20 +00:00
Viktor Barzin
a17304f735 monitoring(uk-payslip): fix empty YTD gross YoY chart
Two bugs:
1. Synthetic dates projected onto 1970/71 fell outside the dashboard's
   default time range (now-10y → now), so Grafana filtered out every
   point. Switched to a sliding 12-month window
   (CURRENT_DATE - INTERVAL '12 months') as the projection base, plus
   a per-panel timeFrom: "13M" override so the panel always shows the
   last 13 months regardless of the dashboard's time picker.
2. ORDER BY tax_year, pay_date violated Grafana's long→wide conversion
   requirement (data must be ascending by time). Wrapped in a CTE and
   re-ordered by the synthetic time column. Pivoted result is now a
   single wide frame with 7 series (2019/20…2025/26).
2026-04-25 23:36:16 +00:00
Viktor Barzin
ac18c49a7b monitoring(wealth): fix x-axis label formatting on yearly bars
The default fieldConfig unit (percent on Yearly investment return %,
currencyGBP on Annual change decomposition) was being applied to the
"year" string column too — so x-axis labels rendered as "2024%" and
"£2,024" respectively. Add field overrides on the "year" column to
force unit=string. The earlier "tax_year" panels weren't affected
because "2024/25" doesn't parse as a number; "2024" did.
2026-04-25 23:31:03 +00:00
Viktor Barzin
77bed10a51 monitoring: investment-only returns + YoY YTD gross line chart
Wealth dashboard:
- "Yearly growth %" → "Yearly investment return %": switched to
  modified-Dietz formula `market_gain / (nw_start + 0.5 × contributions)`
  so contributions don't inflate the return. New money in is excluded —
  this is portfolio performance, not net-worth change.
- "Trailing 12-month growth %" → "Trailing 12-month investment return %":
  same formula, applied to the trailing 12mo window.

Pre-fix vs post-fix:
  2020: 155.0% → 5.12%   (large contributions on small base)
  2021: 344.7% → 26.45%
  2022: 26.9%  → -25.65% (the actual 2022 bear market)
  2023: 123.2% → 41.60%
  2024: 87.4%  → 25.70%
  2025: 46.8%  → 8.43%
  2026: 16.7%  → 3.28%   (YTD)

UK Payslip dashboard:
- Replaced the per-tax-year stacked bar with a year-over-year line chart:
  one line per tax year, X = month-of-tax-year (April→March, projected
  onto a 1970/71 fiscal calendar so years overlay), Y = cumulative YTD
  gross. Five+ lines visible at a glance for trend comparison.
2026-04-25 23:25:42 +00:00
Viktor Barzin
55d1da41f6 monitoring: more growth detail in Wealth + gross composition in UK Payslip
Wealth (4 new panels at the bottom):
- Trailing 12-month growth % (stat) — % change in net worth over last 12mo.
- Yearly growth % (bar per calendar year) — first→last valuation each year.
- Annual change decomposition (stacked bar) — splits each year's NW change
  into "net contributions" (new money in) and "market gain" (everything
  else: appreciation, dividends, FX). Answers "did I grow because I saved
  or because the market did the work?".
- Per-account ROI % (horizontal bar) — (value − contribution) / contribution
  × 100, latest snapshot. Excludes accounts with zero/negative net
  contribution (Schwab — distorts ratio after RSU sells).

UK Payslip (1 new panel below the yearly receipt):
- Gross composition by tax year (stacked bar) — salary / bonus / RSU vest /
  other components per tax year. Bar height = gross pay. Trends in salary
  growth, bonus levels, and RSU vest sizing at a glance.

All queries spot-checked via Grafana /api/ds/query.
2026-04-25 23:21:42 +00:00
Viktor Barzin
d48e222054 monitoring: lock Finance (Personal) folder to admin + fix cash classification
Folder ACL:
- Move uk-payslip + wealth dashboards to a new "Finance (Personal)"
  folder; job-hunter + fire-planner stay in "Finance" (open).
- New null_resource calls Grafana's folder permissions API after the
  dashboard sidecar materialises the folder, setting an admin-only
  ACL ({Admin: 4}). Default Viewer/Editor inheritance is overridden,
  so anonymous-Viewer (auth.anonymous=true) is denied. Server-admin
  always retains access.
- Verified: anonymous → 403 on uk-payslip + wealth, 200 on
  control dashboards (node-exporter); admin → 200 on all.

Wealth cash fix:
- Wealthfolio dumps WORKPLACE_PENSION wrappers entirely into
  cash_balance because it doesn't track underlying fund holdings.
  Reclassify pension cash as invested in the "Cash vs invested"
  panel so the cash series reflects actual uninvested broker cash
  (~£16k T212 ISA + Schwab) instead of phantom £154k.

  Pre-fix:  cash=£153,789 / invested=£870,282 / total=£1,024,071
  Post-fix: cash=£16,064  / invested=£1,008,008 / total=£1,024,071
2026-04-25 23:11:26 +00:00
Viktor Barzin
51bf38815c vault: record Phase 3 vault Released-PV cleanup
Deleted the 6 NFS PVs orphaned by the Phase 2 rolling and removed
their /srv/nfs/<dir> subtrees on the PVE host (~1.5 GB; vault-2 audit
log was 1.4 GB on its own). Cluster-wide Released-PV sweep on the
proxmox-lvm/encrypted side stays out of scope.
2026-04-25 23:08:45 +00:00
Viktor Barzin
498400173c wealthfolio-sync: skip the synthetic TOTAL row in ETL
Wealthfolio's daily_account_valuation includes a row with
account_id='TOTAL' that pre-aggregates the per-account values for that
day. Mirroring it into PG verbatim caused every SUM(total_value) in
the Wealth dashboard to double-count (showing ~£2M against actual
~£1M). Drop the synthetic row at the dump step so the PG mirror only
holds real-account rows.

Initial sync after fix: 8,649 DAV rows (was 10,798), net worth resolves
to £1,024,071 — matches the per-account latest snapshot.
2026-04-25 22:59:24 +00:00
Viktor Barzin
f0ce7b0363 fire-planner: add stack, Vault DB role, dashboard, DB
New stacks/fire-planner/ mirrors payslip-ingest layout:
- ExternalSecret pulling RECOMPUTE_BEARER_TOKEN from Vault secret/fire-planner
- DB ExternalSecret templating DB_CONNECTION_STRING via static role pg-fire-planner
- FastAPI Deployment (serve), CronJob (recompute-all monthly on 2nd at 09:00 UTC,
  scheduled after wealthfolio-sync's 1st at 08:00), ClusterIP Service
- Grafana datasource ConfigMap "FirePlanner" — `database` inside jsonData
  (cc56ba29 fix; otherwise Grafana 11.2+ hits "you do not have default database")

Plus:
- vault/main.tf: pg-fire-planner static role (7d rotation), allowed_roles
- dbaas/modules/dbaas/main.tf: null_resource creates fire_planner DB+role
- monitoring/dashboards/fire-planner.json: 9-panel Finance-folder dashboard
  (NW timeseries, MC fan chart, success heatmap, lifetime tax bars,
  years-to-ruin table, optimal leave-UK stat, ending wealth stat,
  UK success-by-strategy bars, sequence-risk correlation table)
- monitoring/modules/monitoring/grafana.tf: register "fire-planner.json" in Finance folder

Apply order:
  1. vault stack — creates the static role
  2. dbaas stack — creates the database & role
  3. external-secrets stack picks up vault-database refs (no change needed)
  4. fire-planner stack — first apply with -target=kubernetes_manifest.db_external_secret
     before full apply, per the plan-time-data-source pattern
  5. monitoring stack — picks up the new dashboard ConfigMap

[ci skip]

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 17:27:19 +00:00
Viktor Barzin
484b4c7190 vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC
All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1
+ vault-2 today). The NFS fsync incompatibility identified in the
2026-04-22 raft-leader-deadlock post-mortem is no longer reachable —
raft consensus log + audit log live on LUKS2 block storage with real
fsync semantics.

Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox
dropped to zero after the rolling, so the resource is removed from
infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster
and will be reclaimed in Phase 3 cleanup.

Lesson learned (recorded in plan): pvc-protection finalizer races the
StatefulSet controller — pod recreates on the OLD PVCs unless the
finalizer is patched out before pod delete. Force-finalize technique
applied to vault-1 + vault-2 successfully.

Closes: code-gy7h
2026-04-25 17:10:00 +00:00
Viktor Barzin
df2fa0a31d state(vault): update encrypted state 2026-04-25 17:09:35 +00:00
Viktor Barzin
bf4c7618d8 wealth: SQLite→PG ETL sidecar + new Grafana dashboard
Mirrors Wealthfolio's daily_account_valuation / accounts / activities
from SQLite into a new PG database (wealthfolio_sync) every hour, so
Grafana can chart net worth, contributions, and growth over time.

Components:
- dbaas: null_resource creates wealthfolio_sync DB + role on the CNPG
  cluster (dynamic primary lookup so it survives failover).
- vault: pg-wealthfolio-sync static role rotates the password every 7d.
- wealthfolio: ExternalSecret pulls the rotated password into the WF
  namespace; new pg-sync sidecar (alpine + sqlite + postgresql-client +
  busybox crond) does sqlite3 .backup → TSV dump → truncate-and-reload
  psql, hourly at :07. Plus a grafana-wealth-datasource ConfigMap in
  the monitoring namespace (uid: wealth-pg).
- monitoring: new Wealth dashboard (wealth.json, 10 panels) — current
  net worth / contribution / growth / ROI% stats, then time-series
  for net worth, contribution-vs-market, growth area, per-account
  stacked area, cash-vs-invested, and a 100-row activity log.

Initial sync: 6 accounts, 10,798 daily valuations, 518 activities.
Verified PG totals match SQLite latest snapshot exactly.
2026-04-25 17:07:33 +00:00
Viktor Barzin
7dd580972a state(vault): update encrypted state 2026-04-25 16:57:42 +00:00
Viktor Barzin
ac8d2f548b paperless-ngx: migrate to proxmox-lvm-encrypted
Document scans (receipts, contracts, IDs) are unambiguously sensitive
PII. Storage decision rule defaults sensitive data to
`proxmox-lvm-encrypted`, but paperless-ngx had been left on plain
`proxmox-lvm` by an abandoned migration attempt that left a dormant,
non-Terraform-managed encrypted PVC sitting unbound for 11 days.

Cleaned up the orphan, added the encrypted PVC properly via Terraform,
rsynced data with deployment scaled to 0, swapped claim_name. Plain
`proxmox-lvm` PVC retained for a 7-day soak before removal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 16:48:53 +00:00
Viktor Barzin
4f5f1ff8c2 monitoring(uk-payslip): add yearly receipt stacked barchart panel
New panel 16 (barchart, h=11, y=179): one stacked bar per tax year showing
total comp split into net pay (bank deposit), cash income tax, RSU tax
(band-aware marginal: PAYE+NI), cash NI, student loan, pension salary-
sacrifice, and RSU offset (Variant A only).

X-axis = tax_year (categorical), y-axis = currencyGBP. Bar height ≈
gross_pay + pension_sacrifice (small over-attribution in Variant A years
where the band-aware model exceeds recorded payslip PAYE).
2026-04-25 16:26:57 +00:00
Viktor Barzin
288efa89b3 vault: migrate vault-0 storage to proxmox-lvm-encrypted
Phase 2 of the NFS-hostile migration: data + audit storageClass on
the vault helm release switches from nfs-proxmox to
proxmox-lvm-encrypted, then per-pod rolling swap (24h soak between).

vault-0 swap done. vault-1 + vault-2 still on NFS — the rolling part
is what makes this safe (raft quorum maintained by 2 healthy pods
while one is replaced).

Also restores chart-default pod securityContext fields. The previous
`statefulSet.securityContext.pod = {fsGroupChangePolicy = "..."}`
block REPLACED (not merged) the chart's defaults — fsGroup,
runAsGroup, runAsUser, runAsNonRoot were all silently dropped. NFS
exports were permissive enough to mask the missing fsGroup; ext4 LV
volume root is root:root and the vault user (UID 100) couldn't open
vault.db, CrashLoopBackOff. Fix: provide all five fields explicitly,
survives future chart bumps. vault-1 and vault-2 retained their
correct securityContext from when their pod specs were written to
etcd, before the partial customization landed — the bug only surfaces
when a pod is recreated.

Pre-flight raft snapshot saved at /tmp/vault-pre-migration-*.snap
(recovery anchor).

Refs: code-gy7h

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 16:19:49 +00:00
Viktor Barzin
08b13858dd state(vault): update encrypted state 2026-04-25 16:16:35 +00:00
Viktor Barzin
b3c29eda12 monitoring(uk-payslip): model UK income-tax bands + PA-taper for RSU marginal
Replaces the flat 47% (45 PAYE + 2 NI) RSU marginal across panels 3, 7, 8, 11,
and 12 with an exact piecewise band-aware computation. Each row computes
ani_prior/ani_pre/ani_post over the tax-year YTD (chronological model — the
RSU is taxed at the band its YTD ANI position occupies at the vest date,
mirroring PAYE withholding behaviour).

Bands (2024/25+, applied to all years):
  IT:  0% / 20% / 40% / 60% (PA-taper) / 45%   at 12,570 / 50,270 / 100k / 125,140
  NI:  0% / 8% / 2%                            at 12,570 / 50,270

PA-taper modelled as 60% effective IT marginal in £100k–£125,140
(40% on the £1 + 40% on the £0.50 of lost PA = 60%).

Spot-checked per tax-year totals via psql; numbers diverge from the flat
47% baseline most for years where vests cross PA-taper or basic-rate bands
(2020/21 ~35%, 2024/25 ~41%, 2025/26 ~43%).
2026-04-25 16:14:49 +00:00
Viktor Barzin
3f85cee1ef state(vault): update encrypted state 2026-04-25 16:08:38 +00:00
Viktor Barzin
43e4f3f68e immich: migrate PostgreSQL off NFS to proxmox-lvm-encrypted
Live PG data moves to a 10Gi LUKS-encrypted RWO PVC. WAL fsync per
commit on NFS contributed to the 2026-04-22 NFS writeback storm
(2h43m recovery, 3 of 4 nodes hard-reset). Backups remain on NFS
(append-only, NFS-tolerant).

The init container that writes postgresql.override.conf is now gated
on PG_VERSION presence — on a fresh PVC the file would otherwise make
initdb refuse the non-empty PGDATA. First boot skips the override and
initdb's cleanly; second boot (after a forced restart) writes the
override so vchord/vectors/pg_prewarm load before the dump restore.
Idempotent on initialised PVCs.

Migration executed: pg_dumpall (1.9GB) → restore on encrypted PVC →
REINDEX clip_index/face_index → 111,843 assets verified, external
HTTP 200, all 10 extensions present (vector minor 0.8.0→0.8.1 only).
LV created on PVE host, picked up by lvm-pvc-snapshot.

See docs/plans/2026-04-25-nfs-hostile-migration-{design,plan}.md.
Phase 2 (Vault Raft) follows under code-gy7h.

Closes: code-ahr7

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 15:47:30 +00:00
Viktor Barzin
0d5f53f337 monitoring(uk-payslip): replace misleading take-home rates in Panel 3
Drop the two misleading series in "Effective rate & take-home % (YTD
cumulative)" — both used SUM(gross_pay) as denominator while only
counting cash deductions/net in the numerator, which understated
take-home by 25-30 pp because RSU shares are absent from the cash
deposit but present in gross. Replaced with three semantically clean
angles:

- ytd_paye_rate_pct: SUM(income_tax) / SUM(taxable_pay) — HMRC audit
  rate (~41-42% in additional-rate band), kept as before.
- ytd_cash_take_home_pct: SUM(net_pay) / SUM(gross_pay - rsu_vest) —
  what fraction of cash earnings hits the bank (~62-65%).
- ytd_total_keep_pct: (SUM(net_pay) + 0.53 × SUM(rsu_vest)) /
  SUM(gross_pay) — true "what I actually keep" including post-tax RSU
  shares (47% marginal applied to vest value), ~55-60%.

Added field overrides for clear color-coding (red/green/blue).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 15:45:47 +00:00
Viktor Barzin
8f0d13282c monitoring(uk-payslip): drop cash PAYE/NI from "Tax & pension — monthly"
Same reasoning as panel 2: cash-side income_tax and NI are inherently
bumpy in vest months due to UK cumulative PAYE catching up on YTD,
and the flat-47% strip can't fix it. Panel now shows only the
explicit RSU vest tax (orange, 47% × rsu_vest), student loan, and
pensions. The smooth view of total cash deductions stays available on
panel 12 (YTD cumulative).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 15:43:32 +00:00
Viktor Barzin
2230cb6cf4 monitoring(uk-payslip): drop tax/NI from "Monthly cash flow (RSU stripped)" panel
Vest months still bumped 4-5x in this panel after the flat-47% strip
because UK cumulative PAYE genuinely catches up YTD tax in vest
months, on top of the marginal RSU portion — no arithmetic split can
make that line flat without distorting the data. The cash-flow
question this panel answers (what hits the bank, RSU aside) is
already covered cleanly by cash_gross + net_pay; the tax detail lives
on Panel 11 where the RSU split is now linear.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 15:30:46 +00:00
Viktor Barzin
cb3ffa6d8d monitoring(uk-payslip): smooth quarterly RSU tax bumps via flat 47% marginal
Replace the implicit pro-rata RSU/cash split with an explicit flat
47% marginal (45% PAYE + 2% NI) for the RSU vest tax stack. The orange
slice now scales linearly with rsu_vest instead of wobbling around the
month's effective PAYE rate; cash PAYE/NI slices have those amounts
subtracted out so the stack still totals to actual deductions.

Affects panel 7 (monthly), panel 12 (YTD cumulative), panel 7
(YTD uses), and the Sankey panel. Verified on 35 months of live data:
sum invariant holds exactly (cash + rsu_marginal + cash_ni ==
income_tax + national_insurance), no negatives in cash slices.

Out of scope (left raw): effective-rate %, data-integrity, payslip
table, P60/HMRC reconciliation — those are audit views that use
unmodified income_tax / cash_income_tax columns.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 15:13:29 +00:00
Viktor Barzin
4315ed5c2a [backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count)
cmd_prune_count's `log "  Pruned: ..."` wrote to stdout, which the
caller captures via `pruned=$(cmd_prune_count)`. From 2026-04-16 onward
(7d retention kicked in), pruned snapshots polluted the captured value
with multi-line log text, breaking the Prometheus exposition format
on the metric push (`lvm_snapshot_pruned_total ${pruned}` → 400 from
Pushgateway). Snapshots themselves were always fine; only the metric
push silently failed for ~9 nights, eventually triggering
LVMSnapshotNeverRun (alert has 48h `for:`).

Fix: redirect the inner log call to stderr so cmd_prune_count's stdout
contains only the count. Also adopts `infra/scripts/lvm-pvc-snapshot.sh`
as the source-of-truth (was edited only on the PVE host) and updates
backup-dr.md to point at the .sh and document the scp deploy.

Deploy: scp infra/scripts/lvm-pvc-snapshot.sh root@192.168.1.127:/usr/local/bin/lvm-pvc-snapshot

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-25 14:30:58 +00:00
Viktor Barzin
d231615ebb [monitoring] Fix fuse voltage alerts — divide raw deciVolt reading by 10
The tuya-bridge exporter reports `fuse_main_voltage` and
`fuse_garage_voltage` as raw uint16 from the Tuya protocol, which
encodes voltage in deciVolts (e.g. 2352 = 235.2V). The 200/260V
thresholds were comparing against the raw integer, so both
FuseMainVoltageAbnormal and FuseGarageVoltageAbnormal fired
continuously during normal mains conditions.

Dividing in the expression also makes `{{ $value }}V` render the
correct human-readable value in the alert summary.

Root fix would be in tuya-bridge `_decode_value()` where
`name.startswith("voltage")` returns `int.from_bytes(...)` without the
/10 scaling that `decode_voltage_threshold` applies. Leaving that
alone to avoid breaking the automatic_transfer_switch scrape which
uses a different code path (`parse_voltage_string`).
2026-04-24 11:12:56 +00:00
Viktor Barzin
a5e4db9af8 [monitoring] Tuya Cloud root-cause alert + cascade suppression
New alert TuyaCloudDown fires when any *_tuya_cloud_up gauge == 0
(i.e., the Tuya Cloud API rejects scrape calls — the symptom during
last night's iot.tuya.com trial expiry, code=28841002). 5m for-duration
beats the 15m window of the seven downstream *MetricsMissing alerts, so
the new Alertmanager inhibit rule suppresses the per-device noise and
only TuyaCloudDown pages.

Also flips helm_release.prometheus.force_update from true to false:
force_update was tripping on the pushgateway PVC added in rev 188
(commit e51c104) — Helm's --force path tried to reset spec.volumeName
on a bound PVC. Disabled here; re-enable temporarily when a
StatefulSet volumeClaimTemplate change actually needs --force.

Bundled with pre-existing working-tree additions for Fuse/Thermostat
threshold alerts and expanded PowerOutage inhibit regex (landed in the
same Helm revision 190).

Verified: rule loaded, value=7 (all 7 tuya-bridge devices report
cloud_up=0 right now), TuyaCloudDown moved pending→firing after 5m,
3 *MetricsMissing alerts currently suppressed in Alertmanager with
inhibitedBy=1 (thermostat alerts still pending their 15m window, will
be suppressed on transition).
2026-04-23 09:59:48 +00:00
Viktor Barzin
5ebd3a81c3 tuya-bridge: liveness probe hits /health so k8s restarts silently-hung bridge
The bridge was down 10h 40m on 2026-04-22 without being restarted —
the liveness probe hit `/` (trivial Flask handler) which passed while
the actual Tuya-cloud call path was stuck. /health now reports Tuya
cloud reachability via a background probe in the app; point both
probes at it. Liveness: 60s grace + 6x30s = 3min of 503s before
restart; readiness: 2x15s = 30s before removal from service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 07:47:41 +00:00
Viktor Barzin
8e55c4357a [poison-fountain] opt ingress out of Uptime Kuma external monitor
Deployment is scaled to replicas=0 to silence ExternalAccessDivergence,
but the ingress at poison.viktorbarzin.me was still auto-annotated
`external-monitor=true` by ingress_factory (dns_type=non-proxied path),
so external-monitor-sync kept creating `[External] poison` which
probed a backend with no endpoints and flagged DOWN.

Setting `external_monitor = false` emits the explicit opt-out
annotation; next sync run deleted the orphaned monitor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 21:24:22 +00:00
Viktor Barzin
344fce3692 [monitoring][poison-fountain] pushgateway persistence + cronjob uid-0
Two independent root-cause fixes surfaced by the 2026-04-22 cluster
health check:

1. Pushgateway lost all in-memory metrics when node3 kubelet hiccuped
   at 11:42 UTC, hiding backup_last_success_timestamp{job="offsite-
   backup-sync"} until the next 06:01 UTC push — a ~18h false-negative
   window. Enable persistence on a 2Gi proxmox-lvm-encrypted PVC with
   --persistence.interval=1m. Chart note: values key is
   `prometheus-pushgateway:` (subchart alias), not `pushgateway:`.

2. poison-fountain-fetcher CronJob runs curlimages/curl as UID 100
   but the NFS mount /srv/nfs/poison-fountain is root:root 755 and
   the main Deployment runs as root, so mkdir /data/cache fails
   every 6h. Set run_as_user=0 on the CronJob container (no_root_squash
   is set on the export).

Closes the backup_offsite_sync FAIL on the next 06:01 UTC offsite
sync; closes the recurring poison-fountain evicted-pod noise on the
next 00:00 UTC cron tick.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:32:29 +00:00
Viktor Barzin
f1f723be83 [technitium] zone-sync now reconciles primaryNameServerAddresses
When a zone is created against a stale primary IP (e.g. the old primary
pod IP 10.10.36.189 before the technitium-primary ClusterIP service
existed), AXFR refresh keeps failing forever while every other zone on
the same replica refreshes fine from 10.110.37.186. The resync-only
branch didn't touch zone options, so the bad IP was pinned indefinitely.

This surfaced as rpi-sofia.viktorbarzin.lan returning 192.168.1.16
(pre-move) on secondaries while primary had the correct .10 from
2026-04-22 morning — Uptime Kuma Sofia RPI monitor DOWN, cluster
cluster_healthcheck FAIL.

The sync loop now re-applies primaryNameServerAddresses on every run
for existing zones. Idempotent — Technitium accepts identical values
— and self-heals any drift within 30 min. Env renamed PRIMARY_IP →
PRIMARY_HOST for consistency with the reconcile semantics.

Hostname form (technitium-primary.technitium.svc.cluster.local) was
tried but Technitium's own resolver doesn't forward svc.cluster.local,
so the field must stay a literal IP. Terraform tracks the ClusterIP on
every apply and the reconcile loop propagates it to replicas.
2026-04-22 17:47:18 +00:00