Commit graph

4674 commits

Author SHA1 Message Date
Viktor Barzin
5a312563c6 monitoring/wealth: dash the in-progress year on the hourly-rate panel
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The current, still-accruing calendar year read misleadingly high (e.g. 2026
at 5 months showed £149/h gross, above all of 2025) because the full-year
bonus - paid every March - plus front-loaded quarterly RSU vests get divided
by only the months worked so far. It settles lower as the year completes.

Split each line into a solid series (complete years) and a dashed series
(the latest, still-accruing year), so the provisional point is visually
flagged. The split auto-detects the in-progress year (latest year with
< 12 months of payslips), so it needs no per-year maintenance. Panel
description now explains the caveat.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:45:51 +00:00
Viktor Barzin
28984dda9a monitoring/wealth: add per-year effective hourly-rate panel (gross vs net)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor wanted to see, on the wealth dashboard, the hourly wage he earned
each year - both gross and net - with year on the X axis.

New timeseries (line) panel "Effective hourly rate - gross vs net":
- hourly = annual pay / hours worked; hours = contractual 40h/week
  (2,080h per full year, confirmed from the Facebook/Meta UK offer letter:
  Mon-Fri 09:00-18:00 less a 1h lunch), prorated by the months actually
  worked so partial years (2019, 2020, 2026) read correctly.
- Gross = gross_pay incl. notional RSU vest; Net = take-home.
- timeFrom 10y so all years show under the dashboard's default 180d range.

Source data: a duplicate March-2023 payslip (Paperless doc 347, a re-upload
of doc 33) was removed separately, so 2023 is no longer double-counted; this
also corrects the existing net-pay panel.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 12:28:46 +00:00
Viktor Barzin
82371d1ef8 dbaas/mysql: innodb_doublewrite=DETECT_ONLY to halve page-flush writes
All checks were successful
ci/woodpecker/push/default Pipeline was successful
MySQL device-write investigation (code-oflt): after the nextcloud webcal
throttle settled (the earlier 3.4-8.8 MB/s were post-restart transients),
MySQL is ~1.74 MB/s at the InnoDB level — and HALF of that (~0.86 MB/s,
~55 pages/s) is the doublewrite buffer writing every flushed page twice.
Redo is negligible (0.01 MB/s), no temp-table spilling.

Set innodb_doublewrite=DETECT_ONLY (dynamic, no restart; persisted in the
cnf): InnoDB stops writing full page CONTENT to the doublewrite buffer
(~halves MySQL's page-flush writes on the IOPS-bound sdc) but keeps
torn-page DETECTION metadata — a crash-torn page is flagged on recovery
(restore from the daily mysqldump) rather than silently corrupt. Chosen
over full OFF: same write saving, keeps detection, and OFF requires a
shutdown ("cannot change to OFF if doublewrite is enabled"). Acceptable
risk given the PERC BBU cache + UPS (in-flight writes complete on power
loss) + daily per-db backups.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:47:09 +00:00
Viktor Barzin
fbae573664 state(dbaas): update encrypted state 2026-06-30 08:46:45 +00:00
Viktor Barzin
71501be408 nodes: journald -> volatile (RAM) to cut sdc write-IOPS
Some checks failed
ci/woodpecker/push/default Pipeline failed
Node "container churn" investigation (code-oflt): container logs (~30 KB/s)
and overlayfs (~17 KB/s) are negligible; the node OS-disk churn is ext4
journal (jbd2) metadata writes driven mostly by journald's continuous
appends. node4 + node5 had drifted to uncapped persistent journald (4 GB
each, ~100 KB/s); master/node1-3 were correctly capped at 500M.

Node + pod journals already ship to Loki (alloy loki.source.journal), so
on-disk journald is pure write-IOPS overhead on the IOPS-bound sdc. Switch
journald to Storage=volatile (RAM, RuntimeMaxUse=200M) fleet-wide:
- cloud_init.yaml: drop-in 90-oflt-volatile.conf for new nodes (replaces
  the old persistent seds).
- running nodes (master + node1-5): pushed the same drop-in via qm guest
  exec + journald restart + cleared /var/log/journal.

Verified node5: OS-disk writers jbd2/sda1-8 931->46 KB/s, systemd-journal
gone (~94% drop); ~4 GB freed each on node4/node5. Logs stay queryable in
Loki. Trade-off: a hard crash loses the last unshipped journal.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 08:15:38 +00:00
Viktor Barzin
1afe41880e docs: MySQL buffer-pool/limit + nextcloud webcal throttle; VCT drift fixed
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reflect the code-oflt MySQL write-reduction work (commit 82c9e69b + the
nextcloud webcal app-data throttle):
- MySQL row: buffer pool 1->2Gi, mem limit 4->6Gi, and the nextcloud
  webcal calendar churn that was ~60% of MySQL's writes (now throttled
  in oc_calendarsubscriptions.refreshrate — app-data, can regress).
- CNPG apply-gotcha note: the mysql_standalone VCT-annotation drift no
  longer needs -target dodging (now ignore_changes'd on the STS VCT).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:56:04 +00:00
Viktor Barzin
82c9e69b77 dbaas/mysql: 2Gi InnoDB buffer pool + 6Gi limit + ignore VCT drift
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Cut MySQL's write-IOPS footprint on the contended PVE sdc HDD (code-oflt).
Standalone MySQL was the #1 sdc bandwidth writer (~2.8-3.5 MB/s). Live
attribution found ~60% of its writes were nextcloud webcal calendar churn
(throttled separately at the app layer); this addresses write amplification
on the remainder:

- innodb_buffer_pool_size 1Gi -> 2Gi: the pool was too small for the ~5.6Gi
  hot set (Innodb_buffer_pool_wait_free=1.78M = threads stalling for a free
  page -> constant flush-to-make-room write IOPS).
- container memory limit 4Gi -> 6Gi (requests 3->4Gi): the pod was already
  at ~3.7Gi/4Gi (near OOM) with the 1Gi pool, so the 2Gi pool needs the
  headroom. One-time MySQL pod restart to apply.
- ignore_changes on the StatefulSet volume_claim_template: the VCT is
  immutable post-creation and pvc-autoresizer rewrites its annotations on
  the live object, so TF's desired VCT could never apply and errored every
  broad dbaas apply. Ignoring it (autoresizer owns PVC sizing) removes the
  long-standing need to -target around it.

Applied + verified live: buffer_pool=2.0GiB, limit=6Gi, pod healthy,
24 DBs reachable, restart clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:55:18 +00:00
Viktor Barzin
29bf275cef state(dbaas): update encrypted state 2026-06-30 07:53:48 +00:00
Viktor Barzin
308a174ad6 docs(networking): record MetalLB .204 (frigate-rtsp go2rtc) allocation
All checks were successful
ci/woodpecker/push/default Pipeline was successful
PR #17 moved frigate-rtsp to a dedicated MetalLB LoadBalancer IP
(10.0.20.204) exposing RTSP 8554 + WebRTC 8555, but the networking doc
still listed only four IPs in use / three dedicated. Add the .204 row to
the allocation table, bump the counts (five in use, four dedicated, 5-IP
layout), and add a LB-IP renumber-checklist entry for the out-of-band
consumers (the go2rtc WebRTC candidate on the frigate config PVC and the
HA-sofia rtsp_url_template). Note go2rtc cannot use a DNS name in ICE
candidates, so the Service annotation is the single source of truth.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:42:27 +00:00
469cdd7507 frigate: expose go2rtc on a dedicated MetalLB LB IP (RTSP 8554 + WebRTC 8555)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
HA live video from the cluster Frigate hangs/fails because the only path
to Frigate is the Traefik HTTP(S) ingress (frigate-lan -> 10.0.20.203),
which cannot carry RTSP or WebRTC. The container already listens on
8554+8555 but only RTSP had a Service (NodePort), and WebRTC (8555) was
never exposed. Convert frigate-rtsp to a LoadBalancer on a dedicated MetalLB
IP (.204, ETP=Local, pod pinned to the GPU node) carrying RTSP 8554 +
WebRTC 8555 (TCP+UDP), giving HA Sofia + LAN browsers a stable cross-VLAN
endpoint for native HLS/WebRTC live (parity with the Hikvision NVR).
Companion non-Terraform steps are in the PR body.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 07:15:22 +00:00
Viktor Barzin
9ea9cae073 rightsize: reconcile batch-2/3 stacks blocked by killed #427 (job-hunter, wealthfolio, f1-stream)
Some checks failed
ci/woodpecker/push/default Pipeline failed
Memory limits were committed (batch 2/3) but pipeline #427 was killed mid-apply and the local homelab tf apply hit a stale backend-init; this comment-only diff re-triggers a clean CI apply for the three stacks so live matches master (job-hunter 768Mi, wealthfolio 512Mi, f1-stream 384Mi).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:59:41 +00:00
Viktor Barzin
7cc9cde5b1 external-secrets: enable ESO Vault token cache to cut sdc write churn
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Add --enable-vault-token-cache to the ESO controller (a graduated,
non-experimental flag in chart 2.6.0). Until now ESO authenticated to
Vault with login -> lookup-self -> revoke-self on *every* secret fetch.
Across 92 ExternalSecrets refreshing every 15m that measured ~0.22
logins/s + ~0.22 revoke-self/s on the active Vault member, and each
cycle is a token create+revoke (plus its lease) written to the Raft log
on all three members. Those fsync-heavy writes land on the contended
PVE RAID1 7200rpm HDD (sdc) -- one of the write sources behind the
recurring control-plane flaps (code-oflt write-reduction).

The eso kubernetes-auth role already issues a 240h periodic, unlimited-
use token, so the churn was pure waste: ESO discarded a perfectly good
token after a single use. With token caching ESO mints one token and
reuses/renews it, collapsing logins from ~13/min to a handful per token
lifetime. Verified live: vault cache initialized, 112/113 ExternalSecrets
Ready (the one failure, instagram-poster, is pre-existing data drift
unrelated to auth), logins dropped to ~0 after warm-up.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:32:37 +00:00
Viktor Barzin
5e384ed762 state(external-secrets): update encrypted state 2026-06-29 15:32:37 +00:00
Viktor Barzin
bc626a2d89 rightsize: raise OOM-tight memory limits (batch 3/N — spike protection)
Some checks failed
ci/woodpecker/push/default Pipeline failed
shlink 512->704Mi, linkwarden 1Gi->1280Mi, chrome-service 2Gi->2624Mi, forgejo 4Gi->5Gi, f1-stream 256->384Mi. All were request==limit with 30d peak at 91-100% of the ceiling — a spike would OOM-kill them. Raising the limit (now Burstable, request<limit) gives real burst headroom. This is the genuine 'don't OOM on occasional spike' fix. Small add (~2.2Gi limits) vs the ~20Gi of fat removed in batches 1-2, so net overcommit keeps dropping.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:28:11 +00:00
Viktor Barzin
418d1efb4b rightsize: trim over-provisioned memory (batch 2/N)
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
claude-agent-service 12Gi->3Gi (peak 585Mi — the single biggest fat, ~9Gi of limit-overcommit removed), job-hunter 1280->768Mi (kept chromium headroom; 30d peak 118Mi), fire-planner 1024->320Mi, wealthfolio 1Gi->512Mi (kept history-growth headroom). Burstable, limits kept >= generous peak headroom, never below peak. ~10.7Gi of limit overcommit removed. paperless-ai intentionally LEFT at 4Gi (documented in-process RAG model load).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:27:17 +00:00
Viktor Barzin
a3f2c2947a docs: refresh CNPG tuning note (archive_timeout=0, commit_delay, zstd) + apply gotcha
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reflects the write-reduction params applied in c3553731, and documents the
null_resource trigger-bump + targeted-apply gotcha so the next agent doesn't
hit the inert-change / mysql-VCT-drift traps.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:17:38 +00:00
Viktor Barzin
ec04963bfe state(dbaas): update encrypted state
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
2026-06-29 15:16:50 +00:00
Viktor Barzin
c3553731c7 dbaas: CNPG write-reduction — archive_timeout=0, commit_delay, wal_compression=zstd
Part of code-oflt (cut sdc write IOPS before the SSD move; analysis #6922).
- archive_timeout 300->0: CNPG forces archive_mode=on but .spec.backup is empty
  (no ObjectStore), so a 16MB WAL segment switch every 5min shipped NOWHERE =
  ~4.6 GB/day of pure-waste WAL on the contended sdc. archive_mode stays CNPG-on
  (reserved); 0 just stops the timed switch. Daily pg_dump cron unchanged.
- commit_delay 0->2500us: group-commit coalesces concurrent fsyncs. SAFE for
  every DB incl financial -- data still fsynced before COMMIT acks, only <=2.5ms
  added latency under concurrency.
- wal_compression pglz->zstd: ~30-50% smaller full-page images.
All sighup-reloadable. Applied via targeted apply of
module.dbaas.null_resource.pg_cluster (trigger bumped) to avoid the pre-existing
mysql VCT drift that breaks broad dbaas applies.

Refs: code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 15:16:38 +00:00
Viktor Barzin
5d059786a1 rightsize: trim over-provisioned memory limits+requests (batch 1/N)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
claude-breakglass 4Gi->512Mi, stirling-pdf 1536->512Mi, insta2spotify 2Gi->256Mi, recruiter-responder 768->256Mi. These idle/utility services had memory LIMITS sitting 4-15x above their 30d peak, inflating cluster limit-overcommit to 142% across the 5 post-node6 nodes. Burstable (request<limit), limits capped at ~peak x1.5 (never below peak), so no OOM risk (verified zero OOMKills cluster-wide in 30d). Reduces phantom limit overcommit + frees scheduler requests.

Follows the 3-reviewer adversarial review: raising limits on an already-overcommitted cluster worsens correlated node-OOM; the real fix is trimming the fat. Limits only lowered where peak is far below; tuned/DB/GPU limits untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 14:46:58 +00:00
Viktor Barzin
4473b469e3 lvm-pvc-snapshot: cut retention 7->3 days (reduce sdc thin-pool CoW IOPS + free ~1TB)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Part of the sdc IOPS-reduction work (code-oflt). 462 daily thin snapshots
(66 PVCs x 7d) drive ~10-34 w/s of thin-pool metadata (tmeta) CoW writes on
the contended sdc spindle and pin ~2TB in the 70%-full pool. Halving to 3
days roughly halves both. Instant-restore window shrinks 7->3d; daily-backup
still keeps 4 weeks of file-level PVC history, so DR coverage is unchanged.

Deployed to the PVE host via scp (these host scripts are scp-deployed, not
TF-managed). Doc updated in .claude/CLAUDE.md.

Refs: code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:59:16 +00:00
Viktor Barzin
256122ff5b monitoring: make ClusterCannotTolerateNonGpuNodeLoss topology-agnostic
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The N-1 capacity alert was hardcoded to k8s-node[234]/[1234], predating node5/node6 (added 2026-05-26) and the 2026-06-29 removal of node6 — so it no longer reflected the real cluster and gave no trustworthy N-1 signal. Generalize node selection via metrics: GPU node by nvidia_com_gpu capacity, drained/cordoned by kube_node_spec_unschedulable, down by the Ready condition. Control-plane excluded by name (node!~"k8s-master.*") because this cluster's kube-state-metrics exposes neither kube_node_role nor node taints/labels (verified live).

Also fixes a latent bug (multiplying by kube_node_spec_unschedulable==0 zeroed the result) and refreshes the remediation text (krr, not the removed Goldilocks). With node6 gone the rule now correctly evaluates LHS 31.0Gi > RHS 27.9Gi (fires) — the honest signal that removing node6 tightened requests-based N-1; trimming the inflated requests clears it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:34:01 +00:00
Viktor Barzin
6c3619c9c6 state(dbaas): update encrypted state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-29 12:26:21 +00:00
Viktor Barzin
682b982c78 state(dbaas): update encrypted state 2026-06-29 12:25:53 +00:00
Viktor Barzin
c0e0911afa dbaas: bump pg_cluster trigger so the checkpoint/WAL params actually apply
a2c8f906 added checkpoint_timeout=15min + max/min_wal_size to the CNPG
Cluster YAML, but the cluster is applied via null_resource.pg_cluster +
local-exec kubectl apply, which only re-runs when its `triggers` change.
The YAML edit didn't bump a trigger, so the change was inert and never
applied (incl. via CI). Bump the pg_params trigger so the kubectl apply
re-runs and CNPG hot-reloads the new params (reloadable, no restart).

Landing it via a targeted apply (-target=null_resource.pg_cluster) to avoid
3 pre-existing unrelated drifts in this stack -- notably a mysql_standalone
volumeClaimTemplate annotation diff the apiserver rejects as immutable,
which is what fails broad dbaas applies (and silently blocked a2c8f906).

Refs: code-oflt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:25:37 +00:00
Viktor Barzin
bebe8fbd74 workflows: add read-only memory-overcommit + node-removal capacity analysis
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Reusable Workflow script that audits whether the cluster is memory-overcommitted and whether a single k8s worker can be removed to return RAM to the PVE host without sacrificing N-1 failover. Read-only throughout: gathers PVE host memory (qm config / free / KSM via SSH), k8s per-node capacity + cluster 30d peak working set, and per-workload right-sizing, then models N-1 two ways (physical actual-usage and scheduling-by-request) and adversarially verifies the conclusion with 3 skeptics.

Sizes requests (scheduling reservation) and limits (OOM ceiling) as SEPARATE knobs — an earlier ad-hoc pass conflated them by sizing requests to 30d peak, which manufactured a false N-1 shortfall. Invoke via Workflow {scriptPath}, or by name when cwd is the infra repo.

Requested by Viktor: identify memory overcommit and whether deployment requests can be trimmed to free PVE host RAM by removing a node, without sacrificing service reliability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 12:06:17 +00:00
Viktor Barzin
a2c8f906ec dbaas: stretch CNPG checkpoint timer 5->15min + raise WAL size (cut sdc write IOPS)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to reduce CNPG checkpoint/WAL writes as part of the sdc
IOPS-isolation work (code-oflt). The IOPS deep-dive found CNPG checkpoints
fire 100% on the 5-min timer (checkpoints_timed >> checkpoints_req), each
triggering a full-page-write burst + flush onto the contended 7200rpm sdc
spindle -- a top write-IOPS source after etcd.

Set checkpoint_timeout=15min + max_wal_size=4GB + min_wal_size=1GB so
checkpoints fire ~1/3 as often (fewer FPW) and WAL segments are recycled
rather than churned. All three are sighup-reloadable -> CNPG applies them
without a restart or failover. checkpoint_completion_target stays 0.9 so
each checkpoint's IO is still smeared across the interval. Bounded
recovery-time tradeoff (more WAL to replay on crash), acceptable for the
write relief. wal_compression left at pglz ('on') pending image
zstd-support verification.

Also refreshes the stale CNPG tuning note in .claude/CLAUDE.md (it listed
shared_buffers=512MB / effective_cache_size=1536MB / 2Gi; live is 1024MB /
2560MB / 3Gi).

Refs: code-oflt (etcd/sdc IO isolation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 11:41:09 +00:00
Viktor Barzin
3398873a16 k8s-upgrade: move version-check cadence from daily to weekly (Sun check, Mon report)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to move the upgrade checks to weekly. With the actionable-vs-held
gate now quieting the routine 'held' churn (e.g. 1.36), a daily check + attempt
buys little; weekly is enough. Accepted trade-off: k8s patch (incl. security)
uptake now lags up to 7 days instead of <=1.

- var.schedule:        0 23 * * *  ->  0 23 * * 0   (detector: weekly Sunday 23:00 UTC)
- var.report_schedule: 7 6 * * *   ->  7 6 * * 1    (report: Monday 06:07 UTC, ~7h
  after the Sunday check, so nightly-report.py's ~25h staleness threshold stays
  valid AND still flags a missed weekly run; no STALE_SECONDS change needed)

The report CronJob keeps its historical name k8s-upgrade-nightly-report (rename
= churn). Cadence wording updated across main.tf comments, nightly-report.py
docstring, and the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 06:22:20 +00:00
Viktor Barzin
e43e64c666 kyverno: disable reports-controller to stop etcd ephemeralreport load
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor flagged not wanting to wear the single non-RAID SSD with useless etcd
writes if etcd moves there. Investigation found the avoidable load is kyverno
reporting: the 2026-06-12 etcd-load-reduction disabled the report *features*
but left the reports-controller running (default --enableReporting +
--validatingAdmissionPolicyReports=true), so the 2026-06-21 kyverno upgrade
left a one-time pile of ~10.5k cluster/namespaced ephemeralreports (~114MB in
etcd) that nothing reaps (aggregation off). Listing that range starves etcd's
fdatasync enough to flap the apiserver (observed live 2026-06-28).

Disable the reports-controller outright (reportsController.enabled=false),
completing the 2026-06-12 intent. Reports are not consumed (violations surface
via Loki->Slack); admission enforcement (deny-* policies) and Keel mutation are
independent of it. The ~10.5k stale reports already in etcd are cleared
separately (throttled, out-of-band) since bulk-deleting them is itself
etcd-heavy.

Refs: code-oflt (etcd IO isolation), code-at4f (etcd starvation alerting).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-29 05:35:36 +00:00
Viktor Barzin
cf42042cba monitoring: re-trigger apply to persist state after CI cancel-race
All checks were successful
ci/woodpecker/push/default Pipeline was successful
No-op comment touch in loki.tf to force a clean `terragrunt apply monitoring`.
The pfSense egress-monitoring apply (commit 7fe2d978, CI pipeline #414) was
cancelled by a newer push and SIGKILLed mid-helm-upgrade: the live resources
applied (probes green, rules loaded) but the Terraform state write and the helm
release finalize were lost, leaving the prometheus release stuck in
pending-upgrade (manually unstuck). This commit re-applies the unchanged
monitoring stack so state matches live, with zero resource changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:58:49 +00:00
Viktor Barzin
f92075b7c5 fire-planner: solve FIRE targets to age 100 (horizon 60→72)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor plans to live to 100, so the portfolio must last that long. The
fire-targets CronJob was solving a 60-year horizon (≈ to age 88); set it to 72
(retire ~age 28 → age 100). Raises every case's FIRE number modestly (more years
to fund). A one-off in-cluster job re-solves the existing rows at the new horizon.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:49:20 +00:00
Viktor Barzin
7fe2d9780e monitoring: add pfSense WAN/egress alerting + probes
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
On 2026-06-27 pfSense (Proxmox VMID 101) stopped passing internet egress for
~20 min while internal routing + Unbound stayed up; recovery needed a manual
reboot and NOTHING alerted — there was no egress probe and the cloudflared
replica metric stayed green. Add first-class egress monitoring so the next
occurrence pages in ~2 min instead of being noticed by a human.

- blackbox-exporter: new icmp_egress + dns_external probe modules (+ NET_RAW
  so ICMP can use raw sockets).
- Three in-cluster probe jobs exercising the pod->node->pfSense-NAT path that
  failed: wan-gateway-icmp (192.168.1.1), internet-egress-icmp (9.9.9.9 +
  1.1.1.1), internet-egress-dns (cloudflare.com via both resolvers).
- Prometheus alerts (group "Egress / pfSense"): WANGatewayUnreachable,
  InternetEgressDown (both providers dead), ExternalDNSResolutionDown,
  EgressOnlyDivergence (reuses the existing t3-probe legs — the incident's
  exact "external down while internal up" signature), PfSenseVMDown.
- Loki ruler: CloudflaredTunnelConnLoss — the canary that fired first; the
  cloudflared replica metric is blind to tunnel-connection loss. Threshold
  calibrated against live Loki (steady-state ~2/6h vs 37-85/5m in-incident).
- Alertmanager inhibit: WAN/egress-down suppresses the downstream egress
  symptom alerts so one root alert pages, not a storm.
- Runbook docs/runbooks/pfsense-egress.md + .claude/CLAUDE.md.

All metric names + the cloudflared threshold verified against live
Prometheus/Loki. Pure GitOps, no pfSense change. Firewall-side hardening
(dpinger retargeting, failover gateway, pfSense syslog -> Loki) is deferred
and documented in the runbook.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:46:30 +00:00
Viktor Barzin
279b88d2bc docs: add MetalLB L2Status-immutable PG-VIP-flap post-mortem (code-aoxk)
All checks were successful
ci/woodpecker/push/postmortem-todos Pipeline was successful
ci/woodpecker/push/default Pipeline was successful
Post-mortem for the 2026-04/05 SEV3 where a stuck MetalLB ServiceL2Status
CR (immutable status.node) flapped the PG load-balancer VIP and silently
broke Tier-1 Woodpecker terragrunt applies for ~5 days (the wrapper error
"Cannot read PG creds" masked the real cause for ~25 days). Written when
the incident closed (beads code-aoxk, 2026-05-26) but never committed;
landing it so the RCA + stuck-CR cleanup procedure live in the repo.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:25:10 +00:00
Viktor Barzin
6f042ee239 fix(fire-planner): grafana fire-planner-pg datasource survives pw rotation
Some checks failed
ci/woodpecker/push/default Pipeline failed
The fire-planner-pg Grafana datasource baked the rotating fire_planner DB
password into its provisioning ConfigMap at terraform plan-time, so on every
7-day static-role rotation the password went stale and ALL fire-planner-pg
dashboards (fire-planner, cost-of-living, and the new wealth FIRE Countdown)
silently failed with "password authentication failed for user fire_planner"
until the next stack apply.

Switch to the same live-env pattern wealth-pg / payslips-pg already use:
- new ExternalSecret grafana-fire-planner-pg-creds (monitoring ns, Reloader
  match) mirrors the rotating Vault static-creds/pg-fire-planner password
- datasource ConfigMap now references $__env{FIRE_PLANNER_PG_PASSWORD}
- Grafana mounts it via envFromSecrets; reloader (auto) restarts Grafana on
  rotation so the provisioned datasource never goes stale

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 16:14:42 +00:00
Viktor Barzin
35c0057d83 chrome-service: raise noVNC sidecar memory limit 96Mi->256Mi (fix OOMKill)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The noVNC sidecar (x11vnc + websockify) was OOMKilled (exit 137) repeatedly
whenever someone actively opened chrome.viktorbarzin.me — the view connected
then froze/hung. Idle usage is ~37Mi, but x11vnc + websockify
framebuffer/encode buffers spike past the 96Mi cap when streaming the
1280x720 screen to a client. Raised request 32Mi->64Mi, limit 96Mi->256Mi
(Burstable, aux tier). Already applied live via a transient kubectl patch
(Recreate rollout, verified 0 restarts since); this lands the durable state
so the next apply / daily drift-detection doesn't revert it to 96Mi.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:39:17 +00:00
Viktor Barzin
2e50c1235c chrome-service: grant emo shared browser access (noVNC + homelab browser CLI)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Viktor asked to give emo access to the cluster's headed Chrome so he can fill
in forms and get past anti-bot / captcha pages. emo was deliberately locked
out of chrome-service (noVNC Authentik allowlist was Viktor-only + his
power-user RBAC has no pods/portforward). Viktor's explicit decision: SHARE
his existing browser rather than stand up an isolated per-user instance,
accepting that emo can therefore reach Viktor's warmed logged-in sessions
(CDP has no per-context auth, so the single shared persistent profile is
reachable by anyone who can drive the browser). emo's CLI use is hands-off
(his agent can run it unattended).

- authentik: add emo (emil.barzin / emil.barzin@gmail.com) to CHROME_ALLOWED
  so the admin-services-restriction policy admits him to chrome.viktorbarzin.me
  (noVNC). Reverses the prior Viktor-only lock; comment updated to record why.
- chrome-service/rbac.tf (new): emo-browser ServiceAccount + long-lived token
  (dashboard-sa.tf pattern), a chrome-service-portforward Role granting
  pods/portforward, and a cluster read-only binding (oidc-power-user-readonly)
  so the SA can resolve the Service and emo's normal read access doesn't regress.
- t3-provision-users.sh: install_browser_kubeconfig installs a dual-context
  kubeconfig for any user with a <user>-browser SA — SA token as the default
  context (non-interactive, works headless), personal OIDC retained as the
  oidc@homelab named context. emo's OIDC-only kubeconfig can't authenticate the
  headless agent session that homelab browser needs.
- docs/architecture/chrome-service.md: document the shared-browser multi-user
  access model, the session-exposure trade-off, and how to grant/revoke a user.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:20:07 +00:00
Viktor Barzin
50077b43d4 paperless-ngx: drop TASK_WORKERS 6->4 (6 OOMKilled the pod mid-import)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
6 OCR workers crept past the 8Gi per-container memory cap over ~6h and
OOMKilled paperless at 15:00 during the Emo bulk import. The import
auto-recovered (the consume dir lives on the PVC, so a restart re-scans
and reprocesses — nothing lost), but it left the queue inflated with
re-queued duplicates and spiked etcd on each restart.

The 8Gi cap is the shared edge-tier `tier-defaults` LimitRange, not worth
raising for one namespace. 4 workers fit with headroom (4 measured
~1.3Gi). Matches the value applied live via `kubectl set env` during
incident response; this removes the drift so the next apply keeps it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 15:06:46 +00:00
Viktor Barzin
8236ae309d postiz: reconcile HCL to live (adopt unmerged stack config), keep parked
All checks were successful
ci/woodpecker/push/default Pipeline was successful
postiz's live deployment (Helm + Temporal + Elasticsearch + Authentik
OIDC + static-DB password) came from the never-merged branch
`wizard/postiz-cnpg-oidc`, so master's HCL was stale and a `terragrunt
apply` would have DESTROYED the stack. This lands that postiz config to
master so HCL == state == live (CI green; destroy-landmine gone).

Kept PARKED (postiz + temporal replicas = 0): IG-via-postiz is Meta-
blocked (it hardcodes retired Instagram scopes → OAuth "Invalid Scopes"),
which is why it was parked; IG runs via the instagram-poster service. To
revive later: flip postiz `replicaCount` + temporal `replicas` back to 1
and re-check image pins.

Notes captured in this reconcile:
- ES image pinned to 7.17.28 (the branch's 7.17.24 was a DOWNGRADE vs the
  live data → ES refused to start "cannot downgrade node 7.17.28→7.17.24";
  caught + rolled back during this work).
- The 4 Authentik resources (app/provider/group/binding) were re-imported
  into state (adopted, not recreated — no duplicate AK objects); the
  obsolete `external_secret_jwt` ExternalSecret was removed (Retain → its
  synced secret was kept).
- Vault-side cleanup (removing the unused pg-postiz rotated role) is
  deliberately NOT included here — deferred, postiz uses a static
  secret/postiz database_url.

State was already reconciled by a local `scripts/tg apply`; this commit is
the HCL catch-up (CI re-apply is a no-op).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:54:59 +00:00
Viktor Barzin
250d0fc334 docs(authentik): document SFE forced-WebAuthn escape hatches (TOTP + social)
All checks were successful
ci/woodpecker/push/default Pipeline was successful
Old-browser users on the SFE who have a password but no MFA device hit the
default-authentication-flow's forced WebAuthn passkey enrolment, which the SFE
cannot render (the 'unsupported state: ak-stage-authenticator-webauthn' error).
emo (Google-only, iPadOS 15) hit this on the password path.

Document the two no-MFA-downgrade fixes: (1) social login, whose source flow
(default-source-authentication) has no MFA stage, so the SFE's social button
always completes; (2) enrolling TOTP, which the SFE can validate (unlike
WebAuthn) and which flips the MFA stage from force-enrol to validate. TOTP was
enrolled for emo and stored in his Vaultwarden authentik item; verified
end-to-end (a Bitwarden-generated code is accepted by authentik).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 12:24:40 +00:00
Viktor Barzin
e518ada3d4 authentik: repoint to overlay patch3 (all-iOS SFE + SFE social links) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch3. Old iPad Chrome (and any iOS browser) now gets
the SFE too, and the SFE login shows social-login buttons (emo is Google-only with
no password, so the password form alone was a dead end). Docs: .claude/CLAUDE.md +
authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:26 +00:00
Viktor Barzin
4fc09b7a61 Merge remote-tracking branch 'origin/master' into wizard/authentik-sfe-social
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Build Custom Authentik Image / build (push) Has been cancelled
2026-06-28 11:53:04 +00:00
Viktor Barzin
916516eeab authentik overlay patch3: SFE for ALL old iOS browsers + social-login links
Two follow-ups to patch2 (both in patch-compat-sfe.py, guarded):

1. compat_needs_sfe() now also serves the SFE to ANY iOS browser on iOS<=16.3,
   not just Safari. iOS Chrome/Firefox are WebKit skins (Apple mandate) reporting
   a non-Safari UA family, so the Safari-only check missed them and they still got
   the blank modern SPA. Added an os.family=="iOS" + version<=16.3 branch.

2. Inject static social-login <a> links (Continue with Google/GitHub/Facebook ->
   /source/oauth/login/<slug>/) into the SFE shell (flow-sfe.html). The SFE
   architecturally can't render Identification-stage sources (authentik docs), and
   emo's account (emil.barzin@gmail.com) is Google-only with NO password — so the
   SFE's username/password form was a dead end. The links are plain redirects that
   work on any browser. Slugs are static; re-verify on source changes.

Tag -> 2026.2.4-patch3; values repoint + docs land once GHA builds it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:53:03 +00:00
Viktor Barzin
08bdf32aa0 feat(fire-planner): FIRE Countdown dashboard section + monthly target solve
Some checks failed
ci/woodpecker/push/default Pipeline was canceled
Add a "FIRE Countdown" section to the wealth Grafana dashboard plus a monthly
CronJob that computes the targets it reads.

Viktor wanted a £ countdown to retirement in today's money, per life-case
(Solo / Household / Family) and per country, with progress, a projected date,
runway, and his safety guardrails — so he can see how close he is to FIRE
(ideally lean) without ever coming back to work.

- wealth.json: new country / with_home / savings_per_year template vars + a
  per-Case row (target NW at the 99% GK bar, progress gauge, still-needed,
  projected FIRE date, runway) and safety-valve panels (re-entry trigger vs
  £1.0M, 2.5yr cash buffer, pension tranche @57, Anca-bridge note). Reads
  fire_planner.fire_target via the fire-planner-pg datasource (Mixed).
- fire-planner stack: fire-planner-fire-targets CronJob (monthly, 2nd 10:00
  UTC) runs `recompute-fire-targets --countries all`.

Targets come from the solver shipped in fire-planner edb4d11.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:52:17 +00:00
Viktor Barzin
6ba60cbb2d authentik: repoint to overlay patch2 (SFE for old Safari) + docs
All checks were successful
ci/woodpecker/push/default Pipeline was successful
global.image -> 2026.2.4-patch2 (adds the compat_needs_sfe SFE patch on top of the
SLOW-1a query patch). Old Safari/WebKit (<=16.3) now gets authentik's no-JS SFE
login instead of a blank page — fixes emo's iPadOS-15.8 iPad with no auth
downgrade. Docs: .claude/CLAUDE.md Authentik row + docs/architecture/authentication.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:39:29 +00:00
Viktor Barzin
5fb2004de5 Merge remote-tracking branch 'origin/master' into wizard/authentik-perf-fix
Some checks are pending
Build Custom Authentik Image / build (push) Waiting to run
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:38:07 +00:00
Viktor Barzin
f10bb71562 authentik overlay: serve the no-JS SFE login to old Safari (patch #2)
Old Safari/WebKit (<=16.3, e.g. iPadOS<=16.3) can't parse authentik's modern
ES2022 flow SPA and gets a COMPLETELY BLANK login — exactly what emo's iPadOS-15.8
iPad hit. authentik already ships a no-JS Simplified Flow Executor (SFE, ES5) and
serves it via compat_needs_sfe(), but only for IE/old-Edge/PKeyAuth. Extend that
to old Safari so those clients get the REAL authentik login (password + MFA +
reputation, identity preserved — NO auth downgrade, no new credential store).

Chosen over a Traefik basic-auth fallback after an adversarial review: that route
would put a single, spoofable-UA password in front of vbarzin->wizard (passwordless
root on the cluster-controlling devvm) — an MFA->single-factor path to cluster root.
SFE keeps full authentik auth and is generic for any old browser.

Shipped as patch #2 in the existing overlay image (patch-compat-sfe.py — guarded:
asserts the upstream anchor + ast-parses; verified against the live interface.py).
Tag -> 2026.2.4-patch2; the values repoint lands once GHA builds the image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:38:05 +00:00
Viktor Barzin
ec681ba6e1 ci(infra): stop double-apply + stop counting PG lock-waits as failures
All checks were successful
ci/woodpecker/push/default Pipeline was successful
The infra terragrunt-apply pipeline (.woodpecker/default.yml) was going
red ~20% of the time. Root causes (verified from the failure logs, not
guessed):

1. infra is registered in Woodpecker TWICE — canonical Forgejo (repo 82)
   AND legacy GitHub mirror (repo 1) — and BOTH run `default.yml` on every
   push. The two applies race each other for the per-stack PG state lock →
   "Error acquiring the state lock" failures + push-supersede "killed" runs.
2. The skip-not-fail lock guard only matched the Tier-0 Vault lock string
   ("is locked by"); the Tier-1 PG-backend lock ("Error acquiring the state
   lock") fell through and was counted as a hard FAILURE.
3. Transient provider-registry download timeouts (and Vault 5xx) failed the
   whole pipeline with no retry.

Fixes (all in default.yml):
- Forge guard: the push-apply runs ONLY on the canonical Forgejo forge; on
  the GitHub mirror it no-ops (exit 0). The mirror keeps running the crons
  (they live on repo 1), so we de-dup the apply without deactivating the
  registration. Fail-open on unknown forge.
- Lock-skip now matches BOTH tiers (Vault + PG) → lock-waits are SKIPPED.
- Bounded retry (3x) ONLY on transient signatures (provider download
  timeout, Vault 5xx). Config errors + helm atomic-timeouts fail fast.

Rejected (documented in docs/architecture/ci-cd.md): an off-infra GHA
validate gate (catches ~0 of the real, runtime/Vault-data/SSA/lock
failures; reproduced `terraform validate` passing the exact stacks that
fail at apply) and lock-reaping/force-unlock (PG advisory locks are
session-scoped + auto-release; force-unlock can't free them and would
corrupt a live concurrent apply).

Shell logic + the classification regexes were unit-tested locally against
the real decoded error strings (#359 PG lock, #353 provider timeout, #360
missing-arg, helm atomic timeout); `bash -n` clean; YAML parses.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:37:18 +00:00
Viktor Barzin
69e35efd95 Merge remote-tracking branch 'origin/master' into wizard/vault-kv
Some checks failed
ci/woodpecker/push/default Pipeline was successful
Build infra CLI / build (push) Has been cancelled
2026-06-28 11:09:38 +00:00
Viktor Barzin
e03e4719ad vault: distinguish Vaultwarden vs HashiCorp Vault, add vault kv
`homelab vault` only spoke to Vaultwarden (the password manager), but the
name reads as HashiCorp Vault (the infra secrets store — actually OpenBao
here). Make the two unmistakable and support both.

Distinction (no breakage — the existing Vaultwarden verbs are unchanged):
- bare `homelab vault` help now LEADS with the two-stores split;
- every verb summary is tagged `[vaultwarden]` or `[hashicorp-vault]`;
- HashiCorp Vault/OpenBao lives under a clearly-named `vault kv` group.

New `vault kv` (HashiCorp Vault / OpenBao, the secret/… KV store):
- `kv get <path> [--field K]` — read; --field → one value (TTY-aware
  clipboard/stdout), no field → full secret JSON (refuses a bare TTY).
- `kv list <path>` — list sub-paths (no values).
- `kv put <path> <key>` — write one key; value via stdin (piped or
  no-echo prompt, never argv); creates the path or merges (never
  clobbers siblings; uses kv patch -method=rw so no `patch` cap needed).

Critical: `kv` uses the caller's OWN Vault token (OIDC ~/.vault-token /
$VAULT_TOKEN), NOT the per-user scoped Vaultwarden token (bound only to
claude-users/<user>, which would 403 elsewhere) — handlers set VAULT_ADDR
but never inject the scoped token. Access is whatever the policy grants.

Logic in cmd_vault_kv.go (pure cores extractKVData/parseKVList/arg
builders/kvGet/List/Put; file header documents the credential split).
CLI v0.11.0. Tests: no value in put argv, create-then-merge, KV-v2
envelope strip, help names both systems. Verified e2e against live Vault
(read key-names-only + a scratch put/merge/cleanup).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:09:33 +00:00
Viktor Barzin
460f2ad42f state(vault): update encrypted state
All checks were successful
ci/woodpecker/push/default Pipeline was successful
2026-06-28 11:07:22 +00:00
Viktor Barzin
87a450e9a3 vault: grant emo full read/write on his own secret/emo tree
Viktor asked that emo be able to edit his own secrets with full access.
emo's personal-emo policy was read-only (read on data, read/list on
metadata), so he could view but not change his personal secrets.

Widen it to the same self-service capability set every namespace-owner
already has over their own tree: create/read/update/delete/list on
secret/data/emo(+/*) and list/read/delete on secret/metadata/emo(+/*).
Scope is unchanged — still only emo's own secret/emo subtree, still a
named exception that does not widen the power-user tier in general.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-28 11:07:22 +00:00