Commit graph

187 commits

Author SHA1 Message Date
a048b37f60 security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE
## W1.1 — K8s API audit log shipping (LIVE)
- alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on
  k8s-master node. Verified alloy-7zg7t scheduled on master, tailing
  /var/log/kubernetes/audit.log
- loki.tf "Security Wave 1" rule group: added K2-K9 alert rules
  (skipped K1 per Q7 decision):
  - K2 K8sSATokenFromUnexpectedIP
  - K3 K8sSensitiveSecretReadByUnexpectedActor
  - K4 K8sExecIntoSensitiveNamespace
  - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user)
  - K6 K8sAuditPolicyModified (kubeadm-config CM change)
  - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=*)
  - K8 K8sAnonymousBindingGranted
  - K9 K8sViktorFromUnexpectedIP
- All rules use source-IP regex matching the wave-1 allowlist
  (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc,
  100.64-127 tailnet) and `lane = "security"` → #security Slack route.
- Verified: kubectl-audit logs flowing in Loki query
  {job="kubernetes-audit"} returns events with node=k8s-master.
- Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1.

## W1.5 — require-trusted-registries Enforce (LIVE)
- security-policies.tf: flipped Audit→Enforce with explicit allowlist
  built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration.
- Removed `*/*` catch-all (which made Audit→Enforce a no-op).
- Pattern includes 15 explicit registries, 6 DockerHub library bare
  names, 56 DockerHub user repos.
- Verified by admission dry-run:
  - evilcorp.example/malware:v1 → BLOCKED with custom message
  - alpine:3.20 → ALLOWED (matches `alpine*`)
  - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`)

## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation)
- Tried adding FelixConfiguration with flowLogsFileEnabled=true via
  kubectl_manifest in stacks/calico/main.tf
- Calico OSS rejected with "strict decoding error: unknown field
  spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only
- Removed the failed resource. Documented alternative paths in main.tf
  comment block: GNP with action=Log (iptables NFLOG → journal), Cilium
  migration, eBPF tooling, or Tigera Operator adoption.

## Docs updates
- security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE,
  W1.6/W1.7 blocked
- monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in
  prior session before today's apply)

## Cleanup
- Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed
  their job in the 2026-05-18 apply; should not stay in tree per TF docs)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:59 +00:00
83079758bb monitoring(wave1): re-enable Loki+Alloy, deploy wave1 alert rules, add #security Slack lane
## Loki + Alloy re-enabled (code-146x)
- Uncommented helm_release.loki, helm_release.alloy, kubernetes_daemon_set_v1.sysctl-inotify,
  kubernetes_config_map.loki_alert_rules, kubernetes_config_map.grafana_loki_datasource
- Reverses the documented "operational overhead vs benefit after node2 incident"
  decision. Re-evaluated because wave 1 security detection layer (beads code-8ywc)
  needs Loki + ruler + alert routing.
- SingleBinary mode, 2-4Gi mem, 50Gi proxmox-lvm PVC, 30-day retention, ruler enabled
  pointed at prometheus-alertmanager.monitoring.svc:9093
- Alloy DaemonSet (4 pods on worker nodes) discovers pod logs via K8s API + pushes
  to Loki
- Loki canaries running (4)
- Vault audit-tail sidecar logs now flowing to Loki: queried
  {namespace="vault",container="audit-tail"} returns live audit JSON

## Wave 1 alert rules deployed (W1.3 partial)
Added "Security Wave 1" rule group to loki_alert_rules configmap:
- V1: VaultRootTokenCreated — auth/token/create with policies=[root]
- V2: VaultAuditDeviceModified — sys/audit/* create/delete/update
- V3: VaultSealChanged — sys/seal update
- V4: VaultPolicyModified — sys/policies/acl/* create/update/delete
- V5: VaultAuthFailureSpike — >10 permission denied/min
- V7: VaultViktorFromUnexpectedIP — auth as me@viktorbarzin.me from non-allowlist source IP
  (allowlist: 10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet)
- S1: PVEsshLoginFromUnexpectedIP — sshd "Accepted" from non-allowlist IP (rule defined,
  fires once promtail/Alloy ships sshd journal with job=sshd-pve)

Verified rules visible via /loki/api/v1/rules. K2-K9 (K8s API audit) deferred to W1.1
which needs the audit policy + apiserver log shipping codified.

## #security Slack lane (Alertmanager)
- New `slack-security` receiver in prometheus_chart_values.tpl, channel #security
- Higher-priority route at top of routes list: matchers `lane = security` →
  slack-security, continue: false (so wave 1 alerts never fall through to #alerts)
- Slack message format includes summary + description + runbook link annotation
- All wave 1 rules set `lane = "security"` label

## Resource summary
- 6 added: helm_release.loki, helm_release.alloy, kubernetes_config_map.grafana_loki_datasource,
  kubernetes_config_map.loki_alert_rules, kubernetes_daemon_set_v1.sysctl-inotify,
  + 1 other
- 5 changed: helm_release.prometheus (alertmanager config — new receiver + route),
  4 deployments (image tag drift from Keel-managed images, unrelated)
- 1 destroyed: null_resource grafana_admin_only_folder_acl["Finance (Personal)"]
  (timestamp-triggered always recreates — not destructive)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Closes: code-146x
2026-05-22 14:16:58 +00:00
87961e9ef8 monitoring(wealth): drop 6y timeFrom override on META vest cadence 2026-05-22 14:16:57 +00:00
99127939a8 monitoring(wealth): keep only FIFO-realized PNL table; pair Positions + vest-cadence side-by-side
- Removed panel 27 (META RSU vest value over time) — superseded by
  vest-cadence chart which carries the same value signal plus the
  share-count overlay.
- Removed panel 28 (per-vest value at vest vs today) — duplicative with
  panel 31's FIFO realized PNL.
- Removed panel 29 (per-sell realized PNL) — same data as panel 31,
  just rolled up by sell date instead of vest date.
- Resized panel 26 (Positions) to w=12 and moved panel 30
  (META vest cadence) to (y=32, x=12, w=12) so they sit side-by-side
  next to the Positions table.
- Moved panel 31 (FIFO realized PNL) to y=118, where the deleted RSU
  chart used to live.
2026-05-22 14:16:57 +00:00
b879481d71 monitoring(wealth): per-vest realized PNL via FIFO sell-match
New table panel below the per-sell breakdown. For each vest, FIFO-match
its shares against the subsequent sells (shares from earlier vests get
sold first), and aggregate the matched portions:

  realized_pnl = SUM(matched_qty * (sell_price - vest_price))
  pnl_pct      = realized_pnl / SUM(matched_qty * vest_price) * 100
  days_held    = AVG(sell_date - vest_date) per matched portion

Footer reducer sums shares, vest value, sell value, and realized PNL
so the bottom row is the full-portfolio realized take.
2026-05-22 14:16:57 +00:00
8b60e6bb6d monitoring(wealth): META vest cadence chart — value vs shares (dual axis)
Per-vest event line chart. Left Y axis (blue): vest value at the
time = SUM(quantity * unit_price), in USD. Right Y axis (orange):
number of shares vested. One point per vest date (aggregated when
multiple BUY rows share a date, e.g. 2021-05-18 was 18 + 2 shares).

Lets Viktor see how vest sizes ramped (initial 18 shares -> 38 ->
60s) and how the per-vest USD value tracked META's price ride
across 2020-2026. timeFrom='6y' override pins the panel to the full
vesting window.
2026-05-22 14:16:57 +00:00
af077112cb monitoring(wealth): META vest + sell PNL tables with FIFO cost basis
Two new bottom-of-dashboard tables:

Panel 28 'META vests — value at vest vs today': one row per BUY
activity. Shows vest-day price * shares + what those same shares
would be worth at today's META quote, plus the hypo P&L if Viktor
had held everything (color-text on the gain columns).

Panel 29 'META sells — realized PNL vs if held until today':
one row per SELL with FIFO-matched cost basis (LEAST/GREATEST
overlap in cumulative-share space). Shows realized P&L, the
counterfactual P&L had he held until today, and the
'missed by' delta = (today_price - sell_price) * shares.

Both pull today_price dynamically from quote_latest via a CTE so
they self-update as Yahoo updates the META quote. Schwab account
is empty so no live activity is expected.
2026-05-22 14:16:57 +00:00
20c5965f95 monitoring(wealth): pin META RSU panel to 6y window
Dashboard default time range is now-180d, but the META vesting + sell
arc spans 2020-11 → 2026-02. With the default window the panel just
showed a flat line at $64 (the empty post-sell residual). timeFrom='6y'
override makes panel 27 always render the full vesting curve regardless
of the dashboard-level time selector.
2026-05-22 14:16:57 +00:00
018ef3790f monitoring(wealth): META RSU vest value panel (Schwab account)
Daily total_value timeseries for the Schwab workplace account
(account_id 72d34e09-...). Single-asset account holding META RSUs
that vested 2020-11 → 2026-02 and were sold opportunistically over
the same window. Currency USD (account_currency). Yahoo quote on
META powers WF's daily mark; the historical DAV mirrored into
wealthfolio_sync via pg-sync gives us ~2k days of vesting curve.
2026-05-22 14:16:57 +00:00
Viktor Barzin
5482f46125 RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight
The 24h kubelet-uptime threshold (process_start_time_seconds < 86400)
was a defense-in-depth duplicate of the 24h-since-Ready-transition
check in kured-sentinel-gate Check 4 — but they used different
signals (kubelet process start vs node Ready transition). Whenever
the cluster cycled through reboots, the alert kept firing for a full
day even after sentinel-gate's check passed, and blocked anything
querying halt-on-alert (kured, K8s version-upgrade preflight).

Tightened to 1h (3600s) for "node just rebooted, give it a settle
window". The cluster-wide 24h-between-reboots invariant lives
exclusively in kured-sentinel-gate Check 4 from now on (independent,
uses lastTransitionTime).

Matched the preflight's own 24h-quiet check in upgrade-step.sh
(86400 → 3600) so it doesn't act as a second blocker.

Empirically verified: all 5 kubelets are >10h up, alert cleared on
next eval after the rule reload.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:57 +00:00
Viktor Barzin
3bdba9f388 keel: enroll 15 critical-path namespaces for digest-only auto-update
Per user decision today: monitoring, mailserver, vault, descheduler,
metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy,
reloader, headscale, wireguard, xray, cloudflared now participate in
the same `force + match-tag` regime as the rest of the cluster — Keel
watches the deployment's CURRENT tag for digest changes only and rolls
on push, never rewriting tag strings.

Two-part change:

stacks/kyverno/modules/kyverno/keel-annotations.tf
  Trim the policy-level namespace exclude list from 31 → 16. The 16
  remaining exclusions are the irreducible cluster-operator + state-
  coupled set: keel itself, calico-system + tigera-operator (operator
  loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system +
  dbaas (state-coupled), kyverno, metallb-system, external-secrets,
  proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned),
  kube-system, vpa, sealed-secrets, infra-maintenance.

stacks/<each-of-15>/.../main.tf
  Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace`
  resource so the Kyverno mutate policy can target the workloads via
  its namespaceSelector matchLabels.

Note on the apply path: the live ClusterPolicy was patched via
`kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics
during state refresh on Kyverno ClusterPolicy schemas with deeply
nested optional `context.celPreconditions` / `imageRegistry` fields
(see crash dump). The TF source above has the desired state, so any
clean future apply on a fixed provider version will be a no-op against
the live cluster.

Floating-tag workloads in the newly-enrolled set (will roll on every
upstream digest update — acceptable risk per user):
  - wireguard: sclevine/wg:latest (image fixed today via iptables-nft
    postStart shim)
  - xray: teddysun/xray
  - crowdsec-web: viktorbarzin/crowdsec_web
  - monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter
  - traefik: nginx:1-alpine, openresty/openresty:alpine,
    ghcr.io/tarampampam/error-pages:3
  - redis: haproxy:3.1-alpine, redis:8-alpine

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
Viktor Barzin
45c8e88e89 terminal: probe + alerts after Traefik replica routing-table skew
User reported "site loads but failed to connect on the tmux session". Root
cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing
the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only
the IngressRoute CRDs registered. About 1/3 of /token preflight requests
landed on that replica and got 404 with router="-", and WS upgrades
intermittently failed the same way, so the lobby iframe stayed stuck on
"Failed to connect. Retrying...". `kubectl delete pod` on the bad replica
restored the missing router and unblocked the user.

This commit adds the long-term mitigation:

stacks/terminal/main.tf
  - kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits
    /token + /ws via Cloudflare and the in-cluster ttyd Service, pushes
    4 gauges to Pushgateway (token_status, ws_status, ttyd_status,
    last_success_timestamp). Verified the probe end-to-end:
      token=302 ws=302 ttyd=200 ok=1

stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl
  - Webterminal group: WebterminalTokenDegraded (warning, 10m),
    WebterminalWebsocketDegraded (critical, 10m),
    WebterminalTtydUnreachable (critical, 10m),
    WebterminalProbeStale (warning, 15m).
  - Traefik Router Parity group: TraefikRouterCountSkew fires when any
    Traefik replica's router count diverges from siblings for >10m —
    catches the same class of issue cluster-wide, not just for terminal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:56 +00:00
126cfb7022 wealth: dav_corrected view fixes pension gains-offset miscategorisation
The broker-sync Fidelity provider emits 'unrealised-gains-offset'
DEPOSIT activities to reconcile Wealthfolio's total with the
PlanViewer reported pot, because Wealthfolio doesn't track pension
fund units directly. Wealthfolio's data model treats that DEPOSIT as
a cash contribution, which double-inflates net_contribution and
zeroes out the implied growth.

Add a Postgres view 'dav_corrected' in wealthfolio_sync that
subtracts the cumulative gains-offset from net_contribution per
account per date (re-exporting as 'net_contribution' so it's a
drop-in replacement). All 17 wealth dashboard panels that compute
contribution/growth/ROI now read from the view. Total impact:
portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly
the £35,721.20 Fidelity offset that was previously miscategorised).
2026-05-22 14:16:52 +00:00
Viktor Barzin
95b9f7bc89 aiostreams: 1h stream cache + canary stream-count probe + 3 alerts
Hardening pass following the empty-stream-list incident:

1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 /
   disabled). Default behaviour hit all 5 upstream addons on every
   Stremio request; with a 1h TTL repeat requests for the same title
   are instant, while RD cache invalidations still propagate quickly.

2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's
   encryptedPassword via the internal ClusterIP, runs a canary stream
   search for Breaking Bad S01E01, pushes streams_count + probe_success
   to Pushgateway. Uses an ExternalSecret pulling UUID + password from
   Vault secret/viktor. Same pattern as email-roundtrip-monitor.

3. Three alerts in monitoring's prometheus_chart_values.tpl:
   - AIOStreamsStreamCountLow  (< 50 streams for 30m)
   - AIOStreamsProbeFailing    (probe_success == 0 for 30m)
   - AIOStreamsProbeStale      (last_run_timestamp > 30min for 10m)

Verified: probe returned streams=411 success=1 on first run; all 3
alerts loaded into Prometheus with state=inactive health=ok.
2026-05-22 14:16:46 +00:00
2903ab9778 monitoring(wealth): move Positions table under contrib/growth row
Positions panel now sits at y=32 (immediately below the
contrib-vs-market + growth row at y=22..32), and everything from
the per-account stack down shifts 8 rows lower.
2026-05-22 14:16:46 +00:00
8461275308 wealth: positions table panel (shares + cost basis + unrealised return)
pg-sync sidecar now mirrors three extra views from the wealthfolio
SQLite: assets (id/symbol/name/currency), quote_latest (one row per
asset, preferring YAHOO over MANUAL on same-day collisions), and
positions_latest (currently-held positions extracted from the TOTAL
aggregate row of holdings_snapshots — quantity, average cost,
total cost basis).

Wealth dashboard gets a new bottom Positions table joining the three:
symbol, name, shares, avg cost, last price, market value, cost,
gain, return %. Gain and return % are color-text with red<0, green>=0
thresholds.
2026-05-22 14:16:46 +00:00
726fb25182 monitoring(wealth): paint declining segments red on growth chart
Mirror the panel 5 treatment on panel 7 (Growth = market value −
contribution). Second SQL column emits the growth value only when
the point is part of a declining segment; field override paints it
red with no fill, spanNulls=false.
2026-05-22 14:16:45 +00:00
Viktor Barzin
cbd0f71a3b monitoring: PodImagePullBackOff alert + 2 inhibitors + JobFailed for:2h
Three improvements identified in the 7d alert-noise review:

A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures
   node-level pull error rate, which doesn't catch a single pod stuck
   in ImagePullBackOff — council-complaints sat broken for ~10h on
   2026-05-12 without paging. The new rule fires per-pod after 30m.

B. Two new inhibit_rules:
   - PVFillingUp (95% used, critical) suppresses PVPredictedFull
     (linear projection, warning) on the same PVC. Pair was producing
     ~24h of redundant firing per 7d.
   - EmailRoundtripFailing (active probe failure) suppresses
     EmailRoundtripStale (derivative >60min no-success). Same outage
     windows, ~14.5h of duplicate firing per 7d.

C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old
   30-minute window paged on the first failed iteration before the
   next run could recover. 2h means "still failing across at least
   two cron iterations" — much more actionable.

Verified live: rules loaded, inhibitors in alertmanager config,
PodImagePullBackOff is currently inactive (council-complaints
ImagePullBackOff actively detected — see separate fix).
2026-05-22 14:16:45 +00:00
Viktor Barzin
70292b9e23 monitoring: TraefikReplicaConfigStale — drop false-positive on stale series
The initial formulation used clamp_min(min(rate[2h]), 0.0001), which
made a recently-deleted pod's lingering rate=0 drive the ratio toward
infinity for up to 2h until the stale series aged out of the rate
window. With for: 2h, this was a near-miss for spurious firing in the
immediate aftermath of restarting the bad replica (our remediation
path).

Tighter formulation:
* 30m rate window — stale series ages out within minutes, not hours
* `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod
  ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12
  incident) sits well above it, so true positives still trip
* for: 1h — fast enough to catch the next incident, long enough that
  short rate dips don't flap

Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0
results with the live cluster's tight rate spread (~0.00065–0.0007/s
across all three Traefik replicas).
2026-05-22 14:16:45 +00:00
Viktor Barzin
165bb7258e monitoring: detect stale Traefik replicas + reduce alert-storm cascading
Two new alertmanager inhibit rules and one new Prometheus alert,
informed by the 2026-05-12 incident where Traefik pod
traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers
vs 119 on healthy peers (stale K8s informer cache) and served 404 for
~1/3 of viktorbarzin.me traffic.

* New alert TraefikReplicaConfigStale: fires when max/min reload-rate
  ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h
  for-clause tolerates legitimate post-restart ramp-up; the bug
  pattern persists indefinitely.
* New inhibit: TraefikReplicaConfigStale suppresses the symptom
  alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical},
  IngressErrorRate5xxHigh, TraefikHighOpenConnections,
  ForwardAuthFallbackActive, AnubisChallengeStoreErrors,
  ExternalAccessDivergence) so only the actionable root cause pages.
* New inhibit: HomeAssistantDown suppresses
  HomeAssistantCriticalSensorUnavailable and
  HomeAssistantMetricsMissing — when HA itself is down, every sensor
  going unavailable is noise (10x firings observed in the last 12h).
* Extend NodeDown and NFSServerUnresponsive target lists to also
  suppress HomeAssistantCriticalSensorUnavailable.
2026-05-22 14:16:45 +00:00
Viktor Barzin
448bc0c0f6 k8s-version-upgrade: decompose into Job chain to fix self-preemption
The agent-based v1 ran inside claude-agent-service (replicas=1, no
nodeSelector) and self-evicted when it tried to drain its host (k8s-node4
on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers
v1.34.2) until manual recovery.

Rewrite the pipeline as a chain of nodeSelector-pinned Jobs:

  preflight (k8s-node1)
    → master   (k8s-node1)   drains k8s-master
    → worker × 4 (k8s-node1) drains k8s-node{4,3,2}
    → worker   (k8s-master + control-plane toleration) drains k8s-node1
    → postflight (no pinning)

Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by
envsubst-ing job-template.yaml into the next Job. Deterministic names
(k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply`
idempotent — a failed Job can be re-created without duplicating
downstream.

Also lands `predrain_unstick`: deletes pods on the target node whose PDB
has 0 disruptionsAllowed. Without this, drain loops indefinitely on
single-replica deployments (e.g. every Anubis instance — discovered the
hard way during 2026-05-11 manual recovery of k8s-node3).

Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min).
Deprecates the agent prompt (renamed to *.deprecated.md with a header
pointer to the new code).

Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps),
then monitoring (loads the new alert). Both applied 2026-05-11.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:45 +00:00
Viktor Barzin
cd13b9d062 monitoring: drop PVAutoExpanding alert — info-only noise, not actionable
PVAutoExpanding fired at >80% used (info severity), but pvc-autoresizer's
threshold is 10% free (= 90% used) — the alert always fired ~10 points
before any action would have been taken, and there was nothing for an
operator to do during that window either. It was a "heads up" that
didn't surface a problem.

Real failure modes are already covered:
  * PVFillingUp (critical, >95% for 10m) — autoresizer didn't keep up
  * PVPredictedFull (warning, predict_linear 24h) — trend toward exhaustion

Sharpened PVFillingUp's annotation to spell out the likely causes
(storage_limit reached, expansion failing, or missing autoresizer
annotations) so the responder doesn't have to recall the runbook.
2026-05-22 14:16:44 +00:00
396cce82cf monitoring(wealth): paint declining segments red on portfolio chart
Add a second SQL column on panel 5 that returns net_worth only when the
current point's previous or next neighbor is lower — i.e. the point is
part of a declining segment (including the peak and trough endpoints).
A field override draws this 'decline' series in red with no fill and
spanNulls=false, overlaying the green base line so down periods show
up as red on top of the climb.
2026-05-22 14:16:44 +00:00
Viktor Barzin
f10784ddb6 infra: document auth = "app|none" tier on every legacy ingress
Sweep through the 30+ stacks that predated the auth = "app" tier
and were tagged auth = "none" without a comment explaining why
they weren't behind Authentik. Each is now self-documenting at the
call site, so the tg-level anti-exposure guard passes and future
readers don't have to reverse-engineer the intent.

Flipped 6 stacks from "none" to "app" — their backends have their
own user auth and the new tier records that more accurately:
  - navidrome   (Subsonic user/password)
  - ntfy        (deny-all default + user.db tokens)
  - nextcloud   (WebDAV/CalDAV/CardDAV app passwords)
  - vaultwarden (Bitwarden-compatible token auth)
  - headscale   (OIDC + preauth keys for Tailscale nodes)
  - paperless-ngx (app-layer login + API tokens)

Kept "none" with a comment on the rest — they're genuinely public,
webhook receivers, native-protocol endpoints, OAuth callbacks, or
Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt),
claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api,
fire-planner /api, forgejo (git/OCI native clients), frigate (HA
integration), immich/frame, insta2spotify /api, instagram-poster
(meta fetcher), k8s-portal, matrix (native bearer), monitoring×2
(HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT),
owntracks (HTTP Basic), postiz, privatebin (client-side enc),
rybbit (analytics tracker), send (E2E file drop), tuya-bridge
(API key), vault (own auth + CLI), webhook_handler, woodpecker
(forgejo webhooks + OAuth), xray (×3 VPN transports).

real-estate-crawler/main.tf:400 already had its comment from a
prior edit — not touched here.

No live state changes — auth = "app" produces the same middleware
chain as auth = "none" (verified earlier this session). This commit
is purely documentation + intent-tagging.
2026-05-22 14:16:44 +00:00
Viktor Barzin
20774f794d dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts
Cluster grew past the 100-conn default — steady-state idle was 90/100,
leaving zero headroom for terragrunt applies or transient surges. The
ceiling was being discovered by Terraform crashing (pq: "remaining
connection slots are reserved for roles with the SUPERUSER attribute"),
not by alerting, because we had no PG scrape config at all.

dbaas (Tier 0):
  * max_connections: 100 → 200
  * shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory)
  * effective_cache_size: 1536MB → 2560MB (scaled with pod memory)
  * pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers
    + ~16MB work_mem * concurrent sorts + OS cache + overhead)
  * Triggers bump on null_resource.pg_cluster forces CNPG to re-apply,
    which rolls the cluster (standby first, then primary failover).

monitoring:
  * New scrape job 'cnpg' on dbaas namespace pods labeled
    cnpg.io/podRole=instance, port name=metrics (9187). Relabels add
    cnpg_cluster + cnpg_role labels for alert grouping.
  * PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion.
  * PGConnectionsCritical (critical, >95% for 3m) — last call before
    refusing connections.

Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections
metric=200, alert ratio 0.42 → both alerts inactive.
2026-05-22 14:16:44 +00:00
Viktor Barzin
665b6b2934 actualbudget+monitoring: per-account bank-sync metrics, drop noisy alert
The bank-sync CronJob was posting to /accounts/banksync which fans out to
ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls
per-account per-24h quota, a single rate-limited account would 500 the
whole call, and `bank_sync_success` would flip to 0 even though the data
itself was still flowing through manual UI syncs. Result: BankSyncFailing
fired routinely whenever the user had been active in the UI that day —
a structural false positive.

Fix:
  * CronJob: enumerate accounts via GET /accounts, POST per-account
    /accounts/{id}/banksync, emit bank_sync_account_success and
    bank_sync_account_last_success_timestamp labelled by account name.
    Roll up bank_sync_success = 1 iff any account succeeded.
  * Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale
    at 48h (global drought). Add BankSyncAccountStale at 72h (catches
    single-account auth expiry — the real signal we wanted).

Verified: manual run on bank-sync-viktor pushes 6 per-account success +
timestamp series; roll-up bank_sync_success=1; no firing alerts.
2026-05-22 14:16:44 +00:00
Viktor Barzin
dd2b7de291 fix: HA Sofia REST sensors + PVC drift safety
Two real issues found while triaging HomeAssistantCriticalSensorUnavailable
alerts and the prometheus + technitium PVC Terminating-but-in-use
state from the earlier session.

1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required →
   auth=none. HA Sofia REST sensors scrape these endpoints
   programmatically; with Authentik forward-auth in front, every
   request got a 302 to authentik.viktorbarzin.me and the REST
   sensors parsed the HTML login page instead of metrics — leaving
   the R730, UPS, and ~20 other sensors permanently unavailable.
   The allow_local_access_only IP allowlist (192.168.0.0/16 +
   10.0.0.0/8) already gates external access, so authentik on top
   was breaking machine-to-machine traffic for no security gain.

2. prometheus_server_pvc + technitium primary_config_encrypted:
   add lifecycle.ignore_changes = [spec[0].resources[0].requests].
   The autoresizer expands these PVCs; PVCs can't shrink. Without
   the ignore, every TF apply tried to revert the live size back
   to the TF spec value, hit K8s's shrink-forbidden rule, and
   force-replaced the PVC. Because the pod still mounted it, the
   PVC went into Terminating-but-protected limbo — fine until a
   pod restart would have orphaned the volume. Root cause of the
   2026-05-10 PVC Terminating incident.

Bonus: prometheus_server_pvc threshold was the inverted "90%" (the
same bug the bulk fecfa211 sweep fixed elsewhere; my regex only
matched "80%" so this one slipped through). Now "10%".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:43 +00:00
Viktor Barzin
e75bcaf394 k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline
Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison
on master for new patches + HEAD pkgs.k8s.io for next-minor availability,
then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent.

The agent (.claude/agents/k8s-version-upgrade.md) orchestrates:
  pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match)
    -> etcd snapshot save
    -> optional master containerd skew fix
    -> apt repo URL rewrite (minor bumps only)
    -> drain/upgrade/uncordon master via ssh < update_k8s.sh
    -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each
    -> post-flight verification

Two new Upgrade Gates alerts catch failure modes:
  - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m)
  - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m)

update_k8s.sh refactored to take --role / --release args; the agent shells
it into each node via SSH pipe. update_node.sh annotated as OS-major path.

Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section
in docs/architecture/automated-upgrades.md.

Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519
keypair distributed to all 5 nodes via authorized_keys; slack_webhook
reuses kured webhook URL on initial deploy).
2026-05-22 14:16:42 +00:00
Viktor Barzin
ff5538a667 ingress_factory: replace protected bool with auth enum + audit pass across 100 stacks
Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default
false → unprotected) variable in `modules/kubernetes/ingress_factory` with
`auth = string` enum (default "required" → fail-closed). Touches every
ingress_factory caller so the audit decision is recorded explicitly in code.

ingress_factory (Phase 3):
- `auth = "required"`: standard Authentik forward-auth (the legacy
  `protected = true` semantic).
- `auth = "public"`: forward-auth via the new `authentik-forward-auth-public`
  middleware → dedicated public outpost → guest auto-bind. Logged-in users
  keep their real identity.
- `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native
  client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost
  itself.
- `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated
  ingresses don't need anti-AI noise; the auth flow already discourages bots).

Audit pass (Phase 4) across 96 ingress_factory call sites:
- 49 explicit `protected = true`     → `auth = "required"`
- 8 explicit `protected = false`     → `auth = "none"` (5) or `auth = "public"` (3)
- 64 previously-default (no protected line) → `auth = "required"` ADDED, then
  reviewed individually:
  * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack,
    homepage, wrongmove UI, privatebin) → `auth = "none"`
  * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook
    handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC,
    xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich
    location ingestion, immich frame kiosk, headscale CP, send anonymous
    drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) →
    `auth = "none"`
  * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal
    UIs, services without app-level auth)
- Smoke-test promotions to `auth = "public"`: fire-planner public UI,
  k8s-portal API, insta2spotify callback.

Three call sites in wrapper modules (`stacks/freedify/factory/`,
`stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected`
bool — they translate to `auth` internally, out of scope for this rename.

Behavior change: previously-default ingresses now fail closed (require
Authentik login) unless explicitly flipped to `auth = "none"` or
`auth = "public"`. This is the audit goal — no more accidentally-unprotected
surfaces. Sites that were intentionally public (Anubis content, native APIs,
webhooks) are now explicitly recorded as `auth = "none"`.

Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via
`terraform fmt -recursive` during the audit. Behavior-neutral.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
4103ea2ba0 monitoring(prometheus): keep all 4 kubelet_volume_stats_inodes metrics
pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if
all four kubelet_volume_stats metrics (available_bytes, capacity_bytes,
inodes_free, inodes) are retrieved. The keep-list in the
kubernetes-nodes scrape job had available_bytes and capacity_bytes
(post 9d5da4d8) but was missing the two inode metrics, so the
autoresizer's reconcile logged "failed to get volume stats" for every
PVC and never resized anything.

Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free
to the regex.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 14:16:42 +00:00
Viktor Barzin
278ef5f19b monitoring(grafana): swap python3 for jq in folder-ACL local-exec
CI image (ci/Dockerfile) is alpine + jq, no python3. The
grafana_admin_only_folder_acl null_resource was parsing /api/folders
with a python3 oneliner, which crashed every CI apply with
"python3: command not found" and made every monitoring stack apply
fail in CI (worked locally because the dev VM has python3).

jq is already in the CI image and produces the same output.
2026-05-22 14:16:41 +00:00
Viktor Barzin
5c0ea96a91 infra: re-enable unattended-upgrades with kured prometheus-gating
Reverses the March 2026 outage mitigation that disabled unattended-
upgrades cluster-wide. Now re-enables it on the k8s template VM with:

  - Allowed-Origins limited to security/updates pockets
  - Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark
    hold on the cluster-critical components)
  - Automatic-Reboot disabled — kured drives the actual reboots
  - Compatible with the existing kured + sentinel-gate flow

kured side:
  - rebootDelay 30s, concurrency 1
  - Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak
    window from the post-mortem)
  - prometheusUrl + alertFilterRegexp wired so any firing non-ignored
    alert halts the rollout. Ignore-list excludes self-referential
    alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/
    InfoInhibitor) that would otherwise deadlock kured.

Prometheus side (already partly landed in 6c4e0966 — the "Upgrade
Gates" rule group):
  - Refine `KubeQuotaAlmostFull` to include the resourcequota label in
    both the on-clause and the summary, so multi-quota namespaces
    (authentik, beads-server, frigate) report the quota name correctly.

grafana.tf: terraform fmt whitespace only.

Together with the post-mortem 2026-03-22 (memory id=390) the loop is
closed: unattended-upgrades runs again, kernel-class updates can land,
but only when cluster health is green and the reboot window is open.
2026-05-22 14:16:41 +00:00
Viktor Barzin
fe75fad467 monitoring: protect grafana ingress with authentik + disable anonymous
- add traefik-authentik-forward-auth to grafana ingress middleware list
- disable auth.anonymous (was Viewer-by-default for the public)
- enable auth.proxy with X-authentik-username so Authentik users get
  signed in seamlessly (no double-login UX)

Prometheus and Alertmanager already had forward-auth — no change.
2026-05-22 14:16:41 +00:00
Viktor Barzin
6c294d4bb0 authentik: zero-endpoints alert + upgrade-validation checklist
Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on
sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which
is the symptom of the auth-proxy Emergency-Access fallback firing —
in turn caused by zero ready endpoints on the outpost service.

Why this rule and not `kube_endpoint_address_available == 0`:
kube-state-metrics endpoint metrics exist as series names but never
have current values in this Prometheus pipeline (something is dropping
them silently). Detecting the failure at the edge via Traefik is more
reliable than instrumenting the broken middle.

Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex
— the service label is `authentik-ak-outpost-...`, not
`authentik-authentik-outpost-...`, so the alert never matched any
series and never could have fired. Verified in Prometheus before/after
the fix.

Add an "Upgrade Validation Checklist" section to
`.claude/reference/authentik-state.md` with the seven-step smoke test
to run after Authentik chart bumps, provider bumps, or outpost pod
recreation. Covers the brittle surfaces (Service selector, JSON
patches, postgres backend wiring, access_token_validity TTL, edge
auth flow, plan-to-zero).
2026-05-22 14:16:41 +00:00
Viktor Barzin
a89d4a7d2a anubis: pull f1 off Anubis (XHR-vs-challenge collision) + add latency alerts
f1.viktorbarzin.me is a SPA whose JS fetches /schedule, /embed,
/embed-asset, … on the same path tree. With Anubis fronting `/`,
those XHRs land on the challenge HTML even when the cookie *should*
be valid, breaking the page with `Unexpected token '<', "<!doctype "
... is not valid JSON`. Removed Anubis from f1 — would need a path
carve-out (the way wrongmove does for /api) to re-enable. Added a
top-of-block comment so future me remembers why.

Plus four new Prometheus alerts in `Slow Ingress Latency` group
(stacks/monitoring/.../prometheus_chart_values.tpl):

- IngressTTFBHigh         (warn, 10m, avg latency >1s)
- IngressTTFBCritical     (crit, 5m,  avg latency >3s)
- IngressErrorRate5xxHigh (crit, 5m,  5xx >5%)
- AnubisChallengeStoreErrors (crit, 5m, any 5xx on *anubis* services
  via Traefik — proxies for the in-pod challenge-store error since
  Anubis itself only exposes Go-runtime metrics)

Notes from the alert author: avg-not-p95 because the existing
Prometheus scrape config drops traefik bucket series; once those
are restored, swap to histogram_quantile(0.95). TraefikDown inhibit
rule extended to suppress these four during a Traefik outage.
2026-05-10 11:12:40 +00:00
Viktor Barzin
8c619278d3 grafana: env-var datasources + reloader so Vault rotations stop breaking dashboards
Wealth, Payslips, and Job-Hunter Grafana datasources all baked the
rotating PG password into their ConfigMap at TF-apply time, so every
7-day Vault static-role rotation silently broke the panels until a
manual `terragrunt apply`. Same family as the recurring grafana-mysql
backend bug — Grafana caches creds at startup and never picks up the
new ESO-synced password without a restart.

Fix:
- Each source stack now creates an ExternalSecret in `monitoring`
  exposing the rotating password as `<NAME>_PG_PASSWORD` env-var.
- Grafana mounts those via `envFromSecrets` (optional=true so a
  missing source stack doesn't block boot) and the datasource
  ConfigMaps reference `$__env{<NAME>_PG_PASSWORD}` instead of a
  literal password.
- `reloader.stakater.com/auto: "true"` on the Grafana pod restarts
  it whenever any of the four DB-cred Secrets is updated.

Tested end-to-end: forced `vault write -force database/rotate-role/
pg-wealthfolio-sync` → ESO synced (~30s) → reloader fired →
Grafana booted with new env in ~50s total → all three /api/datasources
/uid/*/health endpoints return "Database Connection OK".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-10 11:12:39 +00:00
Viktor Barzin
8c09543391 fix: restore pvc-autoresizer by allow-listing kubelet_volume_stats_available_bytes
The Prometheus scrape config for the kubernetes-nodes job kept
capacity_bytes + used_bytes but dropped available_bytes. pvc-autoresizer
computes utilization from available/capacity, so without that metric it
was silent for every PVC in the cluster — including mailserver, which
filled to 89% (1.7G/2.0G) and started rejecting all inbound mail with
'452 4.3.1 Insufficient system storage' (15+ hours, all real senders:
Brevo, Gmail, Facebook).

Also bumps the floors of mailserver (2Gi -> 5Gi, limit 10Gi) and forgejo
(15Gi -> 30Gi) PVCs to recover from the immediate outage, and adds
ignore_changes on requests.storage so future autoresizer expansions
don't cause TF drift.
2026-05-10 11:12:37 +00:00
Viktor Barzin
e110b40a4a monitoring(wealth): monthly contrib-vs-mkt as line chart, not bars
User asked for two lines instead of side-by-side bars at monthly
granularity. Converts panel 25 from barchart to timeseries:

  * type: barchart -> timeseries
  * format: table -> time_series, SELECT month::timestamp AS time
  * drawStyle line, lineWidth 2, fillOpacity 0, showPoints auto
  * Same blue (contributions) / green (market gain) colour overrides

Where the green line rises above the blue line is the visual cue that
the market out-earned new contributions for that month -- the trend
the user wants to track.

Diff is small (15 ins / 28 del) because the bar-chart-only fields
(barRadius, barWidth, groupWidth, stacking, xField, xTickLabelRotation)
are dropped.
2026-05-07 23:29:35 +00:00
Viktor Barzin
84fd752747 monitoring(wealth): monthly contributions vs market gain bar chart
Goal stated by user: see when monthly market gain starts to exceed
monthly contributions, i.e. the inflection point where the market is
out-earning savings rather than the other way around.

New panel id=25 between the annual decomposition (13) and per-account
ROI (14): bar chart with two side-by-side bars per month --
contributions (blue) and market gain (green). Same calculation as
panel 13 but month-grain instead of year-grain. Months where the
green bar dwarfs the blue one are visible at a glance.

SQL: same endpoints CTE pattern as panel 13, with date_trunc('month',
valuation_date) as the grouping key. Uses max_complete cutoff so
partial-today doesn't skew the latest month.

Layout: panels at y >= 75 shifted down by 11 (chart height). New
chart at y=75; panel 14 (per-account ROI) -> y=86; panel 10
(activity log) -> y=96.

Spot check (recent months from PG):
  2025-07: contrib +£5,601    market +£42,295   <- big market month
  2025-09: contrib +£1,501    market +£24,206
  2026-02: contrib +£35,501   market +£41,382
  2026-03: contrib +£5,501    market -£38,483   <- correction
  2026-04: contrib +£73,267   market +£21,448
2026-05-07 23:29:34 +00:00
Viktor Barzin
4ec40ea804 [forgejo] Phases 3+4+5: cutover, decommission, docs sweep
End of forgejo-registry-consolidation. After Phase 0/1 already landed
(Forgejo ready, dual-push CI, integrity probe, retention CronJob,
images migrated via forgejo-migrate-orphan-images.sh), this commit
flips everything off registry.viktorbarzin.me onto Forgejo and
removes the legacy infrastructure.

Phase 3 — image= flips:
* infra/stacks/{payslip-ingest,job-hunter,claude-agent-service,
  fire-planner,freedify/factory,chrome-service,beads-server}/main.tf
  — image= now points to forgejo.viktorbarzin.me/viktor/<name>.
* infra/stacks/claude-memory/main.tf — also moved off DockerHub
  (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...).
* infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled
  from Forgejo. build-ci-image.yml dual-pushes still until next
  build cycle confirms Forgejo as canonical.
* /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated.

Phase 4 — decommission registry-private:
* registry-credentials Secret: dropped registry.viktorbarzin.me /
  registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries.
  Forgejo entry is the only one left.
* infra/stacks/infra/main.tf cloud-init: dropped containerd
  hosts.toml entries for registry.viktorbarzin.me +
  10.0.20.10:5050. (Existing nodes already had the file removed
  manually by `setup-forgejo-containerd-mirror.sh` rollout — the
  cloud-init template only fires on new VM provision.)
* infra/modules/docker-registry/docker-compose.yml: registry-private
  service block removed; nginx 5050 port mapping dropped. Pull-
  through caches for upstream registries (5000/5010/5020/5030/5040)
  stay on the VM permanently.
* infra/modules/docker-registry/nginx_registry.conf: upstream
  `private` block + port 5050 server block removed.
* infra/stacks/monitoring/modules/monitoring/main.tf: registry_
  integrity_probe + registry_probe_credentials resources stripped.
  forgejo_integrity_probe is the only manifest probe now.

Phase 5 — final docs sweep:
* infra/docs/runbooks/registry-vm.md — VM scope reduced to pull-
  through caches; forgejo-registry-breakglass.md cross-ref added.
* infra/docs/architecture/ci-cd.md — registry component table +
  diagram now reflect Forgejo. Pre-migration root-cause sentence
  preserved as historical context with a pointer to the design doc.
* infra/docs/architecture/monitoring.md — Registry Integrity Probe
  row updated to point at the Forgejo probe.
* infra/.claude/CLAUDE.md — Private registry section rewritten end-
  to-end (auth, retention, integrity, where the bake came from).
* prometheus_chart_values.tpl — RegistryManifestIntegrityFailure
  alert annotation simplified now that only one registry is in
  scope.

Operational follow-up (cannot be done from a TF apply):
1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to
   match the new template AND `docker compose up -d --remove-orphans`
   to actually stop the registry-private container. Memory id=1078
   confirms cloud-init won't redeploy on TF apply alone.
2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/`
   on the VM (~2.6GB freed).
3. Open the dual-push step in build-ci-image.yml and drop
   registry.viktorbarzin.me:5050 from the `repo:` list — at that
   point the post-push integrity check at line 33-107 also needs
   to be repointed at Forgejo or removed (the per-build verify is
   redundant with the every-15min Forgejo probe).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:34 +00:00
Viktor Barzin
f793a5f50b [forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry
Stage 1 of moving private images off the registry:2 container at
registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption
3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk —
pods still pull from the existing registry until Phase 3.

What changes:
* Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi).
  Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive,
  v11 default-on).
* ingress_factory: max_body_size variable was declared but never wired
  in after the nginx→Traefik migration. Now creates a per-ingress
  Buffering middleware when set; default null = no limit (preserves
  existing behavior). Forgejo ingress sets max_body_size=5g to allow
  multi-GB layer pushes.
* Cluster-wide registry-credentials Secret: 4th auths entry for
  forgejo.viktorbarzin.me, populated from Vault secret/viktor/
  forgejo_pull_token (cluster-puller PAT, read:package). Existing
  Kyverno ClusterPolicy syncs cluster-wide — no policy edits.
* Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster
  Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls).
  Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh
  for existing nodes.
* Forgejo retention CronJob (0 4 * * *): keeps newest 10 versions per
  package + always :latest. First 7 days dry-run (DRY_RUN=true);
  flip the local in cleanup.tf after log review.
* Forgejo integrity probe CronJob (*/15): same algorithm as the
  existing registry-integrity-probe. Existing Prometheus alerts
  (RegistryManifestIntegrityFailure et al) made instance-aware so
  they cover both registries during the bake.
* Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/.

Operational note — the apply order is non-trivial because the new
Vault keys (forgejo_pull_token, forgejo_cleanup_token,
secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the
kyverno + monitoring + forgejo stacks. The setup runbook documents
the bootstrap sequence.

Phase 1 (per-project dual-push pipelines) follows in subsequent
commits. Bake clock starts when the last project goes dual-push.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-07 23:29:33 +00:00
Viktor Barzin
41655096c7 openclaw: realtime usage dashboard via Prometheus exporter sidecar
Stdlib-only Python exporter ($1) reads ~/.openclaw/agents/*/sessions/*.jsonl
(assistant messages with usage) plus auth-profiles.json (OAuth expiry,
Plus-tier label) and exposes Prometheus text format on :9099/metrics.
Container is python:3.12-slim; pod template gets prometheus.io/scrape
annotations so the existing kubernetes-pods job picks it up — no
ServiceMonitor needed.

Metrics exported:
  openclaw_codex_messages_total{provider,model,session_kind}    counter
  openclaw_codex_input/output/cache_read/cache_write_tokens_total
  openclaw_codex_message_errors_total{reason}
  openclaw_codex_active_sessions{kind}                          gauge
  openclaw_codex_oauth_expiry_seconds{provider,account,plan}    gauge
  openclaw_codex_last_run_timestamp                             gauge

Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s
refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h,
cache hit %, OAuth expiry days, active sessions, last-turn age, errors,
plus per-model timeseries + bar gauge + error table.

Plus rate-card thresholds in the gauge are conservative (1,200/5h floor;
real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up
below 80%.
2026-05-07 23:29:32 +00:00
Viktor Barzin
f006b48566 monitoring(wealth): delta panels to 2x4 grid (rows = type, cols = window)
Better visual grouping: instead of 8 paired panels in a single row at
w=3 (cramped, hard to scan), arrange as a 2x4 grid at w=6. Top row
("all" — wealth change incl new money), bottom row ("mkt" — pure
market gain). Columns are timeframes 1d / 7d / 30d / 90d.

Reading vertically: same window, two interpretations side by side.
Reading horizontally: same metric across timeframes.

Layout shift: delta row goes from y=4 (4 wide) to y=4..11 (8 high).
All chart/log panels with y >= 8 shift down by another 4 rows
(net-worth chart 8->12, activity log 81->85, etc.).
2026-05-07 23:29:31 +00:00
Viktor Barzin
0f107aeacb monitoring(wealth): pair every delta panel with market-only twin
User feedback: net-worth delta panels (1d/7d/30d/90d) confused
because +£174k over 90d looked too big against the £271k cumulative
unrealised gain. Decomposition showed the 90d delta was £114k of new
money in (contributions) + £60k of actual market gain.

So now the delta row shows BOTH:
  Δ Nd (all)  — net-worth change incl new money (the original number)
  Δ Nd (mkt)  — pure market gain, contributions stripped out

Pattern for "(mkt)" panels: same now_snap / past_snap CTEs but
selecting both total_value and net_contribution, then computing
(nw_delta - contrib_delta) = market_gain over window.

Layout: 8 panels at w=3 each on the y=4 row, paired by window
(all next to mkt for each timeframe), so you can see "wealth
change vs investment performance" at a glance.

Verified live (90d): all=+£174,612, mkt=+£60,343, contrib=+£114,268.
2026-05-07 23:29:31 +00:00
Viktor Barzin
87069ae5c3 monitoring(wealth): add delta row (1d / 7d / 30d / 90d net-worth changes)
New row at y=4 with 4 stat panels showing net-worth change over the
trailing windows. Each uses the latest-per-account stitching pattern
(skew-resilient against partial-day syncs) and computes:

  delta = SUM(latest per account) - SUM(latest per account at or
                                       before max_complete - N)

Where max_complete is the most recent date all accounts have a row.
For each window: 1d, 7d, 30d, 90d.

Verified live values: +£8,575 / +£22,696 / +£144,633 / +£174,612.

All panels at y >= 4 shifted down by 4 rows to make room (Net worth
chart 4->8, Per-account stacked 24->28, Activity log 77->81, etc.).

Note: this commit also reformats the dashboard JSON from compact-
object form to indented form (json.dump indent=2 side effect from the
Python patch script). No semantic changes outside the new panels and
y-shifts.
2026-05-07 23:29:31 +00:00
Viktor Barzin
1cb2bb30f7 monitoring(wealth): show pre-2024 historical data on timeseries
Bug: timeseries panels were empty before 2024-04-10. Cause was the
complete_dates CTE filtering to "every active account has a row for
this date" -- which excluded every day before the most-recently-added
account first appeared. The 6th account (Trading212 Invest GIA) only
started 2024-04-10, so 4 years of legitimate historical data
(2020-06-07 onwards, when the user genuinely had fewer accounts) got
hidden.

New pattern across panels 5/6/7/8/9/12/13: replace complete_dates with
max_complete cutoff. Compute the most-recent date where all current
accounts have a row, then include every historical date up to and
including that day. Partial-today is still excluded automatically.
Historical days with fewer accounts now show as their actual smaller
sums -- which is the correct historical net worth at the time.

Verified via PG: new pattern returns 2,159 distinct days from
2020-06-07 to 2026-05-05 (vs the previous 391 from 2024-04-10).

Per-account first-seen dates:
  InvestEngine ISA       - 2020-06-07
  Schwab US workplace    - 2020-11-17
  InvestEngine GIA       - 2022-03-17
  Fidelity UK Pension    - 2022-05-16
  Trading212 ISA         - 2024-04-08
  Trading212 Invest GIA  - 2024-04-10  (was the bottleneck)
2026-05-05 18:43:26 +00:00
Viktor Barzin
6715cdc51f monitoring(wealth): re-add milestone annotations (now that PG creds rotated)
Re-applies the milestone annotation commit reverted in 0ef36aec. The
earlier "nothing loads / syntax error" was a red herring: Vault had
rotated the wealthfolio_sync DB password 7 days prior, the K8s Secret
picked it up automatically (pg-sync sidecar still working), but the
Grafana datasource ConfigMap is baked at TF-apply time so Grafana was
sending the old password. Every panel + the new annotation alike
failed with: pq password authentication failed for user wealthfolio_sync.

Fix today: refresh the datasource ConfigMap and roll Grafana.

  scripts/tg apply -target=kubernetes_config_map.grafana_wealth_datasource
  kubectl -n monitoring rollout restart deploy/grafana

Annotation source verified live via /api/ds/query: SQL returns 5
milestone rows correctly. Dashboard charts now show vertical dashed
lines at GBP100k 2021-11-01, GBP250k 2023-07-18, GBP500k 2024-09-19,
GBP750k 2025-08-26, GBP1M 2026-04-18.

KNOWN FOLLOW-UP: Vault rotates pg-wealthfolio-sync every 7 days
(static role). Todays failure will recur unless the Grafana
datasource auto-refreshes. Options:
  1. Annotate Grafana deploy with stakater/reloader so it restarts
     when wealthfolio-sync-db-creds Secret changes.
  2. Switch datasource provisioning to read password from an env var
     sourced from the Secret instead of baking into the ConfigMap.
     Combined with reloader, picks up rotation cleanly.
2026-05-02 20:27:21 +00:00
Viktor Barzin
0ef36aec36 Revert "monitoring(wealth): milestone annotations on every timeseries chart"
This reverts commit 5a00b9c096.
2026-05-02 20:20:18 +00:00
Viktor Barzin
5a00b9c096 monitoring(wealth): milestone annotations on every timeseries chart
Inspired by the user's "Journey to £1M" reference — adds vertical
dashed lines on every timeseries panel at the date net worth first
crossed each round threshold (£100k, £250k, £500k, £750k, £1M).

Implementation: a dashboard-level annotation source ("Milestones",
purple) backed by a PG query that finds the MIN(valuation_date) where
SUM(total_value) >= each threshold. The query returns (time, text)
pairs, e.g. "2026-04-18 → £1M 🎉". Annotations attach to all
timeseries panels automatically; auto-extends as future thresholds
are crossed.

Verified against current data:
  £100k → 2021-11-01    £250k → 2023-07-18    £500k → 2024-09-19
  £750k → 2025-08-26    £1M    → 2026-04-18 🎉

Future work (per user request): add a "Journey" stat-card row at the
top mirroring the reference (date achieved + months from previous).
2026-05-02 08:42:21 +00:00
Viktor Barzin
664a85ef1e Revert "monitoring(wealth): show daily points + lighter fill on timeseries"
This reverts commit 5472720c75.
2026-05-01 16:24:18 +00:00