infra

Author	SHA1	Message	Date
Viktor Barzin	a048b37f60	security(wave1): W1.1 audit-log shipping LIVE + W1.5 trusted-registries Enforce LIVE ## W1.1 — K8s API audit log shipping (LIVE) - alloy.yaml: added control-plane toleration so Alloy DaemonSet runs on k8s-master node. Verified alloy-7zg7t scheduled on master, tailing /var/log/kubernetes/audit.log - loki.tf "Security Wave 1" rule group: added K2-K9 alert rules (skipped K1 per Q7 decision): - K2 K8sSATokenFromUnexpectedIP - K3 K8sSensitiveSecretReadByUnexpectedActor - K4 K8sExecIntoSensitiveNamespace - K5 K8sMassDelete (>5 Pod/Secret/CM in 60s by single user) - K6 K8sAuditPolicyModified (kubeadm-config CM change) - K7 K8sClusterRoleWildcardCreated (verbs=* + resources=) - K8 K8sAnonymousBindingGranted - K9 K8sViktorFromUnexpectedIP - All rules use source-IP regex matching the wave-1 allowlist (10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) and `lane = "security"` → #security Slack route. - Verified: kubectl-audit logs flowing in Loki query {job="kubernetes-audit"} returns events with node=k8s-master. - Verified: /loki/api/v1/rules lists all K2-K9 + V1-V7 + S1. ## W1.5 — require-trusted-registries Enforce (LIVE) - security-policies.tf: flipped Audit→Enforce with explicit allowlist built by `kubectl get pods -A -o jsonpath='{..image}'` enumeration. - Removed `/` catch-all (which made Audit→Enforce a no-op). - Pattern includes 15 explicit registries, 6 DockerHub library bare names, 56 DockerHub user repos. - Verified by admission dry-run: - evilcorp.example/malware:v1 → BLOCKED with custom message - alpine:3.20 → ALLOWED (matches `alpine`) - docker.io/library/alpine:3.20 → ALLOWED (matches `docker.io/*`) ## W1.6 — Calico flow logs (BLOCKED — Calico OSS limitation) - Tried adding FelixConfiguration with flowLogsFileEnabled=true via kubectl_manifest in stacks/calico/main.tf - Calico OSS rejected with "strict decoding error: unknown field spec.flowLogsFileEnabled" — these fields are Calico Enterprise/Tigera-only - Removed the failed resource. Documented alternative paths in main.tf comment block: GNP with action=Log (iptables NFLOG → journal), Cilium migration, eBPF tooling, or Tigera Operator adoption. ## Docs updates - security.md status table refreshed: W1.1/W1.2/W1.3/W1.4/W1.5 LIVE, W1.6/W1.7 blocked - monitoring.md: Loki marked DEPLOYED (was incorrectly NOT-DEPLOYED in prior session before today's apply) ## Cleanup - Removed stacks/kyverno/imports.tf (TF 1.5+ import blocks completed their job in the 2026-05-18 apply; should not stay in tree per TF docs) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:59 +00:00
Viktor Barzin	83079758bb	monitoring(wave1): re-enable Loki+Alloy, deploy wave1 alert rules, add #security Slack lane ## Loki + Alloy re-enabled (code-146x) - Uncommented helm_release.loki, helm_release.alloy, kubernetes_daemon_set_v1.sysctl-inotify, kubernetes_config_map.loki_alert_rules, kubernetes_config_map.grafana_loki_datasource - Reverses the documented "operational overhead vs benefit after node2 incident" decision. Re-evaluated because wave 1 security detection layer (beads code-8ywc) needs Loki + ruler + alert routing. - SingleBinary mode, 2-4Gi mem, 50Gi proxmox-lvm PVC, 30-day retention, ruler enabled pointed at prometheus-alertmanager.monitoring.svc:9093 - Alloy DaemonSet (4 pods on worker nodes) discovers pod logs via K8s API + pushes to Loki - Loki canaries running (4) - Vault audit-tail sidecar logs now flowing to Loki: queried {namespace="vault",container="audit-tail"} returns live audit JSON ## Wave 1 alert rules deployed (W1.3 partial) Added "Security Wave 1" rule group to loki_alert_rules configmap: - V1: VaultRootTokenCreated — auth/token/create with policies=[root] - V2: VaultAuditDeviceModified — sys/audit/* create/delete/update - V3: VaultSealChanged — sys/seal update - V4: VaultPolicyModified — sys/policies/acl/* create/update/delete - V5: VaultAuthFailureSpike — >10 permission denied/min - V7: VaultViktorFromUnexpectedIP — auth as me@viktorbarzin.me from non-allowlist source IP (allowlist: 10.0.20.0/22, 192.168.1.0/24, 10.10.0.0/16 pod, 10.96.0.0/12 svc, 100.64-127 tailnet) - S1: PVEsshLoginFromUnexpectedIP — sshd "Accepted" from non-allowlist IP (rule defined, fires once promtail/Alloy ships sshd journal with job=sshd-pve) Verified rules visible via /loki/api/v1/rules. K2-K9 (K8s API audit) deferred to W1.1 which needs the audit policy + apiserver log shipping codified. ## #security Slack lane (Alertmanager) - New `slack-security` receiver in prometheus_chart_values.tpl, channel #security - Higher-priority route at top of routes list: matchers `lane = security` → slack-security, continue: false (so wave 1 alerts never fall through to #alerts) - Slack message format includes summary + description + runbook link annotation - All wave 1 rules set `lane = "security"` label ## Resource summary - 6 added: helm_release.loki, helm_release.alloy, kubernetes_config_map.grafana_loki_datasource, kubernetes_config_map.loki_alert_rules, kubernetes_daemon_set_v1.sysctl-inotify, + 1 other - 5 changed: helm_release.prometheus (alertmanager config — new receiver + route), 4 deployments (image tag drift from Keel-managed images, unrelated) - 1 destroyed: null_resource grafana_admin_only_folder_acl["Finance (Personal)"] (timestamp-triggered always recreates — not destructive) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Closes: code-146x	2026-05-22 14:16:58 +00:00
Viktor Barzin	87961e9ef8	monitoring(wealth): drop 6y timeFrom override on META vest cadence	2026-05-22 14:16:57 +00:00
Viktor Barzin	99127939a8	monitoring(wealth): keep only FIFO-realized PNL table; pair Positions + vest-cadence side-by-side - Removed panel 27 (META RSU vest value over time) — superseded by vest-cadence chart which carries the same value signal plus the share-count overlay. - Removed panel 28 (per-vest value at vest vs today) — duplicative with panel 31's FIFO realized PNL. - Removed panel 29 (per-sell realized PNL) — same data as panel 31, just rolled up by sell date instead of vest date. - Resized panel 26 (Positions) to w=12 and moved panel 30 (META vest cadence) to (y=32, x=12, w=12) so they sit side-by-side next to the Positions table. - Moved panel 31 (FIFO realized PNL) to y=118, where the deleted RSU chart used to live.	2026-05-22 14:16:57 +00:00
Viktor Barzin	b879481d71	monitoring(wealth): per-vest realized PNL via FIFO sell-match New table panel below the per-sell breakdown. For each vest, FIFO-match its shares against the subsequent sells (shares from earlier vests get sold first), and aggregate the matched portions: realized_pnl = SUM(matched_qty * (sell_price - vest_price)) pnl_pct = realized_pnl / SUM(matched_qty * vest_price) * 100 days_held = AVG(sell_date - vest_date) per matched portion Footer reducer sums shares, vest value, sell value, and realized PNL so the bottom row is the full-portfolio realized take.	2026-05-22 14:16:57 +00:00
Viktor Barzin	8b60e6bb6d	monitoring(wealth): META vest cadence chart — value vs shares (dual axis) Per-vest event line chart. Left Y axis (blue): vest value at the time = SUM(quantity * unit_price), in USD. Right Y axis (orange): number of shares vested. One point per vest date (aggregated when multiple BUY rows share a date, e.g. 2021-05-18 was 18 + 2 shares). Lets Viktor see how vest sizes ramped (initial 18 shares -> 38 -> 60s) and how the per-vest USD value tracked META's price ride across 2020-2026. timeFrom='6y' override pins the panel to the full vesting window.	2026-05-22 14:16:57 +00:00
Viktor Barzin	af077112cb	monitoring(wealth): META vest + sell PNL tables with FIFO cost basis Two new bottom-of-dashboard tables: Panel 28 'META vests — value at vest vs today': one row per BUY activity. Shows vest-day price * shares + what those same shares would be worth at today's META quote, plus the hypo P&L if Viktor had held everything (color-text on the gain columns). Panel 29 'META sells — realized PNL vs if held until today': one row per SELL with FIFO-matched cost basis (LEAST/GREATEST overlap in cumulative-share space). Shows realized P&L, the counterfactual P&L had he held until today, and the 'missed by' delta = (today_price - sell_price) * shares. Both pull today_price dynamically from quote_latest via a CTE so they self-update as Yahoo updates the META quote. Schwab account is empty so no live activity is expected.	2026-05-22 14:16:57 +00:00
Viktor Barzin	20c5965f95	monitoring(wealth): pin META RSU panel to 6y window Dashboard default time range is now-180d, but the META vesting + sell arc spans 2020-11 → 2026-02. With the default window the panel just showed a flat line at $64 (the empty post-sell residual). timeFrom='6y' override makes panel 27 always render the full vesting curve regardless of the dashboard-level time selector.	2026-05-22 14:16:57 +00:00
Viktor Barzin	018ef3790f	monitoring(wealth): META RSU vest value panel (Schwab account) Daily total_value timeseries for the Schwab workplace account (account_id 72d34e09-...). Single-asset account holding META RSUs that vested 2020-11 → 2026-02 and were sold opportunistically over the same window. Currency USD (account_currency). Yahoo quote on META powers WF's daily mark; the historical DAV mirrored into wealthfolio_sync via pg-sync gives us ~2k days of vesting curve.	2026-05-22 14:16:57 +00:00
Viktor Barzin	5482f46125	RecentNodeReboot: 24h → 1h threshold, matching upgrade-chain preflight The 24h kubelet-uptime threshold (process_start_time_seconds < 86400) was a defense-in-depth duplicate of the 24h-since-Ready-transition check in kured-sentinel-gate Check 4 — but they used different signals (kubelet process start vs node Ready transition). Whenever the cluster cycled through reboots, the alert kept firing for a full day even after sentinel-gate's check passed, and blocked anything querying halt-on-alert (kured, K8s version-upgrade preflight). Tightened to 1h (3600s) for "node just rebooted, give it a settle window". The cluster-wide 24h-between-reboots invariant lives exclusively in kured-sentinel-gate Check 4 from now on (independent, uses lastTransitionTime). Matched the preflight's own 24h-quiet check in upgrade-step.sh (86400 → 3600) so it doesn't act as a second blocker. Empirically verified: all 5 kubelets are >10h up, alert cleared on next eval after the rule reload. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:57 +00:00
Viktor Barzin	3bdba9f388	keel: enroll 15 critical-path namespaces for digest-only auto-update Per user decision today: monitoring, mailserver, vault, descheduler, metrics-server, traefik, technitium, crowdsec, redis, reverse-proxy, reloader, headscale, wireguard, xray, cloudflared now participate in the same `force + match-tag` regime as the rest of the cluster — Keel watches the deployment's CURRENT tag for digest changes only and rolls on push, never rewriting tag strings. Two-part change: stacks/kyverno/modules/kyverno/keel-annotations.tf Trim the policy-level namespace exclude list from 31 → 16. The 16 remaining exclusions are the irreducible cluster-operator + state- coupled set: keel itself, calico-system + tigera-operator (operator loop), authentik (2026-05-17 pgbouncer incident bite), cnpg-system + dbaas (state-coupled), kyverno, metallb-system, external-secrets, proxmox-csi + nfs-csi + nvidia (just stabilized today, chart-pinned), kube-system, vpa, sealed-secrets, infra-maintenance. stacks/<each-of-15>/.../main.tf Add `"keel.sh/enrolled" = "true"` label to the `kubernetes_namespace` resource so the Kyverno mutate policy can target the workloads via its namespaceSelector matchLabels. Note on the apply path: the live ClusterPolicy was patched via `kubectl patch` because the hashicorp/kubernetes provider v3.1.0 panics during state refresh on Kyverno ClusterPolicy schemas with deeply nested optional `context.celPreconditions` / `imageRegistry` fields (see crash dump). The TF source above has the desired state, so any clean future apply on a fixed provider version will be a no-op against the live cluster. Floating-tag workloads in the newly-enrolled set (will roll on every upstream digest update — acceptable risk per user): - wireguard: sclevine/wg:latest (image fixed today via iptables-nft postStart shim) - xray: teddysun/xray - crowdsec-web: viktorbarzin/crowdsec_web - monitoring: prompve/prometheus-pve-exporter:latest, prom/snmp-exporter - traefik: nginx:1-alpine, openresty/openresty:alpine, ghcr.io/tarampampam/error-pages:3 - redis: haproxy:3.1-alpine, redis:8-alpine Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	45c8e88e89	terminal: probe + alerts after Traefik replica routing-table skew User reported "site loads but failed to connect on the tmux session". Root cause was a Traefik replica (traefik-db7696fbf-ktjjz) that came up missing the kubernetes_ingress-derived router for terminal.viktorbarzin.me — only the IngressRoute CRDs registered. About 1/3 of /token preflight requests landed on that replica and got 404 with router="-", and WS upgrades intermittently failed the same way, so the lobby iframe stayed stuck on "Failed to connect. Retrying...". `kubectl delete pod` on the bad replica restored the missing router and unblocked the user. This commit adds the long-term mitigation: stacks/terminal/main.tf - kubernetes_cron_job_v1.webterminal_probe runs every 5min, hits /token + /ws via Cloudflare and the in-cluster ttyd Service, pushes 4 gauges to Pushgateway (token_status, ws_status, ttyd_status, last_success_timestamp). Verified the probe end-to-end: token=302 ws=302 ttyd=200 ok=1 stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl - Webterminal group: WebterminalTokenDegraded (warning, 10m), WebterminalWebsocketDegraded (critical, 10m), WebterminalTtydUnreachable (critical, 10m), WebterminalProbeStale (warning, 15m). - Traefik Router Parity group: TraefikRouterCountSkew fires when any Traefik replica's router count diverges from siblings for >10m — catches the same class of issue cluster-wide, not just for terminal. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:56 +00:00
Viktor Barzin	126cfb7022	wealth: dav_corrected view fixes pension gains-offset miscategorisation The broker-sync Fidelity provider emits 'unrealised-gains-offset' DEPOSIT activities to reconcile Wealthfolio's total with the PlanViewer reported pot, because Wealthfolio doesn't track pension fund units directly. Wealthfolio's data model treats that DEPOSIT as a cash contribution, which double-inflates net_contribution and zeroes out the implied growth. Add a Postgres view 'dav_corrected' in wealthfolio_sync that subtracts the cumulative gains-offset from net_contribution per account per date (re-exporting as 'net_contribution' so it's a drop-in replacement). All 17 wealth dashboard panels that compute contribution/growth/ROI now read from the view. Total impact: portfolio Growth jumps from £301,753.19 to £337,474.39 (exactly the £35,721.20 Fidelity offset that was previously miscategorised).	2026-05-22 14:16:52 +00:00
Viktor Barzin	95b9f7bc89	aiostreams: 1h stream cache + canary stream-count probe + 3 alerts Hardening pass following the empty-stream-list incident: 1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 / disabled). Default behaviour hit all 5 upstream addons on every Stremio request; with a 1h TTL repeat requests for the same title are instant, while RD cache invalidations still propagate quickly. 2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's encryptedPassword via the internal ClusterIP, runs a canary stream search for Breaking Bad S01E01, pushes streams_count + probe_success to Pushgateway. Uses an ExternalSecret pulling UUID + password from Vault secret/viktor. Same pattern as email-roundtrip-monitor. 3. Three alerts in monitoring's prometheus_chart_values.tpl: - AIOStreamsStreamCountLow (< 50 streams for 30m) - AIOStreamsProbeFailing (probe_success == 0 for 30m) - AIOStreamsProbeStale (last_run_timestamp > 30min for 10m) Verified: probe returned streams=411 success=1 on first run; all 3 alerts loaded into Prometheus with state=inactive health=ok.	2026-05-22 14:16:46 +00:00
Viktor Barzin	2903ab9778	monitoring(wealth): move Positions table under contrib/growth row Positions panel now sits at y=32 (immediately below the contrib-vs-market + growth row at y=22..32), and everything from the per-account stack down shifts 8 rows lower.	2026-05-22 14:16:46 +00:00
Viktor Barzin	8461275308	wealth: positions table panel (shares + cost basis + unrealised return) pg-sync sidecar now mirrors three extra views from the wealthfolio SQLite: assets (id/symbol/name/currency), quote_latest (one row per asset, preferring YAHOO over MANUAL on same-day collisions), and positions_latest (currently-held positions extracted from the TOTAL aggregate row of holdings_snapshots — quantity, average cost, total cost basis). Wealth dashboard gets a new bottom Positions table joining the three: symbol, name, shares, avg cost, last price, market value, cost, gain, return %. Gain and return % are color-text with red<0, green>=0 thresholds.	2026-05-22 14:16:46 +00:00
Viktor Barzin	726fb25182	monitoring(wealth): paint declining segments red on growth chart Mirror the panel 5 treatment on panel 7 (Growth = market value − contribution). Second SQL column emits the growth value only when the point is part of a declining segment; field override paints it red with no fill, spanNulls=false.	2026-05-22 14:16:45 +00:00
Viktor Barzin	cbd0f71a3b	monitoring: PodImagePullBackOff alert + 2 inhibitors + JobFailed for:2h Three improvements identified in the 7d alert-noise review: A. New PodImagePullBackOff alert. `KubeletImagePullErrors` measures node-level pull error rate, which doesn't catch a single pod stuck in ImagePullBackOff — council-complaints sat broken for ~10h on 2026-05-12 without paging. The new rule fires per-pod after 30m. B. Two new inhibit_rules: - PVFillingUp (95% used, critical) suppresses PVPredictedFull (linear projection, warning) on the same PVC. Pair was producing ~24h of redundant firing per 7d. - EmailRoundtripFailing (active probe failure) suppresses EmailRoundtripStale (derivative >60min no-success). Same outage windows, ~14.5h of duplicate firing per 7d. C. JobFailed for: 30m → 2h. Most cronjobs run every 5–15min; the old 30-minute window paged on the first failed iteration before the next run could recover. 2h means "still failing across at least two cron iterations" — much more actionable. Verified live: rules loaded, inhibitors in alertmanager config, PodImagePullBackOff is currently inactive (council-complaints ImagePullBackOff actively detected — see separate fix).	2026-05-22 14:16:45 +00:00
Viktor Barzin	70292b9e23	monitoring: TraefikReplicaConfigStale — drop false-positive on stale series The initial formulation used clamp_min(min(rate[2h]), 0.0001), which made a recently-deleted pod's lingering rate=0 drive the ratio toward infinity for up to 2h until the stale series aged out of the rate window. With for: 2h, this was a near-miss for spurious firing in the immediate aftermath of restarting the bad replica (our remediation path). Tighter formulation: * 30m rate window — stale series ages out within minutes, not hours * `min(rate) > 0.0005` floor — filters both stale-zero and fresh-pod ramp-up series; the bug's actual rate (~0.00076 in the 2026-05-12 incident) sits well above it, so true positives still trip * for: 1h — fast enough to catch the next incident, long enough that short rate dips don't flap Verified: post-deploy `(max/min) > 5 AND min > 0.0005` evaluates to 0 results with the live cluster's tight rate spread (~0.00065–0.0007/s across all three Traefik replicas).	2026-05-22 14:16:45 +00:00
Viktor Barzin	165bb7258e	monitoring: detect stale Traefik replicas + reduce alert-storm cascading Two new alertmanager inhibit rules and one new Prometheus alert, informed by the 2026-05-12 incident where Traefik pod traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers vs 119 on healthy peers (stale K8s informer cache) and served 404 for ~1/3 of viktorbarzin.me traffic. * New alert TraefikReplicaConfigStale: fires when max/min reload-rate ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h for-clause tolerates legitimate post-restart ramp-up; the bug pattern persists indefinitely. * New inhibit: TraefikReplicaConfigStale suppresses the symptom alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical}, IngressErrorRate5xxHigh, TraefikHighOpenConnections, ForwardAuthFallbackActive, AnubisChallengeStoreErrors, ExternalAccessDivergence) so only the actionable root cause pages. * New inhibit: HomeAssistantDown suppresses HomeAssistantCriticalSensorUnavailable and HomeAssistantMetricsMissing — when HA itself is down, every sensor going unavailable is noise (10x firings observed in the last 12h). * Extend NodeDown and NFSServerUnresponsive target lists to also suppress HomeAssistantCriticalSensorUnavailable.	2026-05-22 14:16:45 +00:00
Viktor Barzin	448bc0c0f6	k8s-version-upgrade: decompose into Job chain to fix self-preemption The agent-based v1 ran inside claude-agent-service (replicas=1, no nodeSelector) and self-evicted when it tried to drain its host (k8s-node4 on 2026-05-11). Cluster ended half-upgraded (master v1.34.7, workers v1.34.2) until manual recovery. Rewrite the pipeline as a chain of nodeSelector-pinned Jobs: preflight (k8s-node1) → master (k8s-node1) drains k8s-master → worker × 4 (k8s-node1) drains k8s-node{4,3,2} → worker (k8s-master + control-plane toleration) drains k8s-node1 → postflight (no pinning) Each Job runs scripts/upgrade-step.sh (case-on-$PHASE) and ends by envsubst-ing job-template.yaml into the next Job. Deterministic names (k8s-upgrade-<phase>-<target_version>[-<node>]) make `kubectl apply` idempotent — a failed Job can be re-created without duplicating downstream. Also lands `predrain_unstick`: deletes pods on the target node whose PDB has 0 disruptionsAllowed. Without this, drain loops indefinitely on single-replica deployments (e.g. every Anubis instance — discovered the hard way during 2026-05-11 manual recovery of k8s-node3). Adds K8sUpgradeStalled alert (in_flight + started_timestamp > 90 min). Deprecates the agent prompt (renamed to *.deprecated.md with a header pointer to the new code). Apply order: k8s-version-upgrade first (consumes new SA + ConfigMaps), then monitoring (loads the new alert). Both applied 2026-05-11. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:45 +00:00
Viktor Barzin	cd13b9d062	monitoring: drop PVAutoExpanding alert — info-only noise, not actionable PVAutoExpanding fired at >80% used (info severity), but pvc-autoresizer's threshold is 10% free (= 90% used) — the alert always fired ~10 points before any action would have been taken, and there was nothing for an operator to do during that window either. It was a "heads up" that didn't surface a problem. Real failure modes are already covered: * PVFillingUp (critical, >95% for 10m) — autoresizer didn't keep up * PVPredictedFull (warning, predict_linear 24h) — trend toward exhaustion Sharpened PVFillingUp's annotation to spell out the likely causes (storage_limit reached, expansion failing, or missing autoresizer annotations) so the responder doesn't have to recall the runbook.	2026-05-22 14:16:44 +00:00
Viktor Barzin	396cce82cf	monitoring(wealth): paint declining segments red on portfolio chart Add a second SQL column on panel 5 that returns net_worth only when the current point's previous or next neighbor is lower — i.e. the point is part of a declining segment (including the peak and trough endpoints). A field override draws this 'decline' series in red with no fill and spanNulls=false, overlaying the green base line so down periods show up as red on top of the climb.	2026-05-22 14:16:44 +00:00
Viktor Barzin	f10784ddb6	infra: document auth = "app\|none" tier on every legacy ingress Sweep through the 30+ stacks that predated the auth = "app" tier and were tagged auth = "none" without a comment explaining why they weren't behind Authentik. Each is now self-documenting at the call site, so the tg-level anti-exposure guard passes and future readers don't have to reverse-engineer the intent. Flipped 6 stacks from "none" to "app" — their backends have their own user auth and the new tier records that more accurately: - navidrome (Subsonic user/password) - ntfy (deny-all default + user.db tokens) - nextcloud (WebDAV/CalDAV/CardDAV app passwords) - vaultwarden (Bitwarden-compatible token auth) - headscale (OIDC + preauth keys for Tailscale nodes) - paperless-ngx (app-layer login + API tokens) Kept "none" with a comment on the rest — they're genuinely public, webhook receivers, native-protocol endpoints, OAuth callbacks, or Anubis-fronted: authentik (×2 + guest outpost), beads-server (dolt), claude-memory (bearer-token MCP), dawarich, ebooks/book-search-api, fire-planner /api, forgejo (git/OCI native clients), frigate (HA integration), immich/frame, insta2spotify /api, instagram-poster (meta fetcher), k8s-portal, matrix (native bearer), monitoring×2 (HA REST scrapes), n8n (webhooks), nvidia, onlyoffice (JWT), owntracks (HTTP Basic), postiz, privatebin (client-side enc), rybbit (analytics tracker), send (E2E file drop), tuya-bridge (API key), vault (own auth + CLI), webhook_handler, woodpecker (forgejo webhooks + OAuth), xray (×3 VPN transports). real-estate-crawler/main.tf:400 already had its comment from a prior edit — not touched here. No live state changes — auth = "app" produces the same middleware chain as auth = "none" (verified earlier this session). This commit is purely documentation + intent-tagging.	2026-05-22 14:16:44 +00:00
Viktor Barzin	20774f794d	dbaas+monitoring: bump PG max_connections to 200, add scrape + alerts Cluster grew past the 100-conn default — steady-state idle was 90/100, leaving zero headroom for terragrunt applies or transient surges. The ceiling was being discovered by Terraform crashing (pq: "remaining connection slots are reserved for roles with the SUPERUSER attribute"), not by alerting, because we had no PG scrape config at all. dbaas (Tier 0): * max_connections: 100 → 200 * shared_buffers: 512MB → 1GB (Postgres recommends ~25% of pod memory) * effective_cache_size: 1536MB → 2560MB (scaled with pod memory) * pod memory: 2Gi → 3Gi (rough rule of thumb: enough for shared_buffers + ~16MB work_mem * concurrent sorts + OS cache + overhead) * Triggers bump on null_resource.pg_cluster forces CNPG to re-apply, which rolls the cluster (standby first, then primary failover). monitoring: * New scrape job 'cnpg' on dbaas namespace pods labeled cnpg.io/podRole=instance, port name=metrics (9187). Relabels add cnpg_cluster + cnpg_role labels for alert grouping. * PGConnectionsHigh (warning, >85% for 10m) — heads-up before exhaustion. * PGConnectionsCritical (critical, >95% for 3m) — last call before refusing connections. Verified: cnpg targets up, sum(cnpg_backends_total)=84, max_connections metric=200, alert ratio 0.42 → both alerts inactive.	2026-05-22 14:16:44 +00:00
Viktor Barzin	665b6b2934	actualbudget+monitoring: per-account bank-sync metrics, drop noisy alert The bank-sync CronJob was posting to /accounts/banksync which fans out to ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls per-account per-24h quota, a single rate-limited account would 500 the whole call, and `bank_sync_success` would flip to 0 even though the data itself was still flowing through manual UI syncs. Result: BankSyncFailing fired routinely whenever the user had been active in the UI that day — a structural false positive. Fix: * CronJob: enumerate accounts via GET /accounts, POST per-account /accounts/{id}/banksync, emit bank_sync_account_success and bank_sync_account_last_success_timestamp labelled by account name. Roll up bank_sync_success = 1 iff any account succeeded. * Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale at 48h (global drought). Add BankSyncAccountStale at 72h (catches single-account auth expiry — the real signal we wanted). Verified: manual run on bank-sync-viktor pushes 6 per-account success + timestamp series; roll-up bank_sync_success=1; no firing alerts.	2026-05-22 14:16:44 +00:00
Viktor Barzin	dd2b7de291	fix: HA Sofia REST sensors + PVC drift safety Two real issues found while triaging HomeAssistantCriticalSensorUnavailable alerts and the prometheus + technitium PVC Terminating-but-in-use state from the earlier session. 1. idrac-redfish-exporter + snmp-exporter ingresses: auth=required → auth=none. HA Sofia REST sensors scrape these endpoints programmatically; with Authentik forward-auth in front, every request got a 302 to authentik.viktorbarzin.me and the REST sensors parsed the HTML login page instead of metrics — leaving the R730, UPS, and ~20 other sensors permanently unavailable. The allow_local_access_only IP allowlist (192.168.0.0/16 + 10.0.0.0/8) already gates external access, so authentik on top was breaking machine-to-machine traffic for no security gain. 2. prometheus_server_pvc + technitium primary_config_encrypted: add lifecycle.ignore_changes = [spec[0].resources[0].requests]. The autoresizer expands these PVCs; PVCs can't shrink. Without the ignore, every TF apply tried to revert the live size back to the TF spec value, hit K8s's shrink-forbidden rule, and force-replaced the PVC. Because the pod still mounted it, the PVC went into Terminating-but-protected limbo — fine until a pod restart would have orphaned the volume. Root cause of the 2026-05-10 PVC Terminating incident. Bonus: prometheus_server_pvc threshold was the inverted "90%" (the same bug the bulk `fecfa211` sweep fixed elsewhere; my regex only matched "80%" so this one slipped through). Now "10%". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:43 +00:00
Viktor Barzin	e75bcaf394	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline Adds a weekly detection CronJob (Sun 12:00 UTC) that probes apt-cache madison on master for new patches + HEAD pkgs.k8s.io for next-minor availability, then POSTs to claude-agent-service to dispatch the k8s-version-upgrade agent. The agent (.claude/agents/k8s-version-upgrade.md) orchestrates: pre-flight (5 nodes Ready + halt-on-alert + 24h-quiet + plan target match) -> etcd snapshot save -> optional master containerd skew fix -> apt repo URL rewrite (minor bumps only) -> drain/upgrade/uncordon master via ssh < update_k8s.sh -> sequential workers k8s-node4 -> 3 -> 2 -> 1 with 10-min soak each -> post-flight verification Two new Upgrade Gates alerts catch failure modes: - K8sVersionSkew (kubelet/apiserver gitVersion mismatch >30m) - EtcdPreUpgradeSnapshotMissing (in_flight without snapshot_taken >10m) update_k8s.sh refactored to take --role / --release args; the agent shells it into each node via SSH pipe. update_node.sh annotated as OS-major path. Operator-facing docs: docs/runbooks/k8s-version-upgrade.md and a new section in docs/architecture/automated-upgrades.md. Secrets: secret/k8s-upgrade/{ssh_key,ssh_key_pub,slack_webhook} (ed25519 keypair distributed to all 5 nodes via authorized_keys; slack_webhook reuses kured webhook URL on initial deploy).	2026-05-22 14:16:42 +00:00
Viktor Barzin	ff5538a667	ingress_factory: replace `protected` bool with `auth` enum + audit pass across 100 stacks Phase 3+4 of default-deny ingress plan. Replaces the `protected = bool` (default false → unprotected) variable in `modules/kubernetes/ingress_factory` with `auth = string` enum (default "required" → fail-closed). Touches every ingress_factory caller so the audit decision is recorded explicitly in code. ingress_factory (Phase 3): - `auth = "required"`: standard Authentik forward-auth (the legacy `protected = true` semantic). - `auth = "public"`: forward-auth via the new `authentik-forward-auth-public` middleware → dedicated public outpost → guest auto-bind. Logged-in users keep their real identity. - `auth = "none"`: no Authentik middleware. For Anubis-fronted content, native client APIs (Git, /v2/, WebDAV), webhook receivers, the Authentik outpost itself. - `effective_anti_ai` default flips ON only when `auth = "none"` (auth-gated ingresses don't need anti-AI noise; the auth flow already discourages bots). Audit pass (Phase 4) across 96 ingress_factory call sites: - 49 explicit `protected = true` → `auth = "required"` - 8 explicit `protected = false` → `auth = "none"` (5) or `auth = "public"` (3) - 64 previously-default (no protected line) → `auth = "required"` ADDED, then reviewed individually: * 9 Anubis-fronted (blog, www, kms, travel, f1, cyberchef, jsoncrack, homepage, wrongmove UI, privatebin) → `auth = "none"` * 22 native-client / programmatic surfaces (Forgejo Git+/v2/, webhook handler, claude-memory MCP, Nextcloud WebDAV, Matrix, Vault CLI/OIDC, xray VPN, ntfy, woodpecker webhooks, n8n triggers, ntfy push, dawarich location ingestion, immich frame kiosk, headscale CP, send anonymous drops, rybbit beacon, vaultwarden API, Authentik UI itself + outposts) → `auth = "none"` * Remaining ~33 → `auth = "required"` confirmed (admin tools, internal UIs, services without app-level auth) - Smoke-test promotions to `auth = "public"`: fire-planner public UI, k8s-portal API, insta2spotify callback. Three call sites in wrapper modules (`stacks/freedify/factory/`, `stacks/reverse-proxy/modules/reverse_proxy/`) keep their internal `protected` bool — they translate to `auth` internally, out of scope for this rename. Behavior change: previously-default ingresses now fail closed (require Authentik login) unless explicitly flipped to `auth = "none"` or `auth = "public"`. This is the audit goal — no more accidentally-unprotected surfaces. Sites that were intentionally public (Anubis content, native APIs, webhooks) are now explicitly recorded as `auth = "none"`. Drive-by: `modules/create-vm/main.tf` picked up cosmetic alignment via `terraform fmt -recursive` during the audit. Behavior-neutral. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	4103ea2ba0	monitoring(prometheus): keep all 4 kubelet_volume_stats_inodes metrics pvc-autoresizer's GetMetrics() returns volume stats for a PVC only if all four kubelet_volume_stats metrics (available_bytes, capacity_bytes, inodes_free, inodes) are retrieved. The keep-list in the kubernetes-nodes scrape job had available_bytes and capacity_bytes (post `9d5da4d8`) but was missing the two inode metrics, so the autoresizer's reconcile logged "failed to get volume stats" for every PVC and never resized anything. Add kubelet_volume_stats_inodes and kubelet_volume_stats_inodes_free to the regex. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 14:16:42 +00:00
Viktor Barzin	278ef5f19b	monitoring(grafana): swap python3 for jq in folder-ACL local-exec CI image (ci/Dockerfile) is alpine + jq, no python3. The grafana_admin_only_folder_acl null_resource was parsing /api/folders with a python3 oneliner, which crashed every CI apply with "python3: command not found" and made every monitoring stack apply fail in CI (worked locally because the dev VM has python3). jq is already in the CI image and produces the same output.	2026-05-22 14:16:41 +00:00
Viktor Barzin	5c0ea96a91	infra: re-enable unattended-upgrades with kured prometheus-gating Reverses the March 2026 outage mitigation that disabled unattended- upgrades cluster-wide. Now re-enables it on the k8s template VM with: - Allowed-Origins limited to security/updates pockets - Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark hold on the cluster-critical components) - Automatic-Reboot disabled — kured drives the actual reboots - Compatible with the existing kured + sentinel-gate flow kured side: - rebootDelay 30s, concurrency 1 - Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak window from the post-mortem) - prometheusUrl + alertFilterRegexp wired so any firing non-ignored alert halts the rollout. Ignore-list excludes self-referential alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/ InfoInhibitor) that would otherwise deadlock kured. Prometheus side (already partly landed in `6c4e0966` — the "Upgrade Gates" rule group): - Refine `KubeQuotaAlmostFull` to include the resourcequota label in both the on-clause and the summary, so multi-quota namespaces (authentik, beads-server, frigate) report the quota name correctly. grafana.tf: terraform fmt whitespace only. Together with the post-mortem 2026-03-22 (memory id=390) the loop is closed: unattended-upgrades runs again, kernel-class updates can land, but only when cluster health is green and the reboot window is open.	2026-05-22 14:16:41 +00:00
Viktor Barzin	fe75fad467	monitoring: protect grafana ingress with authentik + disable anonymous - add traefik-authentik-forward-auth to grafana ingress middleware list - disable auth.anonymous (was Viewer-by-default for the public) - enable auth.proxy with X-authentik-username so Authentik users get signed in seamlessly (no double-login UX) Prometheus and Alertmanager already had forward-auth — no change.	2026-05-22 14:16:41 +00:00
Viktor Barzin	6c294d4bb0	authentik: zero-endpoints alert + upgrade-validation checklist Add `AuthentikForwardAuthFallbackActive` Prometheus alert: fires on sustained 401/s spike on the websecure entrypoint (>5/s for 5m), which is the symptom of the auth-proxy Emergency-Access fallback firing — in turn caused by zero ready endpoints on the outpost service. Why this rule and not `kube_endpoint_address_available == 0`: kube-state-metrics endpoint metrics exist as series names but never have current values in this Prometheus pipeline (something is dropping them silently). Detecting the failure at the edge via Traefik is more reliable than instrumenting the broken middle. Also fix the pre-existing `AuthentikOutpostForwardAuth400Spike` regex — the service label is `authentik-ak-outpost-...`, not `authentik-authentik-outpost-...`, so the alert never matched any series and never could have fired. Verified in Prometheus before/after the fix. Add an "Upgrade Validation Checklist" section to `.claude/reference/authentik-state.md` with the seven-step smoke test to run after Authentik chart bumps, provider bumps, or outpost pod recreation. Covers the brittle surfaces (Service selector, JSON patches, postgres backend wiring, access_token_validity TTL, edge auth flow, plan-to-zero).	2026-05-22 14:16:41 +00:00
Viktor Barzin	a89d4a7d2a	anubis: pull f1 off Anubis (XHR-vs-challenge collision) + add latency alerts f1.viktorbarzin.me is a SPA whose JS fetches /schedule, /embed, /embed-asset, … on the same path tree. With Anubis fronting `/`, those XHRs land on the challenge HTML even when the cookie should be valid, breaking the page with `Unexpected token '<', "<!doctype " ... is not valid JSON`. Removed Anubis from f1 — would need a path carve-out (the way wrongmove does for /api) to re-enable. Added a top-of-block comment so future me remembers why. Plus four new Prometheus alerts in `Slow Ingress Latency` group (stacks/monitoring/.../prometheus_chart_values.tpl): - IngressTTFBHigh (warn, 10m, avg latency >1s) - IngressTTFBCritical (crit, 5m, avg latency >3s) - IngressErrorRate5xxHigh (crit, 5m, 5xx >5%) - AnubisChallengeStoreErrors (crit, 5m, any 5xx on anubis services via Traefik — proxies for the in-pod challenge-store error since Anubis itself only exposes Go-runtime metrics) Notes from the alert author: avg-not-p95 because the existing Prometheus scrape config drops traefik bucket series; once those are restored, swap to histogram_quantile(0.95). TraefikDown inhibit rule extended to suppress these four during a Traefik outage.	2026-05-10 11:12:40 +00:00
Viktor Barzin	8c619278d3	grafana: env-var datasources + reloader so Vault rotations stop breaking dashboards Wealth, Payslips, and Job-Hunter Grafana datasources all baked the rotating PG password into their ConfigMap at TF-apply time, so every 7-day Vault static-role rotation silently broke the panels until a manual `terragrunt apply`. Same family as the recurring grafana-mysql backend bug — Grafana caches creds at startup and never picks up the new ESO-synced password without a restart. Fix: - Each source stack now creates an ExternalSecret in `monitoring` exposing the rotating password as `<NAME>_PG_PASSWORD` env-var. - Grafana mounts those via `envFromSecrets` (optional=true so a missing source stack doesn't block boot) and the datasource ConfigMaps reference `$__env{<NAME>_PG_PASSWORD}` instead of a literal password. - `reloader.stakater.com/auto: "true"` on the Grafana pod restarts it whenever any of the four DB-cred Secrets is updated. Tested end-to-end: forced `vault write -force database/rotate-role/ pg-wealthfolio-sync` → ESO synced (~30s) → reloader fired → Grafana booted with new env in ~50s total → all three /api/datasources /uid/*/health endpoints return "Database Connection OK". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-10 11:12:39 +00:00
Viktor Barzin	8c09543391	fix: restore pvc-autoresizer by allow-listing kubelet_volume_stats_available_bytes The Prometheus scrape config for the kubernetes-nodes job kept capacity_bytes + used_bytes but dropped available_bytes. pvc-autoresizer computes utilization from available/capacity, so without that metric it was silent for every PVC in the cluster — including mailserver, which filled to 89% (1.7G/2.0G) and started rejecting all inbound mail with '452 4.3.1 Insufficient system storage' (15+ hours, all real senders: Brevo, Gmail, Facebook). Also bumps the floors of mailserver (2Gi -> 5Gi, limit 10Gi) and forgejo (15Gi -> 30Gi) PVCs to recover from the immediate outage, and adds ignore_changes on requests.storage so future autoresizer expansions don't cause TF drift.	2026-05-10 11:12:37 +00:00
Viktor Barzin	e110b40a4a	monitoring(wealth): monthly contrib-vs-mkt as line chart, not bars User asked for two lines instead of side-by-side bars at monthly granularity. Converts panel 25 from barchart to timeseries: * type: barchart -> timeseries * format: table -> time_series, SELECT month::timestamp AS time * drawStyle line, lineWidth 2, fillOpacity 0, showPoints auto * Same blue (contributions) / green (market gain) colour overrides Where the green line rises above the blue line is the visual cue that the market out-earned new contributions for that month -- the trend the user wants to track. Diff is small (15 ins / 28 del) because the bar-chart-only fields (barRadius, barWidth, groupWidth, stacking, xField, xTickLabelRotation) are dropped.	2026-05-07 23:29:35 +00:00
Viktor Barzin	84fd752747	monitoring(wealth): monthly contributions vs market gain bar chart Goal stated by user: see when monthly market gain starts to exceed monthly contributions, i.e. the inflection point where the market is out-earning savings rather than the other way around. New panel id=25 between the annual decomposition (13) and per-account ROI (14): bar chart with two side-by-side bars per month -- contributions (blue) and market gain (green). Same calculation as panel 13 but month-grain instead of year-grain. Months where the green bar dwarfs the blue one are visible at a glance. SQL: same endpoints CTE pattern as panel 13, with date_trunc('month', valuation_date) as the grouping key. Uses max_complete cutoff so partial-today doesn't skew the latest month. Layout: panels at y >= 75 shifted down by 11 (chart height). New chart at y=75; panel 14 (per-account ROI) -> y=86; panel 10 (activity log) -> y=96. Spot check (recent months from PG): 2025-07: contrib +£5,601 market +£42,295 <- big market month 2025-09: contrib +£1,501 market +£24,206 2026-02: contrib +£35,501 market +£41,382 2026-03: contrib +£5,501 market -£38,483 <- correction 2026-04: contrib +£73,267 market +£21,448	2026-05-07 23:29:34 +00:00
Viktor Barzin	4ec40ea804	[forgejo] Phases 3+4+5: cutover, decommission, docs sweep End of forgejo-registry-consolidation. After Phase 0/1 already landed (Forgejo ready, dual-push CI, integrity probe, retention CronJob, images migrated via forgejo-migrate-orphan-images.sh), this commit flips everything off registry.viktorbarzin.me onto Forgejo and removes the legacy infrastructure. Phase 3 — image= flips: * infra/stacks/{payslip-ingest,job-hunter,claude-agent-service, fire-planner,freedify/factory,chrome-service,beads-server}/main.tf — image= now points to forgejo.viktorbarzin.me/viktor/<name>. * infra/stacks/claude-memory/main.tf — also moved off DockerHub (viktorbarzin/claude-memory-mcp:17 → forgejo.viktorbarzin.me/viktor/...). * infra/.woodpecker/{default,drift-detection}.yml — infra-ci pulled from Forgejo. build-ci-image.yml dual-pushes still until next build cycle confirms Forgejo as canonical. * /home/wizard/code/CLAUDE.md — claude-memory-mcp install URL updated. Phase 4 — decommission registry-private: * registry-credentials Secret: dropped registry.viktorbarzin.me / registry.viktorbarzin.me:5050 / 10.0.20.10:5050 auths entries. Forgejo entry is the only one left. * infra/stacks/infra/main.tf cloud-init: dropped containerd hosts.toml entries for registry.viktorbarzin.me + 10.0.20.10:5050. (Existing nodes already had the file removed manually by `setup-forgejo-containerd-mirror.sh` rollout — the cloud-init template only fires on new VM provision.) * infra/modules/docker-registry/docker-compose.yml: registry-private service block removed; nginx 5050 port mapping dropped. Pull- through caches for upstream registries (5000/5010/5020/5030/5040) stay on the VM permanently. * infra/modules/docker-registry/nginx_registry.conf: upstream `private` block + port 5050 server block removed. * infra/stacks/monitoring/modules/monitoring/main.tf: registry_ integrity_probe + registry_probe_credentials resources stripped. forgejo_integrity_probe is the only manifest probe now. Phase 5 — final docs sweep: * infra/docs/runbooks/registry-vm.md — VM scope reduced to pull- through caches; forgejo-registry-breakglass.md cross-ref added. * infra/docs/architecture/ci-cd.md — registry component table + diagram now reflect Forgejo. Pre-migration root-cause sentence preserved as historical context with a pointer to the design doc. * infra/docs/architecture/monitoring.md — Registry Integrity Probe row updated to point at the Forgejo probe. * infra/.claude/CLAUDE.md — Private registry section rewritten end- to-end (auth, retention, integrity, where the bake came from). * prometheus_chart_values.tpl — RegistryManifestIntegrityFailure alert annotation simplified now that only one registry is in scope. Operational follow-up (cannot be done from a TF apply): 1. ssh root@10.0.20.10 — edit /opt/registry/docker-compose.yml to match the new template AND `docker compose up -d --remove-orphans` to actually stop the registry-private container. Memory id=1078 confirms cloud-init won't redeploy on TF apply alone. 2. After 1 week of no incidents, `rm -rf /opt/registry/data/private/` on the VM (~2.6GB freed). 3. Open the dual-push step in build-ci-image.yml and drop registry.viktorbarzin.me:5050 from the `repo:` list — at that point the post-push integrity check at line 33-107 also needs to be repointed at Forgejo or removed (the per-build verify is redundant with the every-15min Forgejo probe). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:34 +00:00
Viktor Barzin	f793a5f50b	[forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry Stage 1 of moving private images off the registry:2 container at registry.viktorbarzin.me:5050 (which has hit distribution#3324 corruption 3x in 3 weeks) onto Forgejo's built-in OCI registry. No cutover risk — pods still pull from the existing registry until Phase 3. What changes: * Forgejo deployment: memory 384Mi→1Gi, PVC 5Gi→15Gi (cap 50Gi). Explicit FORGEJO__packages__ENABLED + CHUNKED_UPLOAD_PATH (defensive, v11 default-on). * ingress_factory: max_body_size variable was declared but never wired in after the nginx→Traefik migration. Now creates a per-ingress Buffering middleware when set; default null = no limit (preserves existing behavior). Forgejo ingress sets max_body_size=5g to allow multi-GB layer pushes. * Cluster-wide registry-credentials Secret: 4th auths entry for forgejo.viktorbarzin.me, populated from Vault secret/viktor/ forgejo_pull_token (cluster-puller PAT, read:package). Existing Kyverno ClusterPolicy syncs cluster-wide — no policy edits. * Containerd hosts.toml redirect: forgejo.viktorbarzin.me → in-cluster Traefik LB 10.0.20.200 (avoids hairpin NAT for in-cluster pulls). Cloud-init for new VMs + scripts/setup-forgejo-containerd-mirror.sh for existing nodes. * Forgejo retention CronJob (0 4 * * ): keeps newest 10 versions per package + always :latest. First 7 days dry-run (DRY_RUN=true); flip the local in cleanup.tf after log review. Forgejo integrity probe CronJob (/15): same algorithm as the existing registry-integrity-probe. Existing Prometheus alerts (RegistryManifestIntegrityFailure et al) made instance-aware so they cover both registries during the bake. Docs: design+plan in docs/plans/, setup runbook in docs/runbooks/. Operational note — the apply order is non-trivial because the new Vault keys (forgejo_pull_token, forgejo_cleanup_token, secret/ci/global/forgejo_*) must exist BEFORE terragrunt apply in the kyverno + monitoring + forgejo stacks. The setup runbook documents the bootstrap sequence. Phase 1 (per-project dual-push pipelines) follows in subsequent commits. Bake clock starts when the last project goes dual-push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-07 23:29:33 +00:00
Viktor Barzin	41655096c7	openclaw: realtime usage dashboard via Prometheus exporter sidecar Stdlib-only Python exporter ($1) reads ~/.openclaw/agents//sessions/.jsonl (assistant messages with usage) plus auth-profiles.json (OAuth expiry, Plus-tier label) and exposes Prometheus text format on :9099/metrics. Container is python:3.12-slim; pod template gets prometheus.io/scrape annotations so the existing kubernetes-pods job picks it up — no ServiceMonitor needed. Metrics exported: openclaw_codex_messages_total{provider,model,session_kind} counter openclaw_codex_input/output/cache_read/cache_write_tokens_total openclaw_codex_message_errors_total{reason} openclaw_codex_active_sessions{kind} gauge openclaw_codex_oauth_expiry_seconds{provider,account,plan} gauge openclaw_codex_last_run_timestamp gauge Grafana dashboard "OpenClaw — Codex Usage" (Applications folder, 30s refresh): messages/5h vs Plus rate-card, % of 1,200 floor, tokens/5h, cache hit %, OAuth expiry days, active sessions, last-turn age, errors, plus per-model timeseries + bar gauge + error table. Plus rate-card thresholds in the gauge are conservative (1,200/5h floor; real cap is dynamic 1,200–7,000). Re-baseline if throttling shows up below 80%.	2026-05-07 23:29:32 +00:00
Viktor Barzin	f006b48566	monitoring(wealth): delta panels to 2x4 grid (rows = type, cols = window) Better visual grouping: instead of 8 paired panels in a single row at w=3 (cramped, hard to scan), arrange as a 2x4 grid at w=6. Top row ("all" — wealth change incl new money), bottom row ("mkt" — pure market gain). Columns are timeframes 1d / 7d / 30d / 90d. Reading vertically: same window, two interpretations side by side. Reading horizontally: same metric across timeframes. Layout shift: delta row goes from y=4 (4 wide) to y=4..11 (8 high). All chart/log panels with y >= 8 shift down by another 4 rows (net-worth chart 8->12, activity log 81->85, etc.).	2026-05-07 23:29:31 +00:00
Viktor Barzin	0f107aeacb	monitoring(wealth): pair every delta panel with market-only twin User feedback: net-worth delta panels (1d/7d/30d/90d) confused because +£174k over 90d looked too big against the £271k cumulative unrealised gain. Decomposition showed the 90d delta was £114k of new money in (contributions) + £60k of actual market gain. So now the delta row shows BOTH: Δ Nd (all) — net-worth change incl new money (the original number) Δ Nd (mkt) — pure market gain, contributions stripped out Pattern for "(mkt)" panels: same now_snap / past_snap CTEs but selecting both total_value and net_contribution, then computing (nw_delta - contrib_delta) = market_gain over window. Layout: 8 panels at w=3 each on the y=4 row, paired by window (all next to mkt for each timeframe), so you can see "wealth change vs investment performance" at a glance. Verified live (90d): all=+£174,612, mkt=+£60,343, contrib=+£114,268.	2026-05-07 23:29:31 +00:00
Viktor Barzin	87069ae5c3	monitoring(wealth): add delta row (1d / 7d / 30d / 90d net-worth changes) New row at y=4 with 4 stat panels showing net-worth change over the trailing windows. Each uses the latest-per-account stitching pattern (skew-resilient against partial-day syncs) and computes: delta = SUM(latest per account) - SUM(latest per account at or before max_complete - N) Where max_complete is the most recent date all accounts have a row. For each window: 1d, 7d, 30d, 90d. Verified live values: +£8,575 / +£22,696 / +£144,633 / +£174,612. All panels at y >= 4 shifted down by 4 rows to make room (Net worth chart 4->8, Per-account stacked 24->28, Activity log 77->81, etc.). Note: this commit also reformats the dashboard JSON from compact- object form to indented form (json.dump indent=2 side effect from the Python patch script). No semantic changes outside the new panels and y-shifts.	2026-05-07 23:29:31 +00:00
Viktor Barzin	1cb2bb30f7	monitoring(wealth): show pre-2024 historical data on timeseries Bug: timeseries panels were empty before 2024-04-10. Cause was the complete_dates CTE filtering to "every active account has a row for this date" -- which excluded every day before the most-recently-added account first appeared. The 6th account (Trading212 Invest GIA) only started 2024-04-10, so 4 years of legitimate historical data (2020-06-07 onwards, when the user genuinely had fewer accounts) got hidden. New pattern across panels 5/6/7/8/9/12/13: replace complete_dates with max_complete cutoff. Compute the most-recent date where all current accounts have a row, then include every historical date up to and including that day. Partial-today is still excluded automatically. Historical days with fewer accounts now show as their actual smaller sums -- which is the correct historical net worth at the time. Verified via PG: new pattern returns 2,159 distinct days from 2020-06-07 to 2026-05-05 (vs the previous 391 from 2024-04-10). Per-account first-seen dates: InvestEngine ISA - 2020-06-07 Schwab US workplace - 2020-11-17 InvestEngine GIA - 2022-03-17 Fidelity UK Pension - 2022-05-16 Trading212 ISA - 2024-04-08 Trading212 Invest GIA - 2024-04-10 (was the bottleneck)	2026-05-05 18:43:26 +00:00
Viktor Barzin	6715cdc51f	monitoring(wealth): re-add milestone annotations (now that PG creds rotated) Re-applies the milestone annotation commit reverted in `0ef36aec`. The earlier "nothing loads / syntax error" was a red herring: Vault had rotated the wealthfolio_sync DB password 7 days prior, the K8s Secret picked it up automatically (pg-sync sidecar still working), but the Grafana datasource ConfigMap is baked at TF-apply time so Grafana was sending the old password. Every panel + the new annotation alike failed with: pq password authentication failed for user wealthfolio_sync. Fix today: refresh the datasource ConfigMap and roll Grafana. scripts/tg apply -target=kubernetes_config_map.grafana_wealth_datasource kubectl -n monitoring rollout restart deploy/grafana Annotation source verified live via /api/ds/query: SQL returns 5 milestone rows correctly. Dashboard charts now show vertical dashed lines at GBP100k 2021-11-01, GBP250k 2023-07-18, GBP500k 2024-09-19, GBP750k 2025-08-26, GBP1M 2026-04-18. KNOWN FOLLOW-UP: Vault rotates pg-wealthfolio-sync every 7 days (static role). Todays failure will recur unless the Grafana datasource auto-refreshes. Options: 1. Annotate Grafana deploy with stakater/reloader so it restarts when wealthfolio-sync-db-creds Secret changes. 2. Switch datasource provisioning to read password from an env var sourced from the Secret instead of baking into the ConfigMap. Combined with reloader, picks up rotation cleanly.	2026-05-02 20:27:21 +00:00
Viktor Barzin	0ef36aec36	Revert "monitoring(wealth): milestone annotations on every timeseries chart" This reverts commit `5a00b9c096`.	2026-05-02 20:20:18 +00:00
Viktor Barzin	5a00b9c096	monitoring(wealth): milestone annotations on every timeseries chart Inspired by the user's "Journey to £1M" reference — adds vertical dashed lines on every timeseries panel at the date net worth first crossed each round threshold (£100k, £250k, £500k, £750k, £1M). Implementation: a dashboard-level annotation source ("Milestones", purple) backed by a PG query that finds the MIN(valuation_date) where SUM(total_value) >= each threshold. The query returns (time, text) pairs, e.g. "2026-04-18 → £1M 🎉". Annotations attach to all timeseries panels automatically; auto-extends as future thresholds are crossed. Verified against current data: £100k → 2021-11-01 £250k → 2023-07-18 £500k → 2024-09-19 £750k → 2025-08-26 £1M → 2026-04-18 🎉 Future work (per user request): add a "Journey" stat-card row at the top mirroring the reference (date achieved + months from previous).	2026-05-02 08:42:21 +00:00
Viktor Barzin	664a85ef1e	Revert "monitoring(wealth): show daily points + lighter fill on timeseries" This reverts commit `5472720c75`.	2026-05-01 16:24:18 +00:00

1 2 3 4

187 commits