Builds on the compat gate (prev commit) to finish "auto-upgrade when safe, halt +
alert when not":
- monitoring: K8sUpgradeBlocked alert (k8s_upgrade_blocked==1, for 10m, warning)
in the Upgrade Gates group — the clean "a k8s auto-upgrade was refused, see
Slack for why" signal. (Until monitoring is applied, a block still surfaces via
the already-live K8sUpgradeChainJobFailed.)
- upgrade-step.sh phase_postflight: deeper post-upgrade smoke tests —
apiserver /readyz + /livez, in-cluster DNS (resolve kubernetes.default), and
core kube-system pods (apiserver/controller-manager/scheduler/etcd/coredns)
Running. Any failure halts + alerts (exit 1; no rollback — kubeadm can't
downgrade). Catches a "pods look Running but cluster is broken" upgrade.
- runbook: documents the compat gate, the blocked alert, how to clear a block,
matrix maintenance, and the detector minor-probe fix.
After deploy, the nightly chain detects 1.35 (minor detection now works) and
correctly BLOCKS on Calico 3.26 / ESO 0.12 / kyverno 1.16 (all behind), alerting
via K8sUpgradeBlocked — the autonomy working as designed until the catch-up
clears those addons.
The rpi-sofia under-voltage alert keyed off the sticky firmware bit
(rpi_under_voltage_occurred == 1), which latches on the first brown-out and
stays 1 until the Pi reboots. With alert-on-change routing it re-paged on every
boot cycle and sat firing for ~211h of the last 14d — Viktor reported "getting a
few of these lately" — and it disagreed with the HA-sofia dashboard, which shows
the live state and reads OK once voltage recovers.
Can't just switch to the live bit: rpi_under_voltage_now never registered once in
14d (brown-outs are sub-second and fall between the 1-min textfile-collector
samples), so the sticky bit is the only reliable detector.
Fix: edge-trigger on a NEW latch via increase(rpi_under_voltage_occurred[1h]) > 0.
Fires once per brown-out and auto-resolves ~1h later (~2h active over the same
14d instead of ~211h); counter-reset handling makes a clean reboot a no-op. Both
real brown-out events in the window are still caught. Docs updated in the same
commit (monitoring.md).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Refines the new K8sUpgradeChainJobFailed alert from a bare failed-pod count to
the terminal job-condition reasons (BackoffLimitExceeded|DeadlineExceeded). A
phase whose first pod failed but whose retry SUCCEEDED must NOT fire: every
firing alert also halts kured, so a bare-count false-positive would block all
OS node reboots for the Job's 7-day TTL. Verified against kube-state-metrics:
the stuck preflight reports reason="BackoffLimitExceeded"; a Complete job has 0
for the terminal reasons.
Docs updated to match the behaviour change (per the same-commit docs rule):
- docs/runbooks/k8s-version-upgrade.md — new alert in the gates list; the
"kill a stuck Job" recovery now leads with retry-on-failure self-heal.
- docs/architecture/automated-upgrades.md — fourth Upgrade Gates alert;
retry-on-failure note on the deterministic-naming paragraph.
- .claude/skills/upgrade-state/SKILL.md — new "chain failed" status, legend
entry, and drill-down (also copied to the active ~/.claude copy).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 1.34.9 patch auto-upgrade sat stuck for 5 days without anyone knowing.
On 2026-06-12 a transient critical alert (the ttyd web-terminal probe on the
devvm) was firing when the daily detection ran; the preflight's "halt on any
critical alert" gate aborted it, so the preflight Job Failed (backoffLimit=1).
Two design gaps then turned that blip into a multi-day wedge:
* the detection guard and spawn_next only checked whether the phase Job
EXISTED, not whether it succeeded — and the Failed Job lingers 7 days via
ttlSecondsAfterFinished, so every daily run skipped re-spawning it;
* the abort happens before the in-flight metric is pushed, so neither
K8sUpgradeStalled nor upgrade_state.sh could see it — the pipeline reported
"never ran" while actually being stuck.
Fixes:
D1 retry-on-failure: detection CronJob (main.tf) and spawn_next
(upgrade-step.sh) now delete + re-spawn a terminally-Failed phase Job
instead of skipping it, so a transient gate self-corrects next cycle
rather than wedging the pipeline for a week.
D2 WebterminalTtydUnreachable critical -> warning: a devvm developer
web-terminal is not cluster infrastructure and must not block upgrades.
D3 observability: new K8sUpgradeChainJobFailed alert
(kube_job_status_failed in k8s-upgrade ns) and upgrade_state.sh now flags
a Failed chain Job as "chain failed" — closing the pre-in-flight blind
spot so a wedge is visible immediately.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The monitoring stack apply was create-failing on every push with `configmaps
"alert-digest-script" already exists` + `secrets "alert-digest" already exists`
(modules/monitoring/alert_digest.tf) — both resources exist in-cluster but fell
out of Terraform state, so apply tried to CREATE them and errored. Pre-existing
(failed on pipelines 203 AND 204, NOT caused by the t3 alert-rules change). Add
import {} blocks (TF 1.5+ adoption per AGENTS.md) so apply imports + reconciles
instead of failing. Idempotent once imported; safe to remove after a green apply.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Before auto-tracking t3 nightly builds (Viktor's call, risk accepted), stand up
the detection that was missing on 2026-06-09 — when an auto-pulled nightly broke
pairing for ALL users and nothing alerted. Viktor's explicit requirement: make
sure session auth keeps working and revert if the pairing fallback/failure rate
climbs. This is phase 0 (detection) of that work.
- t3-dispatch: exchangeCredential now reports WHICH pairing endpoint answered,
and autoPair logs every outcome (paired user=.. endpoint=.. fallback=..) — so
the real-user browser-session->bootstrap fallback rate is observable. A
non-zero rate flags that a build moved the pairing API (the 2026-06-09 class).
- Loki ruler alerts (devvm journal -> Alertmanager -> Slack): T3PairingBroken
(real users failing to pair), T3PairFallbackHigh (build moved the pairing API),
T3AutoUpdateRolledBack / RollbackFailed / Frozen (enforcer outcomes). Closes
the post-mortem's open "nothing monitors end-to-end pairing" detection gap.
The existing t3-probe only checks GET /api/auth/session==200, which stays 200
even when pairing is dead, so it never caught the outage class.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two small doc additions that also re-include these stacks in Woodpecker's
changed-stack detection. The earlier 2-commit push left chrome-service out of the
HEAD~1..HEAD diff so its ignore_changes fix never applied; the monitoring apply was
separately blocked by a stuck prometheus pending-upgrade (now cleared).
- chrome-service: note the live pod's container order had drifted from this file's
order, so a TF apply reorders them (containers[0] differs live-vs-TF until the
apply lands) -- documents the confusion this caused during diagnosis.
- mam-farming: cross-ref the grabber script that emits mam_grabber_last_run_timestamp.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MAMFarmingStuck fired whenever the freeleech grabber added 0 torrents in 4h, but
grabbing 0 is normal: the grabber searches a random catalogue offset each run and
legitimately finds nothing when freeleech is dry (account ratio was a healthy
37.5; the alert even misreported it as "0.00" because $value was the grabbed
count, not the ratio). The alert's real intent was to catch the grabber not
running at all (CronJob Forbid-blocked / wedged), but increase(grabbed[4h])==0
cannot distinguish "didn't run" from "ran, nothing to grab" since Pushgateway
serves the last pushed value forever.
The grabber now heartbeats mam_grabber_last_run_timestamp on every completed run
(main success, ratio/mouse skip, and qBittorrent-unreachable paths). The alert
fires only when that heartbeat is >4h stale — the true stuck condition. Cookie
expiry and qBittorrent-down keep their own dedicated alerts.
Surfaced by /cluster-health as a false-firing alert.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reviewed the last 24h of Slack alerts after the midday node-pressure blip:
the volume came far less from the outage than from (a) alerts re-pinging
every few hours while nothing changed and (b) a pod cascade that fired
uninhibited. This hardens the alerting *system* so recurrences are quiet,
rather than just clearing today's broken services.
Changes (all in the monitoring module):
* Alert-on-change routing. warning/info repeat_interval -> 8760h (notify
once, then only on a membership change or resolve); critical 1h -> 6h
(a slow nag, not an hourly drip). send_resolved stays on. The bulk of
the 24h volume was these re-pings (RpiSofiaUndervoltage alone fired
continuously for ~24h, re-notifying every 4h).
* Daily digest CronJob (alert_digest.tf + alert_digest.py) -> #alerts at
08:00 Europe/London: the full current board grouped by severity + what
resolved in the last 24h. This is the standing-state safety net for the
alert-on-change model. Stock python:3.12-alpine, pure-stdlib script
(no pip/apk at runtime -> none of the per-run disk-write footprint that
disabled status-page-pusher). Reuses the existing Alertmanager Slack
webhook via a namespaced Secret; reads Alertmanager v2 + Prometheus.
* Cascade inhibition. NodeConditionBad/NodeDiskPressure now suppress the
downstream pod-churn alerts (PodCrashLooping, PodImagePullBackOff,
PodsStuckContainerCreating, ScrapeTargetDown, *ReplicasMismatch, ...).
The midday DiskPressure event on 4 nodes fired 25 PodCrashLooping + 14
PodImagePullBackOff uninhibited because only NodeDown was a source.
* T3 probe de-duplication. T3ProbeLegDown now inhibits T3ProbeDropBurst
for the same leg — two alerts described one condition and were the #1
noise source (~3,400 alert-minutes over 24h).
* ScrapeTargetDown false positives. Scrape only Ready endpoints, so
completed CronJob pods that linger in EndpointSlices as NotReady
addresses stop firing phantom "down" alerts (tts/tripit/beads). A Ready
pod with a genuinely broken metrics endpoint still fires.
* for: 0m -> 5m on the flappy backup-status flags (LVM/Weekly/Offsite/
NfsMirror/Vzdump *Failing) and DNS spike detectors, so a single
transient Pushgateway/scrape blip no longer fires-and-resolves.
* Added an Alertmanager scrape target: it carried no prometheus.io/scrape
annotation, so notification volume was unmeasurable — now we can verify
this change worked (alertmanager_notifications_total et al.).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Closes the loop on Viktor's ask to find the t3 disconnect root cause and
definitively rule infra in or out. Server logs alone cannot separate
'client network broke' from 'Cloudflare/tunnel broke' from 't3-serve
stalled' — every cause collapses into the same 20s-watchdog reconnect.
The t3-probe (stacks/t3code) holds three permanent legs that differ only
in path segment: 'cloudflare' (WS via DoH-resolved public DNS -> WAN ->
CF edge -> tunnel -> Traefik -> dispatch), 'internal' (same WS pinned to
the Traefik LB, no Cloudflare), 't3serve' (HTTP straight to the serve
process). Whichever leg drops convicts its segment; all legs clean while
a user drops exonerates infra with data. Dispatch gains an
unauthenticated /probe/ws echo + /probe/healthz (gorilla/websocket,
test-first) behind an auth=none path carve-out, guarded by the
authentik-walloff probe.
Also starts scraping devvm's node_exporter (job 'devvm') — it ran
unscraped, so the box whose memory/IO stalls cause the drops had zero
pressure history. Alerts T3ProbeLegDown + T3ProbeDropBurst; runbook
docs/runbooks/t3-drop-attribution.md.
vzdump-vms pushes vzdump_last_{run,success}_timestamp + vzdump_last_status to
Pushgateway job vzdump-backup, but nothing alerted on them — a stopped/failing VM
backup would be silent (exactly how the nfs-mirror reaping went unnoticed until I
re-verified). Add the trio to the 3-2-1 group in prometheus_chart_values.tpl,
mirroring the LVM/pfSense/nfs-mirror alerts. Stale = >~50h since last success.
NOT [ci]-applied: this is a Terraform stack change — arms on the next
`scripts/tg apply` of the monitoring stack (metrics already flow, so it arms
immediately once applied). Admin-gated apply per org policy.
[ci skip]
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6d224861 came from a --no-checkout worktree whose empty index made the
commit drop every file except two. This restores 05b50d2b's full tree and
correctly adds stacks/stem95su/gdrive-sync.tf + the service-catalog stem95su
entry. Forward-only (parent=6d224861, no force-push); [ci skip] since the
live infra was never applied from the broken commit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CronJob stem95su-gdrive-sync (*/10) mounts the content PVC RW and
rclone-syncs the read-only Drive folder "claude" (stem claude/files) onto
it (rclone/rclone:1.74.3, scope=drive.readonly, empty-source guard +
--max-delete 25). ESO ExternalSecret stem95su-rclone <- Vault
secret/stem95su. Requires the GCP OAuth app published to Production or the
refresh token expires ~weekly.
Lands the gdrive-sync stack on master (it had landed on a feature branch
by accident on the shared devvm checkout).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
User-chosen fully-open registration on tuwunel (no CAPTCHA support; browser
challenges break native clients). Bot defense is layered instead:
- Traefik rate-limit Middleware on a path-scoped /register ingress carve-out,
keyed on request Host (GLOBAL /register cap) not source IP — the host is
reachable via both Cloudflare-IPv4 (CF-Connecting-IP) and IPv6-direct (HE
tunnel, no CF header), so a per-source key let IPv6 bots bypass. 10/min,
burst 20, per replica; CrowdSec is the hard backstop on both paths.
- Loki ruler rule MatrixNewUserRegistered -> lane=security -> existing
#security Slack receiver (matches "registered on this server", never the
rejection line). tuwunel's admin bot also posts signups to the admin room.
Dropped the REGISTRATION_TOKEN env (secret/matrix + ESO kept for revert).
Applied via scripts/tg (matrix tier-1 + targeted monitoring configmap), so
[ci skip] to avoid CI full-applying monitoring (unrelated grafana-acl drift).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resource changes/deletions are now attributable (the novelapp deletion this week
was untraceable because apiserver audit was off). Low-write policy: drops
reads/noise, Metadata level on mutations, omitStages RequestReceived. Wired into
the kube-apiserver static-pod manifest + kubeadm-config (v1beta4
extraArgs/extraVolumes -> survives kubeadm upgrade) on k8s-master; Alloy tails
/var/log/kubernetes/audit/audit.log -> Loki {job=kubernetes-audit}.
Root cause that had silently blocked this AND OIDC for weeks: a stray
kube-apiserver.yaml.bak inside /etc/kubernetes/manifests/ was a duplicate
static-pod manifest kubelet ran instead of the real one, dropping every flag
added to the real manifest. Removed it. Runbook added.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
calico-typha (~342 lines/hr across 3 pods) still WATCHes the core v1
Endpoints API, so the apiserver returns the 'v1 Endpoints is deprecated
in v1.33+' client-go warning, which typha logs. Per KEP-4974 the v1
Endpoints API will essentially never be removed (clients keep working
indefinitely), and even the latest Calico still watches Endpoints
(projectcalico/calico#11540) so a CNI upgrade would not fix it. Pure
cosmetic noise. Targeted Alloy stage.drop (calico-system ns, exact
deprecation message), mirroring the mailserver drop. Real calico
warnings/errors kept; reversible. Validated with alloy fmt (exit 0).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two mailserver-namespace log-noise cleanups (cluster's #1 Loki error
source, from the 2026-06-06 log triage):
1. TLS warning: docker-mailserver SSL_TYPE=manual writes the authoritative
smtpd_tls_chain_files at boot, so the legacy smtpd_tls_cert_file/key_file
in our postfix-main.cf override were IGNORED and triggered postfix's
'Both smtpd_tls_chain_files and ... legacy ...' warning. Dropped the two
legacy lines (functional no-op; chain_files already wins). Verified via
live postconf.
2. Scanner noise (~9k lines/hr): narrow Alloy stage.drop for the benign
public-SMTP probe patterns (unknown[unknown] SSL_accept resets, postscreen
half-open drops, rate-limit-exceeded from unknown). Real delivery logs +
real-IP SASL failures KEPT; CrowdSec bans these IPs independently, so
security posture is unchanged. Validated with 'alloy fmt' (exit 0).
Reversible.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Synology offsite backup target (/mnt/synology-backup, surfaced via
the PVE host NFS mount) sits at ~94% by design and was firing
NodeFilesystemFull continuously. Per user request, raise the threshold
to 95% (<5% free). NOTE: NodeFilesystemFull is a global node-filesystem
rule, so this also loosens the warning on k8s node/system disks;
BackupDiskFull (sda /mnt/backup) stays at 85%.
Also adds docs/runbooks/synology-storage.md: how to assess Synology
usage WITHOUT du (Storage Analyzer weekly CSVs, df/btrfs/qgroup),
btrfs async/snapshot-pinned reclaim, the 2026-06-05 capacity assessment
(94% full; Backup share 4.42TiB), and ~500GiB of homelab cleanup
candidates (redundant gphotos Takeout, old laptop VM images, archives).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ha-sofia R730 REST sensors (via prometheus-query.lan) + Grafana iDRAC
panels were bound to the 1m snmp-idrac scrape. Halved to 30s so the
dashboard-it Server view refreshes uniformly at 30s, matching the
fan-control daemon's Pushgateway metrics. SNMP scrape ~3-4s; timeout 15s.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
goflow2 emits ~8 GB/day of per-flow NetFlow JSON to stdout (~64% of all cluster
log volume) but only its Prometheus aggregate metrics are used; vpa is ~1.3
GB/day of Goldilocks/VPA recommender chatter. Both are low-value and were
landing in Loki (PVC on the contended sdc HDD). Drop them at the Alloy relabel.
Reversible (remove the drop rule). Loki ingestion drops ~73%.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ha-sofia's 7 R730 REST sensors (CPU/exhaust/inlet temp, power, 2x PSU voltage,
fan) read the iDRAC via the slow on-demand Redfish exporter (scan_interval 120,
~16-22s/fetch, intermittent `unavailable` blips). Migrated them to a FAST
Prometheus query of the SNMP values (instant, ~1m-fresh from the snmp-idrac
scrape), scan_interval 30.
This adds the enabling ingress: `prometheus-query.viktorbarzin.lan` →
`prometheus-server:80`, auth=none, allow_local_access_only, path-scoped to
`/api/v1/query` (read-only instant-query only — not the UI/admin/federation).
ha-sofia can't use `prometheus.viktorbarzin.me` (Authentik-gated, no OIDC from
a REST sensor), so this mirrors the existing local-only `.lan` exporter
ingresses HA already queries.
The ha-sofia REST file (`/config/rest_resources/idrac_redfish_exporter.yaml`)
was edited in place (auto-version-controlled by the HA version-control add-on;
pre-migration copy at `/config/idrac_redfish_exporter.bak-pre-snmp`). The
Technitium CNAME `prometheus-query.viktorbarzin.lan -> ingress.viktorbarzin.lan`
was added manually via the API — like the other `.lan` exporter hosts it is NOT
auto-synced (the technitium-ingress-dns-sync CronJob only creates `.me`
records). Follow-up (already noted for the Loki sensor): extend that sync to
manage `.lan` CNAMEs too. The Redfish remnant's `sensors` collector is now
vestigial (HA no longer reads it).
Verified: all 7 HA sensors report correct fresh values from Prometheus (fan
10800 rpm, CPU 62.0C, power 280W, PSU 230/240V).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cluster-wide Loki log observability now that pod logs flow (Alloy fix). New
dashboards/cluster-logs.json (Loki DS P8E80F9AEF21F6940): namespace/app/pod
dropdowns + free-text regex search; stats (lines/errors/warns/active-ns),
log-volume-by-namespace, error/warn rate, top-namespaces-by-errors,
top-pods-by-errors, a filterable live-logs panel, and a second row for the
node + rpi-sofia systemd journals (volume-by-level + error/warn journal panel).
Error/warn use case-insensitive regex line-filters so they work regardless of
level-label availability. New "Logs" Grafana folder.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cluster pod logs were NOT reaching Loki — only node/Pi journals were. Root cause:
loki.source.file was fed the /var/log/pods/*<uid>/<container>/*.log glob directly
from discovery.relabel, but loki.source.file does NOT expand globs, so it stat()'d
the literal `*` path and shipped zero pod logs ("stat failed: no such file" for
every pod). Per Grafana Alloy docs, a local.file_match component must expand the
glob into concrete file targets first. Add it. Also add stage.cri {} so Loki
stores clean messages + real timestamps instead of raw containerd CRI-prefixed
lines. Fixes cluster-wide log observability (regression vs the working 2026-05-26
state). Ship-all-then-measure per the agreed plan; Alloy mem limits stay as the
IO-storm safeguard.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Redfish exporter (mrlhansen, metrics:all:true) walked every BMC subtree on
each scrape — ~18.5s avg / 28s peak against the slow iDRAC — forcing a 3m
interval. Moved the fast path to SNMP via the (previously unmounted) dell_idrac
module: ~3.7s/scrape at 1m.
- snmp_exporter: merge dell_idrac into ups_snmp_values.yaml; hand-add fan-RPM
(coolingDeviceReading + location lookup) and an amperageProbeLocationName
lookup so the "System Board Pwr Consumption" watts probe is label-selectable.
- snmp-idrac job: params module=dell_idrac, auth=public_v2, 1m/30s — now the
primary source for health/thermal/power/fan/voltage (relabeled r730_idrac_*).
- Re-point 9 iDRAC alerts to SNMP metrics + DellStatus enums (OK=3, on=4) and
fix the misnamed iDRACSNMPMetricsMissing/iDRACRedfishMetricsMissing probes.
- Re-point Grafana panels (idrac.json, cluster_health.json) to SNMP names;
temps ÷10 (tenths-degC); DellStatus value-mappings updated.
- Demote the Redfish exporter to a slow remnant: trim collectors to
system/sensors/power/storage/network/memory, scrape 3m->10m. Kept only for
metrics SNMP can't serve (indicator LED, NIC Mbps, machine/BIOS, per-drive
table) AND to keep HA Sofia's sensor.r730_fan_speed working — it reads
idrac_sensors_fan_speed from the exporter directly, so no ha-sofia change.
SSD-wear alerts + SEL panel left as-is (already inert/empty today). Verified
live: snmp-idrac up, scrape 3.7s, all 9 re-pointed alerts resolve without
firing, HA fan metric (idrac_sensors_fan_speed=6) intact. Design/plan +
as-built docs: docs/plans/2026-06-05-idrac-snmp-migration-{design,plan}.md,
docs/architecture/monitoring.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
scripts/tg's check-ingress-auth-comments.py requires the `# auth = "none":`
rationale comment DIRECTLY above the `auth = "none"` line; mine was in the
module's top block comment, so the guard aborted the whole monitoring apply
(this is why the rpi-sofia scrape/alerts/ingress/dashboard never landed on the
first push). Move the rationale to the required position.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Sofia Raspberry Pi hung this morning (network wedged ~10:13, HA
sensors dead, and its local journal had been silent since Apr 27 — a
2017 SD card intermittently flipping the rootfs read-only). Nothing was
captured because logging lived only on the failing card. Ship telemetry
off-box so the next failure is diagnosable centrally:
- Prometheus scrape job `rpi-sofia` (rpi-sofia.viktorbarzin.lan:9100) —
node_exporter + a vcgencmd textfile collector on the Pi exporting
under-voltage/throttle/SoC-temp as rpi_* metrics.
- Alert group "RPi Sofia": node_exporter Down, rootfs ReadOnly (the
exact SD-failure signature), Under-voltage since boot, High SoC temp.
- LAN-gated Loki write ingress (loki.viktorbarzin.lan) so the Pi's
promtail can push its journal — Loki was ClusterIP-only.
- Grafana dashboard "RPi Sofia" (Hardware): status, undervoltage/
throttle, temp, load, memory, disk, network.
The Pi separately got a systemd hardware watchdog (auto-reboot on a hard
hang; today it stayed down ~5h until a manual power-cycle).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
At h=4 the two stacked values per window panel were too small because each
also rendered its field-name label. Switch textMode value_and_name -> value
on 9211-9215 so the numbers get the full cell height; the % suffix / £ prefix
keep them self-identifying and the window stays in the panel title. Applied
via targeted tg apply of the configmap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The 5 per-window returns widgets (9211-9215) were too tall at h=8. Halve to
h=4 (matching the overview stat cards directly above) and pull every panel
below up by 4 so the layout stays gap-free. Layout-only change — no panel
content/query touched. Applied via targeted tg apply of the configmap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Swap the single "Returns over time windows" table (panel 9201) for 5 stat
panels (1d/7d/30d/90d/12mo), each showing Return % (Modified-Dietz) as the
headline value + Δ market (£, net of contributions) as a second value,
colored red/green by sign.
Same per-window Modified-Dietz math as the old table, just scoped to one
interval per panel — verified against live wealthfolio_sync PG and reproduced
through Grafana's datasource API (e.g. 30d = 8.15% / £86,875, 12mo = 38.68% /
£297,846, matching the prior table exactly). Kept the same 24×8 grid footprint
so nothing else on the dashboard reflows.
Already applied via targeted `tg apply` of the wealth.json configmap; [ci skip]
because a full monitoring-stack CI apply would pull in unrelated pre-existing
drift.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
monitoring-quota requests.memory sat at 89% (18.2/20Gi), tripping the
ResourceQuota>80% WARN. Root cause was over-provisioned requests, not real
usage: loki requested 3Gi but its VPA upperBound is 364Mi and actual ~315Mi.
prometheus's 4Gi is legitimately required (2Gi tmpfs WAL shares the cgroup;
OOMs at 3Gi during WAL replay) so it stays; grafana's main container is
already 512Mi. Trimmed loki to 1Gi request (~3x its observed ceiling; 4Gi
Burstable limit preserves query-spike headroom) -> quota 78.8%, clears the
WARN. NOTE: alloy DaemonSet (562Mi/node) grows with node count, so revisit
(bump the 20Gi quota) as the cluster expands.
The prometheus-backup sidecar runs monthly on the 1st SUNDAY 04:00 UTC.
Consecutive first-Sundays can be ~35 days apart (e.g. May 3 -> Jun 7), but
the alert threshold was 32d (2764800s) -> it false-fired every year for the
~3 days between day-32 and the next run. Raised to 40d (3456000s): clears
the max first-Sunday spacing with margin, still catches a genuinely missed
monthly backup. Backup itself is healthy (last May 3, next Jun 7). Verified:
live rule now > 3.456e6, alert state inactive.
Context (smart) search latency was caused by the 665MB vchord clip_index
decaying out of PG shared_buffers (~33% resident -> ~1.8s cold ANN reads vs
~4ms warm), NOT by yesterday's ML MODEL_TTL/clip-keepalive change (CLIP textual
is warm ~15ms on GPU). The postStart prewarm runs once at pod start and
pg_prewarm.autoprewarm only re-warms at startup, so the index decays under job
buffer-pressure over days.
- clip-index-prewarm CronJob (immich, */5): pg_prewarm('clip_index') keeps the
whole index resident -> searches stay ~4ms.
- immich-search-probe CronJob (immich, */5): times a random-vector ANN query +
reads clip_index residency, pushes gauges to the Pushgateway.
- Prometheus alerts ImmichSmartSearchSlow / ImmichClipIndexColdCache /
ImmichSearchProbeStale (+ inhibition when the probe is stale).
- cluster_healthcheck.sh check #46 check_immich_search (TOTAL_CHECKS 45->46).
- Docs: infra CLAUDE.md immich note, monitoring.md, cluster-health skill.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Part of the L4 LB-IP review (docs/plans/2026-06-03-lb-ip-hygiene-design.md).
The 2026-05-30 Traefik .200->.203 move left consumers pointing at the dead
.200; this fixes the two in-Terraform ones and replaces the stale networking
doc with an accurate registry + a renumber checklist.
- woodpecker: forgejo.viktorbarzin.me hostAlias hardcoded 10.0.20.200
(.200:443 refuses TLS now; the next woodpecker apply would re-pin it and
break pipeline creation). Now reads the Traefik ClusterIP dynamically via a
kubernetes_service data source -- cannot rot on a future renumber and avoids
the ETP=Local hairpin trap.
- monitoring: ViktorBarzinApexDrift alert summary said "expected 10.0.20.200"
-> 10.0.20.203 (cosmetic; alert logic already correct).
- docs/architecture/networking.md: rewrote the MetalLB section (it wrongly had
KMS on .200, mailserver on a LB IP, "two dedicated") into an accurate 4-IP
registry + LB-IP renumber checklist (in-band + out-of-band consumers).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reconciles the tripit stack source with live state and adds the forward
flow. Ingest now polls vbarzin@gmail.com [Gmail]/All Mail read-only over a
rolling 12-month X-GM-RAW travel-sender window (Croatia Jet2 refs excluded),
filing trips under MAIL_DEFAULT_OWNER_EMAIL=vbarzin@gmail.com (Viktor's
Authentik login identity). Adds an ingest-plans CronJob that polls spam@
filtered to To:plans@viktorbarzin.me (the @viktorbarzin.me catch-all target)
so forwarded bookings are extracted and attached to the matching trip;
IMAP_PASSWORD is overridden per-job to spam@'s creds (PLANS_IMAP_PASSWORD).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The role panels (Top roles, Top companies by role volume, New roles/day,
Roles by source, Salary distribution) had no location filter, so they showed
all locations regardless of the $location dropdown. Add
'primary_location IN (${location:sqlstring})' to each (matching the comp
panels' pattern). Also switch the 'Your comp vs the market' panel from
hardcoded 'london' to the same $location filter for consistency. Data was
fine (all london-tagged roles genuinely contain 'london').
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a barchart (panel 10) ranking every company's London p50 total comp
(COALESCE total/base) with the user's current comp shown in line, so it's a
direct "how do I compare" view. The user's figure is NOT hardcoded in the
dashboard JSON — it's a labeled comp_point in the DB (company_slug
'self-current', source 'self', "Me (Meta IC5)"), keeping the sensitive number
out of git. It's below the £500k alert bar (no Slack ping) and ranks too low
to appear in analyze leaders. Runbook documents the panel + how to update the
baseline.
[ci skip] — dashboard ConfigMap applied locally (targeted).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the 36->17 consolidation:
- 3 net-pay panels -> 1 "Net pay vs market gain (${grain})" with a cumulative/
yearly/monthly dropdown (Mixed datasource: payslips-pg + wealth-pg).
- Projection rebuilt as a Trend panel (numeric "Years from today" x-axis) so it
renders regardless of the dashboard time range — fixes empty-by-default. Drops
the duplicate projection-row stat cards + the how-to-view text panel.
- Full reorg into 7 collapsed rows: Overview / Net worth over time / Returns &
contributions / Income vs market / Holdings / RSUs (META) / Projections.
All wealth-pg SQL validated live; net_pay target reuses the existing payslips-pg
source. Visual review pending.
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the committed projections design (docs/plans/2026-05-28-wealth-
projections-{design,plan}.md): a collapsed "Projections" row on the wealth
dashboard with 5 template vars (rate_low/base/high, monthly_contribution=auto,
horizon_years=30), a multi-scenario projection panel (Low/Base/High + trailing-
3y historical line + a base-rate compounding-only line), 3 stat cards, and a
text panel with one-click future time-range links.
Projection is pure SQL over dav_corrected: compound + ordinary-annuity FV from
today's net worth; auto contribution = trailing-12mo run-rate (COALESCE/NULLIF
so $monthly_contribution=auto doesn't constant-fold 'auto'::numeric). Historical
rate = trailing-3-full-year geometric mean of per-year Modified-Dietz returns
(~10.4%) — all-time was a nonsense 83% because the all-accounts-complete window
is only ~4 months, and the true all-time geomean is skewed by 2021's +86%.
Also aligns "Net pay vs market gain — per month" to consecutive month-end
deltas (same fix as the other monthly panels). Verified all SQL live.
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
goflow2, snmp-exporter, pve-exporter, idrac-redfish and the sysctl-inotify
daemonset were missed by the cdb7d9a8 KEEL_LIFECYCLE sweep. The monitoring ns
is keel-enrolled (policy=patch) so Keel owns their image tags + injects keel.sh
annotations; TF kept trying to revert both, plus a live-stamped tier label —
which made `terragrunt plan -detailed-exitcode` return 2 every run and the
drift-detection cron fail daily. Add the standard KEEL ignore_changes (image +
keel.sh annotations) and ignore the tier label so these stop churning.
Declarative-only: takes effect at next plan, no apply needed. [ci skip] so this
does not trigger a monitoring apply. Remaining (separate) drift: the grafana
ACL null_resource (triggers.always) + tls cert refresh.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The redis-v2 Sentinel cluster split-brained: redis-v2-0 booted during a network
partition, hit the init script's deterministic "pod-0 = bootstrap master"
fallback, and became a SECOND master alongside the sentinel-elected redis-v2-2.
HAProxy's `expect rstring role:master` matched both and round-robined client
connections across the two diverging masters, so Immich enqueued BullMQ jobs on
one while its workers blocked-popped on the other -> every queue wedged and
new-upload thumbnails 404'd cluster-wide. Third Sentinel-class incident in ~6
weeks (after the 2026-04-19 PM quorum drift and 2026-04-22 flap cascade).
Revert to a single standalone instance: replicas=1; drop Sentinel + HAProxy +
init bootstrap configmap + both PDBs; redis container only (+ exporter).
maxmemory-policy allkeys-lru -> volatile-lru so one shared instance serves both
workload classes correctly: evict only TTL'd cache keys, never TTL-less Immich
BullMQ / Celery job keys. redis-master service name/DNS unchanged -> no consumer
edits; collapsed onto redis-v2-0's existing dataset (queued jobs preserved).
Applied via tg (Tier 1 / PG-authoritative state); this commit syncs source +
docs only, hence [ci skip].
Monitoring: drop RedisReplicationLagHigh + RedisReplicasMissing (no replicas
now; the latter would false-fire), RedisMemoryPressure 85%->80% volatile-lru backstop.
Docs: rewrite databases.md Redis section (single-instance design + incident
history); add post-mortem 2026-05-30-redis-split-brain.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two correctness fixes to the wealth dashboard, found while validating
contribution data against actual-viktor (source of truth):
1. dav_corrected (Fix 1): LOCF gap-fill scoped to the Fidelity pension.
A PlanViewer scrape gap left total_value=0 for 13 days from 2026-02-16,
which cratered net worth and produced a phantom -£97,457 "contribution"
in Feb then +£100,458 in Mar. Carry the last non-zero day forward across
the gap (a £0 pension valuation is always a scrape gap, never real).
2. wealth.json (Fix 3): "Monthly contributions vs market gain" and "Annual
change decomposition" now use consecutive period-end deltas instead of
within-period first-to-last-obs, so contributions landing near a period
boundary are no longer dropped/mis-attributed.
Verified live: Feb-2026 monthly contribution now +£34,000 (real Trading212
RSU-proceeds investment, reconciles with actual-viktor), no spurious
negatives. Brokerage contributions unchanged (already correct).
Applied via scripts/tg (wealthfolio + targeted monitoring ConfigMap).
[ci skip]
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three new panels comparing employment income to investment returns over
time, via Grafana's -- Mixed -- datasource (salary lives in payslip_ingest,
portfolio in wealthfolio_sync — separate DBs, so per-target datasources):
- cumulative net take-home pay vs cumulative market gain (line race)
- net pay vs market gain per year (grouped bars)
- net pay vs market gain per month (grouped bars)
Inserted after the "Growth over time" panel; existing panels shifted down,
full-width tables remain at the bottom.
The catchall-error-pages IngressRoute matches HostRegexp(^(.+\.)?
viktorbarzin\.me$) at priority=1 — it's the wildcard handler that
returns 404 for any unmatched hostname (typos + scanner traffic).
By design its 4xx rate sits at ~100%, so HighService4xxRate was a
permanent false positive for traefik-catchall-error-pages-*@kubernetescrd.
Same exclusion pattern as nextcloud/grafana/linkwarden/claude-memory
(services with legitimately high 4xx counts).
PVE API endpoint regularly takes ~11s with ~1035 thin LVs on the host
(1002 k8s-csi PVCs + 22 VMs + 11 system), blowing past Prometheus's
default 10s scrape_timeout and flapping ProxmoxMetricsMissing +
ScrapeTargetDown. Switch the Service annotation from prometheus.io/scrape
to prometheus.io/scrape_slow so the scrape moves to the existing
kubernetes-service-endpoints-slow job (5m interval, 30s timeout).
Keel was rewriting tag strings (not just digests) despite the
keel.sh/match-tag=true annotation injected by the Kyverno
inject-keel-annotations ClusterPolicy. That annotation was supposed to
constrain Keel to digest-only watches under the deployment's CURRENT tag.
It didn't. Casualties confirmed today (live image rewritten to a lower
version): uptime-kuma (:2 → :1, 4h CrashLoopBackOff because v1 boots into
SQLite mode and can't read the v2 db-config.json → MariaDB store);
n8n (:1.80.5 → :0.1.2, silent — EEXIST mkdir /root/.n8n loop);
beads-server/dolt-workbench (:0.3.73 → :0.1.0, GraphQL schema mismatch on
addDatabaseConnection); wealthfolio (:3.2.1 → :2.0 → :3.2 string truncate);
plus historical ones previously fixed (claude-memory :71b32438 → :17,
forgejo 11.0.14 → 1.18, onlyoffice 9.3.1 → 4.0.0.9, shlink 5.0.2 → 1.16.1).
Changes:
* stacks/keel: replicaCount = 0 in the helm values. Pod went from 1/1 to
0/0. Keep off until either match-tag is root-caused or every enrolled
workload migrates to a content-addressed (SHA) pin.
* stacks/uptime-kuma: pin image to louislam/uptime-kuma:2.3.2 (was :2,
bumped to :1 by Keel). Full opt-out: keel.sh/policy=never on BOTH the
deployment label (matches Kyverno's exclude rule so the inject-keel-
annotations ClusterPolicy stops mutating) AND the annotation (so Keel
itself respects). Removed keel.sh/policy from lifecycle.ignore_changes
so TF owns it as `never` and can't drift back to `force`.
* stacks/beads-server: pin dolt-workbench to dolthub/dolt-workbench:0.3.73
on both seed-config and workbench containers (was :latest, Keel rolled
to :0.1.0).
* stacks/wealthfolio: pin to afadil/wealthfolio:3.2.1 (was :3.2 truncated
by Keel from the prior live :3.2.1).
* stacks/monitoring: monitoring-quota requests.memory 16Gi → 20Gi. Cluster
grew from 5 to 7 workers (k8s-node5/6 added 2026-05-26) and alloy's
per-pod request jumped 50Mi → 562Mi earlier today; combined with new-node
DS pods (loki-canary, node-exporter, sysctl-inotify) the quota tipped to
100% and blocked every new pod create with FailedCreate. Raising the cap
unblocked the four affected DaemonSets in one shot.
* stacks/immich: tier-quota requests.memory 20Gi → 24Gi, limits.memory
32Gi → 40Gi. Was at 88% with VPA still creeping up on immich-server's
face-detection burst behaviour.
* stacks/{excalidraw,immich,n8n}: providers.tf + .terraform.lock.hcl
updated by `tg init -upgrade` to record telmate/proxmox 3.0.2-rc07
(matches the 21 other stacks that already declare it).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>