actualbudget+monitoring: per-account bank-sync metrics, drop noisy alert

The bank-sync CronJob was posting to /accounts/banksync which fans out to
ALL accounts in a single call. With PSD2/GoCardless's 4-successful-pulls
per-account per-24h quota, a single rate-limited account would 500 the
whole call, and `bank_sync_success` would flip to 0 even though the data
itself was still flowing through manual UI syncs. Result: BankSyncFailing
fired routinely whenever the user had been active in the UI that day —
a structural false positive.

Fix:
  * CronJob: enumerate accounts via GET /accounts, POST per-account
    /accounts/{id}/banksync, emit bank_sync_account_success and
    bank_sync_account_last_success_timestamp labelled by account name.
    Roll up bank_sync_success = 1 iff any account succeeded.
  * Alerts: drop BankSyncFailing (noise generator). Keep BankSyncStale
    at 48h (global drought). Add BankSyncAccountStale at 72h (catches
    single-account auth expiry — the real signal we wanted).

Verified: manual run on bank-sync-viktor pushes 6 per-account success +
timestamp series; roll-up bank_sync_success=1; no firing alerts.
This commit is contained in:
Viktor Barzin 2026-05-11 18:55:15 +00:00
parent 7b6eee49c4
commit 665b6b2934
2 changed files with 92 additions and 43 deletions

View file

@ -2152,20 +2152,24 @@ serverFiles:
severity: warning
annotations:
summary: "Mail server has no available replicas - mail may not be received"
- alert: BankSyncFailing
expr: bank_sync_success == 0
for: 6h
labels:
severity: warning
annotations:
summary: "Bank sync failing. Accounts may need GoCardless reauthorization. Check Pushgateway for which instance."
# Note: no BankSyncFailing alert — GoCardless enforces per-account
# PSD2 quotas (4 successful pulls per account per 24h). Manual UI
# syncs consume the same quota, so the nightly cron routinely hits
# rate-limits without any real outage. Alert only on staleness.
- alert: BankSyncStale
expr: (time() - bank_sync_last_success_timestamp) > 172800
for: 1h
labels:
severity: warning
annotations:
summary: "Bank sync has not succeeded in more than 48h. Check CronJob and account auth."
summary: "Bank sync (instance {{ $labels.instance }}): NO account has synced in over 48h. Likely a real outage — check CronJob, http-api logs, and GoCardless re-auth."
- alert: BankSyncAccountStale
expr: (time() - bank_sync_account_last_success_timestamp) > 259200
for: 1h
labels:
severity: warning
annotations:
summary: "Bank sync (instance {{ $labels.instance }}): account {{ $labels.account }} has not synced in over 72h. GoCardless requisition may have expired — re-link in Settings → Bank Sync."
- alert: EmailRoundtripFailing
expr: email_roundtrip_success{job="email-roundtrip-monitor"} == 0
for: 60m