aiostreams: 1h stream cache + canary stream-count probe + 3 alerts

Hardening pass following the empty-stream-list incident:

1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 /
   disabled). Default behaviour hit all 5 upstream addons on every
   Stremio request; with a 1h TTL repeat requests for the same title
   are instant, while RD cache invalidations still propagate quickly.

2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's
   encryptedPassword via the internal ClusterIP, runs a canary stream
   search for Breaking Bad S01E01, pushes streams_count + probe_success
   to Pushgateway. Uses an ExternalSecret pulling UUID + password from
   Vault secret/viktor. Same pattern as email-roundtrip-monitor.

3. Three alerts in monitoring's prometheus_chart_values.tpl:
   - AIOStreamsStreamCountLow  (< 50 streams for 30m)
   - AIOStreamsProbeFailing    (probe_success == 0 for 30m)
   - AIOStreamsProbeStale      (last_run_timestamp > 30min for 10m)

Verified: probe returned streams=411 success=1 on first run; all 3
alerts loaded into Prometheus with state=inactive health=ok.
This commit is contained in:
Viktor Barzin 2026-05-15 21:38:50 +00:00
parent e037b160d0
commit b6e334daab
2 changed files with 153 additions and 0 deletions

View file

@ -2298,6 +2298,30 @@ serverFiles:
severity: warning
annotations:
summary: "Email round-trip monitor never reported - check CronJob in mailserver namespace"
- alert: AIOStreamsStreamCountLow
expr: aiostreams_stream_count{job="aiostreams-stream-probe"} < 50
for: 30m
labels:
severity: warning
annotations:
summary: "AIOStreams returning <50 streams for the canary title for 30m"
description: "Probe for Breaking Bad S01E01 returned {{ $value }} streams. Could indicate an upstream addon outage, RD filter expansion, or a regression in the user's preset filters. Check `kubectl -n aiostreams get cronjob aiostreams-stream-probe` and the most recent job pod logs."
- alert: AIOStreamsProbeFailing
expr: aiostreams_probe_success{job="aiostreams-stream-probe"} == 0
for: 30m
labels:
severity: warning
annotations:
summary: "AIOStreams stream-probe failing for 30m"
description: "The /api/v1/user fetch or stream search is returning errors, or stream count is below threshold. Check probe logs."
- alert: AIOStreamsProbeStale
expr: (time() - aiostreams_probe_last_run_timestamp{job="aiostreams-stream-probe"}) > 1800
for: 10m
labels:
severity: warning
annotations:
summary: "AIOStreams stream-probe hasn't run in >30 min"
description: "CronJob may be unschedulable or failing before pushgateway POST."
- alert: ClaudeOAuthTokenExpiringSoon
expr: (claude_oauth_token_expiry_timestamp{job="claude-oauth-expiry-monitor"} - time()) < (30 * 86400)
for: 1h