aiostreams: 1h stream cache + canary stream-count probe + 3 alerts
Hardening pass following the empty-stream-list incident: 1. STREAM_CACHE_TTL=3600 — re-enables stream payload cache (was -1 / disabled). Default behaviour hit all 5 upstream addons on every Stremio request; with a 1h TTL repeat requests for the same title are instant, while RD cache invalidations still propagate quickly. 2. aiostreams-stream-probe CronJob (every 5 min): fetches the user's encryptedPassword via the internal ClusterIP, runs a canary stream search for Breaking Bad S01E01, pushes streams_count + probe_success to Pushgateway. Uses an ExternalSecret pulling UUID + password from Vault secret/viktor. Same pattern as email-roundtrip-monitor. 3. Three alerts in monitoring's prometheus_chart_values.tpl: - AIOStreamsStreamCountLow (< 50 streams for 30m) - AIOStreamsProbeFailing (probe_success == 0 for 30m) - AIOStreamsProbeStale (last_run_timestamp > 30min for 10m) Verified: probe returned streams=411 success=1 on first run; all 3 alerts loaded into Prometheus with state=inactive health=ok.
This commit is contained in:
parent
e037b160d0
commit
b6e334daab
2 changed files with 153 additions and 0 deletions
|
|
@ -2298,6 +2298,30 @@ serverFiles:
|
|||
severity: warning
|
||||
annotations:
|
||||
summary: "Email round-trip monitor never reported - check CronJob in mailserver namespace"
|
||||
- alert: AIOStreamsStreamCountLow
|
||||
expr: aiostreams_stream_count{job="aiostreams-stream-probe"} < 50
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "AIOStreams returning <50 streams for the canary title for 30m"
|
||||
description: "Probe for Breaking Bad S01E01 returned {{ $value }} streams. Could indicate an upstream addon outage, RD filter expansion, or a regression in the user's preset filters. Check `kubectl -n aiostreams get cronjob aiostreams-stream-probe` and the most recent job pod logs."
|
||||
- alert: AIOStreamsProbeFailing
|
||||
expr: aiostreams_probe_success{job="aiostreams-stream-probe"} == 0
|
||||
for: 30m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "AIOStreams stream-probe failing for 30m"
|
||||
description: "The /api/v1/user fetch or stream search is returning errors, or stream count is below threshold. Check probe logs."
|
||||
- alert: AIOStreamsProbeStale
|
||||
expr: (time() - aiostreams_probe_last_run_timestamp{job="aiostreams-stream-probe"}) > 1800
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "AIOStreams stream-probe hasn't run in >30 min"
|
||||
description: "CronJob may be unschedulable or failing before pushgateway POST."
|
||||
- alert: ClaudeOAuthTokenExpiringSoon
|
||||
expr: (claude_oauth_token_expiry_timestamp{job="claude-oauth-expiry-monitor"} - time()) < (30 * 86400)
|
||||
for: 1h
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue