mam-farming: make MAMFarmingStuck a grabber heartbeat, not a grab-count check
Some checks failed
ci/woodpecker/push/default Pipeline failed
Some checks failed
ci/woodpecker/push/default Pipeline failed
MAMFarmingStuck fired whenever the freeleech grabber added 0 torrents in 4h, but grabbing 0 is normal: the grabber searches a random catalogue offset each run and legitimately finds nothing when freeleech is dry (account ratio was a healthy 37.5; the alert even misreported it as "0.00" because $value was the grabbed count, not the ratio). The alert's real intent was to catch the grabber not running at all (CronJob Forbid-blocked / wedged), but increase(grabbed[4h])==0 cannot distinguish "didn't run" from "ran, nothing to grab" since Pushgateway serves the last pushed value forever. The grabber now heartbeats mam_grabber_last_run_timestamp on every completed run (main success, ratio/mouse skip, and qBittorrent-unreachable paths). The alert fires only when that heartbeat is >4h stale — the true stuck condition. Cookie expiry and qBittorrent-down keep their own dedicated alerts. Surfaced by /cluster-health as a false-firing alert. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
a0725ede57
commit
2479560fa2
2 changed files with 19 additions and 6 deletions
|
|
@ -2840,15 +2840,22 @@ serverFiles:
|
|||
annotations:
|
||||
summary: "MAM ratio is {{ $value | printf \"%.2f\" }} for 24h (target: >= 1.0)"
|
||||
- alert: MAMFarmingStuck
|
||||
# Heartbeat-based: fires only when the grabber CronJob has not COMPLETED
|
||||
# a run in >4h (the original failure mode: Forbid-blocked / wedged in
|
||||
# ContainerCreating). The grabber heartbeats mam_grabber_last_run_timestamp
|
||||
# on every completed run — including legit dry runs that grab 0 (its random
|
||||
# search offset lands on an empty/over-filtered page, which is normal). The
|
||||
# old increase(mam_farming_grabbed[4h])==0 could not tell "didn't run" from
|
||||
# "ran, nothing to grab" (Pushgateway serves the last value forever), so a
|
||||
# dry freeleech period false-fired. Cookie-expiry and qBittorrent-down have
|
||||
# their own alerts (MAM session cookie / QBittorrentDisconnected).
|
||||
expr: |
|
||||
increase(mam_farming_grabbed[4h]) == 0
|
||||
and mam_farming_total_seeding < 150
|
||||
and mam_ratio >= 1.2
|
||||
for: 4h
|
||||
time() - mam_grabber_last_run_timestamp > 4 * 3600
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "Grabber has added 0 torrents in 4h despite healthy ratio ({{ $value | printf \"%.2f\" }})"
|
||||
summary: "MAM freeleech grabber has not completed a run in {{ $value | humanizeDuration }} — CronJob stuck/blocked"
|
||||
- alert: MAMJanitorStuckBacklog
|
||||
expr: mam_janitor_skipped_active > 400
|
||||
for: 6h
|
||||
|
|
|
|||
|
|
@ -134,6 +134,7 @@ def main():
|
|||
profile_metrics
|
||||
+ f'mam_grabber_skipped_reason{{reason="{reason}"}} 1\n'
|
||||
+ f"mam_farming_grabbed 0\n"
|
||||
+ f"mam_grabber_last_run_timestamp {int(time.time())}\n"
|
||||
)
|
||||
return
|
||||
|
||||
|
|
@ -153,7 +154,11 @@ def main():
|
|||
).json()
|
||||
except Exception as e:
|
||||
print(f"qBittorrent unreachable: {e}", file=sys.stderr)
|
||||
push(profile_metrics + "mam_farming_grabbed 0\n")
|
||||
push(
|
||||
profile_metrics
|
||||
+ "mam_farming_grabbed 0\n"
|
||||
+ f"mam_grabber_last_run_timestamp {int(time.time())}\n"
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
farming = [t for t in all_torrents if t.get("category") == "mam-farming"]
|
||||
|
|
@ -264,6 +269,7 @@ def main():
|
|||
+ f"mam_farming_grabbed {grabbed}\n"
|
||||
+ f"mam_farming_total_seeding {len(farming) + grabbed}\n"
|
||||
+ f"mam_farming_size_bytes {total_size}\n"
|
||||
+ f"mam_grabber_last_run_timestamp {int(time.time())}\n"
|
||||
)
|
||||
push(metrics)
|
||||
print(f"Done: grabbed={grabbed}")
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue