upgrade-state: filter transient registry digest-check errors

Keel polls ~175 image manifests hourly against public registries.
Transient i/o timeouts and registry 5xx responses are inherent at
that scale and auto-recover on the next poll, but they were tripping
the Apps row into ⚠ attn — pure noise.

Extend benign_re to cover:
  - failed to check digest + (i/o timeout | connection refused
    | connection reset | context deadline exceeded | TLS handshake
    timeout | no such host | EOF)
  - failed to check digest + non-successful response (status=5xx)

Real actionable digest-check failures (HTTP 401 auth, 404 removed
tag) still surface. Persistent registry-side 5xx is owned by the
registry's own monitoring (forgejo-integrity-probe +
RegistryCatalogInaccessible), not by Keel logs.

Tested locally: Apps row flips from ⚠ attn → ✓ healthy after the
filter is in place; remaining errors-line drops to "(none in last
24h)".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Viktor Barzin 2026-05-19 22:06:21 +00:00
parent a5772060f8
commit aa05942fa5

View file

@ -192,7 +192,25 @@ except Exception:
# token. We don't want the interactive bot (no approvals; opt-out
# auto-update). The Slack NOTIFICATION sender works independently
# of the bot, so rollout messages still post to #general.
local benign_re='bot\.Run\(\): can not get configuration for bot \[slack\]|SLACK_APP_TOKEN must have the (previf|prefix)'
# - `failed to check digest` with a transient network error —
# Keel polls ~175 image manifests against public registries
# hourly. Occasional `i/o timeout` / `connection refused` /
# `TLS handshake timeout` / `no such host` / `EOF` /
# `context deadline exceeded` are inherent to public-internet
# polling at that scale and auto-recover on the next poll.
# Actionable digest-check failures surface as HTTP 401/404
# (auth, removed-tag) — those are NOT filtered.
# - `failed to check digest` with HTTP 5xx — upstream registry
# having a problem (DockerHub maintenance, Forgejo restart,
# etc.). Same recovery pattern as network errors: next hourly
# poll succeeds once upstream is back. Persistent 5xx for >24h
# would indicate a real registry-side issue, but that surfaces
# via the registry's own monitoring (e.g. forgejo-integrity-probe
# + RegistryCatalogInaccessible), not via Keel logs.
local benign_re='bot\.Run\(\): can not get configuration for bot \[slack\]'
benign_re+='|SLACK_APP_TOKEN must have the (previf|prefix)'
benign_re+='|failed to check digest.*(i/o timeout|connection refused|connection reset|context deadline exceeded|TLS handshake timeout|no such host|: EOF)'
benign_re+='|failed to check digest.*non-successful response \(status=5[0-9][0-9]'
errors=$(echo "$log_24h" | grep -iE '"level":"(error|fatal)"|level=error' | grep -vE "$benign_re" | tail -3 || true)
if [[ -z "$errors" ]]; then
APPS_ERROR_LINE="(none in last 24h)"