upgrade-state: filter transient registry digest-check errors
Keel polls ~175 image manifests hourly against public registries.
Transient i/o timeouts and registry 5xx responses are inherent at
that scale and auto-recover on the next poll, but they were tripping
the Apps row into ⚠ attn — pure noise.
Extend benign_re to cover:
- failed to check digest + (i/o timeout | connection refused
| connection reset | context deadline exceeded | TLS handshake
timeout | no such host | EOF)
- failed to check digest + non-successful response (status=5xx)
Real actionable digest-check failures (HTTP 401 auth, 404 removed
tag) still surface. Persistent registry-side 5xx is owned by the
registry's own monitoring (forgejo-integrity-probe +
RegistryCatalogInaccessible), not by Keel logs.
Tested locally: Apps row flips from ⚠ attn → ✓ healthy after the
filter is in place; remaining errors-line drops to "(none in last
24h)".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
a5772060f8
commit
aa05942fa5
1 changed files with 19 additions and 1 deletions
|
|
@ -192,7 +192,25 @@ except Exception:
|
|||
# token. We don't want the interactive bot (no approvals; opt-out
|
||||
# auto-update). The Slack NOTIFICATION sender works independently
|
||||
# of the bot, so rollout messages still post to #general.
|
||||
local benign_re='bot\.Run\(\): can not get configuration for bot \[slack\]|SLACK_APP_TOKEN must have the (previf|prefix)'
|
||||
# - `failed to check digest` with a transient network error —
|
||||
# Keel polls ~175 image manifests against public registries
|
||||
# hourly. Occasional `i/o timeout` / `connection refused` /
|
||||
# `TLS handshake timeout` / `no such host` / `EOF` /
|
||||
# `context deadline exceeded` are inherent to public-internet
|
||||
# polling at that scale and auto-recover on the next poll.
|
||||
# Actionable digest-check failures surface as HTTP 401/404
|
||||
# (auth, removed-tag) — those are NOT filtered.
|
||||
# - `failed to check digest` with HTTP 5xx — upstream registry
|
||||
# having a problem (DockerHub maintenance, Forgejo restart,
|
||||
# etc.). Same recovery pattern as network errors: next hourly
|
||||
# poll succeeds once upstream is back. Persistent 5xx for >24h
|
||||
# would indicate a real registry-side issue, but that surfaces
|
||||
# via the registry's own monitoring (e.g. forgejo-integrity-probe
|
||||
# + RegistryCatalogInaccessible), not via Keel logs.
|
||||
local benign_re='bot\.Run\(\): can not get configuration for bot \[slack\]'
|
||||
benign_re+='|SLACK_APP_TOKEN must have the (previf|prefix)'
|
||||
benign_re+='|failed to check digest.*(i/o timeout|connection refused|connection reset|context deadline exceeded|TLS handshake timeout|no such host|: EOF)'
|
||||
benign_re+='|failed to check digest.*non-successful response \(status=5[0-9][0-9]'
|
||||
errors=$(echo "$log_24h" | grep -iE '"level":"(error|fatal)"|level=error' | grep -vE "$benign_re" | tail -3 || true)
|
||||
if [[ -z "$errors" ]]; then
|
||||
APPS_ERROR_LINE="(none in last 24h)"
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue