monitoring: exclude catchall-error-pages from HighService4xxRate
The catchall-error-pages IngressRoute matches HostRegexp(^(.+\.)? viktorbarzin\.me$) at priority=1 — it's the wildcard handler that returns 404 for any unmatched hostname (typos + scanner traffic). By design its 4xx rate sits at ~100%, so HighService4xxRate was a permanent false positive for traefik-catchall-error-pages-*@kubernetescrd. Same exclusion pattern as nextcloud/grafana/linkwarden/claude-memory (services with legitimately high 4xx counts).
This commit is contained in:
parent
f677794379
commit
fc5a4b66ad
1 changed files with 8 additions and 3 deletions
|
|
@ -2141,13 +2141,18 @@ serverFiles:
|
|||
annotations:
|
||||
summary: "5xx rate on {{ $labels.service }}: {{ $value | printf \"%.1f\" }}% (threshold: 10%)"
|
||||
- alert: HighService4xxRate
|
||||
# `.*catchall-error-pages.*` is excluded because that IngressRoute
|
||||
# is the wildcard `HostRegexp(^(.+\.)?viktorbarzin\.me$)` handler
|
||||
# — its entire purpose is to return 404 for unmatched hostnames
|
||||
# (typos + scanner traffic), so its 4xx rate is permanently ~100%.
|
||||
# Without this exclusion the alert is a perpetual false positive.
|
||||
expr: |
|
||||
(
|
||||
sum(rate(traefik_service_requests_total{code=~"4..", service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*"}[5m])) by (service)
|
||||
/ sum(rate(traefik_service_requests_total{service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*"}[5m])) by (service)
|
||||
sum(rate(traefik_service_requests_total{code=~"4..", service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*|.*catchall-error-pages.*"}[5m])) by (service)
|
||||
/ sum(rate(traefik_service_requests_total{service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*|.*catchall-error-pages.*"}[5m])) by (service)
|
||||
* 100
|
||||
) > 30
|
||||
and sum(rate(traefik_service_requests_total{service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*"}[5m])) by (service) > 0.1
|
||||
and sum(rate(traefik_service_requests_total{service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*|.*catchall-error-pages.*"}[5m])) by (service) > 0.1
|
||||
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
|
||||
for: 15m
|
||||
labels:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue