monitoring: exclude catchall-error-pages from HighService4xxRate
The catchall-error-pages IngressRoute matches HostRegexp(^(.+\.)? viktorbarzin\.me$) at priority=1 — it's the wildcard handler that returns 404 for any unmatched hostname (typos + scanner traffic). By design its 4xx rate sits at ~100%, so HighService4xxRate was a permanent false positive for traefik-catchall-error-pages-*@kubernetescrd. Same exclusion pattern as nextcloud/grafana/linkwarden/claude-memory (services with legitimately high 4xx counts).
This commit is contained in:
parent
f677794379
commit
fc5a4b66ad
1 changed files with 8 additions and 3 deletions
|
|
@ -2141,13 +2141,18 @@ serverFiles:
|
||||||
annotations:
|
annotations:
|
||||||
summary: "5xx rate on {{ $labels.service }}: {{ $value | printf \"%.1f\" }}% (threshold: 10%)"
|
summary: "5xx rate on {{ $labels.service }}: {{ $value | printf \"%.1f\" }}% (threshold: 10%)"
|
||||||
- alert: HighService4xxRate
|
- alert: HighService4xxRate
|
||||||
|
# `.*catchall-error-pages.*` is excluded because that IngressRoute
|
||||||
|
# is the wildcard `HostRegexp(^(.+\.)?viktorbarzin\.me$)` handler
|
||||||
|
# — its entire purpose is to return 404 for unmatched hostnames
|
||||||
|
# (typos + scanner traffic), so its 4xx rate is permanently ~100%.
|
||||||
|
# Without this exclusion the alert is a perpetual false positive.
|
||||||
expr: |
|
expr: |
|
||||||
(
|
(
|
||||||
sum(rate(traefik_service_requests_total{code=~"4..", service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*"}[5m])) by (service)
|
sum(rate(traefik_service_requests_total{code=~"4..", service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*|.*catchall-error-pages.*"}[5m])) by (service)
|
||||||
/ sum(rate(traefik_service_requests_total{service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*"}[5m])) by (service)
|
/ sum(rate(traefik_service_requests_total{service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*|.*catchall-error-pages.*"}[5m])) by (service)
|
||||||
* 100
|
* 100
|
||||||
) > 30
|
) > 30
|
||||||
and sum(rate(traefik_service_requests_total{service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*"}[5m])) by (service) > 0.1
|
and sum(rate(traefik_service_requests_total{service!~".*nextcloud.*|.*grafana.*|.*linkwarden.*|.*claude-memory.*|.*catchall-error-pages.*"}[5m])) by (service) > 0.1
|
||||||
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
|
and on() (time() - process_start_time_seconds{job="prometheus"}) > 900
|
||||||
for: 15m
|
for: 15m
|
||||||
labels:
|
labels:
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue