Two new alertmanager inhibit rules and one new Prometheus alert,
informed by the 2026-05-12 incident where Traefik pod
traefik-db7696fbf-k42wp came back after a SIGTERM with only 6 routers
vs 119 on healthy peers (stale K8s informer cache) and served 404 for
~1/3 of viktorbarzin.me traffic.
* New alert TraefikReplicaConfigStale: fires when max/min reload-rate
ratio across Traefik pods exceeds 5x for 2h. The 2h window + 2h
for-clause tolerates legitimate post-restart ramp-up; the bug
pattern persists indefinitely.
* New inhibit: TraefikReplicaConfigStale suppresses the symptom
alerts (HighService{Error,4xx,Latency}, IngressTTFB{High,Critical},
IngressErrorRate5xxHigh, TraefikHighOpenConnections,
ForwardAuthFallbackActive, AnubisChallengeStoreErrors,
ExternalAccessDivergence) so only the actionable root cause pages.
* New inhibit: HomeAssistantDown suppresses
HomeAssistantCriticalSensorUnavailable and
HomeAssistantMetricsMissing — when HA itself is down, every sensor
going unavailable is noise (10x firings observed in the last 12h).
* Extend NodeDown and NFSServerUnresponsive target lists to also
suppress HomeAssistantCriticalSensorUnavailable.