infra: re-enable unattended-upgrades with kured prometheus-gating
Reverses the March 2026 outage mitigation that disabled unattended-
upgrades cluster-wide. Now re-enables it on the k8s template VM with:
- Allowed-Origins limited to security/updates pockets
- Package-Blacklist for k8s/containerd/runc/calico-node (apt-mark
hold on the cluster-critical components)
- Automatic-Reboot disabled — kured drives the actual reboots
- Compatible with the existing kured + sentinel-gate flow
kured side:
- rebootDelay 30s, concurrency 1
- Sentinel cool-down stretched 30m → 24h (aligns with the 24h soak
window from the post-mortem)
- prometheusUrl + alertFilterRegexp wired so any firing non-ignored
alert halts the rollout. Ignore-list excludes self-referential
alerts (Watchdog/RebootRequired/KuredNodeWasNotDrained/
InfoInhibitor) that would otherwise deadlock kured.
Prometheus side (already partly landed in 6c4e0966 — the "Upgrade
Gates" rule group):
- Refine `KubeQuotaAlmostFull` to include the resourcequota label in
both the on-clause and the summary, so multi-quota namespaces
(authentik, beads-server, frigate) report the quota name correctly.
grafana.tf: terraform fmt whitespace only.
Together with the post-mortem 2026-03-22 (memory id=390) the loop is
closed: unattended-upgrades runs again, kernel-class updates can land,
but only when cluster health is green and the reboot window is open.
This commit is contained in:
parent
fe75fad467
commit
5c0ea96a91
4 changed files with 71 additions and 15 deletions
|
|
@ -67,11 +67,44 @@ runcmd:
|
|||
- sed -i 's/#Compress=yes/Compress=yes/' /etc/systemd/journald.conf
|
||||
- systemctl restart systemd-journald
|
||||
%{if is_k8s_template}
|
||||
# Disable unattended-upgrades to prevent unexpected kernel updates that can break containerd/kubelet
|
||||
# (Root cause of 26h cluster outage: unattended-upgrades → kernel update → containerd failure)
|
||||
- systemctl disable --now unattended-upgrades || true
|
||||
- apt-get remove -y unattended-upgrades || true
|
||||
# Re-enabled 2026-05-10: unattended-upgrades is back on, but with a tight
|
||||
# Allowed-Origins list, a Package-Blacklist for k8s/containerd/runc/calico,
|
||||
# and Automatic-Reboot disabled (kured + sentinel-gate handles reboots in a
|
||||
# 24h-soaked rolling window, gated by Prometheus alerts).
|
||||
# Original outage (March 2026) was kernel update → containerd overlayfs corruption.
|
||||
# Mitigations: 24h cool-down between node reboots, Prometheus halt-on-alert,
|
||||
# apt-mark hold on k8s components, Package-Blacklist for runtime components.
|
||||
- apt-get install -y unattended-upgrades update-notifier-common
|
||||
- |
|
||||
cat > /etc/apt/apt.conf.d/52unattended-upgrades-k8s <<'EOF'
|
||||
Unattended-Upgrade::Allowed-Origins {
|
||||
"$${distro_id}:$${distro_codename}";
|
||||
"$${distro_id}:$${distro_codename}-security";
|
||||
"$${distro_id}:$${distro_codename}-updates";
|
||||
"$${distro_id}ESMApps:$${distro_codename}-apps-security";
|
||||
"$${distro_id}ESM:$${distro_codename}-infra-security";
|
||||
};
|
||||
Unattended-Upgrade::Package-Blacklist {
|
||||
"^containerd(\.io)?$$";
|
||||
"^runc$$";
|
||||
"^cri-tools$$";
|
||||
"^kubernetes-cni$$";
|
||||
"^calico-.*";
|
||||
"^cni-plugins-.*";
|
||||
"^docker-ce$$";
|
||||
};
|
||||
Unattended-Upgrade::DevRelease "false";
|
||||
Unattended-Upgrade::Automatic-Reboot "false";
|
||||
EOF
|
||||
- |
|
||||
cat > /etc/apt/apt.conf.d/20auto-upgrades <<'EOF'
|
||||
APT::Periodic::Update-Package-Lists "1";
|
||||
APT::Periodic::Unattended-Upgrade "1";
|
||||
EOF
|
||||
- systemctl unmask unattended-upgrades 2>/dev/null || true
|
||||
- systemctl enable --now unattended-upgrades
|
||||
- apt-mark hold kubelet kubeadm kubectl
|
||||
- apt-mark hold containerd containerd.io runc 2>/dev/null || true
|
||||
- systemctl stop kubelet
|
||||
- containerd config default | sudo tee /etc/containerd/config.toml
|
||||
- ${containerd_config_update_command}
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue