infra

Viktor Barzin 4713c3a6d9 k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore Three changes unblocking the autonomous chain for k8s patch upgrades: 1. phase_master quiesces tigera-operator before drain, restores after. Tigera crashes immediately if apiserver is unreachable (no retry logic) and crashlooping it during master static-pod swaps generates ~500MB/s disk I/O that pushes kubeadm's 5-min static-pod-hash watch past its limit. Quiesce removes the storm contributor; calico data plane keeps running unchanged (data plane is the DaemonSet+Typha, operator is just the reconciler). 2. update_k8s.sh retries with --etcd-upgrade=false on the 2nd attempt. For patch upgrades (1.34.7→1.34.8), etcd's image doesn't change — kubeadm writes an identical manifest, hash doesn't update, watch times out and rolls back forever. The skip-etcd retry sidesteps it for the legitimate no-change case while still doing a full etcd upgrade on the first attempt (correct for minor-version bumps). 3. halt_on_alert_query also ignores IngressTTFBHigh + NodeHighIOWait. Both are symptoms-not-causes: ingress latency spikes briefly during any pod-restart wave; high IOwait is exactly what upgrade activity causes (chicken-and-egg). The inline quiet-baseline check (Ready transition <10min) is the real cluster-churn gate. RBAC: k8s-upgrade-job ClusterRole gains `patch` on deployments + scale subresource so the chain can do the scale-to-0/back-to-1 on tigera. These three together get the chain past the cascade that's been blocking 1.34.7→1.34.8 for a week. Long-term fix is still HA control plane (beads code-n0ow); these are the bridge. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>		2026-05-23 08:40:11 +00:00
..
server_safe_poweroff	move helper scripts in scripts dir [ci skip]	2025-10-11 17:14:59 +00:00
check-ingress-auth-comments.py	infra/scripts/tg: enforce ingress_factory auth-comment convention	2026-05-11 19:18:27 +00:00
cluster_healthcheck.sh	cluster-health #43 : tighten PVE thermal threshold to 65 C	2026-05-22 14:09:08 +00:00
cluster_manager.py	chore: add untracked stacks, scripts, and agent configs	2026-04-15 09:33:06 +00:00
daily-backup.service	backup: fix daily-backup silent failures, postiz pg_dump CronJob, doc reconcile	2026-05-09 17:41:04 +00:00
daily-backup.sh	scripts: timeout rsync + sqlite calls in daily-backup	2026-05-10 18:39:07 +00:00
daily-backup.timer	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]	2026-04-13 18:37:04 +00:00
extend_vm_storage.sh	[ci skip] expand k8s worker nodes to 256G, update inventory and extend script	2026-02-28 16:00:16 +00:00
forgejo-migrate-orphan-images.sh	[forgejo] Migration script: exclude empty repos, all-images full mode	2026-05-07 17:21:39 +00:00
frigate-bulk-classify.js	[ci skip] sync tfstate and add frigate helper scripts	2026-02-12 23:11:23 +00:00
frigate-inspect.mjs	[ci skip] sync tfstate and add frigate helper scripts	2026-02-12 23:11:23 +00:00
gen_service_stacks.py	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip]	2026-03-25 23:56:07 +02:00
graceful-db-maintenance.sh	add pod dependency management via Kyverno init container injection	2026-03-15 19:17:57 +00:00
image_pull.sh	chore: add untracked stacks, scripts, and agent configs	2026-04-15 09:33:06 +00:00
image_pull_remote.sh	chore: add untracked stacks, scripts, and agent configs	2026-04-15 09:33:06 +00:00
kill_ns.sh	move helper scripts in scripts dir [ci skip]	2025-10-11 17:14:59 +00:00
lvm-pvc-snapshot.sh	[backup] Fix lvm-pvc-snapshot Pushgateway push (stdout pollution in cmd_prune_count)	2026-04-25 14:30:58 +00:00
lvm-pvc-snapshot.timer	add 3-2-1 backup pipeline: weekly PVC file copy, NFS mirror, pfsense, offsite sync	2026-04-06 14:53:28 +03:00
migrate-state-to-pg	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend	2026-04-16 19:33:12 +00:00
migrate_service_state.sh	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip]	2026-03-25 23:56:07 +02:00
nfs-change-tracker.service	consolidate offsite backup: inotify change tracking, deduplicate Synology paths [ci skip]	2026-04-13 18:06:20 +00:00
node_registry_manager.sh	some nits on the registry manager script - note it is still not working correctly [ci skip]	2025-10-17 19:23:43 +00:00
offsite-sync-backup.service	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]	2026-04-13 18:37:04 +00:00
offsite-sync-backup.sh	rename weekly-backup → daily-backup across scripts, timers, services, and docs [ci skip]	2026-04-13 18:37:04 +00:00
offsite-sync-backup.timer	switch backup + offsite sync from weekly to daily — RPO 7d → 1d [ci skip]	2026-04-13 18:24:38 +00:00
parse-postmortem-todos.sh	fix: use sh instead of bash in pipeline (Alpine compat)	2026-04-14 17:29:14 +00:00
pfsense-haproxy-bootstrap.php	mailserver: split healthcheck path off PROXY-aware listeners + book-search uses ClusterIP	2026-05-05 19:45:33 +00:00
pfsense-nat-mailserver-haproxy-flip.php	[mailserver] Phase 4+5 — pfSense HAProxy cutover for all 4 mail ports [ci skip]	2026-04-19 12:24:50 +00:00
pfsense-nat-mailserver-haproxy-unflip.php	[mailserver] Phase 4+5 — pfSense HAProxy cutover for all 4 mail ports [ci skip]	2026-04-19 12:24:50 +00:00
postmortem-pipeline.sh	[claude-agent-service] Migrate all pipelines from DevVM SSH to K8s HTTP	2026-04-18 10:12:02 +00:00
pve-nfs-exports	fix(post-mortem): add /etc/exports to git, NFS health check in daily-backup, document CSI requirements [PM-2026-04-14]	2026-04-14 18:08:24 +00:00
renew_worker_certs.sh	move helper scripts in scripts dir [ci skip]	2025-10-11 17:14:59 +00:00
setup-containerd-pullthrough.sh	chore: add untracked stacks, scripts, and agent configs	2026-04-15 09:33:06 +00:00
setup-forgejo-containerd-mirror.sh	[forgejo] Phase 0 of registry consolidation: prepare Forgejo OCI registry	2026-05-07 15:51:34 +00:00
setup-task-pipeline.sh	[ci skip] add Forgejo task pipeline for OpenClaw AI agent	2026-03-07 21:11:07 +00:00
setup_containerd_mirrors.sh	add upstream fallback to containerd registry mirrors	2026-04-02 11:05:30 +03:00
state-sync	[infra] Migrate Terraform state from local SOPS to PostgreSQL backend	2026-04-16 19:33:12 +00:00
stop_storage_services.sh	cleanup: remove calibre and audiobookshelf stacks after ebooks migration [ci skip]	2026-03-25 23:56:07 +02:00
task-processor.sh	[ci skip] add Forgejo task pipeline for OpenClaw AI agent	2026-03-07 21:11:07 +00:00
tg	infra/scripts/tg: enforce ingress_factory auth-comment convention	2026-05-11 19:18:27 +00:00
update-istio-injection.sh	move helper scripts in scripts dir [ci skip]	2025-10-11 17:14:59 +00:00
update_k8s.sh	k8s-version-upgrade: tigera quiesce + etcd-skip retry + IO-wait alert ignore	2026-05-23 08:40:11 +00:00
update_node.sh	k8s-version-upgrade: automated kubeadm/kubelet/kubectl upgrade pipeline	2026-05-10 19:07:42 +00:00
upgrade_state.sh	upgrade-state: filter transient registry digest-check errors	2026-05-19 22:06:21 +00:00
vault-kubeconfig	remove SOPS pipeline, deploy ESO + Vault DB/K8s engines	2026-03-15 16:37:38 +00:00
woodpecker-register-forgejo-repo.sh	[woodpecker] Programmatic Forgejo repo registration	2026-05-07 23:33:26 +00:00