TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.
In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).
Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2.6 KiB
2.6 KiB
Restore etcd
Prerequisites
- SSH access to
k8s-masternode - etcd snapshot available on NFS at
/mnt/main/etcd-backup/ - etcd PKI certs at
/etc/kubernetes/pki/etcd/on master node
Backup Location
- NFS:
/mnt/main/etcd-backup/etcd-snapshot-YYYYMMDD-HHMMSS.db - Replicated to Synology NAS (192.168.1.13) via Proxmox host offsite-sync-backup (inotify-driven rsync)
- Retention: 30 days
- Schedule: Daily at 00:00
CRITICAL: etcd is the foundation of the cluster
Restoring etcd will reset the entire Kubernetes state to the snapshot time. All objects created after the snapshot will be lost. This is a last-resort operation.
Only restore etcd if the control plane is completely broken.
Restore Procedure
1. SSH to the master node
ssh k8s-master
2. Identify the snapshot to restore
ls -lt /mnt/main/etcd-backup/etcd-snapshot-*.db | head -10
3. Stop the API server and etcd
# Move static pod manifests to stop them
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /etc/kubernetes/
sudo mv /etc/kubernetes/manifests/etcd.yaml /etc/kubernetes/
# Wait for pods to stop
sudo crictl ps | grep -E "etcd|apiserver"
4. Back up current etcd data
sudo mv /var/lib/etcd /var/lib/etcd.bak.$(date +%Y%m%d-%H%M%S)
5. Restore the snapshot
sudo ETCDCTL_API=3 etcdctl snapshot restore /mnt/main/etcd-backup/etcd-snapshot-YYYYMMDD-HHMMSS.db \
--data-dir=/var/lib/etcd \
--name=k8s-master \
--initial-cluster=k8s-master=https://127.0.0.1:2380 \
--initial-advertise-peer-urls=https://127.0.0.1:2380
6. Fix permissions
sudo chown -R root:root /var/lib/etcd
7. Restart etcd and API server
sudo mv /etc/kubernetes/etcd.yaml /etc/kubernetes/manifests/
# Wait for etcd to be ready
sleep 30
sudo mv /etc/kubernetes/kube-apiserver.yaml /etc/kubernetes/manifests/
8. Verify restoration
# Check etcd health
sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
endpoint health
# Check cluster status
kubectl get nodes
kubectl get pods -A | head -20
9. Reconcile state
After etcd restore, some objects may be stale:
# Re-apply critical infrastructure
cd /path/to/infra
scripts/tg apply stacks/platform
# Check for orphaned resources
kubectl get pods -A | grep -E "Terminating|Error|Unknown"
Estimated Time
- Snapshot restore: ~10-15 minutes
- Full reconciliation: ~30-60 minutes (depends on drift)