infra/docs/runbooks/restore-vault.md

# Restore Vault (Raft)

Last updated: 2026-04-06

## Prerequisites
- `kubectl` access to the cluster
- Vault root token (from `vault-root-token` secret in `vault` namespace — manually created, independent of automation)
- Raft snapshot available on NFS at `/mnt/main/vault-backup/`
- Unseal keys (stored securely — check `secret/viktor` in Vault or emergency kit)

## Backup Location
- NFS: `/mnt/main/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db`
- Mirrored to sda: `/mnt/backup/nfs-mirror/vault-backup/` (PVE host 192.168.1.127)
- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vault-backup/`
- Retention: 30 days (on NFS), latest only (on sda), unlimited (on Synology)
- Schedule: Weekly on Sundays at 02:00 (`0 2 * * 0`)

## CRITICAL: Vault is a dependency for many services
Vault provides secrets to the entire cluster via ESO (External Secrets Operator). A Vault outage affects:
- All ExternalSecrets (43 secrets + 9 DB-creds secrets)
- Vault DB engine password rotation
- K8s credentials engine
- CI/CD secret sync

**Priority: Restore Vault before any other service (except etcd).**

## Restore Procedure

### 1. Identify the snapshot to restore
```bash
# List available snapshots
ls -lt /mnt/main/vault-backup/vault-raft-*.db | head -10
```

### 2. Restore Raft snapshot
```bash
# Get root token
VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d)

# Port-forward to Vault
kubectl port-forward svc/vault-active -n vault 8200:8200 &

# Restore the snapshot (this will overwrite current state)
export VAULT_ADDR=http://127.0.0.1:8200
export VAULT_TOKEN
vault operator raft snapshot restore -force /path/to/vault-raft-YYYYMMDD-HHMMSS.db
```

### 3. Unseal Vault (if sealed after restore)

> **Note:** Vault now has an auto-unseal sidecar that automatically unseals pods
> using the `vault-unseal-key` K8s Secret. The manual procedure below is a
> fallback if auto-unseal fails.

```bash
# Check seal status
vault status

# If sealed, unseal with keys (need threshold number of keys)
vault operator unseal <key1>
vault operator unseal <key2>
vault operator unseal <key3>
```

### 4. Verify restoration
```bash
# Check Vault health
vault status

# Check raft peers
vault operator raft list-peers

# Verify key secrets exist
vault kv get secret/viktor
vault kv list secret/

# Check DB engine
vault list database/roles

# Check K8s engine
vault list kubernetes/roles
```

### 5. Trigger ESO refresh
After Vault restore, ExternalSecrets may need a refresh:
```bash
# Restart ESO to force re-sync
kubectl rollout restart deployment -n external-secrets

# Check ExternalSecret status
kubectl get externalsecrets -A | grep -v "SecretSynced"
```

## Alternative: Restore from sda Backup

If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:

```bash
# 1. SSH to PVE host
ssh root@192.168.1.127

# 2. Find the latest snapshot
ls -lt /mnt/backup/nfs-mirror/vault-backup/

# 3. Copy snapshot to a location accessible from cluster
# Port-forward to Vault and restore
kubectl port-forward svc/vault-active -n vault 8200:8200 &
export VAULT_ADDR=http://127.0.0.1:8200
export VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' | base64 -d)

# Copy snapshot from PVE host to local workstation, then restore
scp root@192.168.1.127:/mnt/backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./
vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db
```

## Alternative: Restore from Synology (if PVE host is down)

If the PVE host itself is unavailable:

```bash
# 1. SSH to Synology NAS
ssh Administrator@192.168.1.13

# 2. Navigate to backup directory
cd /volume1/Backup/Viki/nfs/vault-backup/

# 3. Copy snapshot to local workstation
scp Administrator@192.168.1.13:/volume1/Backup/Viki/nfs/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./

# 4. Restore via port-forward (same as above)
```

## Full Vault Rebuild (from zero)
If Vault needs to be rebuilt from scratch:
1. Comment out data sources + OIDC config in `stacks/vault/main.tf`
2. Apply Helm release: `scripts/tg apply -target=helm_release.vault stacks/vault`
3. Initialize: `vault operator init`
4. Unseal with generated keys
5. Restore raft snapshot (step 2 above)
6. Populate `secret/vault` with OIDC credentials
7. Uncomment data sources + OIDC
8. Re-apply: `scripts/tg apply stacks/vault`

## Estimated Time
- Snapshot restore + unseal: ~10 minutes
- Full rebuild: ~30-45 minutes
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks Phase 1: Add 12 PrometheusRules for backup health alerting - PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts - CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces - Generic BackupCronJobFailed alert Phase 2: Fix backup rotation - etcd: timestamped snapshots instead of overwriting single file - Redis: timestamped RDB files with 7-day retention purge - PostgreSQL: retention increased from 7 to 14 days Phase 3: Fix MySQL password exposure - Move root password from command line arg to MYSQL_PWD env var via secretKeyRef Phase 5: Add restore runbooks - PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild 2026-03-19 20:34:33 +00:00			`# Restore Vault (Raft)`

update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture 2026-04-06 15:06:01 +03:00			`Last updated: 2026-04-06`

backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks Phase 1: Add 12 PrometheusRules for backup health alerting - PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts - CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces - Generic BackupCronJobFailed alert Phase 2: Fix backup rotation - etcd: timestamped snapshots instead of overwriting single file - Redis: timestamped RDB files with 7-day retention purge - PostgreSQL: retention increased from 7 to 14 days Phase 3: Fix MySQL password exposure - Move root password from command line arg to MYSQL_PWD env var via secretKeyRef Phase 5: Add restore runbooks - PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild 2026-03-19 20:34:33 +00:00			`## Prerequisites`
			- `kubectl` access to the cluster
			- Vault root token (from `vault-root-token` secret in `vault` namespace — manually created, independent of automation)
			- Raft snapshot available on NFS at `/mnt/main/vault-backup/`
			- Unseal keys (stored securely — check `secret/viktor` in Vault or emergency kit)

			`## Backup Location`
			- NFS: `/mnt/main/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db`
update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture 2026-04-06 15:06:01 +03:00			- Mirrored to sda: `/mnt/backup/nfs-mirror/vault-backup/` (PVE host 192.168.1.127)
			- Replicated to Synology NAS: `Synology/Backup/Viki/pve-backup/nfs-mirror/vault-backup/`
			`- Retention: 30 days (on NFS), latest only (on sda), unlimited (on Synology)`
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates 2026-04-06 13:21:05 +03:00			- Schedule: Weekly on Sundays at 02:00 (`0 2 * * 0`)
backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks Phase 1: Add 12 PrometheusRules for backup health alerting - PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts - CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces - Generic BackupCronJobFailed alert Phase 2: Fix backup rotation - etcd: timestamped snapshots instead of overwriting single file - Redis: timestamped RDB files with 7-day retention purge - PostgreSQL: retention increased from 7 to 14 days Phase 3: Fix MySQL password exposure - Move root password from command line arg to MYSQL_PWD env var via secretKeyRef Phase 5: Add restore runbooks - PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild 2026-03-19 20:34:33 +00:00
			`## CRITICAL: Vault is a dependency for many services`
			`Vault provides secrets to the entire cluster via ESO (External Secrets Operator). A Vault outage affects:`
			`- All ExternalSecrets (43 secrets + 9 DB-creds secrets)`
			`- Vault DB engine password rotation`
			`- K8s credentials engine`
			`- CI/CD secret sync`

			`Priority: Restore Vault before any other service (except etcd).`

			`## Restore Procedure`

			`### 1. Identify the snapshot to restore`
			```bash
			`# List available snapshots`
			`ls -lt /mnt/main/vault-backup/vault-raft-*.db \| head -10`
			```

			`### 2. Restore Raft snapshot`
			```bash
			`# Get root token`
			`VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' \| base64 -d)`

			`# Port-forward to Vault`
			`kubectl port-forward svc/vault-active -n vault 8200:8200 &`

			`# Restore the snapshot (this will overwrite current state)`
			`export VAULT_ADDR=http://127.0.0.1:8200`
			`export VAULT_TOKEN`
			`vault operator raft snapshot restore -force /path/to/vault-raft-YYYYMMDD-HHMMSS.db`
			```

			`### 3. Unseal Vault (if sealed after restore)`
docs: comprehensive audit and update of all architecture docs and runbooks [ci skip] Audited 14 documentation files against live cluster state and Terraform code. Architecture docs: - databases.md: MySQL 8.4.4, proxmox-lvm storage (not iSCSI), anti-affinity excludes k8s-node1 (GPU), 2Gi/3Gi resources, 7-day rotation (not 24h), CNPG 2 instances, PostGIS 16, postgresql.dbaas has endpoints - overview.md: 1x CPU, ~160GB RAM, all nodes 32GB, proxmox-lvm storage, correct Vault paths (secret/ not kv/) - compute.md: 272GB physical host RAM, ~160GB allocated to VMs - secrets.md: 7-day rotation, 7 MySQL + 5 PG roles, correct ESO config - networking.md: MetalLB pool 10.0.20.200-220 - ci-cd.md: 9 GHA projects, travel_blog 5.7GB Runbooks: - restore-mysql/postgresql: backup files are .sql.gz (not .sql) - restore-vault: weekly backup (not daily), auto-unseal sidecar note - restore-vaultwarden: PVC is proxmox (not iscsi) - restore-full-cluster: updated node roles, removed trading Reference docs: - CLAUDE.md: 7-day rotation, removed trading from PG list - AGENTS.md: 100+ stacks, proxmox-lvm, platform empty shell - service-catalog.md: 6 new stacks, 14 stack column updates 2026-04-06 13:21:05 +03:00
			`> Note: Vault now has an auto-unseal sidecar that automatically unseals pods`
			> using the `vault-unseal-key` K8s Secret. The manual procedure below is a
			`> fallback if auto-unseal fails.`

backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks Phase 1: Add 12 PrometheusRules for backup health alerting - PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts - CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces - Generic BackupCronJobFailed alert Phase 2: Fix backup rotation - etcd: timestamped snapshots instead of overwriting single file - Redis: timestamped RDB files with 7-day retention purge - PostgreSQL: retention increased from 7 to 14 days Phase 3: Fix MySQL password exposure - Move root password from command line arg to MYSQL_PWD env var via secretKeyRef Phase 5: Add restore runbooks - PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild 2026-03-19 20:34:33 +00:00			```bash
			`# Check seal status`
			`vault status`

			`# If sealed, unseal with keys (need threshold number of keys)`
			`vault operator unseal <key1>`
			`vault operator unseal <key2>`
			`vault operator unseal <key3>`
			```

			`### 4. Verify restoration`
			```bash
			`# Check Vault health`
			`vault status`

			`# Check raft peers`
			`vault operator raft list-peers`

			`# Verify key secrets exist`
			`vault kv get secret/viktor`
			`vault kv list secret/`

			`# Check DB engine`
			`vault list database/roles`

			`# Check K8s engine`
			`vault list kubernetes/roles`
			```

			`### 5. Trigger ESO refresh`
			`After Vault restore, ExternalSecrets may need a refresh:`
			```bash
			`# Restart ESO to force re-sync`
			`kubectl rollout restart deployment -n external-secrets`

			`# Check ExternalSecret status`
			`kubectl get externalsecrets -A \| grep -v "SecretSynced"`
			```

update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture 2026-04-06 15:06:01 +03:00			`## Alternative: Restore from sda Backup`

[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-19 16:55:43 +00:00			`If the Proxmox host NFS mount is unavailable but the PVE host itself is accessible:`
update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture 2026-04-06 15:06:01 +03:00
			```bash
			`# 1. SSH to PVE host`
			`ssh root@192.168.1.127`

			`# 2. Find the latest snapshot`
			`ls -lt /mnt/backup/nfs-mirror/vault-backup/`

			`# 3. Copy snapshot to a location accessible from cluster`
			`# Port-forward to Vault and restore`
			`kubectl port-forward svc/vault-active -n vault 8200:8200 &`
			`export VAULT_ADDR=http://127.0.0.1:8200`
			`export VAULT_TOKEN=$(kubectl get secret vault-root-token -n vault -o jsonpath='{.data.vault-root-token}' \| base64 -d)`

			`# Copy snapshot from PVE host to local workstation, then restore`
			`scp root@192.168.1.127:/mnt/backup/nfs-mirror/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./`
			`vault operator raft snapshot restore -force ./vault-raft-YYYYMMDD-HHMMSS.db`
			```

			`## Alternative: Restore from Synology (if PVE host is down)`

[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-19 16:55:43 +00:00			`If the PVE host itself is unavailable:`
update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture 2026-04-06 15:06:01 +03:00
			```bash
			`# 1. SSH to Synology NAS`
			`ssh Administrator@192.168.1.13`

			`# 2. Navigate to backup directory`
[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-19 16:55:43 +00:00			`cd /volume1/Backup/Viki/nfs/vault-backup/`
update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture 2026-04-06 15:06:01 +03:00
			`# 3. Copy snapshot to local workstation`
[docs] TrueNAS decommission cleanup — remove references from active docs TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been served by Proxmox host (192.168.1.127) since. This commit scrubs remaining references from active docs. VM 9000 itself remains on PVE in stopped state pending user decision on deletion. In-session cleanup already landed: reverse-proxy ingress + Cloudflare record removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key} purged; homepage_credentials.reverse_proxy.truenas_token removed; truenas_homepage_token variable + module deleted; Loki + Dashy cleaned; config.tfvars deprecated DNS lines removed; historical-name comment added to the nfs-truenas StorageClass (48 bound PVs, immutable name — kept). Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally untouched — they describe state at a point in time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> 2026-04-19 16:55:43 +00:00			`scp Administrator@192.168.1.13:/volume1/Backup/Viki/nfs/vault-backup/vault-raft-YYYYMMDD-HHMMSS.db ./`
update backup/DR docs and runbooks for 3-2-1 architecture - Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk, PVC file-level copy from LVM snapshots, pfsense backup, two offsite paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree). - Update storage.md: 65 proxmox-lvm PVCs, sda backup tier - Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda - Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths - New runbook: restore-pvc-from-backup.md (file-level restore from sda) - Update CLAUDE.md Storage & Backup section for 3-2-1 architecture 2026-04-06 15:06:01 +03:00
			`# 4. Restore via port-forward (same as above)`
			```

backup & DR: add alerting, fix rotation, secure MySQL password, add runbooks Phase 1: Add 12 PrometheusRules for backup health alerting - PostgreSQL, MySQL, Vault, Vaultwarden, Redis staleness + never-succeeded alerts - CSIDriverCrashLoop alert for nfs-csi/iscsi-csi namespaces - Generic BackupCronJobFailed alert Phase 2: Fix backup rotation - etcd: timestamped snapshots instead of overwriting single file - Redis: timestamped RDB files with 7-day retention purge - PostgreSQL: retention increased from 7 to 14 days Phase 3: Fix MySQL password exposure - Move root password from command line arg to MYSQL_PWD env var via secretKeyRef Phase 5: Add restore runbooks - PostgreSQL, MySQL, Vault, etcd, Vaultwarden, full cluster rebuild 2026-03-19 20:34:33 +00:00			`## Full Vault Rebuild (from zero)`
			`If Vault needs to be rebuilt from scratch:`
			1. Comment out data sources + OIDC config in `stacks/vault/main.tf`
			2. Apply Helm release: `scripts/tg apply -target=helm_release.vault stacks/vault`
			3. Initialize: `vault operator init`
			`4. Unseal with generated keys`
			`5. Restore raft snapshot (step 2 above)`
			6. Populate `secret/vault` with OIDC credentials
			`7. Uncomment data sources + OIDC`
			8. Re-apply: `scripts/tg apply stacks/vault`

			`## Estimated Time`
			`- Snapshot restore + unseal: ~10 minutes`
			`- Full rebuild: ~30-45 minutes`