Commit graph

13 commits

Author SHA1 Message Date
Viktor Barzin
6e9bffb1a3 storage docs: document the per-VM SCSI-LUN cap (proxmox-csi)
The proxmox-csi-plugin hardcodes a 29-disks-per-VM ceiling in
pkg/csi/utils.go:394 (lun < 30 loop). This is the actual block-
storage scaling bottleneck — NOT QEMU, NOT Proxmox, NOT the kernel.

Adds a "Per-VM SCSI-LUN cap" section to docs/architecture/storage.md
explaining:
  - the source-level hardcode and how to recognise it (FailedAttachVolume
    "no free lun found")
  - why switching scsihw to virtio-scsi-single buys ZERO additional
    capacity (perf-only)
  - levers in leverage-per-effort order (migrate non-DB to NFS,
    add a worker, fork+patch the plugin)
  - the Wave 1 NFS migration (2026-05-26) that took 5 services off
    block and skipped two more on pre-flight (plotting-book SQLite+WAL,
    stirling-pdf H2 .mv.db)

Discovered during the Wave 1 work — see remote memory ids 2788+ for
full context and 2798+ for the related postiz state-drift discovery.
2026-05-26 02:56:27 +00:00
Viktor Barzin
d6590612b2 immich: bulk-import Anca's Elements photo archive into her account
Grows pve/nfs-data 3T → 4T (online lvextend + resize2fs) to absorb ~340 GB
of new originals landing under /srv/nfs/immich/upload during the import.

Adds:
- module "nfs_anca_elements_host" — RO PVC over /srv/nfs/anca-elements,
  consumed only by the import Job (not mounted in immich-server).
- kubernetes_job_v1.anca_elements_import — immich-go v0.31.0 uploader
  posting to immich-server.immich.svc:2283 with Anca's API key (synced
  via the existing immich-secrets ExternalSecret from
  secret/immich.anca_api_key). Filters to image extensions, bans the
  non-photo top-level dirs (filme/, Music/, carti/, courses, installers,
  docs, etc.), puts every asset in the album "Poze (Elements)". Default
  `--pause-immich-jobs` is disabled — non-admin keys can't pause jobs.
- docs/architecture/storage.md — note the new 4 TB size in 3 places.
- docs/runbooks/grow-pve-nfs-lv.md — captures the one-shot lvextend
  procedure (no pve-host TF stack exists for this).

Job is removed in the follow-up cleanup commit once the upload completes;
the PVC stays for a videos batch later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 14:12:30 +00:00
Viktor Barzin
34f8c0f537 docs+scripts: lock in nextcloud-as-PVE-NFS-browser surface
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
- docs/architecture/storage.md: new "Nextcloud as PVE-NFS browser"
  section documenting mount-per-archive + applicable_users model,
  why mount-level ACL beats Files Access Control on NC 30/31, the
  manifest shape (with current applicableUsers + enableSharing
  fields), and the trade-off
- docs/runbooks/nextcloud-add-archive.md: 5-step runbook to surface
  a new directory under /srv/nfs/* to specific NC users via the
  bootstrap Job
- scripts/anca-elements-sync.sh: deployed at
  /usr/local/bin/anca-elements-sync.sh on the PVE host; fpsync from
  Synology Anca/Elements to /srv/nfs/anca-elements (idempotent +
  resumable). The PVE replica is what the NC /anca-elements mount
  serves; the offsite-sync pipeline excludes this path (committed
  earlier this session) so we don't write it back to Synology

NC usernames are admin/anca/emo (not display names — admin is
Viktor). Stale "viktor" references in the manifest example dropped.
2026-05-24 11:45:01 +00:00
Viktor Barzin
f9f19e4c54 mysql: bump to 4Gi limit / 3Gi request; grow /srv/nfs LV to 3 TiB
mysql-standalone OOMKilled May 8 18:05 (anon-rss 2 GB at the 2 Gi limit).
innodb_buffer_pool_size=1Gi plus connection buffers and InnoDB internals
don't fit in 2 Gi. Bumping limit to 4 Gi (request 3 Gi) leaves headroom
without changing the buffer pool config.

/srv/nfs was at 90% (1.7T / 2T); grew the underlying pve/nfs-data LV
1 TiB online and ran resize2fs (now 60% used). Triggered by surfacing
during the 2026-05-09 IO-pressure post-mortem; thinpool had ~4.6 TiB
free.

The post-mortem also covers the stale-NFS-client trigger (legacy
/usr/local/bin/weekly-backup pointing at the decommissioned TrueNAS IP)
and the resulting wedged kthread on the PVE host. Script removed and
node_exporter restarted out-of-band; kthread will clear at next PVE
reboot. See docs/post-mortems/2026-05-09-io-pressure-stale-nfs.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-09 17:01:57 +00:00
Viktor Barzin
484b4c7190 vault: complete Phase 2 NFS-hostile migration; remove nfs-proxmox SC
All 3 vault voters now on proxmox-lvm-encrypted (vault-0 16:18, vault-1
+ vault-2 today). The NFS fsync incompatibility identified in the
2026-04-22 raft-leader-deadlock post-mortem is no longer reachable —
raft consensus log + audit log live on LUKS2 block storage with real
fsync semantics.

Cluster-wide consumers of the inline kubernetes_storage_class.nfs_proxmox
dropped to zero after the rolling, so the resource is removed from
infra/stacks/vault/main.tf. Released NFS PVs (6) remain in the cluster
and will be reclaimed in Phase 3 cleanup.

Lesson learned (recorded in plan): pvc-protection finalizer races the
StatefulSet controller — pod recreates on the OLD PVCs unless the
finalizer is patched out before pod delete. Force-finalize technique
applied to vault-1 + vault-2 successfully.

Closes: code-gy7h
2026-04-25 17:10:00 +00:00
Viktor Barzin
ac8d2f548b paperless-ngx: migrate to proxmox-lvm-encrypted
Document scans (receipts, contracts, IDs) are unambiguously sensitive
PII. Storage decision rule defaults sensitive data to
`proxmox-lvm-encrypted`, but paperless-ngx had been left on plain
`proxmox-lvm` by an abandoned migration attempt that left a dormant,
non-Terraform-managed encrypted PVC sitting unbound for 11 days.

Cleaned up the orphan, added the encrypted PVC properly via Terraform,
rsynced data with deployment scaled to 0, swapped claim_name. Plain
`proxmox-lvm` PVC retained for a 7-day soak before removal.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 16:48:53 +00:00
Viktor Barzin
5a0b24f54e [docs] TrueNAS decommission cleanup — remove references from active docs
TrueNAS VM 9000 was operationally decommissioned 2026-04-13; NFS has been
served by Proxmox host (192.168.1.127) since. This commit scrubs remaining
references from active docs. VM 9000 itself remains on PVE in stopped state
pending user decision on deletion.

In-session cleanup already landed: reverse-proxy ingress + Cloudflare record
removed; Technitium DNS records deleted; Vault truenas_{api_key,ssh_private_key}
purged; homepage_credentials.reverse_proxy.truenas_token removed;
truenas_homepage_token variable + module deleted; Loki + Dashy cleaned;
config.tfvars deprecated DNS lines removed; historical-name comment added to
the nfs-truenas StorageClass (48 bound PVs, immutable name — kept).

Historical records (docs/plans/, docs/post-mortems/, .planning/) intentionally
untouched — they describe state at a point in time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 16:55:43 +00:00
Viktor Barzin
dcc96f465e docs(storage): add encrypted LVM documentation
Update storage docs to reflect the 2026-04-15 migration of all sensitive
services to proxmox-lvm-encrypted. Add encrypted PVC template, LUKS2 flow
documentation, updated architecture diagram, and storage class decision
rules.

Files updated:
- .claude/CLAUDE.md: storage decision table, encrypted PVC template
- docs/architecture/storage.md: encrypted flow, components, diagram, Vault paths
- AGENTS.md: storage section with encrypted SC as default for sensitive data

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 21:00:37 +00:00
Viktor Barzin
b45cee5c4a docs: update backup architecture for inotify change tracking + consolidated Synology layout [ci skip]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:16:36 +00:00
Viktor Barzin
38d51ab0af deprecate TrueNAS: migrate Immich NFS to Proxmox, remove all 10.0.10.15 references [ci skip]
- Migrate Immich (8 NFS PVs, 1.1TB) from TrueNAS to Proxmox host NFS
- Update config.tfvars nfs_server to 192.168.1.127 (Proxmox)
- Update nfs-csi StorageClass share to /srv/nfs
- Update scripts (weekly-backup, cluster-healthcheck) to Proxmox IP
- Delete obsolete TrueNAS scripts (nfs_exports.sh, truenas-status.sh)
- Rewrite nfs-health.sh for Proxmox NFS monitoring
- Update Freedify nfs_music_server default to Proxmox
- Mark CloudSync monitor CronJob as deprecated
- Update Prometheus alert summaries
- Update all architecture docs, AGENTS.md, and reference docs
- Zero PVs remain on TrueNAS — VM ready for decommission

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 14:42:07 +00:00
Viktor Barzin
b345b086ef update backup/DR docs and runbooks for 3-2-1 architecture
- Full rewrite of backup-dr.md: 3-2-1 strategy with sda backup disk,
  PVC file-level copy from LVM snapshots, pfsense backup, two offsite
  paths. 4 Mermaid diagrams (data flow, timeline, disk layout, restore tree).
- Update storage.md: 65 proxmox-lvm PVCs, sda backup tier
- Update restore-full-cluster.md: add Phase 3.5 for PVC restore from sda
- Update restore-{mysql,postgresql,vault,vaultwarden}.md: add sda fallback paths
- New runbook: restore-pvc-from-backup.md (file-level restore from sda)
- Update CLAUDE.md Storage & Backup section for 3-2-1 architecture
2026-04-06 15:06:01 +03:00
Viktor Barzin
d49acebd8e migrate ebooks-calibre to proxmox-lvm, update storage docs [ci skip]
- Migrate ebooks-calibre-config-iscsi (2Gi, 2380 files) to proxmox-lvm
- Update docs/architecture/storage.md: document Proxmox CSI as primary
  block storage, mark democratic-csi iSCSI as deprecated
- Add full migration plan to docs/plans/
2026-04-03 19:45:34 +03:00
Viktor Barzin
5a42643176 add architecture documentation for all infrastructure subsystems [ci skip]
14 docs covering networking, VPN, storage, authentication, security,
monitoring, secrets, CI/CD, backup/DR, compute, databases, and
multi-tenancy. Each doc includes Mermaid diagrams, component tables,
configuration references, decision rationale, and troubleshooting.
2026-03-24 00:55:25 +02:00