Commit graph

3768 commits

Author SHA1 Message Date
Viktor Barzin
321c073ca0 state(infra): update encrypted state 2026-05-26 07:09:52 +00:00
Viktor Barzin
5b7b962d7c state(infra): update encrypted state 2026-05-26 07:09:33 +00:00
Viktor Barzin
6a83cee6ae state(infra): update encrypted state 2026-05-26 07:07:06 +00:00
Viktor Barzin
9b75b2817b cloud-init: fix k8s node bootstrap snippet (multi-line interp + containerd v2 quotes)
Two bugs found while rebuilding k8s-node4 (2026-05-26):

1. **runcmd YAML breakage**: `- $${containerd_config_update_command}`
   interpolated a multi-line heredoc as bare list-item content. The
   trailing lines lost their list-item prefix, breaking cloud-config
   parsing. Cloud-init silently fell back to the minimal default
   (hostname + package_upgrade only) — kubeadm join, containerd config,
   kubelet tuning, iSCSI hardening, swap, ALL skipped. No error visible
   in `cloud-init status`.

   Fix: wrap the interpolation in `- |` literal block with `indent(4, ...)`.

2. **containerd v2 single-quote mismatch**: `containerd config default`
   in v2 writes `config_path = ''` (single quotes), v1 writes `""` (double).
   The sed pattern matched only double quotes → silent no-op on fresh
   containerd 2.x nodes → registry-mirror hosts.toml ignored → all image
   pulls hit upstream registries → DNS-to-MetalLB chicken-and-egg loop.

   Fix: match any value with `config_path = .*`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 07:06:50 +00:00
Viktor Barzin
445feb118f infra: per-VM I/O caps + terragrunt v0.77 plumbing + state recovery
WHAT LANDED:
- terragrunt.hcl (root): added telmate/proxmox to k8s_providers
  required_providers. Other stacks just don't instantiate a provider
  block — harmless. Replaces the same-name override trick the infra
  stack used to do, which stopped working under Terragrunt v0.77
  ("Detected generate blocks with the same name").
- stacks/infra/terragrunt.hcl: new generate "proxmox_provider" block
  writes proxmox_provider.tf with the provider config; credentials
  read from Vault secret/viktor at plan/apply time (no env vars).
- modules/create-vm: new mbps_rd / mbps_wr number variables (default 0
  = uncapped), wired into scsi0/scsi1 disk{} blocks as
  mbps_r_concurrent / mbps_wr_concurrent. lifecycle.ignore_changes
  extended to scsi6..scsi29 (K8s nodes have many CSI-managed slots),
  plus scsihw and qemu_os (vary per-VM; non-trivial live changes).
- stacks/infra/main.tf: docker-registry-vm gains mbps_rd=40,
  mbps_wr=40 in HCL — already applied live via qm set on 2026-05-26.

WHAT FAILED AND WAS ROLLED BACK:
- Attempted import of 7 VMs (102 devvm, 103 home-assistant, 200
  k8s-master, 201 k8s-node1, 202 k8s-node2, 203 k8s-node3, 204
  k8s-node4) via import {} blocks. The telmate/proxmox v3.0.2-rc07
  provider mangled proxmox-csi PVC slots on apply for vmid 202 and
  203: every scsi slot got rewritten from `vm-9999-pvc-<uuid>` to
  the boot disk `vm-<vmid>-disk-0`. Restored both .conf files from
  the 2026-05-24 nightly PVE config backup at /mnt/backup/pve-config/
  etc-pve/nodes/pve/qemu-server/{202,203}.conf — no reboots, no data
  loss, K8s CSI reconciled PVC attachments within minutes. Removed
  the 7 imports from state via `terraform state rm` and re-encrypted.
  Tracked in beads code-xzbl: blocked on bpg/proxmox provider
  migration (telmate has the same dynamic-disk defect that bit us on
  iSCSI back in 2026-04-02; see memory id=539).

LIVE CAPS STILL IN PLACE (qm set, 2026-05-26 ~03:13 UTC):
  102 devvm 60/60   103 home-assistant 40/40   200 k8s-master 100/60
  201 k8s-node1 150/120   202 k8s-node2 150/120   203 k8s-node3 150/120
  204 k8s-node4 150/120   220 docker-registry 40/40
  (pfSense 101 BSD + Windows10 300 intentionally out of scope.)

PRE-EXISTING DRIFT EXPOSED (NOT NEW):
- HCL declares k8s-master (200) and k8s-node2 (202) but neither was
  ever imported into TF state — confirmed against the SOPS-encrypted
  state in git (lineage e1cc5bb5, serial 42, last touched 2026-04-06).
  This commit leaves both declarations in place but does NOT import
  them; that's part of the code-xzbl follow-up.

Closes: code-s9xr
2026-05-26 06:46:47 +00:00
Viktor Barzin
07bd2e0017 onlyoffice: restore replicas 0 → 1 post IO-storm recovery
Cluster is fully stable (all 5 nodes Ready, vaultwarden recovered,
node4 rebuilt 2026-05-26). Removing the TEMP-SCALEDOWN guard.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 03:08:17 +00:00
Viktor Barzin
6e9bffb1a3 storage docs: document the per-VM SCSI-LUN cap (proxmox-csi)
The proxmox-csi-plugin hardcodes a 29-disks-per-VM ceiling in
pkg/csi/utils.go:394 (lun < 30 loop). This is the actual block-
storage scaling bottleneck — NOT QEMU, NOT Proxmox, NOT the kernel.

Adds a "Per-VM SCSI-LUN cap" section to docs/architecture/storage.md
explaining:
  - the source-level hardcode and how to recognise it (FailedAttachVolume
    "no free lun found")
  - why switching scsihw to virtio-scsi-single buys ZERO additional
    capacity (perf-only)
  - levers in leverage-per-effort order (migrate non-DB to NFS,
    add a worker, fork+patch the plugin)
  - the Wave 1 NFS migration (2026-05-26) that took 5 services off
    block and skipped two more on pre-flight (plotting-book SQLite+WAL,
    stirling-pdf H2 .mv.db)

Discovered during the Wave 1 work — see remote memory ids 2788+ for
full context and 2798+ for the related postiz state-drift discovery.
2026-05-26 02:56:27 +00:00
Viktor Barzin
7ad0e578ae f1-stream: migrate PVC from proxmox-lvm to NFS
Wave 1 LUN-cap relief. The PVC stores 5 small JSON state files
(health_state, schedule, scraped_links, sessions, streams) and a
lost+found — total 30KB, no DB, regenerable from upstream APIs.

Standard scale-to-0 → rsync → swap pattern (deployment was at
replicas=1). Pod came back up on k8s-node4 (now Ready again).

Net: -1 SCSI LUN on k8s-node1 (was the previous host).
2026-05-26 02:49:43 +00:00
Viktor Barzin
aded77d5ab monitoring: alerts for proxmox-csi LUN saturation per node
Vaultwarden + 18 pods got stuck for 7h on 2026-05-26 when k8s-node4 went
down: surviving workloads piled onto node1 and hit the
csi.proxmox.sinextra.dev/max-volume-attachments=28 cap. The Proxmox VM also
had 5 stale scsi entries (PVCs long-migrated to other nodes but never
removed from VM config), which bypassed the K8s scheduler safety until the
plugin returned 'no free lun found' at attach time.

Three new alerts on the kube_volumeattachment_info count per node:
- warning at 24/28 (>= 85%), 10m
- critical at 27/28 (1 slot left), 3m
- critical at 28/28 (cap reached), 1m

Also whitelisted kube_volumeattachment_info — the metric was being dropped
by the disk-write-reduction filter (id=559) and the alert queries returned
zero series until it's kept.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:45:13 +00:00
Viktor Barzin
a0b5cbc922 onlyoffice: migrate PVC from proxmox-lvm to NFS
Wave 1 LUN-cap relief. OnlyOffice document server keeps only 2 WOPI
key files + a .private dir on the PVC (~24K) — the real DB lives in
its external Postgres + Redis stack, not on this PVC. Service is at
replicas=0 (IO-storm temp scaledown — TEMP-SCALEDOWN comment
preserved).

Migration trivia: scheduler tried to put the rsync helper on
k8s-node4 (PVC's last-known location) but node4 had just come back
online and its proxmox-csi/nfs-csi node pods were still in
ContainerCreating — failed. Retried pinned to k8s-node2 via
nodeSelector; rsync template updated to take an optional node arg.

Net: -1 SCSI LUN once onlyoffice is brought back up.
2026-05-26 02:43:47 +00:00
Viktor Barzin
681f6daf10 whisper: migrate PVC from proxmox-lvm to NFS
Wave 1 LUN-cap relief. Whisper PVC holds Piper TTS .onnx voice
model + a HuggingFace faster-whisper-small-int8 model cache —
read-mostly model artefacts, no DB, 303M total. Both whisper and
piper deployments are at replicas=0 (GPU-node memory pressure,
unrelated).

Switched access_modes to ReadWriteMany since both whisper + piper
deployments reference the same PVC; on proxmox-lvm RWO they could
only colocate on the same node when both come back.

Net: -1 SCSI LUN once these are brought back up.
2026-05-26 02:38:34 +00:00
Viktor Barzin
a2b410f6c9 resume: migrate PVC from proxmox-lvm to NFS
Wave 1 LUN-cap relief. Reactive Resume stores user-uploaded PDFs +
3 .txt counters under uploads/ and statistics/ — no embedded DB,
112K of data. Service is at replicas=0 (browserless OOM scaledown,
unrelated to this work) so the migration was no-downtime.

Net: -1 SCSI LUN once resume is brought back up.
2026-05-26 02:36:20 +00:00
Viktor Barzin
cdbb418f45 monitoring: alert when cluster can't tolerate losing a non-GPU worker
ClusterCannotTolerateNonGpuNodeLoss fires when the most heavily reserved
non-GPU worker (k8s-node2/3/4) has more memory requests pinned to it
than the rest of the workers (incl. node1 GPU node) currently have free.
If that node went down, its pods would not fit elsewhere and would stay
Pending — exactly what happened today (2026-05-26) with node4 NotReady:
4 kyverno pods + woodpecker PVCs + several deployments stuck Pending
because node2/node3 were at 99% memory-request saturation.

Math: max(R(node X) for X in non-GPU workers) > sum(clamp_min(A(n) - R(n), 0))
over Ready workers. node1 included on the right because its taint is
PreferNoSchedule (soft) so it does absorb non-GPU pods under pressure.

Currently fires with a 33.96 GiB shortage. Remediation: right-size top
reservers via Goldilocks (immich-server 8Gi, frigate 5Gi, prometheus
4.4Gi, pg-cluster 3Gi each, paperless 2Gi) or bump VM RAM on
k8s-node2/k8s-node3 from 32GB → 48GB to match node1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:34:13 +00:00
Viktor Barzin
467fa1631d excalidraw: migrate PVC from proxmox-lvm to NFS
Wave 1 of the per-VM SCSI-LUN cap relief. The proxmox-csi-plugin
hardcodes a `lun < 30` loop (pkg/csi/utils.go:394) — cap is 29
attachable PVCs per K8s node VM, and k8s-node1 was sitting at 29
with 4 stuck `no free lun found` PVCs queued behind it.

Excalidraw stores per-user .excalidraw scene files (no SQLite,
no embedded DB) — confirmed safe on NFS. 1.5 MiB of data,
4 active scenes. Migration:
  - Add nfs_volume module → apply
  - Scale to 0, rsync helper, swap claim_name → apply
  - Remove old proxmox-lvm PVC → apply
Net: -1 SCSI LUN on k8s-node2.

Refs: docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md
(separate concern; this is for the upstream LUN-cap pressure).
2026-05-26 02:33:41 +00:00
Viktor Barzin
16b3969ceb alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm
The Alloy Helm chart maps `alloy.resources`, NOT `controller.resources`, onto
the alloy container. The block under `controller:` was silently dropped, so
the container ran with `resources: {}` and inherited the Kyverno LimitRange
`tier-defaults` 256Mi — well below Alloy's 400-450Mi steady state. The
cgroup ran at 255.8/256MB with ~50M memory-reclaim events, page-cache
thrashing drove ~185 MB/s sdc reads (12.18 TB in 24h), saturating the
Proxmox host and rippling out to all VMs + NFS.

Fix:
- Move resources to `alloy.resources` (correct chart key).
- Burstable QoS: request 512Mi, limit 1Gi. Workers are at 97-99%
  memory-request saturation cluster-wide; a 1Gi request blocks
  scheduling on node2/node3.
- Bump controller.updateStrategy.maxUnavailable to 50% so a 5-pod DS
  rolling update fits inside the helm timeout.
- Bump helm_release.alloy.timeout to 900s (default 300s was too short
  with occasional runc-stuck-Terminating on k8s-master).

Verified: all 4 alloy pods now show 1Gi/512Mi at the container level;
helm rev=8 deployed; per-pod memory 99-108Mi at steady state (well
under the new limit).

Memory ID 2726.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:08:35 +00:00
Viktor Barzin
b9ac942647 nvidia: fix driver install deadlock + extend startup probe
Two compounding issues prevented the GPU driver from installing after
the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04):

1. **Deadlock**: The k8s-driver-manager init container was stuck waiting
   for nvidia-operator-validator to shut down. The validator's
   driver-validation init container was in an infinite poll loop checking
   for /run/nvidia/validations/.driver-ctr-ready (which only appears after
   a successful driver install). The validator pod had deletionTimestamp
   set but its container remained in Terminating state indefinitely.
   Fix: force-delete the stuck Terminating validator pod to break the
   deadlock (kubectl delete --force --grace-period=0).

2. **Startup probe timeout**: Full driver install on this hardware
   (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min)
   exactly exhausted the default 120×10s=20min startup probe window,
   causing SIGKILL (exit 137) at exactly 21 minutes even when the install
   was succeeding. Extended failureThreshold 120→300 (50min headroom).

Documented both root causes + recovery steps in the post-mortem.
values.yaml: add driver.startupProbe.failureThreshold: 300.

Note: the kubectl patch applied during recovery is a temporary fix;
this TF values.yaml change makes it durable via the next TF apply.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:53:44 +00:00
Viktor Barzin
da33919368 f1-stream: verifier — wrap m3u8 fetches through /proxy
The frontend already routes every m3u8 URL through `getProxyUrl` →
`/proxy?url=…` so CORS-restricted hosts work for users. The verifier
was the odd one out: it loaded m3u8 URLs directly into hls.js inside a
`data:` URL test page, which has Origin `null`. Hosts like
`oe1.ossfeed.store` (pitsport's playlist CDN) only set ACAO when the
request's Origin is `https://pushembdz.store`, so hls.js got an instant
`fatal_network_error` and every pitsport stream was marked dead even
though they play fine for real users.

Wrap the m3u8 URL the same way the verifier already wraps embed URLs:
`{PROXY_BASE}/proxy?url=<b64>`. Stays same-origin for hls.js, gets
ACAO:* from our proxy, and the rewritten variants are also proxy-wrapped
so subsequent fetches stay clean.

For sites whose CDN serves any IP without Origin tricks (stremio,
dd12), this is transparent — proxy just forwards.

Side effect: every verified m3u8 hits our proxy once during extraction.
Cheap (1 cluster-internal request + 1 upstream HEAD/GET) and only
during the 5/30-min extraction cycle.
2026-05-24 22:26:56 +00:00
Viktor Barzin
7045559fee immich: harden against bulk-import load (memory + probe + Job retries)
Mid-flight stability changes from the 2026-05-24 Anca-elements import
that surfaced multiple latent issues under sustained load:

- `immich-postgresql` memory 3Gi → 5Gi. The original limit OOM-killed
  PG once the bulk insert + vector embeddings drove buffer pressure
  past 3 GiB. 5 GiB gives ~60% headroom over the observed steady
  state during ongoing imports.
- `immich-server` startup probe `failure_threshold` 30 → 360 (5min →
  1h). After any PG restart, immich-server reindexes `clip_index` +
  `face_index` (147k + 185k rows at the time of incident) before
  binding the API port. The old 5min budget was too tight, so each
  PG bounce trapped immich-server in a startup crashloop until the
  reindex was killed. 1h gives generous headroom.
- `kubernetes_job_v1.anca_elements_import.backoff_limit` 2 → 20 and
  `--concurrent-tasks` 8 → 20 on the immich-go upload. Short
  cluster blips (PG restart, KCM lease loss) were exhausting the
  Job's 3-attempt budget. 20 attempts + 20 parallel hashers makes
  dedup-on-resume ~2.5x faster and tolerates a much rougher cluster.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 22:14:05 +00:00
root
445f30d955 Woodpecker CI deploy [CI SKIP] 2026-05-24 22:07:58 +00:00
Viktor Barzin
5cdac421c2 forgejo: pin to v11.0.14 + disable Keel (image-rewrite incident 2026-05-24)
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
On 2026-05-24T15:35:37Z Keel's force-policy rewrote the image tag from
`11.0.14 → 1.18` (codeberg.org/forgejo/forgejo). v1.18 is a Gitea-era
Forgejo (Forgejo forked from Gitea at 1.18 and used pre-Forgejo
versioning early on); the DB had already been migrated to schema 305
by 11.0.14, and 1.18 only knows up to migration 231 → pod refused to
start ("Your database (migration version: 305) is for a newer Gitea,
you can not use the newer database for this old Gitea release (231)").
Exact replay of the 2026-05-16 force-policy tag-rewriting bug
(memory id=1933).

Changes:
- Pin image to explicit `:11.0.14` (latest 11.x, published 2026-05-12)
- Add `keel.sh/policy: "never"` deploy annotation — overrides the
  Kyverno-stamped `force` policy via the chart's `+()` anchor semantics
  (memory id=1972). Keel will no longer touch this workload.
- Drop KEEL_IGNORE_IMAGE from `lifecycle.ignore_changes` (TF owns the
  image now). Restore it if you flip Keel back to `force`.
- Add the KEEL_LIFECYCLE_V1 trio (`kubernetes.io/change-cause`,
  `deployment.kubernetes.io/revision`, `keel.sh/update-time` on the
  pod template) so future TF applies don't fight K8s rollout metadata.

Verified: new pod on v11.0.14 came up Running 1/1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 22:06:59 +00:00
Viktor Barzin
5a0e4b3dac f1-stream: revive aceztrims + pitsport, more ppv variants
- aceztrims: scrape /f11/ (the actual stream page), not /f1/ (the
  cross-sport schedule). Drop the dead /iframe1?s= + onclick m3u8
  regexes (site moved to `getElementById('iframe').src = '...'` ~20
  channels ago). Strip HTML comments first so the ~20 legacy buttons
  kept inside <!-- ... --> stop showing up as false positives.
  Also pick up the default inline <iframe id='iframe' src='...'>.
  Local run: 11 channels (was 0).

- pitsport: decode the RSC payload before regex-matching in
  _parse_live_events (raw HTML had it escape-encoded, so the homepage
  card path was silently 0). Add the new /live-now route (canonical
  what's-live-right-now list). Add "f1" to MOTORSPORT_CATEGORIES — the
  site labels Formula 1 events as just "F1". Refresh the stale
  serveplay.site docstring (host rotates; pushembdz's api/stream link
  is authoritative).
  Local run: 7 m3u8 streams covering Canadian GP (EN1/EN2/MULTI/ITA/ESP)
  + NASCAR Coke 600 (was 0).

- ppv: always emit the parent embed alongside substreams (was dropping
  it whenever substreams existed). Prefer source_tag in substream titles
  so users see "Sky Sport 1 NZ" / "Apple TV (F1TV)" instead of generic
  #1/#2 suffixes.

Diagnosed against the live cluster (curated + 7 other extractors
returning 0 cached streams, only 2 dead hmembeds curated 24/7 channels
visible to users). Each fix verified with the extractor run against
live sites this turn.
2026-05-24 22:05:37 +00:00
Viktor Barzin
d5f73ce109 backup: exclude /anca-elements/ from nfs-mirror + offsite Step 1
Anca's photos are being ingested into Immich (started 2026-05-24
afternoon), so /srv/nfs/immich/library/ becomes the canonical copy
for those photos. The separate /srv/nfs/anca-elements/ archive tree
+ its sda mirror at /mnt/backup/anca-elements/ are now redundant.

Going forward:
- nfs-mirror EXCLUDES /anca-elements/ so future weekly runs don't
  re-touch the 771G subtree (also no longer required since Immich
  has the data via its NFS library).
- offsite-sync Step 1 also excludes /anca-elements/ — the historical
  771G under /mnt/backup/anca-elements/ stays on sda for now but is
  NOT shipped to Synology pve-backup/ (Immich's library reaches
  Synology via Step 2 bypass leg anyway).

The 771G on /mnt/backup/anca-elements/ will be cleaned up manually
once Immich ingest completes and we verify all photos are in the
Immich library. Same for /srv/nfs/anca-elements/ on sdc thin pool —
freeing both would reclaim ~1.5 TB across sdc + sda.

In-flight context: today's nfs-mirror first run was killed mid-flight
at ~70% (was at /srv/nfs/postgresql/). The killed run wrote ~200G of
service NFS subtrees to /mnt/backup/<svc>/, then sda hit 95% used,
prompting this change. Next nfs-mirror run will not touch
anca-elements and will fit comfortably (~250G total for the keep-list
minus anca-elements).
2026-05-24 18:34:41 +00:00
Viktor Barzin
c948dc0dbe backup pipeline: flock manifest + cap + drop LAN -z
Three more audit fixes from the 2026-05-24 backup-pipeline review:

#5 (S1 race) — manifest flock
  daily-backup and nfs-mirror both append to /mnt/backup/.changed-files.
  If they overlap (nfs-mirror Mon 04:11 running long, daily-backup
  starting Mon 05:00), concurrent appends from `find | tee` and
  `find | sed >>` could interleave mid-line — partial paths would slip
  past rsync's --files-from. Both scripts now share a manifest_append()
  helper using `flock -x` on /mnt/backup/.changed-files.lock. The 4
  daily-backup call sites + the 1 nfs-mirror call site all pipe through
  it instead of redirecting directly.

#7 (S2 unbounded manifest)
  daily-backup gains check_manifest_size() invoked after the PVE-config
  append (the last manifest writer of the run). Above MANIFEST_MAX_LINES
  (500k) it touches /mnt/backup/.force-full-sync — offsite-sync's Step 1
  now treats that flag the same as day-of-month ≤ 7 (full sync with
  --delete) and clears it on success. Catches the "Synology unreachable
  for many days" edge case where the manifest would grow unbounded.

#9 (wear — drop -z on LAN hops)
  offsite-sync rsync calls to Synology over the same 192.168.1.0/24
  gigabit LAN had `-rltz`. Compression burns CPU on the PVE host (already
  IO-busy) and gives nothing on a saturated GigE link. Dropped to `-rlt`
  on all 5 offsite rsync invocations (Step 1 full + Step 1 incremental +
  Step 2 full nfs + Step 2 full nfs-ssd + Step 2 incremental).

Other adjustments:
- nfs-mirror's find-after-rsync now also excludes the new state files
  (.changed-files.lock, .force-full-sync) when populating the manifest.
- offsite-sync Step 1 full-sync excludes the same .force-full-sync flag
  so it doesn't ship to Synology.

Deployed to PVE host (/usr/local/bin/{daily-backup,nfs-mirror,
offsite-sync-backup}). Currently in-flight nfs-mirror run is unaffected
(bash loaded the old script into memory at start). Next runs use the
new behaviour.

Refs: 2026-05-24 audit Section 2 items #1 (manifest race), #4 (unbounded
manifest), #6 (LAN -z wear).
2026-05-24 16:27:42 +00:00
Viktor Barzin
4798583db7 backup pipeline: S1 fixes from 2026-05-24 audit
Three immediate fixes surfaced by the backup-pipeline audit:

1. **S1 silent-loss race fix** (daily-backup.sh:142): remove the
   `> "${MANIFEST}"` truncation at the start of daily-backup. Truncation
   already lives in offsite-sync-backup at line 159, gated on a successful
   sync. With both scripts truncating, an offsite-sync failure followed by
   the next morning's daily-backup would silently wipe yesterday's
   unconsumed manifest entries — those files would only reach Synology
   via the monthly full sync (1st-7th of month). Now only offsite-sync
   truncates, and only on success.

2. **Missing alert OffsiteBackupSyncFailing**: documented in backup-dr.md
   but was never added to prometheus_chart_values.tpl. Step 1 or Step 2
   failure pushes offsite_sync_last_status=1 but nothing read it. Added.

3. **wear: drop `-z` from local-only rsyncs** (daily-backup.sh:218 PVC
   snapshot rsync + line 347 /etc/pve sync). Both are local-to-sda
   transfers — compression wastes CPU and yields nothing (gigabit local
   path, intermediate disk doesn't benefit).

Bonus cleanups (zero functional impact):
- "Weekly backup starting/complete" → "daily-backup starting/complete"
  (the timer is daily, not weekly — legacy from earlier monthly-rotation
  schedule).
- "--- Step 2: PVC file copy ---" → "Step 1:" (was numbered from 2 with no
  Step 1 above).
- **wear: pfSense full filesystem tar now Sunday-only** instead of daily.
  config.xml stays daily (it's the primary restore artifact and tiny).
  Full tar is forensic recovery only — re-tarring ~100MB+ daily writes
  ~3G/month to sda + Synology for unchanged content. Weekly is plenty.

docs/architecture/backup-dr.md: rewritten Overview + 3-2-1 breakdown to
reflect today's two-leg architecture; added a "2026-05-24 session"
changelog summary at the top; added a "Synology snapshot management"
subsection with the sudo + `synosharesnapshot` recipe (DSM API is gated
by 2FA so this is the only programmatic path); updated Key Files table
with nfs-mirror + the Synology SSH access notes.

Open follow-ups from the audit (S2 — file as beads if pursued):
- Factor two-leg invariant into /etc/backup-skip-list.conf sourced by
  both nfs-mirror.sh and offsite-sync-backup.sh.
- Manifest write-collision flock between nfs-mirror Mon 04:11 and
  daily-backup Mon 05:00.
- Unbounded manifest cap (force full sync if > 500k lines).
- Synology free-space scraper + alert.
- LVM thin pool meta-pool fill alert.
- nfs-change-tracker.service heartbeat to Pushgateway.
- Synology config drift TF surface (snap retention, share defs).
2026-05-24 16:18:44 +00:00
Viktor Barzin
9277d71d81 nfs-mirror: append transferred files to offsite-sync manifest
Some checks failed
ci/woodpecker/push/default Pipeline is running
ci/woodpecker/push/build-cli Pipeline failed
Step 1 of offsite-sync-backup is incremental on non-monthly days,
driven by /mnt/backup/.changed-files which only daily-backup wrote
to. nfs-mirror's writes were therefore invisible to Step 1 until the
next monthly --delete pass — which would *also* wipe data
pre-positioned on Synology pve-backup/ (e.g. the in-place btrfs
rename we just did to relocate ~160G of NFS subtrees from
/Backup/Viki/nfs/<svc>/ to /Backup/Viki/pve-backup/<svc>/).

Fix: snapshot a timestamp before rsync, then after rsync use
`find -newer $STAMP -type f -printf '%P\n'` to enumerate every file
nfs-mirror created/modified and append to the manifest. Paths are
relative to /mnt/backup/ (matches Step 1 --files-from expectation).
State files are excluded.

The current in-flight first run started before this patch was
deployed, so its writes won't auto-populate the manifest — a one-off
manual backfill will be done after it completes.
2026-05-24 15:32:22 +00:00
Viktor Barzin
15745eab2f backup: retire anca-elements-mirror + anca-elements-sync.sh
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Both subsumed by nfs-mirror (deployed earlier this session) — see
commit 4d756be4. anca-elements-sync.sh is now dead code because its
upstream (Synology /volume1/Backup/Anca/Elements) was deleted today
once the sda mirror was parity-verified (109,624 files /
827,480,937,976 bytes equal both sides). PVE NFS is the source of
truth for the archive from here on.

Final script inventory on the PVE host (down from 6 to 4):
- /usr/local/bin/daily-backup           (block PVCs + sqlite + pfsense)
- /usr/local/bin/lvm-pvc-snapshot       (snapshot management)
- /usr/local/bin/nfs-mirror             (NFS local mirror to sda)
- /usr/local/bin/offsite-sync-backup    (sda + bypass-list NFS to Synology)
2026-05-24 14:58:55 +00:00
root
9e2163040b Woodpecker CI deploy [CI SKIP] 2026-05-24 14:23:44 +00:00
Viktor Barzin
d6590612b2 immich: bulk-import Anca's Elements photo archive into her account
Grows pve/nfs-data 3T → 4T (online lvextend + resize2fs) to absorb ~340 GB
of new originals landing under /srv/nfs/immich/upload during the import.

Adds:
- module "nfs_anca_elements_host" — RO PVC over /srv/nfs/anca-elements,
  consumed only by the import Job (not mounted in immich-server).
- kubernetes_job_v1.anca_elements_import — immich-go v0.31.0 uploader
  posting to immich-server.immich.svc:2283 with Anca's API key (synced
  via the existing immich-secrets ExternalSecret from
  secret/immich.anca_api_key). Filters to image extensions, bans the
  non-photo top-level dirs (filme/, Music/, carti/, courses, installers,
  docs, etc.), puts every asset in the album "Poze (Elements)". Default
  `--pause-immich-jobs` is disabled — non-admin keys can't pause jobs.
- docs/architecture/storage.md — note the new 4 TB size in 3 places.
- docs/runbooks/grow-pve-nfs-lv.md — captures the one-shot lvextend
  procedure (no pve-host TF stack exists for this).

Job is removed in the follow-up cleanup commit once the upload completes;
the PVC stays for a videos batch later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 14:12:30 +00:00
Viktor Barzin
4d756be4f5 backup: consolidate to one local-mirror script + invert offsite filter
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Before this commit, the in-flight design split anca-elements (its own
mirror script + timer) from the rest of /srv/nfs (still going to
Synology via inotify-tracked offsite-sync). It also meant Synology
received some bytes via both paths (sda → Synology AND direct NFS →
Synology), which doubled consumption.

This commit collapses both into a clean 3-2-1:

  Copy 1 (sdc):       live /srv/nfs/* + cluster block PVCs
  Copy 2 (sda):       /mnt/backup/{pvc-data,sqlite-backup,pfsense,
                                   pve-config,<critical-nfs>/}
                      ← daily-backup + nfs-mirror (one script each)
  Copy 3 (Synology):  /Backup/Viki/{pve-backup,nfs,nfs-ssd}
                      ← offsite-sync-backup Step 1 (sda → Synology)
                        + Step 2 (sda-BYPASS paths only → Synology direct)

scripts/nfs-mirror.{sh,service,timer}:
  New consolidated weekly mirror. Replaces anca-elements-mirror (to be
  removed in a follow-up after the current in-flight rsync completes,
  parity-verified, and Synology source-of-truth is deleted). Single
  rsync /srv/nfs/ → /mnt/backup/ with an explicit EXCLUDES list that
  drops paths not worth a local 2nd copy: immich (1.2T — too big),
  frigate (14d ring), prometheus/loki (rebuildable), ollama/llamacpp/
  audiblez/ebook2audiobook (re-fetchable), *-backup (already backups),
  temp/alertmanager (transient). Nice=10, IOSchedulingClass=idle.

scripts/offsite-sync-backup.sh:
  Step 2 (NFS → Synology) filter inverted: instead of `--exclude=
  anca-elements/`, it now `--include`s only the sda-BYPASS paths
  (immich, frigate, prometheus, *-backup, …). The bypass-include
  regex MUST stay in lockstep with nfs-mirror's EXCLUDES — they are
  complementary and any drift creates either gaps or duplication on
  Synology. Comment in the script flags this.

monitoring alerts: renamed AncaElementsMirror{Stale,Failing} to
NfsMirror{Stale,Failing} matching the new metric job name
`nfs-mirror`. Thresholds unchanged.

docs/architecture/backup-dr.md: rewritten Step 1/Step 2 sections and
added the bypass-list rationale + cross-reference between scripts.

NOT YET DEPLOYED — gated on the in-flight anca-elements-mirror rsync
finishing + parity verification + Synology /volume1/Backup/Anca/
Elements deletion. The old scripts (anca-elements-{mirror,sync.sh})
remain on the PVE host until then, and will be removed in a cleanup
commit.
2026-05-24 12:49:20 +00:00
Viktor Barzin
416c2a0468 monitoring: add AncaElementsMirror{Stale,Failing} alerts
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline failed
Layer 3a (anca-elements local mirror) now has the same alert coverage
as offsite-sync-backup:
- AncaElementsMirrorStale fires if last_run_timestamp > 16d
  (2 weekly cycles, matches the 8d → 9d slack used elsewhere)
- AncaElementsMirrorFailing fires if last_status != 0

BackupDiskFull (existing) covers the sda fill-up risk at 85%.

Not applied this commit — pick up on next monitoring stack apply.
2026-05-24 11:55:19 +00:00
Viktor Barzin
6db64fe060 anca-elements: weekly local mirror sdc → sda (replaces Synology as 2nd copy)
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Synology is being removed as a host for the Anca/Elements archive
(770G). /srv/nfs/anca-elements on PVE becomes the source of truth;
sda /mnt/backup/anca-elements becomes the single-disk-failure mirror.
No offsite for this archive — by design.

- scripts/anca-elements-mirror.sh: rsync -rlt --delete -H, idempotent,
  pushes anca_elements_mirror_last_{run_timestamp,status,bytes} to
  Pushgateway, lockfile in /run, SIGTERM-safe (status=2 on abort).
- .service: oneshot, Nice=10, IOSchedulingClass=idle, 5h timeout.
- .timer: weekly Mon 04:00, Persistent=true, 15-min randomised delay.

Deployed to PVE host; timer enabled; initial 770G sync running in
background. Synology original to be deleted after first run completes
and parity is verified.

docs/architecture/backup-dr.md: documents Layer 3a + updated path
exclusion rationale (PVE is now upstream, not downstream).
2026-05-24 11:51:52 +00:00
Viktor Barzin
34f8c0f537 docs+scripts: lock in nextcloud-as-PVE-NFS-browser surface
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
- docs/architecture/storage.md: new "Nextcloud as PVE-NFS browser"
  section documenting mount-per-archive + applicable_users model,
  why mount-level ACL beats Files Access Control on NC 30/31, the
  manifest shape (with current applicableUsers + enableSharing
  fields), and the trade-off
- docs/runbooks/nextcloud-add-archive.md: 5-step runbook to surface
  a new directory under /srv/nfs/* to specific NC users via the
  bootstrap Job
- scripts/anca-elements-sync.sh: deployed at
  /usr/local/bin/anca-elements-sync.sh on the PVE host; fpsync from
  Synology Anca/Elements to /srv/nfs/anca-elements (idempotent +
  resumable). The PVE replica is what the NC /anca-elements mount
  serves; the offsite-sync pipeline excludes this path (committed
  earlier this session) so we don't write it back to Synology

NC usernames are admin/anca/emo (not display names — admin is
Viktor). Stale "viktor" references in the manifest example dropped.
2026-05-24 11:45:01 +00:00
Viktor Barzin
c624caf65a nextcloud(external_storage): add per-mount enableSharing option
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Lets admin natively share folders from inside an external mount with
internal users/groups or via public link. The two PVE pool browsers
(visible to admin only) get enableSharing=true so they can act as a
"share-from picker" over /srv/nfs and /srv/nfs-ssd; /anca-elements
stays false so anca manages re-sharing inside her own view.

- Manifest schema gains enableSharing on rootMounts + archiveMounts.
- Bootstrap Job adds sync_option() and reconciles enable_sharing via
  occ files_external:option (idempotent — occ no-ops same-value set).
2026-05-24 11:39:16 +00:00
root
37e563d5a9 Woodpecker CI deploy [CI SKIP] 2026-05-24 11:31:53 +00:00
Viktor Barzin
cb1a34fd00 nextcloud: expose PVE NFS roots + /anca-elements via Files External
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Mounts the Proxmox host NFS exports (/srv/nfs and /srv/nfs-ssd) into
the NC pod and surfaces them through occ files_external:create:

- /PVE NFS Pool      → /mnt/pve-nfs       (admin group only)
- /PVE NFS-SSD Pool  → /mnt/pve-nfs-ssd   (admin group only)
- /anca-elements     → /mnt/pve-nfs/anca-elements  (admin, anca users)

Mount visibility is controlled by occ files_external:applicable; no
Files Access Control. ACL state is reconciled idempotently by a
bootstrap Job that diffs desired vs current applicable_users /
applicable_groups (via files_external:list --output=json).

Bootstrap fixes vs initial design:
- Sync loop used `[ -n "$U" ] && cmd` which returns 1 on empty input,
  triggering set -e on no-op re-runs. Switched to process substitution
  `< <(jq ...)` so empty diff -> loop body never runs -> 0 exit.
- RBAC missed `watch` verb (kubectl wait spammed reflector errors).
- Manifest used display-name "viktor" instead of NC username "admin"
  for the /anca-elements applicable list.

Chart values: added two PV-backed volume mounts at /mnt/pve-nfs[+ssd]
and pinned securityContext to fsGroup=33 with fsGroupChangePolicy:
OnRootMismatch (chart default Always would recurse 600k+ files on
every pod restart).
2026-05-24 11:27:26 +00:00
Viktor Barzin
7a649ce7eb crowdsec: pin image to v1.7.8 + remove ENROLL_KEY, CAPI restored
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Root cause of today's CAPI 403 crashloop: chart 0.21.0 pins appVersion
to v1.7.3, but Keel had auto-bumped the running pods to v1.7.8 on
2026-05-16 and they ran fine with CAPI for 8 days. Today's TF apply
(b59acbc1 agent memory bump) re-rendered the deployment from chart
defaults, reverting the image to v1.7.3 — and v1.7.3 has a CAPI
watcher-auth bug against the current api.crowdsec.net behaviour, so
every fresh replica started 403'ing on startup.

Fix: set `image.tag: "v1.7.8"` in values.yaml so the image survives
future TF applies independently of the chart's appVersion. Verified
CAPI auth succeeds on all 3 fresh pods with v1.7.8.

Also dropped the ENROLL_KEY env block — the existing key `cmey5e636…`
is single-shot and was already consumed by the first replica;
subsequent pods hit 403 on `cscli console enroll`. CAPI works WITHOUT
console enrollment (separate flows). Re-enable console reporting by
generating a fresh enroll key at app.crowdsec.net (procedure
documented in the values.yaml comment block).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 11:11:29 +00:00
Viktor Barzin
f55eaae682 docs/backup-dr: document /srv/nfs/anca-elements offsite-sync exclusion
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
2026-05-24 11:03:50 +00:00
Viktor Barzin
05f047f290 offsite-sync-backup + nfs-change-tracker: exclude /srv/nfs/anca-elements
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
The 771G under /srv/nfs/anca-elements is a downstream replica synced
FROM Synology (/volume1/Backup/Anca/Elements) by anca-elements-sync.sh.
The offsite-sync pipeline was copying it back to Synology under
/volume1/Backup/Viki/nfs/anca-elements, creating a self-duplicate
(~122G already partially copied during the last monthly full sync).

- nfs-change-tracker.service: drop anca-elements/ from inotify watch
  (incremental syncs no longer queue these paths)
- offsite-sync-backup.sh: --exclude='anca-elements/' on the monthly
  full rsync; grep -v on the incremental files-from list

Deployed to 192.168.1.127:/usr/local/bin/offsite-sync-backup +
/etc/systemd/system/nfs-change-tracker.service; service reloaded.
2026-05-24 11:03:09 +00:00
Viktor Barzin
41786b0fca crowdsec: DISABLE_ONLINE_API=true — break the recurring 403 crashloop
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
CAPI auth at api.crowdsec.net is rejecting watcher logins from inside
the cluster within ~1h of registration, even after rotating creds via
`cscli capi register`. The same login successfully authenticates from
devvm but fails from cluster pods → IP-throttle or account-state issue
at the central API. Until that's resolved with CrowdSec support (or
the throttle window resets), running with CAPI on is just chronic
crashloops on every fresh replica.

`DISABLE_ONLINE_API=true` makes the chart entrypoint
`conf_set 'del(.api.server.online_client)'`, removing the online_client
block entirely. Pods skip CAPI auth, no 403, no crashloop. Trade-off:
no community blocklists. Local scenarios + bouncers continue
unchanged.

Side-effect of disabling CAPI in this chart (v0.21.0) — `role.yaml`
is gated on `IsOnlineAPIDisabled=false` while `cscli-lapi-register-job`
is gated on `StoreLAPICscliCredentialsInSecret=true` (orthogonal). So
the hook runs without the Role it needs, and atomic apply rolls back.
Mitigation: pre-created the `crowdsec-lapi-cscli-credentials` Secret
manually (the hook short-circuits when the secret already exists) and
re-applied the missing Role for future re-enablement.

Re-enable path documented in the comment block.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 10:31:03 +00:00
Viktor Barzin
1f6facc8e4 Merge forgejo/master — reconcile 18-day divergence with origin
Some checks failed
ci/woodpecker/push/build-cli Pipeline failed
ci/woodpecker/push/default Pipeline was successful
Origin and forgejo had drifted since 2026-05-05 (merge base b45c45e4).
Each remote was receiving Viktor's commits independently — origin since
2026-05-23 and forgejo from 2026-05-06 to 2026-05-22 14:15. Both had
~30 substantive commits. This merge brings forgejo's work into the
local branch.

13 conflict files resolved as follows (all favoured HEAD = origin/local,
which is newer in every case):

- secrets/{fullchain,privkey}.pem — kept HEAD (renewed 2026-05-24,
  vs forgejo's 2026-05-17 renewal)
- stacks/blog/main.tf — kept HEAD (ingress-www intentionally removed
  today after DNS+monitor cleanup; forgejo had the old block)
- stacks/xray/modules/xray/main.tf — kept HEAD (vless dropped today
  as dead ingress; forgejo had the old 3-port service)
- stacks/k8s-version-upgrade/scripts/upgrade-step.sh — kept HEAD
  (allowlist refactor, master-phase idempotency skip, tigera-operator
  quiesce/restore, IngressTTFBCritical ignore — all newer than forgejo)
- stacks/k8s-version-upgrade/main.tf — kept HEAD (deployments/scale
  RBAC, oldest-kubelet detection — both added 2026-05-23)
- scripts/update_k8s.sh — kept HEAD (--etcd-upgrade=false fallback)
- stacks/llama-cpp/main.tf — kept HEAD (KEEL_LIFECYCLE_V1 ignore_changes
  block added today, commit 0b1282a1)
- stacks/openclaw/main.tf — kept HEAD (nim/meta/llama-3.1-70b primary)
- stacks/trading-bot/main.tf — kept HEAD (claude-haiku-4-5 pin +
  kevin-signal-bridge container)
- stacks/postiz/modules/postiz/main.tf — kept HEAD (memory 2Gi/3Gi
  bump, despite postiz being destroyed today — kept TF intent)
- stacks/nvidia/modules/nvidia/values.yaml — kept HEAD (mem 822Mi)
- stacks/monitoring/modules/monitoring/prometheus_chart_values.tpl —
  kept HEAD (richer alert list + raised StatefulSet `for: 3m`)
- stacks/kyverno/modules/kyverno/security-policies.tf — kept HEAD
  (expanded registry allowlist + comments)
- docs/architecture/security.md — kept HEAD (detailed W1.7 analysis)
- docs/plans/2026-05-21-ha-control-plane-design.md — kept HEAD
  (178-line superset incl. 2026-05-23 deferral rationale)

Auto-merged (no conflict): broker-sync, claude-agent-service,
cloudflared, mailserver, n8n, technitium, traefik, url, proxmox-csi,
xray (deployment portion). Brings in forgejo-only substantive commits:
fire-planner, openclaw v3 flow + recruiter-responder wiring, several
k8s-version-upgrade hardening passes (kill-switch, RecentNodeReboot
ignore, pipefail fixes), HA control plane design, security wave 1
expansion to tier 3+4, alloy file-tail switch, prometheus scrape 2m,
authentik replica cut, forgejo archive disable.

Meta: forgejo and origin drift is a coordination bug. Going forward we
need to either (a) have one CI mirror to the other, or (b) standardize
on one remote. Filed mentally; not addressed in this commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 09:41:36 +00:00
Viktor Barzin
0b1282a13c llama-cpp: ignore_changes for keel/k8s-managed annotations
Every `tg apply` was reverting the annotations that keel patches when it
detects an upstream digest change — `keel.sh/match-tag` (Kyverno-stamped),
`keel.sh/update-time` (on the pod template; what actually triggers the
rollout), plus the K8s-managed `kubernetes.io/change-cause` and
`deployment.kubernetes.io/revision`. The revert forced a rollout, then
the next keel poll re-stamped the annotations, forcing another. With
llama-swap's ~10s cold-load on each pod recreate the user noticed.

Upstream `ghcr.io/mostlygeek/llama-swap:cuda` is a moving nightly tag —
keel still drives one legitimate rollout per day at ~07:25 UTC; this
patch stops the apply-driven extra rollouts on top of that.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 09:01:17 +00:00
67f8be4598 trading-bot: add kevin_signal_bridge container (kill-switch OFF for Phase 1)
5th worker container running in audit-only mode. Writes
kevin_signal_bridge_state rows showing what it WOULD trade but never
publishes to signals:generated. Kill-switch flipped in Phase 2.
2026-05-24 01:22:53 +00:00
Viktor Barzin
6218868ea5 xray: drop dead vless ingress + pin Service target_port
The xray-vless ingress, Service port 6443, and container port 6443 had
no backing listener — xray.config.json only binds 7443 (REALITY), 8443
(WS) and 9443 (XHTTP). The "xray-vless" hostname was returning 502
since the module was created.

Side effect: removing the first Service port slot ("vless"/6443) caused
the kubernetes provider to shift targetPort values on the remaining
two ports (defaulting only worked at create time, not on port removal).
Pinning target_port explicitly makes Service routing deterministic.

End-to-end verified: REALITY via public IP:8080 (pfSense forward 8080
-> 10.0.20.200:7443), WS via Cloudflare, XHTTP via Cloudflare — all
three transports proxied successfully through a test pod, egress IP
correctly resolves to the home WAN.
2026-05-24 01:13:54 +00:00
Viktor Barzin
ae874e028d postiz: bump memory request 512Mi → 2Gi, limit 4Gi → 3Gi (right-size for next deploy)
krr 2026-05-22 flagged postiz-app as critically under-requested when it
was running (gap 2.2 GiB above the 512Mi request). Postiz is currently
uninstalled in the cluster — this change is only for when the stack is
re-deployed later. No apply triggered now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:11:25 +00:00
Viktor Barzin
b59acbc1db crowdsec/agent: bump memory request 64Mi → 128Mi
krr 2026-05-22 flagged crowdsec-agent DaemonSet (4 pods) as under-
requested by ~588 MiB across the cluster. Live usage around the
80-128 MiB mark for active log parsing — 64 MiB request risked eviction
ahead of more-needed pods. Limit stays at 512 MiB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:11:16 +00:00
Viktor Barzin
7108843b38 nvidia/driver-daemonset: bump memory request 256Mi → 822Mi
krr 2026-05-22 flagged nvidia-driver-daemonset as critically
under-requested (~566 MiB gap). Live driver process holds ~600-800Mi
once the kernel module is loaded. Limit stays at 2Gi so the DKMS build
during a kernel upgrade still has headroom (documented in values.yaml
to need ~1.4 GiB peak).

May help unblock code-8vr0 (GPU driver crashloop on node1) if the
crashloop was OOM-driven.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:11:06 +00:00
Viktor Barzin
2711d4af05 monitoring/loki: bump memory request 2Gi → 3Gi (close gap to 4Gi limit)
krr 2026-05-22 flagged loki as under-requested by 1.9 GiB. Live working
set is sitting at ~3 GiB during normal ingestion; the existing 2 GiB
request meant scheduler didn't reserve enough room and the pod risked
eviction. Limit stays at 4 GiB (documented ceiling in loki.yaml).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:55 +00:00
Viktor Barzin
c77984a713 proxmox-csi/node: bump memory request 64Mi → 1Gi (LUKS unlock reservation)
The CSI node plugin's LUKS2 Argon2id key derivation peaks at ~1 GiB
during unlock (memory id=712 + already-documented in the limits=1280Mi).
Request was 64 MiB — meaning the unlock burst ran "best-effort", first in
line for OOM under node pressure. krr 2026-05-22 flagged this as a top
under-request. Bumping request matches the documented requirement.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00
Viktor Barzin
467460cccd k8s-version-upgrade: ignore IngressTTFBCritical in halt-on-alert check
The Synology DSM (port 5001) ingress chronically trips IngressTTFBCritical
because of NAS-side latency that is unrelated to k8s upgrades. The chain
was halting indefinitely waiting for it to clear. Add it alongside
RecentNodeReboot to the per-call ignore regex so the chain can proceed
autonomously without manual silences.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00
Viktor Barzin
447bfef507 blog: remove www.viktorbarzin.me ingress
The www subdomain was internal-only (no Cloudflare DNS record) but the
external uptime-kuma monitor still flagged it as down because public DNS
resolution failed. Removing the ingress along with the Technitium CNAME
makes the failure mode disappear and lets the cluster reach an
autonomous-clean state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 01:10:44 +00:00