infra

Author	SHA1	Message	Date
Viktor Barzin	b3dcccfc41	vaultwarden: track :latest tag for Keel auto-upgrade (was 1.35.7) Earlier today Keel's hourly poll caught vaultwarden's deployment in a window where the `keel.sh/match-tag` annotation wasn't set, fell into 'watch repository tags' mode, and rewrote 1.35.7 -> 1.21.0. Vaultwarden 1.21.0 doesn't have the API endpoints the modern Bitwarden clients call (/identity/accounts/prelogin/password, /api/devices/knowndevice, /api/config), so the Chrome extension started 404-ing on login. Same race shape as the 2026-05-17 authentik/pgbouncer incident. The fundamental issue: `policy: force` on a semver-pinned tag is unsafe because Keel happily rewrites the tag string if it can't find a stable 'current tag' to digest-watch. Fix: switch to `:latest` (the mutable tag vaultwarden publishes for the newest stable release). Keel now digest-watches `:latest` (safe mode) and rolls forward on each upstream release. Matches cluster convention (128 other Keel-managed workloads use the same `:latest` + force + match-tag pattern). Also added imagePullPolicy=Always (required with :latest so the kubelet revalidates the manifest on each rollout instead of using a cached layer), and extended the lifecycle.ignore_changes to cover the match-tag annotation and kubernetes.io/change-cause (Keel rewrites this on every rollout). Current `:latest` digest -> vaultwarden 1.36.0 (released 2026-05-03). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 13:26:36 +00:00
Viktor Barzin	8ed427a7e4	cloud-init: hands-off k8s worker provisioning + 5 bug fixes Goal: re-clone the worker template, boot, and have it appear as `kubectl get nodes …Ready` with no manual steps. Adds `scripts/provision-k8s-worker NAME VMID IP` and rebuilds the cloud-init pipeline that was failing five distinct ways on a clean boot. Bugs fixed (all hit during the k8s-node5 + k8s-node6 builds today): 1. `indent(6, containerd_config_update_command)` indented the bodies of `cat >> /etc/containerd/config.toml <<'CONTAINERD_GC'` heredocs, so [plugins.*] TOML sections landed in /etc/containerd/config.toml at col 6 — containerd refused to parse them. Source is now a normal .sh file (`modules/create-template-vm/k8s-node-containerd-setup.sh`) base64-embedded into `write_files`; YAML whitespace never touches the heredoc bodies. 2. The same script tried to `cat >> /etc/containerd/config.toml` `[plugins."io.containerd.gc.v1.scheduler"]` etc., which containerd v2.2.4's `config default` ALREADY emits. Result: `toml: table … already exists`. Patched with sed-in-place overrides instead. 3. Kubelet tuning (sed against /var/lib/kubelet/config.yaml) ran from the containerd setup script — BEFORE `kubeadm join` writes that file. Sed aborted with "No such file or directory", `set -e` killed the script, post-script cloud-init steps kept going (cloud-init doesn't stop on runcmd failure). Split into a dedicated `k8s-node-post-join-tune.sh` invoked AFTER kubeadm join. 4. cloud_init.yaml fallocate'd a 4G swapfile and `swapon`'d it BEFORE kubeadm join. kubelet defaults to failSwapOn=true → exited 1 immediately. Replaced the swap setup with `swapoff -a` (node4 already runs this way and the cluster is fine). 5. Without `hostname:` in the shared user-data snippet, Proxmox's auto-generated meta-data does NOT include local-hostname when `cicustom user=…` is set — so cloud-init falls back to the cloud image's default `ubuntu` and `kubeadm join` registers the wrong node name. `provision-k8s-worker` now writes a per-node `<NAME>-meta.yaml` snippet and passes both via `cicustom user=…,meta=…`. Other improvements rolled in while fixing the above: - `ssh_public_key` read from Vault (`secret/viktor.ssh_public_key`, added today) instead of `var.ssh_public_key`. The last `terragrunt apply` was run with that var empty, leaving the snippet's `ssh_authorized_keys` with a single blank entry; the wizard user was effectively locked out of every fresh node. - `cloud_init.yaml` adds `/etc/systemd/resolved.conf.d/global-dns.conf` with `DNS=8.8.8.8 1.1.1.1, FallbackDNS=10.0.20.201`. Without it, systemd-resolved only consulted Technitium (link-level), which returns NXDOMAIN for `forgejo.viktorbarzin.me` — kubelet pulls from the Forgejo registry then failed DNS until I patched it manually on node5. - k8s apt repo bumped v1.32 → v1.34 (matches cluster). - The containerd setup script now creates hosts.toml for forgejo, quay, registry.k8s.io in addition to docker.io + ghcr.io. node3/4 had these added by hand post-bootstrap; now they're baked in. - `config_path` sed matches both `""` (containerd v1) and `''` (containerd v2.x). Without the v2 match, the certs.d mirror dir was silently ignored. - `proxmox-csi` node map adds k8s-node5 + k8s-node6 entries so CSI topology labels (region/zone, max-volume-attachments=28) apply on next `tg apply`. - `stacks/infra/main.tf` shed the 160-line inline containerd setup heredoc — that whole thing now lives in the module as a .sh file. Known unsolved gaps (deferred): - iscsid restart hangs ~90s on first boot before SIGKILL releases it (systemd-resolved restart kicks iscsid via dependency). Adds wall- clock time but doesn't block the join. - `provision-k8s-worker` doesn't run `tg apply` on `proxmox-csi` afterward, so the CSI topology labels need a manual apply after the node joins. Solving cleanly needs the CSI map to derive from `kubectl get nodes` instead of a static local — separate work. - `var.containerd_config_update_command` is now ignored when is_k8s_template=true (replaced by the bundled .sh file). Variable kept with a deprecation note to avoid breaking other call sites. E2E proof: k8s-node6 (VMID 206) boots hands-off from `provision-k8s-worker k8s-node6 206 10.0.20.106` and appears as `kubectl get nodes …Ready` ~7 min later (most of which is the apt package_upgrade — separate optimization). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 11:52:00 +00:00
Viktor Barzin	bb9d8f1b38	kyverno: GPU priority mutate uses add (was replace) — fixes silent skip The Layer 5 ClusterPolicy inject-gpu-workload-priority used JSON6902 op=replace on /spec/priorityClassName. Incoming pods (e.g. frigate) have no priorityClassName field at all — replace requires the path to exist, so the patch fails with "doc is missing key: /spec/priorityClassName" and the whole mutation chain aborts BEFORE Layer 4 (inject-priority-class-from-tier) gets a chance to add the field. Result: GPU pods never got priorityClassName set, sat at priority=0, and could not preempt lower-tier pods on the GPU node. Observed today on frigate post-node4-recovery — pod stayed Pending with "Preemption is not helpful" while 3 pg-cluster pods (tier-1-cluster, priority 800000) occupied node1's memory budget. Fix: op=add for all three paths. add works whether or not the key is present, so the policy is robust to the upstream pod shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 09:04:51 +00:00
Viktor Barzin	12b4f6f81a	dbaas: require pod anti-affinity on pg-cluster (one PG per node) Default CNPG affinity was `preferred` (soft). During the 2026-05-26 node4 outage, all 3 pg-cluster pods drifted onto k8s-node1 — losing that node would have taken the whole PG cluster down (no quorum) AND the 9.2 GiB pg-cluster footprint was the dominant reason frigate couldn't fit on the GPU node. With 3 instances + 4 worker nodes, `required` is safe under 1-node drain (3 distinct nodes always available, even excluding the drained one). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 09:00:37 +00:00
Viktor Barzin	400ee88967	state(dbaas): update encrypted state	2026-05-26 08:59:40 +00:00
root	daa41a2eb1	Woodpecker CI deploy [CI SKIP]	2026-05-26 08:29:09 +00:00
Viktor Barzin	00bbbe0838	url/shlink-web: containerPort 8080 -> 80 shlinkio/shlink-web-client:0.1.1 listens on port 80 (nginx default), not 8080 like the prior :latest images. Keel auto-bumped the tag on 2026-05-23; liveness/readiness probes have been failing ever since because they still hit :8080. Pod was stuck restarting, the DeploymentReplicasMismatch alert fired. Aligns containerPort + both probes + service target_port with the image.	2026-05-26 08:19:24 +00:00
Viktor Barzin	44c3770a5c	infra: pull all VMs out of Terraform — telmate provider can't represent them safely The telmate/proxmox v3.0.2-rc07 provider mangles dynamically-attached disks (id=539, 2026-05-26 incident) and doesn't refresh mbps_*_concurrent fields back from live state — every plan after a qm-set cap is applied proposes to "fix" mbps 0 → N and the apply errors with the spurious "the QEMU guest needs to be rebooted" message. lifecycle.ignore_changes does NOT block either failure mode. Decision: stop trying to manage Linux VMs in this stack. The cloud-init bootstrap stays in TF (via k8s-node-template, non-k8s-node-template, docker-registry-template above), so a fresh node still clones the right template and runs the same bootstrap. VM lifecycle stays in the Proxmox UI. I/O caps are managed via qm-set on the PVE host (idempotent script at /tmp/apply-mbps-caps.sh, tracked in beads code-9v2j). Removed from TF state + HCL: - module "k8s-master" (vmid 200) - module "k8s-node2" (vmid 202) — pre-existing drift, never in state - module "docker-registry-vm" (vmid 220) — was in state, hit refresh bug Already hand-managed (never in HCL): - 102 devvm, 103 home-assistant, 201 k8s-node1 (Tesla T4 passthrough), 203 k8s-node3, 204 k8s-node4, 101 pfSense (BSD), 300 Windows10. Live I/O caps (qm set, all verified): 102=60/60 103=40/40 200=100/60 201=150/120 202=150/120 203=150/120 204=150/120 220=40/40 Future TF adoption tracked in beads code-75ds (blocks on bpg/proxmox provider migration — telmate can't represent these VMs at all). Closes: code-75ds	2026-05-26 07:12:46 +00:00
Viktor Barzin	9b75b2817b	cloud-init: fix k8s node bootstrap snippet (multi-line interp + containerd v2 quotes) Two bugs found while rebuilding k8s-node4 (2026-05-26): 1. runcmd YAML breakage: `- $${containerd_config_update_command}` interpolated a multi-line heredoc as bare list-item content. The trailing lines lost their list-item prefix, breaking cloud-config parsing. Cloud-init silently fell back to the minimal default (hostname + package_upgrade only) — kubeadm join, containerd config, kubelet tuning, iSCSI hardening, swap, ALL skipped. No error visible in `cloud-init status`. Fix: wrap the interpolation in `- \|` literal block with `indent(4, ...)`. 2. containerd v2 single-quote mismatch: `containerd config default` in v2 writes `config_path = ''` (single quotes), v1 writes `""` (double). The sed pattern matched only double quotes → silent no-op on fresh containerd 2.x nodes → registry-mirror hosts.toml ignored → all image pulls hit upstream registries → DNS-to-MetalLB chicken-and-egg loop. Fix: match any value with `config_path = .*`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 07:06:50 +00:00
Viktor Barzin	445feb118f	infra: per-VM I/O caps + terragrunt v0.77 plumbing + state recovery WHAT LANDED: - terragrunt.hcl (root): added telmate/proxmox to k8s_providers required_providers. Other stacks just don't instantiate a provider block — harmless. Replaces the same-name override trick the infra stack used to do, which stopped working under Terragrunt v0.77 ("Detected generate blocks with the same name"). - stacks/infra/terragrunt.hcl: new generate "proxmox_provider" block writes proxmox_provider.tf with the provider config; credentials read from Vault secret/viktor at plan/apply time (no env vars). - modules/create-vm: new mbps_rd / mbps_wr number variables (default 0 = uncapped), wired into scsi0/scsi1 disk{} blocks as mbps_r_concurrent / mbps_wr_concurrent. lifecycle.ignore_changes extended to scsi6..scsi29 (K8s nodes have many CSI-managed slots), plus scsihw and qemu_os (vary per-VM; non-trivial live changes). - stacks/infra/main.tf: docker-registry-vm gains mbps_rd=40, mbps_wr=40 in HCL — already applied live via qm set on 2026-05-26. WHAT FAILED AND WAS ROLLED BACK: - Attempted import of 7 VMs (102 devvm, 103 home-assistant, 200 k8s-master, 201 k8s-node1, 202 k8s-node2, 203 k8s-node3, 204 k8s-node4) via import {} blocks. The telmate/proxmox v3.0.2-rc07 provider mangled proxmox-csi PVC slots on apply for vmid 202 and 203: every scsi slot got rewritten from `vm-9999-pvc-<uuid>` to the boot disk `vm-<vmid>-disk-0`. Restored both .conf files from the 2026-05-24 nightly PVE config backup at /mnt/backup/pve-config/ etc-pve/nodes/pve/qemu-server/{202,203}.conf — no reboots, no data loss, K8s CSI reconciled PVC attachments within minutes. Removed the 7 imports from state via `terraform state rm` and re-encrypted. Tracked in beads code-xzbl: blocked on bpg/proxmox provider migration (telmate has the same dynamic-disk defect that bit us on iSCSI back in 2026-04-02; see memory id=539). LIVE CAPS STILL IN PLACE (qm set, 2026-05-26 ~03:13 UTC): 102 devvm 60/60 103 home-assistant 40/40 200 k8s-master 100/60 201 k8s-node1 150/120 202 k8s-node2 150/120 203 k8s-node3 150/120 204 k8s-node4 150/120 220 docker-registry 40/40 (pfSense 101 BSD + Windows10 300 intentionally out of scope.) PRE-EXISTING DRIFT EXPOSED (NOT NEW): - HCL declares k8s-master (200) and k8s-node2 (202) but neither was ever imported into TF state — confirmed against the SOPS-encrypted state in git (lineage e1cc5bb5, serial 42, last touched 2026-04-06). This commit leaves both declarations in place but does NOT import them; that's part of the code-xzbl follow-up. Closes: code-s9xr	2026-05-26 06:46:47 +00:00
Viktor Barzin	07bd2e0017	onlyoffice: restore replicas 0 → 1 post IO-storm recovery Cluster is fully stable (all 5 nodes Ready, vaultwarden recovered, node4 rebuilt 2026-05-26). Removing the TEMP-SCALEDOWN guard. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 03:08:17 +00:00
Viktor Barzin	7ad0e578ae	f1-stream: migrate PVC from proxmox-lvm to NFS Wave 1 LUN-cap relief. The PVC stores 5 small JSON state files (health_state, schedule, scraped_links, sessions, streams) and a lost+found — total 30KB, no DB, regenerable from upstream APIs. Standard scale-to-0 → rsync → swap pattern (deployment was at replicas=1). Pod came back up on k8s-node4 (now Ready again). Net: -1 SCSI LUN on k8s-node1 (was the previous host).	2026-05-26 02:49:43 +00:00
Viktor Barzin	aded77d5ab	monitoring: alerts for proxmox-csi LUN saturation per node Vaultwarden + 18 pods got stuck for 7h on 2026-05-26 when k8s-node4 went down: surviving workloads piled onto node1 and hit the csi.proxmox.sinextra.dev/max-volume-attachments=28 cap. The Proxmox VM also had 5 stale scsi entries (PVCs long-migrated to other nodes but never removed from VM config), which bypassed the K8s scheduler safety until the plugin returned 'no free lun found' at attach time. Three new alerts on the kube_volumeattachment_info count per node: - warning at 24/28 (>= 85%), 10m - critical at 27/28 (1 slot left), 3m - critical at 28/28 (cap reached), 1m Also whitelisted kube_volumeattachment_info — the metric was being dropped by the disk-write-reduction filter (id=559) and the alert queries returned zero series until it's kept. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:45:13 +00:00
Viktor Barzin	a0b5cbc922	onlyoffice: migrate PVC from proxmox-lvm to NFS Wave 1 LUN-cap relief. OnlyOffice document server keeps only 2 WOPI key files + a .private dir on the PVC (~24K) — the real DB lives in its external Postgres + Redis stack, not on this PVC. Service is at replicas=0 (IO-storm temp scaledown — TEMP-SCALEDOWN comment preserved). Migration trivia: scheduler tried to put the rsync helper on k8s-node4 (PVC's last-known location) but node4 had just come back online and its proxmox-csi/nfs-csi node pods were still in ContainerCreating — failed. Retried pinned to k8s-node2 via nodeSelector; rsync template updated to take an optional node arg. Net: -1 SCSI LUN once onlyoffice is brought back up.	2026-05-26 02:43:47 +00:00
Viktor Barzin	681f6daf10	whisper: migrate PVC from proxmox-lvm to NFS Wave 1 LUN-cap relief. Whisper PVC holds Piper TTS .onnx voice model + a HuggingFace faster-whisper-small-int8 model cache — read-mostly model artefacts, no DB, 303M total. Both whisper and piper deployments are at replicas=0 (GPU-node memory pressure, unrelated). Switched access_modes to ReadWriteMany since both whisper + piper deployments reference the same PVC; on proxmox-lvm RWO they could only colocate on the same node when both come back. Net: -1 SCSI LUN once these are brought back up.	2026-05-26 02:38:34 +00:00
Viktor Barzin	a2b410f6c9	resume: migrate PVC from proxmox-lvm to NFS Wave 1 LUN-cap relief. Reactive Resume stores user-uploaded PDFs + 3 .txt counters under uploads/ and statistics/ — no embedded DB, 112K of data. Service is at replicas=0 (browserless OOM scaledown, unrelated to this work) so the migration was no-downtime. Net: -1 SCSI LUN once resume is brought back up.	2026-05-26 02:36:20 +00:00
Viktor Barzin	cdbb418f45	monitoring: alert when cluster can't tolerate losing a non-GPU worker ClusterCannotTolerateNonGpuNodeLoss fires when the most heavily reserved non-GPU worker (k8s-node2/3/4) has more memory requests pinned to it than the rest of the workers (incl. node1 GPU node) currently have free. If that node went down, its pods would not fit elsewhere and would stay Pending — exactly what happened today (2026-05-26) with node4 NotReady: 4 kyverno pods + woodpecker PVCs + several deployments stuck Pending because node2/node3 were at 99% memory-request saturation. Math: max(R(node X) for X in non-GPU workers) > sum(clamp_min(A(n) - R(n), 0)) over Ready workers. node1 included on the right because its taint is PreferNoSchedule (soft) so it does absorb non-GPU pods under pressure. Currently fires with a 33.96 GiB shortage. Remediation: right-size top reservers via Goldilocks (immich-server 8Gi, frigate 5Gi, prometheus 4.4Gi, pg-cluster 3Gi each, paperless 2Gi) or bump VM RAM on k8s-node2/k8s-node3 from 32GB → 48GB to match node1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:34:13 +00:00
Viktor Barzin	467fa1631d	excalidraw: migrate PVC from proxmox-lvm to NFS Wave 1 of the per-VM SCSI-LUN cap relief. The proxmox-csi-plugin hardcodes a `lun < 30` loop (pkg/csi/utils.go:394) — cap is 29 attachable PVCs per K8s node VM, and k8s-node1 was sitting at 29 with 4 stuck `no free lun found` PVCs queued behind it. Excalidraw stores per-user .excalidraw scene files (no SQLite, no embedded DB) — confirmed safe on NFS. 1.5 MiB of data, 4 active scenes. Migration: - Add nfs_volume module → apply - Scale to 0, rsync helper, swap claim_name → apply - Remove old proxmox-lvm PVC → apply Net: -1 SCSI LUN on k8s-node2. Refs: docs/post-mortems/2026-05-25-immich-anca-elements-io-storm.md (separate concern; this is for the upstream LUN-cap pressure).	2026-05-26 02:33:41 +00:00
Viktor Barzin	16b3969ceb	alloy: move resources to alloy.* (chart key bug); 1Gi limit fixes IO storm The Alloy Helm chart maps `alloy.resources`, NOT `controller.resources`, onto the alloy container. The block under `controller:` was silently dropped, so the container ran with `resources: {}` and inherited the Kyverno LimitRange `tier-defaults` 256Mi — well below Alloy's 400-450Mi steady state. The cgroup ran at 255.8/256MB with ~50M memory-reclaim events, page-cache thrashing drove ~185 MB/s sdc reads (12.18 TB in 24h), saturating the Proxmox host and rippling out to all VMs + NFS. Fix: - Move resources to `alloy.resources` (correct chart key). - Burstable QoS: request 512Mi, limit 1Gi. Workers are at 97-99% memory-request saturation cluster-wide; a 1Gi request blocks scheduling on node2/node3. - Bump controller.updateStrategy.maxUnavailable to 50% so a 5-pod DS rolling update fits inside the helm timeout. - Bump helm_release.alloy.timeout to 900s (default 300s was too short with occasional runc-stuck-Terminating on k8s-master). Verified: all 4 alloy pods now show 1Gi/512Mi at the container level; helm rev=8 deployed; per-pod memory 99-108Mi at steady state (well under the new limit). Memory ID 2726. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:08:35 +00:00
Viktor Barzin	b9ac942647	nvidia: fix driver install deadlock + extend startup probe Two compounding issues prevented the GPU driver from installing after the k8s-node1 kernel rollback to 6.8.0-117-generic (Ubuntu 24.04): 1. Deadlock: The k8s-driver-manager init container was stuck waiting for nvidia-operator-validator to shut down. The validator's driver-validation init container was in an infinite poll loop checking for /run/nvidia/validations/.driver-ctr-ready (which only appears after a successful driver install). The validator pod had deletionTimestamp set but its container remained in Terminating state indefinitely. Fix: force-delete the stuck Terminating validator pod to break the deadlock (kubectl delete --force --grace-period=0). 2. Startup probe timeout: Full driver install on this hardware (apt headers ~2min + gcc make -j16 ~12min + file copy ~7min = ~21min) exactly exhausted the default 120×10s=20min startup probe window, causing SIGKILL (exit 137) at exactly 21 minutes even when the install was succeeding. Extended failureThreshold 120→300 (50min headroom). Documented both root causes + recovery steps in the post-mortem. values.yaml: add driver.startupProbe.failureThreshold: 300. Note: the kubectl patch applied during recovery is a temporary fix; this TF values.yaml change makes it durable via the next TF apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:53:44 +00:00
Viktor Barzin	da33919368	f1-stream: verifier — wrap m3u8 fetches through /proxy The frontend already routes every m3u8 URL through `getProxyUrl` → `/proxy?url=…` so CORS-restricted hosts work for users. The verifier was the odd one out: it loaded m3u8 URLs directly into hls.js inside a `data:` URL test page, which has Origin `null`. Hosts like `oe1.ossfeed.store` (pitsport's playlist CDN) only set ACAO when the request's Origin is `https://pushembdz.store`, so hls.js got an instant `fatal_network_error` and every pitsport stream was marked dead even though they play fine for real users. Wrap the m3u8 URL the same way the verifier already wraps embed URLs: `{PROXY_BASE}/proxy?url=<b64>`. Stays same-origin for hls.js, gets ACAO:* from our proxy, and the rewritten variants are also proxy-wrapped so subsequent fetches stay clean. For sites whose CDN serves any IP without Origin tricks (stremio, dd12), this is transparent — proxy just forwards. Side effect: every verified m3u8 hits our proxy once during extraction. Cheap (1 cluster-internal request + 1 upstream HEAD/GET) and only during the 5/30-min extraction cycle.	2026-05-24 22:26:56 +00:00
Viktor Barzin	7045559fee	immich: harden against bulk-import load (memory + probe + Job retries) Mid-flight stability changes from the 2026-05-24 Anca-elements import that surfaced multiple latent issues under sustained load: - `immich-postgresql` memory 3Gi → 5Gi. The original limit OOM-killed PG once the bulk insert + vector embeddings drove buffer pressure past 3 GiB. 5 GiB gives ~60% headroom over the observed steady state during ongoing imports. - `immich-server` startup probe `failure_threshold` 30 → 360 (5min → 1h). After any PG restart, immich-server reindexes `clip_index` + `face_index` (147k + 185k rows at the time of incident) before binding the API port. The old 5min budget was too tight, so each PG bounce trapped immich-server in a startup crashloop until the reindex was killed. 1h gives generous headroom. - `kubernetes_job_v1.anca_elements_import.backoff_limit` 2 → 20 and `--concurrent-tasks` 8 → 20 on the immich-go upload. Short cluster blips (PG restart, KCM lease loss) were exhausting the Job's 3-attempt budget. 20 attempts + 20 parallel hashers makes dedup-on-resume ~2.5x faster and tolerates a much rougher cluster. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 22:14:05 +00:00
root	445f30d955	Woodpecker CI deploy [CI SKIP]	2026-05-24 22:07:58 +00:00
Viktor Barzin	5cdac421c2	forgejo: pin to v11.0.14 + disable Keel (image-rewrite incident 2026-05-24) Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details On 2026-05-24T15:35:37Z Keel's force-policy rewrote the image tag from `11.0.14 → 1.18` (codeberg.org/forgejo/forgejo). v1.18 is a Gitea-era Forgejo (Forgejo forked from Gitea at 1.18 and used pre-Forgejo versioning early on); the DB had already been migrated to schema 305 by 11.0.14, and 1.18 only knows up to migration 231 → pod refused to start ("Your database (migration version: 305) is for a newer Gitea, you can not use the newer database for this old Gitea release (231)"). Exact replay of the 2026-05-16 force-policy tag-rewriting bug (memory id=1933). Changes: - Pin image to explicit `:11.0.14` (latest 11.x, published 2026-05-12) - Add `keel.sh/policy: "never"` deploy annotation — overrides the Kyverno-stamped `force` policy via the chart's `+()` anchor semantics (memory id=1972). Keel will no longer touch this workload. - Drop KEEL_IGNORE_IMAGE from `lifecycle.ignore_changes` (TF owns the image now). Restore it if you flip Keel back to `force`. - Add the KEEL_LIFECYCLE_V1 trio (`kubernetes.io/change-cause`, `deployment.kubernetes.io/revision`, `keel.sh/update-time` on the pod template) so future TF applies don't fight K8s rollout metadata. Verified: new pod on v11.0.14 came up Running 1/1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 22:06:59 +00:00
Viktor Barzin	5a0e4b3dac	f1-stream: revive aceztrims + pitsport, more ppv variants - aceztrims: scrape /f11/ (the actual stream page), not /f1/ (the cross-sport schedule). Drop the dead /iframe1?s= + onclick m3u8 regexes (site moved to `getElementById('iframe').src = '...'` ~20 channels ago). Strip HTML comments first so the ~20 legacy buttons kept inside <!-- ... --> stop showing up as false positives. Also pick up the default inline <iframe id='iframe' src='...'>. Local run: 11 channels (was 0). - pitsport: decode the RSC payload before regex-matching in _parse_live_events (raw HTML had it escape-encoded, so the homepage card path was silently 0). Add the new /live-now route (canonical what's-live-right-now list). Add "f1" to MOTORSPORT_CATEGORIES — the site labels Formula 1 events as just "F1". Refresh the stale serveplay.site docstring (host rotates; pushembdz's api/stream link is authoritative). Local run: 7 m3u8 streams covering Canadian GP (EN1/EN2/MULTI/ITA/ESP) + NASCAR Coke 600 (was 0). - ppv: always emit the parent embed alongside substreams (was dropping it whenever substreams existed). Prefer source_tag in substream titles so users see "Sky Sport 1 NZ" / "Apple TV (F1TV)" instead of generic #1/#2 suffixes. Diagnosed against the live cluster (curated + 7 other extractors returning 0 cached streams, only 2 dead hmembeds curated 24/7 channels visible to users). Each fix verified with the extractor run against live sites this turn.	2026-05-24 22:05:37 +00:00
Viktor Barzin	4798583db7	backup pipeline: S1 fixes from 2026-05-24 audit Three immediate fixes surfaced by the backup-pipeline audit: 1. S1 silent-loss race fix (daily-backup.sh:142): remove the `> "${MANIFEST}"` truncation at the start of daily-backup. Truncation already lives in offsite-sync-backup at line 159, gated on a successful sync. With both scripts truncating, an offsite-sync failure followed by the next morning's daily-backup would silently wipe yesterday's unconsumed manifest entries — those files would only reach Synology via the monthly full sync (1st-7th of month). Now only offsite-sync truncates, and only on success. 2. Missing alert OffsiteBackupSyncFailing: documented in backup-dr.md but was never added to prometheus_chart_values.tpl. Step 1 or Step 2 failure pushes offsite_sync_last_status=1 but nothing read it. Added. 3. wear: drop `-z` from local-only rsyncs (daily-backup.sh:218 PVC snapshot rsync + line 347 /etc/pve sync). Both are local-to-sda transfers — compression wastes CPU and yields nothing (gigabit local path, intermediate disk doesn't benefit). Bonus cleanups (zero functional impact): - "Weekly backup starting/complete" → "daily-backup starting/complete" (the timer is daily, not weekly — legacy from earlier monthly-rotation schedule). - "--- Step 2: PVC file copy ---" → "Step 1:" (was numbered from 2 with no Step 1 above). - wear: pfSense full filesystem tar now Sunday-only instead of daily. config.xml stays daily (it's the primary restore artifact and tiny). Full tar is forensic recovery only — re-tarring ~100MB+ daily writes ~3G/month to sda + Synology for unchanged content. Weekly is plenty. docs/architecture/backup-dr.md: rewritten Overview + 3-2-1 breakdown to reflect today's two-leg architecture; added a "2026-05-24 session" changelog summary at the top; added a "Synology snapshot management" subsection with the sudo + `synosharesnapshot` recipe (DSM API is gated by 2FA so this is the only programmatic path); updated Key Files table with nfs-mirror + the Synology SSH access notes. Open follow-ups from the audit (S2 — file as beads if pursued): - Factor two-leg invariant into /etc/backup-skip-list.conf sourced by both nfs-mirror.sh and offsite-sync-backup.sh. - Manifest write-collision flock between nfs-mirror Mon 04:11 and daily-backup Mon 05:00. - Unbounded manifest cap (force full sync if > 500k lines). - Synology free-space scraper + alert. - LVM thin pool meta-pool fill alert. - nfs-change-tracker.service heartbeat to Pushgateway. - Synology config drift TF surface (snap retention, share defs).	2026-05-24 16:18:44 +00:00
Viktor Barzin	9277d71d81	nfs-mirror: append transferred files to offsite-sync manifest Some checks failed ci/woodpecker/push/default Pipeline is running Details ci/woodpecker/push/build-cli Pipeline failed Details Step 1 of offsite-sync-backup is incremental on non-monthly days, driven by /mnt/backup/.changed-files which only daily-backup wrote to. nfs-mirror's writes were therefore invisible to Step 1 until the next monthly --delete pass — which would also wipe data pre-positioned on Synology pve-backup/ (e.g. the in-place btrfs rename we just did to relocate ~160G of NFS subtrees from /Backup/Viki/nfs/<svc>/ to /Backup/Viki/pve-backup/<svc>/). Fix: snapshot a timestamp before rsync, then after rsync use `find -newer $STAMP -type f -printf '%P\n'` to enumerate every file nfs-mirror created/modified and append to the manifest. Paths are relative to /mnt/backup/ (matches Step 1 --files-from expectation). State files are excluded. The current in-flight first run started before this patch was deployed, so its writes won't auto-populate the manifest — a one-off manual backfill will be done after it completes.	2026-05-24 15:32:22 +00:00
root	9e2163040b	Woodpecker CI deploy [CI SKIP]	2026-05-24 14:23:44 +00:00
Viktor Barzin	d6590612b2	immich: bulk-import Anca's Elements photo archive into her account Grows pve/nfs-data 3T → 4T (online lvextend + resize2fs) to absorb ~340 GB of new originals landing under /srv/nfs/immich/upload during the import. Adds: - module "nfs_anca_elements_host" — RO PVC over /srv/nfs/anca-elements, consumed only by the import Job (not mounted in immich-server). - kubernetes_job_v1.anca_elements_import — immich-go v0.31.0 uploader posting to immich-server.immich.svc:2283 with Anca's API key (synced via the existing immich-secrets ExternalSecret from secret/immich.anca_api_key). Filters to image extensions, bans the non-photo top-level dirs (filme/, Music/, carti/, courses, installers, docs, etc.), puts every asset in the album "Poze (Elements)". Default `--pause-immich-jobs` is disabled — non-admin keys can't pause jobs. - docs/architecture/storage.md — note the new 4 TB size in 3 places. - docs/runbooks/grow-pve-nfs-lv.md — captures the one-shot lvextend procedure (no pve-host TF stack exists for this). Job is removed in the follow-up cleanup commit once the upload completes; the PVC stays for a videos batch later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 14:12:30 +00:00
Viktor Barzin	4d756be4f5	backup: consolidate to one local-mirror script + invert offsite filter Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details Before this commit, the in-flight design split anca-elements (its own mirror script + timer) from the rest of /srv/nfs (still going to Synology via inotify-tracked offsite-sync). It also meant Synology received some bytes via both paths (sda → Synology AND direct NFS → Synology), which doubled consumption. This commit collapses both into a clean 3-2-1: Copy 1 (sdc): live /srv/nfs/* + cluster block PVCs Copy 2 (sda): /mnt/backup/{pvc-data,sqlite-backup,pfsense, pve-config,<critical-nfs>/} ← daily-backup + nfs-mirror (one script each) Copy 3 (Synology): /Backup/Viki/{pve-backup,nfs,nfs-ssd} ← offsite-sync-backup Step 1 (sda → Synology) + Step 2 (sda-BYPASS paths only → Synology direct) scripts/nfs-mirror.{sh,service,timer}: New consolidated weekly mirror. Replaces anca-elements-mirror (to be removed in a follow-up after the current in-flight rsync completes, parity-verified, and Synology source-of-truth is deleted). Single rsync /srv/nfs/ → /mnt/backup/ with an explicit EXCLUDES list that drops paths not worth a local 2nd copy: immich (1.2T — too big), frigate (14d ring), prometheus/loki (rebuildable), ollama/llamacpp/ audiblez/ebook2audiobook (re-fetchable), -backup (already backups), temp/alertmanager (transient). Nice=10, IOSchedulingClass=idle. scripts/offsite-sync-backup.sh: Step 2 (NFS → Synology) filter inverted: instead of `--exclude= anca-elements/`, it now `--include`s only the sda-BYPASS paths (immich, frigate, prometheus, -backup, …). The bypass-include regex MUST stay in lockstep with nfs-mirror's EXCLUDES — they are complementary and any drift creates either gaps or duplication on Synology. Comment in the script flags this. monitoring alerts: renamed AncaElementsMirror{Stale,Failing} to NfsMirror{Stale,Failing} matching the new metric job name `nfs-mirror`. Thresholds unchanged. docs/architecture/backup-dr.md: rewritten Step 1/Step 2 sections and added the bypass-list rationale + cross-reference between scripts. NOT YET DEPLOYED — gated on the in-flight anca-elements-mirror rsync finishing + parity verification + Synology /volume1/Backup/Anca/ Elements deletion. The old scripts (anca-elements-{mirror,sync.sh}) remain on the PVE host until then, and will be removed in a cleanup commit.	2026-05-24 12:49:20 +00:00
Viktor Barzin	416c2a0468	monitoring: add AncaElementsMirror{Stale,Failing} alerts Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline failed Details Layer 3a (anca-elements local mirror) now has the same alert coverage as offsite-sync-backup: - AncaElementsMirrorStale fires if last_run_timestamp > 16d (2 weekly cycles, matches the 8d → 9d slack used elsewhere) - AncaElementsMirrorFailing fires if last_status != 0 BackupDiskFull (existing) covers the sda fill-up risk at 85%. Not applied this commit — pick up on next monitoring stack apply.	2026-05-24 11:55:19 +00:00
Viktor Barzin	c624caf65a	nextcloud(external_storage): add per-mount enableSharing option Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details Lets admin natively share folders from inside an external mount with internal users/groups or via public link. The two PVE pool browsers (visible to admin only) get enableSharing=true so they can act as a "share-from picker" over /srv/nfs and /srv/nfs-ssd; /anca-elements stays false so anca manages re-sharing inside her own view. - Manifest schema gains enableSharing on rootMounts + archiveMounts. - Bootstrap Job adds sync_option() and reconciles enable_sharing via occ files_external:option (idempotent — occ no-ops same-value set).	2026-05-24 11:39:16 +00:00
root	37e563d5a9	Woodpecker CI deploy [CI SKIP]	2026-05-24 11:31:53 +00:00
Viktor Barzin	cb1a34fd00	nextcloud: expose PVE NFS roots + /anca-elements via Files External Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details Mounts the Proxmox host NFS exports (/srv/nfs and /srv/nfs-ssd) into the NC pod and surfaces them through occ files_external:create: - /PVE NFS Pool → /mnt/pve-nfs (admin group only) - /PVE NFS-SSD Pool → /mnt/pve-nfs-ssd (admin group only) - /anca-elements → /mnt/pve-nfs/anca-elements (admin, anca users) Mount visibility is controlled by occ files_external:applicable; no Files Access Control. ACL state is reconciled idempotently by a bootstrap Job that diffs desired vs current applicable_users / applicable_groups (via files_external:list --output=json). Bootstrap fixes vs initial design: - Sync loop used `[ -n "$U" ] && cmd` which returns 1 on empty input, triggering set -e on no-op re-runs. Switched to process substitution `< <(jq ...)` so empty diff -> loop body never runs -> 0 exit. - RBAC missed `watch` verb (kubectl wait spammed reflector errors). - Manifest used display-name "viktor" instead of NC username "admin" for the /anca-elements applicable list. Chart values: added two PV-backed volume mounts at /mnt/pve-nfs[+ssd] and pinned securityContext to fsGroup=33 with fsGroupChangePolicy: OnRootMismatch (chart default Always would recurse 600k+ files on every pod restart).	2026-05-24 11:27:26 +00:00
Viktor Barzin	7a649ce7eb	crowdsec: pin image to v1.7.8 + remove ENROLL_KEY, CAPI restored Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details Root cause of today's CAPI 403 crashloop: chart 0.21.0 pins appVersion to v1.7.3, but Keel had auto-bumped the running pods to v1.7.8 on 2026-05-16 and they ran fine with CAPI for 8 days. Today's TF apply (`b59acbc1` agent memory bump) re-rendered the deployment from chart defaults, reverting the image to v1.7.3 — and v1.7.3 has a CAPI watcher-auth bug against the current api.crowdsec.net behaviour, so every fresh replica started 403'ing on startup. Fix: set `image.tag: "v1.7.8"` in values.yaml so the image survives future TF applies independently of the chart's appVersion. Verified CAPI auth succeeds on all 3 fresh pods with v1.7.8. Also dropped the ENROLL_KEY env block — the existing key `cmey5e636…` is single-shot and was already consumed by the first replica; subsequent pods hit 403 on `cscli console enroll`. CAPI works WITHOUT console enrollment (separate flows). Re-enable console reporting by generating a fresh enroll key at app.crowdsec.net (procedure documented in the values.yaml comment block). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 11:11:29 +00:00
Viktor Barzin	41786b0fca	crowdsec: DISABLE_ONLINE_API=true — break the recurring 403 crashloop Some checks failed ci/woodpecker/push/build-cli Pipeline failed Details ci/woodpecker/push/default Pipeline was successful Details CAPI auth at api.crowdsec.net is rejecting watcher logins from inside the cluster within ~1h of registration, even after rotating creds via `cscli capi register`. The same login successfully authenticates from devvm but fails from cluster pods → IP-throttle or account-state issue at the central API. Until that's resolved with CrowdSec support (or the throttle window resets), running with CAPI on is just chronic crashloops on every fresh replica. `DISABLE_ONLINE_API=true` makes the chart entrypoint `conf_set 'del(.api.server.online_client)'`, removing the online_client block entirely. Pods skip CAPI auth, no 403, no crashloop. Trade-off: no community blocklists. Local scenarios + bouncers continue unchanged. Side-effect of disabling CAPI in this chart (v0.21.0) — `role.yaml` is gated on `IsOnlineAPIDisabled=false` while `cscli-lapi-register-job` is gated on `StoreLAPICscliCredentialsInSecret=true` (orthogonal). So the hook runs without the Role it needs, and atomic apply rolls back. Mitigation: pre-created the `crowdsec-lapi-cscli-credentials` Secret manually (the hook short-circuits when the secret already exists) and re-applied the missing Role for future re-enablement. Re-enable path documented in the comment block. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 10:31:03 +00:00
Viktor Barzin	0b1282a13c	llama-cpp: ignore_changes for keel/k8s-managed annotations Every `tg apply` was reverting the annotations that keel patches when it detects an upstream digest change — `keel.sh/match-tag` (Kyverno-stamped), `keel.sh/update-time` (on the pod template; what actually triggers the rollout), plus the K8s-managed `kubernetes.io/change-cause` and `deployment.kubernetes.io/revision`. The revert forced a rollout, then the next keel poll re-stamped the annotations, forcing another. With llama-swap's ~10s cold-load on each pod recreate the user noticed. Upstream `ghcr.io/mostlygeek/llama-swap:cuda` is a moving nightly tag — keel still drives one legitimate rollout per day at ~07:25 UTC; this patch stops the apply-driven extra rollouts on top of that. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 09:01:17 +00:00
Viktor Barzin	67f8be4598	trading-bot: add kevin_signal_bridge container (kill-switch OFF for Phase 1) 5th worker container running in audit-only mode. Writes kevin_signal_bridge_state rows showing what it WOULD trade but never publishes to signals:generated. Kill-switch flipped in Phase 2.	2026-05-24 01:22:53 +00:00
Viktor Barzin	6218868ea5	xray: drop dead vless ingress + pin Service target_port The xray-vless ingress, Service port 6443, and container port 6443 had no backing listener — xray.config.json only binds 7443 (REALITY), 8443 (WS) and 9443 (XHTTP). The "xray-vless" hostname was returning 502 since the module was created. Side effect: removing the first Service port slot ("vless"/6443) caused the kubernetes provider to shift targetPort values on the remaining two ports (defaulting only worked at create time, not on port removal). Pinning target_port explicitly makes Service routing deterministic. End-to-end verified: REALITY via public IP:8080 (pfSense forward 8080 -> 10.0.20.200:7443), WS via Cloudflare, XHTTP via Cloudflare — all three transports proxied successfully through a test pod, egress IP correctly resolves to the home WAN.	2026-05-24 01:13:54 +00:00
Viktor Barzin	ae874e028d	postiz: bump memory request 512Mi → 2Gi, limit 4Gi → 3Gi (right-size for next deploy) krr 2026-05-22 flagged postiz-app as critically under-requested when it was running (gap 2.2 GiB above the 512Mi request). Postiz is currently uninstalled in the cluster — this change is only for when the stack is re-deployed later. No apply triggered now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:11:25 +00:00
Viktor Barzin	b59acbc1db	crowdsec/agent: bump memory request 64Mi → 128Mi krr 2026-05-22 flagged crowdsec-agent DaemonSet (4 pods) as under- requested by ~588 MiB across the cluster. Live usage around the 80-128 MiB mark for active log parsing — 64 MiB request risked eviction ahead of more-needed pods. Limit stays at 512 MiB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:11:16 +00:00
Viktor Barzin	7108843b38	nvidia/driver-daemonset: bump memory request 256Mi → 822Mi krr 2026-05-22 flagged nvidia-driver-daemonset as critically under-requested (~566 MiB gap). Live driver process holds ~600-800Mi once the kernel module is loaded. Limit stays at 2Gi so the DKMS build during a kernel upgrade still has headroom (documented in values.yaml to need ~1.4 GiB peak). May help unblock code-8vr0 (GPU driver crashloop on node1) if the crashloop was OOM-driven. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:11:06 +00:00
Viktor Barzin	2711d4af05	monitoring/loki: bump memory request 2Gi → 3Gi (close gap to 4Gi limit) krr 2026-05-22 flagged loki as under-requested by 1.9 GiB. Live working set is sitting at ~3 GiB during normal ingestion; the existing 2 GiB request meant scheduler didn't reserve enough room and the pod risked eviction. Limit stays at 4 GiB (documented ceiling in loki.yaml). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:10:55 +00:00
Viktor Barzin	c77984a713	proxmox-csi/node: bump memory request 64Mi → 1Gi (LUKS unlock reservation) The CSI node plugin's LUKS2 Argon2id key derivation peaks at ~1 GiB during unlock (memory id=712 + already-documented in the limits=1280Mi). Request was 64 MiB — meaning the unlock burst ran "best-effort", first in line for OOM under node pressure. krr 2026-05-22 flagged this as a top under-request. Bumping request matches the documented requirement. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:10:44 +00:00
Viktor Barzin	467460cccd	k8s-version-upgrade: ignore IngressTTFBCritical in halt-on-alert check The Synology DSM (port 5001) ingress chronically trips IngressTTFBCritical because of NAS-side latency that is unrelated to k8s upgrades. The chain was halting indefinitely waiting for it to clear. Add it alongside RecentNodeReboot to the per-call ignore regex so the chain can proceed autonomously without manual silences. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:10:44 +00:00
Viktor Barzin	447bfef507	blog: remove www.viktorbarzin.me ingress The www subdomain was internal-only (no Cloudflare DNS record) but the external uptime-kuma monitor still flagged it as down because public DNS resolution failed. Removing the ingress along with the Technitium CNAME makes the failure mode disappear and lets the cluster reach an autonomous-clean state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 01:10:44 +00:00
Viktor Barzin	b4aa8eaf58	technitium: cut memory — primary 2Gi → 1Gi, secondary+tertiary 2Gi → 512Mi Right-sizing per krr report (2026-05-22). Zone data is ~43 MiB; the rest was cache headroom. Primary keeps more (1 GiB) since it owns authoritative zones; replicas get 512 MiB. DNS sanity-checked across CoreDNS and the MetalLB external IP (10.0.20.201) post-rollout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:03:51 +00:00
Viktor Barzin	931d7b6c9d	claude-agent-service: cut memory request 2Gi → 1Gi (limit 4Gi → 2Gi) Right-sizing per krr report (2026-05-22). Kept Burstable QoS (limit > request) so an active agent run still has 2 GiB headroom — krr's 100 MiB recommendation was measured idle and is not safe for an active job. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:03:42 +00:00
Viktor Barzin	d76f4c4827	n8n: cut memory request 1Gi → 512Mi (+ image bump 1.80.0 → 1.80.5) Right-sizing per krr report (2026-05-22). Image bump syncs main.tf with the live Keel-managed version to avoid an inadvertent downgrade on apply. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:03:28 +00:00
Viktor Barzin	17c1ef73be	url/shlink: cut memory request 960Mi → 512Mi Right-sizing per krr report (2026-05-22, memory id=2431-2438). Live pod working set is ~80 MiB; 512Mi leaves comfortable headroom for the Symfony+RoadRunner footprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 10:02:45 +00:00

1 2 3 4 5 ...

1077 commits