infra/stacks/claude-memory
Viktor Barzin 8787d361dc
All checks were successful
ci/woodpecker/push/default Pipeline was successful
claude-memory: HA (replicas 2 + PDB) to stop recurring MCP disconnects
The claude-memory MCP backend ran as a single replica with no PDB, so every
voluntary disruption took it to zero for ~30-90s — which surfaced as the
memory MCP "keeps getting disconnected" problem. Disruption sources hitting
the lone pod: the descheduler (every-5-min CronJob, LowNodeUtilization —
caught evicting it live), Keel image bumps, Reloader restarts on the 7-day
DB-password rotation, node drains, and CI deploys.

The local stdio MCP subprocess itself was proven healthy (fast non-blocking
startup, stderr suppressed, graceful degradation), so the fault was purely
backend availability, not the MCP plumbing.

Fix: run 2 replicas (the backend is stateless FastAPI over shared CNPG
Postgres and already has hostname anti-affinity) + restore the PDB at
minAvailable=1 (safe now — the drain deadlock that justified removing it
only existed at 1 replica) + descheduler evict=false to stop the needless
5-min churn. All five disruption sources become zero-downtime rolling events.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-18 09:13:36 +00:00
..
main.tf claude-memory: HA (replicas 2 + PDB) to stop recurring MCP disconnects 2026-06-18 09:13:36 +00:00
providers.tf fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
secrets fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00
terragrunt.hcl fix: restore tree dropped by 6d224861; land stem95su gdrive-sync (10m) [ci skip] 2026-06-09 08:45:33 +00:00