Under heavy concurrent memory_store (many subagents/sessions writing close
together) the local SQLite layer raced on the single SQLite writer and surfaced
sqlite3.OperationalError: database is locked, which made memory tools slow and
eventually dropped whole sessions. Two root causes:
- The MCP server (mcp_server.py) and the background SyncEngine (sync.py) each
opened a SEPARATE connection to the same SQLite file. WAL allows one writer;
when the sync writer held the lock across a resync, a concurrent store blew
past busy_timeout and raised "database is locked".
- mcp_server's connection was opened WITHOUT check_same_thread=False, so the
moment two requests were handled on different threads every local store/recall
raised ProgrammingError "SQLite objects created in a thread...".
Fix: a single process-wide serialized writer.
- New LocalStore (local_store.py) owns ONE connection (check_same_thread=False)
guarded by ONE re-entrant lock, keeps WAL, and wraps writes in
transaction()/write() with bounded exponential-backoff retry on the rare
residual lock (e.g. another OS process) instead of failing the call.
- MemoryServer builds the LocalStore and SHARES it with the SyncEngine, so the
sync writer no longer opens a second connection — the two-connections race is
eliminated. All server reads/writes go through the shared lock; stores stay
snappy (enqueue-local + async sync).
- Bound the one genuinely slow path (remote semantic memory_recall) with an
explicit RECALL_TIMEOUT and, on timeout/unreachable backend, return a clear
"working / retry" signal instead of hanging silently or crashing. When a
client supplies _meta.progressToken, emit one notifications/progress so the
call shows life.
Ships to users via the plugin (mcp/memory-mcp.json runs src/.../mcp_server.py);
no server-side/API change needed. TDD: added concurrency tests (many threads +
sync writer on one file) and recall progress/bounding tests; full gate green
(ruff + mypy strict + 185 pytest).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Decouple push and pull in _sync_once() so pull always runs even if push fails
- Add startup full resync to catch drift from other agents and schema changes
- Add periodic full resync every ~10 minutes for continuous drift correction
- Add auth failure detection (401/403) with graceful SQLite-only degradation
- Add /api/auth-check endpoint for lightweight key validation
- Add retry cap (5 attempts) on pending ops to prevent infinite queue buildup
- Add orphan reconciliation: push local-only records with content dedup
- Add memory_count MCP tool for sync diagnostics
- Add version-based SQLite schema migration (PRAGMA user_version)
- Fix API key in ~/.claude.json to match server
- Update README with sync resilience docs, test structure, project layout
- Add 30 new tests covering all new behaviors (155 total, all passing)