Commit graph

7 commits

Author SHA1 Message Date
Viktor Barzin
5151bbe0d5 fix(mcp): serialize local SQLite writes to end "database is locked" under concurrent stores
Some checks failed
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build (push) Has been cancelled
Build and Push / deploy (push) Has been cancelled
Build and Push / notify-failure (push) Has been cancelled
Under heavy concurrent memory_store (many subagents/sessions writing close
together) the local SQLite layer raced on the single SQLite writer and surfaced
sqlite3.OperationalError: database is locked, which made memory tools slow and
eventually dropped whole sessions. Two root causes:

  - The MCP server (mcp_server.py) and the background SyncEngine (sync.py) each
    opened a SEPARATE connection to the same SQLite file. WAL allows one writer;
    when the sync writer held the lock across a resync, a concurrent store blew
    past busy_timeout and raised "database is locked".
  - mcp_server's connection was opened WITHOUT check_same_thread=False, so the
    moment two requests were handled on different threads every local store/recall
    raised ProgrammingError "SQLite objects created in a thread...".

Fix: a single process-wide serialized writer.

  - New LocalStore (local_store.py) owns ONE connection (check_same_thread=False)
    guarded by ONE re-entrant lock, keeps WAL, and wraps writes in
    transaction()/write() with bounded exponential-backoff retry on the rare
    residual lock (e.g. another OS process) instead of failing the call.
  - MemoryServer builds the LocalStore and SHARES it with the SyncEngine, so the
    sync writer no longer opens a second connection — the two-connections race is
    eliminated. All server reads/writes go through the shared lock; stores stay
    snappy (enqueue-local + async sync).
  - Bound the one genuinely slow path (remote semantic memory_recall) with an
    explicit RECALL_TIMEOUT and, on timeout/unreachable backend, return a clear
    "working / retry" signal instead of hanging silently or crashing. When a
    client supplies _meta.progressToken, emit one notifications/progress so the
    call shows life.

Ships to users via the plugin (mcp/memory-mcp.json runs src/.../mcp_server.py);
no server-side/API change needed. TDD: added concurrency tests (many threads +
sync writer on one file) and recall progress/bounding tests; full gate green
(ruff + mypy strict + 185 pytest).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:06:09 +00:00
Viktor Barzin
e47efee6b6
resilient memory sync: decouple push/pull, startup full resync, auth failure handling
- Decouple push and pull in _sync_once() so pull always runs even if push fails
- Add startup full resync to catch drift from other agents and schema changes
- Add periodic full resync every ~10 minutes for continuous drift correction
- Add auth failure detection (401/403) with graceful SQLite-only degradation
- Add /api/auth-check endpoint for lightweight key validation
- Add retry cap (5 attempts) on pending ops to prevent infinite queue buildup
- Add orphan reconciliation: push local-only records with content dedup
- Add memory_count MCP tool for sync diagnostics
- Add version-based SQLite schema migration (PRAGMA user_version)
- Fix API key in ~/.claude.json to match server
- Update README with sync resilience docs, test structure, project layout
- Add 30 new tests covering all new behaviors (155 total, all passing)
2026-03-16 18:37:59 +00:00
Viktor Barzin
d370855abf
fix mypy across all source files, remove || true from CI
- Add type annotations to all FastAPI endpoints in api/app.py
- Fix bare list/dict generics in sync.py and app.py
- Fix no-any-return in vault_client.py and sync.py
- Remove mypy || true from GitHub Actions CI — mypy is now clean
2026-03-15 23:36:34 +00:00
Viktor Barzin
e22a8f743a
fix: 404 retry loop on delete and URL-encode since param
Three fixes for high 4xx alert rate:
1. Catch urllib.error.HTTPError (not just RuntimeError) for 404 on
   DELETE in pending_ops — was causing infinite retry loop for
   already-deleted memories
2. try_sync_delete: treat 404 as success instead of enqueuing
   a retry that will also 404 forever
3. URL-encode the `since` query param to prevent `+` in timezone
   offset being decoded to a space (the asyncpg-string-timestamp
   pattern applied to the sync client)
2026-03-15 15:28:19 +00:00
Viktor Barzin
a52afe050d
Fix MCP server startup for Claude Code compatibility
- Suppress stderr output (Claude Code rejects servers with stderr)
- Make SyncEngine.start() non-blocking (blocking sync caused 15s+ timeout)
- Skip Content-Length header lines gracefully (NDJSON transport)
- Silently ignore malformed JSON lines instead of sending error responses
2026-03-14 16:05:35 +00:00
Viktor Barzin
0a4aed9e1b
fix: resolve ruff lint errors (unused imports and variables) 2026-03-14 13:04:40 +00:00
Viktor Barzin
cd80a67dfa
feat: add local SQLite cache with background sync and HA deployment
- Add SyncEngine for background sync between local SQLite cache and
  remote API with pending_ops queue for offline resilience
- Refactor MCP server to support three modes: SQLite-only, hybrid
  (local cache + sync, new default), and HTTP-only (legacy)
- Add GET /api/memories/sync endpoint for incremental sync
- Change DELETE to soft delete (set deleted_at) for sync support
- Add deleted_at IS NULL filters to all read queries
- Scale API deployment to 2 replicas with pod anti-affinity, PDB,
  and startup probe for high availability
- Add migration 003 for deleted_at column and updated_at index
- Add comprehensive tests for sync engine and API sync endpoint
2026-03-14 12:42:39 +00:00