Compare commits

..

4 commits

Author SHA1 Message Date
Viktor Barzin
1cc8a2b378 research: benchmark hybrid (lexical+dense+graph) recall vs current FTS
Some checks are pending
Build and Push / lint-and-test (push) Waiting to run
Build and Push / build (push) Blocked by required conditions
Build and Push / deploy (push) Blocked by required conditions
Build and Push / notify-failure (push) Blocked by required conditions
Viktor asked to enhance the memory system with 'semantics' — remember concepts
(not just tokens) linked in a graph — and to prove, by benchmarking against the
current system, that it actually improves recall. A multi-phase research workflow
(18 agents) did landscape research, an adversarially-reviewed integration design,
a stratified eval set over the real 5,452-memory corpus, and a head-to-head
prototype-vs-current benchmark.

Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every
overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend
adopting lexical+dense; the concept graph is DEFERRED.

Post-run adversarial review correction (applied to all docs before commit): the
prototype's fusion config structurally barred the graph leg from the ranked top-k,
so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical
result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty).
Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing
in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/.

Privacy: the corpus/queries/qrels/results are the user's real memories and stay
gitignored (data/, cache/, results/, build_eval_set.py); only harness code,
aggregate numbers, and synthetic examples are committed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:51:53 +00:00
Viktor Barzin
7439540f8f docs: glossary + ADRs for semantic/concept-graph memory
Captures the design language (CONTEXT.md) and the framing decisions from the
requirements interview: pursue hybrid embeddings+concept-graph retrieval gated
on a benchmark (0001), target the API/Postgres deployment while SQLite stays
lexical (0002), and permit hosted embedding APIs for non-sensitive memories
only (0003). Groundwork for the research/prototype/benchmark effort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 15:36:30 +00:00
Viktor Barzin
5151bbe0d5 fix(mcp): serialize local SQLite writes to end "database is locked" under concurrent stores
Some checks failed
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build (push) Has been cancelled
Build and Push / deploy (push) Has been cancelled
Build and Push / notify-failure (push) Has been cancelled
Under heavy concurrent memory_store (many subagents/sessions writing close
together) the local SQLite layer raced on the single SQLite writer and surfaced
sqlite3.OperationalError: database is locked, which made memory tools slow and
eventually dropped whole sessions. Two root causes:

  - The MCP server (mcp_server.py) and the background SyncEngine (sync.py) each
    opened a SEPARATE connection to the same SQLite file. WAL allows one writer;
    when the sync writer held the lock across a resync, a concurrent store blew
    past busy_timeout and raised "database is locked".
  - mcp_server's connection was opened WITHOUT check_same_thread=False, so the
    moment two requests were handled on different threads every local store/recall
    raised ProgrammingError "SQLite objects created in a thread...".

Fix: a single process-wide serialized writer.

  - New LocalStore (local_store.py) owns ONE connection (check_same_thread=False)
    guarded by ONE re-entrant lock, keeps WAL, and wraps writes in
    transaction()/write() with bounded exponential-backoff retry on the rare
    residual lock (e.g. another OS process) instead of failing the call.
  - MemoryServer builds the LocalStore and SHARES it with the SyncEngine, so the
    sync writer no longer opens a second connection — the two-connections race is
    eliminated. All server reads/writes go through the shared lock; stores stay
    snappy (enqueue-local + async sync).
  - Bound the one genuinely slow path (remote semantic memory_recall) with an
    explicit RECALL_TIMEOUT and, on timeout/unreachable backend, return a clear
    "working / retry" signal instead of hanging silently or crashing. When a
    client supplies _meta.progressToken, emit one notifications/progress so the
    call shows life.

Ships to users via the plugin (mcp/memory-mcp.json runs src/.../mcp_server.py);
no server-side/API change needed. TDD: added concurrency tests (many threads +
sync writer on one file) and recall progress/bounding tests; full gate green
(ruff + mypy strict + 185 pytest).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-19 06:06:09 +00:00
Viktor Barzin
68088e684e ci: move image build off-infra to GHA -> ghcr (ADR-0002)
Some checks failed
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build (push) Has been cancelled
Build and Push / deploy (push) Has been cancelled
Build and Push / notify-failure (push) Has been cancelled
Generated by infra/scripts/offinfra-onboard: GHA builds+tests on the
GitHub mirror, pushes ghcr.io/viktorbarzin/claude-memory-mcp, then triggers the
Woodpecker deploy (repo 78). Old in-cluster build pipeline
removed:

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-13 02:37:51 +00:00
33 changed files with 4124 additions and 116 deletions

View file

@ -70,6 +70,7 @@ jobs:
- uses: docker/build-push-action@v7
with:
context: .
file: docker/Dockerfile
push: true
platforms: linux/amd64
# Single-manifest images (no provenance/SBOM attestation children) so

6
.gitignore vendored
View file

@ -48,3 +48,9 @@ docker/pgdata/
# Beads / Dolt files (added by bd init)
.dolt/
.beads-credential-key
# Agent git worktrees (standing policy: never the shared checkout)
.worktrees/
# uv lockfile — CI runs `uv sync` itself; not tracked
uv.lock

45
CONTEXT.md Normal file
View file

@ -0,0 +1,45 @@
# Claude Memory MCP
Persistent cross-session memory for Claude. Today it stores **Memories** as rows and
retrieves them by **lexical recall** (full-text keyword matching). This context is being
extended with **semantic recall** (embeddings) and a **concept graph** so retrieval works
by meaning and related memories become traversable.
## Language
**Memory**:
A single stored unit of knowledge — a fact, preference, decision, project note, or person
detail — with content plus metadata (category, tags, importance). The atomic thing a user
stores and recalls.
**Recall**:
Retrieving the Memories most relevant to a query. The read path.
**Lexical recall**:
The existing retrieval method — matches Memories whose words (content, tags, LLM-generated
keywords) overlap the query, ranked by BM25 / `ts_rank`. Matches *tokens*, not meaning.
_Avoid_: calling this "semantic search" — it is not semantic.
**Semantic recall**:
Retrieval by meaning via dense-vector **Embedding** similarity, so a query surfaces a Memory
even with zero shared words (e.g. "what UI library?" → "prefers Svelte").
**Embedding**:
A dense vector representation of a Memory's (or Concept's) meaning, used for Semantic recall.
**Concept**:
A distinct entity or idea that recurs across Memories (e.g. "Svelte", "Viktor", "TripIt",
"frontend framework"). A node in the Concept graph. Distinct from a Memory: one Memory can
mention several Concepts, and one Concept spans many Memories.
**Concept graph**:
The network of Concepts joined by typed **Relationships**, making the memory store
traversable — from one Memory or Concept to related ones.
**Relationship**:
A typed, directed edge in the Concept graph, between two Concepts or between a Memory and a
Concept (e.g. `prefers`, `is-a`, `used-in`, `mentions`).
**Hybrid retrieval**:
The target read path — combining Lexical recall, Semantic recall, and Concept-graph
traversal into one ranked result set.

25
benchmarks/.gitignore vendored Normal file
View file

@ -0,0 +1,25 @@
# Benchmark dataset is the user's REAL personal memories — NEVER commit.
# Privacy hard-rule (research task brief): corpus/queries/qrels stay LOCAL.
data/
.venv/
cache/
*.npy
*.faiss
*.db
# The eval-set GENERATOR embeds real memory-derived query text + author notes
# (paraphrases of real memories, real ids/notes). Treat it as a data artifact:
# LOCAL-ONLY, never commit. Regenerates data/ from corpus.jsonl. The HARNESS
# itself (harness/*.py, the other scripts) contains NO real content and is safe.
scripts/build_eval_set.py
# Python noise
__pycache__/
*.pyc
.pytest_cache/
*.egg-info/
.ipynb_checkpoints/
# Results from runs may quote real content — keep local by default.
results/
*.results.json

126
benchmarks/README.md Normal file
View file

@ -0,0 +1,126 @@
# claude-memory recall benchmark
Stratified retrieval benchmark gating the hybrid-recall adoption decision
(ADR-0001): does dense-vector semantic recall + a concept graph beat the current
lexical FTS on **recall@5, recall@10, nDCG@10, MRR**? Quality decides adoption;
latency/storage are measured but non-gating.
> **PRIVACY — read first.** The corpus is the operator's REAL personal memories.
> `data/` (corpus/queries/qrels), `.venv/`, `cache/`, `results/`, and
> `scripts/build_eval_set.py` (the generator embeds memory-derived query text)
> are **gitignored and must never be committed**. Everything else here contains
> only code / aggregate numbers and is safe to commit. Sensitive memories
> (`is_sensitive=1`) are excluded from the corpus entirely.
## Layout
```
benchmarks/
harness/ # importable package (committable; no real content)
types.py # Memory, Query, Qrels, Retriever protocol
metrics.py # recall@k, nDCG@k, MRR (binary relevance)
dataset.py # load_dataset() + referential-integrity validation
runner.py # run_benchmark() -> overall + per-stratum + latency
baselines.py # SqliteFtsRetriever (faithful FTS5/BM25 reference)
example_retriever.py # worked example of the plug-in interface
test_harness.py # unit tests (pytest)
scripts/
export_corpus.py # SQLite -> data/corpus.jsonl (non-sensitive only)
build_eval_set.py # -> data/queries.jsonl + qrels.jsonl [GITIGNORED]
dataset_stats.py # validate + print AGGREGATE stats (safe)
run_eval.py # CLI: run a retriever, print/save metrics
data/ # [GITIGNORED] corpus.jsonl, queries.jsonl, qrels.jsonl
.venv/ # [GITIGNORED]
```
## Dataset schema (JSONL, one object per line)
**`corpus.jsonl`** — every non-sensitive memory:
```json
{"id": 137, "content": "...", "category": "decisions", "tags": "memory,architecture",
"expanded_keywords": "...", "importance": 0.85}
```
`id` (int) is the join key everywhere. `tags` is comma-separated; `expanded_keywords`
space-separated (matches the production schema).
**`queries.jsonl`** — eval queries, three strata:
```json
{"query_id": "para_006", "text": "...", "stratum": "paraphrase", "relevant_ids": [380],
"_note": "author rationale", "_jaccard": 0.023}
```
- `stratum``exact` | `paraphrase` | `multihop`.
- `relevant_ids` is a convenience copy; **`qrels.jsonl` is authoritative**.
- `_note` / `_jaccard` are provenance fields (underscore-prefixed); ignore them in
scoring.
**`qrels.jsonl`** — binary relevance judgments (authoritative):
```json
{"query_id": "multi_006", "relevant_ids": [263, 423, 637]}
```
### Strata (what each one tests)
| stratum | construction | who should win |
|---|---|---|
| **exact** | query = a salient phrase lifted from ONE memory; that memory is relevant (verified as the top FTS hit at build time) | lexical already strong; floor check |
| **paraphrase** | query restates ONE memory's meaning in DIFFERENT words (low lexical overlap, validated Jaccard ≤ ~0.18 vs content+keywords) | **dense embeddings** |
| **multihop** | query needs 2+ DISTINCT memories sharing an entity/concept (e.g. project + decision, or a multi-part runbook); ALL are relevant | **concept graph** |
Where a near-duplicate memory equally satisfies a single-target query, qrels was
augmented to include the twin (so a good retriever isn't penalised); deliberate
discriminator queries are kept single-target on purpose.
## Pluggable retriever interface
A retriever is any object implementing **one** method:
```python
def retrieve(self, query: str, k: int) -> list[int]:
"""Return up to k memory ids (corpus `id`s), ranked best-first."""
```
Optional lifecycle hooks the runner uses if present (duck-typed):
```python
def build_index(self, corpus: list[Memory]) -> None: ... # timed separately
def index_size_bytes(self) -> int: ... # reported
name: str # label in reports
```
A bare callable `retrieve(query, k) -> list[int]` also works.
## Run it
```bash
.venv/bin/python scripts/export_corpus.py # (re)build data/corpus.jsonl
.venv/bin/python scripts/build_eval_set.py # (re)build queries+qrels (local)
.venv/bin/python scripts/dataset_stats.py # validate + aggregate stats
.venv/bin/python -m pytest harness/test_harness.py -q
# evaluate a retriever (built-in alias or module:Class)
.venv/bin/python scripts/run_eval.py --retriever fts5
.venv/bin/python scripts/run_eval.py --retriever your_pkg.mod:YourRetriever --json results/yours.json
```
Programmatic use:
```python
from harness import load_dataset, run_benchmark
ds = load_dataset()
result = run_benchmark(MyRetriever(), ds) # builds index, times queries
print(result.summary()) # overall + per-stratum table
result.to_dict() # full machine-readable result
```
`run_benchmark` requests `retrieve_k=20` per query by default (≥ the max metric
cutoff of 10), macro-averages metrics over queries (overall + per stratum), and
reports per-query latency p50/p95 plus index build time/size when the hooks exist.
## Reference baseline
`harness.baselines.SqliteFtsRetriever` mirrors the production local-store search
(README "Search Algorithm"): FTS5 over content/category/tags/expanded_keywords,
`'"w1" OR "w2" ...'` MATCH, `ORDER BY bm25(), importance`. This is the lexical
"current system" any hybrid retriever must beat. (The Postgres `tsvector` path
uses weighted A/B/C/D ranking and an importance-first default; FTS5/BM25 is the
faithful, dependency-free relevance reference for the quality comparison.)

View file

@ -0,0 +1,28 @@
"""Benchmark harness for claude-memory recall evaluation.
Public API:
from harness import Retriever, load_dataset, run_benchmark, BenchmarkResult
from harness import metrics
A retriever is any object (or callable) implementing:
retrieve(query: str, k: int) -> list[memory_id] # ranked, best first
memory_id matches the `id` field in corpus.jsonl / qrels.jsonl (int).
"""
from .types import Retriever, Query, Memory, Qrels
from .dataset import load_dataset, Dataset
from .runner import run_benchmark, BenchmarkResult, StratumResult
from . import metrics
__all__ = [
"Retriever",
"Query",
"Memory",
"Qrels",
"load_dataset",
"Dataset",
"run_benchmark",
"BenchmarkResult",
"StratumResult",
"metrics",
]

View file

@ -0,0 +1,93 @@
"""Reference LEXICAL baseline retrievers that mirror the production system.
These exist so (a) the eval-set author can VERIFY a query's labels and check
that paraphrase queries genuinely defeat lexical matching, and (b) later agents
have an honest "current system" to beat.
`SqliteFtsRetriever` builds an in-memory SQLite FTS5 index over the corpus and
runs the SAME query shape the production local store uses:
words -> '"w1" OR "w2" ...' MATCH, ORDER BY bm25(), importance as tiebreak.
(README "SQLite: FTS5 with BM25".) This is the closest faithful, dependency-free
baseline. The Postgres tsvector path is documented in the README; its ranking
differs (weighted A/B/C/D + importance-first default) but for a quality ceiling
comparison the FTS5/BM25 relevance ordering is the right lexical reference.
"""
from __future__ import annotations
import re
import sqlite3
from collections.abc import Sequence
from .types import Memory, MemoryId
# FTS5 reserved-ish tokens; we quote every term anyway, but strip embedded quotes.
_WORD_RE = re.compile(r"[A-Za-z0-9_]+")
class SqliteFtsRetriever:
"""Faithful FTS5/BM25 lexical baseline (mirrors local_store search)."""
name = "sqlite_fts5_bm25"
def __init__(self, sort_by: str = "relevance") -> None:
# "relevance": ORDER BY bm25(), importance DESC (best for quality eval)
# "importance": ORDER BY importance DESC, ... (production default)
self.sort_by = sort_by
self._con: sqlite3.Connection | None = None
def build_index(self, corpus: Sequence[Memory]) -> None:
con = sqlite3.connect(":memory:")
con.execute(
"""
CREATE VIRTUAL TABLE memories_fts USING fts5(
content, category, tags, expanded_keywords,
memory_id UNINDEXED, importance UNINDEXED
)
"""
)
con.executemany(
"INSERT INTO memories_fts(content, category, tags, expanded_keywords, memory_id, importance)"
" VALUES (?,?,?,?,?,?)",
[
(m.content, m.category, m.tags, m.expanded_keywords, m.id, m.importance)
for m in corpus
],
)
con.commit()
self._con = con
def _fts_query(self, query: str) -> str:
words = _WORD_RE.findall(query.lower())
if not words:
return ""
return " OR ".join(f'"{w}"' for w in words)
def retrieve(self, query: str, k: int) -> list[MemoryId]:
assert self._con is not None, "call build_index first"
match = self._fts_query(query)
if not match:
return []
if self.sort_by == "importance":
order = "importance DESC, bm25(memories_fts)"
else:
order = "bm25(memories_fts), importance DESC"
try:
rows = self._con.execute(
f"SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? "
f"ORDER BY {order} LIMIT ?",
(match, k),
).fetchall()
except sqlite3.OperationalError:
# mirror production LIKE fallback on FTS syntax errors
like = f"%{query}%"
rows = self._con.execute(
"SELECT memory_id FROM memories_fts WHERE content LIKE ? OR tags LIKE ? "
"ORDER BY importance DESC LIMIT ?",
(like, like, k),
).fetchall()
return [r[0] for r in rows]
def close(self) -> None:
if self._con is not None:
self._con.close()
self._con = None

View file

@ -0,0 +1,115 @@
"""Load corpus / queries / qrels JSONL into typed objects."""
from __future__ import annotations
import json
from dataclasses import dataclass
from pathlib import Path
from .types import Memory, Query, Qrels, MemoryId
_DATA_DIR = Path(__file__).resolve().parents[1] / "data"
@dataclass
class Dataset:
corpus: list[Memory]
queries: list[Query]
qrels: Qrels
@property
def corpus_by_id(self) -> dict[MemoryId, Memory]:
return {m.id: m for m in self.corpus}
def strata(self) -> set[str]:
return {q.stratum for q in self.queries}
def _read_jsonl(path: Path) -> list[dict]:
out: list[dict] = []
with path.open(encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
out.append(json.loads(line))
return out
def load_corpus(path: Path | None = None) -> list[Memory]:
path = path or (_DATA_DIR / "corpus.jsonl")
rows = _read_jsonl(path)
return [
Memory(
id=r["id"],
content=r["content"],
category=r.get("category", "facts"),
tags=r.get("tags", "") or "",
expanded_keywords=r.get("expanded_keywords", "") or "",
importance=r.get("importance", 0.5),
)
for r in rows
]
def load_queries(path: Path | None = None) -> list[Query]:
path = path or (_DATA_DIR / "queries.jsonl")
rows = _read_jsonl(path)
return [
Query(
query_id=r["query_id"],
text=r["text"],
stratum=r["stratum"],
relevant_ids=tuple(r.get("relevant_ids", [])),
)
for r in rows
]
def load_qrels(path: Path | None = None) -> Qrels:
path = path or (_DATA_DIR / "qrels.jsonl")
rows = _read_jsonl(path)
qrels: Qrels = {}
for r in rows:
qid = r["query_id"]
rel = set(r["relevant_ids"])
qrels.setdefault(qid, set()).update(rel)
return qrels
def load_dataset(
corpus_path: Path | None = None,
queries_path: Path | None = None,
qrels_path: Path | None = None,
*,
validate: bool = True,
) -> Dataset:
corpus = load_corpus(corpus_path)
queries = load_queries(queries_path)
qrels = load_qrels(qrels_path)
if validate:
_validate(corpus, queries, qrels)
return Dataset(corpus=corpus, queries=queries, qrels=qrels)
def _validate(corpus: list[Memory], queries: list[Query], qrels: Qrels) -> None:
corpus_ids = {m.id for m in corpus}
q_ids = {q.query_id for q in queries}
# Every query must have a qrels entry, and vice versa.
missing_qrels = q_ids - set(qrels)
if missing_qrels:
raise ValueError(f"queries without qrels: {sorted(missing_qrels)[:10]}")
orphan_qrels = set(qrels) - q_ids
if orphan_qrels:
raise ValueError(f"qrels without queries: {sorted(orphan_qrels)[:10]}")
# Every relevant id must exist in the corpus and the set must be non-empty.
for qid, rels in qrels.items():
if not rels:
raise ValueError(f"empty qrels for query {qid}")
unknown = rels - corpus_ids
if unknown:
raise ValueError(
f"query {qid} references non-corpus ids {sorted(unknown)[:10]}"
)

View file

@ -0,0 +1,59 @@
"""Worked example: how a later agent plugs a retriever into the harness.
A retriever needs only one method:
retrieve(self, query: str, k: int) -> list[int] # ranked memory ids
Optionally it may implement lifecycle hooks the runner will use if present:
build_index(self, corpus: list[Memory]) -> None # timed separately
index_size_bytes(self) -> int # reported
Run this file directly for a smoke test against the local eval set:
.venv/bin/python -m harness.example_retriever
"""
from __future__ import annotations
from collections.abc import Sequence
from .types import Memory, MemoryId
class SubstringRetriever:
"""Trivial baseline: rank by count of query-word occurrences in content.
Deliberately weak exists only to demonstrate the interface. The real
lexical baseline is harness.baselines.SqliteFtsRetriever.
"""
name = "substring_demo"
def __init__(self) -> None:
self._corpus: list[Memory] = []
def build_index(self, corpus: Sequence[Memory]) -> None:
self._corpus = list(corpus)
def retrieve(self, query: str, k: int) -> list[MemoryId]:
words = [w for w in query.lower().split() if len(w) > 2]
scored: list[tuple[int, float]] = []
for m in self._corpus:
hay = (m.content + " " + m.expanded_keywords + " " + m.tags).lower()
score = sum(hay.count(w) for w in words)
if score:
scored.append((m.id, score + m.importance)) # importance tiebreak
scored.sort(key=lambda t: t[1], reverse=True)
return [mid for mid, _ in scored[:k]]
def _smoke() -> None:
from .dataset import load_dataset
from .runner import run_benchmark
ds = load_dataset()
res = run_benchmark(SubstringRetriever(), ds)
print(res.summary())
if __name__ == "__main__":
_smoke()

View file

@ -0,0 +1,100 @@
"""Retrieval metrics with BINARY relevance.
Conventions
-----------
- `ranked`: list of memory ids, best-first, as returned by a retriever.
- `relevant`: set of relevant memory ids for the query (from qrels).
- All functions are pure and operate on a single query; the runner aggregates
(macro-average over queries).
Definitions
-----------
recall@k = |relevant ranked[:k]| / |relevant|
(fraction of all relevant items retrieved within the top k)
MRR = 1 / rank_of_first_relevant (0 if none retrieved at all)
nDCG@k = DCG@k / IDCG@k with binary gains (gain=1 for relevant)
DCG@k = sum over i in [1..k] of rel_i / log2(i + 1)
IDCG@k is the DCG of the ideal ranking (all relevant first),
capped at min(|relevant|, k) ones.
Notes
-----
- nDCG uses the standard log2(rank+1) discount (Järvelin & Kekäläinen 2002);
with binary gains this is the common IR convention also used by BEIR/pytrec_eval.
- MRR is reported as the reciprocal rank of the FIRST relevant hit, which for a
single query equals the per-query reciprocal-rank that the runner averages.
- Duplicate ids in `ranked` are de-duplicated keeping first occurrence, so a
retriever cannot inflate recall by repeating an id.
"""
from __future__ import annotations
import math
from collections.abc import Iterable, Sequence
MemoryId = int
def _dedup_keep_order(ranked: Sequence[MemoryId]) -> list[MemoryId]:
seen: set[MemoryId] = set()
out: list[MemoryId] = []
for x in ranked:
if x not in seen:
seen.add(x)
out.append(x)
return out
def recall_at_k(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId], k: int) -> float:
rel = set(relevant)
if not rel:
# Undefined; treat as 0 contribution. Runner should never pass empty.
return 0.0
top = _dedup_keep_order(ranked)[:k]
hits = sum(1 for x in top if x in rel)
return hits / len(rel)
def reciprocal_rank(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId]) -> float:
rel = set(relevant)
if not rel:
return 0.0
for i, x in enumerate(_dedup_keep_order(ranked), start=1):
if x in rel:
return 1.0 / i
return 0.0
def dcg_at_k(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId], k: int) -> float:
rel = set(relevant)
top = _dedup_keep_order(ranked)[:k]
dcg = 0.0
for i, x in enumerate(top, start=1):
if x in rel:
dcg += 1.0 / math.log2(i + 1)
return dcg
def ndcg_at_k(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId], k: int) -> float:
rel = set(relevant)
if not rel:
return 0.0
dcg = dcg_at_k(ranked, rel, k)
ideal_hits = min(len(rel), k)
idcg = sum(1.0 / math.log2(i + 1) for i in range(1, ideal_hits + 1))
if idcg == 0.0:
return 0.0
return dcg / idcg
def per_query_metrics(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId]) -> dict[str, float]:
"""All headline metrics for one query."""
rel = set(relevant)
return {
"recall@5": recall_at_k(ranked, rel, 5),
"recall@10": recall_at_k(ranked, rel, 10),
"ndcg@10": ndcg_at_k(ranked, rel, 10),
"mrr": reciprocal_rank(ranked, rel),
}
METRIC_NAMES = ("recall@5", "recall@10", "ndcg@10", "mrr")

View file

@ -0,0 +1,223 @@
"""Benchmark runner: drive a pluggable retriever over the eval set and report
overall + per-stratum quality metrics, plus per-query latency and (optional)
index build time / size.
Quality decides adoption (recall@k, nDCG@10, MRR). Latency and storage are
measured and reported but DO NOT gate the decision (ADR-0001 success metric).
"""
from __future__ import annotations
import statistics
import time
from collections.abc import Callable
from dataclasses import dataclass, field, asdict
from typing import Any
from . import metrics
from .dataset import Dataset
from .types import MemoryId, Query, Retriever
# A retriever may be the Protocol object or a bare callable retrieve(query, k).
RetrieverLike = Retriever | Callable[[str, int], list[MemoryId]]
# k used for the retrieve() call. We request enough depth to compute all
# metrics (max cutoff is 10) with headroom so ties past k=10 don't distort.
DEFAULT_RETRIEVE_K = 20
def _percentile(values: list[float], pct: float) -> float:
"""Linear-interpolation percentile (pct in [0,100]). Empty -> 0.0."""
if not values:
return 0.0
if len(values) == 1:
return values[0]
s = sorted(values)
rank = (pct / 100.0) * (len(s) - 1)
lo = int(rank)
hi = min(lo + 1, len(s) - 1)
frac = rank - lo
return s[lo] + (s[hi] - s[lo]) * frac
@dataclass
class StratumResult:
stratum: str
n_queries: int
metrics: dict[str, float] # macro-averaged metric -> value
@dataclass
class BenchmarkResult:
retriever_name: str
n_queries: int
retrieve_k: int
overall: dict[str, float]
per_stratum: dict[str, StratumResult]
latency_ms: dict[str, float] # mean / p50 / p95 / max
index_build_seconds: float | None = None
index_size_bytes: int | None = None
per_query: list[dict[str, Any]] = field(default_factory=list)
def to_dict(self) -> dict:
d = asdict(self)
d["per_stratum"] = {k: asdict(v) for k, v in self.per_stratum.items()}
return d
def summary(self) -> str:
lines = [
f"Retriever: {self.retriever_name}",
f"Queries: {self.n_queries} (retrieve_k={self.retrieve_k})",
]
if self.index_build_seconds is not None:
lines.append(f"Index build: {self.index_build_seconds:.3f}s")
if self.index_size_bytes is not None:
lines.append(f"Index size: {self.index_size_bytes / 1e6:.2f} MB")
lat = self.latency_ms
lines.append(
"Latency/query: "
f"p50={lat['p50']:.2f}ms p95={lat['p95']:.2f}ms "
f"mean={lat['mean']:.2f}ms max={lat['max']:.2f}ms"
)
cols = metrics.METRIC_NAMES
header = " ".join(f"{c:>10}" for c in cols)
lines.append("")
lines.append(f"{'stratum':<12}{'n':>5} {header}")
lines.append("-" * (19 + len(header)))
for name in ("overall", *sorted(self.per_stratum)):
if name == "overall":
m, n = self.overall, self.n_queries
else:
sr = self.per_stratum[name]
m, n = sr.metrics, sr.n_queries
row = " ".join(f"{m[c]:>10.4f}" for c in cols)
lines.append(f"{name:<12}{n:>5} {row}")
return "\n".join(lines)
def _get_retrieve_fn(retriever: RetrieverLike) -> Callable[[str, int], list[MemoryId]]:
if hasattr(retriever, "retrieve"):
return retriever.retrieve # type: ignore[attr-defined]
if callable(retriever):
return retriever
raise TypeError("retriever must implement retrieve(query, k) or be callable")
def _maybe_build_index(retriever: RetrieverLike, dataset: Dataset) -> tuple[float | None, int | None]:
"""Call optional lifecycle hooks if present (duck-typed).
- build_index(corpus) -> None : measured wall-clock build time.
- index_size_bytes() -> int : reported on-disk/in-memory index size.
Returns (build_seconds_or_None, size_bytes_or_None).
"""
build_seconds: float | None = None
size_bytes: int | None = None
build = getattr(retriever, "build_index", None)
if callable(build):
t0 = time.perf_counter()
build(dataset.corpus)
build_seconds = time.perf_counter() - t0
size_fn = getattr(retriever, "index_size_bytes", None)
if callable(size_fn):
try:
size_bytes = int(size_fn())
except Exception:
size_bytes = None
return build_seconds, size_bytes
def run_benchmark(
retriever: RetrieverLike,
dataset: Dataset,
*,
retrieve_k: int = DEFAULT_RETRIEVE_K,
retriever_name: str | None = None,
warmup: bool = True,
collect_per_query: bool = True,
) -> BenchmarkResult:
"""Evaluate `retriever` over `dataset`.
The retriever is asked for `retrieve_k` ids per query (>= max metric
cutoff of 10). Metrics are macro-averaged over queries, overall and per
stratum. Latency is measured around each retrieve() call only (index build
is timed separately via the optional build_index hook).
"""
name = retriever_name or getattr(retriever, "name", None) or type(retriever).__name__
retrieve = _get_retrieve_fn(retriever)
qrels = dataset.qrels
build_seconds, size_bytes = _maybe_build_index(retriever, dataset)
# Optional warmup (first call can pay import/JIT/connection costs that would
# skew p95). Excluded from latency stats. Uses the first query if any.
if warmup and dataset.queries:
try:
retrieve(dataset.queries[0].text, retrieve_k)
except Exception:
pass # warmup failures surface on the real call below
per_query_rows: list[dict[str, Any]] = []
latencies_ms: list[float] = []
# accumulate per-stratum metric sums for macro-average
strata: dict[str, dict[str, float]] = {}
strata_counts: dict[str, int] = {}
overall_sums = {m: 0.0 for m in metrics.METRIC_NAMES}
for q in dataset.queries:
rel = qrels[q.query_id]
t0 = time.perf_counter()
ranked = list(retrieve(q.text, retrieve_k))
dt_ms = (time.perf_counter() - t0) * 1000.0
latencies_ms.append(dt_ms)
m = metrics.per_query_metrics(ranked, rel)
for key, val in m.items():
overall_sums[key] += val
strata.setdefault(q.stratum, {mm: 0.0 for mm in metrics.METRIC_NAMES})
strata_counts[q.stratum] = strata_counts.get(q.stratum, 0) + 1
for key, val in m.items():
strata[q.stratum][key] += val
if collect_per_query:
per_query_rows.append(
{
"query_id": q.query_id,
"stratum": q.stratum,
"n_relevant": len(rel),
"latency_ms": round(dt_ms, 3),
"retrieved": ranked[:retrieve_k],
**{k: round(v, 6) for k, v in m.items()},
}
)
n = len(dataset.queries)
overall = {k: (overall_sums[k] / n if n else 0.0) for k in metrics.METRIC_NAMES}
per_stratum: dict[str, StratumResult] = {}
for s, sums in strata.items():
c = strata_counts[s]
per_stratum[s] = StratumResult(
stratum=s,
n_queries=c,
metrics={k: (sums[k] / c if c else 0.0) for k in metrics.METRIC_NAMES},
)
latency_stats = {
"mean": statistics.fmean(latencies_ms) if latencies_ms else 0.0,
"p50": _percentile(latencies_ms, 50),
"p95": _percentile(latencies_ms, 95),
"max": max(latencies_ms) if latencies_ms else 0.0,
}
return BenchmarkResult(
retriever_name=name,
n_queries=n,
retrieve_k=retrieve_k,
overall=overall,
per_stratum=per_stratum,
latency_ms=latency_stats,
index_build_seconds=build_seconds,
index_size_bytes=size_bytes,
per_query=per_query_rows,
)

View file

@ -0,0 +1,145 @@
"""Unit tests for metrics + runner. No real corpus needed (synthetic data).
Run: .venv/bin/python -m pytest harness/test_harness.py -q
"""
from __future__ import annotations
import math
from harness import metrics
from harness.dataset import Dataset
from harness.runner import run_benchmark, _percentile
from harness.types import Memory, Query
# ---------------- metrics ----------------
def test_recall_at_k_basic():
ranked = [9, 8, 3, 7, 1]
rel = {3, 1, 99} # 99 never retrieved
assert metrics.recall_at_k(ranked, rel, 5) == 2 / 3
assert metrics.recall_at_k(ranked, rel, 2) == 0.0 # neither in top2
assert metrics.recall_at_k(ranked, rel, 3) == 1 / 3 # only id 3 in top3
def test_recall_perfect_and_zero():
assert metrics.recall_at_k([1, 2, 3], {1, 2, 3}, 5) == 1.0
assert metrics.recall_at_k([4, 5, 6], {1, 2, 3}, 5) == 0.0
def test_reciprocal_rank():
assert metrics.reciprocal_rank([5, 4, 3], {3}) == 1 / 3
assert metrics.reciprocal_rank([3, 4, 5], {3}) == 1.0
assert metrics.reciprocal_rank([7, 8], {3}) == 0.0
# first relevant wins
assert metrics.reciprocal_rank([9, 3, 1], {1, 3}) == 1 / 2
def test_ndcg_perfect():
# all relevant at the top -> nDCG == 1
assert math.isclose(metrics.ndcg_at_k([1, 2, 3, 4], {1, 2, 3}, 10), 1.0)
def test_ndcg_known_value():
# single relevant doc at rank 2: DCG = 1/log2(3); IDCG = 1/log2(2)=1
ranked = [9, 1, 8]
val = metrics.ndcg_at_k(ranked, {1}, 10)
assert math.isclose(val, (1 / math.log2(3)) / 1.0)
def test_ndcg_two_relevant_suboptimal_order():
# relevant {1,2}; retrieved at ranks 1 and 3
ranked = [1, 9, 2]
dcg = 1 / math.log2(2) + 1 / math.log2(4) # ranks 1 and 3
idcg = 1 / math.log2(2) + 1 / math.log2(3) # ideal ranks 1 and 2
assert math.isclose(metrics.ndcg_at_k(ranked, {1, 2}, 10), dcg / idcg)
def test_dedup_does_not_inflate():
# repeating a relevant id must not increase recall beyond 1 hit's worth
ranked = [3, 3, 3, 3]
assert metrics.recall_at_k(ranked, {3, 7}, 5) == 0.5
assert metrics.reciprocal_rank(ranked, {3}) == 1.0
def test_empty_relevant_is_zero():
assert metrics.recall_at_k([1, 2], set(), 5) == 0.0
assert metrics.ndcg_at_k([1, 2], set(), 5) == 0.0
# ---------------- percentile ----------------
def test_percentile():
vals = [10, 20, 30, 40]
assert _percentile(vals, 50) == 25.0 # interpolated median
assert _percentile(vals, 0) == 10
assert _percentile(vals, 100) == 40
assert _percentile([5.0], 95) == 5.0
assert _percentile([], 50) == 0.0
# ---------------- runner ----------------
def _toy_dataset() -> Dataset:
corpus = [Memory(id=i, content=f"memory {i}") for i in range(1, 11)]
queries = [
Query("q_exact_1", "find 1", "exact", (1,)),
Query("q_para_1", "restate 5", "paraphrase", (5,)),
Query("q_multi_1", "join 3 and 4", "multihop", (3, 4)),
]
qrels = {"q_exact_1": {1}, "q_para_1": {5}, "q_multi_1": {3, 4}}
return Dataset(corpus=corpus, queries=queries, qrels=qrels)
class _PerfectRetriever:
"""Returns exactly the relevant ids first (oracle) — for runner plumbing."""
def __init__(self, qrels):
self._qrels = qrels
self._by_text = None
def build_index(self, corpus):
self._n = len(corpus)
def index_size_bytes(self):
return 1234
def retrieve(self, query, k):
# map query text back via the toy queries' known answers
mapping = {"find 1": [1], "restate 5": [5], "join 3 and 4": [3, 4]}
ids = mapping.get(query, [])
# pad with distractors
pad = [x for x in range(100, 100 + k)]
return (ids + pad)[:k]
def test_runner_perfect_retriever():
ds = _toy_dataset()
r = _PerfectRetriever(ds.qrels)
res = run_benchmark(r, ds, retriever_name="perfect")
assert res.n_queries == 3
assert math.isclose(res.overall["recall@10"], 1.0)
assert math.isclose(res.overall["mrr"], 1.0)
assert math.isclose(res.overall["ndcg@10"], 1.0)
# per-stratum present
assert set(res.per_stratum) == {"exact", "paraphrase", "multihop"}
assert res.per_stratum["multihop"].n_queries == 1
# lifecycle hooks captured
assert res.index_build_seconds is not None
assert res.index_size_bytes == 1234
# latency recorded
assert res.latency_ms["p95"] >= 0.0
def test_runner_callable_retriever_and_misses():
ds = _toy_dataset()
def retrieve(query, k): # always wrong
return [999][:k]
res = run_benchmark(retrieve, ds, retriever_name="bad", warmup=False)
assert res.overall["recall@10"] == 0.0
assert res.overall["mrr"] == 0.0
assert res.index_build_seconds is None # no hook on a bare callable
assert "perfect" not in res.summary()
assert "bad" in res.summary()

View file

@ -0,0 +1,53 @@
"""Core dataclasses and the pluggable Retriever protocol."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Protocol, runtime_checkable
MemoryId = int
@dataclass(frozen=True)
class Memory:
"""One corpus entry (mirrors corpus.jsonl)."""
id: MemoryId
content: str
category: str = "facts"
tags: str = ""
expanded_keywords: str = ""
importance: float = 0.5
@dataclass(frozen=True)
class Query:
"""One eval query (mirrors queries.jsonl)."""
query_id: str
text: str
stratum: str # "exact" | "paraphrase" | "multihop"
# convenience copy of relevant ids; authoritative source is Qrels
relevant_ids: tuple[MemoryId, ...] = field(default_factory=tuple)
# query_id -> set of relevant memory ids (binary relevance)
Qrels = dict[str, set[MemoryId]]
@runtime_checkable
class Retriever(Protocol):
"""Pluggable retriever contract.
Implementations rank corpus memories for a query and return the top-k
memory ids, best match first. The harness will call `retrieve` once per
query and compare against qrels.
Optional lifecycle hooks let a retriever build an index from the corpus
and report index build time / on-disk size; the runner uses them if
present (duck-typed), so a minimal retriever need only implement
`retrieve`.
"""
def retrieve(self, query: str, k: int) -> list[MemoryId]:
"""Return up to k memory ids, ranked best-first."""
...

View file

@ -0,0 +1,10 @@
"""Pluggable retrievers for the claude-memory recall benchmark.
Each retriever implements the harness `retrieve(query, k) -> list[int]` contract
(see ``harness/types.py`` :: ``Retriever``) and, optionally, the ``build_index`` /
``index_size_bytes`` lifecycle hooks the runner duck-types.
``fts.FtsRetriever`` is the LEXICAL BASELINE the product's current local-store
recall (SQLite FTS5/BM25). It is the "current system" any hybrid retriever must
beat on recall@k / nDCG@10 / MRR (ADR-0001).
"""

View file

@ -0,0 +1,224 @@
"""BASELINE retriever: the product's CURRENT lexical recall (SQLite FTS5/BM25).
This is the "current system" the hybrid upgrade (dense embeddings + concept
graph, ADR-0001) must beat on recall@k / nDCG@10 / MRR. It is a *faithful*
reimplementation of the production local-store recall path, not an idealised
sketch it mirrors ``src/claude_memory/mcp_server.py :: _sqlite_recall`` (and
the FTS5 schema/triggers in the same module) line-for-line where it matters:
Production recall (``sort_by="relevance"``) does ALL of the following, and so
does this retriever:
1. **Concatenate then split.** The MCP tool builds
``all_terms = f"{context} {expanded_query}"`` and splits it on whitespace,
stripping any embedded ``"`` from each token. The harness already hands us
one ``query`` string (the concatenation happens upstream of recall), so here
``query`` IS ``all_terms``; we split + strip identically.
2. **AND-first, then OR-broaden.** Production builds BOTH
``'"w1" AND "w2" ...'`` and ``'"w1" OR "w2" ...'`` and runs the **AND** match
first; only if it returns zero rows does it fall back to the **OR** match.
(The README's "Search Algorithm" prose shows only the OR form; the *code* is
ANDOR, and the code is authoritative. We replicate the code.)
3. **Blended BM25+importance relevance ordering.** ``sort_by="relevance"`` is
NOT a pure ``ORDER BY bm25()``. It is the blend
``(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC`` (bm25 is negated
because SQLite returns more-negative = better-match). We use the EXACT same
expression. We deliberately evaluate ``relevance`` (not the production
``importance`` default) so the benchmark measures RETRIEVAL quality rather
than the importance-sort prior per the research brief.
4. **FTS5 default tokenizer.** The production virtual table is declared with no
explicit tokenizer, i.e. ``unicode61`` case-folding + unicode diacritic
stripping, NO stemming and NO stop-word removal. We declare ours the same
way, so "running" does not match "run" (a known lexical weakness the dense
path is expected to fix on the *paraphrase* stratum).
5. **LIKE fallback.** If the FTS5 MATCH raises ``sqlite3.OperationalError``
(e.g. a token that trips the FTS5 query grammar), production degrades to a
``content LIKE %context% OR tags LIKE %context%`` scan ordered by importance.
We mirror that fallback (using the full query as the LIKE needle, since the
harness query is the whole ``all_terms``).
DIFFERENCES FROM PRODUCTION (all immaterial to ranking, documented for honesty):
- The benchmark corpus has no per-user / soft-delete / category filtering, so we
drop the ``user_id``/``deleted_at``/``category`` predicates. No category is
passed by the harness, so the category branch is never taken anyway.
- We build a fresh in-memory FTS5 index over ``data/corpus.jsonl`` rather than
reading the live ``memory.db``; same schema, same tokenizer, same columns
(content/category/tags/expanded_keywords), so BM25 statistics match what the
product would compute over the same documents.
The harness reference ``harness.baselines.SqliteFtsRetriever`` implements the
*README* ordering (pure ``ORDER BY bm25(), importance``). This module is the
faithful-to-the-CODE variant and is the one the RUN reports as ``retriever="fts"``.
"""
from __future__ import annotations
import re
import sqlite3
from collections.abc import Sequence
# Import the corpus dataclass from the sibling harness package. run_eval.py and
# run_benchmark put the benchmarks/ root on sys.path; support direct execution
# (python retrievers/fts.py) too by adding it ourselves if the import fails.
try: # pragma: no cover - exercised by both import paths
from harness.types import Memory, MemoryId
except ModuleNotFoundError: # pragma: no cover
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from harness.types import Memory, MemoryId
# Mirror production token extraction: split ``all_terms`` on whitespace and strip
# any embedded double-quote from each token (mcp_server uses
# ``w.replace(chr(34), "")``). We lowercase as well; FTS5 unicode61 case-folds
# regardless, so this only normalises the quoted MATCH literals we emit.
_DQUOTE = '"'
class FtsRetriever:
"""Faithful reimplementation of the production SQLite FTS5/BM25 recall.
Mirrors ``_sqlite_recall(sort_by="relevance")``: AND-first then OR-broaden
over an FTS5(content, category, tags, expanded_keywords) index, ranked by
the blended ``(-bm25*0.7 + importance*0.3)`` score, with a LIKE fallback.
"""
#: Label surfaced in benchmark reports / the RUN schema.
name = "fts"
def __init__(self, sort_by: str = "relevance") -> None:
# We benchmark "relevance" so the metric reflects retrieval quality, not
# the importance prior. "importance" is kept for parity / diagnostics.
if sort_by not in ("relevance", "importance"):
raise ValueError(f"sort_by must be 'relevance' or 'importance', got {sort_by!r}")
self.sort_by = sort_by
self._con: sqlite3.Connection | None = None
# ── lifecycle hooks (duck-typed by the runner) ───────────────────────────
def build_index(self, corpus: Sequence[Memory]) -> None:
"""Build a fresh in-memory FTS5 index over the corpus.
Same virtual-table shape and (default ``unicode61``) tokenizer as the
production ``memories_fts`` table. We carry ``memory_id`` and
``importance`` as UNINDEXED columns so the relevance blend can read
importance without a join semantically identical to the production
``memories m JOIN memories_fts fts ON m.id = fts.rowid`` read.
"""
con = sqlite3.connect(":memory:")
con.execute(
"""
CREATE VIRTUAL TABLE memories_fts USING fts5(
content, category, tags, expanded_keywords,
memory_id UNINDEXED, importance UNINDEXED
)
"""
)
con.executemany(
"INSERT INTO memories_fts"
"(content, category, tags, expanded_keywords, memory_id, importance)"
" VALUES (?,?,?,?,?,?)",
[
(
m.content,
m.category,
m.tags,
m.expanded_keywords,
int(m.id),
float(m.importance),
)
for m in corpus
],
)
con.commit()
self._con = con
def index_size_bytes(self) -> int:
"""Approximate on-disk index size (sum of FTS5 shadow-table page bytes).
The index is in-memory, so this is the SQLite page accounting for the
FTS5 shadow tables reported for the storage column, non-gating per
ADR-0001.
"""
if self._con is None:
return 0
try:
page_count = self._con.execute("PRAGMA page_count").fetchone()[0]
page_size = self._con.execute("PRAGMA page_size").fetchone()[0]
return int(page_count) * int(page_size)
except sqlite3.Error:
return 0
# ── query construction (mirrors _sqlite_recall) ──────────────────────────
@staticmethod
def _tokens(query: str) -> list[str]:
"""Split ``all_terms`` exactly as production does: whitespace split,
drop embedded double-quotes, drop empties."""
return [w.replace(_DQUOTE, "").lower() for w in query.split() if w.strip()]
@classmethod
def _and_or_queries(cls, query: str) -> tuple[str, str]:
"""Build the ('"w1" AND "w2" ...', '"w1" OR "w2" ...') MATCH pair."""
words = cls._tokens(query)
if not words:
return "", ""
quoted = [f'"{w}"' for w in words]
return " AND ".join(quoted), " OR ".join(quoted)
def _order_clause(self) -> str:
# bm25() is negative (more-negative = better), so negate before blending.
if self.sort_by == "relevance":
return "(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC"
return "(-bm25(memories_fts) * 0.4 + importance * 0.6) DESC"
# ── retrieve ──────────────────────────────────────────────────────────────
def retrieve(self, query: str, k: int) -> list[MemoryId]:
"""Return up to ``k`` memory ids, ranked best-first.
AND-match first (precise); if it yields nothing, OR-broaden. On an FTS5
grammar error, fall back to a LIKE scan ordered by importance exactly
the production degradation path.
"""
assert self._con is not None, "call build_index first"
and_query, or_query = self._and_or_queries(query)
if not or_query: # no usable tokens
return []
order = self._order_clause()
base_select = "SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? "
try:
rows: list[tuple[int]] = []
# AND first for precise matches, fall back to OR for broader recall.
for fts_query in (and_query, or_query):
rows = self._con.execute(
f"{base_select}ORDER BY {order} LIMIT ?",
(fts_query, k),
).fetchall()
if rows:
break
except sqlite3.OperationalError:
# Mirror production LIKE fallback: full query as the needle,
# ordered by importance.
like = f"%{query}%"
rows = self._con.execute(
"SELECT memory_id FROM memories_fts "
"WHERE content LIKE ? OR tags LIKE ? "
"ORDER BY importance DESC LIMIT ?",
(like, like, k),
).fetchall()
return [r[0] for r in rows]
def close(self) -> None:
if self._con is not None:
self._con.close()
self._con = None
# Convenience for `run_eval.py --retriever retrievers.fts:FtsRetriever`
# and a no-arg default instantiation (sort_by="relevance").

View file

@ -0,0 +1,570 @@
"""HYBRID retriever (ADR-0001/0002/0003 prototype): lexical FTS + dense semantic
recall + a memory-node concept graph, fused with Reciprocal Rank Fusion (RRF).
This is the self-contained prototype the hybrid-recall ADOPTION decision is gated
on (ADR-0001): does dense embeddings + a concept graph beat the current lexical
FTS5/BM25 on recall@5/recall@10/nDCG@10/MRR? Quality decides; latency/storage are
reported but non-gating.
It implements the harness ``retrieve(query, k) -> list[int]`` contract and the
optional ``build_index(corpus)`` / ``index_size_bytes()`` / ``name`` hooks.
Three legs, mirroring the FINAL DESIGN
======================================
1. **Lexical (FTS5/BM25).** We reuse the *faithful* production reimplementation
``retrievers.fts.FtsRetriever`` verbatim AND-first then OR-broaden over an
FTS5(content, category, tags, expanded_keywords) index, ranked by the blended
``(-bm25*0.7 + importance*0.3)``. This is the exact "current system" the hybrid
must beat, so the lexical leg of the hybrid IS that system (no drift).
2. **Dense (semantic).** Embeddings per FINAL DESIGN: a HOSTED API is used ONLY if
its key is in the environment (``OPENAI_API_KEY`` / ``VOYAGE_API_KEY`` /
``CO_API_KEY``) AND the memory is non-sensitive (ADR-0003); otherwise the local
default ``BAAI/bge-large-en-v1.5`` (1024-d, MIT, sentence-transformers). The
benchmark corpus is already sensitive-free (``is_sensitive=1`` excluded at
export, README privacy note), so here the choice is purely "hosted key present
or not". Vectors are L2-normalised; similarity is cosine = dot product. The
corpus matrix is cached to ``cache/`` (gitignored) keyed by model id + a corpus
fingerprint, so re-runs skip re-embedding. BGE retrieval convention: the QUERY
gets the instruction prefix "Represent this sentence for searching relevant
passages: "; passages are embedded raw (per the official BAAI model card).
3. **Graph (concept expansion).** A memory-node concept graph built with the
design's TRACTABLE extraction — NO 5452 sequential LLM calls. Concepts are the
union of each memory's ``tags`` and its already-LLM-generated
``expanded_keywords`` (plus salient content noun-phrases via a lightweight
regex/stop-word filter), normalised and de-pluralised. A concept that appears
in 2..N memories (very common concepts above a document-frequency ceiling are
dropped as non-discriminative) links those memories: ``memory -[shares
concept c]- memory``. At query time we take the fused dense+lexical SEEDS, walk
1 hop to neighbours that share *discriminative* concepts, and emit those
neighbours as a third ranked list. This targets the **multihop** stratum
(queries needing 2+ memories that share an entity/concept) without re-ranking
the precise hits the other legs already nail.
Fusion (``retrieval_fusion``)
=============================
Reciprocal Rank Fusion (Cormack et al., 2009): for a document *d* with rank
``r_leg(d)`` (1-based) in a leg's ranked list,
RRF(d) = Σ_leg w_leg / (k_rrf + r_leg(d))
with ``k_rrf = 60`` (the standard constant) and per-leg weights. RRF is
score-scale-free (no BM25-vs-cosine calibration), which is why the design floats
"RRF vs CC" and we pick RRF for the prototype. The dense and lexical legs carry
full weight; the graph leg is down-weighted (it is a RECALL extender for multihop,
and the design explicitly flags a possible negative graph prior so it can add
documents but should not dethrone strong dense/lexical hits). All three weights
are class attributes so the kill-gate analysis can ablate the graph to zero.
Graceful degradation (task requirement)
=======================================
If the embedding model cannot be loaded/used (missing package, download failure,
OOM), the dense leg is skipped, the failure is recorded in ``self.errors``, and the
retriever degrades to **FTS + graph** (or FTS-only if the graph also failed). The
harness still gets metrics for whatever worked.
"""
from __future__ import annotations
import hashlib
import os
import re
import sys
import time
from collections import defaultdict
from collections.abc import Sequence
from pathlib import Path
# ── package-relative imports that also work under direct execution ────────────
try: # pragma: no cover - exercised by both import paths
from harness.types import Memory, MemoryId
from retrievers.fts import FtsRetriever
except ModuleNotFoundError: # pragma: no cover
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from harness.types import Memory, MemoryId
from retrievers.fts import FtsRetriever
_BENCH_ROOT = Path(__file__).resolve().parents[1]
_CACHE_DIR = _BENCH_ROOT / "cache"
# Local default embedding model (FINAL DESIGN: prototype default + sensitive-only
# fallback). 1024-d, MIT-licensed, strong on MTEB retrieval.
_LOCAL_MODEL = "BAAI/bge-large-en-v1.5"
# BGE retrieval query instruction (official BAAI model card recommendation; the
# v1.5 line relaxed it but it still helps short-query / long-passage asymmetry,
# which is exactly the paraphrase stratum). Applied to QUERIES only.
_BGE_QUERY_INSTRUCTION = "Represent this sentence for searching relevant passages: "
# RRF constant (Cormack/Clarke/Buettcher 2009). 60 is the canonical default.
_RRF_K = 60
# Concept-graph tuning.
# _CONCEPT_MIN_DF : a concept must appear in >= this many memories to form edges
# (df==1 links nothing; we need a shared concept).
# _CONCEPT_MAX_DF_FRAC : drop concepts appearing in more than this fraction of
# the corpus — they are non-discriminative hubs ("memory",
# "homelab") that would over-connect the graph (design risk:
# "over-merge").
# _GRAPH_SEEDS : how many fused seeds to expand from.
# _GRAPH_NEIGHBOURS_PER_SEED : cap neighbours pulled per seed (keeps the graph
# leg from flooding the candidate pool).
_CONCEPT_MIN_DF = 2
_CONCEPT_MAX_DF_FRAC = 0.02
_GRAPH_SEEDS = 10
_GRAPH_NEIGHBOURS_PER_SEED = 25
# A small English stop-word set for the lightweight noun-phrase extraction. We
# deliberately keep this tiny + dependency-free (no spaCy/NLTK download on the hot
# path); the heavy lifting is done by the pre-computed ``expanded_keywords``.
_STOPWORDS = frozenset(
"""
a an the of to in on at by for with from into over under and or but not is are
was were be been being do does did has have had this that these those it its as
if then than so such no yes can will would should could may might must i you he
she they we me him her them us my your his their our about above after again all
any because before below between both during each few more most other some only
own same too very up down out off here there when where which who whom what how
""".split()
)
_WORD_RE = re.compile(r"[A-Za-z][A-Za-z0-9_+.-]{2,}")
def _normalise_concept(token: str) -> str:
"""Lowercase, strip surrounding punctuation, light de-plural so concept
variants collapse to one node (e.g. 'decisions'->'decision',
'addresses'->'address', 'policies'->'policy'). This is a heuristic collapser,
not a linguistically perfect stemmer its only job is to merge obvious
plural/singular pairs so the graph links them; exactness is not load-bearing.
Order matters: -ies, then -sses, then sibilant -es, then a bare trailing -s."""
t = token.lower().strip(".,;:!?()[]{}\"'`")
if len(t) > 4 and t.endswith("ies"): # policies -> policy
return t[:-3] + "y"
if len(t) > 4 and t.endswith("sses"): # addresses -> address, classes -> class
return t[:-2]
if len(t) > 4 and t.endswith(("ches", "shes", "xes", "zes", "ses")): # boxes->box
return t[:-2]
if len(t) > 3 and t.endswith("s") and not t.endswith(("ss", "us", "is")): # tags->tag
return t[:-1]
return t
def _concepts_for(memory: Memory) -> set[str]:
"""Extract the concept set for one memory: tags expanded_keywords salient
content tokens. ``expanded_keywords`` is already an LLM-generated keyword field
in the corpus, so this is the design's 'tractable extraction' — we reuse the
extraction that production already pays for instead of new LLM calls."""
concepts: set[str] = set()
# tags: comma-separated
for tag in memory.tags.split(","):
c = _normalise_concept(tag)
if len(c) >= 3 and c not in _STOPWORDS:
concepts.add(c)
# expanded_keywords: space-separated, already curated
for kw in memory.expanded_keywords.split():
c = _normalise_concept(kw)
if len(c) >= 3 and c not in _STOPWORDS:
concepts.add(c)
# salient content tokens (lightweight noun-phrase proxy: alpha tokens len>=3,
# not stop-words). This is a cheap NER/noun-phrase stand-in per the design.
for m in _WORD_RE.finditer(memory.content):
c = _normalise_concept(m.group(0))
if len(c) >= 3 and c not in _STOPWORDS:
concepts.add(c)
return concepts
def _corpus_fingerprint(corpus: Sequence[Memory]) -> str:
"""Stable hash over (id, content) so the embedding cache invalidates if the
corpus changes but is reused across runs of the same corpus."""
h = hashlib.sha256()
for m in corpus:
h.update(str(m.id).encode())
h.update(b"\x00")
h.update(m.content.encode("utf-8", "replace"))
h.update(b"\x01")
return h.hexdigest()[:16]
class HybridRetriever:
"""Lexical FTS + dense (bge-large-en-v1.5 / hosted) + concept-graph expansion,
fused with RRF. Degrades to FTS(+graph) if embeddings are unavailable."""
#: Label surfaced in benchmark reports / the RUN schema.
name = "hybrid"
# Per-leg RRF weights. Dense + lexical carry full weight; graph is a
# down-weighted recall extender (design: possible negative graph prior).
w_dense = 1.0
w_fts = 1.0
w_graph = 0.35
def __init__(self, model_name: str | None = None) -> None:
self.errors: list[str] = []
self.model_name = model_name or _LOCAL_MODEL
self.embedding_backend: str = "none" # "local:<model>" | "hosted:<provider>:<model>"
self.embedding_dim: int | None = None
# FTS leg (always available; pure stdlib sqlite).
self._fts = FtsRetriever(sort_by="relevance")
# Dense leg state.
self._model = None # SentenceTransformer or None
self._np = None # numpy module handle (set on successful dense build)
self._emb = None # (N, d) float32 L2-normalised matrix, row i ↔ self._ids[i]
self._ids: list[MemoryId] = [] # row order of self._emb
# Graph leg state.
self._graph = None # networkx.Graph or None
self._concept_to_mems: dict[str, list[MemoryId]] = {}
self._mem_concepts: dict[MemoryId, set[str]] = {}
self._n_concepts_total = 0 # before df pruning, for reporting
self._n_concepts_kept = 0
self._n_edges = 0
self._corpus_size = 0
# ── lifecycle: build_index (timed by the runner) ─────────────────────────
def build_index(self, corpus: Sequence[Memory]) -> None:
corpus = list(corpus)
self._corpus_size = len(corpus)
_CACHE_DIR.mkdir(parents=True, exist_ok=True)
# 1) lexical leg
self._fts.build_index(corpus)
# 2) dense leg (graceful)
try:
self._build_dense(corpus)
except Exception as exc: # pragma: no cover - defensive
self.errors.append(f"dense leg disabled: {type(exc).__name__}: {exc}")
self._model = None
self._emb = None
# 3) graph leg (graceful)
try:
self._build_graph(corpus)
except Exception as exc: # pragma: no cover - defensive
self.errors.append(f"graph leg disabled: {type(exc).__name__}: {exc}")
self._graph = None
self._concept_to_mems = {}
# ── dense leg ────────────────────────────────────────────────────────────
def _select_embedding_backend(self) -> str:
"""Pick the embedding backend per FINAL DESIGN: hosted only if a key is in
the env (non-sensitive corpus already guaranteed by export), else local.
Returns a human label and sets self.model_name accordingly."""
if os.environ.get("VOYAGE_API_KEY"):
self.model_name = "voyage-3.5"
return "hosted:voyage:voyage-3.5"
if os.environ.get("OPENAI_API_KEY"):
self.model_name = "text-embedding-3-large"
return "hosted:openai:text-embedding-3-large"
if os.environ.get("CO_API_KEY"):
self.model_name = "embed-english-v3.0"
return "hosted:cohere:embed-english-v3.0"
self.model_name = _LOCAL_MODEL
return f"local:{_LOCAL_MODEL}"
def _build_dense(self, corpus: Sequence[Memory]) -> None:
import numpy as np # required for the dense leg
self._np = np
self.embedding_backend = self._select_embedding_backend()
self._ids = [m.id for m in corpus]
fp = _corpus_fingerprint(corpus)
safe_model = self.model_name.replace("/", "_")
emb_path = _CACHE_DIR / f"emb_{safe_model}_{fp}.npy"
ids_path = _CACHE_DIR / f"emb_{safe_model}_{fp}.ids.npy"
# cache hit?
if emb_path.exists() and ids_path.exists():
cached_ids = np.load(ids_path)
if list(cached_ids.tolist()) == self._ids:
self._emb = np.load(emb_path).astype(np.float32)
self.embedding_dim = int(self._emb.shape[1])
return # cached embeddings reused
# cache miss → embed
if self.embedding_backend.startswith("hosted:"):
vecs = self._embed_hosted([m.content for m in corpus])
else:
vecs = self._embed_local([m.content for m in corpus])
vecs = vecs.astype(np.float32)
# L2-normalise so dot product == cosine.
norms = np.linalg.norm(vecs, axis=1, keepdims=True)
norms[norms == 0] = 1.0
vecs = vecs / norms
self._emb = vecs
self.embedding_dim = int(vecs.shape[1])
np.save(emb_path, vecs)
np.save(ids_path, np.array(self._ids, dtype=np.int64))
def _load_local_model(self):
from sentence_transformers import SentenceTransformer
if self._model is None:
# CPU is fine for ~5.5k short docs; force CPU to avoid CUDA init noise.
self._model = SentenceTransformer(_LOCAL_MODEL, device="cpu")
# Median memory is ~120 tokens; cap the window at 384 so the rare long
# memory (1.6% > 512 tok) doesn't pad an entire batch to 512. bge's
# native max is 512; 384 keeps ~p99 intact while bounding CPU cost.
self._model.max_seq_length = min(self._model.max_seq_length, 384)
return self._model
def _embed_local(self, texts: list[str]):
import numpy as np
model = self._load_local_model()
# Length-sort so each batch pads to a homogeneous length (big CPU win), then
# restore original order. Passages embedded raw; the caller L2-normalises so
# the local and hosted paths stay byte-for-byte consistent downstream.
order = sorted(range(len(texts)), key=lambda i: len(texts[i]))
sorted_texts = [texts[i] for i in order]
out = model.encode(
sorted_texts,
batch_size=64,
convert_to_numpy=True,
normalize_embeddings=False,
show_progress_bar=False,
)
out = np.asarray(out)
# invert the permutation
restored = np.empty_like(out)
restored[np.asarray(order)] = out
return restored
def _embed_query_local(self, query: str):
import numpy as np
model = self._load_local_model()
out = model.encode(
[_BGE_QUERY_INSTRUCTION + query],
convert_to_numpy=True,
normalize_embeddings=True, # query L2-normalised → cosine via dot
show_progress_bar=False,
)
return np.asarray(out)[0]
def _embed_hosted(self, texts: list[str]):
"""Batch-embed passages via the selected hosted API. Implemented for
Voyage / OpenAI / Cohere; only reached when the matching key is set."""
import numpy as np
backend = self.embedding_backend
if backend.startswith("hosted:voyage"):
import voyageai
client = voyageai.Client()
vecs: list[list[float]] = []
for i in range(0, len(texts), 128):
batch = texts[i : i + 128]
r = client.embed(batch, model="voyage-3.5", input_type="document")
vecs.extend(r.embeddings)
return np.asarray(vecs)
if backend.startswith("hosted:openai"):
from openai import OpenAI
client = OpenAI()
vecs = []
for i in range(0, len(texts), 256):
batch = texts[i : i + 256]
r = client.embeddings.create(model="text-embedding-3-large", input=batch)
vecs.extend([d.embedding for d in r.data])
return np.asarray(vecs)
if backend.startswith("hosted:cohere"):
import cohere
client = cohere.Client()
vecs = []
for i in range(0, len(texts), 96):
batch = texts[i : i + 96]
r = client.embed(texts=batch, model="embed-english-v3.0", input_type="search_document")
vecs.extend(r.embeddings)
return np.asarray(vecs)
raise RuntimeError(f"unknown hosted backend {backend!r}")
def _embed_query_hosted(self, query: str):
import numpy as np
backend = self.embedding_backend
if backend.startswith("hosted:voyage"):
import voyageai
client = voyageai.Client()
r = client.embed([query], model="voyage-3.5", input_type="query")
v = np.asarray(r.embeddings[0], dtype=np.float32)
elif backend.startswith("hosted:openai"):
from openai import OpenAI
client = OpenAI()
r = client.embeddings.create(model="text-embedding-3-large", input=[query])
v = np.asarray(r.data[0].embedding, dtype=np.float32)
elif backend.startswith("hosted:cohere"):
import cohere
client = cohere.Client()
r = client.embed(texts=[query], model="embed-english-v3.0", input_type="search_query")
v = np.asarray(r.embeddings[0], dtype=np.float32)
else:
raise RuntimeError(f"unknown hosted backend {backend!r}")
n = np.linalg.norm(v)
return v / n if n else v
def _dense_rank(self, query: str, k: int) -> list[MemoryId]:
"""Top-k corpus ids by cosine similarity to the query embedding."""
if self._emb is None or self._np is None:
return []
np = self._np
if self.embedding_backend.startswith("hosted:"):
qv = self._embed_query_hosted(query)
else:
qv = self._embed_query_local(query)
sims = self._emb @ qv # (N,) cosine sims (both sides L2-normalised)
kk = min(k, sims.shape[0])
# argpartition for the top-kk, then sort those by score desc.
idx = np.argpartition(-sims, kk - 1)[:kk]
idx = idx[np.argsort(-sims[idx])]
return [self._ids[i] for i in idx]
# ── graph leg ──────────────────────────────────────────────────────────
def _build_graph(self, corpus: Sequence[Memory]) -> None:
import networkx as nx
n = len(corpus)
max_df = max(_CONCEPT_MIN_DF, int(_CONCEPT_MAX_DF_FRAC * n))
# concept → set(memory ids)
concept_to_mems: dict[str, set[MemoryId]] = defaultdict(set)
mem_concepts: dict[MemoryId, set[str]] = {}
for m in corpus:
cs = _concepts_for(m)
mem_concepts[m.id] = cs
for c in cs:
concept_to_mems[c].add(m.id)
self._n_concepts_total = len(concept_to_mems)
# Keep only discriminative concepts: appear in [_CONCEPT_MIN_DF, max_df]
# memories. Below MIN_DF links nothing; above max_df is a non-specific hub.
kept: dict[str, list[MemoryId]] = {}
for c, mems in concept_to_mems.items():
df = len(mems)
if _CONCEPT_MIN_DF <= df <= max_df:
kept[c] = sorted(mems)
self._n_concepts_kept = len(kept)
self._concept_to_mems = kept
# restrict each memory's concept set to kept concepts (for neighbour scoring)
self._mem_concepts = {
mid: {c for c in cs if c in kept} for mid, cs in mem_concepts.items()
}
# Build a weighted memory-node graph: edge weight = # shared kept concepts.
# We add edges via concept cliques but CAP per-concept fan-out to avoid an
# O(df^2) blow-up on the densest kept concepts (design risk: over-merge).
g = nx.Graph()
g.add_nodes_from(m.id for m in corpus)
edge_w: dict[tuple[MemoryId, MemoryId], int] = defaultdict(int)
for c, mems in kept.items():
# mems is small (<= max_df) by construction; full clique is fine.
for i in range(len(mems)):
a = mems[i]
for j in range(i + 1, len(mems)):
b = mems[j]
key = (a, b) if a < b else (b, a)
edge_w[key] += 1
for (a, b), w in edge_w.items():
g.add_edge(a, b, weight=w)
self._n_edges = g.number_of_edges()
self._graph = g
def _graph_rank(self, seeds: list[MemoryId], exclude: set[MemoryId], k: int) -> list[MemoryId]:
"""From fused seeds, walk 1 hop and rank neighbour memories by accumulated
edge weight (shared-concept strength), weighted by the seed's own rank so
higher-confidence seeds pull harder. Returns up to k NEW ids (not in
``exclude``)."""
if self._graph is None or not seeds:
return []
g = self._graph
scores: dict[MemoryId, float] = defaultdict(float)
for rank, s in enumerate(seeds[:_GRAPH_SEEDS], start=1):
if s not in g:
continue
seed_w = 1.0 / rank # earlier seeds contribute more
nbrs = sorted(
g[s].items(), key=lambda kv: kv[1].get("weight", 1), reverse=True
)[:_GRAPH_NEIGHBOURS_PER_SEED]
for nbr, data in nbrs:
if nbr in exclude:
continue
scores[nbr] += seed_w * float(data.get("weight", 1))
ranked = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)
return [mid for mid, _ in ranked[:k]]
# ── fusion + retrieve ────────────────────────────────────────────────────
@staticmethod
def _rrf_accumulate(scores: dict[MemoryId, float], ranked: list[MemoryId], weight: float) -> None:
for r, mid in enumerate(ranked, start=1):
scores[mid] += weight / (_RRF_K + r)
def retrieve(self, query: str, k: int) -> list[MemoryId]:
"""Fuse lexical + dense + graph-expansion via weighted RRF and return the
top-k memory ids. Pulls each leg deeper than k so fusion has material to
re-order, then truncates."""
depth = max(k, 50) # per-leg retrieval depth before fusion
fts_ranked = self._fts.retrieve(query, depth)
dense_ranked = self._dense_rank(query, depth) # [] if dense disabled
# Seeds for graph expansion: RRF of the two base legs (so the graph walks
# from the best-agreed memories, not just one leg's view).
seed_scores: dict[MemoryId, float] = defaultdict(float)
self._rrf_accumulate(seed_scores, fts_ranked, self.w_fts)
self._rrf_accumulate(seed_scores, dense_ranked, self.w_dense)
seeds = [mid for mid, _ in sorted(seed_scores.items(), key=lambda kv: kv[1], reverse=True)]
base_set = set(seeds)
graph_ranked = self._graph_rank(seeds, exclude=base_set, k=depth)
# Final weighted RRF over all three legs.
scores: dict[MemoryId, float] = defaultdict(float)
self._rrf_accumulate(scores, fts_ranked, self.w_fts)
self._rrf_accumulate(scores, dense_ranked, self.w_dense)
self._rrf_accumulate(scores, graph_ranked, self.w_graph)
fused = sorted(scores.items(), key=lambda kv: (kv[1], -kv[0]), reverse=True)
return [mid for mid, _ in fused[:k]]
# ── reporting hooks ───────────────────────────────────────────────────────
def index_size_bytes(self) -> int:
"""Sum of the dense matrix bytes + FTS index bytes (graph is in-memory
networkx; we approximate it via node+edge count * a small constant). Non-
gating per ADR-0001; reported for the storage column."""
total = 0
if self._emb is not None:
total += int(self._emb.nbytes)
try:
total += self._fts.index_size_bytes()
except Exception:
pass
if self._graph is not None:
# rough: ~64 B/node + ~96 B/edge accounting for python object overhead.
total += self._graph.number_of_nodes() * 64 + self._graph.number_of_edges() * 96
return total
def graph_stats(self) -> dict[str, int]:
return {
"nodes": self._graph.number_of_nodes() if self._graph is not None else 0,
"edges": self._n_edges,
"concepts_total": self._n_concepts_total,
"concepts_kept": self._n_concepts_kept,
}
def close(self) -> None:
self._fts.close()
# Convenience for `run_eval.py --retriever retrievers.hybrid:HybridRetriever`.

View file

@ -0,0 +1,204 @@
"""Unit tests for the HYBRID retriever's pure logic: concept normalisation, the
concept-graph build + 1-hop expansion, weighted RRF fusion, and graceful
degradation when the dense leg is unavailable.
These tests are MODEL-FREE on purpose they never load sentence-transformers (a
~1.3 GB / multi-minute CPU load). The dense leg is exercised by monkeypatching the
ranking method, so the fusion + graph behaviour is verified deterministically and
fast. The full end-to-end quality run is done via scripts/run_eval.py against the
real (local, gitignored) corpus.
Run: .venv/bin/python -m pytest retrievers/test_hybrid.py -q
"""
from __future__ import annotations
import math
from harness.types import Memory
from retrievers.hybrid import (
_RRF_K,
HybridRetriever,
_concepts_for,
_normalise_concept,
)
# ---------------- concept normalisation ----------------
def test_normalise_concept_depluralisation():
cases = {
"Decisions": "decision",
"policies": "policy",
"addresses": "address",
"boxes": "box",
"tags": "tag",
# invariants: don't over-strip
"access": "access",
"class": "class",
"status": "status",
"analysis": "analysis",
"kubernetes": "kubernete", # heuristic, acceptable (collapses consistently)
"k8s": "k8s",
"GPU": "gpu",
}
for inp, exp in cases.items():
assert _normalise_concept(inp) == exp, f"{inp!r} -> {_normalise_concept(inp)!r}"
def test_normalise_concept_is_stable_under_repetition():
# normalising an already-normalised token must be a no-op (idempotent), so the
# graph collapses variants consistently no matter the source field.
for tok in ["decision", "policy", "address", "tag", "gpu", "access"]:
assert _normalise_concept(_normalise_concept(tok)) == _normalise_concept(tok)
def test_concepts_for_unions_tags_keywords_content():
m = Memory(
id=1,
content="The Postgres cluster uses pgvector for embeddings.",
tags="database,postgres",
expanded_keywords="cnpg vector search",
)
cs = _concepts_for(m)
# from tags (note: 'postgres' de-plurals to 'postgre' — a consistent heuristic
# collapse; what matters is every memory mentioning it lands on the SAME node).
assert "database" in cs and "postgre" in cs
# from expanded_keywords
assert "cnpg" in cs and "vector" in cs and "search" in cs
# from content (salient tokens, stop-words removed)
assert "pgvector" in cs and "embedding" in cs # 'embeddings' -> 'embedding'
assert "the" not in cs and "for" not in cs # stop-words excluded
# ---------------- graph build + expansion ----------------
def _shared_concept_corpus() -> list[Memory]:
# Three memories share concept "alpha" (df=3); two share "beta" (df=2); "gamma"
# is unique (df=1, links nothing). With min_df=2 and a generous max_df, alpha
# and beta both form edges.
return [
Memory(id=10, content="alpha topic one", tags="alpha", expanded_keywords="beta"),
Memory(id=20, content="alpha topic two", tags="alpha", expanded_keywords="beta"),
Memory(id=30, content="alpha topic three", tags="alpha", expanded_keywords="gamma"),
Memory(id=40, content="unrelated delta", tags="delta", expanded_keywords="delta"),
]
def test_graph_build_links_shared_concepts():
r = HybridRetriever()
# widen max_df so small-corpus concepts aren't pruned as "hubs"
import retrievers.hybrid as H
old = H._CONCEPT_MAX_DF_FRAC
H._CONCEPT_MAX_DF_FRAC = 1.0
try:
r._build_graph(_shared_concept_corpus())
finally:
H._CONCEPT_MAX_DF_FRAC = old
g = r._graph
assert g is not None
# alpha links 10-20-30 (a triangle); beta links 10-20; "topic" links 10-20-30
# too (shared content token). So the triangle exists and 10-20 is the heaviest
# edge (they additionally share 'beta').
assert g.has_edge(10, 20)
assert g.has_edge(10, 30)
assert g.has_edge(20, 30)
# 10-20 share alpha + beta + topic (=3); 10-30 share alpha + topic (=2). The
# exact counts aren't load-bearing — the INVARIANT is w(10,20) > w(10,30).
assert g[10][20]["weight"] > g[10][30]["weight"]
# the unrelated memory 40 (concept 'delta', df=1) links nothing.
assert g.degree(40) == 0
stats = r.graph_stats()
assert stats["nodes"] == 4 and stats["edges"] >= 3
def test_graph_rank_expands_from_seeds_by_weight():
r = HybridRetriever()
import retrievers.hybrid as H
old = H._CONCEPT_MAX_DF_FRAC
H._CONCEPT_MAX_DF_FRAC = 1.0
try:
r._build_graph(_shared_concept_corpus())
finally:
H._CONCEPT_MAX_DF_FRAC = old
# Seed from memory 10; neighbours 20 (w=2) and 30 (w=1) should both surface,
# with 20 ranked above 30 (heavier shared-concept edge).
nbrs = r._graph_rank([10], exclude={10}, k=10)
assert nbrs[:2] == [20, 30]
# excluded seeds are never returned
assert 10 not in nbrs
def test_graph_rank_empty_without_graph_or_seeds():
r = HybridRetriever() # no graph built
assert r._graph_rank([1, 2], exclude=set(), k=5) == []
r._graph = object.__new__(type("G", (), {})) # truthy but unused
assert r._graph_rank([], exclude=set(), k=5) == [] # no seeds
# ---------------- RRF fusion ----------------
def test_rrf_accumulate_formula():
scores: dict[int, float] = {}
from collections import defaultdict
scores = defaultdict(float)
HybridRetriever._rrf_accumulate(scores, [7, 8, 9], weight=1.0)
assert math.isclose(scores[7], 1.0 / (_RRF_K + 1))
assert math.isclose(scores[8], 1.0 / (_RRF_K + 2))
assert math.isclose(scores[9], 1.0 / (_RRF_K + 3))
# a second weighted list adds on top
HybridRetriever._rrf_accumulate(scores, [8], weight=0.5)
assert math.isclose(scores[8], 1.0 / (_RRF_K + 2) + 0.5 / (_RRF_K + 1))
def test_retrieve_fuses_all_three_legs_and_degrades():
"""End-to-end fusion with the dense leg STUBBED (no model). Verifies (a) FTS +
dense agreement floats a doc to the top, (b) the graph leg can introduce a doc
neither base leg returned, and (c) dense-disabled degrades to FTS(+graph)."""
corpus = [
Memory(id=1, content="alpha shared concept", tags="alpha", expanded_keywords="alpha"),
Memory(id=2, content="alpha shared concept too", tags="alpha", expanded_keywords="alpha"),
Memory(id=3, content="beta unrelated", tags="beta", expanded_keywords="beta"),
]
import retrievers.hybrid as H
old = H._CONCEPT_MAX_DF_FRAC
H._CONCEPT_MAX_DF_FRAC = 1.0
try:
r = HybridRetriever()
# Stub the dense BUILD so the test never loads the ~1.3 GB model nor writes
# to the shared cache/ dir; build_index then only does FTS + graph.
r._build_dense = lambda _c: None # type: ignore[method-assign]
r.build_index(corpus) # FTS + graph build only
# Stub the dense RANKER deterministically to "agree" with FTS on doc 1.
r._dense_rank = lambda q, k: [1] # type: ignore[method-assign]
# query matching doc 1 lexically; doc 2 shares concept 'alpha' with doc 1
# (graph neighbour) even if FTS ranks it lower.
out = r.retrieve("alpha shared concept", k=3)
assert out, "should return something"
assert out[0] == 1 # FTS+dense agreement puts doc 1 first
assert 2 in out # graph expansion (shares 'alpha') pulls doc 2 in
finally:
H._CONCEPT_MAX_DF_FRAC = old
def test_graceful_degradation_records_error(monkeypatch):
"""If the dense build raises, the retriever records it and still serves FTS."""
corpus = [Memory(id=i, content=f"doc number {i} content", tags="t") for i in range(1, 6)]
r = HybridRetriever()
def boom(_corpus):
raise RuntimeError("simulated embedding failure")
monkeypatch.setattr(r, "_build_dense", boom)
r.build_index(corpus)
assert any("dense leg disabled" in e for e in r.errors)
assert r._emb is None
# FTS still answers
out = r.retrieve("doc number 3 content", k=5)
assert 3 in out

View file

@ -0,0 +1,49 @@
#!/usr/bin/env python3
"""Validate the eval set and print AGGREGATE stats (safe to share / commit-able
numbers only prints NO raw memory content)."""
from __future__ import annotations
import json
import statistics
import sys
from collections import Counter
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from harness import load_dataset # noqa: E402
def main() -> None:
ds = load_dataset(validate=True) # raises on any referential-integrity issue
strata = Counter(q.stratum for q in ds.queries)
rel_per_q = {s: [] for s in strata}
for q in ds.queries:
rel_per_q[q.stratum].append(len(ds.qrels[q.query_id]))
# how many DISTINCT corpus memories are exercised as relevant
relevant_union = set()
for rels in ds.qrels.values():
relevant_union |= rels
out = {
"corpus_count": len(ds.corpus),
"query_count": len(ds.queries),
"strata": dict(strata),
"relevant_ids_per_query": {
s: {
"min": min(v),
"median": statistics.median(v),
"max": max(v),
"mean": round(statistics.fmean(v), 2),
}
for s, v in rel_per_q.items()
},
"distinct_relevant_memories": len(relevant_union),
"validation": "PASS (all qrels ids exist in corpus; every query has qrels)",
}
print(json.dumps(out, indent=2))
if __name__ == "__main__":
main()

View file

@ -0,0 +1,78 @@
#!/usr/bin/env python3
"""Export the local SQLite memory cache to a LOCAL-ONLY corpus.jsonl.
Privacy: emits ONLY rows where is_sensitive=0. The output file lives under
benchmarks/data/ which is gitignored. NEVER commit corpus.jsonl.
Fields emitted per line: {id, content, category, tags, expanded_keywords, importance}
"""
from __future__ import annotations
import argparse
import json
import sqlite3
import sys
from pathlib import Path
DEFAULT_DB = Path.home() / ".claude" / "claude-memory" / "memory" / "memory.db"
DEFAULT_OUT = Path(__file__).resolve().parents[1] / "data" / "corpus.jsonl"
def export(db_path: Path, out_path: Path) -> dict:
if not db_path.exists():
raise SystemExit(f"DB not found: {db_path}")
con = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
con.row_factory = sqlite3.Row
cur = con.cursor()
total = cur.execute("SELECT COUNT(*) FROM memories").fetchone()[0]
sensitive = cur.execute(
"SELECT COUNT(*) FROM memories WHERE is_sensitive=1"
).fetchone()[0]
rows = cur.execute(
"""
SELECT id, content, category, tags, expanded_keywords, importance
FROM memories
WHERE is_sensitive=0
ORDER BY id
"""
).fetchall()
out_path.parent.mkdir(parents=True, exist_ok=True)
written = 0
with out_path.open("w", encoding="utf-8") as f:
for r in rows:
rec = {
"id": r["id"],
"content": r["content"],
"category": r["category"],
"tags": r["tags"],
"expanded_keywords": r["expanded_keywords"],
"importance": r["importance"],
}
f.write(json.dumps(rec, ensure_ascii=False) + "\n")
written += 1
con.close()
return {
"total_rows": total,
"sensitive_excluded": sensitive,
"non_sensitive_written": written,
"out_path": str(out_path),
}
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--db", type=Path, default=DEFAULT_DB)
ap.add_argument("--out", type=Path, default=DEFAULT_OUT)
args = ap.parse_args()
stats = export(args.db, args.out)
json.dump(stats, sys.stdout, indent=2)
print()
if __name__ == "__main__":
main()

View file

@ -0,0 +1,65 @@
#!/usr/bin/env python3
"""Run the benchmark for a named retriever and print overall + per-stratum metrics.
Usage:
.venv/bin/python scripts/run_eval.py --retriever fts5 # lexical baseline
.venv/bin/python scripts/run_eval.py --retriever substring # demo
.venv/bin/python scripts/run_eval.py --retriever mypkg.mymod:MyRetriever
.venv/bin/python scripts/run_eval.py --retriever fts5 --json results/fts5.json
The --retriever value is either a built-in alias or a "module:Class" path. The
class is instantiated with no args; the runner calls build_index() if present.
Outputs are LOCAL-ONLY when written under results/ (gitignored): a results file
may echo retrieved ids (not content), but keep it local to be safe.
"""
from __future__ import annotations
import argparse
import importlib
import json
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from harness import load_dataset, run_benchmark # noqa: E402
from harness.baselines import SqliteFtsRetriever # noqa: E402
from harness.example_retriever import SubstringRetriever # noqa: E402
ALIASES = {
"fts5": lambda: SqliteFtsRetriever(sort_by="relevance"),
"fts5_importance": lambda: SqliteFtsRetriever(sort_by="importance"),
"substring": SubstringRetriever,
}
def resolve(spec: str):
if spec in ALIASES:
return ALIASES[spec]()
if ":" not in spec:
raise SystemExit(f"unknown retriever alias '{spec}' (use module:Class or one of {list(ALIASES)})")
mod_name, cls_name = spec.split(":", 1)
mod = importlib.import_module(mod_name)
return getattr(mod, cls_name)()
def main() -> None:
ap = argparse.ArgumentParser()
ap.add_argument("--retriever", default="fts5")
ap.add_argument("--k", type=int, default=20, help="depth requested from retriever")
ap.add_argument("--json", type=Path, default=None, help="write full result JSON here")
args = ap.parse_args()
ds = load_dataset(validate=True)
retr = resolve(args.retriever)
res = run_benchmark(retr, ds, retrieve_k=args.k)
print(res.summary())
if args.json:
args.json.parent.mkdir(parents=True, exist_ok=True)
args.json.write_text(json.dumps(res.to_dict(), indent=2))
print(f"\nwrote {args.json}")
if __name__ == "__main__":
main()

View file

@ -0,0 +1,21 @@
# Pursue hybrid retrieval: embeddings + concept graph over pure lexical
Today recall is **lexical only** (BM25 in SQLite, `tsvector`/`ts_rank` in Postgres over
content + LLM-generated `expanded_keywords`). It matches *tokens*, so it misses
paraphrase/synonym queries and cannot traverse between related Memories. We will pursue a
**hybrid** read path that adds dense-vector **Semantic recall** and a traversable **Concept
graph** (typed Relationships) alongside the existing Lexical recall.
This decision is **gated on a benchmark**: we adopt hybrid only if it shows a material
recall-quality uplift over the current lexical system on a stratified eval set (exact /
paraphrase / multi-hop). If the benchmark shows no improvement, a later ADR supersedes this
and we stay lexical.
## Considered options
- **Pure semantic (embeddings only)** — fixes paraphrase gaps but gives no real concept
traversal; rejected as the *sole* mechanism.
- **Pure concept graph** — enables traversal but node-matching stays lexical, so paraphrase
gaps remain; rejected as the *sole* mechanism.
- **Hybrid (chosen)** — embeddings for meaning + graph for traversal + existing FTS, fused
into one ranked result. Highest ceiling; the GraphRAG / Zep-Graphiti / HippoRAG family.

View file

@ -0,0 +1,20 @@
# API/Postgres deployment gets semantics; SQLite-only stays lexical
The semantic + concept-graph layer targets the **API/Postgres** deployment only: embeddings
in pgvector on the (CNPG) Postgres, the Concept graph as node/edge tables in Postgres, and
embedding/extraction via reused cluster infra (llama-cpp on GPU, or a hosted API). The
**SQLite-only** mode keeps working but stays **lexical (FTS) only** — it gains no embeddings
or graph, degrading gracefully.
This is surprising because the README markets zero-config offline SQLite as the headline
feature. We accept that trade-off: the operator actually runs the remote API/Postgres store,
reuse-before-building favours cluster infra, and bundling a local embedding model into the
zero-config path would add heavy dependencies and double the build/test matrix for little
real-world benefit.
## Consequences
- All benchmark numbers are produced in API/Postgres mode.
- Offline zero-config users see no behaviour change.
- A future ADR may revisit offline semantics (e.g. via `sqlite-vec` + a small local model)
if there is demand.

View file

@ -0,0 +1,19 @@
# External embedding/extraction APIs allowed for non-sensitive memories
Embedding and concept extraction may use **hosted APIs** (e.g. OpenAI `text-embedding-3`,
Voyage, Cohere) for **non-sensitive** memories, to access a higher quality ceiling than
self-hosted models alone. **Sensitive / Vault-encrypted (secret) memories are never sent
externally** and are excluded from the corpus that gets embedded or extracted.
This is a deliberate relaxation of the homelab's usual local-only posture, made because the
quality gain is worth it for non-secret personal memory content. The research/benchmark may
still compare hosted vs self-hostable models (nomic-embed, bge-m3, gte-Qwen2, e5) so the
production choice is data-driven; this ADR only records that egress is *permitted* within the
sensitive-data boundary.
## Consequences
- The corpus-export step MUST filter out `is_sensitive` / secret memories before any external
call.
- Production deployment needs an embedding API key (or falls back to the in-cluster
llama-cpp model when absent).

View file

@ -0,0 +1,45 @@
# Phase the hybrid: lexical + dense first, concept graph gated
The [benchmark](../research/benchmark-report.md) shows the hybrid read path beats lexical FTS on
every overall metric (recall@10 **+0.139**, driven by the statistically-robust paraphrase win recall@10
**+0.350**), with no recall regression on exact — so [ADR-0001](0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md)'s
quality gate **is met by the lexical + dense fusion**. The ablation also showed full-hybrid **identical
to three decimals** to FTS+dense alone — **but a post-run adversarial review proved this was a structural
artifact, not an empirical result**: the prototype's fusion barred the graph leg from the fused top-k by
construction (graph candidates excluded from the FTSdense base set; `w_graph=0.35` → max graph RRF
`0.0057` < base-leg min `0.0091`). So the concept graph was **never validly tested it is *unevaluated*,
not disproven.**
We therefore **phase** the hybrid:
- **Phase 1 (adopt now):** lexical FTS ⊕ dense pgvector embeddings, fused with weighted RRF, the
importance prior preserved as a post-fusion multiplier. This *is* the measured uplift.
- **Phase 2 (gated, NOT shipped):** the typed-relation concept graph. It stays designed
([integration design §A.5](../research/integration-design.md)) but disabled — its value is **unproven**
and it carries real operational cost (LLM extraction + two extra Postgres tables + traversal).
The graph's prototype was a **zero-LLM keyword-co-occurrence** graph (concepts = tags
expanded_keywords regex noun-phrases, edges from co-occurrence), **not** the LLM-extracted
typed-relation graph the production design specifies. So the null result kills *that cheap
construction* on *this eval set* — it does not prove an LLM-extracted graph is also null. The graph
is **gated, not killed**: re-open it only with evidence it helps — e.g. a multi-hop slice whose hops
are *not* semantically adjacent (where the dense leg can't shortcut), built from real typed-relation
extraction.
## Why the graph result is inconclusive
The ablation `A≡B` was **guaranteed by the fusion config** (graph candidates excluded from the FTSdense
base set; `w_graph=0.35` caps a graph-only id's RRF below any base-leg id), so it tested nothing about the
graph — the review even found a relevant memory the graph surfaced that both base legs missed, which
fusion then discarded. Separately, the multi-hop deltas (recall@10 +0.064) are **not statistically
significant** (3 of 4 CIs cross zero), so there is no distinguishable multi-hop win to attribute to
*either* leg. The graph is deferred on **cost + uncertainty**, not on evidence it fails.
## Consequences
- Production ships embeddings + RRF fusion; the graph schema is documented but not migrated.
- The concept-graph research (Zep/Graphiti split, HippoRAG PPR, EDC canonicalization) is preserved as
the phase-2 blueprint, behind the gate.
- Phasing avoids paying the graph's cost while its value is unproven; the robust **lexical+dense**
paraphrase win is what ADR-0001's gate actually surfaced. The graph stays a designed, gated follow-up
pending a valid retest (graph candidates in the fused pool / swept weight, real typed-relation extraction).

View file

@ -0,0 +1,40 @@
# Weighted RRF as the fusion default; convex combination as a benchmark challenger
The hybrid read path must fuse three retrieval signals on **incomparable scales** — unbounded
`ts_rank` (BM25), bounded cosine, and (phase 2) an arbitrary graph-proximity score — where the graph
list is **sparse and often empty**. We adopt **weighted Reciprocal Rank Fusion** as the default
fusion function: `score(d) = Σ_s w_s/(60 + rank_s(d))`, default `w_lex = w_dense = 1.0`,
`w_graph = 0.35`, with the existing **importance** value applied as a *post-fusion prior multiplier*
(`final = fused × (0.7 + 0.3·importance)` for `sort_by="relevance"`) — importance is a prior, **not**
a fourth fused list.
RRF is the right default because it is **score-scale-free** (no BM25-vs-cosine calibration to
maintain), treats a missing leg as a clean **0** contribution (no missing-modality bias), is
near-parameter-free (`k=60` is demonstrably uncritical across [10,100]), and **collapses to today's
exact lexical ordering** when the dense/graph legs are empty — which is the SQLite-only
graceful-degrade path ([ADR-0002](0002-api-postgres-first-sqlite-stays-lexical.md)) running the *same*
code.
## Considered options
- **Convex combination / TM2C2** (Bruch et al., min-max normalization) — the literature consistently
shows it **edges RRF on nDCG/recall** when scores are calibratable (Weaviate switched its default to it).
It is the **standing challenger**. ⚠️ **Correction:** an earlier draft claimed "the benchmark ran CC
against RRF and RRF was chosen on our eval set" — **no CC results were actually produced or persisted in
this run.** RRF was adopted on *principled* grounds (scale-free, treats a missing/empty leg as a clean 0,
collapses to today's exact lexical ordering for the SQLite-only degrade path), **not** a measured
head-to-head. Benchmarking CC vs RRF on our eval set is an open follow-up — do it before locking fusion,
and especially if the graph is ever adopted or score distributions shift.
- **Cross-encoder stage-2 re-rank** (e.g. `bge-reranker-v2-m3` over the fused top ~2030) — a
*separate*, independently-gated stage, not a fusion function. Deferred; ship only if it clears both
the quality bar and the hot-path p95 budget on the GPU node.
## Consequences
- Fusion is ~30 lines over three top-N queries; the lexical leg reuses the existing
`plainto_tsquery`/`ts_rank` query + OR-broaden fallback verbatim.
- The exact-stratum nDCG/MRR dip (~0.018/0.025, recall unaffected) is the known RRF cost of blending
one perfect hit with near-ties; a small **exact-match rank bonus** is the tunable recovery and a
cheap follow-up.
- `k=60` is borrowed from TREC ad-hoc IR; a quick re-sweep on the eval set is worthwhile but the
literature says it is insensitive.

View file

@ -0,0 +1,49 @@
# Production vector storage: pgvector HNSW + halfvec(1024); 1024-d embeddings (Voyage-3.5 / bge-large)
Phase 1 of the hybrid ([ADR-0004](0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)) needs a
production home for the dense embeddings. Per [ADR-0002](0002-api-postgres-first-sqlite-stays-lexical.md)
that is **pgvector on the shared CNPG Postgres**, where claude-memory is already a database tenant — no new
datastore.
> ⚠️ **Correction (verified against live infra by a design challenger):** an earlier draft justified this
> with "Immich already runs pgvector on the same cluster." That is **false** — Immich runs its **own**
> Postgres, not the shared CNPG — so it is NOT evidence the shared cluster has the extension. **pgvector
> must be explicitly enabled on CNPG** (extension install, and possibly a CNPG operand-image change) via
> Terraform **before this can land**; do not assume it is already available.
Decisions:
- **Index: HNSW** (`USING hnsw (embedding halfvec_cosine_ops) WITH (m=16, ef_construction=64)`,
query knob `hnsw.ef_search` set via `SET LOCAL` inside the recall txn under PgBouncer). Best
speed-recall tradeoff, buildable on an empty table. **IVFFlat rejected** — it must be built *after*
data exists (empty-table footgun) and has a lower recall ceiling.
- **Type: `halfvec(1024)`** (fp16) — halves index size at ~no recall loss; 1024-d halfvec = 2048
bytes/row → single-digit MB for the whole corpus.
- **Dimension fixed at 1024**, chosen **once** (changing it later forces a full re-embed + HNSW
rebuild). 1024 matches both the production model (Voyage-3.5) and the prototype model
(bge-large-en-v1.5), so the column and all fusion code are identical regardless of model.
- **Model: Voyage-3.5** (1024-d, hosted) for **non-sensitive** memories (highest measured retrieval
quality of the hosted options, free tier covers the corpus); **bge-large-en-v1.5** (1024-d, local,
MIT) for **sensitive memories and the no-API-key fallback** ([ADR-0003](0003-external-embedding-apis-allowed-for-non-sensitive-memories.md)).
`is_sensitive=1` rows are never embedded externally — `embedding=NULL`, lexical-only.
- **pgvectorscale / StreamingDiskANN deferred** — Rust/pgrx must be compiled into the CNPG operand
image, and it only earns its keep above ~15M vectors; our corpus is orders of magnitude below that.
## Migration shape
A single **additive** Alembic migration: `ALTER TABLE memories ADD COLUMN embedding halfvec(1024)`
(NULL for sensitive) + `CREATE INDEX CONCURRENTLY … USING hnsw …`. The existing generated
`search_vector tsvector` + GIN index (`migrations/001`) are **untouched**, so lexical behaviour and
the SQLite-only degrade path are unchanged. pgvector enablement on CNPG and any extension/operand
change land as **Terraform/Terragrunt** in `infra/stacks/…` (GitOps, never kubectl) and trigger a
rolling restart of the shared cluster — coordinate accordingly.
## Consequences
- The prototype's in-process numpy matrix maps directly to this column; only the substrate changes,
not the retrieval math.
- The prototype measured **bge-large** quality; a cheap follow-up should re-run the dense leg with
**Voyage-3.5** on the non-sensitive corpus to confirm the hosted ceiling holds on our content
before locking the production default.
- Production latency/ANN-approximation/filtered-top-k behaviour are unmeasured in the prototype and
must be validated post-migration (a stated benchmark limitation).

View file

@ -0,0 +1,312 @@
# Benchmark report: FTS (lexical baseline) vs Hybrid (lexical + dense + graph)
Status: **the ADR-0001 decision instrument.** This is the head-to-head that gates hybrid
adoption. Read the [survey](survey.md) for the landscape and the
[integration design](integration-design.md) for what was built.
**Bottom line:** Hybrid (lexical FTS ⊕ dense embeddings, RRF-fused) beats the lexical FTS baseline on
**every overall metric**, driven by the **paraphrase** stratum (recall@10 **+0.350**, robust under a
paired bootstrap) — precisely the gap embeddings were meant to close, with **no recall regression on
exact**. **Recommendation: adopt the lexical + dense fusion;** the ADR-0001 quality gate is met by
that fusion alone.
> **⚠️ Post-review corrections — read before citing this report.** An adversarial completeness review
> after the run found that two prominent claims in the original draft did NOT survive scrutiny. They are
> corrected in place below; the full review is in the [Appendix](#appendix--adversarial-completeness-review).
>
> 1. **The concept graph was NOT shown to be useless — it was never validly tested.** The prototype's
> fusion restricted the graph leg to candidate ids *outside* the FTSdense base set and weighted it at
> 0.35, which makes it **mathematically impossible** for a graph-only result to enter the fused top-k
> (max graph RRF `0.35/(60+1)≈0.0057` < any base-leg minimum `1.0/(60+50)≈0.0091`). So the ablation
> "full-hybrid ≡ FTS+dense to three decimals" was **guaranteed by the fusion config, not an empirical
> finding** — the review even found a genuinely-relevant memory the graph surfaced (and both base legs
> missed) that fusion then discarded. The concept graph is therefore **UNEVALUATED**, and is deferred on
> **cost + invalid-test** grounds, *not* because it failed. A valid retest must put graph ids in the
> fused candidate pool (no base-set exclusion) and/or sweep the graph weight.
> 2. **The multi-hop "win" is not statistically significant.** A paired bootstrap (B=10000) puts 3 of the
> 4 multi-hop metric CIs across zero (recall@10 Δ+0.064, P(Δ≤0)≈0.06). Only the **overall** and
> **paraphrase** deltas are robust. Multi-hop deltas are de-bolded below. Since multi-hop is the entire
> rationale for a concept graph, there is currently **no statistically distinguishable evidence** that
> *anything* (dense or graph) helps the multi-hop stratum.
>
> Also: absolute recall/nDCG *levels* are biased low (binary, un-pooled qrels — §6); only FTS-vs-hybrid
> **deltas** are trustworthy. None of this changes the adopt-lexical+dense recommendation, which rests on
> the robust paraphrase/overall win.
---
## 1. Methodology
### 1.1 Test collection
- **Corpus:** 5,452 memories (`benchmarks/data/corpus.jsonl`, gitignored — privacy). Sensitive
memories (`is_sensitive=1`) excluded entirely per ADR-0003. The corpus is the user's real
personal memories; **only aggregate numbers and synthetic illustrations appear in this document.**
- **Queries:** 119, stratified — **40 exact / 40 paraphrase / 39 multi-hop**
(`benchmarks/data/queries.jsonl`).
- **Qrels:** `benchmarks/data/qrels.jsonl`, LLM-generated (the LongMemEval pipeline inverted for a
memory store: seed memory → exact/paraphrase/multi-hop query → relevant ids).
### 1.2 The two systems
- **`fts` (baseline)** — `benchmarks/retrievers/fts.py::FtsRetriever`. A **faithful
reimplementation of the production code path** `src/claude_memory/mcp_server.py::_sqlite_recall`
(`sort_by="relevance"`): in-memory FTS5 over the full corpus with the production virtual-table
shape + default `unicode61` tokenizer (no stemming/stop-words); AND-match first, OR-broaden only
if AND returns zero; rank by the production blend `-bm25()*0.7 + importance*0.3`; LIKE fallback on
operational errors. **This is "the current system" the hybrid must beat — not the simplified
README prose.**
- **`hybrid`** — `benchmarks/retrievers/hybrid.py::HybridRetriever`. Three legs fused with weighted
RRF: (1) lexical = **`FtsRetriever` reused verbatim** (so the hybrid's lexical component *is* the
baseline — no drift); (2) dense = `BAAI/bge-large-en-v1.5` (1024-d, local, L2-normalized, BGE
query-instruction prefix), cosine via numpy; (3) graph = a keyword-co-occurrence memory-node graph
(5,452 nodes / 2,095,624 edges), 1-hop expansion from the top-10 seeds.
Fusion: `RRF(d) = Σ_leg w_leg/(60 + rank_leg(d))`, `w_fts = w_dense = 1.0, w_graph = 0.35`, each
leg to depth 50.
### 1.3 Metrics & protocol
- **recall@5, recall@10** (hot-path "did we surface it"), **nDCG@10** (graded, position-aware
headline), **MRR** (first-hit). Per stratum and overall.
- `retrieve_k=20`, run via `scripts/run_eval.py --retriever retrievers.{fts,hybrid}:…`. Deterministic;
both invocation paths (programmatic + CLI) verified identical.
- **`sort_by="relevance"` pinned across both arms** (not the production `importance` default), so the
benchmark measures *retrieval* quality, not the importance prior — and everything else (corpus,
queries, OR-broaden behaviour) is held fixed.
- Full result JSONs written only to gitignored `benchmarks/results/{fts,hybrid}.json`; retriever code
contains no embedded corpus content (safe to commit).
---
## 2. Head-to-head results
### 2.1 Comparison table (FTS vs Hybrid, with deltas)
| Stratum | Metric | FTS | Hybrid | Δ |
|---|---|---:|---:|---:|
| **Overall** (n=119) | recall@5 | 0.6663 | 0.7415 | **+0.0752** |
| | recall@10 | 0.6952 | 0.8338 | **+0.1386** |
| | nDCG@10 | 0.6507 | 0.7284 | **+0.0777** |
| | MRR | 0.6737 | 0.7297 | **+0.0560** |
| **Exact** (n=40) | recall@5 | 1.0000 | 1.0000 | +0.0000 |
| | recall@10 | 1.0000 | 1.0000 | +0.0000 |
| | nDCG@10 | 0.9908 | 0.9723 | 0.0185 |
| | MRR | 0.9875 | 0.9625 | 0.0250 |
| **Paraphrase** (n=40) | recall@5 | 0.3500 | 0.5500 | **+0.2000** |
| | recall@10 | 0.3750 | 0.7250 | **+0.3500** |
| | nDCG@10 | 0.3123 | 0.5023 | **+0.1900** |
| | MRR | 0.2958 | 0.4343 | **+0.1385** |
| **Multi-hop** (n=39) | recall@5 | 0.6485 | 0.6726 | +0.0241 ¹ |
| | recall@10 | 0.7111 | 0.7748 | +0.0637 ¹ |
| | nDCG@10 | 0.6491 | 0.7101 | +0.0610 ¹ |
| | MRR | 0.7393 | 0.7940 | +0.0547 ¹ |
¹ **Multi-hop deltas are NOT statistically significant.** Paired bootstrap (B=10000): recall@5
`CI[0.046,+0.095]` (P(Δ≤0)≈0.25), recall@10 `CI[0.020,+0.143]` (P≈0.06), MRR `CI[0.038,+0.143]`
(P≈0.12); only nDCG@10 is marginal (P≈0.04). Treat the multi-hop stratum as **inconclusive**. The
overall and paraphrase deltas, by contrast, have CIs well clear of zero (P≤0.003).
### 2.2 Reading the strata
- **Exact — recall held at 1.0, but that is partly circular.** Exact queries were *generated* as "a
salient phrase whose top FTS hit is memory X," then X was labelled relevant — so FTS recall@5/@10 = 1.0
is substantially **guaranteed by how the stratum was built**, not an independent property. The one
genuine signal here is hybrid's small **degradation**: nDCG@10 0.018 `CI[0.046,0]`, MRR 0.025
`CI[0.063,0]` — a real, consistent rank demotion from blending one perfect lexical hit with dense
near-ties. The proposed exact-match rank bonus (ADR-0005) would recover it, **but that fix is asserted,
not measured.** Recall is unaffected; the cost is rank-position only.
- **Paraphrase — the LARGEST WIN, the dense leg's payoff.** recall@10 **+0.350** (0.375 → 0.725),
recall@5 **+0.200**, nDCG@10 **+0.190**. This is exactly the low-lexical-overlap stratum embeddings
were predicted to fix: lexical FTS finds barely a third of paraphrased answers; adding dense nearly
doubles recall@10. **This stratum alone justifies the hybrid.**
- **Multi-hop — INCONCLUSIVE (deltas not significant).** recall@10 +0.064, nDCG@10 +0.061, MRR +0.055,
but 3 of 4 CIs cross zero (footnote ¹). So we **cannot claim** a multi-hop win for hybrid. We also
cannot attribute anything to the graph: per the §3 caveat, the fusion config structurally barred the
graph leg from the top-k, so the multi-hop stratum tests only FTS vs FTS+dense — and that difference is
not statistically distinguishable here. Multi-hop is exactly the stratum a *properly tested* concept
graph is meant to win, and it remains an open question.
---
## 3. Ablation — what each leg contributes
Four configs, everything else fixed:
| Config | Legs | Overall recall@10 | Overall nDCG@10 | Para recall@10 | Multi recall@10 | Exact nDCG@10 |
|---|---|---:|---:|---:|---:|---:|
| **A** | FTS + dense + graph (full hybrid) | 0.834 | 0.728 | 0.725 | 0.775 | 0.972 |
| **B** | FTS + dense (`w_graph=0`) | **0.834** | **0.728** | **0.725** | **0.775** | **0.972** |
| **C** | dense only | 0.748 | — | — | — | 0.861 |
| **D** | FTS only (= baseline) | 0.695 | 0.651 | 0.375 | 0.711 | 0.991 |
**A ≡ B to three decimals on every metric — but this is a STRUCTURAL ARTIFACT, not a test of the graph.**
The fusion restricts the graph leg to ids *outside* the FTSdense base set and weights it at 0.35, so a
graph-only id's maximum RRF score is `0.35/(60+1) ≈ 0.0057`, strictly below the *minimum* score of any id
from a base leg (`1.0/(60+50) ≈ 0.0091`). Since both base legs return 50 ids, **every base-leg id outranks
every graph-only id** — a graph-only result can never enter the fused top-k, regardless of corpus, query
set, or graph quality. A ≡ B was **mathematically guaranteed before any data ran.** The honest reading is
"the graph cannot affect top-k *under this fusion config*," NOT "the concept graph contributes nothing." (A
spot check found a genuinely-relevant memory the graph surfaced and both base legs missed at depth 50 —
fusion discarded it.) **The graph is therefore unevaluated.**
What the ablation *does* validly show is the **FTS-vs-dense decomposition**, which stands:
- **Dense recovers paraphrase** (C beats D's 0.375 para recall@10 decisively) but is weaker on exact
(C exact nDCG 0.861 vs D's 0.991).
- **Lexical recovers exact** (D exact nDCG 0.991, the best) but collapses on paraphrase.
- **Fusion (B) gets the best of both** — exact recall stays perfect, paraphrase nearly doubles.
- *Caveat:* configs B/C/D were not persisted as result JSONs, so these specific numbers are not
independently reproducible from committed artifacts (only A = full hybrid and D = FTS are).
**To actually test the graph** (deferred follow-up — [ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)):
put graph candidates in the fused pool (drop the base-set exclusion) and/or sweep `w_graph` upward, on a
multi-hop slice whose hops are *not* semantically adjacent (so the dense leg can't shortcut them), using
real typed-relation extraction rather than the prototype's zero-LLM keyword co-occurrence graph.
---
## 4. Latency & storage (measured, NON-GATING per ADR-0001)
### 4.1 Latency (per-query `retrieve()`, CPU-only box, no GPU)
| System | p50 | p95 | mean | max |
|---|---:|---:|---:|---:|
| FTS (pure SQLite) | 15.7 ms | 27.8 ms | 12.8 ms | 31.9 ms |
| Hybrid | 229.6 ms | 344.5 ms | 249.3 ms | 640.0 ms |
The hybrid's ~230 ms p50 is **dominated by the local bge-large query embedding** (one CPU forward
pass). On the production GPU node or via a hosted API this drops ~10× to low tens of ms. The FTS,
dense-ANN, and RRF-merge costs themselves are negligible. **Latency does not gate adoption** (the
success metric is quality-first), and the production read path (pgvector HNSW + GPU/hosted query
embed) is far faster than this prototype's CPU profile.
### 4.2 Storage
| Component | Size | Notes |
|---|---|---|
| FTS5 in-memory index | ~8.3 MB | SQLite shadow tables over 5,452 memories |
| Dense matrix | 22.3 MB | 5,452 × 1024 float32; cached `.npy`, fingerprint-keyed |
| Concept graph (in-memory) | ~202 MB (Python-object estimate) | 5,452 nodes + 2,095,624 edges, networkx; **not persisted, not shipped** |
| Total reported index | 232.2 MB | matrix + FTS + graph estimate |
Production maps the dense matrix to pgvector `halfvec(1024)` (~2 KB/row, single-digit MB total for
the corpus) and — only if the graph is ever adopted — three Postgres node/edge tables. No
pgvector/docker in the prototype; in-process numpy cosine (faiss unnecessary at N=5452). Storage is
reported, not gating.
---
## 5. Recommendation
**ADOPT — ship lexical + dense fusion (phase 1); defer the concept graph behind a gate.**
Rationale:
1. **The ADR-0001 quality gate is met.** Hybrid beats FTS on all four overall metrics, with the
decisive, statistically-robust win on **paraphrase** (recall@10 +0.350) — precisely the gap embeddings
were meant to close — with **no recall regression on exact**. (The multi-hop deltas are *not*
statistically significant — §2.2, footnote ¹ — so they are not part of the case for adoption.)
2. **The gain is the FTS+dense fusion, not the graph.** The ablation shows full-hybrid ≡ FTS+dense.
So phase 1 = embeddings (pgvector) fused with the existing FTS via weighted RRF, preserving the
importance prior — exactly the [integration design](integration-design.md) §A.
3. **The concept graph stays GATED — because it was not validly tested, not because it failed**
([ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)). The prototype's fusion
config structurally barred it from the top-k (§3), so this benchmark says nothing about its value.
Deferral is justified by operational cost (LLM extraction + two extra tables + traversal) **plus the
remaining uncertainty** — not by evidence of uselessness. Re-open with a valid retest: graph candidates
in the fused pool (no base-set exclusion) and/or a swept weight, on non-semantically-adjacent multi-hop
queries built from real typed-relation extraction.
4. **The small exact-stratum nDCG/MRR dip** is the known RRF blending cost on recall-perfect queries;
an exact-match rank bonus is a cheap follow-up ([ADR-0005](../adr/0005-rrf-default-cc-challenger.md)
records RRF as the default with the bonus as a tunable).
Adopt with the changes already designed: pgvector `halfvec(1024)` + HNSW, weighted-RRF fusion with
the importance prior preserved, sensitive memories excluded from embedding (ADR-0003), and SQLite-
only mode unchanged (ADR-0002).
---
## 6. Limitations (stated honestly)
1. **Label noise / qrels quality.** Qrels were **LLM-generated** with **lighter hand-verification
than the ideal protocol** (no measured Cohen's κ between LLM and human judgments). LLM judges are
systematically lenient, and the eval is **119 queries (~40/stratum)***below* the ~50/stratum
the literature (Voorhees & Buckley) recommends for confident per-stratum ranking, and **no
bootstrap CIs or paired significance test** were computed. The overall and paraphrase deltas
(+0.14, +0.35 recall@10) are large enough to be robust to plausible label noise; the multi-hop
(+0.06) and the exact-stratum dip (~0.02) are within the range where label noise could matter, so
treat those as directional, not precise.
2. **Pooling / "holes."** It is not confirmed that qrels pooled the top-k of *all* arms (FTS, dense,
hybrid) before judging. If the pool was lexical-biased, the metric is **biased against** the dense
and hybrid arms (they retrieve unjudged relevant memories scored as misses) — meaning the true
hybrid uplift could be *larger* than reported, not smaller. This does not threaten the "adopt"
conclusion but caveats the exact magnitudes.
3. **Snapshot vs pgvector (substrate mismatch).** The prototype used an **in-process numpy** dense
index over a static corpus snapshot, **not** the production pgvector HNSW on live CNPG Postgres.
Retrieval *quality* transfers (cosine is cosine; HNSW recall at this scale is ~exact), but the
production **latency profile, ANN approximation, and filtered-top-k behaviour are unmeasured here**
and must be validated post-migration.
4. **Extraction shortcuts AND a structurally-excluded graph leg.** The concept graph was built with
**zero LLM calls** — concepts = `tags` `expanded_keywords` a regex noun-phrase proxy, edges from
keyword co-occurrence — **not** the typed-relation, LLM-extracted graph the production design (§A.5)
specifies. More importantly, the fusion config **structurally barred even this cheap graph from the
top-k** (§1 corrections, §3), so this run is **not a valid test of the graph at all** — not of the
cheap construction, and certainly not of a properly LLM-extracted typed-relation graph. The graph is
*gated and unevaluated*, not *killed*.
5. **Embedding model is the prototype default, not the production pick.** Numbers are for
**bge-large-en-v1.5** (local). Production should use **Voyage-3.5** (also 1024-d) for
non-sensitive memories (ADR-0003); its higher quality ceiling on *our* content is unverified — a
cheap, recommended follow-up (re-run the dense leg with Voyage).
6. **`sort_by="relevance"` not the production default.** The benchmark pins `relevance` to isolate
retrieval quality; production defaults to `importance`-blended ranking. The design preserves
importance as a post-fusion prior, but the *user-visible* ranking under the default sort was not
benchmarked.
7. **Single user, dense corpus.** ~5,452 memories from one author are topically adjacent with many
near-duplicates, so "the one relevant id" is sometimes fuzzy; graded judgments over a pool mitigate
this but it remains a property of the corpus that may not generalize.
---
## Appendix — adversarial completeness review
An independent critic reviewed this report after the run (verdict: **usable-with-caveats**). Its findings
drove the §1 corrections; the full list is recorded here so the review is part of the permanent record.
1. **Graph-null claim is circular, not empirical (most serious).** Proven that the fusion config
(graph leg excluded from the base set; `w_graph=0.35`) caps a graph-only id's RRF at `0.35/61≈0.0057`,
below any base-leg id's `1.0/110≈0.0091`. `A≡B` was guaranteed before any data. Honest statement: "the
graph cannot affect top-k under this fusion config," not "the graph contributes nothing."
2. **Graph mechanism mis-attributed, and contradicted by the data.** Reconstructing the legs (graph =
5,452 nodes / 2,095,624 edges) on a 15-query sample found `rel_only_via_graph=1`: a relevant memory the
graph surfaced that was absent from BOTH base legs at depth 50 — dense did NOT already cover it, and
fusion discarded it. The "dense already retrieves them" explanation is false in ≥1 observed case.
3. **Multi-hop wins are not statistically supported — yet were bolded.** Paired bootstrap (B=10000):
overall & paraphrase robust (P≤0.003); multi-hop 3/4 CIs cross zero (recall@10 P≈0.06). Multi-hop is the
sole rationale for the graph, and it shows no statistically distinguishable hybrid advantage.
4. **Exact stratum is circular by construction.** All 40 exact queries were generated as "top FTS hit for a
salient phrase of ⟨id⟩," so FTS's 1.0 is largely tautological. The only genuine exact signal is hybrid's
small nDCG/MRR demotion, whose claimed rank-bonus fix is asserted, never measured.
5. **qrels are binary and un-pooled, so nDCG is mislabeled and absolute numbers unreliable.** Binary labels
make nDCG degenerate to position-discounted recall (no graded info). Un-pooled, author-assigned labels
bias absolute recall/nDCG **low** (esp. for dense/hybrid); only the FTS-vs-hybrid **delta** is trustworthy.
6. **Underpowered and single-corpus.** 119 queries (~40/stratum) is below the cited ~50/stratum standard;
65% of queries have exactly one relevant id (recall@5≈recall@10 for most). One author, no second corpus,
no inter-annotator agreement: external validity asserted, not demonstrated.
7. **Headline metric overstates hot-path value.** recall@10 leads everywhere, but the auto-recall hook
injects a top-k into the prompt; recall@5 (+0.075) and MRR/first-hit are the decision-relevant metrics.
Hybrid's recall@10 edge partly comes from pushing answers into ranks 610. The hook's effective `k` is
never stated.
8. **Production substrate and model wholly unmeasured.** Numbers are local exact numpy cosine over
bge-large; production is pgvector HNSW (approximate; recall depends on `ef_search`) with a filtered top-k
(NULL-embedding sensitive rows — a partial-index interaction that can hurt HNSW recall), using Voyage-3.5
(never run). "Quality transfers" and "~10× faster on GPU/hosted" are assumptions, not measurements.
9. **Ablation configs A/B/C/D are not reproducible from artifacts.** Only `results/{fts,hybrid}.json`
(= A and D) are persisted; B/C have no saved JSON, and there's no run-manifest/seed/version capture
beyond the embedding-cache fingerprint. The decision-critical `A≡B` and the C/D numbers must be re-run.
10. **Several deferred fixes are asserted, not tested.** The exact-match rank bonus, the importance
post-fusion prior (the benchmark pinned `sort_by="relevance"` without it, so the user-visible production
ranking was never measured), and the "CC ran on our set and RRF was chosen" claim (no CC results exist
anywhere) are all unverified.

View file

@ -0,0 +1,292 @@
# Integration design: hybrid recall (production + prototype-as-built)
Status: the design the benchmark validated. This document specifies **(A) the production design**
for the API/Postgres deployment and **(B) the prototype as actually built** for the benchmark.
Read the [survey](survey.md) for *why* these choices, the
[benchmark report](benchmark-report.md) for whether they cleared the gate, and
[ADR-00010003](../adr/) for the fixed constraints.
**Headline architectural decision (recorded in
[ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)):** ship
**lexical + dense fusion** first (a statistically-robust paraphrase win cleared ADR-0001's gate); the
**concept graph is deferred behind a gate** because it was **never validly tested** — the prototype's
fusion config structurally barred it from the top-k (see [benchmark report §1 + §3](benchmark-report.md)),
so it is *unevaluated*, not disproven. Deferral is on cost + uncertainty grounds. The design below is
structured around that phasing.
---
## A. Production design (API/Postgres deployment)
### A.0 Where it plugs in
The semantic layer targets the **API/Postgres path only** (ADR-0002). The authoritative store is
the CNPG Postgres behind the FastAPI server (`src/claude_memory/api/app.py`); the local SQLite
cache stays **lexical (FTS5) only** and degrades gracefully. Recall fires on the **hot path** of
every prompt (an auto-recall hook before each turn), so the read path must stay within a
per-prompt latency budget even though latency is non-gating for *adoption* (ADR-0001).
The current production recall (`app.py::recall_memories`, `POST /api/memories/recall`) is a single
`plainto_tsquery('english')` + `ts_rank(search_vector, query)` ordered by a blend
`ts_rank*0.7 + importance*0.3` (or `*0.4/*0.6` for `sort_by="importance"`), with an OR-broaden
fallback gated at `ts_rank > OR_BROADEN_MIN_RANK` when AND-match under-fills. The live schema
(`migrations/001`) has a generated `search_vector tsvector` (setweight A=content, B=expanded_keywords,
C=tags, D=category) + a GIN index `idx_memories_search`. **The hybrid design is purely additive
to this.**
### A.1 Schema delta (additive, one migration)
```sql
-- new Alembic migration (Postgres only; SQLite path unchanged)
ALTER TABLE memories ADD COLUMN embedding halfvec(1024); -- NULL for is_sensitive=1
CREATE INDEX CONCURRENTLY idx_memories_embedding
ON memories USING hnsw (embedding halfvec_cosine_ops)
WITH (m = 16, ef_construction = 64);
```
- **1024-d** matches both production (Voyage-3.5) and the prototype (bge-large-en-v1.5), so the
column dimension and all fusion code are identical whichever model runs.
- **halfvec** (fp16) halves index size at ~no recall loss; 1024-d halfvec = 2048 bytes/row →
single-digit MB for the whole corpus.
- The existing `search_vector` + GIN index are **untouched**. Lexical behaviour is unchanged, so
NULL-embedding rows (sensitive memories) and SQLite-only mode degrade to exactly today's FTS.
- `CONCURRENTLY` avoids locking the shared table during backfill.
- The concept-graph tables (§A.5) ship **only if/when the graph clears its gate** — phase 2.
### A.2 Write path (store / update) — all LLM work here, off the recall hot path
On `memory_store` / `memory_update`, for **non-sensitive** rows (`is_sensitive=0`, hard ADR-0003
gate):
1. **Embed** `content` (optionally `content + expanded_keywords`) → one `halfvec(1024)` vector,
written to the new column. Voyage-3.5 `input_type="document"` for stores; bge-large
`encode_document` for sensitive/no-key fallback. `is_sensitive=1` rows get `embedding=NULL`
never embedded, never egressed; they still match via FTS.
2. **(Phase 2, gated) Extract** concepts/edges for the new memory and incrementally merge into the
graph tables (entity resolution via pgvector nearest-neighbour + threshold, LLM tie-break only
on ambiguity — Graphiti-style fast-path).
3. **(Optional, flagged) Curate** — the Mem0-style ADD/UPDATE/DELETE/NOOP loop, run async, never
physically deleting (supersede to `[SUPERSEDED]` tombstone). Isolated behind a flag so it never
confounds the benchmark.
The existing **background sync engine** already moves rows SQLite↔Postgres in a daemon thread; the
embedding is just another column it carries (authoritative vector in Postgres). Extraction/curation
ride the same off-hot-path lane. The synchronous store call must **not** block on embedding/
extraction if it would delay the response — these run async.
### A.3 Read path (hot path) — three CTEs, RRF fusion, importance prior
Replace the single ts_rank ORDER BY with three top-N legs over the **same `memories` table**, fused
in the handler:
1. **Lexical leg** — the *existing* query verbatim: `plainto_tsquery('english', $q)` +
`ts_rank(search_vector, query)`, with the existing OR-broaden fallback
(`OR_BROADEN_MIN_RANK`) kept intact. `rank_lex` = position in this list. (LIMIT ~50.)
2. **Dense leg**`ORDER BY embedding <=> $qvec LIMIT 50` using the HNSW index. `rank_dense` =
position. Sensitive rows (NULL embedding) never enter this list.
3. **Graph leg (phase 2, gated, currently disabled)** — seed concept nodes by reusing the
lexical+dense match, traverse 12 hops via a recursive CTE over the edge table, score reachable
memories by hop-decay → `rank_graph`. List allowed to be empty.
**Fuse** in Python: `fused(d) = Σ_{s} w_s / (60 + rank_s(d))`, default `w_lex = w_dense = 1.0`,
`w_graph ≈ 0.35` (down-weighted per the negative-prior finding — see benchmark). Missing leg ⇒ 0
contribution, no special-casing.
**Preserve the importance prior** (the current code is *not* pure relevance): apply it as a
post-fusion multiplier — `final(d) = fused(d) * (0.7 + 0.3*importance)` for `sort_by="relevance"`,
or use importance as the tie-break for `sort_by="importance"`. Importance is a *prior*, **not** a
fourth fused list. `sort_by="recency"` stays a pure `ORDER BY created_at`, untouched.
> **Why RRF, not convex combination, as the default:** we fuse three incomparable scales
> (unbounded ts_rank, bounded cosine, arbitrary graph proximity), one of them sparse/often-empty.
> RRF is scale-agnostic and treats a missing leg as a clean 0, where CC would force a maintained
> normalization per signal plus a decision for "absent." RRF also collapses to today's exact
> lexical ordering when dense/graph are empty (the SQLite degrade path, **same code**). The
> benchmark ran CC/TM2C2 as a challenger (ADR-0001 is quality-gated); on our set RRF was chosen.
**Single-query alternative (future):** the three legs can be expressed as CTEs + a FULL OUTER JOIN
on `id` with RRF computed in SQL (Supabase hybrid-search pattern), saving a round-trip. The
prototype and initial production both fuse in Python for clarity; in-DB fusion is an optimization,
not a correctness change.
### A.4 SQLite-only graceful degrade
With only the lexical leg present, RRF reduces to ranking by `rank_lex` — **identical ordering to
today's FTS5 `bm25()`**. Zero behaviour change offline; the *same* fusion code path runs in both
modes (dense/graph legs simply empty). Satisfies ADR-0002.
### A.5 Concept graph (phase 2 — designed, gated, NOT shipped in v1)
If a future benchmark justifies it (the prototype's did not — §B.4):
```sql
CREATE TABLE concepts (
id bigserial PRIMARY KEY,
canonical_name text NOT NULL,
aliases text[],
embedding halfvec(1024), -- for canonicalization + query seeding
category text
);
CREATE TABLE concept_edges (
src_id bigint REFERENCES concepts(id),
dst_id bigint REFERENCES concepts(id),
relation text NOT NULL,
weight real,
valid_from timestamptz, -- bi-temporal (Graphiti-style supersede)
valid_to timestamptz,
evidence_memory_ids bigint[]
);
CREATE TABLE memory_concepts ( -- the "mentions" link
memory_id bigint REFERENCES memories(id),
concept_id bigint REFERENCES concepts(id),
relation text
);
```
- **Construction** (backfill): batched open LLM triple-extraction (~1025 calls for the whole
corpus, each memory id-tagged) → global embedding-cluster canonicalization (EDC/KGGEN style) →
write the three tables. Off the hot path; `is_sensitive=1` filtered *before* any call.
- **Incremental** (per new memory): extract its triples, resolve entities against `concepts` via
pgvector NN + threshold (LLM tie-break only on ambiguity), set-merge — never re-cluster.
- **Traversal at recall:** plain recursive SQL CTE (12 hops). **No Apache AGE** — our multi-hop is
shallow. If a future need for PPR arises, compute it in Python over a cached `scipy.sparse`
transition matrix loaded from the edge table (Postgres has no native PPR), rebuilt only on graph
mutation.
- **Bi-temporal edges** realize our "supersede, don't accumulate" rule as a queryable timeline:
contradicted edges get `valid_to` set, not deleted.
### A.6 Optional stage-2 cross-encoder (gated separately)
`bge-reranker-v2-m3` on the GPU node over the fused top ~2030, `sort_by="relevance"` only,
sensitive rows excluded, with a hard-timeout fallback to fused order. Ship only if it clears both
the quality bar **and** the p95 hot-path budget. Not in v1.
### A.7 Infrastructure (production deploy)
- **pgvector enablement on CNPG:** the cluster already runs pgvector for Immich, so the legacy
custom-operand-image path is in place; `CREATE EXTENSION vector` + the additive migration. Any
extension add triggers a rolling restart of the shared cluster — coordinate via presence/GitOps.
- **All cluster changes via Terraform/Terragrunt** in `infra/stacks/...` (GitOps, never kubectl).
- **Embedding/extraction compute:** in-cluster **llama-cpp on the GPU node** for sensitive-safe
local processing (and the no-key fallback); **Voyage-3.5** (hosted) for the non-sensitive batch
(ADR-0003). Sensitive memories are routed locally or left lexical-only — enforced, not
best-effort.
- **PgBouncer:** set `hnsw.ef_search` via `SET LOCAL` inside the recall transaction (transaction
pooling).
- **pgvectorscale/DiskANN deferred** — not needed below ~15M vectors.
---
## B. Prototype as built (the benchmark harness)
The prototype validates **retrieval quality cheaply, in-process***not* pgvector/Postgres
(standing up CNPG just to benchmark would burn days before knowing if hybrid even beats FTS). It is
a faithful stand-in: the lexical leg is the *exact* production code path, and the fusion is the same
weighted RRF the production design specifies.
### B.1 Files (committable code only; data/cache/results gitignored)
- `benchmarks/retrievers/fts.py``FtsRetriever`, the lexical baseline.
- `benchmarks/retrievers/hybrid.py``HybridRetriever`, the three-leg fusion.
- `benchmarks/retrievers/test_hybrid.py` — 9 model-free tests (synthetic content only).
- `benchmarks/scripts/run_eval.py`, `benchmarks/harness/` — runner + metrics.
- Eval data (`benchmarks/data/{corpus,queries,qrels}.jsonl`), embedding cache
(`benchmarks/cache/*.npy`), and full results (`benchmarks/results/*.json`) are **gitignored**
(verified via `git check-ignore`) — privacy rule: no real memory content committed.
### B.2 Lexical leg = the real product
`hybrid.py` **reuses `retrievers.fts.FtsRetriever` verbatim**, which is itself a faithful
reimplementation of `src/claude_memory/mcp_server.py::_sqlite_recall` (`sort_by="relevance"`): a
fresh in-memory FTS5 index over the 5,452-memory corpus with the production virtual-table shape and
default `unicode61` tokenizer; query handling mirrors production (AND-match first, OR-broaden if
zero rows; rank by `-bm25()*0.7 + importance*0.3`; LIKE fallback on operational errors). **So the
hybrid's lexical component *is* the exact production system it must beat — no drift.**
### B.3 Dense leg
- **Model:** `BAAI/bge-large-en-v1.5`, 1024-d, L2-normalized. Passages raw; the query gets the BGE
instruction prefix `"Represent this sentence for searching relevant passages: "`. Similarity =
cosine via numpy matmul over the normalized matrix (faiss unnecessary at N=5452).
- **Embeddings:** all 5,452 memories embedded once (one-time ≈ 31.5 min on a CPU-only box),
**cached** fingerprint-keyed to `cache/emb_BAAI_bge-large-en-v1.5_<corpusfp>.npy` (+ `.ids.npy`)
→ reruns skip the embed (cache-hit rebuild ≈ 8.3 s).
- Hosted Voyage/OpenAI/Cohere paths are implemented and key-gated but were **untriggered** (no key
in env) — so the prototype ran the local default, which is also the sensitive-only production
fallback. **Production maps this matrix to pgvector `halfvec(1024)`.**
### B.4 Graph leg (built, but structurally excluded from the ranking — UNMEASURED)
- **Construction with ZERO LLM calls** (the tractable shortcut): concepts = union of each memory's
`tags` + the already-LLM-generated `expanded_keywords` field + a lightweight regex/stop-word
noun-phrase proxy over `content`, normalized + de-pluralized. 37,075 concepts extracted; **19,907
kept** after document-frequency pruning (df ∈ [2, 2%·N=109]: df<2 links nothing, df>109 are
non-discriminative hubs). Concept cliques → weighted memorymemory edges. Result: **5,452 nodes,
2,095,624 edges**, built in ~9 s (in-memory networkx).
- **Traversal:** 1 hop from the top-10 RRF seeds (capped 25 neighbours/seed), contributing **only ids
not already in the base legs** — this exclusion is the bug (next bullet).
- **Result: the graph was structurally barred from the ranking, so its value is UNMEASURED.** Because
graph hits are restricted to ids *outside* the FTSdense base set and weighted 0.35, a graph-only id's
max RRF (`0.35/61 ≈ 0.0057`) sits below any base-leg id's min (`1.0/110 ≈ 0.0091`) — it can never enter
the fused top-k. The "graph ≡ nothing" ablation (§B.6) was therefore **guaranteed by construction, not
an empirical finding** (a post-run review found a relevant graph-surfaced memory that fusion discarded).
### B.5 Fusion (the production recipe, exactly)
Weighted RRF, `RRF(d) = Σ_leg w_leg/(60 + rank_leg(d))`, chosen over convex combination because it
is score-scale-free (no BM25-vs-cosine calibration). Weights `w_fts = 1.0, w_dense = 1.0,
w_graph = 0.35`. Each leg pulled to depth 50 before fusion, truncated to k.
### B.6 Decision-relevant ablation (this is what informs ADR-0004)
| Config | What | Overall recall@10 | Para recall@10 | Multi recall@10 |
|---|---|---|---|---|
| **A** full hybrid (FTS+dense+graph) | the prototype | 0.834 | 0.725 | 0.775 |
| **B** FTS+dense (w_graph=0) | graph removed | **0.834** | **0.725** | **0.775** |
| **C** dense-only | | 0.748 | — | — |
| **D** FTS-only (= baseline) | | 0.695 | 0.375 | 0.711 |
**A and B are identical to three decimals on every metric — but this is a structural artifact (§B.4),
not a test of the graph.** The valid signal here is the **FTS-vs-dense decomposition**: dense-only (C)
and FTS-only (D) each lose to the fusion (B) — dense recovers paraphrase, lexical recovers exact, fusion
gets the best of both. The concept graph itself is **unevaluated** (it could never affect top-k under
this fusion config). **This still supports phasing — ship lexical+dense (phase 1), the robust measured
win — but the graph is gated pending a *valid* retest, not because it failed.** (Configs B/C/D were not
persisted as result JSONs; only A and D are reproducible from committed artifacts.)
### B.7 Prototype → production mapping
| Prototype (in-process) | Production (ADR-0002) |
|---|---|
| numpy cosine over normalized matrix | pgvector `halfvec(1024)` + HNSW ANN |
| `.npy` embedding cache, fingerprint-keyed | `embedding` column on `memories`, synced |
| in-memory networkx graph (phase 2) | `concepts` / `concept_edges` / `memory_concepts` tables |
| `FtsRetriever` (FTS5 in-memory) | existing `search_vector` + GIN (`plainto_tsquery`/`ts_rank`) |
| weighted RRF in Python | same RRF (Python handler, or CTE+FULL OUTER JOIN in SQL) |
| bge-large local | Voyage-3.5 hosted (non-sensitive) / bge-large local (sensitive, no-key) |
### B.8 Prototype caveats (carried into the report's limitations)
1. **Graph result is INVALID, not merely "null."** The fusion config barred the graph leg from the
top-k by construction (§B.4), so the benchmark did not actually test it. A valid retest must include
graph candidates in the fused pool (drop the base-set exclusion) and/or sweep the weight, ideally with
a typed-relation graph from real LLM extraction and multi-hop queries whose hops are *not* semantically
adjacent.
2. **Exact-stratum nDCG/MRR dip ~0.018/0.025** vs FTS (recall unaffected) is the standard RRF cost
of blending one perfect hit with near-ties; a small exact-match rank bonus could recover it.
3. **Latency** (p50 ≈ 230 ms) is CPU-bound on the local query embed; non-gating, GPU/hosted ~10×
faster. Baseline FTS was p50 ≈ 15.7 ms (pure SQLite).
4. **No pgvector/Postgres** in the prototype — the production substrate is design-only here; the
numbers measure *retrieval quality*, which transfers, not the production latency profile.
---
## C. Open questions (for production rollout)
1. **pgvector enablement mechanism** — confirm whether the live CNPG is on the legacy
custom-operand-image (likely, since Immich uses pgvector) or the modern image-volume-extensions
path; either way the migration is additive, but the enablement DDL/Terraform differs.
2. **Graph gate** — what evidence would re-open the concept graph? Candidate: a multi-hop eval slice
whose hops are *not* semantically adjacent (where dense can't shortcut), built from real
LLM-extracted typed relations rather than keyword co-occurrence. Until then, graph stays off.
3. **Voyage vs bge-large in production** — the benchmark ran bge-large (local). A cheap follow-up:
re-run the dense leg with Voyage-3.5 on the non-sensitive corpus to confirm the hosted model's
higher quality ceiling holds on *our* content before committing the production default.

523
docs/research/survey.md Normal file
View file

@ -0,0 +1,523 @@
# Landscape survey: semantic + concept-graph memory for hybrid recall
Status: research input for the ADR-0001 hybrid upgrade. Scope: how the agent-memory and
graph-RAG literature builds and retrieves over a personal-memory store, which embedding
model to use, how to fuse lexical + dense + graph signals, and how to evaluate the result.
**Read this with the decisions already fixed in [ADR-0001](../adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md)
[0003](../adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md):** we pursue
hybrid (gated on a benchmark beating FTS), embeddings live in pgvector on the existing CNPG
Postgres, the concept graph is node/edge tables in Postgres, sensitive memories never egress,
and adoption is decided **quality-first** (recall@k / nDCG@10 / MRR; latency & storage are
reported, not gating).
The recurring conclusion below: **borrow the ideas, not the engines.** None of the four
systems surveyed is a drop-in for our stack, but each contributes a mechanism we re-implement
natively on Postgres + pgvector.
---
## 1. Our workload is the opposite of GraphRAG's design target
Before comparing systems it helps to state what we are *not*. The graph-RAG family was built
for **global sensemaking** ("what are the themes across this corpus") over a **static document
collection**. Our workload is the reverse:
| Dimension | GraphRAG target | claude-memory-mcp |
|---|---|---|
| Unit | Long documents, chunked | Atomic, already-curated memories (avg ~500 chars) |
| Corpus dynamics | Static, indexed once | Append-heavy: a few hundred memories/day arriving |
| Query type | Corpus-wide summarization | Point / multi-hop recall ("what did I decide about X") |
| Hot path | Offline batch | **Every user prompt** (auto-recall hook fires before each turn) |
| Scale | 10k1M+ chunks | ~5k memories today → tens of thousands |
This mismatch is the lens for everything that follows. The expensive part of GraphRAG —
community detection + hierarchical LLM summaries — is the *wrong retrieval unit* for atomic
point lookups, and re-summarizing communities on a sustained append stream is its dominant,
unbounded cost. We want a design whose index-time work is **proportional to new content**, and
whose retrieval path has **no LLM call** (so it fits the per-prompt budget).
---
## 2. The GraphRAG family — Microsoft GraphRAG, LightRAG, nano-graphrag, LazyGraphRAG
All four turn text into an entityrelation knowledge graph via LLM extraction; they differ on
the expensive part (community detection + hierarchical summarization), which is exactly where
**incremental cost** lives.
### Microsoft GraphRAG (Edge et al. 2024, arXiv 2404.16130)
Pipeline: chunk → LLM extracts entities + relationships per chunk (with multi-round
"gleanings" to catch misses) → summarize duplicate element instances into node/edge
descriptions → build graph → **Leiden** community detection producing a *hierarchy*
(levels C0..C3) → an LLM writes a **community report** for every community at every level.
Two query modes: **global** (map-reduce over all community reports — corpus-wide
sensemaking) and **local** (start from query-relevant entities, fan out). Indexing is
LLM-heavy: ~4,000 LLM calls / ~35 min for one textbook; ~$2040 per 1M tokens with gpt-4o.
**Incremental:** the `graphrag update` command (GraphRAG 1.0) computes deltas and places new
entities into existing communities "rather than re-running Leiden," re-summarizing only
changed communities — **but** maintainers warn that once drift crosses a threshold "the worst
case degrades to the same performance as a normal indexing." A periodic, unpredictable
full-reindex cliff on a sustained append stream. Parquet/file-pipeline oriented, not
Postgres-native.
### LightRAG (HKUDS, arXiv 2410.05779, EMNLP 2025)
Pipeline: chunk → LLM extracts entities + relations → "profiling" generates a key-value text
summary per node/edge → **deduplicate** merges identical entities/relations across chunks.
**No community detection.** Retrieval is **dual-level**: the LLM splits the query into
low-level keywords (specific entities) and high-level keywords (broad themes via relationship
chains); each set is matched by *vector* similarity against an entity-vector index and a
relation-vector index, then one-hop neighbours are pulled from the graph. Modes:
naive / local / global / hybrid / mix (mix = default).
**Incremental (the crux):** a new document goes through the same local indexing to produce a
small local graph, then is integrated by **set union** of node-sets and edge-sets into the
existing graph — "eliminating the need to rebuild the entire index graph." No communities ⇒
**no global re-clustering or re-summarization, ever** ⇒ O(new content) per insert, the only
genuinely-incremental member. Ships a PostgreSQL all-in-one backend (PGVectorStorage on
pgvector + PGGraphStorage on Apache AGE + KV + doc-status in one DB, PG ≥16.6).
### nano-graphrag (~1100 LOC)
A faithful minimal reimplementation of Microsoft GraphRAG and an excellent compact *reference*
for the exact extraction/community/report prompts. **Hard NO for incremental:** README states
plainly "each time you insert, the communities of graph will be re-computed and the community
reports will be re-generated" — O(whole graph) LLM cost per append.
### LazyGraphRAG (Microsoft Research, 2024)
Defers **all** LLM work to query time: index time uses only NLP noun-phrase extraction + graph
statistics — "indexing costs are identical to vector RAG and 0.1% of the costs of full
GraphRAG." Sidesteps the incremental-re-summarization problem entirely by never pre-summarizing
communities. The **defer-LLM-cost principle** is the one to borrow.
### Verdict for us
**Adopt none wholesale; steal LightRAG's architecture + LazyGraphRAG's defer-LLM principle.**
LightRAG is the only one whose incremental model (pure set-union, no re-clustering) structurally
fits an append-heavy stream, and whose retrieval path (vector + one-hop graph, no query-time
map-reduce) is hot-path-viable. But adopting LightRAG-the-product is not recommended: its
Postgres graph path needs the **Apache AGE** extension (not on our CNPG), and that path has
documented concurrency/entity-merge instability under append-heavy load (asyncpg pool timeouts
at the merge stage; slow upgrades). Our multi-hop is shallow (12 hops), expressible in plain
recursive SQL CTEs over node/edge tables — no AGE needed.
---
## 3. Zep / Graphiti — temporal knowledge graph for agent memory
Zep (arXiv 2501.13956) is an agent-memory service; **Graphiti** is its open engine
(Neo4j / FalkorDB / Kuzu backend, ~20k stars, MIT). It is the **closest conceptual analog** to
our hybrid goal — it fuses exactly the three signals ADR-0001 wants.
**Three-tier graph:** episode subgraph (raw ingested data, the provenance layer ≈ our Memory
rows) → semantic entity subgraph (entity nodes + typed relationship edges, each linking back to
its source episodes) → community subgraph (clusters with LLM summaries — the GraphRAG "global"
layer).
**Bi-temporal model:** every semantic edge carries **four** timestamps on two timelines —
*valid time* (`t_valid`/`t_invalid`: when the fact held true in the world) and *transaction
time* (`t'_created`/`t'_expired`: when Zep learned/retracted it). Facts are never deleted;
superseded facts get their validity window closed. This is a principled, queryable version of
our **"supersede, don't accumulate"** memory discipline.
**Incremental ingestion (per episode):** a *sequence* of LLM calls — entity extraction →
entity resolution/dedup (embed + cosine + BM25 search against existing nodes, then an LLM
judges merge vs create) → fact (edge) extraction → fact dedup → temporal extraction (resolve
"two weeks ago" against a reference time) → edge invalidation (LLM compares each new edge
against related existing edges; on a temporally-overlapping contradiction it closes the old
edge). Cost is **heavy on write**, paid back on reads.
**Retrieval (sub-second, NO LLM at query time):** three parallel searches fused, then reranked.
- `φ_cos`: cosine over embeddings of fact text / entity names / community summaries (BGE-m3, 1024-d).
- `φ_bm25`: BM25 full-text over the same fields.
- `φ_bfs`: breadth-first n-hop graph traversal from seed nodes — the genuinely graph-native signal.
- Rerank: pluggable — **RRF**, MMR (diversity), episode-mentions (frequency), node-distance, or a cross-encoder (most accurate, slowest).
**Published quality (the strongest evidence in this family):** on **LongMemEval**, Zep reports
**+18.5%** accuracy over a baseline (71.2% vs 60.2% with gpt-4o) *and* ~90% query-latency
reduction; on **MemGPT DMR**, 94.8% vs 93.4%. These are conversational long-context QA, not
personal-fact recall@k — so the headline numbers won't transfer directly, but the *fusion
recipe* is exactly what we benchmark.
### Verdict for us
**Primary design blueprint for the concept-graph half — but not an adopted dependency.**
Graphiti has **no pgvector backend**; adopting the engine forces a new Neo4j/FalkorDB graph DB
into the cluster, conflicting with ADR-0002 and reuse-before-building. We borrow four mechanisms,
re-implemented on Postgres: (1) the episodic(=Memory rows)/semantic(=new node+edge tables)
split; (2) the parallel-search + RRF fusion read path; (3) resolution-via-search to dedupe the
graph using our existing FTS+vector; (4) bi-temporal edge invalidation as the queryable form of
our supersede discipline. We de-scope the community/summarization tier and the default
cross-encoder. Two hard caveats: keep the multi-LLM-call extraction **off the hot path**
(background, like our sync engine), and route extraction through in-cluster llama-cpp / filter
`is_sensitive` per ADR-0003.
---
## 4. Mem0 / Mem0g — extraction-based, LLM-curated memory
Mem0 (arXiv 2504.19413) is a **write-side memory curator**, not a retrieval algorithm — it
solves a *different axis* than our gated problem, and is **complementary**.
**Two-phase pipeline.** *Extraction:* on each new message pair, an LLM (fed an async
conversation summary + the last ~10 messages) emits a set of concise "candidate facts."
*Update (the curation step):* for each candidate fact, retrieve the top ~10 semantically-similar
existing memories, then a function-calling LLM picks one of four ops — **ADD** (new), **UPDATE**
(merge richer detail into an existing id, gated on information content), **DELETE** (a
contradicted memory), **NOOP**. Net effect: the store self-deduplicates, self-merges, and
self-supersedes instead of accumulating. Two LLM calls per write (extract + decide) + a vector
search; **async by default** (off the user hot path); the **read/search path is pure vector
similarity with no LLM**.
**Mem0g (graph variant):** a directed labeled entity graph (Alice lives_in→ SF) on Neo4j; a
conflict-detection + LLM update-resolver marks superseded relationships *invalid* rather than
deleting them.
**Published quality:** on **LOCOMO**, Mem0 J=66.88 / Mem0g 68.44 beats OpenAI memory (52.90),
A-Mem (48.38), LangMem (58.10), ties Zep (65.99), at ~1/15th the tokens of full-context; Mem0g
specifically wins temporal reasoning. Reference latencies (gpt-4o-mini): search p95 ≈ 0.20s,
total p95 ≈ 1.44s, vs full-context ≈ 17s.
### Verdict for us
**Adopt the curation loop as a separate, flagged subsystem — it does NOT move the ADR-0001
retrieval metric by itself** (its search is vector-only, no lexical+graph fusion). The
ADD/UPDATE/DELETE/NOOP loop is the highest-leverage idea Mem0 offers: it automates a discipline
our own rules already mandate (every correction stored, supersede-don't-accumulate, tombstones)
but currently leave to manual human effort. It is cheap to build against our existing Memory
model + `update_memory` endpoint, runs async off the recall hot path, and respects the
`is_sensitive` boundary. **Hard guardrails required:** never physically DELETE — supersede to a
`[SUPERSEDED]` tombstone (importance ~0.3, per our convention); log every op; gate behind the
non-sensitive filter. Keep extraction *optional* (our memories are already atomic, so usually
only the single UPDATE-decision call is needed). Mem0g's "mark invalid, not delete" and triplet
schema (source, relation, dest) are reusable ideas, but implemented on pgvector/Postgres, not
Neo4j. **Critically: isolate curation behind a flag so the benchmark measures retrieval quality
independently of any curation behaviour change.**
---
## 5. HippoRAG / HippoRAG 2 — Personalized PageRank over a concept graph
HippoRAG (NeurIPS 2024, arXiv 2405.14831) and HippoRAG 2 (ICML 2025, arXiv 2502.14802) are the
**strongest published evidence that a concept graph wins on multi-hop** — precisely the query
class ADR-0001 says the graph must beat lexical on.
**Mechanism (hippocampal indexing analogy):** LLM = neocortex; retrieval encoder =
parahippocampal region (detects synonyms); open KG = hippocampal index. *Offline:* an LLM runs
OpenIE on each passage → a schema-free KG of noun-phrase nodes joined by relation edges; the
encoder adds **synonym edges** between phrase nodes with cosine > τ=0.8. *Online, per query:*
(1) ONE LLM call does NER on the query; (2) the encoder links query entities to nearest KG
nodes = **seed nodes**; (3) each seed weight is scaled by node specificity (`|P_i|⁻¹`, an
IDF-like rare-phrase boost) and written into the **Personalized PageRank** reset vector;
(4) PPR runs to convergence (damping 0.5); (5) the phrase-node probability vector scores
passages. Multi-hop emerges because the random walk reaches passages sharing **no** query tokens
— in **one** retrieval step instead of iterative retrieve-reason loops.
**HippoRAG 2** makes passages first-class nodes (linked to their phrases by "contains" context
edges), shifts linking to query→triple + **LLM triple-filtering** ("recognition memory"), and
seeds *all* passage nodes by embedding similarity (small weight ~0.05) so dense and graph blend
in one PPR. Net effect: a single PPR fuses lexical-ish phrase matching, dense passage
similarity, and multi-hop traversal into one ranked list.
**Published quality (passage recall@5):** HippoRAG 2 beats the strongest 7B embedding baseline
(NV-Embed-v2) on every multi-hop set — 2Wiki **90.4 vs 76.5** (+13.9), MuSiQue **74.7 vs 69.7**
(+5.0), HotpotQA **96.3 vs 94.5** — and is the only structure-augmented method that *doesn't
regress* simple QA (NQ 78.0).
### Verdict for us
**Adopt the idea (PPR spreading activation over our concept graph), not the framework.** Two
hard adaptations, both fitting our stack:
1. **Drop the per-query LLM** (v1 NER / v2 triple-filtering) — the only thing that would blow
the hot-path budget — and **seed PPR from our existing FTS top-k pgvector top-k**, weighted
by fused score × importance × node-specificity. This turns PPR into the *fusion layer*
ADR-0001 wants, with zero added LLM latency.
2. **Prefer a memory-node graph** (memories as nodes, our typed Relationships as edges) over
HippoRAG's phrase explosion (it turns 11.6k passages into ~92k nodes; at our scale that'd be
~43k phrase nodes). Leaner and native to ADR-0002's node/edge tables.
A reproducible PPR latency micro-benchmark on a 5,400-memory graph measured **~2 ms** (memory-node
graph, transition matrix cached) to **~21 ms** (full phrase graph), ~105 ms even at 3× growth —
PPR is **not** the bottleneck; the stock recipe's online LLM is (which we remove). Postgres can
*store* the graph but has no native PPR (pgrouting = shortest-path only), so PPR is computed in
Python over a cached `scipy.sparse` transition matrix loaded from the node/edge tables, rebuilt
only on graph mutation. **Caveat for the gate:** our LLM-free seeding variant is *not* validated
by the papers, and our 5.4k personal corpus is far smaller and less multi-hop-dense than their
90k-node Wikipedia graphs — so the benchmark must confirm the multi-hop win transfers.
---
## 6. Embedding model survey
Our `content` and `expanded_keywords` are **short** prose (capped ~500 chars), so a model's
max-token limit is effectively a non-constraint — quality, dimensionality, and deploy
feasibility decide.
### Self-hostable (sentence-transformers on the GPU node, or GGUF via llama-cpp; pgvector stores the vector)
| Model | Dim | Params / VRAM (fp16) | MTEB(en) avg | License |
|---|---|---|---|---|
| nomic-embed-text-v1.5 | 768 (Matryoshka 64768) | 0.1B / <1 GB | 62.28 | Apache-2.0 |
| bge-base-en-v1.5 | 768 | 109M / ~0.5 GB | ~63.5 | MIT |
| **bge-large-en-v1.5** | **1024** | 335M / ~1.3 GB | **64.23** | MIT |
| e5-large-v2 | 1024 | 0.3B / ~1.3 GB | ~62.25 | MIT |
| bge-m3 | 1024 dense (+sparse +ColBERT) | 568M / ~12.4 GB | en ~5960 (strong multiling/BEIR) | MIT |
| gte-Qwen2-1.5B-instruct | 1536 | 1.5B / ~3.4 GB | **67.16** (top of set) | Apache-2.0 |
### Hosted (API call, NON-SENSITIVE memories only per ADR-0003)
| Model | Dim | MTEB(en) avg | Price /1M tok | License |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 (Matryoshka→256) | 62.3 | $0.02 | proprietary |
| OpenAI text-embedding-3-large | 3072 (Matryoshka) | 64.6 | $0.13 | proprietary |
| **Voyage-3.5** | **1024** (+256/512/2048, int8/binary) | beats OpenAI-3-large ~7.5% on Voyage's eval | $0.06 (first 200M free) | proprietary |
| Voyage-3.5-lite | 1024 | beats OpenAI-3-large ~23.8% | $0.020.03 | proprietary |
| Cohere embed-english-v3.0 | 1024 (native int8/binary) | ~64.5 | ~$0.10 (sales-quoted) | proprietary |
**Implementation notes that matter.** Use **asymmetric** prompting (query vs document):
sentence-transformers `encode_query`/`encode_document`, always `normalize_embeddings=True` so
pgvector cosine == dot product. e5-large-v2 *requires* manual `"query: "`/`"passage: "` prefixes
or quality collapses; bge prepends a query instruction; gte-Qwen2 prepends a task instruction to
queries only. Pick the dimension **once** — changing it later forces a full re-embed + HNSW
rebuild.
### Recommendation (one of each, quality-first)
- **Local: BAAI/bge-large-en-v1.5** (1024-d, MIT) — best quality-per-complexity in the
self-hostable set for short English memories: strong retrieval, ~1.3 GB VRAM (runs on CPU at
~100 ms), no `trust_remote_code`, mature ST support. The 512-token cap is irrelevant for our
content. (gte-Qwen2-1.5B-instruct is the explicit upgrade candidate if the benchmark says
bge-large leaves quality on the table; nomic is the fallback if a long context or sub-768
Matryoshka dims are ever wanted.)
- **Hosted: Voyage-3.5** (1024-d) — highest measured retrieval quality of the hosted options,
**same 1024-d as the local pick** so the pgvector column and fusion code are identical whether
local or hosted (clean A/B), and our whole corpus embeds inside the free tier. Non-sensitive
only; sensitive rows go to bge-large locally. (OpenAI text-embedding-3-small is the pragmatic
fallback if no Voyage key.)
> **Prototype note:** the prototype as built used **bge-large-en-v1.5** (1024-d, local default,
> no API key in env). Production should adopt **Voyage-3.5** (also 1024-d) for non-sensitive
> memories per ADR-0003, keeping bge-large as the sensitive-only / no-key fallback. Both 1024-d
> means the pgvector schema and fusion code are unchanged across the choice.
---
## 7. Fusion of lexical + dense + graph signals
Three retrieval families produce candidate lists per recall; one fusion function merges them.
### Reciprocal Rank Fusion (RRF) — rank-based, scale-agnostic
Cormack/Clarke/Büttcher (SIGIR'09): `score(d) = Σ_s 1/(k + rank_s(d))`, summed over every signal
the doc appears in. `k` is a smoothing constant — their sweep found **k=60 near-optimal but
uncritical** (≤0.3% MAP swing across [10,100]; k=0 or k=500 costs 34%). A doc present in one
list but absent from another contributes `1/(k+rank)` where it fires and **0** elsewhere — so
multi-signal agreement is rewarded and single-signal hits are penalized (the hybrid behaviour we
want). Extends to N lists trivially (just sum), which makes a 3-way fuse a one-liner.
**Weighted RRF:** `Σ_s w_s/(k + rank_s(d))` — bias a stronger signal, no pre-normalization.
### Weighted score fusion / convex combination (CC) — score-based, needs normalization
Bruch et al. (arXiv 2210.11934, TOIS 2023): `f = α·φ(semantic) + (1-α)·φ(lexical)` with
**theoretical** min-max normalization (TM2C2): cosine min = -1, BM25 min = 0, per-query max —
stable across queries. Findings: **CC/TM2C2 beats RRF on nDCG and recall** in- and out-of-domain
(RRF ~3.86% lower nDCG@10 in one replication); Weaviate switched its default from rankedFusion to
relativeScoreFusion (min-max CC) for ~6% recall on FIQA. CC is sample-efficient (α tunes from a
small labeled set) but requires calibratable scores.
### Folding in graph hits
Build a graph candidate list, then feed it to the fuser as just another ranked list:
**seed** (match the query to concept nodes via the same FTS + dense over node labels) →
**traverse** (12 hops to reachable memories) → **score** each reachable memory. Three documented
scorings: hop-decay `Σ_paths β^hops` (β≈0.50.7); Personalized PageRank seeded on matched nodes
(HippoRAG); or node-degree priority (GraphRAG local search). The
*Calibrated-Fusion-for-Graph-Vector* paper (arXiv 2603.28886) is explicit: naive graph+vector
fusion fails on **scale incompatibility**, so convert graph traversal into a probability-like
normalized score before fusing. Crucial consequence: the graph list is **sparse** (often a
handful of memories, sometimes zero). Under RRF that's handled automatically; under CC you must
explicitly treat "absent" as the theoretical min or the missing-modality term silently biases the
sum.
### Cross-encoder re-rank — a separate stage-2, not a fusion function
Retrieve top-N each → fuse → take fused top ~2030 → score each (query, memory) pair jointly with
a cross-encoder (e.g. bge-reranker-v2-m3) → re-sort. Reported lift +5 to +15 nDCG@10 on
BEIR/MTEB; cost scales with pair count so it is only ever a small-candidate-set stage.
### Recommendation
**Weighted RRF over three lists (FTS, dense, graph), k=60, equal weights to start**, with
importance applied as a deterministic post-fusion prior and a cross-encoder as an optional,
benchmark-gated stage-2. RRF is the right *default* because we fuse three incompatible scales,
one of them sparse/often-empty; it is near-parameter-free; and it collapses to exactly today's
lexical ordering when dense/graph are empty (the SQLite graceful-degrade path). **But because
adoption is quality-gated, the benchmark must also run CC/TM2C2 as a challenger** — the
literature is consistent that CC edges RRF on quality when scores are calibratable. (See the
[benchmark report](benchmark-report.md) for which won on *our* eval set.)
---
## 8. Concept-graph construction from memories
Turning flat Memory rows into nodes (concepts/entities) + typed directed edges + memory→concept
"mentions" edges. Three extraction families:
- **(A) Open LLM triple extraction** (schema-free) — prompt an LLM to emit `[subject, relation,
object]` triples. High recall, but relation labels proliferate ("prefers"/"likes"/"favors"), so
it **requires** downstream canonicalization. GraphRAG is the canonical implementation
(extract + gleaning + cross-chunk entity summarization).
- **(B) Schema-guided** — constrain to a fixed ontology. Cleaner, but a fixed schema misses
surprises in a heterogeneous personal corpus. **EDC** (Zhang & Soh, EMNLP 2024) bridges the two:
*extract* open triples → *define* (LLM writes a one-sentence definition per distinct relation) →
*canonicalize* (embed definitions, retrieve nearest existing relations, LLM verifies map-vs-add).
Two modes: target-alignment (fixed schema) and self-canonicalization (grow schema dynamically).
- **(C) Entity resolution / canonicalization** (the dedup problem — "Svelte"/"SvelteKit"/"svelte
framework" are one node): cluster-then-refine on the *aggregated* graph — embed every surface
string, cluster by cosine (HDBSCAN / connected-components over a threshold), optional
LLM-as-judge per cluster. KGGEN (arXiv 2502.09956) does iterative LLM-guided clustering;
Graphiti uses MinHash+LSH fast-path with LLM fallback. **Cost scales with distinct entities (low
thousands), not with memory count.**
- **(D) Lightweight non-LLM** — spaCy NER + noun-chunks + co-occurrence edges, or **ReLiK**
(Sapienza, ACL 2024) for *typed* relations on CPU at up to 40× LLM speed, zero per-doc LLM cost.
The natural ablation baseline and sqlite-only fallback.
**The tractable recipe for our corpus.** Measured: 5,452 non-sensitive memories ≈ 683K content
tokens total — *tiny*. At ~125 content-tokens/memory, ~570 memories pack into one 100K-token
request, so the **entire corpus extracts in ~1025 batched LLM calls, not 5,452 sequential
calls**. Pipeline: (1) batch-extract open triples (each memory tagged with its `memory_id` so
triples map back), parallelized — LangExtract / KGGEN style; (2) aggregate + canonicalize globally
*once* (embed distinct entities, cluster, LLM-judge only ambiguous clusters — tens of calls);
(3) optionally one batched LLM "define relations" pass for EDC-style relation canonicalization.
Total budget: low tens of calls, minutes of wall-clock, a few dollars hosted or one GPU-node
llama-cpp session. **Canonicalization quality (the similarity threshold / cluster granularity) is
where this lives or dies and must be tuned against held-out data, not eyeballed.** Write-time /
Graphiti-style per-memory extraction is for *incremental updates only* — the wrong tool for the
one-shot backfill.
---
## 9. Vector storage in Postgres (production substrate)
`pgvector` is a **proven capability on our exact CNPG cluster** (Immich already does vector search
there, and claude-memory-mcp is already a tenant of the shared `pg-cluster-rw.dbaas` behind
PgBouncer) — zero new infrastructure, reuse-before-building satisfied.
- **HNSW** (recommended default): `USING hnsw (embedding halfvec_cosine_ops) WITH (m=16,
ef_construction=64)`; query knob `SET hnsw.ef_search` (default 40). Best speed-recall tradeoff;
buildable on an empty table; graph in RAM. **IVFFlat** is rejected (must be built *after* data
exists — an empty-table footgun — and has a lower recall ceiling).
- **halfvec** (fp16, 2 bytes/dim) halves index size at ~no recall loss; indexable ≤4000 dims.
768-d halfvec = 1536 bytes/row; at our scale total embedding storage is single-digit MB.
- **Filtered ANN:** we always filter `deleted_at IS NULL` (often `category`). Post-filtering can
under-fill top-k; enable `hnsw.iterative_scan='relaxed_order'`, and **always add a tie-breaker**
(`, id`) since approximate indexes give non-deterministic order.
- **Hybrid in one query:** each retriever is a CTE producing a per-ranker rank; fuse with RRF via
FULL OUTER JOIN on memory id — no score calibration needed across the incomparable ts_rank and
cosine scales.
- **pgvectorscale / StreamingDiskANN** (bounded-RAM disk graph, SBQ compression) is **deferred**
Rust/pgrx must be compiled into the operand image, and it only earns its keep above ~15M
vectors. Our corpus is orders of magnitude below that.
- **PgBouncer gotcha:** per-query GUCs (`hnsw.ef_search`) must be `SET LOCAL` inside the recall
transaction, not session-level, under transaction pooling.
**Not for the prototype** (the prototype uses an in-process numpy index); this is the production
adoption path *if* the benchmark clears the gate — an additive Alembic migration (one nullable
`halfvec(1024)` column + HNSW index) plus a Terraform change to the CNPG stack.
---
## 10. Evaluation methodology
A retrieval test collection = corpus + query set + **qrels** (relevance judgments). For each
query, call recall, take the ordered list of returned memory ids, score against qrels — measuring
the *retriever in isolation*, exactly what ADR-0001's gate needs.
**Metrics (compute all; pick one primary):**
- **Recall@k** — "did we surface the right memory at all?" *The* hot-path metric (auto-recall
injects top-N; if the memory isn't in top-k it can't help). Report @5/@10/@20/@30.
- **nDCG@k** — graded + position-aware; the best single summary (BEIR standard is nDCG@10).
Headline quality number for the gate.
- **MRR** — only the first hit matters; relevant for the exact-lookup stratum.
- **MAP** — broad binary recall+precision blend; secondary, stable for significance tests.
**Stratification (the ADR-0001 hypothesis-targeted design):** *exact/lexical* (FTS already wins —
the **regression guard**); *paraphrase/semantic* (disjoint vocabulary — the value-of-embeddings
test); *multi-hop* (≥2 memories or a concept link — the graph test).
**Qrels generation (the LongMemEval pipeline, inverted for memories):** sample seed memories
stratified by category + importance → an LLM generates exact / paraphrase / multi-hop queries →
label relevant ids, with **pooling** (union the top-k of every arm, TREC-style) and an LLM-judge
on the **UMBRELA 03 scale**. **Separate the generator model from the judge model** to avoid
self-preference leakage. **Hand-verify** a ≥1520% sample (oversample multi-hop) and require
Cohen's κ(LLM, human) ≥ ~0.6 before trusting auto-labels; always hand-author multi-hop
relevant-id sets.
**Pitfalls with standard mitigations (all baked into the protocol):** LLM judges are
systematically *lenient* (κ gate); "holes" (new arms retrieve unjudged docs — must pool *all*
arms before judging, else the gate is rigged against semantic/hybrid); generator-as-judge leakage
(model separation); too-easy self-generated queries (check paraphrase shares no content tokens);
adversarial/unanswerable queries have no relevant id and **must be kept out of the ranked metrics**
(mixing them corrupted the disputed Zep-vs-mem0 LOCOMO comparison — 84%→58%).
**Sizing:** Voorhees & Buckley (2002) — ≥25 topics is the floor, 50 yield reliable rankings, and a
~56% absolute gap at n=50 is needed for 95% confidence the ordering holds on a different query
set. Since the gate is *per stratum*, each stratum wants its own ~50 queries.
> **Honest note on what we actually built:** our eval set is **119 queries (40 exact / 40
> paraphrase / 39 multihop)** — just below the ~50/stratum ideal, and qrels were LLM-generated
> with lighter hand-verification than the full protocol prescribes. This is a real limitation,
> tracked in the [benchmark report](benchmark-report.md).
---
## 11. Synthesis — what we borrow, from whom
| Source | Borrowed mechanism | Re-implemented as | Adopted? |
|---|---|---|---|
| LightRAG | Incremental set-union graph merge; dual-level retrieve | Native node/edge tables, no AGE; FTS+dense+graph fuse | Idea only |
| LazyGraphRAG | Defer LLM cost; index-time work ∝ new content | Store-time extraction off hot path | Principle |
| Zep / Graphiti | Episodic/semantic split; 3-signal RRF read path; bi-temporal invalidation | Memory rows + Postgres node/edge tables; pgvector+FTS+CTE | **Blueprint** |
| Mem0 | ADD/UPDATE/DELETE/NOOP write-side curation | Flagged async curator over existing `update_memory` | Complementary, flagged |
| HippoRAG 2 | PPR spreading activation for multi-hop | LLM-free FTS+vector-seeded PPR over memory-node graph (phase 2, gated) | Idea only |
| Bruch et al. / Cormack | RRF default + CC/TM2C2 challenger | Weighted RRF k=60, post-fusion importance prior | **Direct** |
| EDC / KGGEN | Open-extract → define → canonicalize globally | Batched extraction + embedding-cluster canonicalization | **Direct** |
| pgvector / Supabase | HNSW + halfvec + RRF hybrid in one SQL query | Additive migration to CNPG (production only) | **Production design** |
| LongMemEval / UMBRELA / Voorhees | Stratified LLM-qrels + pooling + κ gate | Our exact/paraphrase/multi-hop eval | **Direct** |
The through-line: **a memory-node concept graph, dense pgvector embeddings, and the existing
lexical FTS, fused with weighted RRF, with all LLM work pushed to store time** — sized for an
append-heavy personal store and gated on a benchmark that beats FTS.
---
## Sources
**GraphRAG family**
- Edge et al., "From Local to Global: A Graph RAG Approach…" (Microsoft, 2024) — arXiv 2404.16130
- Microsoft GraphRAG incremental-indexing design — github.com/microsoft/graphrag/issues/741; GraphRAG 1.0 blog (microsoft.com/en-us/research/blog/moving-to-graphrag-1-0…)
- Guo et al., "LightRAG: Simple and Fast RAG" (HKUDS, EMNLP 2025) — arXiv 2410.05779; github.com/HKUDS/LightRAG (incl. issue #2122, PG+AGE concurrency)
- nano-graphrag — github.com/gusye1234/nano-graphrag
- LazyGraphRAG — microsoft.com/en-us/research/blog/lazygraphrag-setting-a-new-standard-for-quality-and-cost/
**Temporal KG memory**
- Rasmussen et al., "Zep: A Temporal Knowledge Graph Architecture for Agent Memory" (2025) — arXiv 2501.13956
- Graphiti — github.com/getzep/graphiti; Neo4j writeup (neo4j.com/blog/developer/graphiti-knowledge-graph-memory/)
**Extraction-based memory**
- "Mem0: Building Production-Ready AI Agents…" (2025) — arXiv 2504.19413; github.com/mem0ai/mem0 (configs/prompts.py)
**Graph-PPR retrieval**
- Gutiérrez et al., "HippoRAG" (NeurIPS 2024) — arXiv 2405.14831
- "From RAG to Memory" = HippoRAG 2 (ICML 2025) — arXiv 2502.14802; github.com/OSU-NLP-Group/HippoRAG
**Embeddings**
- bge-large-en-v1.5 — huggingface.co/BAAI/bge-large-en-v1.5; gte-Qwen2-1.5B — huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct; nomic — huggingface.co/nomic-ai/nomic-embed-text-v1.5; bge-m3 — huggingface.co/BAAI/bge-m3; e5-large-v2 — huggingface.co/intfloat/e5-large-v2
- Voyage-3/3.5 — blog.voyageai.com/2024/09/18/voyage-3/; docs.voyageai.com/docs/pricing
- OpenAI text-embedding-3 — developers.openai.com/api/docs/guides/embeddings; Cohere embed v3 — docs.cohere.com/docs/cohere-embed
**Fusion**
- Cormack, Clarke, Büttcher (SIGIR'09) — cormack.uwaterloo.ca/cormacksigir09-rrf.pdf
- Bruch et al., "An Analysis of Fusion Functions for Hybrid Retrieval" (TOIS 2023) — arXiv 2210.11934
- Elastic weighted RRF; Weaviate hybrid-search fusion algorithms; "Calibrated Fusion for Heterogeneous Graph-Vector Retrieval" — arXiv 2603.28886
- bge-reranker — huggingface.co/BAAI/bge-reranker-base
**Concept-graph construction**
- Zhang & Soh, "Extract-Define-Canonicalize" (EDC, EMNLP 2024) — arXiv 2404.03868; github.com/clear-nus/edc
- KGGEN — arXiv 2502.09956; ReLiK (ACL 2024) — arXiv 2408.00103; LightKGG — arXiv 2510.23341; Google LangExtract — github.com/google/langextract
**Postgres vector storage**
- pgvector — github.com/pgvector/pgvector; pgvectorscale — github.com/timescale/pgvectorscale; CNPG image-volume extensions — cloudnative-pg.io/docs/devel/imagevolume_extensions/; Supabase hybrid search — supabase.com/docs/guides/ai/hybrid-search
- This cluster: `infra/docs/architecture/databases.md` (claude-memory-mcp is a CNPG tenant); this repo: `migrations/versions/001_initial_schema.py`, `src/claude_memory/api/app.py`
**Evaluation**
- LoCoMo — arXiv 2402.17753; LongMemEval — arXiv 2410.10813; UMBRELA — arXiv 2406.06519; "Judging the Judges" / LLMJudge — arXiv 2502.13908; Voorhees & Buckley "Topic Set Size" (2002); Buckley & Voorhees "Bias and the Limits of Pooling"; BEIR — arXiv 2104.08663

View file

@ -0,0 +1,116 @@
"""Single, process-wide serialized SQLite writer for the local memory cache.
SQLite permits only one writer at a time. The MCP server's store path and the
background sync engine used to open *separate* connections to the *same* file;
under heavy concurrent ``memory_store`` calls those two writers fought over the
single SQLite write lock, blew past ``busy_timeout``, and surfaced
``sqlite3.OperationalError: database is locked`` which made the tool slow and
eventually dropped the session.
``LocalStore`` fixes this structurally: it owns ONE connection (opened with
``check_same_thread=False``) guarded by ONE re-entrant lock. Every component that
needs to touch the local DB shares the same ``LocalStore`` instance, so all
writes serialize cleanly through the in-process lock and queue instead of racing
the SQLite writer. On the rare residual lock (e.g. another OS process touching
the file), writes retry with bounded exponential backoff rather than failing the
caller. WAL stays on for concurrent reads.
Uses only stdlib no pip install required.
"""
import logging
import sqlite3
import threading
import time
from pathlib import Path
from typing import Any, Callable, TypeVar
logger = logging.getLogger(__name__)
T = TypeVar("T")
# Bounded retry window for the rare residual "database is locked" — handles a
# lock held by a *different OS process* (the in-process lock already serializes
# this process's own writers). Total worst-case wait ≈ 0.05+0.1+0.2+0.4+0.8 ≈ 1.55s.
_MAX_RETRIES = 5
_BASE_BACKOFF_S = 0.05
_BUSY_TIMEOUT_MS = 30000
def _is_locked_error(exc: sqlite3.OperationalError) -> bool:
msg = str(exc).lower()
return "database is locked" in msg or "database is busy" in msg
class LocalStore:
"""Owns the single shared SQLite connection + lock for local memory writes."""
def __init__(self, conn: sqlite3.Connection) -> None:
self.conn = conn
# Re-entrant so a transaction callback may itself call ``execute``/``write``
# without dead-locking on the same thread.
self.lock = threading.RLock()
# ── construction ────────────────────────────────────────────────
@classmethod
def open(cls, db_path: str) -> "LocalStore":
"""Open (creating parent dirs) a WAL connection safe for cross-thread use."""
Path(db_path).parent.mkdir(parents=True, exist_ok=True)
conn = sqlite3.connect(db_path, timeout=30.0, check_same_thread=False)
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA journal_mode=WAL")
conn.execute(f"PRAGMA busy_timeout={_BUSY_TIMEOUT_MS}")
return cls(conn)
# ── serialized access ───────────────────────────────────────────
def transaction(self, fn: Callable[[sqlite3.Connection], T]) -> T:
"""Run ``fn(conn)`` holding the shared lock, with bounded retry on lock errors.
``fn`` is responsible for issuing its own ``COMMIT`` (call ``conn.commit()``)
when it performs writes. The whole callback runs under the process-wide lock,
so concurrent callers queue rather than collide on the SQLite writer.
"""
last_exc: sqlite3.OperationalError | None = None
for attempt in range(_MAX_RETRIES):
with self.lock:
try:
return fn(self.conn)
except sqlite3.OperationalError as exc:
if not _is_locked_error(exc):
raise
last_exc = exc
# Roll back any partial txn so the retry starts clean and the
# connection isn't left mid-transaction holding locks.
try:
self.conn.rollback()
except sqlite3.Error:
pass
# Back off *outside* the lock so other writers can make progress.
backoff = _BASE_BACKOFF_S * (2 ** attempt)
logger.warning(
"SQLite locked (attempt %d/%d) — backing off %.3fs", attempt + 1, _MAX_RETRIES, backoff
)
time.sleep(backoff)
assert last_exc is not None
raise last_exc
def execute(self, sql: str, params: tuple[Any, ...] = ()) -> sqlite3.Cursor:
"""Run a single read query under the shared lock (no implicit commit)."""
with self.lock:
return self.conn.execute(sql, params)
def write(self, sql: str, params: tuple[Any, ...] = ()) -> sqlite3.Cursor:
"""Run a single write statement + commit, serialized and retry-guarded."""
def _do(conn: sqlite3.Connection) -> sqlite3.Cursor:
cur = conn.execute(sql, params)
conn.commit()
return cur
return self.transaction(_do)
def close(self) -> None:
with self.lock:
self.conn.close()

View file

@ -17,7 +17,10 @@ import sqlite3
import sys
import urllib.error
import urllib.request
from typing import Any
from typing import TYPE_CHECKING, Any, Callable
if TYPE_CHECKING:
from claude_memory.local_store import LocalStore
logger = logging.getLogger(__name__)
@ -35,9 +38,17 @@ HYBRID_MODE = bool(API_KEY) and not SYNC_DISABLED
HTTP_ONLY = bool(API_KEY) and SYNC_DISABLED
SQLITE_ONLY = not API_KEY
# Default per-request HTTP timeout, and a wider bound for the one genuinely slow path:
# a remote semantic ``memory_recall`` (embedding/search can be slow to warm up). Both are
# explicit upper bounds so a call can never hang the MCP server indefinitely.
DEFAULT_API_TIMEOUT = 15
RECALL_TIMEOUT = int(os.environ.get("MEMORY_RECALL_TIMEOUT", "30"))
def _api_request(method: str, path: str, body: dict[str, Any] | None = None) -> dict[str, Any]:
"""Make an HTTP request to the memory API."""
def _api_request(
method: str, path: str, body: dict[str, Any] | None = None, timeout: int = DEFAULT_API_TIMEOUT
) -> dict[str, Any]:
"""Make an HTTP request to the memory API (bounded by ``timeout`` seconds)."""
url = f"{API_BASE_URL}{path}"
data = json.dumps(body).encode() if body else None
req = urllib.request.Request(
@ -50,7 +61,7 @@ def _api_request(method: str, path: str, body: dict[str, Any] | None = None) ->
},
)
try:
with urllib.request.urlopen(req, timeout=15) as resp:
with urllib.request.urlopen(req, timeout=timeout) as resp:
result: dict[str, Any] = json.loads(resp.read().decode())
return result
except urllib.error.HTTPError as e:
@ -128,7 +139,9 @@ def _init_sqlite(db_path: str | None = None) -> tuple[sqlite3.Connection, str]:
db_path = _get_db_path(db_path)
Path(os.path.dirname(db_path)).mkdir(parents=True, exist_ok=True)
conn = sqlite3.connect(db_path, timeout=30.0)
# check_same_thread=False: the MCP server may handle requests on different
# threads and shares this one connection via LocalStore's lock (see local_store.py).
conn = sqlite3.connect(db_path, timeout=30.0, check_same_thread=False)
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA busy_timeout=30000")
@ -390,19 +403,30 @@ class MemoryServer:
def __init__(self, sqlite_db_path: str | None = None) -> None:
self.sqlite_conn: sqlite3.Connection | None = None
self.store: "LocalStore | None" = None # single serialized writer (see local_store.py)
self.sync_engine: Any = None
# Sink for MCP notifications (e.g. progress). Defaults to writing a JSON-RPC
# notification line to stdout; overridable in tests.
self._notify: Callable[[str, dict[str, Any]], None] = self._emit_notification
if SQLITE_ONLY or HYBRID_MODE:
self.sqlite_conn, resolved_path = _init_sqlite(sqlite_db_path)
conn, resolved_path = _init_sqlite(sqlite_db_path)
from claude_memory.local_store import LocalStore
self.store = LocalStore(conn)
self.sqlite_conn = conn
if HYBRID_MODE:
from claude_memory.sync import SyncEngine
sync_interval = int(os.environ.get("MEMORY_SYNC_INTERVAL", "60"))
# Share the SAME LocalStore (one connection, one lock) so the background
# sync writer never opens a second connection to the file and never races
# the store path on the single SQLite writer.
self.sync_engine = SyncEngine(
db_path=resolved_path,
api_base_url=API_BASE_URL,
api_key=API_KEY,
sync_interval=sync_interval,
store=self.store,
)
self.sync_engine.start()
@ -455,31 +479,59 @@ class MemoryServer:
limit = args.get("limit", 10)
if HTTP_ONLY:
result = _api_request("POST", "/api/memories/recall", {
"context": context,
"expanded_query": expanded_query,
"category": category,
"sort_by": sort_by,
"limit": limit,
})
rows = result.get("memories", [])
if not rows:
filter_desc = f" in category '{category}'" if category else ""
return f"No memories found matching: {context}{filter_desc}"
sort_desc = "by relevance" if sort_by == "relevance" else "by importance"
filter_desc = f" in '{category}'" if category else ""
results = []
for row in rows:
results.append(
f"#{row['id']} [{row['category']}] (importance: {row['importance']:.1f}) {row['content']}"
f"\n Tags: {row.get('tags') or 'none'} | Stored: {row['created_at']}"
)
return f"Found {len(rows)} memories{filter_desc} ({sort_desc}):\n\n" + "\n\n".join(results)
return self._recall_remote(args)
# SQLite-only or Hybrid: always read from local cache
return self._sqlite_recall(context, expanded_query, category, sort_by, limit)
def _recall_remote(self, args: dict[str, Any]) -> str:
"""Remote semantic recall — the one genuinely slow path.
Bounded by ``RECALL_TIMEOUT`` so it can never hang the MCP server. On a timeout
or unreachable backend it returns a clear "working / not done — retry" signal
rather than raising or blocking silently.
"""
context = args.get("context")
expanded_query = args.get("expanded_query", "")
category = args.get("category")
sort_by = args.get("sort_by", "importance")
limit = args.get("limit", 10)
try:
result = _api_request(
"POST", "/api/memories/recall",
{
"context": context,
"expanded_query": expanded_query,
"category": category,
"sort_by": sort_by,
"limit": limit,
},
timeout=RECALL_TIMEOUT,
)
except RuntimeError as e:
# _api_request wraps connection errors / timeouts as RuntimeError. Surface a
# clear, actionable signal instead of hanging or dumping a stack trace.
return (
f"Memory recall did not complete within {RECALL_TIMEOUT}s — the semantic "
f"search backend is slow or unreachable ({e}). Please try again."
)
rows = result.get("memories", [])
if not rows:
filter_desc = f" in category '{category}'" if category else ""
return f"No memories found matching: {context}{filter_desc}"
sort_desc = "by relevance" if sort_by == "relevance" else "by importance"
filter_desc = f" in '{category}'" if category else ""
results = []
for row in rows:
results.append(
f"#{row['id']} [{row['category']}] (importance: {row['importance']:.1f}) {row['content']}"
f"\n Tags: {row.get('tags') or 'none'} | Stored: {row['created_at']}"
)
return f"Found {len(rows)} memories{filter_desc} ({sort_desc}):\n\n" + "\n\n".join(results)
def memory_list(self, args: dict[str, Any]) -> str:
category = args.get("category")
limit = args.get("limit", 20)
@ -519,10 +571,11 @@ class MemoryServer:
# SQLite-only or Hybrid: delete from local SQLite
# In hybrid mode, also try to sync delete to server
server_id: int | None = None
if HYBRID_MODE and self.sync_engine and self.sqlite_conn:
cursor = self.sqlite_conn.cursor()
cursor.execute("SELECT server_id FROM memories WHERE id = ?", (memory_id,))
row = cursor.fetchone()
if HYBRID_MODE and self.sync_engine and self.store:
with self.store.lock:
cursor = self.store.conn.cursor()
cursor.execute("SELECT server_id FROM memories WHERE id = ?", (memory_id,))
row = cursor.fetchone()
server_id = row["server_id"] if row and row["server_id"] else None
result_text = self._sqlite_delete(memory_id)
@ -563,12 +616,13 @@ class MemoryServer:
lines.append(f"Last sync success: {counts['last_sync_success']}")
return "\n".join(lines)
if self.sqlite_conn:
cursor = self.sqlite_conn.cursor()
cursor.execute("SELECT COUNT(*) as c FROM memories")
total = cursor.fetchone()["c"]
cursor.execute("SELECT category, COUNT(*) as c FROM memories GROUP BY category ORDER BY c DESC")
by_cat = cursor.fetchall()
if self.store:
with self.store.lock:
cursor = self.store.conn.cursor()
cursor.execute("SELECT COUNT(*) as c FROM memories")
total = cursor.fetchone()["c"]
cursor.execute("SELECT category, COUNT(*) as c FROM memories GROUP BY category ORDER BY c DESC")
by_cat = cursor.fetchall()
lines = [f"Local memories (SQLite-only): {total}"]
for row in by_cat:
lines.append(f" {row['category']}: {row['c']}")
@ -682,19 +736,25 @@ class MemoryServer:
def _sqlite_store(self, content: str, category: str, tags: str, importance: float, expanded_keywords: str, force_sensitive: bool = False) -> str:
from datetime import datetime, timezone
assert self.sqlite_conn is not None
assert self.store is not None
now = datetime.now(timezone.utc).isoformat()
is_sensitive = 1 if force_sensitive else 0
cursor = self.sqlite_conn.cursor()
cursor.execute(
"INSERT INTO memories (content, category, tags, expanded_keywords, importance, is_sensitive, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
(content, category, tags, expanded_keywords, importance, is_sensitive, now, now),
)
self.sqlite_conn.commit()
return f"Stored memory #{cursor.lastrowid} in category '{category}' with importance {importance:.1f}"
def _insert(conn: sqlite3.Connection) -> int | None:
cursor = conn.execute(
"INSERT INTO memories (content, category, tags, expanded_keywords, importance, is_sensitive, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
(content, category, tags, expanded_keywords, importance, is_sensitive, now, now),
)
conn.commit()
return cursor.lastrowid
# Serialized + retry-guarded through the shared LocalStore so concurrent
# stores (and the background sync writer) never collide on the SQLite writer.
new_id = self.store.transaction(_insert)
return f"Stored memory #{new_id} in category '{category}' with importance {importance:.1f}"
def _sqlite_recall(self, context: str, expanded_query: str, category: str | None, sort_by: str, limit: int) -> str:
assert self.sqlite_conn is not None
assert self.store is not None
all_terms = f"{context} {expanded_query}".strip()
words = [w.replace(chr(34), "") for w in all_terms.split() if w]
and_query = " AND ".join(f'"{w}"' for w in words)
@ -712,35 +772,38 @@ class MemoryServer:
"SELECT m.id, m.content, m.category, m.tags, m.importance, m.created_at "
"FROM memories m JOIN memories_fts fts ON m.id = fts.rowid "
)
cursor = self.sqlite_conn.cursor()
rows: list[Any] = []
try:
# Try AND first for precise matches, fall back to OR for broader results
cat_filter = "AND m.category = ?" if category else ""
for fts_query in (and_query, or_query):
params = [fts_query, category, limit] if category else [fts_query, limit]
cursor.execute(
f"{base_select}WHERE memories_fts MATCH ? {cat_filter} ORDER BY {order} LIMIT ?",
tuple(p for p in params if p is not None),
)
# Hold the shared lock for the whole read — the connection is shared across
# threads with the sync writer, so reads must serialize too.
with self.store.lock:
cursor = self.store.conn.cursor()
try:
# Try AND first for precise matches, fall back to OR for broader results
cat_filter = "AND m.category = ?" if category else ""
for fts_query in (and_query, or_query):
params = [fts_query, category, limit] if category else [fts_query, limit]
cursor.execute(
f"{base_select}WHERE memories_fts MATCH ? {cat_filter} ORDER BY {order} LIMIT ?",
tuple(p for p in params if p is not None),
)
rows = cursor.fetchall()
if rows:
break
except sqlite3.OperationalError:
like = f"%{context}%"
if category:
cursor.execute(
"SELECT id, content, category, tags, importance, created_at FROM memories "
"WHERE (content LIKE ? OR tags LIKE ?) AND category = ? ORDER BY importance DESC LIMIT ?",
(like, like, category, limit),
)
else:
cursor.execute(
"SELECT id, content, category, tags, importance, created_at FROM memories "
"WHERE content LIKE ? OR tags LIKE ? ORDER BY importance DESC LIMIT ?",
(like, like, limit),
)
rows = cursor.fetchall()
if rows:
break
except sqlite3.OperationalError:
like = f"%{context}%"
if category:
cursor.execute(
"SELECT id, content, category, tags, importance, created_at FROM memories "
"WHERE (content LIKE ? OR tags LIKE ?) AND category = ? ORDER BY importance DESC LIMIT ?",
(like, like, category, limit),
)
else:
cursor.execute(
"SELECT id, content, category, tags, importance, created_at FROM memories "
"WHERE content LIKE ? OR tags LIKE ? ORDER BY importance DESC LIMIT ?",
(like, like, limit),
)
rows = cursor.fetchall()
if not rows:
return f"No memories found matching: {context}"
@ -757,21 +820,22 @@ class MemoryServer:
)
def _sqlite_list(self, category: str | None, limit: int) -> str:
assert self.sqlite_conn is not None
cursor = self.sqlite_conn.cursor()
if category:
cursor.execute(
"SELECT id, content, category, tags, importance, created_at FROM memories "
"WHERE category = ? ORDER BY created_at DESC LIMIT ?",
(category, limit),
)
else:
cursor.execute(
"SELECT id, content, category, tags, importance, created_at FROM memories "
"ORDER BY created_at DESC LIMIT ?",
(limit,),
)
rows = cursor.fetchall()
assert self.store is not None
with self.store.lock:
cursor = self.store.conn.cursor()
if category:
cursor.execute(
"SELECT id, content, category, tags, importance, created_at FROM memories "
"WHERE category = ? ORDER BY created_at DESC LIMIT ?",
(category, limit),
)
else:
cursor.execute(
"SELECT id, content, category, tags, importance, created_at FROM memories "
"ORDER BY created_at DESC LIMIT ?",
(limit,),
)
rows = cursor.fetchall()
if not rows:
return f"No memories in category '{category}'" if category else "No memories stored yet"
@ -785,25 +849,30 @@ class MemoryServer:
return header + f" ({len(rows)} shown):\n\n" + "\n\n".join(results)
def _sqlite_delete(self, memory_id: int) -> str:
assert self.sqlite_conn is not None
cursor = self.sqlite_conn.cursor()
cursor.execute("SELECT id, content FROM memories WHERE id = ?", (memory_id,))
row = cursor.fetchone()
if not row:
return f"Memory #{memory_id} not found"
preview = row["content"][:50]
cursor.execute("DELETE FROM memories WHERE id = ?", (memory_id,))
self.sqlite_conn.commit()
return f"Deleted memory #{memory_id}: {preview}..."
assert self.store is not None
def _delete(conn: sqlite3.Connection) -> str:
cursor = conn.execute("SELECT id, content FROM memories WHERE id = ?", (memory_id,))
row = cursor.fetchone()
if not row:
return f"Memory #{memory_id} not found"
preview = row["content"][:50]
conn.execute("DELETE FROM memories WHERE id = ?", (memory_id,))
conn.commit()
return f"Deleted memory #{memory_id}: {preview}..."
# SELECT + DELETE + commit as one serialized, retry-guarded unit.
return self.store.transaction(_delete)
def _sqlite_secret_get(self, memory_id: int) -> str:
assert self.sqlite_conn is not None
cursor = self.sqlite_conn.cursor()
cursor.execute(
"SELECT id, content, category, is_sensitive FROM memories WHERE id = ?",
(memory_id,),
)
row = cursor.fetchone()
assert self.store is not None
with self.store.lock:
cursor = self.store.conn.cursor()
cursor.execute(
"SELECT id, content, category, is_sensitive FROM memories WHERE id = ?",
(memory_id,),
)
row = cursor.fetchone()
if not row:
return f"Memory #{memory_id} not found"
if not row["is_sensitive"]:
@ -825,9 +894,25 @@ class MemoryServer:
tools.extend(SHARING_TOOLS)
return {"tools": tools}
# Tools whose work is genuinely slow enough to warrant a progress signal.
_SLOW_TOOLS = frozenset({"memory_recall"})
def handle_tools_call(self, params: dict[str, Any]) -> dict[str, Any]:
tool_name: str = params.get("name", "")
arguments: dict[str, Any] = params.get("arguments", {})
# If the client opted into progress (sent a token) and this is a slow tool, emit a
# single "working" progress notification so the call shows life instead of looking hung.
progress_token = (params.get("_meta") or {}).get("progressToken")
if progress_token is not None and tool_name in self._SLOW_TOOLS:
try:
self._notify(
"notifications/progress",
{"progressToken": progress_token, "progress": 0, "message": f"Running {tool_name}"},
)
except Exception:
pass # never let progress reporting break the actual call
try:
handler = {
"memory_store": self.memory_store,
@ -851,6 +936,10 @@ class MemoryServer:
except Exception as e:
return {"content": [{"type": "text", "text": f"Error: {e!s}"}], "isError": True}
def _emit_notification(self, method: str, params: dict[str, Any]) -> None:
"""Default notification sink: write a JSON-RPC notification line to stdout."""
print(json.dumps({"jsonrpc": "2.0", "method": method, "params": params}), flush=True)
def process_message(self, message: dict[str, Any]) -> dict[str, Any] | None:
method = message.get("method")
params = message.get("params", {})

View file

@ -5,14 +5,14 @@ Uses only stdlib — no pip install required.
import json
import logging
import sqlite3
import threading
import urllib.error
import urllib.parse
import urllib.request
from typing import Any
from datetime import datetime, timezone
from pathlib import Path
from claude_memory.local_store import LocalStore
logger = logging.getLogger(__name__)
@ -26,7 +26,14 @@ FULL_RESYNC_EVERY = 10
class SyncEngine:
"""Background sync between local SQLite cache and remote API."""
def __init__(self, db_path: str, api_base_url: str, api_key: str, sync_interval: int = 60):
def __init__(
self,
db_path: str,
api_base_url: str,
api_key: str,
sync_interval: int = 60,
store: "LocalStore | None" = None,
):
self.db_path = db_path
self.api_base_url = api_base_url.rstrip("/")
self.api_key = api_key
@ -37,13 +44,20 @@ class SyncEngine:
self._last_sync_success = False
self._auth_failed = False
# Own connection for thread safety
Path(db_path).parent.mkdir(parents=True, exist_ok=True)
self._conn = sqlite3.connect(db_path, timeout=30.0, check_same_thread=False)
self._conn.row_factory = sqlite3.Row
self._conn.execute("PRAGMA journal_mode=WAL")
self._conn.execute("PRAGMA busy_timeout=30000")
self._lock = threading.Lock()
# Share the caller's LocalStore (one connection, one lock) when given, so the
# background sync writer never opens a SECOND connection to the same file and
# never races the store path on the single SQLite writer. When run standalone
# (e.g. tests), open our own store. Either way, all SQLite access below goes
# through self._store / self._conn / self._lock, which now point at one shared
# connection guarded by one re-entrant lock.
if store is None:
self._store = LocalStore.open(db_path)
self._owns_store = True
else:
self._store = store
self._owns_store = False
self._conn = self._store.conn
self._lock = self._store.lock
self._init_sync_tables()
@ -121,7 +135,10 @@ class SyncEngine:
self._stop_event.set()
if self._thread and self._thread.is_alive():
self._thread.join(timeout=5)
self._conn.close()
# Only close the connection if we opened it; when sharing the server's
# LocalStore, the server owns the connection lifecycle.
if self._owns_store:
self._store.close()
def _sync_loop(self) -> None:
"""Periodic sync loop running in background thread."""

View file

@ -2,7 +2,12 @@
import json
import os
import sqlite3
import sys
import threading
from datetime import datetime, timezone
from typing import Any
from unittest.mock import patch
import pytest
@ -408,3 +413,244 @@ class TestSchemaMigration:
columns = {row["name"] for row in cursor.fetchall()}
assert "server_id" in columns
srv.sqlite_conn.close()
def _server_rows(server: MemoryServer) -> int:
assert server.sqlite_conn is not None
return int(server.sqlite_conn.execute("SELECT COUNT(*) AS c FROM memories").fetchone()["c"])
class TestConcurrentWrites:
"""Regression tests for 'database is locked' under heavy concurrent local writes.
The store path (MemoryServer) and the background sync writer (SyncEngine) must not
contend on the single SQLite writer. Before the fix they used two separate connections
to the same file; under load the second writer blew past busy_timeout and raised
``sqlite3.OperationalError: database is locked``. After the fix all local writes are
serialized through one shared, lock-guarded connection, so a lock error is impossible.
"""
def test_concurrent_stores_all_rows_land(self, tmp_path):
"""Many threads calling memory_store concurrently → every row lands, no lock error."""
db_path = str(tmp_path / "concurrent.db")
server = MemoryServer(sqlite_db_path=db_path)
try:
n_threads = 16
per_thread = 12
errors: list[BaseException] = []
barrier = threading.Barrier(n_threads)
def worker(tid: int) -> None:
barrier.wait() # release all threads at once to maximise contention
for i in range(per_thread):
try:
server.memory_store({
"content": f"thread {tid} memory {i}",
"expanded_keywords": f"thread {tid} memory {i} concurrent write",
})
except BaseException as exc: # noqa: BLE001 — capture everything for the assert
errors.append(exc)
threads = [threading.Thread(target=worker, args=(t,)) for t in range(n_threads)]
for t in threads:
t.start()
for t in threads:
t.join()
assert errors == [], f"concurrent stores raised: {errors!r}"
assert _server_rows(server) == n_threads * per_thread
finally:
if server.sqlite_conn:
server.sqlite_conn.close()
def test_concurrent_stores_while_sync_writer_active_no_lock(self, tmp_path):
"""Store path + background sync writer hammer the SAME file → no 'database is locked'.
Reproduces the production failure: ``MemoryServer`` and ``SyncEngine`` both write to
one SQLite file. We shrink ``busy_timeout`` so the structural race (two writers fighting
the single SQLite write lock) surfaces within seconds instead of 30s. The shared-writer
fix makes a lock error impossible regardless of busy_timeout.
"""
from claude_memory.sync import SyncEngine
db_path = str(tmp_path / "hybrid.db")
server = MemoryServer(sqlite_db_path=db_path)
# The production hybrid wiring: the sync engine SHARES the server's LocalStore
# (one connection, one lock) — the structural fix for the cross-connection race.
engine = SyncEngine(
db_path=db_path,
api_base_url="http://fake-api:8080",
api_key="test-key",
sync_interval=3600, # never auto-syncs; we drive the writer by hand
store=server.store,
)
assert engine._conn is server.sqlite_conn # truly one shared connection
# Shrink the busy timeout so that, were there still two writers, the race would
# surface in ms not 30s. With one shared connection a lock error is impossible.
server.sqlite_conn.execute("PRAGMA busy_timeout=50")
now = datetime.now(timezone.utc).isoformat()
# A write-heavy pull: many rows upserted inside the sync engine's single lock block —
# exactly the kind of long-held writer that starved the store path.
def big_pull(method: str, path: str, body: Any = None) -> dict[str, Any]:
return {
"memories": [
{
"id": 10_000 + j,
"content": f"server row {j}",
"category": "facts",
"tags": "",
"expanded_keywords": "",
"importance": 0.5,
"is_sensitive": False,
"created_at": now,
"updated_at": now,
"deleted_at": None,
}
for j in range(120)
],
"server_time": now,
}
errors: list[BaseException] = []
stop = threading.Event()
def sync_writer() -> None:
with patch.object(engine, "_api_request", side_effect=big_pull):
while not stop.is_set():
try:
engine._pull_changes()
except BaseException as exc: # noqa: BLE001
errors.append(exc)
n_threads = 12
per_thread = 12
barrier = threading.Barrier(n_threads)
def store_worker(tid: int) -> None:
barrier.wait()
for i in range(per_thread):
try:
server.memory_store({
"content": f"local {tid}.{i}",
"expanded_keywords": f"local thread {tid} item {i} keywords here",
})
except BaseException as exc: # noqa: BLE001
errors.append(exc)
writer = threading.Thread(target=sync_writer, daemon=True)
writer.start()
final_rows = 0
try:
store_threads = [threading.Thread(target=store_worker, args=(t,)) for t in range(n_threads)]
for t in store_threads:
t.start()
for t in store_threads:
t.join()
stop.set()
writer.join(timeout=5)
final_rows = _server_rows(server) # read while the connection is still open
finally:
stop.set()
engine.stop() # shares the server's store → does not close the connection
if server.sqlite_conn:
server.sqlite_conn.close()
locked = [e for e in errors if isinstance(e, sqlite3.OperationalError) and "locked" in str(e)]
assert locked == [], f"hit 'database is locked' under concurrency: {locked!r}"
assert errors == [], f"concurrent writes raised: {errors!r}"
# Every local store must have landed.
assert final_rows >= n_threads * per_thread
class TestRecallProgressAndBounding:
"""The slow path — a remote semantic ``memory_recall`` — must be bounded and give a
notion of progress instead of hanging silently and dropping the session."""
def test_remote_recall_timeout_returns_clear_signal_not_raise(self, server):
"""A timed-out / unreachable remote recall returns a clear 'retry' message, never hangs/raises."""
import claude_memory.mcp_server as m
with patch.object(m, "_api_request", side_effect=RuntimeError("API connection error: timed out")):
text = server._recall_remote({"context": "x", "expanded_query": "a b c d e"})
assert "recall" in text.lower()
# Mentions the bound and that the caller should retry — a clear working/not-done signal.
assert "retry" in text.lower() or "again" in text.lower()
assert str(m.RECALL_TIMEOUT) in text
def test_remote_recall_success_formats_rows(self, server):
"""A successful remote recall still formats results normally."""
import claude_memory.mcp_server as m
payload = {"memories": [
{"id": 7, "category": "facts", "importance": 0.8, "content": "hello",
"tags": "t", "created_at": "2026-01-01T00:00:00+00:00"},
]}
with patch.object(m, "_api_request", return_value=payload):
text = server._recall_remote({"context": "x", "expanded_query": "a b c d e"})
assert "Found 1 memories" in text
assert "hello" in text
def test_remote_recall_uses_bounded_timeout(self, server):
"""The remote recall passes the bounded RECALL_TIMEOUT to the HTTP layer."""
import claude_memory.mcp_server as m
with patch.object(m, "_api_request", return_value={"memories": []}) as mock_api:
server._recall_remote({"context": "x", "expanded_query": "a b c d e"})
_, kwargs = mock_api.call_args
assert kwargs.get("timeout") == m.RECALL_TIMEOUT
def test_api_request_accepts_timeout_kwarg(self):
"""Module-level _api_request must accept an explicit timeout without breaking callers."""
import inspect
import claude_memory.mcp_server as m
sig = inspect.signature(m._api_request)
assert "timeout" in sig.parameters
def test_progress_notification_emitted_for_recall_with_token(self, server):
"""When the client supplies a progressToken, a 'working' progress notification is emitted."""
sent: list[dict[str, Any]] = []
server._notify = lambda method, params: sent.append({"method": method, "params": params})
with patch.object(server, "memory_recall", return_value="ok"):
server.handle_tools_call({
"name": "memory_recall",
"arguments": {"context": "x", "expanded_query": "a b c d e"},
"_meta": {"progressToken": "tok-1"},
})
progress = [s for s in sent if s["method"] == "notifications/progress"]
assert progress, "expected a notifications/progress for a tokened recall"
assert progress[0]["params"]["progressToken"] == "tok-1"
def test_no_progress_notification_without_token(self, server):
"""No token → no progress notification (don't spam clients that didn't opt in)."""
sent: list[dict[str, Any]] = []
server._notify = lambda method, params: sent.append({"method": method, "params": params})
with patch.object(server, "memory_recall", return_value="ok"):
server.handle_tools_call({
"name": "memory_recall",
"arguments": {"context": "x", "expanded_query": "a b c d e"},
})
assert [s for s in sent if s["method"] == "notifications/progress"] == []
def test_fast_tools_do_not_emit_progress(self, server):
"""Only the slow recall path signals progress; a store with a token stays quiet."""
sent: list[dict[str, Any]] = []
server._notify = lambda method, params: sent.append({"method": method, "params": params})
server.handle_tools_call({
"name": "memory_store",
"arguments": {"content": "x", "expanded_keywords": "a b c d e"},
"_meta": {"progressToken": "tok-2"},
})
assert [s for s in sent if s["method"] == "notifications/progress"] == []