claude-memory-mcp/benchmarks
Viktor Barzin 1cc8a2b378
Some checks are pending
Build and Push / lint-and-test (push) Waiting to run
Build and Push / build (push) Blocked by required conditions
Build and Push / deploy (push) Blocked by required conditions
Build and Push / notify-failure (push) Blocked by required conditions
research: benchmark hybrid (lexical+dense+graph) recall vs current FTS
Viktor asked to enhance the memory system with 'semantics' — remember concepts
(not just tokens) linked in a graph — and to prove, by benchmarking against the
current system, that it actually improves recall. A multi-phase research workflow
(18 agents) did landscape research, an adversarially-reviewed integration design,
a stratified eval set over the real 5,452-memory corpus, and a head-to-head
prototype-vs-current benchmark.

Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every
overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend
adopting lexical+dense; the concept graph is DEFERRED.

Post-run adversarial review correction (applied to all docs before commit): the
prototype's fusion config structurally barred the graph leg from the ranked top-k,
so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical
result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty).
Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing
in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/.

Privacy: the corpus/queries/qrels/results are the user's real memories and stay
gitignored (data/, cache/, results/, build_eval_set.py); only harness code,
aggregate numbers, and synthetic examples are committed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:51:53 +00:00
..
harness research: benchmark hybrid (lexical+dense+graph) recall vs current FTS 2026-06-25 17:51:53 +00:00
retrievers research: benchmark hybrid (lexical+dense+graph) recall vs current FTS 2026-06-25 17:51:53 +00:00
scripts research: benchmark hybrid (lexical+dense+graph) recall vs current FTS 2026-06-25 17:51:53 +00:00
.gitignore research: benchmark hybrid (lexical+dense+graph) recall vs current FTS 2026-06-25 17:51:53 +00:00
README.md research: benchmark hybrid (lexical+dense+graph) recall vs current FTS 2026-06-25 17:51:53 +00:00

claude-memory recall benchmark

Stratified retrieval benchmark gating the hybrid-recall adoption decision (ADR-0001): does dense-vector semantic recall + a concept graph beat the current lexical FTS on recall@5, recall@10, nDCG@10, MRR? Quality decides adoption; latency/storage are measured but non-gating.

PRIVACY — read first. The corpus is the operator's REAL personal memories. data/ (corpus/queries/qrels), .venv/, cache/, results/, and scripts/build_eval_set.py (the generator embeds memory-derived query text) are gitignored and must never be committed. Everything else here contains only code / aggregate numbers and is safe to commit. Sensitive memories (is_sensitive=1) are excluded from the corpus entirely.

Layout

benchmarks/
  harness/                 # importable package (committable; no real content)
    types.py               # Memory, Query, Qrels, Retriever protocol
    metrics.py             # recall@k, nDCG@k, MRR (binary relevance)
    dataset.py             # load_dataset() + referential-integrity validation
    runner.py              # run_benchmark() -> overall + per-stratum + latency
    baselines.py           # SqliteFtsRetriever (faithful FTS5/BM25 reference)
    example_retriever.py   # worked example of the plug-in interface
    test_harness.py        # unit tests (pytest)
  scripts/
    export_corpus.py       # SQLite -> data/corpus.jsonl (non-sensitive only)
    build_eval_set.py      # -> data/queries.jsonl + qrels.jsonl  [GITIGNORED]
    dataset_stats.py       # validate + print AGGREGATE stats (safe)
    run_eval.py            # CLI: run a retriever, print/save metrics
  data/                    # [GITIGNORED] corpus.jsonl, queries.jsonl, qrels.jsonl
  .venv/                   # [GITIGNORED]

Dataset schema (JSONL, one object per line)

corpus.jsonl — every non-sensitive memory:

{"id": 137, "content": "...", "category": "decisions", "tags": "memory,architecture",
 "expanded_keywords": "...", "importance": 0.85}

id (int) is the join key everywhere. tags is comma-separated; expanded_keywords space-separated (matches the production schema).

queries.jsonl — eval queries, three strata:

{"query_id": "para_006", "text": "...", "stratum": "paraphrase", "relevant_ids": [380],
 "_note": "author rationale", "_jaccard": 0.023}
  • stratumexact | paraphrase | multihop.
  • relevant_ids is a convenience copy; qrels.jsonl is authoritative.
  • _note / _jaccard are provenance fields (underscore-prefixed); ignore them in scoring.

qrels.jsonl — binary relevance judgments (authoritative):

{"query_id": "multi_006", "relevant_ids": [263, 423, 637]}

Strata (what each one tests)

stratum construction who should win
exact query = a salient phrase lifted from ONE memory; that memory is relevant (verified as the top FTS hit at build time) lexical already strong; floor check
paraphrase query restates ONE memory's meaning in DIFFERENT words (low lexical overlap, validated Jaccard ≤ ~0.18 vs content+keywords) dense embeddings
multihop query needs 2+ DISTINCT memories sharing an entity/concept (e.g. project + decision, or a multi-part runbook); ALL are relevant concept graph

Where a near-duplicate memory equally satisfies a single-target query, qrels was augmented to include the twin (so a good retriever isn't penalised); deliberate discriminator queries are kept single-target on purpose.

Pluggable retriever interface

A retriever is any object implementing one method:

def retrieve(self, query: str, k: int) -> list[int]:
    """Return up to k memory ids (corpus `id`s), ranked best-first."""

Optional lifecycle hooks the runner uses if present (duck-typed):

def build_index(self, corpus: list[Memory]) -> None: ...   # timed separately
def index_size_bytes(self) -> int: ...                     # reported
name: str                                                  # label in reports

A bare callable retrieve(query, k) -> list[int] also works.

Run it

.venv/bin/python scripts/export_corpus.py        # (re)build data/corpus.jsonl
.venv/bin/python scripts/build_eval_set.py        # (re)build queries+qrels (local)
.venv/bin/python scripts/dataset_stats.py         # validate + aggregate stats
.venv/bin/python -m pytest harness/test_harness.py -q

# evaluate a retriever (built-in alias or module:Class)
.venv/bin/python scripts/run_eval.py --retriever fts5
.venv/bin/python scripts/run_eval.py --retriever your_pkg.mod:YourRetriever --json results/yours.json

Programmatic use:

from harness import load_dataset, run_benchmark
ds = load_dataset()
result = run_benchmark(MyRetriever(), ds)   # builds index, times queries
print(result.summary())                     # overall + per-stratum table
result.to_dict()                            # full machine-readable result

run_benchmark requests retrieve_k=20 per query by default (≥ the max metric cutoff of 10), macro-averages metrics over queries (overall + per stratum), and reports per-query latency p50/p95 plus index build time/size when the hooks exist.

Reference baseline

harness.baselines.SqliteFtsRetriever mirrors the production local-store search (README "Search Algorithm"): FTS5 over content/category/tags/expanded_keywords, '"w1" OR "w2" ...' MATCH, ORDER BY bm25(), importance. This is the lexical "current system" any hybrid retriever must beat. (The Postgres tsvector path uses weighted A/B/C/D ranking and an importance-first default; FTS5/BM25 is the faithful, dependency-free relevance reference for the quality comparison.)