viktor/claude-memory-mcp

Fork 0

Viktor Barzin 1cc8a2b378

Build and Push / lint-and-test (push) Waiting to run

Details

Build and Push / build (push) Blocked by required conditions

Details

Build and Push / deploy (push) Blocked by required conditions

Details

Build and Push / notify-failure (push) Blocked by required conditions

Details

research: benchmark hybrid (lexical+dense+graph) recall vs current FTS

Viktor asked to enhance the memory system with 'semantics' — remember concepts
(not just tokens) linked in a graph — and to prove, by benchmarking against the
current system, that it actually improves recall. A multi-phase research workflow
(18 agents) did landscape research, an adversarially-reviewed integration design,
a stratified eval set over the real 5,452-memory corpus, and a head-to-head
prototype-vs-current benchmark.

Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every
overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend
adopting lexical+dense; the concept graph is DEFERRED.

Post-run adversarial review correction (applied to all docs before commit): the
prototype's fusion config structurally barred the graph leg from the ranked top-k,
so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical
result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty).
Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing
in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/.

Privacy: the corpus/queries/qrels/results are the user's real memories and stay
gitignored (data/, cache/, results/, build_eval_set.py); only harness code,
aggregate numbers, and synthetic examples are committed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-25 17:51:53 +00:00

21 KiB

Raw Blame History

Benchmark report: FTS (lexical baseline) vs Hybrid (lexical + dense + graph)

Status: the ADR-0001 decision instrument. This is the head-to-head that gates hybrid adoption. Read the survey for the landscape and the integration design for what was built.

Bottom line: Hybrid (lexical FTS ⊕ dense embeddings, RRF-fused) beats the lexical FTS baseline on every overall metric, driven by the paraphrase stratum (recall@10 +0.350, robust under a paired bootstrap) — precisely the gap embeddings were meant to close, with no recall regression on exact. Recommendation: adopt the lexical + dense fusion; the ADR-0001 quality gate is met by that fusion alone.

⚠️ Post-review corrections — read before citing this report. An adversarial completeness review after the run found that two prominent claims in the original draft did NOT survive scrutiny. They are corrected in place below; the full review is in the Appendix.

The concept graph was NOT shown to be useless — it was never validly tested. The prototype's fusion restricted the graph leg to candidate ids outside the FTS∪dense base set and weighted it at 0.35, which makes it mathematically impossible for a graph-only result to enter the fused top-k (max graph RRF 0.35/(60+1)≈0.0057 < any base-leg minimum 1.0/(60+50)≈0.0091). So the ablation "full-hybrid ≡ FTS+dense to three decimals" was guaranteed by the fusion config, not an empirical finding — the review even found a genuinely-relevant memory the graph surfaced (and both base legs missed) that fusion then discarded. The concept graph is therefore UNEVALUATED, and is deferred on cost + invalid-test grounds, not because it failed. A valid retest must put graph ids in the fused candidate pool (no base-set exclusion) and/or sweep the graph weight.

The multi-hop "win" is not statistically significant. A paired bootstrap (B=10000) puts 3 of the 4 multi-hop metric CIs across zero (recall@10 Δ+0.064, P(Δ≤0)≈0.06). Only the overall and paraphrase deltas are robust. Multi-hop deltas are de-bolded below. Since multi-hop is the entire rationale for a concept graph, there is currently no statistically distinguishable evidence that anything (dense or graph) helps the multi-hop stratum.

Also: absolute recall/nDCG levels are biased low (binary, un-pooled qrels — §6); only FTS-vs-hybrid deltas are trustworthy. None of this changes the adopt-lexical+dense recommendation, which rests on the robust paraphrase/overall win.

1. Methodology

1.1 Test collection

Corpus: 5,452 memories (benchmarks/data/corpus.jsonl, gitignored — privacy). Sensitive memories (is_sensitive=1) excluded entirely per ADR-0003. The corpus is the user's real personal memories; only aggregate numbers and synthetic illustrations appear in this document.
Queries: 119, stratified — 40 exact / 40 paraphrase / 39 multi-hop (benchmarks/data/queries.jsonl).
Qrels: benchmarks/data/qrels.jsonl, LLM-generated (the LongMemEval pipeline inverted for a memory store: seed memory → exact/paraphrase/multi-hop query → relevant ids).

1.2 The two systems

fts (baseline) — benchmarks/retrievers/fts.py::FtsRetriever. A faithful reimplementation of the production code path src/claude_memory/mcp_server.py::_sqlite_recall (sort_by="relevance"): in-memory FTS5 over the full corpus with the production virtual-table shape + default unicode61 tokenizer (no stemming/stop-words); AND-match first, OR-broaden only if AND returns zero; rank by the production blend -bm25()*0.7 + importance*0.3; LIKE fallback on operational errors. This is "the current system" the hybrid must beat — not the simplified README prose.
hybrid — benchmarks/retrievers/hybrid.py::HybridRetriever. Three legs fused with weighted RRF: (1) lexical = FtsRetriever reused verbatim (so the hybrid's lexical component is the baseline — no drift); (2) dense = BAAI/bge-large-en-v1.5 (1024-d, local, L2-normalized, BGE query-instruction prefix), cosine via numpy; (3) graph = a keyword-co-occurrence memory-node graph (5,452 nodes / 2,095,624 edges), 1-hop expansion from the top-10 seeds. Fusion: RRF(d) = Σ_leg w_leg/(60 + rank_leg(d)), w_fts = w_dense = 1.0, w_graph = 0.35, each leg to depth 50.

1.3 Metrics & protocol

recall@5, recall@10 (hot-path "did we surface it"), nDCG@10 (graded, position-aware headline), MRR (first-hit). Per stratum and overall.
retrieve_k=20, run via scripts/run_eval.py --retriever retrievers.{fts,hybrid}:…. Deterministic; both invocation paths (programmatic + CLI) verified identical.
sort_by="relevance" pinned across both arms (not the production importance default), so the benchmark measures retrieval quality, not the importance prior — and everything else (corpus, queries, OR-broaden behaviour) is held fixed.
Full result JSONs written only to gitignored benchmarks/results/{fts,hybrid}.json; retriever code contains no embedded corpus content (safe to commit).

2. Head-to-head results

2.1 Comparison table (FTS vs Hybrid, with deltas)

Stratum	Metric	FTS	Hybrid	Δ
Overall (n=119)	recall@5	0.6663	0.7415	+0.0752
	recall@10	0.6952	0.8338	+0.1386
	nDCG@10	0.6507	0.7284	+0.0777
	MRR	0.6737	0.7297	+0.0560
Exact (n=40)	recall@5	1.0000	1.0000	+0.0000
	recall@10	1.0000	1.0000	+0.0000
	nDCG@10	0.9908	0.9723	−0.0185
	MRR	0.9875	0.9625	−0.0250
Paraphrase (n=40)	recall@5	0.3500	0.5500	+0.2000
	recall@10	0.3750	0.7250	+0.3500
	nDCG@10	0.3123	0.5023	+0.1900
	MRR	0.2958	0.4343	+0.1385
Multi-hop (n=39)	recall@5	0.6485	0.6726	+0.0241 ¹
	recall@10	0.7111	0.7748	+0.0637 ¹
	nDCG@10	0.6491	0.7101	+0.0610 ¹
	MRR	0.7393	0.7940	+0.0547 ¹

¹ Multi-hop deltas are NOT statistically significant. Paired bootstrap (B=10000): recall@5 CI[−0.046,+0.095] (P(Δ≤0)≈0.25), recall@10 CI[−0.020,+0.143] (P≈0.06), MRR CI[−0.038,+0.143] (P≈0.12); only nDCG@10 is marginal (P≈0.04). Treat the multi-hop stratum as inconclusive. The overall and paraphrase deltas, by contrast, have CIs well clear of zero (P≤0.003).

2.2 Reading the strata

Exact — recall held at 1.0, but that is partly circular. Exact queries were generated as "a salient phrase whose top FTS hit is memory X," then X was labelled relevant — so FTS recall@5/@10 = 1.0 is substantially guaranteed by how the stratum was built, not an independent property. The one genuine signal here is hybrid's small degradation: nDCG@10 −0.018 CI[−0.046,0], MRR −0.025 CI[−0.063,0] — a real, consistent rank demotion from blending one perfect lexical hit with dense near-ties. The proposed exact-match rank bonus (ADR-0005) would recover it, but that fix is asserted, not measured. Recall is unaffected; the cost is rank-position only.
Paraphrase — the LARGEST WIN, the dense leg's payoff. recall@10 +0.350 (0.375 → 0.725), recall@5 +0.200, nDCG@10 +0.190. This is exactly the low-lexical-overlap stratum embeddings were predicted to fix: lexical FTS finds barely a third of paraphrased answers; adding dense nearly doubles recall@10. This stratum alone justifies the hybrid.
Multi-hop — INCONCLUSIVE (deltas not significant). recall@10 +0.064, nDCG@10 +0.061, MRR +0.055, but 3 of 4 CIs cross zero (footnote ¹). So we cannot claim a multi-hop win for hybrid. We also cannot attribute anything to the graph: per the §3 caveat, the fusion config structurally barred the graph leg from the top-k, so the multi-hop stratum tests only FTS vs FTS+dense — and that difference is not statistically distinguishable here. Multi-hop is exactly the stratum a properly tested concept graph is meant to win, and it remains an open question.

3. Ablation — what each leg contributes

Four configs, everything else fixed:

Config	Legs	Overall recall@10	Overall nDCG@10	Para recall@10	Multi recall@10	Exact nDCG@10
A	FTS + dense + graph (full hybrid)	0.834	0.728	0.725	0.775	0.972
B	FTS + dense (`w_graph=0`)	0.834	0.728	0.725	0.775	0.972
C	dense only	0.748	—	—	—	0.861
D	FTS only (= baseline)	0.695	0.651	0.375	0.711	0.991

A ≡ B to three decimals on every metric — but this is a STRUCTURAL ARTIFACT, not a test of the graph. The fusion restricts the graph leg to ids outside the FTS∪dense base set and weights it at 0.35, so a graph-only id's maximum RRF score is 0.35/(60+1) ≈ 0.0057, strictly below the minimum score of any id from a base leg (1.0/(60+50) ≈ 0.0091). Since both base legs return 50 ids, every base-leg id outranks every graph-only id — a graph-only result can never enter the fused top-k, regardless of corpus, query set, or graph quality. A ≡ B was mathematically guaranteed before any data ran. The honest reading is "the graph cannot affect top-k under this fusion config," NOT "the concept graph contributes nothing." (A spot check found a genuinely-relevant memory the graph surfaced and both base legs missed at depth 50 — fusion discarded it.) The graph is therefore unevaluated.

What the ablation does validly show is the FTS-vs-dense decomposition, which stands:

Dense recovers paraphrase (C beats D's 0.375 para recall@10 decisively) but is weaker on exact (C exact nDCG 0.861 vs D's 0.991).
Lexical recovers exact (D exact nDCG 0.991, the best) but collapses on paraphrase.
Fusion (B) gets the best of both — exact recall stays perfect, paraphrase nearly doubles.
Caveat: configs B/C/D were not persisted as result JSONs, so these specific numbers are not independently reproducible from committed artifacts (only A = full hybrid and D = FTS are).

To actually test the graph (deferred follow-up — ADR-0004): put graph candidates in the fused pool (drop the base-set exclusion) and/or sweep w_graph upward, on a multi-hop slice whose hops are not semantically adjacent (so the dense leg can't shortcut them), using real typed-relation extraction rather than the prototype's zero-LLM keyword co-occurrence graph.

4. Latency & storage (measured, NON-GATING per ADR-0001)

4.1 Latency (per-query `retrieve()`, CPU-only box, no GPU)

System	p50	p95	mean	max
FTS (pure SQLite)	15.7 ms	27.8 ms	12.8 ms	31.9 ms
Hybrid	229.6 ms	344.5 ms	249.3 ms	640.0 ms

The hybrid's ~230 ms p50 is dominated by the local bge-large query embedding (one CPU forward pass). On the production GPU node or via a hosted API this drops ~10× to low tens of ms. The FTS, dense-ANN, and RRF-merge costs themselves are negligible. Latency does not gate adoption (the success metric is quality-first), and the production read path (pgvector HNSW + GPU/hosted query embed) is far faster than this prototype's CPU profile.

4.2 Storage

Component	Size	Notes
FTS5 in-memory index	~8.3 MB	SQLite shadow tables over 5,452 memories
Dense matrix	22.3 MB	5,452 × 1024 float32; cached `.npy`, fingerprint-keyed
Concept graph (in-memory)	~202 MB (Python-object estimate)	5,452 nodes + 2,095,624 edges, networkx; not persisted, not shipped
Total reported index	232.2 MB	matrix + FTS + graph estimate

Production maps the dense matrix to pgvector halfvec(1024) (~2 KB/row, single-digit MB total for the corpus) and — only if the graph is ever adopted — three Postgres node/edge tables. No pgvector/docker in the prototype; in-process numpy cosine (faiss unnecessary at N=5452). Storage is reported, not gating.

5. Recommendation

ADOPT — ship lexical + dense fusion (phase 1); defer the concept graph behind a gate.

Rationale:

The ADR-0001 quality gate is met. Hybrid beats FTS on all four overall metrics, with the decisive, statistically-robust win on paraphrase (recall@10 +0.350) — precisely the gap embeddings were meant to close — with no recall regression on exact. (The multi-hop deltas are not statistically significant — §2.2, footnote ¹ — so they are not part of the case for adoption.)
The gain is the FTS+dense fusion, not the graph. The ablation shows full-hybrid ≡ FTS+dense. So phase 1 = embeddings (pgvector) fused with the existing FTS via weighted RRF, preserving the importance prior — exactly the integration design §A.
The concept graph stays GATED — because it was not validly tested, not because it failed (ADR-0004). The prototype's fusion config structurally barred it from the top-k (§3), so this benchmark says nothing about its value. Deferral is justified by operational cost (LLM extraction + two extra tables + traversal) plus the remaining uncertainty — not by evidence of uselessness. Re-open with a valid retest: graph candidates in the fused pool (no base-set exclusion) and/or a swept weight, on non-semantically-adjacent multi-hop queries built from real typed-relation extraction.
The small exact-stratum nDCG/MRR dip is the known RRF blending cost on recall-perfect queries; an exact-match rank bonus is a cheap follow-up (ADR-0005 records RRF as the default with the bonus as a tunable).

Adopt with the changes already designed: pgvector halfvec(1024) + HNSW, weighted-RRF fusion with the importance prior preserved, sensitive memories excluded from embedding (ADR-0003), and SQLite- only mode unchanged (ADR-0002).

6. Limitations (stated honestly)

Label noise / qrels quality. Qrels were LLM-generated with lighter hand-verification than the ideal protocol (no measured Cohen's κ between LLM and human judgments). LLM judges are systematically lenient, and the eval is 119 queries (~40/stratum) — below the ~50/stratum the literature (Voorhees & Buckley) recommends for confident per-stratum ranking, and no bootstrap CIs or paired significance test were computed. The overall and paraphrase deltas (+0.14, +0.35 recall@10) are large enough to be robust to plausible label noise; the multi-hop (+0.06) and the exact-stratum dip (~0.02) are within the range where label noise could matter, so treat those as directional, not precise.
Pooling / "holes." It is not confirmed that qrels pooled the top-k of all arms (FTS, dense, hybrid) before judging. If the pool was lexical-biased, the metric is biased against the dense and hybrid arms (they retrieve unjudged relevant memories scored as misses) — meaning the true hybrid uplift could be larger than reported, not smaller. This does not threaten the "adopt" conclusion but caveats the exact magnitudes.
Snapshot vs pgvector (substrate mismatch). The prototype used an in-process numpy dense index over a static corpus snapshot, not the production pgvector HNSW on live CNPG Postgres. Retrieval quality transfers (cosine is cosine; HNSW recall at this scale is ~exact), but the production latency profile, ANN approximation, and filtered-top-k behaviour are unmeasured here and must be validated post-migration.
Extraction shortcuts AND a structurally-excluded graph leg. The concept graph was built with zero LLM calls — concepts = tags ∪ expanded_keywords ∪ a regex noun-phrase proxy, edges from keyword co-occurrence — not the typed-relation, LLM-extracted graph the production design (§A.5) specifies. More importantly, the fusion config structurally barred even this cheap graph from the top-k (§1 corrections, §3), so this run is not a valid test of the graph at all — not of the cheap construction, and certainly not of a properly LLM-extracted typed-relation graph. The graph is gated and unevaluated, not killed.
Embedding model is the prototype default, not the production pick. Numbers are for bge-large-en-v1.5 (local). Production should use Voyage-3.5 (also 1024-d) for non-sensitive memories (ADR-0003); its higher quality ceiling on our content is unverified — a cheap, recommended follow-up (re-run the dense leg with Voyage).
sort_by="relevance" not the production default. The benchmark pins relevance to isolate retrieval quality; production defaults to importance-blended ranking. The design preserves importance as a post-fusion prior, but the user-visible ranking under the default sort was not benchmarked.
Single user, dense corpus. ~5,452 memories from one author are topically adjacent with many near-duplicates, so "the one relevant id" is sometimes fuzzy; graded judgments over a pool mitigate this but it remains a property of the corpus that may not generalize.

Appendix — adversarial completeness review

An independent critic reviewed this report after the run (verdict: usable-with-caveats). Its findings drove the §1 corrections; the full list is recorded here so the review is part of the permanent record.

Graph-null claim is circular, not empirical (most serious). Proven that the fusion config (graph leg excluded from the base set; w_graph=0.35) caps a graph-only id's RRF at 0.35/61≈0.0057, below any base-leg id's 1.0/110≈0.0091. A≡B was guaranteed before any data. Honest statement: "the graph cannot affect top-k under this fusion config," not "the graph contributes nothing."
Graph mechanism mis-attributed, and contradicted by the data. Reconstructing the legs (graph = 5,452 nodes / 2,095,624 edges) on a 15-query sample found rel_only_via_graph=1: a relevant memory the graph surfaced that was absent from BOTH base legs at depth 50 — dense did NOT already cover it, and fusion discarded it. The "dense already retrieves them" explanation is false in ≥1 observed case.
Multi-hop wins are not statistically supported — yet were bolded. Paired bootstrap (B=10000): overall & paraphrase robust (P≤0.003); multi-hop 3/4 CIs cross zero (recall@10 P≈0.06). Multi-hop is the sole rationale for the graph, and it shows no statistically distinguishable hybrid advantage.
Exact stratum is circular by construction. All 40 exact queries were generated as "top FTS hit for a salient phrase of ⟨id⟩," so FTS's 1.0 is largely tautological. The only genuine exact signal is hybrid's small nDCG/MRR demotion, whose claimed rank-bonus fix is asserted, never measured.
qrels are binary and un-pooled, so nDCG is mislabeled and absolute numbers unreliable. Binary labels make nDCG degenerate to position-discounted recall (no graded info). Un-pooled, author-assigned labels bias absolute recall/nDCG low (esp. for dense/hybrid); only the FTS-vs-hybrid delta is trustworthy.
Underpowered and single-corpus. 119 queries (~40/stratum) is below the cited ~50/stratum standard; 65% of queries have exactly one relevant id (recall@5≈recall@10 for most). One author, no second corpus, no inter-annotator agreement: external validity asserted, not demonstrated.
Headline metric overstates hot-path value. recall@10 leads everywhere, but the auto-recall hook injects a top-k into the prompt; recall@5 (+0.075) and MRR/first-hit are the decision-relevant metrics. Hybrid's recall@10 edge partly comes from pushing answers into ranks 6–10. The hook's effective k is never stated.
Production substrate and model wholly unmeasured. Numbers are local exact numpy cosine over bge-large; production is pgvector HNSW (approximate; recall depends on ef_search) with a filtered top-k (NULL-embedding sensitive rows — a partial-index interaction that can hurt HNSW recall), using Voyage-3.5 (never run). "Quality transfers" and "~10× faster on GPU/hosted" are assumptions, not measurements.
Ablation configs A/B/C/D are not reproducible from artifacts. Only results/{fts,hybrid}.json (= A and D) are persisted; B/C have no saved JSON, and there's no run-manifest/seed/version capture beyond the embedding-cache fingerprint. The decision-critical A≡B and the C/D numbers must be re-run.
Several deferred fixes are asserted, not tested. The exact-match rank bonus, the importance post-fusion prior (the benchmark pinned sort_by="relevance" without it, so the user-visible production ranking was never measured), and the "CC ran on our set and RRF was chosen" claim (no CC results exist anywhere) are all unverified.

21 KiB Raw Blame History Unescape Escape

Benchmark report: FTS (lexical baseline) vs Hybrid (lexical + dense + graph)

1. Methodology

1.1 Test collection

1.2 The two systems

1.3 Metrics & protocol

2. Head-to-head results

2.1 Comparison table (FTS vs Hybrid, with deltas)

2.2 Reading the strata

3. Ablation — what each leg contributes

4. Latency & storage (measured, NON-GATING per ADR-0001)

4.1 Latency (per-query retrieve(), CPU-only box, no GPU)

4.2 Storage

5. Recommendation

6. Limitations (stated honestly)

Appendix — adversarial completeness review

21 KiB

Raw Blame History

4.1 Latency (per-query `retrieve()`, CPU-only box, no GPU)