claude-memory-mcp/docs/research/benchmark-report.md
Viktor Barzin 1cc8a2b378
Some checks are pending
Build and Push / lint-and-test (push) Waiting to run
Build and Push / build (push) Blocked by required conditions
Build and Push / deploy (push) Blocked by required conditions
Build and Push / notify-failure (push) Blocked by required conditions
research: benchmark hybrid (lexical+dense+graph) recall vs current FTS
Viktor asked to enhance the memory system with 'semantics' — remember concepts
(not just tokens) linked in a graph — and to prove, by benchmarking against the
current system, that it actually improves recall. A multi-phase research workflow
(18 agents) did landscape research, an adversarially-reviewed integration design,
a stratified eval set over the real 5,452-memory corpus, and a head-to-head
prototype-vs-current benchmark.

Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every
overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend
adopting lexical+dense; the concept graph is DEFERRED.

Post-run adversarial review correction (applied to all docs before commit): the
prototype's fusion config structurally barred the graph leg from the ranked top-k,
so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical
result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty).
Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing
in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/.

Privacy: the corpus/queries/qrels/results are the user's real memories and stay
gitignored (data/, cache/, results/, build_eval_set.py); only harness code,
aggregate numbers, and synthetic examples are committed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:51:53 +00:00

312 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Benchmark report: FTS (lexical baseline) vs Hybrid (lexical + dense + graph)
Status: **the ADR-0001 decision instrument.** This is the head-to-head that gates hybrid
adoption. Read the [survey](survey.md) for the landscape and the
[integration design](integration-design.md) for what was built.
**Bottom line:** Hybrid (lexical FTS ⊕ dense embeddings, RRF-fused) beats the lexical FTS baseline on
**every overall metric**, driven by the **paraphrase** stratum (recall@10 **+0.350**, robust under a
paired bootstrap) — precisely the gap embeddings were meant to close, with **no recall regression on
exact**. **Recommendation: adopt the lexical + dense fusion;** the ADR-0001 quality gate is met by
that fusion alone.
> **⚠️ Post-review corrections — read before citing this report.** An adversarial completeness review
> after the run found that two prominent claims in the original draft did NOT survive scrutiny. They are
> corrected in place below; the full review is in the [Appendix](#appendix--adversarial-completeness-review).
>
> 1. **The concept graph was NOT shown to be useless — it was never validly tested.** The prototype's
> fusion restricted the graph leg to candidate ids *outside* the FTSdense base set and weighted it at
> 0.35, which makes it **mathematically impossible** for a graph-only result to enter the fused top-k
> (max graph RRF `0.35/(60+1)≈0.0057` < any base-leg minimum `1.0/(60+50)≈0.0091`). So the ablation
> "full-hybrid ≡ FTS+dense to three decimals" was **guaranteed by the fusion config, not an empirical
> finding** — the review even found a genuinely-relevant memory the graph surfaced (and both base legs
> missed) that fusion then discarded. The concept graph is therefore **UNEVALUATED**, and is deferred on
> **cost + invalid-test** grounds, *not* because it failed. A valid retest must put graph ids in the
> fused candidate pool (no base-set exclusion) and/or sweep the graph weight.
> 2. **The multi-hop "win" is not statistically significant.** A paired bootstrap (B=10000) puts 3 of the
> 4 multi-hop metric CIs across zero (recall@10 Δ+0.064, P(Δ≤0)≈0.06). Only the **overall** and
> **paraphrase** deltas are robust. Multi-hop deltas are de-bolded below. Since multi-hop is the entire
> rationale for a concept graph, there is currently **no statistically distinguishable evidence** that
> *anything* (dense or graph) helps the multi-hop stratum.
>
> Also: absolute recall/nDCG *levels* are biased low (binary, un-pooled qrels — §6); only FTS-vs-hybrid
> **deltas** are trustworthy. None of this changes the adopt-lexical+dense recommendation, which rests on
> the robust paraphrase/overall win.
---
## 1. Methodology
### 1.1 Test collection
- **Corpus:** 5,452 memories (`benchmarks/data/corpus.jsonl`, gitignored — privacy). Sensitive
memories (`is_sensitive=1`) excluded entirely per ADR-0003. The corpus is the user's real
personal memories; **only aggregate numbers and synthetic illustrations appear in this document.**
- **Queries:** 119, stratified — **40 exact / 40 paraphrase / 39 multi-hop**
(`benchmarks/data/queries.jsonl`).
- **Qrels:** `benchmarks/data/qrels.jsonl`, LLM-generated (the LongMemEval pipeline inverted for a
memory store: seed memory → exact/paraphrase/multi-hop query → relevant ids).
### 1.2 The two systems
- **`fts` (baseline)** — `benchmarks/retrievers/fts.py::FtsRetriever`. A **faithful
reimplementation of the production code path** `src/claude_memory/mcp_server.py::_sqlite_recall`
(`sort_by="relevance"`): in-memory FTS5 over the full corpus with the production virtual-table
shape + default `unicode61` tokenizer (no stemming/stop-words); AND-match first, OR-broaden only
if AND returns zero; rank by the production blend `-bm25()*0.7 + importance*0.3`; LIKE fallback on
operational errors. **This is "the current system" the hybrid must beat — not the simplified
README prose.**
- **`hybrid`** — `benchmarks/retrievers/hybrid.py::HybridRetriever`. Three legs fused with weighted
RRF: (1) lexical = **`FtsRetriever` reused verbatim** (so the hybrid's lexical component *is* the
baseline — no drift); (2) dense = `BAAI/bge-large-en-v1.5` (1024-d, local, L2-normalized, BGE
query-instruction prefix), cosine via numpy; (3) graph = a keyword-co-occurrence memory-node graph
(5,452 nodes / 2,095,624 edges), 1-hop expansion from the top-10 seeds.
Fusion: `RRF(d) = Σ_leg w_leg/(60 + rank_leg(d))`, `w_fts = w_dense = 1.0, w_graph = 0.35`, each
leg to depth 50.
### 1.3 Metrics & protocol
- **recall@5, recall@10** (hot-path "did we surface it"), **nDCG@10** (graded, position-aware
headline), **MRR** (first-hit). Per stratum and overall.
- `retrieve_k=20`, run via `scripts/run_eval.py --retriever retrievers.{fts,hybrid}:…`. Deterministic;
both invocation paths (programmatic + CLI) verified identical.
- **`sort_by="relevance"` pinned across both arms** (not the production `importance` default), so the
benchmark measures *retrieval* quality, not the importance prior — and everything else (corpus,
queries, OR-broaden behaviour) is held fixed.
- Full result JSONs written only to gitignored `benchmarks/results/{fts,hybrid}.json`; retriever code
contains no embedded corpus content (safe to commit).
---
## 2. Head-to-head results
### 2.1 Comparison table (FTS vs Hybrid, with deltas)
| Stratum | Metric | FTS | Hybrid | Δ |
|---|---|---:|---:|---:|
| **Overall** (n=119) | recall@5 | 0.6663 | 0.7415 | **+0.0752** |
| | recall@10 | 0.6952 | 0.8338 | **+0.1386** |
| | nDCG@10 | 0.6507 | 0.7284 | **+0.0777** |
| | MRR | 0.6737 | 0.7297 | **+0.0560** |
| **Exact** (n=40) | recall@5 | 1.0000 | 1.0000 | +0.0000 |
| | recall@10 | 1.0000 | 1.0000 | +0.0000 |
| | nDCG@10 | 0.9908 | 0.9723 | 0.0185 |
| | MRR | 0.9875 | 0.9625 | 0.0250 |
| **Paraphrase** (n=40) | recall@5 | 0.3500 | 0.5500 | **+0.2000** |
| | recall@10 | 0.3750 | 0.7250 | **+0.3500** |
| | nDCG@10 | 0.3123 | 0.5023 | **+0.1900** |
| | MRR | 0.2958 | 0.4343 | **+0.1385** |
| **Multi-hop** (n=39) | recall@5 | 0.6485 | 0.6726 | +0.0241 ¹ |
| | recall@10 | 0.7111 | 0.7748 | +0.0637 ¹ |
| | nDCG@10 | 0.6491 | 0.7101 | +0.0610 ¹ |
| | MRR | 0.7393 | 0.7940 | +0.0547 ¹ |
¹ **Multi-hop deltas are NOT statistically significant.** Paired bootstrap (B=10000): recall@5
`CI[0.046,+0.095]` (P(Δ≤0)≈0.25), recall@10 `CI[0.020,+0.143]` (P≈0.06), MRR `CI[0.038,+0.143]`
(P≈0.12); only nDCG@10 is marginal (P≈0.04). Treat the multi-hop stratum as **inconclusive**. The
overall and paraphrase deltas, by contrast, have CIs well clear of zero (P≤0.003).
### 2.2 Reading the strata
- **Exact — recall held at 1.0, but that is partly circular.** Exact queries were *generated* as "a
salient phrase whose top FTS hit is memory X," then X was labelled relevant — so FTS recall@5/@10 = 1.0
is substantially **guaranteed by how the stratum was built**, not an independent property. The one
genuine signal here is hybrid's small **degradation**: nDCG@10 0.018 `CI[0.046,0]`, MRR 0.025
`CI[0.063,0]` — a real, consistent rank demotion from blending one perfect lexical hit with dense
near-ties. The proposed exact-match rank bonus (ADR-0005) would recover it, **but that fix is asserted,
not measured.** Recall is unaffected; the cost is rank-position only.
- **Paraphrase — the LARGEST WIN, the dense leg's payoff.** recall@10 **+0.350** (0.375 → 0.725),
recall@5 **+0.200**, nDCG@10 **+0.190**. This is exactly the low-lexical-overlap stratum embeddings
were predicted to fix: lexical FTS finds barely a third of paraphrased answers; adding dense nearly
doubles recall@10. **This stratum alone justifies the hybrid.**
- **Multi-hop — INCONCLUSIVE (deltas not significant).** recall@10 +0.064, nDCG@10 +0.061, MRR +0.055,
but 3 of 4 CIs cross zero (footnote ¹). So we **cannot claim** a multi-hop win for hybrid. We also
cannot attribute anything to the graph: per the §3 caveat, the fusion config structurally barred the
graph leg from the top-k, so the multi-hop stratum tests only FTS vs FTS+dense — and that difference is
not statistically distinguishable here. Multi-hop is exactly the stratum a *properly tested* concept
graph is meant to win, and it remains an open question.
---
## 3. Ablation — what each leg contributes
Four configs, everything else fixed:
| Config | Legs | Overall recall@10 | Overall nDCG@10 | Para recall@10 | Multi recall@10 | Exact nDCG@10 |
|---|---|---:|---:|---:|---:|---:|
| **A** | FTS + dense + graph (full hybrid) | 0.834 | 0.728 | 0.725 | 0.775 | 0.972 |
| **B** | FTS + dense (`w_graph=0`) | **0.834** | **0.728** | **0.725** | **0.775** | **0.972** |
| **C** | dense only | 0.748 | — | — | — | 0.861 |
| **D** | FTS only (= baseline) | 0.695 | 0.651 | 0.375 | 0.711 | 0.991 |
**A ≡ B to three decimals on every metric — but this is a STRUCTURAL ARTIFACT, not a test of the graph.**
The fusion restricts the graph leg to ids *outside* the FTSdense base set and weights it at 0.35, so a
graph-only id's maximum RRF score is `0.35/(60+1) ≈ 0.0057`, strictly below the *minimum* score of any id
from a base leg (`1.0/(60+50) ≈ 0.0091`). Since both base legs return 50 ids, **every base-leg id outranks
every graph-only id** — a graph-only result can never enter the fused top-k, regardless of corpus, query
set, or graph quality. A ≡ B was **mathematically guaranteed before any data ran.** The honest reading is
"the graph cannot affect top-k *under this fusion config*," NOT "the concept graph contributes nothing." (A
spot check found a genuinely-relevant memory the graph surfaced and both base legs missed at depth 50 —
fusion discarded it.) **The graph is therefore unevaluated.**
What the ablation *does* validly show is the **FTS-vs-dense decomposition**, which stands:
- **Dense recovers paraphrase** (C beats D's 0.375 para recall@10 decisively) but is weaker on exact
(C exact nDCG 0.861 vs D's 0.991).
- **Lexical recovers exact** (D exact nDCG 0.991, the best) but collapses on paraphrase.
- **Fusion (B) gets the best of both** — exact recall stays perfect, paraphrase nearly doubles.
- *Caveat:* configs B/C/D were not persisted as result JSONs, so these specific numbers are not
independently reproducible from committed artifacts (only A = full hybrid and D = FTS are).
**To actually test the graph** (deferred follow-up — [ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)):
put graph candidates in the fused pool (drop the base-set exclusion) and/or sweep `w_graph` upward, on a
multi-hop slice whose hops are *not* semantically adjacent (so the dense leg can't shortcut them), using
real typed-relation extraction rather than the prototype's zero-LLM keyword co-occurrence graph.
---
## 4. Latency & storage (measured, NON-GATING per ADR-0001)
### 4.1 Latency (per-query `retrieve()`, CPU-only box, no GPU)
| System | p50 | p95 | mean | max |
|---|---:|---:|---:|---:|
| FTS (pure SQLite) | 15.7 ms | 27.8 ms | 12.8 ms | 31.9 ms |
| Hybrid | 229.6 ms | 344.5 ms | 249.3 ms | 640.0 ms |
The hybrid's ~230 ms p50 is **dominated by the local bge-large query embedding** (one CPU forward
pass). On the production GPU node or via a hosted API this drops ~10× to low tens of ms. The FTS,
dense-ANN, and RRF-merge costs themselves are negligible. **Latency does not gate adoption** (the
success metric is quality-first), and the production read path (pgvector HNSW + GPU/hosted query
embed) is far faster than this prototype's CPU profile.
### 4.2 Storage
| Component | Size | Notes |
|---|---|---|
| FTS5 in-memory index | ~8.3 MB | SQLite shadow tables over 5,452 memories |
| Dense matrix | 22.3 MB | 5,452 × 1024 float32; cached `.npy`, fingerprint-keyed |
| Concept graph (in-memory) | ~202 MB (Python-object estimate) | 5,452 nodes + 2,095,624 edges, networkx; **not persisted, not shipped** |
| Total reported index | 232.2 MB | matrix + FTS + graph estimate |
Production maps the dense matrix to pgvector `halfvec(1024)` (~2 KB/row, single-digit MB total for
the corpus) and — only if the graph is ever adopted — three Postgres node/edge tables. No
pgvector/docker in the prototype; in-process numpy cosine (faiss unnecessary at N=5452). Storage is
reported, not gating.
---
## 5. Recommendation
**ADOPT — ship lexical + dense fusion (phase 1); defer the concept graph behind a gate.**
Rationale:
1. **The ADR-0001 quality gate is met.** Hybrid beats FTS on all four overall metrics, with the
decisive, statistically-robust win on **paraphrase** (recall@10 +0.350) — precisely the gap embeddings
were meant to close — with **no recall regression on exact**. (The multi-hop deltas are *not*
statistically significant — §2.2, footnote ¹ — so they are not part of the case for adoption.)
2. **The gain is the FTS+dense fusion, not the graph.** The ablation shows full-hybrid ≡ FTS+dense.
So phase 1 = embeddings (pgvector) fused with the existing FTS via weighted RRF, preserving the
importance prior — exactly the [integration design](integration-design.md) §A.
3. **The concept graph stays GATED — because it was not validly tested, not because it failed**
([ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)). The prototype's fusion
config structurally barred it from the top-k (§3), so this benchmark says nothing about its value.
Deferral is justified by operational cost (LLM extraction + two extra tables + traversal) **plus the
remaining uncertainty** — not by evidence of uselessness. Re-open with a valid retest: graph candidates
in the fused pool (no base-set exclusion) and/or a swept weight, on non-semantically-adjacent multi-hop
queries built from real typed-relation extraction.
4. **The small exact-stratum nDCG/MRR dip** is the known RRF blending cost on recall-perfect queries;
an exact-match rank bonus is a cheap follow-up ([ADR-0005](../adr/0005-rrf-default-cc-challenger.md)
records RRF as the default with the bonus as a tunable).
Adopt with the changes already designed: pgvector `halfvec(1024)` + HNSW, weighted-RRF fusion with
the importance prior preserved, sensitive memories excluded from embedding (ADR-0003), and SQLite-
only mode unchanged (ADR-0002).
---
## 6. Limitations (stated honestly)
1. **Label noise / qrels quality.** Qrels were **LLM-generated** with **lighter hand-verification
than the ideal protocol** (no measured Cohen's κ between LLM and human judgments). LLM judges are
systematically lenient, and the eval is **119 queries (~40/stratum)***below* the ~50/stratum
the literature (Voorhees & Buckley) recommends for confident per-stratum ranking, and **no
bootstrap CIs or paired significance test** were computed. The overall and paraphrase deltas
(+0.14, +0.35 recall@10) are large enough to be robust to plausible label noise; the multi-hop
(+0.06) and the exact-stratum dip (~0.02) are within the range where label noise could matter, so
treat those as directional, not precise.
2. **Pooling / "holes."** It is not confirmed that qrels pooled the top-k of *all* arms (FTS, dense,
hybrid) before judging. If the pool was lexical-biased, the metric is **biased against** the dense
and hybrid arms (they retrieve unjudged relevant memories scored as misses) — meaning the true
hybrid uplift could be *larger* than reported, not smaller. This does not threaten the "adopt"
conclusion but caveats the exact magnitudes.
3. **Snapshot vs pgvector (substrate mismatch).** The prototype used an **in-process numpy** dense
index over a static corpus snapshot, **not** the production pgvector HNSW on live CNPG Postgres.
Retrieval *quality* transfers (cosine is cosine; HNSW recall at this scale is ~exact), but the
production **latency profile, ANN approximation, and filtered-top-k behaviour are unmeasured here**
and must be validated post-migration.
4. **Extraction shortcuts AND a structurally-excluded graph leg.** The concept graph was built with
**zero LLM calls** — concepts = `tags` `expanded_keywords` a regex noun-phrase proxy, edges from
keyword co-occurrence — **not** the typed-relation, LLM-extracted graph the production design (§A.5)
specifies. More importantly, the fusion config **structurally barred even this cheap graph from the
top-k** (§1 corrections, §3), so this run is **not a valid test of the graph at all** — not of the
cheap construction, and certainly not of a properly LLM-extracted typed-relation graph. The graph is
*gated and unevaluated*, not *killed*.
5. **Embedding model is the prototype default, not the production pick.** Numbers are for
**bge-large-en-v1.5** (local). Production should use **Voyage-3.5** (also 1024-d) for
non-sensitive memories (ADR-0003); its higher quality ceiling on *our* content is unverified — a
cheap, recommended follow-up (re-run the dense leg with Voyage).
6. **`sort_by="relevance"` not the production default.** The benchmark pins `relevance` to isolate
retrieval quality; production defaults to `importance`-blended ranking. The design preserves
importance as a post-fusion prior, but the *user-visible* ranking under the default sort was not
benchmarked.
7. **Single user, dense corpus.** ~5,452 memories from one author are topically adjacent with many
near-duplicates, so "the one relevant id" is sometimes fuzzy; graded judgments over a pool mitigate
this but it remains a property of the corpus that may not generalize.
---
## Appendix — adversarial completeness review
An independent critic reviewed this report after the run (verdict: **usable-with-caveats**). Its findings
drove the §1 corrections; the full list is recorded here so the review is part of the permanent record.
1. **Graph-null claim is circular, not empirical (most serious).** Proven that the fusion config
(graph leg excluded from the base set; `w_graph=0.35`) caps a graph-only id's RRF at `0.35/61≈0.0057`,
below any base-leg id's `1.0/110≈0.0091`. `A≡B` was guaranteed before any data. Honest statement: "the
graph cannot affect top-k under this fusion config," not "the graph contributes nothing."
2. **Graph mechanism mis-attributed, and contradicted by the data.** Reconstructing the legs (graph =
5,452 nodes / 2,095,624 edges) on a 15-query sample found `rel_only_via_graph=1`: a relevant memory the
graph surfaced that was absent from BOTH base legs at depth 50 — dense did NOT already cover it, and
fusion discarded it. The "dense already retrieves them" explanation is false in ≥1 observed case.
3. **Multi-hop wins are not statistically supported — yet were bolded.** Paired bootstrap (B=10000):
overall & paraphrase robust (P≤0.003); multi-hop 3/4 CIs cross zero (recall@10 P≈0.06). Multi-hop is the
sole rationale for the graph, and it shows no statistically distinguishable hybrid advantage.
4. **Exact stratum is circular by construction.** All 40 exact queries were generated as "top FTS hit for a
salient phrase of ⟨id⟩," so FTS's 1.0 is largely tautological. The only genuine exact signal is hybrid's
small nDCG/MRR demotion, whose claimed rank-bonus fix is asserted, never measured.
5. **qrels are binary and un-pooled, so nDCG is mislabeled and absolute numbers unreliable.** Binary labels
make nDCG degenerate to position-discounted recall (no graded info). Un-pooled, author-assigned labels
bias absolute recall/nDCG **low** (esp. for dense/hybrid); only the FTS-vs-hybrid **delta** is trustworthy.
6. **Underpowered and single-corpus.** 119 queries (~40/stratum) is below the cited ~50/stratum standard;
65% of queries have exactly one relevant id (recall@5≈recall@10 for most). One author, no second corpus,
no inter-annotator agreement: external validity asserted, not demonstrated.
7. **Headline metric overstates hot-path value.** recall@10 leads everywhere, but the auto-recall hook
injects a top-k into the prompt; recall@5 (+0.075) and MRR/first-hit are the decision-relevant metrics.
Hybrid's recall@10 edge partly comes from pushing answers into ranks 610. The hook's effective `k` is
never stated.
8. **Production substrate and model wholly unmeasured.** Numbers are local exact numpy cosine over
bge-large; production is pgvector HNSW (approximate; recall depends on `ef_search`) with a filtered top-k
(NULL-embedding sensitive rows — a partial-index interaction that can hurt HNSW recall), using Voyage-3.5
(never run). "Quality transfers" and "~10× faster on GPU/hosted" are assumptions, not measurements.
9. **Ablation configs A/B/C/D are not reproducible from artifacts.** Only `results/{fts,hybrid}.json`
(= A and D) are persisted; B/C have no saved JSON, and there's no run-manifest/seed/version capture
beyond the embedding-cache fingerprint. The decision-critical `A≡B` and the C/D numbers must be re-run.
10. **Several deferred fixes are asserted, not tested.** The exact-match rank bonus, the importance
post-fusion prior (the benchmark pinned `sort_by="relevance"` without it, so the user-visible production
ranking was never measured), and the "CC ran on our set and RRF was chosen" claim (no CC results exist
anywhere) are all unverified.