claude-memory-mcp/docs/research/benchmark-report.md

313 lines
21 KiB
Markdown
Raw Normal View History

# Benchmark report: FTS (lexical baseline) vs Hybrid (lexical + dense + graph)
Status: **the ADR-0001 decision instrument.** This is the head-to-head that gates hybrid
adoption. Read the [survey](survey.md) for the landscape and the
[integration design](integration-design.md) for what was built.
**Bottom line:** Hybrid (lexical FTS ⊕ dense embeddings, RRF-fused) beats the lexical FTS baseline on
**every overall metric**, driven by the **paraphrase** stratum (recall@10 **+0.350**, robust under a
paired bootstrap) — precisely the gap embeddings were meant to close, with **no recall regression on
exact**. **Recommendation: adopt the lexical + dense fusion;** the ADR-0001 quality gate is met by
that fusion alone.
> **⚠️ Post-review corrections — read before citing this report.** An adversarial completeness review
> after the run found that two prominent claims in the original draft did NOT survive scrutiny. They are
> corrected in place below; the full review is in the [Appendix](#appendix--adversarial-completeness-review).
>
> 1. **The concept graph was NOT shown to be useless — it was never validly tested.** The prototype's
> fusion restricted the graph leg to candidate ids *outside* the FTSdense base set and weighted it at
> 0.35, which makes it **mathematically impossible** for a graph-only result to enter the fused top-k
> (max graph RRF `0.35/(60+1)≈0.0057` < any base-leg minimum `1.0/(60+50)≈0.0091`). So the ablation
> "full-hybrid ≡ FTS+dense to three decimals" was **guaranteed by the fusion config, not an empirical
> finding** — the review even found a genuinely-relevant memory the graph surfaced (and both base legs
> missed) that fusion then discarded. The concept graph is therefore **UNEVALUATED**, and is deferred on
> **cost + invalid-test** grounds, *not* because it failed. A valid retest must put graph ids in the
> fused candidate pool (no base-set exclusion) and/or sweep the graph weight.
> 2. **The multi-hop "win" is not statistically significant.** A paired bootstrap (B=10000) puts 3 of the
> 4 multi-hop metric CIs across zero (recall@10 Δ+0.064, P(Δ≤0)≈0.06). Only the **overall** and
> **paraphrase** deltas are robust. Multi-hop deltas are de-bolded below. Since multi-hop is the entire
> rationale for a concept graph, there is currently **no statistically distinguishable evidence** that
> *anything* (dense or graph) helps the multi-hop stratum.
>
> Also: absolute recall/nDCG *levels* are biased low (binary, un-pooled qrels — §6); only FTS-vs-hybrid
> **deltas** are trustworthy. None of this changes the adopt-lexical+dense recommendation, which rests on
> the robust paraphrase/overall win.
---
## 1. Methodology
### 1.1 Test collection
- **Corpus:** 5,452 memories (`benchmarks/data/corpus.jsonl`, gitignored — privacy). Sensitive
memories (`is_sensitive=1`) excluded entirely per ADR-0003. The corpus is the user's real
personal memories; **only aggregate numbers and synthetic illustrations appear in this document.**
- **Queries:** 119, stratified — **40 exact / 40 paraphrase / 39 multi-hop**
(`benchmarks/data/queries.jsonl`).
- **Qrels:** `benchmarks/data/qrels.jsonl`, LLM-generated (the LongMemEval pipeline inverted for a
memory store: seed memory → exact/paraphrase/multi-hop query → relevant ids).
### 1.2 The two systems
- **`fts` (baseline)** — `benchmarks/retrievers/fts.py::FtsRetriever`. A **faithful
reimplementation of the production code path** `src/claude_memory/mcp_server.py::_sqlite_recall`
(`sort_by="relevance"`): in-memory FTS5 over the full corpus with the production virtual-table
shape + default `unicode61` tokenizer (no stemming/stop-words); AND-match first, OR-broaden only
if AND returns zero; rank by the production blend `-bm25()*0.7 + importance*0.3`; LIKE fallback on
operational errors. **This is "the current system" the hybrid must beat — not the simplified
README prose.**
- **`hybrid`** — `benchmarks/retrievers/hybrid.py::HybridRetriever`. Three legs fused with weighted
RRF: (1) lexical = **`FtsRetriever` reused verbatim** (so the hybrid's lexical component *is* the
baseline — no drift); (2) dense = `BAAI/bge-large-en-v1.5` (1024-d, local, L2-normalized, BGE
query-instruction prefix), cosine via numpy; (3) graph = a keyword-co-occurrence memory-node graph
(5,452 nodes / 2,095,624 edges), 1-hop expansion from the top-10 seeds.
Fusion: `RRF(d) = Σ_leg w_leg/(60 + rank_leg(d))`, `w_fts = w_dense = 1.0, w_graph = 0.35`, each
leg to depth 50.
### 1.3 Metrics & protocol
- **recall@5, recall@10** (hot-path "did we surface it"), **nDCG@10** (graded, position-aware
headline), **MRR** (first-hit). Per stratum and overall.
- `retrieve_k=20`, run via `scripts/run_eval.py --retriever retrievers.{fts,hybrid}:…`. Deterministic;
both invocation paths (programmatic + CLI) verified identical.
- **`sort_by="relevance"` pinned across both arms** (not the production `importance` default), so the
benchmark measures *retrieval* quality, not the importance prior — and everything else (corpus,
queries, OR-broaden behaviour) is held fixed.
- Full result JSONs written only to gitignored `benchmarks/results/{fts,hybrid}.json`; retriever code
contains no embedded corpus content (safe to commit).
---
## 2. Head-to-head results
### 2.1 Comparison table (FTS vs Hybrid, with deltas)
| Stratum | Metric | FTS | Hybrid | Δ |
|---|---|---:|---:|---:|
| **Overall** (n=119) | recall@5 | 0.6663 | 0.7415 | **+0.0752** |
| | recall@10 | 0.6952 | 0.8338 | **+0.1386** |
| | nDCG@10 | 0.6507 | 0.7284 | **+0.0777** |
| | MRR | 0.6737 | 0.7297 | **+0.0560** |
| **Exact** (n=40) | recall@5 | 1.0000 | 1.0000 | +0.0000 |
| | recall@10 | 1.0000 | 1.0000 | +0.0000 |
| | nDCG@10 | 0.9908 | 0.9723 | 0.0185 |
| | MRR | 0.9875 | 0.9625 | 0.0250 |
| **Paraphrase** (n=40) | recall@5 | 0.3500 | 0.5500 | **+0.2000** |
| | recall@10 | 0.3750 | 0.7250 | **+0.3500** |
| | nDCG@10 | 0.3123 | 0.5023 | **+0.1900** |
| | MRR | 0.2958 | 0.4343 | **+0.1385** |
| **Multi-hop** (n=39) | recall@5 | 0.6485 | 0.6726 | +0.0241 ¹ |
| | recall@10 | 0.7111 | 0.7748 | +0.0637 ¹ |
| | nDCG@10 | 0.6491 | 0.7101 | +0.0610 ¹ |
| | MRR | 0.7393 | 0.7940 | +0.0547 ¹ |
¹ **Multi-hop deltas are NOT statistically significant.** Paired bootstrap (B=10000): recall@5
`CI[0.046,+0.095]` (P(Δ≤0)≈0.25), recall@10 `CI[0.020,+0.143]` (P≈0.06), MRR `CI[0.038,+0.143]`
(P≈0.12); only nDCG@10 is marginal (P≈0.04). Treat the multi-hop stratum as **inconclusive**. The
overall and paraphrase deltas, by contrast, have CIs well clear of zero (P≤0.003).
### 2.2 Reading the strata
- **Exact — recall held at 1.0, but that is partly circular.** Exact queries were *generated* as "a
salient phrase whose top FTS hit is memory X," then X was labelled relevant — so FTS recall@5/@10 = 1.0
is substantially **guaranteed by how the stratum was built**, not an independent property. The one
genuine signal here is hybrid's small **degradation**: nDCG@10 0.018 `CI[0.046,0]`, MRR 0.025
`CI[0.063,0]` — a real, consistent rank demotion from blending one perfect lexical hit with dense
near-ties. The proposed exact-match rank bonus (ADR-0005) would recover it, **but that fix is asserted,
not measured.** Recall is unaffected; the cost is rank-position only.
- **Paraphrase — the LARGEST WIN, the dense leg's payoff.** recall@10 **+0.350** (0.375 → 0.725),
recall@5 **+0.200**, nDCG@10 **+0.190**. This is exactly the low-lexical-overlap stratum embeddings
were predicted to fix: lexical FTS finds barely a third of paraphrased answers; adding dense nearly
doubles recall@10. **This stratum alone justifies the hybrid.**
- **Multi-hop — INCONCLUSIVE (deltas not significant).** recall@10 +0.064, nDCG@10 +0.061, MRR +0.055,
but 3 of 4 CIs cross zero (footnote ¹). So we **cannot claim** a multi-hop win for hybrid. We also
cannot attribute anything to the graph: per the §3 caveat, the fusion config structurally barred the
graph leg from the top-k, so the multi-hop stratum tests only FTS vs FTS+dense — and that difference is
not statistically distinguishable here. Multi-hop is exactly the stratum a *properly tested* concept
graph is meant to win, and it remains an open question.
---
## 3. Ablation — what each leg contributes
Four configs, everything else fixed:
| Config | Legs | Overall recall@10 | Overall nDCG@10 | Para recall@10 | Multi recall@10 | Exact nDCG@10 |
|---|---|---:|---:|---:|---:|---:|
| **A** | FTS + dense + graph (full hybrid) | 0.834 | 0.728 | 0.725 | 0.775 | 0.972 |
| **B** | FTS + dense (`w_graph=0`) | **0.834** | **0.728** | **0.725** | **0.775** | **0.972** |
| **C** | dense only | 0.748 | — | — | — | 0.861 |
| **D** | FTS only (= baseline) | 0.695 | 0.651 | 0.375 | 0.711 | 0.991 |
**A ≡ B to three decimals on every metric — but this is a STRUCTURAL ARTIFACT, not a test of the graph.**
The fusion restricts the graph leg to ids *outside* the FTSdense base set and weights it at 0.35, so a
graph-only id's maximum RRF score is `0.35/(60+1) ≈ 0.0057`, strictly below the *minimum* score of any id
from a base leg (`1.0/(60+50) ≈ 0.0091`). Since both base legs return 50 ids, **every base-leg id outranks
every graph-only id** — a graph-only result can never enter the fused top-k, regardless of corpus, query
set, or graph quality. A ≡ B was **mathematically guaranteed before any data ran.** The honest reading is
"the graph cannot affect top-k *under this fusion config*," NOT "the concept graph contributes nothing." (A
spot check found a genuinely-relevant memory the graph surfaced and both base legs missed at depth 50 —
fusion discarded it.) **The graph is therefore unevaluated.**
What the ablation *does* validly show is the **FTS-vs-dense decomposition**, which stands:
- **Dense recovers paraphrase** (C beats D's 0.375 para recall@10 decisively) but is weaker on exact
(C exact nDCG 0.861 vs D's 0.991).
- **Lexical recovers exact** (D exact nDCG 0.991, the best) but collapses on paraphrase.
- **Fusion (B) gets the best of both** — exact recall stays perfect, paraphrase nearly doubles.
- *Caveat:* configs B/C/D were not persisted as result JSONs, so these specific numbers are not
independently reproducible from committed artifacts (only A = full hybrid and D = FTS are).
**To actually test the graph** (deferred follow-up — [ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)):
put graph candidates in the fused pool (drop the base-set exclusion) and/or sweep `w_graph` upward, on a
multi-hop slice whose hops are *not* semantically adjacent (so the dense leg can't shortcut them), using
real typed-relation extraction rather than the prototype's zero-LLM keyword co-occurrence graph.
---
## 4. Latency & storage (measured, NON-GATING per ADR-0001)
### 4.1 Latency (per-query `retrieve()`, CPU-only box, no GPU)
| System | p50 | p95 | mean | max |
|---|---:|---:|---:|---:|
| FTS (pure SQLite) | 15.7 ms | 27.8 ms | 12.8 ms | 31.9 ms |
| Hybrid | 229.6 ms | 344.5 ms | 249.3 ms | 640.0 ms |
The hybrid's ~230 ms p50 is **dominated by the local bge-large query embedding** (one CPU forward
pass). On the production GPU node or via a hosted API this drops ~10× to low tens of ms. The FTS,
dense-ANN, and RRF-merge costs themselves are negligible. **Latency does not gate adoption** (the
success metric is quality-first), and the production read path (pgvector HNSW + GPU/hosted query
embed) is far faster than this prototype's CPU profile.
### 4.2 Storage
| Component | Size | Notes |
|---|---|---|
| FTS5 in-memory index | ~8.3 MB | SQLite shadow tables over 5,452 memories |
| Dense matrix | 22.3 MB | 5,452 × 1024 float32; cached `.npy`, fingerprint-keyed |
| Concept graph (in-memory) | ~202 MB (Python-object estimate) | 5,452 nodes + 2,095,624 edges, networkx; **not persisted, not shipped** |
| Total reported index | 232.2 MB | matrix + FTS + graph estimate |
Production maps the dense matrix to pgvector `halfvec(1024)` (~2 KB/row, single-digit MB total for
the corpus) and — only if the graph is ever adopted — three Postgres node/edge tables. No
pgvector/docker in the prototype; in-process numpy cosine (faiss unnecessary at N=5452). Storage is
reported, not gating.
---
## 5. Recommendation
**ADOPT — ship lexical + dense fusion (phase 1); defer the concept graph behind a gate.**
Rationale:
1. **The ADR-0001 quality gate is met.** Hybrid beats FTS on all four overall metrics, with the
decisive, statistically-robust win on **paraphrase** (recall@10 +0.350) — precisely the gap embeddings
were meant to close — with **no recall regression on exact**. (The multi-hop deltas are *not*
statistically significant — §2.2, footnote ¹ — so they are not part of the case for adoption.)
2. **The gain is the FTS+dense fusion, not the graph.** The ablation shows full-hybrid ≡ FTS+dense.
So phase 1 = embeddings (pgvector) fused with the existing FTS via weighted RRF, preserving the
importance prior — exactly the [integration design](integration-design.md) §A.
3. **The concept graph stays GATED — because it was not validly tested, not because it failed**
([ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)). The prototype's fusion
config structurally barred it from the top-k (§3), so this benchmark says nothing about its value.
Deferral is justified by operational cost (LLM extraction + two extra tables + traversal) **plus the
remaining uncertainty** — not by evidence of uselessness. Re-open with a valid retest: graph candidates
in the fused pool (no base-set exclusion) and/or a swept weight, on non-semantically-adjacent multi-hop
queries built from real typed-relation extraction.
4. **The small exact-stratum nDCG/MRR dip** is the known RRF blending cost on recall-perfect queries;
an exact-match rank bonus is a cheap follow-up ([ADR-0005](../adr/0005-rrf-default-cc-challenger.md)
records RRF as the default with the bonus as a tunable).
Adopt with the changes already designed: pgvector `halfvec(1024)` + HNSW, weighted-RRF fusion with
the importance prior preserved, sensitive memories excluded from embedding (ADR-0003), and SQLite-
only mode unchanged (ADR-0002).
---
## 6. Limitations (stated honestly)
1. **Label noise / qrels quality.** Qrels were **LLM-generated** with **lighter hand-verification
than the ideal protocol** (no measured Cohen's κ between LLM and human judgments). LLM judges are
systematically lenient, and the eval is **119 queries (~40/stratum)***below* the ~50/stratum
the literature (Voorhees & Buckley) recommends for confident per-stratum ranking, and **no
bootstrap CIs or paired significance test** were computed. The overall and paraphrase deltas
(+0.14, +0.35 recall@10) are large enough to be robust to plausible label noise; the multi-hop
(+0.06) and the exact-stratum dip (~0.02) are within the range where label noise could matter, so
treat those as directional, not precise.
2. **Pooling / "holes."** It is not confirmed that qrels pooled the top-k of *all* arms (FTS, dense,
hybrid) before judging. If the pool was lexical-biased, the metric is **biased against** the dense
and hybrid arms (they retrieve unjudged relevant memories scored as misses) — meaning the true
hybrid uplift could be *larger* than reported, not smaller. This does not threaten the "adopt"
conclusion but caveats the exact magnitudes.
3. **Snapshot vs pgvector (substrate mismatch).** The prototype used an **in-process numpy** dense
index over a static corpus snapshot, **not** the production pgvector HNSW on live CNPG Postgres.
Retrieval *quality* transfers (cosine is cosine; HNSW recall at this scale is ~exact), but the
production **latency profile, ANN approximation, and filtered-top-k behaviour are unmeasured here**
and must be validated post-migration.
4. **Extraction shortcuts AND a structurally-excluded graph leg.** The concept graph was built with
**zero LLM calls** — concepts = `tags` `expanded_keywords` a regex noun-phrase proxy, edges from
keyword co-occurrence — **not** the typed-relation, LLM-extracted graph the production design (§A.5)
specifies. More importantly, the fusion config **structurally barred even this cheap graph from the
top-k** (§1 corrections, §3), so this run is **not a valid test of the graph at all** — not of the
cheap construction, and certainly not of a properly LLM-extracted typed-relation graph. The graph is
*gated and unevaluated*, not *killed*.
5. **Embedding model is the prototype default, not the production pick.** Numbers are for
**bge-large-en-v1.5** (local). Production should use **Voyage-3.5** (also 1024-d) for
non-sensitive memories (ADR-0003); its higher quality ceiling on *our* content is unverified — a
cheap, recommended follow-up (re-run the dense leg with Voyage).
6. **`sort_by="relevance"` not the production default.** The benchmark pins `relevance` to isolate
retrieval quality; production defaults to `importance`-blended ranking. The design preserves
importance as a post-fusion prior, but the *user-visible* ranking under the default sort was not
benchmarked.
7. **Single user, dense corpus.** ~5,452 memories from one author are topically adjacent with many
near-duplicates, so "the one relevant id" is sometimes fuzzy; graded judgments over a pool mitigate
this but it remains a property of the corpus that may not generalize.
---
## Appendix — adversarial completeness review
An independent critic reviewed this report after the run (verdict: **usable-with-caveats**). Its findings
drove the §1 corrections; the full list is recorded here so the review is part of the permanent record.
1. **Graph-null claim is circular, not empirical (most serious).** Proven that the fusion config
(graph leg excluded from the base set; `w_graph=0.35`) caps a graph-only id's RRF at `0.35/61≈0.0057`,
below any base-leg id's `1.0/110≈0.0091`. `A≡B` was guaranteed before any data. Honest statement: "the
graph cannot affect top-k under this fusion config," not "the graph contributes nothing."
2. **Graph mechanism mis-attributed, and contradicted by the data.** Reconstructing the legs (graph =
5,452 nodes / 2,095,624 edges) on a 15-query sample found `rel_only_via_graph=1`: a relevant memory the
graph surfaced that was absent from BOTH base legs at depth 50 — dense did NOT already cover it, and
fusion discarded it. The "dense already retrieves them" explanation is false in ≥1 observed case.
3. **Multi-hop wins are not statistically supported — yet were bolded.** Paired bootstrap (B=10000):
overall & paraphrase robust (P≤0.003); multi-hop 3/4 CIs cross zero (recall@10 P≈0.06). Multi-hop is the
sole rationale for the graph, and it shows no statistically distinguishable hybrid advantage.
4. **Exact stratum is circular by construction.** All 40 exact queries were generated as "top FTS hit for a
salient phrase of ⟨id⟩," so FTS's 1.0 is largely tautological. The only genuine exact signal is hybrid's
small nDCG/MRR demotion, whose claimed rank-bonus fix is asserted, never measured.
5. **qrels are binary and un-pooled, so nDCG is mislabeled and absolute numbers unreliable.** Binary labels
make nDCG degenerate to position-discounted recall (no graded info). Un-pooled, author-assigned labels
bias absolute recall/nDCG **low** (esp. for dense/hybrid); only the FTS-vs-hybrid **delta** is trustworthy.
6. **Underpowered and single-corpus.** 119 queries (~40/stratum) is below the cited ~50/stratum standard;
65% of queries have exactly one relevant id (recall@5≈recall@10 for most). One author, no second corpus,
no inter-annotator agreement: external validity asserted, not demonstrated.
7. **Headline metric overstates hot-path value.** recall@10 leads everywhere, but the auto-recall hook
injects a top-k into the prompt; recall@5 (+0.075) and MRR/first-hit are the decision-relevant metrics.
Hybrid's recall@10 edge partly comes from pushing answers into ranks 610. The hook's effective `k` is
never stated.
8. **Production substrate and model wholly unmeasured.** Numbers are local exact numpy cosine over
bge-large; production is pgvector HNSW (approximate; recall depends on `ef_search`) with a filtered top-k
(NULL-embedding sensitive rows — a partial-index interaction that can hurt HNSW recall), using Voyage-3.5
(never run). "Quality transfers" and "~10× faster on GPU/hosted" are assumptions, not measurements.
9. **Ablation configs A/B/C/D are not reproducible from artifacts.** Only `results/{fts,hybrid}.json`
(= A and D) are persisted; B/C have no saved JSON, and there's no run-manifest/seed/version capture
beyond the embedding-cache fingerprint. The decision-critical `A≡B` and the C/D numbers must be re-run.
10. **Several deferred fixes are asserted, not tested.** The exact-match rank bonus, the importance
post-fusion prior (the benchmark pinned `sort_by="relevance"` without it, so the user-visible production
ranking was never measured), and the "CC ran on our set and RRF was chosen" claim (no CC results exist
anywhere) are all unverified.