claude-memory-mcp/docs/adr/0005-rrf-default-cc-challenger.md
Viktor Barzin 1cc8a2b378
Some checks failed
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build (push) Has been cancelled
Build and Push / deploy (push) Has been cancelled
Build and Push / notify-failure (push) Has been cancelled
research: benchmark hybrid (lexical+dense+graph) recall vs current FTS
Viktor asked to enhance the memory system with 'semantics' — remember concepts
(not just tokens) linked in a graph — and to prove, by benchmarking against the
current system, that it actually improves recall. A multi-phase research workflow
(18 agents) did landscape research, an adversarially-reviewed integration design,
a stratified eval set over the real 5,452-memory corpus, and a head-to-head
prototype-vs-current benchmark.

Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every
overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend
adopting lexical+dense; the concept graph is DEFERRED.

Post-run adversarial review correction (applied to all docs before commit): the
prototype's fusion config structurally barred the graph leg from the ranked top-k,
so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical
result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty).
Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing
in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/.

Privacy: the corpus/queries/qrels/results are the user's real memories and stay
gitignored (data/, cache/, results/, build_eval_set.py); only harness code,
aggregate numbers, and synthetic examples are committed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:51:53 +00:00

40 lines
2.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Weighted RRF as the fusion default; convex combination as a benchmark challenger
The hybrid read path must fuse three retrieval signals on **incomparable scales** — unbounded
`ts_rank` (BM25), bounded cosine, and (phase 2) an arbitrary graph-proximity score — where the graph
list is **sparse and often empty**. We adopt **weighted Reciprocal Rank Fusion** as the default
fusion function: `score(d) = Σ_s w_s/(60 + rank_s(d))`, default `w_lex = w_dense = 1.0`,
`w_graph = 0.35`, with the existing **importance** value applied as a *post-fusion prior multiplier*
(`final = fused × (0.7 + 0.3·importance)` for `sort_by="relevance"`) — importance is a prior, **not**
a fourth fused list.
RRF is the right default because it is **score-scale-free** (no BM25-vs-cosine calibration to
maintain), treats a missing leg as a clean **0** contribution (no missing-modality bias), is
near-parameter-free (`k=60` is demonstrably uncritical across [10,100]), and **collapses to today's
exact lexical ordering** when the dense/graph legs are empty — which is the SQLite-only
graceful-degrade path ([ADR-0002](0002-api-postgres-first-sqlite-stays-lexical.md)) running the *same*
code.
## Considered options
- **Convex combination / TM2C2** (Bruch et al., min-max normalization) — the literature consistently
shows it **edges RRF on nDCG/recall** when scores are calibratable (Weaviate switched its default to it).
It is the **standing challenger**. ⚠️ **Correction:** an earlier draft claimed "the benchmark ran CC
against RRF and RRF was chosen on our eval set" — **no CC results were actually produced or persisted in
this run.** RRF was adopted on *principled* grounds (scale-free, treats a missing/empty leg as a clean 0,
collapses to today's exact lexical ordering for the SQLite-only degrade path), **not** a measured
head-to-head. Benchmarking CC vs RRF on our eval set is an open follow-up — do it before locking fusion,
and especially if the graph is ever adopted or score distributions shift.
- **Cross-encoder stage-2 re-rank** (e.g. `bge-reranker-v2-m3` over the fused top ~2030) — a
*separate*, independently-gated stage, not a fusion function. Deferred; ship only if it clears both
the quality bar and the hot-path p95 budget on the GPU node.
## Consequences
- Fusion is ~30 lines over three top-N queries; the lexical leg reuses the existing
`plainto_tsquery`/`ts_rank` query + OR-broaden fallback verbatim.
- The exact-stratum nDCG/MRR dip (~0.018/0.025, recall unaffected) is the known RRF cost of blending
one perfect hit with near-ties; a small **exact-match rank bonus** is the tunable recovery and a
cheap follow-up.
- `k=60` is borrowed from TREC ad-hoc IR; a quick re-sweep on the eval set is worthwhile but the
literature says it is insensitive.