claude-memory-mcp/benchmarks/retrievers/fts.py
Viktor Barzin 1cc8a2b378
Some checks are pending
Build and Push / lint-and-test (push) Waiting to run
Build and Push / build (push) Blocked by required conditions
Build and Push / deploy (push) Blocked by required conditions
Build and Push / notify-failure (push) Blocked by required conditions
research: benchmark hybrid (lexical+dense+graph) recall vs current FTS
Viktor asked to enhance the memory system with 'semantics' — remember concepts
(not just tokens) linked in a graph — and to prove, by benchmarking against the
current system, that it actually improves recall. A multi-phase research workflow
(18 agents) did landscape research, an adversarially-reviewed integration design,
a stratified eval set over the real 5,452-memory corpus, and a head-to-head
prototype-vs-current benchmark.

Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every
overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend
adopting lexical+dense; the concept graph is DEFERRED.

Post-run adversarial review correction (applied to all docs before commit): the
prototype's fusion config structurally barred the graph leg from the ranked top-k,
so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical
result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty).
Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing
in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/.

Privacy: the corpus/queries/qrels/results are the user's real memories and stay
gitignored (data/, cache/, results/, build_eval_set.py); only harness code,
aggregate numbers, and synthetic examples are committed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:51:53 +00:00

224 lines
10 KiB
Python

"""BASELINE retriever: the product's CURRENT lexical recall (SQLite FTS5/BM25).
This is the "current system" the hybrid upgrade (dense embeddings + concept
graph, ADR-0001) must beat on recall@k / nDCG@10 / MRR. It is a *faithful*
reimplementation of the production local-store recall path, not an idealised
sketch — it mirrors ``src/claude_memory/mcp_server.py :: _sqlite_recall`` (and
the FTS5 schema/triggers in the same module) line-for-line where it matters:
Production recall (``sort_by="relevance"``) does ALL of the following, and so
does this retriever:
1. **Concatenate then split.** The MCP tool builds
``all_terms = f"{context} {expanded_query}"`` and splits it on whitespace,
stripping any embedded ``"`` from each token. The harness already hands us
one ``query`` string (the concatenation happens upstream of recall), so here
``query`` IS ``all_terms``; we split + strip identically.
2. **AND-first, then OR-broaden.** Production builds BOTH
``'"w1" AND "w2" ...'`` and ``'"w1" OR "w2" ...'`` and runs the **AND** match
first; only if it returns zero rows does it fall back to the **OR** match.
(The README's "Search Algorithm" prose shows only the OR form; the *code* is
AND→OR, and the code is authoritative. We replicate the code.)
3. **Blended BM25+importance relevance ordering.** ``sort_by="relevance"`` is
NOT a pure ``ORDER BY bm25()``. It is the blend
``(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC`` (bm25 is negated
because SQLite returns more-negative = better-match). We use the EXACT same
expression. We deliberately evaluate ``relevance`` (not the production
``importance`` default) so the benchmark measures RETRIEVAL quality rather
than the importance-sort prior — per the research brief.
4. **FTS5 default tokenizer.** The production virtual table is declared with no
explicit tokenizer, i.e. ``unicode61`` — case-folding + unicode diacritic
stripping, NO stemming and NO stop-word removal. We declare ours the same
way, so "running" does not match "run" (a known lexical weakness the dense
path is expected to fix on the *paraphrase* stratum).
5. **LIKE fallback.** If the FTS5 MATCH raises ``sqlite3.OperationalError``
(e.g. a token that trips the FTS5 query grammar), production degrades to a
``content LIKE %context% OR tags LIKE %context%`` scan ordered by importance.
We mirror that fallback (using the full query as the LIKE needle, since the
harness query is the whole ``all_terms``).
DIFFERENCES FROM PRODUCTION (all immaterial to ranking, documented for honesty):
- The benchmark corpus has no per-user / soft-delete / category filtering, so we
drop the ``user_id``/``deleted_at``/``category`` predicates. No category is
passed by the harness, so the category branch is never taken anyway.
- We build a fresh in-memory FTS5 index over ``data/corpus.jsonl`` rather than
reading the live ``memory.db``; same schema, same tokenizer, same columns
(content/category/tags/expanded_keywords), so BM25 statistics match what the
product would compute over the same documents.
The harness reference ``harness.baselines.SqliteFtsRetriever`` implements the
*README* ordering (pure ``ORDER BY bm25(), importance``). This module is the
faithful-to-the-CODE variant and is the one the RUN reports as ``retriever="fts"``.
"""
from __future__ import annotations
import re
import sqlite3
from collections.abc import Sequence
# Import the corpus dataclass from the sibling harness package. run_eval.py and
# run_benchmark put the benchmarks/ root on sys.path; support direct execution
# (python retrievers/fts.py) too by adding it ourselves if the import fails.
try: # pragma: no cover - exercised by both import paths
from harness.types import Memory, MemoryId
except ModuleNotFoundError: # pragma: no cover
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
from harness.types import Memory, MemoryId
# Mirror production token extraction: split ``all_terms`` on whitespace and strip
# any embedded double-quote from each token (mcp_server uses
# ``w.replace(chr(34), "")``). We lowercase as well; FTS5 unicode61 case-folds
# regardless, so this only normalises the quoted MATCH literals we emit.
_DQUOTE = '"'
class FtsRetriever:
"""Faithful reimplementation of the production SQLite FTS5/BM25 recall.
Mirrors ``_sqlite_recall(sort_by="relevance")``: AND-first then OR-broaden
over an FTS5(content, category, tags, expanded_keywords) index, ranked by
the blended ``(-bm25*0.7 + importance*0.3)`` score, with a LIKE fallback.
"""
#: Label surfaced in benchmark reports / the RUN schema.
name = "fts"
def __init__(self, sort_by: str = "relevance") -> None:
# We benchmark "relevance" so the metric reflects retrieval quality, not
# the importance prior. "importance" is kept for parity / diagnostics.
if sort_by not in ("relevance", "importance"):
raise ValueError(f"sort_by must be 'relevance' or 'importance', got {sort_by!r}")
self.sort_by = sort_by
self._con: sqlite3.Connection | None = None
# ── lifecycle hooks (duck-typed by the runner) ───────────────────────────
def build_index(self, corpus: Sequence[Memory]) -> None:
"""Build a fresh in-memory FTS5 index over the corpus.
Same virtual-table shape and (default ``unicode61``) tokenizer as the
production ``memories_fts`` table. We carry ``memory_id`` and
``importance`` as UNINDEXED columns so the relevance blend can read
importance without a join — semantically identical to the production
``memories m JOIN memories_fts fts ON m.id = fts.rowid`` read.
"""
con = sqlite3.connect(":memory:")
con.execute(
"""
CREATE VIRTUAL TABLE memories_fts USING fts5(
content, category, tags, expanded_keywords,
memory_id UNINDEXED, importance UNINDEXED
)
"""
)
con.executemany(
"INSERT INTO memories_fts"
"(content, category, tags, expanded_keywords, memory_id, importance)"
" VALUES (?,?,?,?,?,?)",
[
(
m.content,
m.category,
m.tags,
m.expanded_keywords,
int(m.id),
float(m.importance),
)
for m in corpus
],
)
con.commit()
self._con = con
def index_size_bytes(self) -> int:
"""Approximate on-disk index size (sum of FTS5 shadow-table page bytes).
The index is in-memory, so this is the SQLite page accounting for the
FTS5 shadow tables — reported for the storage column, non-gating per
ADR-0001.
"""
if self._con is None:
return 0
try:
page_count = self._con.execute("PRAGMA page_count").fetchone()[0]
page_size = self._con.execute("PRAGMA page_size").fetchone()[0]
return int(page_count) * int(page_size)
except sqlite3.Error:
return 0
# ── query construction (mirrors _sqlite_recall) ──────────────────────────
@staticmethod
def _tokens(query: str) -> list[str]:
"""Split ``all_terms`` exactly as production does: whitespace split,
drop embedded double-quotes, drop empties."""
return [w.replace(_DQUOTE, "").lower() for w in query.split() if w.strip()]
@classmethod
def _and_or_queries(cls, query: str) -> tuple[str, str]:
"""Build the ('"w1" AND "w2" ...', '"w1" OR "w2" ...') MATCH pair."""
words = cls._tokens(query)
if not words:
return "", ""
quoted = [f'"{w}"' for w in words]
return " AND ".join(quoted), " OR ".join(quoted)
def _order_clause(self) -> str:
# bm25() is negative (more-negative = better), so negate before blending.
if self.sort_by == "relevance":
return "(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC"
return "(-bm25(memories_fts) * 0.4 + importance * 0.6) DESC"
# ── retrieve ──────────────────────────────────────────────────────────────
def retrieve(self, query: str, k: int) -> list[MemoryId]:
"""Return up to ``k`` memory ids, ranked best-first.
AND-match first (precise); if it yields nothing, OR-broaden. On an FTS5
grammar error, fall back to a LIKE scan ordered by importance — exactly
the production degradation path.
"""
assert self._con is not None, "call build_index first"
and_query, or_query = self._and_or_queries(query)
if not or_query: # no usable tokens
return []
order = self._order_clause()
base_select = "SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? "
try:
rows: list[tuple[int]] = []
# AND first for precise matches, fall back to OR for broader recall.
for fts_query in (and_query, or_query):
rows = self._con.execute(
f"{base_select}ORDER BY {order} LIMIT ?",
(fts_query, k),
).fetchall()
if rows:
break
except sqlite3.OperationalError:
# Mirror production LIKE fallback: full query as the needle,
# ordered by importance.
like = f"%{query}%"
rows = self._con.execute(
"SELECT memory_id FROM memories_fts "
"WHERE content LIKE ? OR tags LIKE ? "
"ORDER BY importance DESC LIMIT ?",
(like, like, k),
).fetchall()
return [r[0] for r in rows]
def close(self) -> None:
if self._con is not None:
self._con.close()
self._con = None
# Convenience for `run_eval.py --retriever retrievers.fts:FtsRetriever`
# and a no-arg default instantiation (sort_by="relevance").