research: benchmark hybrid (lexical+dense+graph) recall vs current FTS

Viktor asked to enhance the memory system with 'semantics' — remember concepts (not just tokens) linked in a graph — and to prove, by benchmarking against the current system, that it actually improves recall. A multi-phase research workflow (18 agents) did landscape research, an adversarially-reviewed integration design, a stratified eval set over the real 5,452-memory corpus, and a head-to-head prototype-vs-current benchmark. Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend adopting lexical+dense; the concept graph is DEFERRED. Post-run adversarial review correction (applied to all docs before commit): the prototype's fusion config structurally barred the graph leg from the ranked top-k, so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty). Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/. Privacy: the corpus/queries/qrels/results are the user's real memories and stay gitignored (data/, cache/, results/, build_eval_set.py); only harness code, aggregate numbers, and synthetic examples are committed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:51:53 +00:00 · 2026-06-25 17:51:53 +00:00 · 1cc8a2b378
commit 1cc8a2b378
parent 7439540f8f
23 changed files with 3428 additions and 0 deletions
--- a/benchmarks/retrievers/fts.py
+++ b/benchmarks/retrievers/fts.py
@ -0,0 +1,224 @@
+"""BASELINE retriever: the product's CURRENT lexical recall (SQLite FTS5/BM25).
+
+This is the "current system" the hybrid upgrade (dense embeddings + concept
+graph, ADR-0001) must beat on recall@k / nDCG@10 / MRR. It is a *faithful*
+reimplementation of the production local-store recall path, not an idealised
+sketch — it mirrors ``src/claude_memory/mcp_server.py :: _sqlite_recall`` (and
+the FTS5 schema/triggers in the same module) line-for-line where it matters:
+
+Production recall (``sort_by="relevance"``) does ALL of the following, and so
+does this retriever:
+
+1. **Concatenate then split.** The MCP tool builds
+   ``all_terms = f"{context} {expanded_query}"`` and splits it on whitespace,
+   stripping any embedded ``"`` from each token. The harness already hands us
+   one ``query`` string (the concatenation happens upstream of recall), so here
+   ``query`` IS ``all_terms``; we split + strip identically.
+
+2. **AND-first, then OR-broaden.** Production builds BOTH
+   ``'"w1" AND "w2" ...'`` and ``'"w1" OR "w2" ...'`` and runs the **AND** match
+   first; only if it returns zero rows does it fall back to the **OR** match.
+   (The README's "Search Algorithm" prose shows only the OR form; the *code* is
+   AND→OR, and the code is authoritative. We replicate the code.)
+
+3. **Blended BM25+importance relevance ordering.** ``sort_by="relevance"`` is
+   NOT a pure ``ORDER BY bm25()``. It is the blend
+   ``(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC`` (bm25 is negated
+   because SQLite returns more-negative = better-match). We use the EXACT same
+   expression. We deliberately evaluate ``relevance`` (not the production
+   ``importance`` default) so the benchmark measures RETRIEVAL quality rather
+   than the importance-sort prior — per the research brief.
+
+4. **FTS5 default tokenizer.** The production virtual table is declared with no
+   explicit tokenizer, i.e. ``unicode61`` — case-folding + unicode diacritic
+   stripping, NO stemming and NO stop-word removal. We declare ours the same
+   way, so "running" does not match "run" (a known lexical weakness the dense
+   path is expected to fix on the *paraphrase* stratum).
+
+5. **LIKE fallback.** If the FTS5 MATCH raises ``sqlite3.OperationalError``
+   (e.g. a token that trips the FTS5 query grammar), production degrades to a
+   ``content LIKE %context% OR tags LIKE %context%`` scan ordered by importance.
+   We mirror that fallback (using the full query as the LIKE needle, since the
+   harness query is the whole ``all_terms``).
+
+DIFFERENCES FROM PRODUCTION (all immaterial to ranking, documented for honesty):
+- The benchmark corpus has no per-user / soft-delete / category filtering, so we
+  drop the ``user_id``/``deleted_at``/``category`` predicates. No category is
+  passed by the harness, so the category branch is never taken anyway.
+- We build a fresh in-memory FTS5 index over ``data/corpus.jsonl`` rather than
+  reading the live ``memory.db``; same schema, same tokenizer, same columns
+  (content/category/tags/expanded_keywords), so BM25 statistics match what the
+  product would compute over the same documents.
+
+The harness reference ``harness.baselines.SqliteFtsRetriever`` implements the
+*README* ordering (pure ``ORDER BY bm25(), importance``). This module is the
+faithful-to-the-CODE variant and is the one the RUN reports as ``retriever="fts"``.
+"""
+from __future__ import annotations
+
+import re
+import sqlite3
+from collections.abc import Sequence
+
+# Import the corpus dataclass from the sibling harness package. run_eval.py and
+# run_benchmark put the benchmarks/ root on sys.path; support direct execution
+# (python retrievers/fts.py) too by adding it ourselves if the import fails.
+try:  # pragma: no cover - exercised by both import paths
+    from harness.types import Memory, MemoryId
+except ModuleNotFoundError:  # pragma: no cover
+    import sys
+    from pathlib import Path
+
+    sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
+    from harness.types import Memory, MemoryId
+
+# Mirror production token extraction: split ``all_terms`` on whitespace and strip
+# any embedded double-quote from each token (mcp_server uses
+# ``w.replace(chr(34), "")``). We lowercase as well; FTS5 unicode61 case-folds
+# regardless, so this only normalises the quoted MATCH literals we emit.
+_DQUOTE = '"'
+
+
+class FtsRetriever:
+    """Faithful reimplementation of the production SQLite FTS5/BM25 recall.
+
+    Mirrors ``_sqlite_recall(sort_by="relevance")``: AND-first then OR-broaden
+    over an FTS5(content, category, tags, expanded_keywords) index, ranked by
+    the blended ``(-bm25*0.7 + importance*0.3)`` score, with a LIKE fallback.
+    """
+
+    #: Label surfaced in benchmark reports / the RUN schema.
+    name = "fts"
+
+    def __init__(self, sort_by: str = "relevance") -> None:
+        # We benchmark "relevance" so the metric reflects retrieval quality, not
+        # the importance prior. "importance" is kept for parity / diagnostics.
+        if sort_by not in ("relevance", "importance"):
+            raise ValueError(f"sort_by must be 'relevance' or 'importance', got {sort_by!r}")
+        self.sort_by = sort_by
+        self._con: sqlite3.Connection | None = None
+
+    # ── lifecycle hooks (duck-typed by the runner) ───────────────────────────
+
+    def build_index(self, corpus: Sequence[Memory]) -> None:
+        """Build a fresh in-memory FTS5 index over the corpus.
+
+        Same virtual-table shape and (default ``unicode61``) tokenizer as the
+        production ``memories_fts`` table. We carry ``memory_id`` and
+        ``importance`` as UNINDEXED columns so the relevance blend can read
+        importance without a join — semantically identical to the production
+        ``memories m JOIN memories_fts fts ON m.id = fts.rowid`` read.
+        """
+        con = sqlite3.connect(":memory:")
+        con.execute(
+            """
+            CREATE VIRTUAL TABLE memories_fts USING fts5(
+                content, category, tags, expanded_keywords,
+                memory_id UNINDEXED, importance UNINDEXED
+            )
+            """
+        )
+        con.executemany(
+            "INSERT INTO memories_fts"
+            "(content, category, tags, expanded_keywords, memory_id, importance)"
+            " VALUES (?,?,?,?,?,?)",
+            [
+                (
+                    m.content,
+                    m.category,
+                    m.tags,
+                    m.expanded_keywords,
+                    int(m.id),
+                    float(m.importance),
+                )
+                for m in corpus
+            ],
+        )
+        con.commit()
+        self._con = con
+
+    def index_size_bytes(self) -> int:
+        """Approximate on-disk index size (sum of FTS5 shadow-table page bytes).
+
+        The index is in-memory, so this is the SQLite page accounting for the
+        FTS5 shadow tables — reported for the storage column, non-gating per
+        ADR-0001.
+        """
+        if self._con is None:
+            return 0
+        try:
+            page_count = self._con.execute("PRAGMA page_count").fetchone()[0]
+            page_size = self._con.execute("PRAGMA page_size").fetchone()[0]
+            return int(page_count) * int(page_size)
+        except sqlite3.Error:
+            return 0
+
+    # ── query construction (mirrors _sqlite_recall) ──────────────────────────
+
+    @staticmethod
+    def _tokens(query: str) -> list[str]:
+        """Split ``all_terms`` exactly as production does: whitespace split,
+        drop embedded double-quotes, drop empties."""
+        return [w.replace(_DQUOTE, "").lower() for w in query.split() if w.strip()]
+
+    @classmethod
+    def _and_or_queries(cls, query: str) -> tuple[str, str]:
+        """Build the ('"w1" AND "w2" ...', '"w1" OR "w2" ...') MATCH pair."""
+        words = cls._tokens(query)
+        if not words:
+            return "", ""
+        quoted = [f'"{w}"' for w in words]
+        return " AND ".join(quoted), " OR ".join(quoted)
+
+    def _order_clause(self) -> str:
+        # bm25() is negative (more-negative = better), so negate before blending.
+        if self.sort_by == "relevance":
+            return "(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC"
+        return "(-bm25(memories_fts) * 0.4 + importance * 0.6) DESC"
+
+    # ── retrieve ──────────────────────────────────────────────────────────────
+
+    def retrieve(self, query: str, k: int) -> list[MemoryId]:
+        """Return up to ``k`` memory ids, ranked best-first.
+
+        AND-match first (precise); if it yields nothing, OR-broaden. On an FTS5
+        grammar error, fall back to a LIKE scan ordered by importance — exactly
+        the production degradation path.
+        """
+        assert self._con is not None, "call build_index first"
+        and_query, or_query = self._and_or_queries(query)
+        if not or_query:  # no usable tokens
+            return []
+
+        order = self._order_clause()
+        base_select = "SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? "
+        try:
+            rows: list[tuple[int]] = []
+            # AND first for precise matches, fall back to OR for broader recall.
+            for fts_query in (and_query, or_query):
+                rows = self._con.execute(
+                    f"{base_select}ORDER BY {order} LIMIT ?",
+                    (fts_query, k),
+                ).fetchall()
+                if rows:
+                    break
+        except sqlite3.OperationalError:
+            # Mirror production LIKE fallback: full query as the needle,
+            # ordered by importance.
+            like = f"%{query}%"
+            rows = self._con.execute(
+                "SELECT memory_id FROM memories_fts "
+                "WHERE content LIKE ? OR tags LIKE ? "
+                "ORDER BY importance DESC LIMIT ?",
+                (like, like, k),
+            ).fetchall()
+        return [r[0] for r in rows]
+
+    def close(self) -> None:
+        if self._con is not None:
+            self._con.close()
+            self._con = None
+
+
+# Convenience for `run_eval.py --retriever retrievers.fts:FtsRetriever`
+# and a no-arg default instantiation (sort_by="relevance").