research: benchmark hybrid (lexical+dense+graph) recall vs current FTS
Viktor asked to enhance the memory system with 'semantics' — remember concepts (not just tokens) linked in a graph — and to prove, by benchmarking against the current system, that it actually improves recall. A multi-phase research workflow (18 agents) did landscape research, an adversarially-reviewed integration design, a stratified eval set over the real 5,452-memory corpus, and a head-to-head prototype-vs-current benchmark. Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend adopting lexical+dense; the concept graph is DEFERRED. Post-run adversarial review correction (applied to all docs before commit): the prototype's fusion config structurally barred the graph leg from the ranked top-k, so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty). Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/. Privacy: the corpus/queries/qrels/results are the user's real memories and stay gitignored (data/, cache/, results/, build_eval_set.py); only harness code, aggregate numbers, and synthetic examples are committed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
7439540f8f
commit
1cc8a2b378
23 changed files with 3428 additions and 0 deletions
224
benchmarks/retrievers/fts.py
Normal file
224
benchmarks/retrievers/fts.py
Normal file
|
|
@ -0,0 +1,224 @@
|
|||
"""BASELINE retriever: the product's CURRENT lexical recall (SQLite FTS5/BM25).
|
||||
|
||||
This is the "current system" the hybrid upgrade (dense embeddings + concept
|
||||
graph, ADR-0001) must beat on recall@k / nDCG@10 / MRR. It is a *faithful*
|
||||
reimplementation of the production local-store recall path, not an idealised
|
||||
sketch — it mirrors ``src/claude_memory/mcp_server.py :: _sqlite_recall`` (and
|
||||
the FTS5 schema/triggers in the same module) line-for-line where it matters:
|
||||
|
||||
Production recall (``sort_by="relevance"``) does ALL of the following, and so
|
||||
does this retriever:
|
||||
|
||||
1. **Concatenate then split.** The MCP tool builds
|
||||
``all_terms = f"{context} {expanded_query}"`` and splits it on whitespace,
|
||||
stripping any embedded ``"`` from each token. The harness already hands us
|
||||
one ``query`` string (the concatenation happens upstream of recall), so here
|
||||
``query`` IS ``all_terms``; we split + strip identically.
|
||||
|
||||
2. **AND-first, then OR-broaden.** Production builds BOTH
|
||||
``'"w1" AND "w2" ...'`` and ``'"w1" OR "w2" ...'`` and runs the **AND** match
|
||||
first; only if it returns zero rows does it fall back to the **OR** match.
|
||||
(The README's "Search Algorithm" prose shows only the OR form; the *code* is
|
||||
AND→OR, and the code is authoritative. We replicate the code.)
|
||||
|
||||
3. **Blended BM25+importance relevance ordering.** ``sort_by="relevance"`` is
|
||||
NOT a pure ``ORDER BY bm25()``. It is the blend
|
||||
``(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC`` (bm25 is negated
|
||||
because SQLite returns more-negative = better-match). We use the EXACT same
|
||||
expression. We deliberately evaluate ``relevance`` (not the production
|
||||
``importance`` default) so the benchmark measures RETRIEVAL quality rather
|
||||
than the importance-sort prior — per the research brief.
|
||||
|
||||
4. **FTS5 default tokenizer.** The production virtual table is declared with no
|
||||
explicit tokenizer, i.e. ``unicode61`` — case-folding + unicode diacritic
|
||||
stripping, NO stemming and NO stop-word removal. We declare ours the same
|
||||
way, so "running" does not match "run" (a known lexical weakness the dense
|
||||
path is expected to fix on the *paraphrase* stratum).
|
||||
|
||||
5. **LIKE fallback.** If the FTS5 MATCH raises ``sqlite3.OperationalError``
|
||||
(e.g. a token that trips the FTS5 query grammar), production degrades to a
|
||||
``content LIKE %context% OR tags LIKE %context%`` scan ordered by importance.
|
||||
We mirror that fallback (using the full query as the LIKE needle, since the
|
||||
harness query is the whole ``all_terms``).
|
||||
|
||||
DIFFERENCES FROM PRODUCTION (all immaterial to ranking, documented for honesty):
|
||||
- The benchmark corpus has no per-user / soft-delete / category filtering, so we
|
||||
drop the ``user_id``/``deleted_at``/``category`` predicates. No category is
|
||||
passed by the harness, so the category branch is never taken anyway.
|
||||
- We build a fresh in-memory FTS5 index over ``data/corpus.jsonl`` rather than
|
||||
reading the live ``memory.db``; same schema, same tokenizer, same columns
|
||||
(content/category/tags/expanded_keywords), so BM25 statistics match what the
|
||||
product would compute over the same documents.
|
||||
|
||||
The harness reference ``harness.baselines.SqliteFtsRetriever`` implements the
|
||||
*README* ordering (pure ``ORDER BY bm25(), importance``). This module is the
|
||||
faithful-to-the-CODE variant and is the one the RUN reports as ``retriever="fts"``.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
import sqlite3
|
||||
from collections.abc import Sequence
|
||||
|
||||
# Import the corpus dataclass from the sibling harness package. run_eval.py and
|
||||
# run_benchmark put the benchmarks/ root on sys.path; support direct execution
|
||||
# (python retrievers/fts.py) too by adding it ourselves if the import fails.
|
||||
try: # pragma: no cover - exercised by both import paths
|
||||
from harness.types import Memory, MemoryId
|
||||
except ModuleNotFoundError: # pragma: no cover
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
|
||||
from harness.types import Memory, MemoryId
|
||||
|
||||
# Mirror production token extraction: split ``all_terms`` on whitespace and strip
|
||||
# any embedded double-quote from each token (mcp_server uses
|
||||
# ``w.replace(chr(34), "")``). We lowercase as well; FTS5 unicode61 case-folds
|
||||
# regardless, so this only normalises the quoted MATCH literals we emit.
|
||||
_DQUOTE = '"'
|
||||
|
||||
|
||||
class FtsRetriever:
|
||||
"""Faithful reimplementation of the production SQLite FTS5/BM25 recall.
|
||||
|
||||
Mirrors ``_sqlite_recall(sort_by="relevance")``: AND-first then OR-broaden
|
||||
over an FTS5(content, category, tags, expanded_keywords) index, ranked by
|
||||
the blended ``(-bm25*0.7 + importance*0.3)`` score, with a LIKE fallback.
|
||||
"""
|
||||
|
||||
#: Label surfaced in benchmark reports / the RUN schema.
|
||||
name = "fts"
|
||||
|
||||
def __init__(self, sort_by: str = "relevance") -> None:
|
||||
# We benchmark "relevance" so the metric reflects retrieval quality, not
|
||||
# the importance prior. "importance" is kept for parity / diagnostics.
|
||||
if sort_by not in ("relevance", "importance"):
|
||||
raise ValueError(f"sort_by must be 'relevance' or 'importance', got {sort_by!r}")
|
||||
self.sort_by = sort_by
|
||||
self._con: sqlite3.Connection | None = None
|
||||
|
||||
# ── lifecycle hooks (duck-typed by the runner) ───────────────────────────
|
||||
|
||||
def build_index(self, corpus: Sequence[Memory]) -> None:
|
||||
"""Build a fresh in-memory FTS5 index over the corpus.
|
||||
|
||||
Same virtual-table shape and (default ``unicode61``) tokenizer as the
|
||||
production ``memories_fts`` table. We carry ``memory_id`` and
|
||||
``importance`` as UNINDEXED columns so the relevance blend can read
|
||||
importance without a join — semantically identical to the production
|
||||
``memories m JOIN memories_fts fts ON m.id = fts.rowid`` read.
|
||||
"""
|
||||
con = sqlite3.connect(":memory:")
|
||||
con.execute(
|
||||
"""
|
||||
CREATE VIRTUAL TABLE memories_fts USING fts5(
|
||||
content, category, tags, expanded_keywords,
|
||||
memory_id UNINDEXED, importance UNINDEXED
|
||||
)
|
||||
"""
|
||||
)
|
||||
con.executemany(
|
||||
"INSERT INTO memories_fts"
|
||||
"(content, category, tags, expanded_keywords, memory_id, importance)"
|
||||
" VALUES (?,?,?,?,?,?)",
|
||||
[
|
||||
(
|
||||
m.content,
|
||||
m.category,
|
||||
m.tags,
|
||||
m.expanded_keywords,
|
||||
int(m.id),
|
||||
float(m.importance),
|
||||
)
|
||||
for m in corpus
|
||||
],
|
||||
)
|
||||
con.commit()
|
||||
self._con = con
|
||||
|
||||
def index_size_bytes(self) -> int:
|
||||
"""Approximate on-disk index size (sum of FTS5 shadow-table page bytes).
|
||||
|
||||
The index is in-memory, so this is the SQLite page accounting for the
|
||||
FTS5 shadow tables — reported for the storage column, non-gating per
|
||||
ADR-0001.
|
||||
"""
|
||||
if self._con is None:
|
||||
return 0
|
||||
try:
|
||||
page_count = self._con.execute("PRAGMA page_count").fetchone()[0]
|
||||
page_size = self._con.execute("PRAGMA page_size").fetchone()[0]
|
||||
return int(page_count) * int(page_size)
|
||||
except sqlite3.Error:
|
||||
return 0
|
||||
|
||||
# ── query construction (mirrors _sqlite_recall) ──────────────────────────
|
||||
|
||||
@staticmethod
|
||||
def _tokens(query: str) -> list[str]:
|
||||
"""Split ``all_terms`` exactly as production does: whitespace split,
|
||||
drop embedded double-quotes, drop empties."""
|
||||
return [w.replace(_DQUOTE, "").lower() for w in query.split() if w.strip()]
|
||||
|
||||
@classmethod
|
||||
def _and_or_queries(cls, query: str) -> tuple[str, str]:
|
||||
"""Build the ('"w1" AND "w2" ...', '"w1" OR "w2" ...') MATCH pair."""
|
||||
words = cls._tokens(query)
|
||||
if not words:
|
||||
return "", ""
|
||||
quoted = [f'"{w}"' for w in words]
|
||||
return " AND ".join(quoted), " OR ".join(quoted)
|
||||
|
||||
def _order_clause(self) -> str:
|
||||
# bm25() is negative (more-negative = better), so negate before blending.
|
||||
if self.sort_by == "relevance":
|
||||
return "(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC"
|
||||
return "(-bm25(memories_fts) * 0.4 + importance * 0.6) DESC"
|
||||
|
||||
# ── retrieve ──────────────────────────────────────────────────────────────
|
||||
|
||||
def retrieve(self, query: str, k: int) -> list[MemoryId]:
|
||||
"""Return up to ``k`` memory ids, ranked best-first.
|
||||
|
||||
AND-match first (precise); if it yields nothing, OR-broaden. On an FTS5
|
||||
grammar error, fall back to a LIKE scan ordered by importance — exactly
|
||||
the production degradation path.
|
||||
"""
|
||||
assert self._con is not None, "call build_index first"
|
||||
and_query, or_query = self._and_or_queries(query)
|
||||
if not or_query: # no usable tokens
|
||||
return []
|
||||
|
||||
order = self._order_clause()
|
||||
base_select = "SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? "
|
||||
try:
|
||||
rows: list[tuple[int]] = []
|
||||
# AND first for precise matches, fall back to OR for broader recall.
|
||||
for fts_query in (and_query, or_query):
|
||||
rows = self._con.execute(
|
||||
f"{base_select}ORDER BY {order} LIMIT ?",
|
||||
(fts_query, k),
|
||||
).fetchall()
|
||||
if rows:
|
||||
break
|
||||
except sqlite3.OperationalError:
|
||||
# Mirror production LIKE fallback: full query as the needle,
|
||||
# ordered by importance.
|
||||
like = f"%{query}%"
|
||||
rows = self._con.execute(
|
||||
"SELECT memory_id FROM memories_fts "
|
||||
"WHERE content LIKE ? OR tags LIKE ? "
|
||||
"ORDER BY importance DESC LIMIT ?",
|
||||
(like, like, k),
|
||||
).fetchall()
|
||||
return [r[0] for r in rows]
|
||||
|
||||
def close(self) -> None:
|
||||
if self._con is not None:
|
||||
self._con.close()
|
||||
self._con = None
|
||||
|
||||
|
||||
# Convenience for `run_eval.py --retriever retrievers.fts:FtsRetriever`
|
||||
# and a no-arg default instantiation (sort_by="relevance").
|
||||
Loading…
Add table
Add a link
Reference in a new issue