Some checks are pending
Viktor asked to enhance the memory system with 'semantics' — remember concepts (not just tokens) linked in a graph — and to prove, by benchmarking against the current system, that it actually improves recall. A multi-phase research workflow (18 agents) did landscape research, an adversarially-reviewed integration design, a stratified eval set over the real 5,452-memory corpus, and a head-to-head prototype-vs-current benchmark. Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend adopting lexical+dense; the concept graph is DEFERRED. Post-run adversarial review correction (applied to all docs before commit): the prototype's fusion config structurally barred the graph leg from the ranked top-k, so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty). Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/. Privacy: the corpus/queries/qrels/results are the user's real memories and stay gitignored (data/, cache/, results/, build_eval_set.py); only harness code, aggregate numbers, and synthetic examples are committed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
224 lines
10 KiB
Python
224 lines
10 KiB
Python
"""BASELINE retriever: the product's CURRENT lexical recall (SQLite FTS5/BM25).
|
|
|
|
This is the "current system" the hybrid upgrade (dense embeddings + concept
|
|
graph, ADR-0001) must beat on recall@k / nDCG@10 / MRR. It is a *faithful*
|
|
reimplementation of the production local-store recall path, not an idealised
|
|
sketch — it mirrors ``src/claude_memory/mcp_server.py :: _sqlite_recall`` (and
|
|
the FTS5 schema/triggers in the same module) line-for-line where it matters:
|
|
|
|
Production recall (``sort_by="relevance"``) does ALL of the following, and so
|
|
does this retriever:
|
|
|
|
1. **Concatenate then split.** The MCP tool builds
|
|
``all_terms = f"{context} {expanded_query}"`` and splits it on whitespace,
|
|
stripping any embedded ``"`` from each token. The harness already hands us
|
|
one ``query`` string (the concatenation happens upstream of recall), so here
|
|
``query`` IS ``all_terms``; we split + strip identically.
|
|
|
|
2. **AND-first, then OR-broaden.** Production builds BOTH
|
|
``'"w1" AND "w2" ...'`` and ``'"w1" OR "w2" ...'`` and runs the **AND** match
|
|
first; only if it returns zero rows does it fall back to the **OR** match.
|
|
(The README's "Search Algorithm" prose shows only the OR form; the *code* is
|
|
AND→OR, and the code is authoritative. We replicate the code.)
|
|
|
|
3. **Blended BM25+importance relevance ordering.** ``sort_by="relevance"`` is
|
|
NOT a pure ``ORDER BY bm25()``. It is the blend
|
|
``(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC`` (bm25 is negated
|
|
because SQLite returns more-negative = better-match). We use the EXACT same
|
|
expression. We deliberately evaluate ``relevance`` (not the production
|
|
``importance`` default) so the benchmark measures RETRIEVAL quality rather
|
|
than the importance-sort prior — per the research brief.
|
|
|
|
4. **FTS5 default tokenizer.** The production virtual table is declared with no
|
|
explicit tokenizer, i.e. ``unicode61`` — case-folding + unicode diacritic
|
|
stripping, NO stemming and NO stop-word removal. We declare ours the same
|
|
way, so "running" does not match "run" (a known lexical weakness the dense
|
|
path is expected to fix on the *paraphrase* stratum).
|
|
|
|
5. **LIKE fallback.** If the FTS5 MATCH raises ``sqlite3.OperationalError``
|
|
(e.g. a token that trips the FTS5 query grammar), production degrades to a
|
|
``content LIKE %context% OR tags LIKE %context%`` scan ordered by importance.
|
|
We mirror that fallback (using the full query as the LIKE needle, since the
|
|
harness query is the whole ``all_terms``).
|
|
|
|
DIFFERENCES FROM PRODUCTION (all immaterial to ranking, documented for honesty):
|
|
- The benchmark corpus has no per-user / soft-delete / category filtering, so we
|
|
drop the ``user_id``/``deleted_at``/``category`` predicates. No category is
|
|
passed by the harness, so the category branch is never taken anyway.
|
|
- We build a fresh in-memory FTS5 index over ``data/corpus.jsonl`` rather than
|
|
reading the live ``memory.db``; same schema, same tokenizer, same columns
|
|
(content/category/tags/expanded_keywords), so BM25 statistics match what the
|
|
product would compute over the same documents.
|
|
|
|
The harness reference ``harness.baselines.SqliteFtsRetriever`` implements the
|
|
*README* ordering (pure ``ORDER BY bm25(), importance``). This module is the
|
|
faithful-to-the-CODE variant and is the one the RUN reports as ``retriever="fts"``.
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
import sqlite3
|
|
from collections.abc import Sequence
|
|
|
|
# Import the corpus dataclass from the sibling harness package. run_eval.py and
|
|
# run_benchmark put the benchmarks/ root on sys.path; support direct execution
|
|
# (python retrievers/fts.py) too by adding it ourselves if the import fails.
|
|
try: # pragma: no cover - exercised by both import paths
|
|
from harness.types import Memory, MemoryId
|
|
except ModuleNotFoundError: # pragma: no cover
|
|
import sys
|
|
from pathlib import Path
|
|
|
|
sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
|
|
from harness.types import Memory, MemoryId
|
|
|
|
# Mirror production token extraction: split ``all_terms`` on whitespace and strip
|
|
# any embedded double-quote from each token (mcp_server uses
|
|
# ``w.replace(chr(34), "")``). We lowercase as well; FTS5 unicode61 case-folds
|
|
# regardless, so this only normalises the quoted MATCH literals we emit.
|
|
_DQUOTE = '"'
|
|
|
|
|
|
class FtsRetriever:
|
|
"""Faithful reimplementation of the production SQLite FTS5/BM25 recall.
|
|
|
|
Mirrors ``_sqlite_recall(sort_by="relevance")``: AND-first then OR-broaden
|
|
over an FTS5(content, category, tags, expanded_keywords) index, ranked by
|
|
the blended ``(-bm25*0.7 + importance*0.3)`` score, with a LIKE fallback.
|
|
"""
|
|
|
|
#: Label surfaced in benchmark reports / the RUN schema.
|
|
name = "fts"
|
|
|
|
def __init__(self, sort_by: str = "relevance") -> None:
|
|
# We benchmark "relevance" so the metric reflects retrieval quality, not
|
|
# the importance prior. "importance" is kept for parity / diagnostics.
|
|
if sort_by not in ("relevance", "importance"):
|
|
raise ValueError(f"sort_by must be 'relevance' or 'importance', got {sort_by!r}")
|
|
self.sort_by = sort_by
|
|
self._con: sqlite3.Connection | None = None
|
|
|
|
# ── lifecycle hooks (duck-typed by the runner) ───────────────────────────
|
|
|
|
def build_index(self, corpus: Sequence[Memory]) -> None:
|
|
"""Build a fresh in-memory FTS5 index over the corpus.
|
|
|
|
Same virtual-table shape and (default ``unicode61``) tokenizer as the
|
|
production ``memories_fts`` table. We carry ``memory_id`` and
|
|
``importance`` as UNINDEXED columns so the relevance blend can read
|
|
importance without a join — semantically identical to the production
|
|
``memories m JOIN memories_fts fts ON m.id = fts.rowid`` read.
|
|
"""
|
|
con = sqlite3.connect(":memory:")
|
|
con.execute(
|
|
"""
|
|
CREATE VIRTUAL TABLE memories_fts USING fts5(
|
|
content, category, tags, expanded_keywords,
|
|
memory_id UNINDEXED, importance UNINDEXED
|
|
)
|
|
"""
|
|
)
|
|
con.executemany(
|
|
"INSERT INTO memories_fts"
|
|
"(content, category, tags, expanded_keywords, memory_id, importance)"
|
|
" VALUES (?,?,?,?,?,?)",
|
|
[
|
|
(
|
|
m.content,
|
|
m.category,
|
|
m.tags,
|
|
m.expanded_keywords,
|
|
int(m.id),
|
|
float(m.importance),
|
|
)
|
|
for m in corpus
|
|
],
|
|
)
|
|
con.commit()
|
|
self._con = con
|
|
|
|
def index_size_bytes(self) -> int:
|
|
"""Approximate on-disk index size (sum of FTS5 shadow-table page bytes).
|
|
|
|
The index is in-memory, so this is the SQLite page accounting for the
|
|
FTS5 shadow tables — reported for the storage column, non-gating per
|
|
ADR-0001.
|
|
"""
|
|
if self._con is None:
|
|
return 0
|
|
try:
|
|
page_count = self._con.execute("PRAGMA page_count").fetchone()[0]
|
|
page_size = self._con.execute("PRAGMA page_size").fetchone()[0]
|
|
return int(page_count) * int(page_size)
|
|
except sqlite3.Error:
|
|
return 0
|
|
|
|
# ── query construction (mirrors _sqlite_recall) ──────────────────────────
|
|
|
|
@staticmethod
|
|
def _tokens(query: str) -> list[str]:
|
|
"""Split ``all_terms`` exactly as production does: whitespace split,
|
|
drop embedded double-quotes, drop empties."""
|
|
return [w.replace(_DQUOTE, "").lower() for w in query.split() if w.strip()]
|
|
|
|
@classmethod
|
|
def _and_or_queries(cls, query: str) -> tuple[str, str]:
|
|
"""Build the ('"w1" AND "w2" ...', '"w1" OR "w2" ...') MATCH pair."""
|
|
words = cls._tokens(query)
|
|
if not words:
|
|
return "", ""
|
|
quoted = [f'"{w}"' for w in words]
|
|
return " AND ".join(quoted), " OR ".join(quoted)
|
|
|
|
def _order_clause(self) -> str:
|
|
# bm25() is negative (more-negative = better), so negate before blending.
|
|
if self.sort_by == "relevance":
|
|
return "(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC"
|
|
return "(-bm25(memories_fts) * 0.4 + importance * 0.6) DESC"
|
|
|
|
# ── retrieve ──────────────────────────────────────────────────────────────
|
|
|
|
def retrieve(self, query: str, k: int) -> list[MemoryId]:
|
|
"""Return up to ``k`` memory ids, ranked best-first.
|
|
|
|
AND-match first (precise); if it yields nothing, OR-broaden. On an FTS5
|
|
grammar error, fall back to a LIKE scan ordered by importance — exactly
|
|
the production degradation path.
|
|
"""
|
|
assert self._con is not None, "call build_index first"
|
|
and_query, or_query = self._and_or_queries(query)
|
|
if not or_query: # no usable tokens
|
|
return []
|
|
|
|
order = self._order_clause()
|
|
base_select = "SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? "
|
|
try:
|
|
rows: list[tuple[int]] = []
|
|
# AND first for precise matches, fall back to OR for broader recall.
|
|
for fts_query in (and_query, or_query):
|
|
rows = self._con.execute(
|
|
f"{base_select}ORDER BY {order} LIMIT ?",
|
|
(fts_query, k),
|
|
).fetchall()
|
|
if rows:
|
|
break
|
|
except sqlite3.OperationalError:
|
|
# Mirror production LIKE fallback: full query as the needle,
|
|
# ordered by importance.
|
|
like = f"%{query}%"
|
|
rows = self._con.execute(
|
|
"SELECT memory_id FROM memories_fts "
|
|
"WHERE content LIKE ? OR tags LIKE ? "
|
|
"ORDER BY importance DESC LIMIT ?",
|
|
(like, like, k),
|
|
).fetchall()
|
|
return [r[0] for r in rows]
|
|
|
|
def close(self) -> None:
|
|
if self._con is not None:
|
|
self._con.close()
|
|
self._con = None
|
|
|
|
|
|
# Convenience for `run_eval.py --retriever retrievers.fts:FtsRetriever`
|
|
# and a no-arg default instantiation (sort_by="relevance").
|