From 7439540f8f2552824884ef695f8b3b9878007e07 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Thu, 25 Jun 2026 15:36:30 +0000 Subject: [PATCH 1/2] docs: glossary + ADRs for semantic/concept-graph memory Captures the design language (CONTEXT.md) and the framing decisions from the requirements interview: pursue hybrid embeddings+concept-graph retrieval gated on a benchmark (0001), target the API/Postgres deployment while SQLite stays lexical (0002), and permit hosted embedding APIs for non-sensitive memories only (0003). Groundwork for the research/prototype/benchmark effort. Co-Authored-By: Claude Opus 4.8 --- CONTEXT.md | 45 +++++++++++++++++++ ...-retrieval-embeddings-and-concept-graph.md | 21 +++++++++ ...api-postgres-first-sqlite-stays-lexical.md | 20 +++++++++ ...apis-allowed-for-non-sensitive-memories.md | 19 ++++++++ 4 files changed, 105 insertions(+) create mode 100644 CONTEXT.md create mode 100644 docs/adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md create mode 100644 docs/adr/0002-api-postgres-first-sqlite-stays-lexical.md create mode 100644 docs/adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md diff --git a/CONTEXT.md b/CONTEXT.md new file mode 100644 index 0000000..6782e70 --- /dev/null +++ b/CONTEXT.md @@ -0,0 +1,45 @@ +# Claude Memory MCP + +Persistent cross-session memory for Claude. Today it stores **Memories** as rows and +retrieves them by **lexical recall** (full-text keyword matching). This context is being +extended with **semantic recall** (embeddings) and a **concept graph** so retrieval works +by meaning and related memories become traversable. + +## Language + +**Memory**: +A single stored unit of knowledge — a fact, preference, decision, project note, or person +detail — with content plus metadata (category, tags, importance). The atomic thing a user +stores and recalls. + +**Recall**: +Retrieving the Memories most relevant to a query. The read path. + +**Lexical recall**: +The existing retrieval method — matches Memories whose words (content, tags, LLM-generated +keywords) overlap the query, ranked by BM25 / `ts_rank`. Matches *tokens*, not meaning. +_Avoid_: calling this "semantic search" — it is not semantic. + +**Semantic recall**: +Retrieval by meaning via dense-vector **Embedding** similarity, so a query surfaces a Memory +even with zero shared words (e.g. "what UI library?" → "prefers Svelte"). + +**Embedding**: +A dense vector representation of a Memory's (or Concept's) meaning, used for Semantic recall. + +**Concept**: +A distinct entity or idea that recurs across Memories (e.g. "Svelte", "Viktor", "TripIt", +"frontend framework"). A node in the Concept graph. Distinct from a Memory: one Memory can +mention several Concepts, and one Concept spans many Memories. + +**Concept graph**: +The network of Concepts joined by typed **Relationships**, making the memory store +traversable — from one Memory or Concept to related ones. + +**Relationship**: +A typed, directed edge in the Concept graph, between two Concepts or between a Memory and a +Concept (e.g. `prefers`, `is-a`, `used-in`, `mentions`). + +**Hybrid retrieval**: +The target read path — combining Lexical recall, Semantic recall, and Concept-graph +traversal into one ranked result set. diff --git a/docs/adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md b/docs/adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md new file mode 100644 index 0000000..e9f0cf9 --- /dev/null +++ b/docs/adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md @@ -0,0 +1,21 @@ +# Pursue hybrid retrieval: embeddings + concept graph over pure lexical + +Today recall is **lexical only** (BM25 in SQLite, `tsvector`/`ts_rank` in Postgres over +content + LLM-generated `expanded_keywords`). It matches *tokens*, so it misses +paraphrase/synonym queries and cannot traverse between related Memories. We will pursue a +**hybrid** read path that adds dense-vector **Semantic recall** and a traversable **Concept +graph** (typed Relationships) alongside the existing Lexical recall. + +This decision is **gated on a benchmark**: we adopt hybrid only if it shows a material +recall-quality uplift over the current lexical system on a stratified eval set (exact / +paraphrase / multi-hop). If the benchmark shows no improvement, a later ADR supersedes this +and we stay lexical. + +## Considered options + +- **Pure semantic (embeddings only)** — fixes paraphrase gaps but gives no real concept + traversal; rejected as the *sole* mechanism. +- **Pure concept graph** — enables traversal but node-matching stays lexical, so paraphrase + gaps remain; rejected as the *sole* mechanism. +- **Hybrid (chosen)** — embeddings for meaning + graph for traversal + existing FTS, fused + into one ranked result. Highest ceiling; the GraphRAG / Zep-Graphiti / HippoRAG family. diff --git a/docs/adr/0002-api-postgres-first-sqlite-stays-lexical.md b/docs/adr/0002-api-postgres-first-sqlite-stays-lexical.md new file mode 100644 index 0000000..9310b72 --- /dev/null +++ b/docs/adr/0002-api-postgres-first-sqlite-stays-lexical.md @@ -0,0 +1,20 @@ +# API/Postgres deployment gets semantics; SQLite-only stays lexical + +The semantic + concept-graph layer targets the **API/Postgres** deployment only: embeddings +in pgvector on the (CNPG) Postgres, the Concept graph as node/edge tables in Postgres, and +embedding/extraction via reused cluster infra (llama-cpp on GPU, or a hosted API). The +**SQLite-only** mode keeps working but stays **lexical (FTS) only** — it gains no embeddings +or graph, degrading gracefully. + +This is surprising because the README markets zero-config offline SQLite as the headline +feature. We accept that trade-off: the operator actually runs the remote API/Postgres store, +reuse-before-building favours cluster infra, and bundling a local embedding model into the +zero-config path would add heavy dependencies and double the build/test matrix for little +real-world benefit. + +## Consequences + +- All benchmark numbers are produced in API/Postgres mode. +- Offline zero-config users see no behaviour change. +- A future ADR may revisit offline semantics (e.g. via `sqlite-vec` + a small local model) + if there is demand. diff --git a/docs/adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md b/docs/adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md new file mode 100644 index 0000000..e0ec9f4 --- /dev/null +++ b/docs/adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md @@ -0,0 +1,19 @@ +# External embedding/extraction APIs allowed for non-sensitive memories + +Embedding and concept extraction may use **hosted APIs** (e.g. OpenAI `text-embedding-3`, +Voyage, Cohere) for **non-sensitive** memories, to access a higher quality ceiling than +self-hosted models alone. **Sensitive / Vault-encrypted (secret) memories are never sent +externally** and are excluded from the corpus that gets embedded or extracted. + +This is a deliberate relaxation of the homelab's usual local-only posture, made because the +quality gain is worth it for non-secret personal memory content. The research/benchmark may +still compare hosted vs self-hostable models (nomic-embed, bge-m3, gte-Qwen2, e5) so the +production choice is data-driven; this ADR only records that egress is *permitted* within the +sensitive-data boundary. + +## Consequences + +- The corpus-export step MUST filter out `is_sensitive` / secret memories before any external + call. +- Production deployment needs an embedding API key (or falls back to the in-cluster + llama-cpp model when absent). From 1cc8a2b378f5812a2db6238703dda17fbcd5f050 Mon Sep 17 00:00:00 2001 From: Viktor Barzin Date: Thu, 25 Jun 2026 17:51:53 +0000 Subject: [PATCH 2/2] research: benchmark hybrid (lexical+dense+graph) recall vs current FTS MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Viktor asked to enhance the memory system with 'semantics' — remember concepts (not just tokens) linked in a graph — and to prove, by benchmarking against the current system, that it actually improves recall. A multi-phase research workflow (18 agents) did landscape research, an adversarially-reviewed integration design, a stratified eval set over the real 5,452-memory corpus, and a head-to-head prototype-vs-current benchmark. Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend adopting lexical+dense; the concept graph is DEFERRED. Post-run adversarial review correction (applied to all docs before commit): the prototype's fusion config structurally barred the graph leg from the ranked top-k, so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty). Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/. Privacy: the corpus/queries/qrels/results are the user's real memories and stay gitignored (data/, cache/, results/, build_eval_set.py); only harness code, aggregate numbers, and synthetic examples are committed. Co-Authored-By: Claude Opus 4.8 --- benchmarks/.gitignore | 25 + benchmarks/README.md | 126 ++++ benchmarks/harness/__init__.py | 28 + benchmarks/harness/baselines.py | 93 +++ benchmarks/harness/dataset.py | 115 ++++ benchmarks/harness/example_retriever.py | 59 ++ benchmarks/harness/metrics.py | 100 +++ benchmarks/harness/runner.py | 223 +++++++ benchmarks/harness/test_harness.py | 145 +++++ benchmarks/harness/types.py | 53 ++ benchmarks/retrievers/__init__.py | 10 + benchmarks/retrievers/fts.py | 224 +++++++ benchmarks/retrievers/hybrid.py | 570 ++++++++++++++++++ benchmarks/retrievers/test_hybrid.py | 204 +++++++ benchmarks/scripts/dataset_stats.py | 49 ++ benchmarks/scripts/export_corpus.py | 78 +++ benchmarks/scripts/run_eval.py | 65 ++ ...-hybrid-lexical-dense-first-graph-gated.md | 45 ++ docs/adr/0005-rrf-default-cc-challenger.md | 40 ++ ...-pgvector-hnsw-halfvec-1024d-embeddings.md | 49 ++ docs/research/benchmark-report.md | 312 ++++++++++ docs/research/integration-design.md | 292 +++++++++ docs/research/survey.md | 523 ++++++++++++++++ 23 files changed, 3428 insertions(+) create mode 100644 benchmarks/.gitignore create mode 100644 benchmarks/README.md create mode 100644 benchmarks/harness/__init__.py create mode 100644 benchmarks/harness/baselines.py create mode 100644 benchmarks/harness/dataset.py create mode 100644 benchmarks/harness/example_retriever.py create mode 100644 benchmarks/harness/metrics.py create mode 100644 benchmarks/harness/runner.py create mode 100644 benchmarks/harness/test_harness.py create mode 100644 benchmarks/harness/types.py create mode 100644 benchmarks/retrievers/__init__.py create mode 100644 benchmarks/retrievers/fts.py create mode 100644 benchmarks/retrievers/hybrid.py create mode 100644 benchmarks/retrievers/test_hybrid.py create mode 100644 benchmarks/scripts/dataset_stats.py create mode 100644 benchmarks/scripts/export_corpus.py create mode 100644 benchmarks/scripts/run_eval.py create mode 100644 docs/adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md create mode 100644 docs/adr/0005-rrf-default-cc-challenger.md create mode 100644 docs/adr/0006-pgvector-hnsw-halfvec-1024d-embeddings.md create mode 100644 docs/research/benchmark-report.md create mode 100644 docs/research/integration-design.md create mode 100644 docs/research/survey.md diff --git a/benchmarks/.gitignore b/benchmarks/.gitignore new file mode 100644 index 0000000..9761d5d --- /dev/null +++ b/benchmarks/.gitignore @@ -0,0 +1,25 @@ +# Benchmark dataset is the user's REAL personal memories — NEVER commit. +# Privacy hard-rule (research task brief): corpus/queries/qrels stay LOCAL. +data/ +.venv/ +cache/ +*.npy +*.faiss +*.db + +# The eval-set GENERATOR embeds real memory-derived query text + author notes +# (paraphrases of real memories, real ids/notes). Treat it as a data artifact: +# LOCAL-ONLY, never commit. Regenerates data/ from corpus.jsonl. The HARNESS +# itself (harness/*.py, the other scripts) contains NO real content and is safe. +scripts/build_eval_set.py + +# Python noise +__pycache__/ +*.pyc +.pytest_cache/ +*.egg-info/ +.ipynb_checkpoints/ + +# Results from runs may quote real content — keep local by default. +results/ +*.results.json diff --git a/benchmarks/README.md b/benchmarks/README.md new file mode 100644 index 0000000..a2859ab --- /dev/null +++ b/benchmarks/README.md @@ -0,0 +1,126 @@ +# claude-memory recall benchmark + +Stratified retrieval benchmark gating the hybrid-recall adoption decision +(ADR-0001): does dense-vector semantic recall + a concept graph beat the current +lexical FTS on **recall@5, recall@10, nDCG@10, MRR**? Quality decides adoption; +latency/storage are measured but non-gating. + +> **PRIVACY — read first.** The corpus is the operator's REAL personal memories. +> `data/` (corpus/queries/qrels), `.venv/`, `cache/`, `results/`, and +> `scripts/build_eval_set.py` (the generator embeds memory-derived query text) +> are **gitignored and must never be committed**. Everything else here contains +> only code / aggregate numbers and is safe to commit. Sensitive memories +> (`is_sensitive=1`) are excluded from the corpus entirely. + +## Layout + +``` +benchmarks/ + harness/ # importable package (committable; no real content) + types.py # Memory, Query, Qrels, Retriever protocol + metrics.py # recall@k, nDCG@k, MRR (binary relevance) + dataset.py # load_dataset() + referential-integrity validation + runner.py # run_benchmark() -> overall + per-stratum + latency + baselines.py # SqliteFtsRetriever (faithful FTS5/BM25 reference) + example_retriever.py # worked example of the plug-in interface + test_harness.py # unit tests (pytest) + scripts/ + export_corpus.py # SQLite -> data/corpus.jsonl (non-sensitive only) + build_eval_set.py # -> data/queries.jsonl + qrels.jsonl [GITIGNORED] + dataset_stats.py # validate + print AGGREGATE stats (safe) + run_eval.py # CLI: run a retriever, print/save metrics + data/ # [GITIGNORED] corpus.jsonl, queries.jsonl, qrels.jsonl + .venv/ # [GITIGNORED] +``` + +## Dataset schema (JSONL, one object per line) + +**`corpus.jsonl`** — every non-sensitive memory: +```json +{"id": 137, "content": "...", "category": "decisions", "tags": "memory,architecture", + "expanded_keywords": "...", "importance": 0.85} +``` +`id` (int) is the join key everywhere. `tags` is comma-separated; `expanded_keywords` +space-separated (matches the production schema). + +**`queries.jsonl`** — eval queries, three strata: +```json +{"query_id": "para_006", "text": "...", "stratum": "paraphrase", "relevant_ids": [380], + "_note": "author rationale", "_jaccard": 0.023} +``` +- `stratum` ∈ `exact` | `paraphrase` | `multihop`. +- `relevant_ids` is a convenience copy; **`qrels.jsonl` is authoritative**. +- `_note` / `_jaccard` are provenance fields (underscore-prefixed); ignore them in + scoring. + +**`qrels.jsonl`** — binary relevance judgments (authoritative): +```json +{"query_id": "multi_006", "relevant_ids": [263, 423, 637]} +``` + +### Strata (what each one tests) + +| stratum | construction | who should win | +|---|---|---| +| **exact** | query = a salient phrase lifted from ONE memory; that memory is relevant (verified as the top FTS hit at build time) | lexical already strong; floor check | +| **paraphrase** | query restates ONE memory's meaning in DIFFERENT words (low lexical overlap, validated Jaccard ≤ ~0.18 vs content+keywords) | **dense embeddings** | +| **multihop** | query needs 2+ DISTINCT memories sharing an entity/concept (e.g. project + decision, or a multi-part runbook); ALL are relevant | **concept graph** | + +Where a near-duplicate memory equally satisfies a single-target query, qrels was +augmented to include the twin (so a good retriever isn't penalised); deliberate +discriminator queries are kept single-target on purpose. + +## Pluggable retriever interface + +A retriever is any object implementing **one** method: + +```python +def retrieve(self, query: str, k: int) -> list[int]: + """Return up to k memory ids (corpus `id`s), ranked best-first.""" +``` + +Optional lifecycle hooks the runner uses if present (duck-typed): + +```python +def build_index(self, corpus: list[Memory]) -> None: ... # timed separately +def index_size_bytes(self) -> int: ... # reported +name: str # label in reports +``` + +A bare callable `retrieve(query, k) -> list[int]` also works. + +## Run it + +```bash +.venv/bin/python scripts/export_corpus.py # (re)build data/corpus.jsonl +.venv/bin/python scripts/build_eval_set.py # (re)build queries+qrels (local) +.venv/bin/python scripts/dataset_stats.py # validate + aggregate stats +.venv/bin/python -m pytest harness/test_harness.py -q + +# evaluate a retriever (built-in alias or module:Class) +.venv/bin/python scripts/run_eval.py --retriever fts5 +.venv/bin/python scripts/run_eval.py --retriever your_pkg.mod:YourRetriever --json results/yours.json +``` + +Programmatic use: + +```python +from harness import load_dataset, run_benchmark +ds = load_dataset() +result = run_benchmark(MyRetriever(), ds) # builds index, times queries +print(result.summary()) # overall + per-stratum table +result.to_dict() # full machine-readable result +``` + +`run_benchmark` requests `retrieve_k=20` per query by default (≥ the max metric +cutoff of 10), macro-averages metrics over queries (overall + per stratum), and +reports per-query latency p50/p95 plus index build time/size when the hooks exist. + +## Reference baseline + +`harness.baselines.SqliteFtsRetriever` mirrors the production local-store search +(README "Search Algorithm"): FTS5 over content/category/tags/expanded_keywords, +`'"w1" OR "w2" ...'` MATCH, `ORDER BY bm25(), importance`. This is the lexical +"current system" any hybrid retriever must beat. (The Postgres `tsvector` path +uses weighted A/B/C/D ranking and an importance-first default; FTS5/BM25 is the +faithful, dependency-free relevance reference for the quality comparison.) diff --git a/benchmarks/harness/__init__.py b/benchmarks/harness/__init__.py new file mode 100644 index 0000000..eb2b856 --- /dev/null +++ b/benchmarks/harness/__init__.py @@ -0,0 +1,28 @@ +"""Benchmark harness for claude-memory recall evaluation. + +Public API: + from harness import Retriever, load_dataset, run_benchmark, BenchmarkResult + from harness import metrics + +A retriever is any object (or callable) implementing: + retrieve(query: str, k: int) -> list[memory_id] # ranked, best first + +memory_id matches the `id` field in corpus.jsonl / qrels.jsonl (int). +""" +from .types import Retriever, Query, Memory, Qrels +from .dataset import load_dataset, Dataset +from .runner import run_benchmark, BenchmarkResult, StratumResult +from . import metrics + +__all__ = [ + "Retriever", + "Query", + "Memory", + "Qrels", + "load_dataset", + "Dataset", + "run_benchmark", + "BenchmarkResult", + "StratumResult", + "metrics", +] diff --git a/benchmarks/harness/baselines.py b/benchmarks/harness/baselines.py new file mode 100644 index 0000000..b5fd678 --- /dev/null +++ b/benchmarks/harness/baselines.py @@ -0,0 +1,93 @@ +"""Reference LEXICAL baseline retrievers that mirror the production system. + +These exist so (a) the eval-set author can VERIFY a query's labels and check +that paraphrase queries genuinely defeat lexical matching, and (b) later agents +have an honest "current system" to beat. + +`SqliteFtsRetriever` builds an in-memory SQLite FTS5 index over the corpus and +runs the SAME query shape the production local store uses: + words -> '"w1" OR "w2" ...' MATCH, ORDER BY bm25(), importance as tiebreak. +(README "SQLite: FTS5 with BM25".) This is the closest faithful, dependency-free +baseline. The Postgres tsvector path is documented in the README; its ranking +differs (weighted A/B/C/D + importance-first default) but for a quality ceiling +comparison the FTS5/BM25 relevance ordering is the right lexical reference. +""" +from __future__ import annotations + +import re +import sqlite3 +from collections.abc import Sequence + +from .types import Memory, MemoryId + +# FTS5 reserved-ish tokens; we quote every term anyway, but strip embedded quotes. +_WORD_RE = re.compile(r"[A-Za-z0-9_]+") + + +class SqliteFtsRetriever: + """Faithful FTS5/BM25 lexical baseline (mirrors local_store search).""" + + name = "sqlite_fts5_bm25" + + def __init__(self, sort_by: str = "relevance") -> None: + # "relevance": ORDER BY bm25(), importance DESC (best for quality eval) + # "importance": ORDER BY importance DESC, ... (production default) + self.sort_by = sort_by + self._con: sqlite3.Connection | None = None + + def build_index(self, corpus: Sequence[Memory]) -> None: + con = sqlite3.connect(":memory:") + con.execute( + """ + CREATE VIRTUAL TABLE memories_fts USING fts5( + content, category, tags, expanded_keywords, + memory_id UNINDEXED, importance UNINDEXED + ) + """ + ) + con.executemany( + "INSERT INTO memories_fts(content, category, tags, expanded_keywords, memory_id, importance)" + " VALUES (?,?,?,?,?,?)", + [ + (m.content, m.category, m.tags, m.expanded_keywords, m.id, m.importance) + for m in corpus + ], + ) + con.commit() + self._con = con + + def _fts_query(self, query: str) -> str: + words = _WORD_RE.findall(query.lower()) + if not words: + return "" + return " OR ".join(f'"{w}"' for w in words) + + def retrieve(self, query: str, k: int) -> list[MemoryId]: + assert self._con is not None, "call build_index first" + match = self._fts_query(query) + if not match: + return [] + if self.sort_by == "importance": + order = "importance DESC, bm25(memories_fts)" + else: + order = "bm25(memories_fts), importance DESC" + try: + rows = self._con.execute( + f"SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? " + f"ORDER BY {order} LIMIT ?", + (match, k), + ).fetchall() + except sqlite3.OperationalError: + # mirror production LIKE fallback on FTS syntax errors + like = f"%{query}%" + rows = self._con.execute( + "SELECT memory_id FROM memories_fts WHERE content LIKE ? OR tags LIKE ? " + "ORDER BY importance DESC LIMIT ?", + (like, like, k), + ).fetchall() + return [r[0] for r in rows] + + def close(self) -> None: + if self._con is not None: + self._con.close() + self._con = None diff --git a/benchmarks/harness/dataset.py b/benchmarks/harness/dataset.py new file mode 100644 index 0000000..ad5607a --- /dev/null +++ b/benchmarks/harness/dataset.py @@ -0,0 +1,115 @@ +"""Load corpus / queries / qrels JSONL into typed objects.""" +from __future__ import annotations + +import json +from dataclasses import dataclass +from pathlib import Path + +from .types import Memory, Query, Qrels, MemoryId + +_DATA_DIR = Path(__file__).resolve().parents[1] / "data" + + +@dataclass +class Dataset: + corpus: list[Memory] + queries: list[Query] + qrels: Qrels + + @property + def corpus_by_id(self) -> dict[MemoryId, Memory]: + return {m.id: m for m in self.corpus} + + def strata(self) -> set[str]: + return {q.stratum for q in self.queries} + + +def _read_jsonl(path: Path) -> list[dict]: + out: list[dict] = [] + with path.open(encoding="utf-8") as f: + for line in f: + line = line.strip() + if line: + out.append(json.loads(line)) + return out + + +def load_corpus(path: Path | None = None) -> list[Memory]: + path = path or (_DATA_DIR / "corpus.jsonl") + rows = _read_jsonl(path) + return [ + Memory( + id=r["id"], + content=r["content"], + category=r.get("category", "facts"), + tags=r.get("tags", "") or "", + expanded_keywords=r.get("expanded_keywords", "") or "", + importance=r.get("importance", 0.5), + ) + for r in rows + ] + + +def load_queries(path: Path | None = None) -> list[Query]: + path = path or (_DATA_DIR / "queries.jsonl") + rows = _read_jsonl(path) + return [ + Query( + query_id=r["query_id"], + text=r["text"], + stratum=r["stratum"], + relevant_ids=tuple(r.get("relevant_ids", [])), + ) + for r in rows + ] + + +def load_qrels(path: Path | None = None) -> Qrels: + path = path or (_DATA_DIR / "qrels.jsonl") + rows = _read_jsonl(path) + qrels: Qrels = {} + for r in rows: + qid = r["query_id"] + rel = set(r["relevant_ids"]) + qrels.setdefault(qid, set()).update(rel) + return qrels + + +def load_dataset( + corpus_path: Path | None = None, + queries_path: Path | None = None, + qrels_path: Path | None = None, + *, + validate: bool = True, +) -> Dataset: + corpus = load_corpus(corpus_path) + queries = load_queries(queries_path) + qrels = load_qrels(qrels_path) + + if validate: + _validate(corpus, queries, qrels) + + return Dataset(corpus=corpus, queries=queries, qrels=qrels) + + +def _validate(corpus: list[Memory], queries: list[Query], qrels: Qrels) -> None: + corpus_ids = {m.id for m in corpus} + q_ids = {q.query_id for q in queries} + + # Every query must have a qrels entry, and vice versa. + missing_qrels = q_ids - set(qrels) + if missing_qrels: + raise ValueError(f"queries without qrels: {sorted(missing_qrels)[:10]}") + orphan_qrels = set(qrels) - q_ids + if orphan_qrels: + raise ValueError(f"qrels without queries: {sorted(orphan_qrels)[:10]}") + + # Every relevant id must exist in the corpus and the set must be non-empty. + for qid, rels in qrels.items(): + if not rels: + raise ValueError(f"empty qrels for query {qid}") + unknown = rels - corpus_ids + if unknown: + raise ValueError( + f"query {qid} references non-corpus ids {sorted(unknown)[:10]}" + ) diff --git a/benchmarks/harness/example_retriever.py b/benchmarks/harness/example_retriever.py new file mode 100644 index 0000000..29ed748 --- /dev/null +++ b/benchmarks/harness/example_retriever.py @@ -0,0 +1,59 @@ +"""Worked example: how a later agent plugs a retriever into the harness. + +A retriever needs only one method: + + retrieve(self, query: str, k: int) -> list[int] # ranked memory ids + +Optionally it may implement lifecycle hooks the runner will use if present: + + build_index(self, corpus: list[Memory]) -> None # timed separately + index_size_bytes(self) -> int # reported + +Run this file directly for a smoke test against the local eval set: + .venv/bin/python -m harness.example_retriever +""" +from __future__ import annotations + +from collections.abc import Sequence + +from .types import Memory, MemoryId + + +class SubstringRetriever: + """Trivial baseline: rank by count of query-word occurrences in content. + + Deliberately weak — exists only to demonstrate the interface. The real + lexical baseline is harness.baselines.SqliteFtsRetriever. + """ + + name = "substring_demo" + + def __init__(self) -> None: + self._corpus: list[Memory] = [] + + def build_index(self, corpus: Sequence[Memory]) -> None: + self._corpus = list(corpus) + + def retrieve(self, query: str, k: int) -> list[MemoryId]: + words = [w for w in query.lower().split() if len(w) > 2] + scored: list[tuple[int, float]] = [] + for m in self._corpus: + hay = (m.content + " " + m.expanded_keywords + " " + m.tags).lower() + score = sum(hay.count(w) for w in words) + if score: + scored.append((m.id, score + m.importance)) # importance tiebreak + scored.sort(key=lambda t: t[1], reverse=True) + return [mid for mid, _ in scored[:k]] + + +def _smoke() -> None: + from .dataset import load_dataset + from .runner import run_benchmark + + ds = load_dataset() + res = run_benchmark(SubstringRetriever(), ds) + print(res.summary()) + + +if __name__ == "__main__": + _smoke() diff --git a/benchmarks/harness/metrics.py b/benchmarks/harness/metrics.py new file mode 100644 index 0000000..4edd789 --- /dev/null +++ b/benchmarks/harness/metrics.py @@ -0,0 +1,100 @@ +"""Retrieval metrics with BINARY relevance. + +Conventions +----------- +- `ranked`: list of memory ids, best-first, as returned by a retriever. +- `relevant`: set of relevant memory ids for the query (from qrels). +- All functions are pure and operate on a single query; the runner aggregates + (macro-average over queries). + +Definitions +----------- +recall@k = |relevant ∩ ranked[:k]| / |relevant| + (fraction of all relevant items retrieved within the top k) +MRR = 1 / rank_of_first_relevant (0 if none retrieved at all) +nDCG@k = DCG@k / IDCG@k with binary gains (gain=1 for relevant) + DCG@k = sum over i in [1..k] of rel_i / log2(i + 1) + IDCG@k is the DCG of the ideal ranking (all relevant first), + capped at min(|relevant|, k) ones. + +Notes +----- +- nDCG uses the standard log2(rank+1) discount (Järvelin & Kekäläinen 2002); + with binary gains this is the common IR convention also used by BEIR/pytrec_eval. +- MRR is reported as the reciprocal rank of the FIRST relevant hit, which for a + single query equals the per-query reciprocal-rank that the runner averages. +- Duplicate ids in `ranked` are de-duplicated keeping first occurrence, so a + retriever cannot inflate recall by repeating an id. +""" +from __future__ import annotations + +import math +from collections.abc import Iterable, Sequence + +MemoryId = int + + +def _dedup_keep_order(ranked: Sequence[MemoryId]) -> list[MemoryId]: + seen: set[MemoryId] = set() + out: list[MemoryId] = [] + for x in ranked: + if x not in seen: + seen.add(x) + out.append(x) + return out + + +def recall_at_k(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId], k: int) -> float: + rel = set(relevant) + if not rel: + # Undefined; treat as 0 contribution. Runner should never pass empty. + return 0.0 + top = _dedup_keep_order(ranked)[:k] + hits = sum(1 for x in top if x in rel) + return hits / len(rel) + + +def reciprocal_rank(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId]) -> float: + rel = set(relevant) + if not rel: + return 0.0 + for i, x in enumerate(_dedup_keep_order(ranked), start=1): + if x in rel: + return 1.0 / i + return 0.0 + + +def dcg_at_k(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId], k: int) -> float: + rel = set(relevant) + top = _dedup_keep_order(ranked)[:k] + dcg = 0.0 + for i, x in enumerate(top, start=1): + if x in rel: + dcg += 1.0 / math.log2(i + 1) + return dcg + + +def ndcg_at_k(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId], k: int) -> float: + rel = set(relevant) + if not rel: + return 0.0 + dcg = dcg_at_k(ranked, rel, k) + ideal_hits = min(len(rel), k) + idcg = sum(1.0 / math.log2(i + 1) for i in range(1, ideal_hits + 1)) + if idcg == 0.0: + return 0.0 + return dcg / idcg + + +def per_query_metrics(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId]) -> dict[str, float]: + """All headline metrics for one query.""" + rel = set(relevant) + return { + "recall@5": recall_at_k(ranked, rel, 5), + "recall@10": recall_at_k(ranked, rel, 10), + "ndcg@10": ndcg_at_k(ranked, rel, 10), + "mrr": reciprocal_rank(ranked, rel), + } + + +METRIC_NAMES = ("recall@5", "recall@10", "ndcg@10", "mrr") diff --git a/benchmarks/harness/runner.py b/benchmarks/harness/runner.py new file mode 100644 index 0000000..bb693c4 --- /dev/null +++ b/benchmarks/harness/runner.py @@ -0,0 +1,223 @@ +"""Benchmark runner: drive a pluggable retriever over the eval set and report +overall + per-stratum quality metrics, plus per-query latency and (optional) +index build time / size. + +Quality decides adoption (recall@k, nDCG@10, MRR). Latency and storage are +measured and reported but DO NOT gate the decision (ADR-0001 success metric). +""" +from __future__ import annotations + +import statistics +import time +from collections.abc import Callable +from dataclasses import dataclass, field, asdict +from typing import Any + +from . import metrics +from .dataset import Dataset +from .types import MemoryId, Query, Retriever + +# A retriever may be the Protocol object or a bare callable retrieve(query, k). +RetrieverLike = Retriever | Callable[[str, int], list[MemoryId]] + +# k used for the retrieve() call. We request enough depth to compute all +# metrics (max cutoff is 10) with headroom so ties past k=10 don't distort. +DEFAULT_RETRIEVE_K = 20 + + +def _percentile(values: list[float], pct: float) -> float: + """Linear-interpolation percentile (pct in [0,100]). Empty -> 0.0.""" + if not values: + return 0.0 + if len(values) == 1: + return values[0] + s = sorted(values) + rank = (pct / 100.0) * (len(s) - 1) + lo = int(rank) + hi = min(lo + 1, len(s) - 1) + frac = rank - lo + return s[lo] + (s[hi] - s[lo]) * frac + + +@dataclass +class StratumResult: + stratum: str + n_queries: int + metrics: dict[str, float] # macro-averaged metric -> value + + +@dataclass +class BenchmarkResult: + retriever_name: str + n_queries: int + retrieve_k: int + overall: dict[str, float] + per_stratum: dict[str, StratumResult] + latency_ms: dict[str, float] # mean / p50 / p95 / max + index_build_seconds: float | None = None + index_size_bytes: int | None = None + per_query: list[dict[str, Any]] = field(default_factory=list) + + def to_dict(self) -> dict: + d = asdict(self) + d["per_stratum"] = {k: asdict(v) for k, v in self.per_stratum.items()} + return d + + def summary(self) -> str: + lines = [ + f"Retriever: {self.retriever_name}", + f"Queries: {self.n_queries} (retrieve_k={self.retrieve_k})", + ] + if self.index_build_seconds is not None: + lines.append(f"Index build: {self.index_build_seconds:.3f}s") + if self.index_size_bytes is not None: + lines.append(f"Index size: {self.index_size_bytes / 1e6:.2f} MB") + lat = self.latency_ms + lines.append( + "Latency/query: " + f"p50={lat['p50']:.2f}ms p95={lat['p95']:.2f}ms " + f"mean={lat['mean']:.2f}ms max={lat['max']:.2f}ms" + ) + cols = metrics.METRIC_NAMES + header = " ".join(f"{c:>10}" for c in cols) + lines.append("") + lines.append(f"{'stratum':<12}{'n':>5} {header}") + lines.append("-" * (19 + len(header))) + for name in ("overall", *sorted(self.per_stratum)): + if name == "overall": + m, n = self.overall, self.n_queries + else: + sr = self.per_stratum[name] + m, n = sr.metrics, sr.n_queries + row = " ".join(f"{m[c]:>10.4f}" for c in cols) + lines.append(f"{name:<12}{n:>5} {row}") + return "\n".join(lines) + + +def _get_retrieve_fn(retriever: RetrieverLike) -> Callable[[str, int], list[MemoryId]]: + if hasattr(retriever, "retrieve"): + return retriever.retrieve # type: ignore[attr-defined] + if callable(retriever): + return retriever + raise TypeError("retriever must implement retrieve(query, k) or be callable") + + +def _maybe_build_index(retriever: RetrieverLike, dataset: Dataset) -> tuple[float | None, int | None]: + """Call optional lifecycle hooks if present (duck-typed). + + - build_index(corpus) -> None : measured wall-clock build time. + - index_size_bytes() -> int : reported on-disk/in-memory index size. + Returns (build_seconds_or_None, size_bytes_or_None). + """ + build_seconds: float | None = None + size_bytes: int | None = None + + build = getattr(retriever, "build_index", None) + if callable(build): + t0 = time.perf_counter() + build(dataset.corpus) + build_seconds = time.perf_counter() - t0 + + size_fn = getattr(retriever, "index_size_bytes", None) + if callable(size_fn): + try: + size_bytes = int(size_fn()) + except Exception: + size_bytes = None + + return build_seconds, size_bytes + + +def run_benchmark( + retriever: RetrieverLike, + dataset: Dataset, + *, + retrieve_k: int = DEFAULT_RETRIEVE_K, + retriever_name: str | None = None, + warmup: bool = True, + collect_per_query: bool = True, +) -> BenchmarkResult: + """Evaluate `retriever` over `dataset`. + + The retriever is asked for `retrieve_k` ids per query (>= max metric + cutoff of 10). Metrics are macro-averaged over queries, overall and per + stratum. Latency is measured around each retrieve() call only (index build + is timed separately via the optional build_index hook). + """ + name = retriever_name or getattr(retriever, "name", None) or type(retriever).__name__ + retrieve = _get_retrieve_fn(retriever) + qrels = dataset.qrels + + build_seconds, size_bytes = _maybe_build_index(retriever, dataset) + + # Optional warmup (first call can pay import/JIT/connection costs that would + # skew p95). Excluded from latency stats. Uses the first query if any. + if warmup and dataset.queries: + try: + retrieve(dataset.queries[0].text, retrieve_k) + except Exception: + pass # warmup failures surface on the real call below + + per_query_rows: list[dict[str, Any]] = [] + latencies_ms: list[float] = [] + # accumulate per-stratum metric sums for macro-average + strata: dict[str, dict[str, float]] = {} + strata_counts: dict[str, int] = {} + overall_sums = {m: 0.0 for m in metrics.METRIC_NAMES} + + for q in dataset.queries: + rel = qrels[q.query_id] + t0 = time.perf_counter() + ranked = list(retrieve(q.text, retrieve_k)) + dt_ms = (time.perf_counter() - t0) * 1000.0 + latencies_ms.append(dt_ms) + + m = metrics.per_query_metrics(ranked, rel) + for key, val in m.items(): + overall_sums[key] += val + strata.setdefault(q.stratum, {mm: 0.0 for mm in metrics.METRIC_NAMES}) + strata_counts[q.stratum] = strata_counts.get(q.stratum, 0) + 1 + for key, val in m.items(): + strata[q.stratum][key] += val + + if collect_per_query: + per_query_rows.append( + { + "query_id": q.query_id, + "stratum": q.stratum, + "n_relevant": len(rel), + "latency_ms": round(dt_ms, 3), + "retrieved": ranked[:retrieve_k], + **{k: round(v, 6) for k, v in m.items()}, + } + ) + + n = len(dataset.queries) + overall = {k: (overall_sums[k] / n if n else 0.0) for k in metrics.METRIC_NAMES} + per_stratum: dict[str, StratumResult] = {} + for s, sums in strata.items(): + c = strata_counts[s] + per_stratum[s] = StratumResult( + stratum=s, + n_queries=c, + metrics={k: (sums[k] / c if c else 0.0) for k in metrics.METRIC_NAMES}, + ) + + latency_stats = { + "mean": statistics.fmean(latencies_ms) if latencies_ms else 0.0, + "p50": _percentile(latencies_ms, 50), + "p95": _percentile(latencies_ms, 95), + "max": max(latencies_ms) if latencies_ms else 0.0, + } + + return BenchmarkResult( + retriever_name=name, + n_queries=n, + retrieve_k=retrieve_k, + overall=overall, + per_stratum=per_stratum, + latency_ms=latency_stats, + index_build_seconds=build_seconds, + index_size_bytes=size_bytes, + per_query=per_query_rows, + ) diff --git a/benchmarks/harness/test_harness.py b/benchmarks/harness/test_harness.py new file mode 100644 index 0000000..9375374 --- /dev/null +++ b/benchmarks/harness/test_harness.py @@ -0,0 +1,145 @@ +"""Unit tests for metrics + runner. No real corpus needed (synthetic data). + +Run: .venv/bin/python -m pytest harness/test_harness.py -q +""" +from __future__ import annotations + +import math + +from harness import metrics +from harness.dataset import Dataset +from harness.runner import run_benchmark, _percentile +from harness.types import Memory, Query + + +# ---------------- metrics ---------------- + +def test_recall_at_k_basic(): + ranked = [9, 8, 3, 7, 1] + rel = {3, 1, 99} # 99 never retrieved + assert metrics.recall_at_k(ranked, rel, 5) == 2 / 3 + assert metrics.recall_at_k(ranked, rel, 2) == 0.0 # neither in top2 + assert metrics.recall_at_k(ranked, rel, 3) == 1 / 3 # only id 3 in top3 + + +def test_recall_perfect_and_zero(): + assert metrics.recall_at_k([1, 2, 3], {1, 2, 3}, 5) == 1.0 + assert metrics.recall_at_k([4, 5, 6], {1, 2, 3}, 5) == 0.0 + + +def test_reciprocal_rank(): + assert metrics.reciprocal_rank([5, 4, 3], {3}) == 1 / 3 + assert metrics.reciprocal_rank([3, 4, 5], {3}) == 1.0 + assert metrics.reciprocal_rank([7, 8], {3}) == 0.0 + # first relevant wins + assert metrics.reciprocal_rank([9, 3, 1], {1, 3}) == 1 / 2 + + +def test_ndcg_perfect(): + # all relevant at the top -> nDCG == 1 + assert math.isclose(metrics.ndcg_at_k([1, 2, 3, 4], {1, 2, 3}, 10), 1.0) + + +def test_ndcg_known_value(): + # single relevant doc at rank 2: DCG = 1/log2(3); IDCG = 1/log2(2)=1 + ranked = [9, 1, 8] + val = metrics.ndcg_at_k(ranked, {1}, 10) + assert math.isclose(val, (1 / math.log2(3)) / 1.0) + + +def test_ndcg_two_relevant_suboptimal_order(): + # relevant {1,2}; retrieved at ranks 1 and 3 + ranked = [1, 9, 2] + dcg = 1 / math.log2(2) + 1 / math.log2(4) # ranks 1 and 3 + idcg = 1 / math.log2(2) + 1 / math.log2(3) # ideal ranks 1 and 2 + assert math.isclose(metrics.ndcg_at_k(ranked, {1, 2}, 10), dcg / idcg) + + +def test_dedup_does_not_inflate(): + # repeating a relevant id must not increase recall beyond 1 hit's worth + ranked = [3, 3, 3, 3] + assert metrics.recall_at_k(ranked, {3, 7}, 5) == 0.5 + assert metrics.reciprocal_rank(ranked, {3}) == 1.0 + + +def test_empty_relevant_is_zero(): + assert metrics.recall_at_k([1, 2], set(), 5) == 0.0 + assert metrics.ndcg_at_k([1, 2], set(), 5) == 0.0 + + +# ---------------- percentile ---------------- + +def test_percentile(): + vals = [10, 20, 30, 40] + assert _percentile(vals, 50) == 25.0 # interpolated median + assert _percentile(vals, 0) == 10 + assert _percentile(vals, 100) == 40 + assert _percentile([5.0], 95) == 5.0 + assert _percentile([], 50) == 0.0 + + +# ---------------- runner ---------------- + +def _toy_dataset() -> Dataset: + corpus = [Memory(id=i, content=f"memory {i}") for i in range(1, 11)] + queries = [ + Query("q_exact_1", "find 1", "exact", (1,)), + Query("q_para_1", "restate 5", "paraphrase", (5,)), + Query("q_multi_1", "join 3 and 4", "multihop", (3, 4)), + ] + qrels = {"q_exact_1": {1}, "q_para_1": {5}, "q_multi_1": {3, 4}} + return Dataset(corpus=corpus, queries=queries, qrels=qrels) + + +class _PerfectRetriever: + """Returns exactly the relevant ids first (oracle) — for runner plumbing.""" + + def __init__(self, qrels): + self._qrels = qrels + self._by_text = None + + def build_index(self, corpus): + self._n = len(corpus) + + def index_size_bytes(self): + return 1234 + + def retrieve(self, query, k): + # map query text back via the toy queries' known answers + mapping = {"find 1": [1], "restate 5": [5], "join 3 and 4": [3, 4]} + ids = mapping.get(query, []) + # pad with distractors + pad = [x for x in range(100, 100 + k)] + return (ids + pad)[:k] + + +def test_runner_perfect_retriever(): + ds = _toy_dataset() + r = _PerfectRetriever(ds.qrels) + res = run_benchmark(r, ds, retriever_name="perfect") + assert res.n_queries == 3 + assert math.isclose(res.overall["recall@10"], 1.0) + assert math.isclose(res.overall["mrr"], 1.0) + assert math.isclose(res.overall["ndcg@10"], 1.0) + # per-stratum present + assert set(res.per_stratum) == {"exact", "paraphrase", "multihop"} + assert res.per_stratum["multihop"].n_queries == 1 + # lifecycle hooks captured + assert res.index_build_seconds is not None + assert res.index_size_bytes == 1234 + # latency recorded + assert res.latency_ms["p95"] >= 0.0 + + +def test_runner_callable_retriever_and_misses(): + ds = _toy_dataset() + + def retrieve(query, k): # always wrong + return [999][:k] + + res = run_benchmark(retrieve, ds, retriever_name="bad", warmup=False) + assert res.overall["recall@10"] == 0.0 + assert res.overall["mrr"] == 0.0 + assert res.index_build_seconds is None # no hook on a bare callable + assert "perfect" not in res.summary() + assert "bad" in res.summary() diff --git a/benchmarks/harness/types.py b/benchmarks/harness/types.py new file mode 100644 index 0000000..0665f33 --- /dev/null +++ b/benchmarks/harness/types.py @@ -0,0 +1,53 @@ +"""Core dataclasses and the pluggable Retriever protocol.""" +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Protocol, runtime_checkable + +MemoryId = int + + +@dataclass(frozen=True) +class Memory: + """One corpus entry (mirrors corpus.jsonl).""" + + id: MemoryId + content: str + category: str = "facts" + tags: str = "" + expanded_keywords: str = "" + importance: float = 0.5 + + +@dataclass(frozen=True) +class Query: + """One eval query (mirrors queries.jsonl).""" + + query_id: str + text: str + stratum: str # "exact" | "paraphrase" | "multihop" + # convenience copy of relevant ids; authoritative source is Qrels + relevant_ids: tuple[MemoryId, ...] = field(default_factory=tuple) + + +# query_id -> set of relevant memory ids (binary relevance) +Qrels = dict[str, set[MemoryId]] + + +@runtime_checkable +class Retriever(Protocol): + """Pluggable retriever contract. + + Implementations rank corpus memories for a query and return the top-k + memory ids, best match first. The harness will call `retrieve` once per + query and compare against qrels. + + Optional lifecycle hooks let a retriever build an index from the corpus + and report index build time / on-disk size; the runner uses them if + present (duck-typed), so a minimal retriever need only implement + `retrieve`. + """ + + def retrieve(self, query: str, k: int) -> list[MemoryId]: + """Return up to k memory ids, ranked best-first.""" + ... diff --git a/benchmarks/retrievers/__init__.py b/benchmarks/retrievers/__init__.py new file mode 100644 index 0000000..893917d --- /dev/null +++ b/benchmarks/retrievers/__init__.py @@ -0,0 +1,10 @@ +"""Pluggable retrievers for the claude-memory recall benchmark. + +Each retriever implements the harness `retrieve(query, k) -> list[int]` contract +(see ``harness/types.py`` :: ``Retriever``) and, optionally, the ``build_index`` / +``index_size_bytes`` lifecycle hooks the runner duck-types. + +``fts.FtsRetriever`` is the LEXICAL BASELINE — the product's current local-store +recall (SQLite FTS5/BM25). It is the "current system" any hybrid retriever must +beat on recall@k / nDCG@10 / MRR (ADR-0001). +""" diff --git a/benchmarks/retrievers/fts.py b/benchmarks/retrievers/fts.py new file mode 100644 index 0000000..6a5f516 --- /dev/null +++ b/benchmarks/retrievers/fts.py @@ -0,0 +1,224 @@ +"""BASELINE retriever: the product's CURRENT lexical recall (SQLite FTS5/BM25). + +This is the "current system" the hybrid upgrade (dense embeddings + concept +graph, ADR-0001) must beat on recall@k / nDCG@10 / MRR. It is a *faithful* +reimplementation of the production local-store recall path, not an idealised +sketch — it mirrors ``src/claude_memory/mcp_server.py :: _sqlite_recall`` (and +the FTS5 schema/triggers in the same module) line-for-line where it matters: + +Production recall (``sort_by="relevance"``) does ALL of the following, and so +does this retriever: + +1. **Concatenate then split.** The MCP tool builds + ``all_terms = f"{context} {expanded_query}"`` and splits it on whitespace, + stripping any embedded ``"`` from each token. The harness already hands us + one ``query`` string (the concatenation happens upstream of recall), so here + ``query`` IS ``all_terms``; we split + strip identically. + +2. **AND-first, then OR-broaden.** Production builds BOTH + ``'"w1" AND "w2" ...'`` and ``'"w1" OR "w2" ...'`` and runs the **AND** match + first; only if it returns zero rows does it fall back to the **OR** match. + (The README's "Search Algorithm" prose shows only the OR form; the *code* is + AND→OR, and the code is authoritative. We replicate the code.) + +3. **Blended BM25+importance relevance ordering.** ``sort_by="relevance"`` is + NOT a pure ``ORDER BY bm25()``. It is the blend + ``(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC`` (bm25 is negated + because SQLite returns more-negative = better-match). We use the EXACT same + expression. We deliberately evaluate ``relevance`` (not the production + ``importance`` default) so the benchmark measures RETRIEVAL quality rather + than the importance-sort prior — per the research brief. + +4. **FTS5 default tokenizer.** The production virtual table is declared with no + explicit tokenizer, i.e. ``unicode61`` — case-folding + unicode diacritic + stripping, NO stemming and NO stop-word removal. We declare ours the same + way, so "running" does not match "run" (a known lexical weakness the dense + path is expected to fix on the *paraphrase* stratum). + +5. **LIKE fallback.** If the FTS5 MATCH raises ``sqlite3.OperationalError`` + (e.g. a token that trips the FTS5 query grammar), production degrades to a + ``content LIKE %context% OR tags LIKE %context%`` scan ordered by importance. + We mirror that fallback (using the full query as the LIKE needle, since the + harness query is the whole ``all_terms``). + +DIFFERENCES FROM PRODUCTION (all immaterial to ranking, documented for honesty): +- The benchmark corpus has no per-user / soft-delete / category filtering, so we + drop the ``user_id``/``deleted_at``/``category`` predicates. No category is + passed by the harness, so the category branch is never taken anyway. +- We build a fresh in-memory FTS5 index over ``data/corpus.jsonl`` rather than + reading the live ``memory.db``; same schema, same tokenizer, same columns + (content/category/tags/expanded_keywords), so BM25 statistics match what the + product would compute over the same documents. + +The harness reference ``harness.baselines.SqliteFtsRetriever`` implements the +*README* ordering (pure ``ORDER BY bm25(), importance``). This module is the +faithful-to-the-CODE variant and is the one the RUN reports as ``retriever="fts"``. +""" +from __future__ import annotations + +import re +import sqlite3 +from collections.abc import Sequence + +# Import the corpus dataclass from the sibling harness package. run_eval.py and +# run_benchmark put the benchmarks/ root on sys.path; support direct execution +# (python retrievers/fts.py) too by adding it ourselves if the import fails. +try: # pragma: no cover - exercised by both import paths + from harness.types import Memory, MemoryId +except ModuleNotFoundError: # pragma: no cover + import sys + from pathlib import Path + + sys.path.insert(0, str(Path(__file__).resolve().parents[1])) + from harness.types import Memory, MemoryId + +# Mirror production token extraction: split ``all_terms`` on whitespace and strip +# any embedded double-quote from each token (mcp_server uses +# ``w.replace(chr(34), "")``). We lowercase as well; FTS5 unicode61 case-folds +# regardless, so this only normalises the quoted MATCH literals we emit. +_DQUOTE = '"' + + +class FtsRetriever: + """Faithful reimplementation of the production SQLite FTS5/BM25 recall. + + Mirrors ``_sqlite_recall(sort_by="relevance")``: AND-first then OR-broaden + over an FTS5(content, category, tags, expanded_keywords) index, ranked by + the blended ``(-bm25*0.7 + importance*0.3)`` score, with a LIKE fallback. + """ + + #: Label surfaced in benchmark reports / the RUN schema. + name = "fts" + + def __init__(self, sort_by: str = "relevance") -> None: + # We benchmark "relevance" so the metric reflects retrieval quality, not + # the importance prior. "importance" is kept for parity / diagnostics. + if sort_by not in ("relevance", "importance"): + raise ValueError(f"sort_by must be 'relevance' or 'importance', got {sort_by!r}") + self.sort_by = sort_by + self._con: sqlite3.Connection | None = None + + # ── lifecycle hooks (duck-typed by the runner) ─────────────────────────── + + def build_index(self, corpus: Sequence[Memory]) -> None: + """Build a fresh in-memory FTS5 index over the corpus. + + Same virtual-table shape and (default ``unicode61``) tokenizer as the + production ``memories_fts`` table. We carry ``memory_id`` and + ``importance`` as UNINDEXED columns so the relevance blend can read + importance without a join — semantically identical to the production + ``memories m JOIN memories_fts fts ON m.id = fts.rowid`` read. + """ + con = sqlite3.connect(":memory:") + con.execute( + """ + CREATE VIRTUAL TABLE memories_fts USING fts5( + content, category, tags, expanded_keywords, + memory_id UNINDEXED, importance UNINDEXED + ) + """ + ) + con.executemany( + "INSERT INTO memories_fts" + "(content, category, tags, expanded_keywords, memory_id, importance)" + " VALUES (?,?,?,?,?,?)", + [ + ( + m.content, + m.category, + m.tags, + m.expanded_keywords, + int(m.id), + float(m.importance), + ) + for m in corpus + ], + ) + con.commit() + self._con = con + + def index_size_bytes(self) -> int: + """Approximate on-disk index size (sum of FTS5 shadow-table page bytes). + + The index is in-memory, so this is the SQLite page accounting for the + FTS5 shadow tables — reported for the storage column, non-gating per + ADR-0001. + """ + if self._con is None: + return 0 + try: + page_count = self._con.execute("PRAGMA page_count").fetchone()[0] + page_size = self._con.execute("PRAGMA page_size").fetchone()[0] + return int(page_count) * int(page_size) + except sqlite3.Error: + return 0 + + # ── query construction (mirrors _sqlite_recall) ────────────────────────── + + @staticmethod + def _tokens(query: str) -> list[str]: + """Split ``all_terms`` exactly as production does: whitespace split, + drop embedded double-quotes, drop empties.""" + return [w.replace(_DQUOTE, "").lower() for w in query.split() if w.strip()] + + @classmethod + def _and_or_queries(cls, query: str) -> tuple[str, str]: + """Build the ('"w1" AND "w2" ...', '"w1" OR "w2" ...') MATCH pair.""" + words = cls._tokens(query) + if not words: + return "", "" + quoted = [f'"{w}"' for w in words] + return " AND ".join(quoted), " OR ".join(quoted) + + def _order_clause(self) -> str: + # bm25() is negative (more-negative = better), so negate before blending. + if self.sort_by == "relevance": + return "(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC" + return "(-bm25(memories_fts) * 0.4 + importance * 0.6) DESC" + + # ── retrieve ────────────────────────────────────────────────────────────── + + def retrieve(self, query: str, k: int) -> list[MemoryId]: + """Return up to ``k`` memory ids, ranked best-first. + + AND-match first (precise); if it yields nothing, OR-broaden. On an FTS5 + grammar error, fall back to a LIKE scan ordered by importance — exactly + the production degradation path. + """ + assert self._con is not None, "call build_index first" + and_query, or_query = self._and_or_queries(query) + if not or_query: # no usable tokens + return [] + + order = self._order_clause() + base_select = "SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? " + try: + rows: list[tuple[int]] = [] + # AND first for precise matches, fall back to OR for broader recall. + for fts_query in (and_query, or_query): + rows = self._con.execute( + f"{base_select}ORDER BY {order} LIMIT ?", + (fts_query, k), + ).fetchall() + if rows: + break + except sqlite3.OperationalError: + # Mirror production LIKE fallback: full query as the needle, + # ordered by importance. + like = f"%{query}%" + rows = self._con.execute( + "SELECT memory_id FROM memories_fts " + "WHERE content LIKE ? OR tags LIKE ? " + "ORDER BY importance DESC LIMIT ?", + (like, like, k), + ).fetchall() + return [r[0] for r in rows] + + def close(self) -> None: + if self._con is not None: + self._con.close() + self._con = None + + +# Convenience for `run_eval.py --retriever retrievers.fts:FtsRetriever` +# and a no-arg default instantiation (sort_by="relevance"). diff --git a/benchmarks/retrievers/hybrid.py b/benchmarks/retrievers/hybrid.py new file mode 100644 index 0000000..6039b21 --- /dev/null +++ b/benchmarks/retrievers/hybrid.py @@ -0,0 +1,570 @@ +"""HYBRID retriever (ADR-0001/0002/0003 prototype): lexical FTS + dense semantic +recall + a memory-node concept graph, fused with Reciprocal Rank Fusion (RRF). + +This is the self-contained prototype the hybrid-recall ADOPTION decision is gated +on (ADR-0001): does dense embeddings + a concept graph beat the current lexical +FTS5/BM25 on recall@5/recall@10/nDCG@10/MRR? Quality decides; latency/storage are +reported but non-gating. + +It implements the harness ``retrieve(query, k) -> list[int]`` contract and the +optional ``build_index(corpus)`` / ``index_size_bytes()`` / ``name`` hooks. + +Three legs, mirroring the FINAL DESIGN +====================================== + +1. **Lexical (FTS5/BM25).** We reuse the *faithful* production reimplementation + ``retrievers.fts.FtsRetriever`` verbatim — AND-first then OR-broaden over an + FTS5(content, category, tags, expanded_keywords) index, ranked by the blended + ``(-bm25*0.7 + importance*0.3)``. This is the exact "current system" the hybrid + must beat, so the lexical leg of the hybrid IS that system (no drift). + +2. **Dense (semantic).** Embeddings per FINAL DESIGN: a HOSTED API is used ONLY if + its key is in the environment (``OPENAI_API_KEY`` / ``VOYAGE_API_KEY`` / + ``CO_API_KEY``) AND the memory is non-sensitive (ADR-0003); otherwise the local + default ``BAAI/bge-large-en-v1.5`` (1024-d, MIT, sentence-transformers). The + benchmark corpus is already sensitive-free (``is_sensitive=1`` excluded at + export, README privacy note), so here the choice is purely "hosted key present + or not". Vectors are L2-normalised; similarity is cosine = dot product. The + corpus matrix is cached to ``cache/`` (gitignored) keyed by model id + a corpus + fingerprint, so re-runs skip re-embedding. BGE retrieval convention: the QUERY + gets the instruction prefix "Represent this sentence for searching relevant + passages: "; passages are embedded raw (per the official BAAI model card). + +3. **Graph (concept expansion).** A memory-node concept graph built with the + design's TRACTABLE extraction — NO 5452 sequential LLM calls. Concepts are the + union of each memory's ``tags`` and its already-LLM-generated + ``expanded_keywords`` (plus salient content noun-phrases via a lightweight + regex/stop-word filter), normalised and de-pluralised. A concept that appears + in 2..N memories (very common concepts above a document-frequency ceiling are + dropped as non-discriminative) links those memories: ``memory -[shares + concept c]- memory``. At query time we take the fused dense+lexical SEEDS, walk + 1 hop to neighbours that share *discriminative* concepts, and emit those + neighbours as a third ranked list. This targets the **multihop** stratum + (queries needing 2+ memories that share an entity/concept) without re-ranking + the precise hits the other legs already nail. + +Fusion (``retrieval_fusion``) +============================= +Reciprocal Rank Fusion (Cormack et al., 2009): for a document *d* with rank +``r_leg(d)`` (1-based) in a leg's ranked list, + + RRF(d) = Σ_leg w_leg / (k_rrf + r_leg(d)) + +with ``k_rrf = 60`` (the standard constant) and per-leg weights. RRF is +score-scale-free (no BM25-vs-cosine calibration), which is why the design floats +"RRF vs CC" and we pick RRF for the prototype. The dense and lexical legs carry +full weight; the graph leg is down-weighted (it is a RECALL extender for multihop, +and the design explicitly flags a possible negative graph prior — so it can add +documents but should not dethrone strong dense/lexical hits). All three weights +are class attributes so the kill-gate analysis can ablate the graph to zero. + +Graceful degradation (task requirement) +======================================= +If the embedding model cannot be loaded/used (missing package, download failure, +OOM), the dense leg is skipped, the failure is recorded in ``self.errors``, and the +retriever degrades to **FTS + graph** (or FTS-only if the graph also failed). The +harness still gets metrics for whatever worked. +""" +from __future__ import annotations + +import hashlib +import os +import re +import sys +import time +from collections import defaultdict +from collections.abc import Sequence +from pathlib import Path + +# ── package-relative imports that also work under direct execution ──────────── +try: # pragma: no cover - exercised by both import paths + from harness.types import Memory, MemoryId + from retrievers.fts import FtsRetriever +except ModuleNotFoundError: # pragma: no cover + sys.path.insert(0, str(Path(__file__).resolve().parents[1])) + from harness.types import Memory, MemoryId + from retrievers.fts import FtsRetriever + +_BENCH_ROOT = Path(__file__).resolve().parents[1] +_CACHE_DIR = _BENCH_ROOT / "cache" + +# Local default embedding model (FINAL DESIGN: prototype default + sensitive-only +# fallback). 1024-d, MIT-licensed, strong on MTEB retrieval. +_LOCAL_MODEL = "BAAI/bge-large-en-v1.5" +# BGE retrieval query instruction (official BAAI model card recommendation; the +# v1.5 line relaxed it but it still helps short-query / long-passage asymmetry, +# which is exactly the paraphrase stratum). Applied to QUERIES only. +_BGE_QUERY_INSTRUCTION = "Represent this sentence for searching relevant passages: " + +# RRF constant (Cormack/Clarke/Buettcher 2009). 60 is the canonical default. +_RRF_K = 60 + +# Concept-graph tuning. +# _CONCEPT_MIN_DF : a concept must appear in >= this many memories to form edges +# (df==1 links nothing; we need a shared concept). +# _CONCEPT_MAX_DF_FRAC : drop concepts appearing in more than this fraction of +# the corpus — they are non-discriminative hubs ("memory", +# "homelab") that would over-connect the graph (design risk: +# "over-merge"). +# _GRAPH_SEEDS : how many fused seeds to expand from. +# _GRAPH_NEIGHBOURS_PER_SEED : cap neighbours pulled per seed (keeps the graph +# leg from flooding the candidate pool). +_CONCEPT_MIN_DF = 2 +_CONCEPT_MAX_DF_FRAC = 0.02 +_GRAPH_SEEDS = 10 +_GRAPH_NEIGHBOURS_PER_SEED = 25 + +# A small English stop-word set for the lightweight noun-phrase extraction. We +# deliberately keep this tiny + dependency-free (no spaCy/NLTK download on the hot +# path); the heavy lifting is done by the pre-computed ``expanded_keywords``. +_STOPWORDS = frozenset( + """ + a an the of to in on at by for with from into over under and or but not is are + was were be been being do does did has have had this that these those it its as + if then than so such no yes can will would should could may might must i you he + she they we me him her them us my your his their our about above after again all + any because before below between both during each few more most other some only + own same too very up down out off here there when where which who whom what how + """.split() +) +_WORD_RE = re.compile(r"[A-Za-z][A-Za-z0-9_+.-]{2,}") + + +def _normalise_concept(token: str) -> str: + """Lowercase, strip surrounding punctuation, light de-plural so concept + variants collapse to one node (e.g. 'decisions'->'decision', + 'addresses'->'address', 'policies'->'policy'). This is a heuristic collapser, + not a linguistically perfect stemmer — its only job is to merge obvious + plural/singular pairs so the graph links them; exactness is not load-bearing. + Order matters: -ies, then -sses, then sibilant -es, then a bare trailing -s.""" + t = token.lower().strip(".,;:!?()[]{}\"'`") + if len(t) > 4 and t.endswith("ies"): # policies -> policy + return t[:-3] + "y" + if len(t) > 4 and t.endswith("sses"): # addresses -> address, classes -> class + return t[:-2] + if len(t) > 4 and t.endswith(("ches", "shes", "xes", "zes", "ses")): # boxes->box + return t[:-2] + if len(t) > 3 and t.endswith("s") and not t.endswith(("ss", "us", "is")): # tags->tag + return t[:-1] + return t + + +def _concepts_for(memory: Memory) -> set[str]: + """Extract the concept set for one memory: tags ∪ expanded_keywords ∪ salient + content tokens. ``expanded_keywords`` is already an LLM-generated keyword field + in the corpus, so this is the design's 'tractable extraction' — we reuse the + extraction that production already pays for instead of new LLM calls.""" + concepts: set[str] = set() + # tags: comma-separated + for tag in memory.tags.split(","): + c = _normalise_concept(tag) + if len(c) >= 3 and c not in _STOPWORDS: + concepts.add(c) + # expanded_keywords: space-separated, already curated + for kw in memory.expanded_keywords.split(): + c = _normalise_concept(kw) + if len(c) >= 3 and c not in _STOPWORDS: + concepts.add(c) + # salient content tokens (lightweight noun-phrase proxy: alpha tokens len>=3, + # not stop-words). This is a cheap NER/noun-phrase stand-in per the design. + for m in _WORD_RE.finditer(memory.content): + c = _normalise_concept(m.group(0)) + if len(c) >= 3 and c not in _STOPWORDS: + concepts.add(c) + return concepts + + +def _corpus_fingerprint(corpus: Sequence[Memory]) -> str: + """Stable hash over (id, content) so the embedding cache invalidates if the + corpus changes but is reused across runs of the same corpus.""" + h = hashlib.sha256() + for m in corpus: + h.update(str(m.id).encode()) + h.update(b"\x00") + h.update(m.content.encode("utf-8", "replace")) + h.update(b"\x01") + return h.hexdigest()[:16] + + +class HybridRetriever: + """Lexical FTS + dense (bge-large-en-v1.5 / hosted) + concept-graph expansion, + fused with RRF. Degrades to FTS(+graph) if embeddings are unavailable.""" + + #: Label surfaced in benchmark reports / the RUN schema. + name = "hybrid" + + # Per-leg RRF weights. Dense + lexical carry full weight; graph is a + # down-weighted recall extender (design: possible negative graph prior). + w_dense = 1.0 + w_fts = 1.0 + w_graph = 0.35 + + def __init__(self, model_name: str | None = None) -> None: + self.errors: list[str] = [] + self.model_name = model_name or _LOCAL_MODEL + self.embedding_backend: str = "none" # "local:" | "hosted::" + self.embedding_dim: int | None = None + + # FTS leg (always available; pure stdlib sqlite). + self._fts = FtsRetriever(sort_by="relevance") + + # Dense leg state. + self._model = None # SentenceTransformer or None + self._np = None # numpy module handle (set on successful dense build) + self._emb = None # (N, d) float32 L2-normalised matrix, row i ↔ self._ids[i] + self._ids: list[MemoryId] = [] # row order of self._emb + + # Graph leg state. + self._graph = None # networkx.Graph or None + self._concept_to_mems: dict[str, list[MemoryId]] = {} + self._mem_concepts: dict[MemoryId, set[str]] = {} + self._n_concepts_total = 0 # before df pruning, for reporting + self._n_concepts_kept = 0 + self._n_edges = 0 + + self._corpus_size = 0 + + # ── lifecycle: build_index (timed by the runner) ───────────────────────── + + def build_index(self, corpus: Sequence[Memory]) -> None: + corpus = list(corpus) + self._corpus_size = len(corpus) + _CACHE_DIR.mkdir(parents=True, exist_ok=True) + + # 1) lexical leg + self._fts.build_index(corpus) + + # 2) dense leg (graceful) + try: + self._build_dense(corpus) + except Exception as exc: # pragma: no cover - defensive + self.errors.append(f"dense leg disabled: {type(exc).__name__}: {exc}") + self._model = None + self._emb = None + + # 3) graph leg (graceful) + try: + self._build_graph(corpus) + except Exception as exc: # pragma: no cover - defensive + self.errors.append(f"graph leg disabled: {type(exc).__name__}: {exc}") + self._graph = None + self._concept_to_mems = {} + + # ── dense leg ──────────────────────────────────────────────────────────── + + def _select_embedding_backend(self) -> str: + """Pick the embedding backend per FINAL DESIGN: hosted only if a key is in + the env (non-sensitive corpus already guaranteed by export), else local. + Returns a human label and sets self.model_name accordingly.""" + if os.environ.get("VOYAGE_API_KEY"): + self.model_name = "voyage-3.5" + return "hosted:voyage:voyage-3.5" + if os.environ.get("OPENAI_API_KEY"): + self.model_name = "text-embedding-3-large" + return "hosted:openai:text-embedding-3-large" + if os.environ.get("CO_API_KEY"): + self.model_name = "embed-english-v3.0" + return "hosted:cohere:embed-english-v3.0" + self.model_name = _LOCAL_MODEL + return f"local:{_LOCAL_MODEL}" + + def _build_dense(self, corpus: Sequence[Memory]) -> None: + import numpy as np # required for the dense leg + + self._np = np + self.embedding_backend = self._select_embedding_backend() + self._ids = [m.id for m in corpus] + fp = _corpus_fingerprint(corpus) + safe_model = self.model_name.replace("/", "_") + emb_path = _CACHE_DIR / f"emb_{safe_model}_{fp}.npy" + ids_path = _CACHE_DIR / f"emb_{safe_model}_{fp}.ids.npy" + + # cache hit? + if emb_path.exists() and ids_path.exists(): + cached_ids = np.load(ids_path) + if list(cached_ids.tolist()) == self._ids: + self._emb = np.load(emb_path).astype(np.float32) + self.embedding_dim = int(self._emb.shape[1]) + return # cached embeddings reused + + # cache miss → embed + if self.embedding_backend.startswith("hosted:"): + vecs = self._embed_hosted([m.content for m in corpus]) + else: + vecs = self._embed_local([m.content for m in corpus]) + vecs = vecs.astype(np.float32) + # L2-normalise so dot product == cosine. + norms = np.linalg.norm(vecs, axis=1, keepdims=True) + norms[norms == 0] = 1.0 + vecs = vecs / norms + self._emb = vecs + self.embedding_dim = int(vecs.shape[1]) + np.save(emb_path, vecs) + np.save(ids_path, np.array(self._ids, dtype=np.int64)) + + def _load_local_model(self): + from sentence_transformers import SentenceTransformer + + if self._model is None: + # CPU is fine for ~5.5k short docs; force CPU to avoid CUDA init noise. + self._model = SentenceTransformer(_LOCAL_MODEL, device="cpu") + # Median memory is ~120 tokens; cap the window at 384 so the rare long + # memory (1.6% > 512 tok) doesn't pad an entire batch to 512. bge's + # native max is 512; 384 keeps ~p99 intact while bounding CPU cost. + self._model.max_seq_length = min(self._model.max_seq_length, 384) + return self._model + + def _embed_local(self, texts: list[str]): + import numpy as np + + model = self._load_local_model() + # Length-sort so each batch pads to a homogeneous length (big CPU win), then + # restore original order. Passages embedded raw; the caller L2-normalises so + # the local and hosted paths stay byte-for-byte consistent downstream. + order = sorted(range(len(texts)), key=lambda i: len(texts[i])) + sorted_texts = [texts[i] for i in order] + out = model.encode( + sorted_texts, + batch_size=64, + convert_to_numpy=True, + normalize_embeddings=False, + show_progress_bar=False, + ) + out = np.asarray(out) + # invert the permutation + restored = np.empty_like(out) + restored[np.asarray(order)] = out + return restored + + def _embed_query_local(self, query: str): + import numpy as np + + model = self._load_local_model() + out = model.encode( + [_BGE_QUERY_INSTRUCTION + query], + convert_to_numpy=True, + normalize_embeddings=True, # query L2-normalised → cosine via dot + show_progress_bar=False, + ) + return np.asarray(out)[0] + + def _embed_hosted(self, texts: list[str]): + """Batch-embed passages via the selected hosted API. Implemented for + Voyage / OpenAI / Cohere; only reached when the matching key is set.""" + import numpy as np + + backend = self.embedding_backend + if backend.startswith("hosted:voyage"): + import voyageai + + client = voyageai.Client() + vecs: list[list[float]] = [] + for i in range(0, len(texts), 128): + batch = texts[i : i + 128] + r = client.embed(batch, model="voyage-3.5", input_type="document") + vecs.extend(r.embeddings) + return np.asarray(vecs) + if backend.startswith("hosted:openai"): + from openai import OpenAI + + client = OpenAI() + vecs = [] + for i in range(0, len(texts), 256): + batch = texts[i : i + 256] + r = client.embeddings.create(model="text-embedding-3-large", input=batch) + vecs.extend([d.embedding for d in r.data]) + return np.asarray(vecs) + if backend.startswith("hosted:cohere"): + import cohere + + client = cohere.Client() + vecs = [] + for i in range(0, len(texts), 96): + batch = texts[i : i + 96] + r = client.embed(texts=batch, model="embed-english-v3.0", input_type="search_document") + vecs.extend(r.embeddings) + return np.asarray(vecs) + raise RuntimeError(f"unknown hosted backend {backend!r}") + + def _embed_query_hosted(self, query: str): + import numpy as np + + backend = self.embedding_backend + if backend.startswith("hosted:voyage"): + import voyageai + + client = voyageai.Client() + r = client.embed([query], model="voyage-3.5", input_type="query") + v = np.asarray(r.embeddings[0], dtype=np.float32) + elif backend.startswith("hosted:openai"): + from openai import OpenAI + + client = OpenAI() + r = client.embeddings.create(model="text-embedding-3-large", input=[query]) + v = np.asarray(r.data[0].embedding, dtype=np.float32) + elif backend.startswith("hosted:cohere"): + import cohere + + client = cohere.Client() + r = client.embed(texts=[query], model="embed-english-v3.0", input_type="search_query") + v = np.asarray(r.embeddings[0], dtype=np.float32) + else: + raise RuntimeError(f"unknown hosted backend {backend!r}") + n = np.linalg.norm(v) + return v / n if n else v + + def _dense_rank(self, query: str, k: int) -> list[MemoryId]: + """Top-k corpus ids by cosine similarity to the query embedding.""" + if self._emb is None or self._np is None: + return [] + np = self._np + if self.embedding_backend.startswith("hosted:"): + qv = self._embed_query_hosted(query) + else: + qv = self._embed_query_local(query) + sims = self._emb @ qv # (N,) cosine sims (both sides L2-normalised) + kk = min(k, sims.shape[0]) + # argpartition for the top-kk, then sort those by score desc. + idx = np.argpartition(-sims, kk - 1)[:kk] + idx = idx[np.argsort(-sims[idx])] + return [self._ids[i] for i in idx] + + # ── graph leg ────────────────────────────────────────────────────────── + + def _build_graph(self, corpus: Sequence[Memory]) -> None: + import networkx as nx + + n = len(corpus) + max_df = max(_CONCEPT_MIN_DF, int(_CONCEPT_MAX_DF_FRAC * n)) + + # concept → set(memory ids) + concept_to_mems: dict[str, set[MemoryId]] = defaultdict(set) + mem_concepts: dict[MemoryId, set[str]] = {} + for m in corpus: + cs = _concepts_for(m) + mem_concepts[m.id] = cs + for c in cs: + concept_to_mems[c].add(m.id) + self._n_concepts_total = len(concept_to_mems) + + # Keep only discriminative concepts: appear in [_CONCEPT_MIN_DF, max_df] + # memories. Below MIN_DF links nothing; above max_df is a non-specific hub. + kept: dict[str, list[MemoryId]] = {} + for c, mems in concept_to_mems.items(): + df = len(mems) + if _CONCEPT_MIN_DF <= df <= max_df: + kept[c] = sorted(mems) + self._n_concepts_kept = len(kept) + self._concept_to_mems = kept + # restrict each memory's concept set to kept concepts (for neighbour scoring) + self._mem_concepts = { + mid: {c for c in cs if c in kept} for mid, cs in mem_concepts.items() + } + + # Build a weighted memory-node graph: edge weight = # shared kept concepts. + # We add edges via concept cliques but CAP per-concept fan-out to avoid an + # O(df^2) blow-up on the densest kept concepts (design risk: over-merge). + g = nx.Graph() + g.add_nodes_from(m.id for m in corpus) + edge_w: dict[tuple[MemoryId, MemoryId], int] = defaultdict(int) + for c, mems in kept.items(): + # mems is small (<= max_df) by construction; full clique is fine. + for i in range(len(mems)): + a = mems[i] + for j in range(i + 1, len(mems)): + b = mems[j] + key = (a, b) if a < b else (b, a) + edge_w[key] += 1 + for (a, b), w in edge_w.items(): + g.add_edge(a, b, weight=w) + self._n_edges = g.number_of_edges() + self._graph = g + + def _graph_rank(self, seeds: list[MemoryId], exclude: set[MemoryId], k: int) -> list[MemoryId]: + """From fused seeds, walk 1 hop and rank neighbour memories by accumulated + edge weight (shared-concept strength), weighted by the seed's own rank so + higher-confidence seeds pull harder. Returns up to k NEW ids (not in + ``exclude``).""" + if self._graph is None or not seeds: + return [] + g = self._graph + scores: dict[MemoryId, float] = defaultdict(float) + for rank, s in enumerate(seeds[:_GRAPH_SEEDS], start=1): + if s not in g: + continue + seed_w = 1.0 / rank # earlier seeds contribute more + nbrs = sorted( + g[s].items(), key=lambda kv: kv[1].get("weight", 1), reverse=True + )[:_GRAPH_NEIGHBOURS_PER_SEED] + for nbr, data in nbrs: + if nbr in exclude: + continue + scores[nbr] += seed_w * float(data.get("weight", 1)) + ranked = sorted(scores.items(), key=lambda kv: kv[1], reverse=True) + return [mid for mid, _ in ranked[:k]] + + # ── fusion + retrieve ──────────────────────────────────────────────────── + + @staticmethod + def _rrf_accumulate(scores: dict[MemoryId, float], ranked: list[MemoryId], weight: float) -> None: + for r, mid in enumerate(ranked, start=1): + scores[mid] += weight / (_RRF_K + r) + + def retrieve(self, query: str, k: int) -> list[MemoryId]: + """Fuse lexical + dense + graph-expansion via weighted RRF and return the + top-k memory ids. Pulls each leg deeper than k so fusion has material to + re-order, then truncates.""" + depth = max(k, 50) # per-leg retrieval depth before fusion + + fts_ranked = self._fts.retrieve(query, depth) + dense_ranked = self._dense_rank(query, depth) # [] if dense disabled + + # Seeds for graph expansion: RRF of the two base legs (so the graph walks + # from the best-agreed memories, not just one leg's view). + seed_scores: dict[MemoryId, float] = defaultdict(float) + self._rrf_accumulate(seed_scores, fts_ranked, self.w_fts) + self._rrf_accumulate(seed_scores, dense_ranked, self.w_dense) + seeds = [mid for mid, _ in sorted(seed_scores.items(), key=lambda kv: kv[1], reverse=True)] + base_set = set(seeds) + graph_ranked = self._graph_rank(seeds, exclude=base_set, k=depth) + + # Final weighted RRF over all three legs. + scores: dict[MemoryId, float] = defaultdict(float) + self._rrf_accumulate(scores, fts_ranked, self.w_fts) + self._rrf_accumulate(scores, dense_ranked, self.w_dense) + self._rrf_accumulate(scores, graph_ranked, self.w_graph) + + fused = sorted(scores.items(), key=lambda kv: (kv[1], -kv[0]), reverse=True) + return [mid for mid, _ in fused[:k]] + + # ── reporting hooks ─────────────────────────────────────────────────────── + + def index_size_bytes(self) -> int: + """Sum of the dense matrix bytes + FTS index bytes (graph is in-memory + networkx; we approximate it via node+edge count * a small constant). Non- + gating per ADR-0001; reported for the storage column.""" + total = 0 + if self._emb is not None: + total += int(self._emb.nbytes) + try: + total += self._fts.index_size_bytes() + except Exception: + pass + if self._graph is not None: + # rough: ~64 B/node + ~96 B/edge accounting for python object overhead. + total += self._graph.number_of_nodes() * 64 + self._graph.number_of_edges() * 96 + return total + + def graph_stats(self) -> dict[str, int]: + return { + "nodes": self._graph.number_of_nodes() if self._graph is not None else 0, + "edges": self._n_edges, + "concepts_total": self._n_concepts_total, + "concepts_kept": self._n_concepts_kept, + } + + def close(self) -> None: + self._fts.close() + + +# Convenience for `run_eval.py --retriever retrievers.hybrid:HybridRetriever`. diff --git a/benchmarks/retrievers/test_hybrid.py b/benchmarks/retrievers/test_hybrid.py new file mode 100644 index 0000000..09612a6 --- /dev/null +++ b/benchmarks/retrievers/test_hybrid.py @@ -0,0 +1,204 @@ +"""Unit tests for the HYBRID retriever's pure logic: concept normalisation, the +concept-graph build + 1-hop expansion, weighted RRF fusion, and graceful +degradation when the dense leg is unavailable. + +These tests are MODEL-FREE on purpose — they never load sentence-transformers (a +~1.3 GB / multi-minute CPU load). The dense leg is exercised by monkeypatching the +ranking method, so the fusion + graph behaviour is verified deterministically and +fast. The full end-to-end quality run is done via scripts/run_eval.py against the +real (local, gitignored) corpus. + +Run: .venv/bin/python -m pytest retrievers/test_hybrid.py -q +""" +from __future__ import annotations + +import math + +from harness.types import Memory +from retrievers.hybrid import ( + _RRF_K, + HybridRetriever, + _concepts_for, + _normalise_concept, +) + + +# ---------------- concept normalisation ---------------- + +def test_normalise_concept_depluralisation(): + cases = { + "Decisions": "decision", + "policies": "policy", + "addresses": "address", + "boxes": "box", + "tags": "tag", + # invariants: don't over-strip + "access": "access", + "class": "class", + "status": "status", + "analysis": "analysis", + "kubernetes": "kubernete", # heuristic, acceptable (collapses consistently) + "k8s": "k8s", + "GPU": "gpu", + } + for inp, exp in cases.items(): + assert _normalise_concept(inp) == exp, f"{inp!r} -> {_normalise_concept(inp)!r}" + + +def test_normalise_concept_is_stable_under_repetition(): + # normalising an already-normalised token must be a no-op (idempotent), so the + # graph collapses variants consistently no matter the source field. + for tok in ["decision", "policy", "address", "tag", "gpu", "access"]: + assert _normalise_concept(_normalise_concept(tok)) == _normalise_concept(tok) + + +def test_concepts_for_unions_tags_keywords_content(): + m = Memory( + id=1, + content="The Postgres cluster uses pgvector for embeddings.", + tags="database,postgres", + expanded_keywords="cnpg vector search", + ) + cs = _concepts_for(m) + # from tags (note: 'postgres' de-plurals to 'postgre' — a consistent heuristic + # collapse; what matters is every memory mentioning it lands on the SAME node). + assert "database" in cs and "postgre" in cs + # from expanded_keywords + assert "cnpg" in cs and "vector" in cs and "search" in cs + # from content (salient tokens, stop-words removed) + assert "pgvector" in cs and "embedding" in cs # 'embeddings' -> 'embedding' + assert "the" not in cs and "for" not in cs # stop-words excluded + + +# ---------------- graph build + expansion ---------------- + +def _shared_concept_corpus() -> list[Memory]: + # Three memories share concept "alpha" (df=3); two share "beta" (df=2); "gamma" + # is unique (df=1, links nothing). With min_df=2 and a generous max_df, alpha + # and beta both form edges. + return [ + Memory(id=10, content="alpha topic one", tags="alpha", expanded_keywords="beta"), + Memory(id=20, content="alpha topic two", tags="alpha", expanded_keywords="beta"), + Memory(id=30, content="alpha topic three", tags="alpha", expanded_keywords="gamma"), + Memory(id=40, content="unrelated delta", tags="delta", expanded_keywords="delta"), + ] + + +def test_graph_build_links_shared_concepts(): + r = HybridRetriever() + # widen max_df so small-corpus concepts aren't pruned as "hubs" + import retrievers.hybrid as H + + old = H._CONCEPT_MAX_DF_FRAC + H._CONCEPT_MAX_DF_FRAC = 1.0 + try: + r._build_graph(_shared_concept_corpus()) + finally: + H._CONCEPT_MAX_DF_FRAC = old + + g = r._graph + assert g is not None + # alpha links 10-20-30 (a triangle); beta links 10-20; "topic" links 10-20-30 + # too (shared content token). So the triangle exists and 10-20 is the heaviest + # edge (they additionally share 'beta'). + assert g.has_edge(10, 20) + assert g.has_edge(10, 30) + assert g.has_edge(20, 30) + # 10-20 share alpha + beta + topic (=3); 10-30 share alpha + topic (=2). The + # exact counts aren't load-bearing — the INVARIANT is w(10,20) > w(10,30). + assert g[10][20]["weight"] > g[10][30]["weight"] + # the unrelated memory 40 (concept 'delta', df=1) links nothing. + assert g.degree(40) == 0 + stats = r.graph_stats() + assert stats["nodes"] == 4 and stats["edges"] >= 3 + + +def test_graph_rank_expands_from_seeds_by_weight(): + r = HybridRetriever() + import retrievers.hybrid as H + + old = H._CONCEPT_MAX_DF_FRAC + H._CONCEPT_MAX_DF_FRAC = 1.0 + try: + r._build_graph(_shared_concept_corpus()) + finally: + H._CONCEPT_MAX_DF_FRAC = old + + # Seed from memory 10; neighbours 20 (w=2) and 30 (w=1) should both surface, + # with 20 ranked above 30 (heavier shared-concept edge). + nbrs = r._graph_rank([10], exclude={10}, k=10) + assert nbrs[:2] == [20, 30] + # excluded seeds are never returned + assert 10 not in nbrs + + +def test_graph_rank_empty_without_graph_or_seeds(): + r = HybridRetriever() # no graph built + assert r._graph_rank([1, 2], exclude=set(), k=5) == [] + r._graph = object.__new__(type("G", (), {})) # truthy but unused + assert r._graph_rank([], exclude=set(), k=5) == [] # no seeds + + +# ---------------- RRF fusion ---------------- + +def test_rrf_accumulate_formula(): + scores: dict[int, float] = {} + from collections import defaultdict + + scores = defaultdict(float) + HybridRetriever._rrf_accumulate(scores, [7, 8, 9], weight=1.0) + assert math.isclose(scores[7], 1.0 / (_RRF_K + 1)) + assert math.isclose(scores[8], 1.0 / (_RRF_K + 2)) + assert math.isclose(scores[9], 1.0 / (_RRF_K + 3)) + # a second weighted list adds on top + HybridRetriever._rrf_accumulate(scores, [8], weight=0.5) + assert math.isclose(scores[8], 1.0 / (_RRF_K + 2) + 0.5 / (_RRF_K + 1)) + + +def test_retrieve_fuses_all_three_legs_and_degrades(): + """End-to-end fusion with the dense leg STUBBED (no model). Verifies (a) FTS + + dense agreement floats a doc to the top, (b) the graph leg can introduce a doc + neither base leg returned, and (c) dense-disabled degrades to FTS(+graph).""" + corpus = [ + Memory(id=1, content="alpha shared concept", tags="alpha", expanded_keywords="alpha"), + Memory(id=2, content="alpha shared concept too", tags="alpha", expanded_keywords="alpha"), + Memory(id=3, content="beta unrelated", tags="beta", expanded_keywords="beta"), + ] + import retrievers.hybrid as H + + old = H._CONCEPT_MAX_DF_FRAC + H._CONCEPT_MAX_DF_FRAC = 1.0 + try: + r = HybridRetriever() + # Stub the dense BUILD so the test never loads the ~1.3 GB model nor writes + # to the shared cache/ dir; build_index then only does FTS + graph. + r._build_dense = lambda _c: None # type: ignore[method-assign] + r.build_index(corpus) # FTS + graph build only + # Stub the dense RANKER deterministically to "agree" with FTS on doc 1. + r._dense_rank = lambda q, k: [1] # type: ignore[method-assign] + + # query matching doc 1 lexically; doc 2 shares concept 'alpha' with doc 1 + # (graph neighbour) even if FTS ranks it lower. + out = r.retrieve("alpha shared concept", k=3) + assert out, "should return something" + assert out[0] == 1 # FTS+dense agreement puts doc 1 first + assert 2 in out # graph expansion (shares 'alpha') pulls doc 2 in + finally: + H._CONCEPT_MAX_DF_FRAC = old + + +def test_graceful_degradation_records_error(monkeypatch): + """If the dense build raises, the retriever records it and still serves FTS.""" + corpus = [Memory(id=i, content=f"doc number {i} content", tags="t") for i in range(1, 6)] + r = HybridRetriever() + + def boom(_corpus): + raise RuntimeError("simulated embedding failure") + + monkeypatch.setattr(r, "_build_dense", boom) + r.build_index(corpus) + assert any("dense leg disabled" in e for e in r.errors) + assert r._emb is None + # FTS still answers + out = r.retrieve("doc number 3 content", k=5) + assert 3 in out diff --git a/benchmarks/scripts/dataset_stats.py b/benchmarks/scripts/dataset_stats.py new file mode 100644 index 0000000..87c79b8 --- /dev/null +++ b/benchmarks/scripts/dataset_stats.py @@ -0,0 +1,49 @@ +#!/usr/bin/env python3 +"""Validate the eval set and print AGGREGATE stats (safe to share / commit-able +numbers only — prints NO raw memory content).""" +from __future__ import annotations + +import json +import statistics +import sys +from collections import Counter +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).resolve().parents[1])) +from harness import load_dataset # noqa: E402 + + +def main() -> None: + ds = load_dataset(validate=True) # raises on any referential-integrity issue + + strata = Counter(q.stratum for q in ds.queries) + rel_per_q = {s: [] for s in strata} + for q in ds.queries: + rel_per_q[q.stratum].append(len(ds.qrels[q.query_id])) + + # how many DISTINCT corpus memories are exercised as relevant + relevant_union = set() + for rels in ds.qrels.values(): + relevant_union |= rels + + out = { + "corpus_count": len(ds.corpus), + "query_count": len(ds.queries), + "strata": dict(strata), + "relevant_ids_per_query": { + s: { + "min": min(v), + "median": statistics.median(v), + "max": max(v), + "mean": round(statistics.fmean(v), 2), + } + for s, v in rel_per_q.items() + }, + "distinct_relevant_memories": len(relevant_union), + "validation": "PASS (all qrels ids exist in corpus; every query has qrels)", + } + print(json.dumps(out, indent=2)) + + +if __name__ == "__main__": + main() diff --git a/benchmarks/scripts/export_corpus.py b/benchmarks/scripts/export_corpus.py new file mode 100644 index 0000000..a66a774 --- /dev/null +++ b/benchmarks/scripts/export_corpus.py @@ -0,0 +1,78 @@ +#!/usr/bin/env python3 +"""Export the local SQLite memory cache to a LOCAL-ONLY corpus.jsonl. + +Privacy: emits ONLY rows where is_sensitive=0. The output file lives under +benchmarks/data/ which is gitignored. NEVER commit corpus.jsonl. + +Fields emitted per line: {id, content, category, tags, expanded_keywords, importance} +""" +from __future__ import annotations + +import argparse +import json +import sqlite3 +import sys +from pathlib import Path + +DEFAULT_DB = Path.home() / ".claude" / "claude-memory" / "memory" / "memory.db" +DEFAULT_OUT = Path(__file__).resolve().parents[1] / "data" / "corpus.jsonl" + + +def export(db_path: Path, out_path: Path) -> dict: + if not db_path.exists(): + raise SystemExit(f"DB not found: {db_path}") + + con = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True) + con.row_factory = sqlite3.Row + cur = con.cursor() + + total = cur.execute("SELECT COUNT(*) FROM memories").fetchone()[0] + sensitive = cur.execute( + "SELECT COUNT(*) FROM memories WHERE is_sensitive=1" + ).fetchone()[0] + + rows = cur.execute( + """ + SELECT id, content, category, tags, expanded_keywords, importance + FROM memories + WHERE is_sensitive=0 + ORDER BY id + """ + ).fetchall() + + out_path.parent.mkdir(parents=True, exist_ok=True) + written = 0 + with out_path.open("w", encoding="utf-8") as f: + for r in rows: + rec = { + "id": r["id"], + "content": r["content"], + "category": r["category"], + "tags": r["tags"], + "expanded_keywords": r["expanded_keywords"], + "importance": r["importance"], + } + f.write(json.dumps(rec, ensure_ascii=False) + "\n") + written += 1 + con.close() + + return { + "total_rows": total, + "sensitive_excluded": sensitive, + "non_sensitive_written": written, + "out_path": str(out_path), + } + + +def main() -> None: + ap = argparse.ArgumentParser() + ap.add_argument("--db", type=Path, default=DEFAULT_DB) + ap.add_argument("--out", type=Path, default=DEFAULT_OUT) + args = ap.parse_args() + stats = export(args.db, args.out) + json.dump(stats, sys.stdout, indent=2) + print() + + +if __name__ == "__main__": + main() diff --git a/benchmarks/scripts/run_eval.py b/benchmarks/scripts/run_eval.py new file mode 100644 index 0000000..0ad0dc9 --- /dev/null +++ b/benchmarks/scripts/run_eval.py @@ -0,0 +1,65 @@ +#!/usr/bin/env python3 +"""Run the benchmark for a named retriever and print overall + per-stratum metrics. + +Usage: + .venv/bin/python scripts/run_eval.py --retriever fts5 # lexical baseline + .venv/bin/python scripts/run_eval.py --retriever substring # demo + .venv/bin/python scripts/run_eval.py --retriever mypkg.mymod:MyRetriever + .venv/bin/python scripts/run_eval.py --retriever fts5 --json results/fts5.json + +The --retriever value is either a built-in alias or a "module:Class" path. The +class is instantiated with no args; the runner calls build_index() if present. + +Outputs are LOCAL-ONLY when written under results/ (gitignored): a results file +may echo retrieved ids (not content), but keep it local to be safe. +""" +from __future__ import annotations + +import argparse +import importlib +import json +import sys +from pathlib import Path + +sys.path.insert(0, str(Path(__file__).resolve().parents[1])) +from harness import load_dataset, run_benchmark # noqa: E402 +from harness.baselines import SqliteFtsRetriever # noqa: E402 +from harness.example_retriever import SubstringRetriever # noqa: E402 + +ALIASES = { + "fts5": lambda: SqliteFtsRetriever(sort_by="relevance"), + "fts5_importance": lambda: SqliteFtsRetriever(sort_by="importance"), + "substring": SubstringRetriever, +} + + +def resolve(spec: str): + if spec in ALIASES: + return ALIASES[spec]() + if ":" not in spec: + raise SystemExit(f"unknown retriever alias '{spec}' (use module:Class or one of {list(ALIASES)})") + mod_name, cls_name = spec.split(":", 1) + mod = importlib.import_module(mod_name) + return getattr(mod, cls_name)() + + +def main() -> None: + ap = argparse.ArgumentParser() + ap.add_argument("--retriever", default="fts5") + ap.add_argument("--k", type=int, default=20, help="depth requested from retriever") + ap.add_argument("--json", type=Path, default=None, help="write full result JSON here") + args = ap.parse_args() + + ds = load_dataset(validate=True) + retr = resolve(args.retriever) + res = run_benchmark(retr, ds, retrieve_k=args.k) + print(res.summary()) + + if args.json: + args.json.parent.mkdir(parents=True, exist_ok=True) + args.json.write_text(json.dumps(res.to_dict(), indent=2)) + print(f"\nwrote {args.json}") + + +if __name__ == "__main__": + main() diff --git a/docs/adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md b/docs/adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md new file mode 100644 index 0000000..6bdbae1 --- /dev/null +++ b/docs/adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md @@ -0,0 +1,45 @@ +# Phase the hybrid: lexical + dense first, concept graph gated + +The [benchmark](../research/benchmark-report.md) shows the hybrid read path beats lexical FTS on +every overall metric (recall@10 **+0.139**, driven by the statistically-robust paraphrase win recall@10 +**+0.350**), with no recall regression on exact — so [ADR-0001](0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md)'s +quality gate **is met by the lexical + dense fusion**. The ablation also showed full-hybrid **identical +to three decimals** to FTS+dense alone — **but a post-run adversarial review proved this was a structural +artifact, not an empirical result**: the prototype's fusion barred the graph leg from the fused top-k by +construction (graph candidates excluded from the FTS∪dense base set; `w_graph=0.35` → max graph RRF +`0.0057` < base-leg min `0.0091`). So the concept graph was **never validly tested — it is *unevaluated*, +not disproven.** + +We therefore **phase** the hybrid: + +- **Phase 1 (adopt now):** lexical FTS ⊕ dense pgvector embeddings, fused with weighted RRF, the + importance prior preserved as a post-fusion multiplier. This *is* the measured uplift. +- **Phase 2 (gated, NOT shipped):** the typed-relation concept graph. It stays designed + ([integration design §A.5](../research/integration-design.md)) but disabled — its value is **unproven** + and it carries real operational cost (LLM extraction + two extra Postgres tables + traversal). + +The graph's prototype was a **zero-LLM keyword-co-occurrence** graph (concepts = tags ∪ +expanded_keywords ∪ regex noun-phrases, edges from co-occurrence), **not** the LLM-extracted +typed-relation graph the production design specifies. So the null result kills *that cheap +construction* on *this eval set* — it does not prove an LLM-extracted graph is also null. The graph +is **gated, not killed**: re-open it only with evidence it helps — e.g. a multi-hop slice whose hops +are *not* semantically adjacent (where the dense leg can't shortcut), built from real typed-relation +extraction. + +## Why the graph result is inconclusive + +The ablation `A≡B` was **guaranteed by the fusion config** (graph candidates excluded from the FTS∪dense +base set; `w_graph=0.35` caps a graph-only id's RRF below any base-leg id), so it tested nothing about the +graph — the review even found a relevant memory the graph surfaced that both base legs missed, which +fusion then discarded. Separately, the multi-hop deltas (recall@10 +0.064) are **not statistically +significant** (3 of 4 CIs cross zero), so there is no distinguishable multi-hop win to attribute to +*either* leg. The graph is deferred on **cost + uncertainty**, not on evidence it fails. + +## Consequences + +- Production ships embeddings + RRF fusion; the graph schema is documented but not migrated. +- The concept-graph research (Zep/Graphiti split, HippoRAG PPR, EDC canonicalization) is preserved as + the phase-2 blueprint, behind the gate. +- Phasing avoids paying the graph's cost while its value is unproven; the robust **lexical+dense** + paraphrase win is what ADR-0001's gate actually surfaced. The graph stays a designed, gated follow-up + pending a valid retest (graph candidates in the fused pool / swept weight, real typed-relation extraction). diff --git a/docs/adr/0005-rrf-default-cc-challenger.md b/docs/adr/0005-rrf-default-cc-challenger.md new file mode 100644 index 0000000..b05f5fa --- /dev/null +++ b/docs/adr/0005-rrf-default-cc-challenger.md @@ -0,0 +1,40 @@ +# Weighted RRF as the fusion default; convex combination as a benchmark challenger + +The hybrid read path must fuse three retrieval signals on **incomparable scales** — unbounded +`ts_rank` (BM25), bounded cosine, and (phase 2) an arbitrary graph-proximity score — where the graph +list is **sparse and often empty**. We adopt **weighted Reciprocal Rank Fusion** as the default +fusion function: `score(d) = Σ_s w_s/(60 + rank_s(d))`, default `w_lex = w_dense = 1.0`, +`w_graph = 0.35`, with the existing **importance** value applied as a *post-fusion prior multiplier* +(`final = fused × (0.7 + 0.3·importance)` for `sort_by="relevance"`) — importance is a prior, **not** +a fourth fused list. + +RRF is the right default because it is **score-scale-free** (no BM25-vs-cosine calibration to +maintain), treats a missing leg as a clean **0** contribution (no missing-modality bias), is +near-parameter-free (`k=60` is demonstrably uncritical across [10,100]), and **collapses to today's +exact lexical ordering** when the dense/graph legs are empty — which is the SQLite-only +graceful-degrade path ([ADR-0002](0002-api-postgres-first-sqlite-stays-lexical.md)) running the *same* +code. + +## Considered options + +- **Convex combination / TM2C2** (Bruch et al., min-max normalization) — the literature consistently + shows it **edges RRF on nDCG/recall** when scores are calibratable (Weaviate switched its default to it). + It is the **standing challenger**. ⚠️ **Correction:** an earlier draft claimed "the benchmark ran CC + against RRF and RRF was chosen on our eval set" — **no CC results were actually produced or persisted in + this run.** RRF was adopted on *principled* grounds (scale-free, treats a missing/empty leg as a clean 0, + collapses to today's exact lexical ordering for the SQLite-only degrade path), **not** a measured + head-to-head. Benchmarking CC vs RRF on our eval set is an open follow-up — do it before locking fusion, + and especially if the graph is ever adopted or score distributions shift. +- **Cross-encoder stage-2 re-rank** (e.g. `bge-reranker-v2-m3` over the fused top ~20–30) — a + *separate*, independently-gated stage, not a fusion function. Deferred; ship only if it clears both + the quality bar and the hot-path p95 budget on the GPU node. + +## Consequences + +- Fusion is ~30 lines over three top-N queries; the lexical leg reuses the existing + `plainto_tsquery`/`ts_rank` query + OR-broaden fallback verbatim. +- The exact-stratum nDCG/MRR dip (~0.018/0.025, recall unaffected) is the known RRF cost of blending + one perfect hit with near-ties; a small **exact-match rank bonus** is the tunable recovery and a + cheap follow-up. +- `k=60` is borrowed from TREC ad-hoc IR; a quick re-sweep on the eval set is worthwhile but the + literature says it is insensitive. diff --git a/docs/adr/0006-pgvector-hnsw-halfvec-1024d-embeddings.md b/docs/adr/0006-pgvector-hnsw-halfvec-1024d-embeddings.md new file mode 100644 index 0000000..158a52b --- /dev/null +++ b/docs/adr/0006-pgvector-hnsw-halfvec-1024d-embeddings.md @@ -0,0 +1,49 @@ +# Production vector storage: pgvector HNSW + halfvec(1024); 1024-d embeddings (Voyage-3.5 / bge-large) + +Phase 1 of the hybrid ([ADR-0004](0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)) needs a +production home for the dense embeddings. Per [ADR-0002](0002-api-postgres-first-sqlite-stays-lexical.md) +that is **pgvector on the shared CNPG Postgres**, where claude-memory is already a database tenant — no new +datastore. + +> ⚠️ **Correction (verified against live infra by a design challenger):** an earlier draft justified this +> with "Immich already runs pgvector on the same cluster." That is **false** — Immich runs its **own** +> Postgres, not the shared CNPG — so it is NOT evidence the shared cluster has the extension. **pgvector +> must be explicitly enabled on CNPG** (extension install, and possibly a CNPG operand-image change) via +> Terraform **before this can land**; do not assume it is already available. + +Decisions: + +- **Index: HNSW** (`USING hnsw (embedding halfvec_cosine_ops) WITH (m=16, ef_construction=64)`, + query knob `hnsw.ef_search` set via `SET LOCAL` inside the recall txn under PgBouncer). Best + speed-recall tradeoff, buildable on an empty table. **IVFFlat rejected** — it must be built *after* + data exists (empty-table footgun) and has a lower recall ceiling. +- **Type: `halfvec(1024)`** (fp16) — halves index size at ~no recall loss; 1024-d halfvec = 2048 + bytes/row → single-digit MB for the whole corpus. +- **Dimension fixed at 1024**, chosen **once** (changing it later forces a full re-embed + HNSW + rebuild). 1024 matches both the production model (Voyage-3.5) and the prototype model + (bge-large-en-v1.5), so the column and all fusion code are identical regardless of model. +- **Model: Voyage-3.5** (1024-d, hosted) for **non-sensitive** memories (highest measured retrieval + quality of the hosted options, free tier covers the corpus); **bge-large-en-v1.5** (1024-d, local, + MIT) for **sensitive memories and the no-API-key fallback** ([ADR-0003](0003-external-embedding-apis-allowed-for-non-sensitive-memories.md)). + `is_sensitive=1` rows are never embedded externally — `embedding=NULL`, lexical-only. +- **pgvectorscale / StreamingDiskANN deferred** — Rust/pgrx must be compiled into the CNPG operand + image, and it only earns its keep above ~1–5M vectors; our corpus is orders of magnitude below that. + +## Migration shape + +A single **additive** Alembic migration: `ALTER TABLE memories ADD COLUMN embedding halfvec(1024)` +(NULL for sensitive) + `CREATE INDEX CONCURRENTLY … USING hnsw …`. The existing generated +`search_vector tsvector` + GIN index (`migrations/001`) are **untouched**, so lexical behaviour and +the SQLite-only degrade path are unchanged. pgvector enablement on CNPG and any extension/operand +change land as **Terraform/Terragrunt** in `infra/stacks/…` (GitOps, never kubectl) and trigger a +rolling restart of the shared cluster — coordinate accordingly. + +## Consequences + +- The prototype's in-process numpy matrix maps directly to this column; only the substrate changes, + not the retrieval math. +- The prototype measured **bge-large** quality; a cheap follow-up should re-run the dense leg with + **Voyage-3.5** on the non-sensitive corpus to confirm the hosted ceiling holds on our content + before locking the production default. +- Production latency/ANN-approximation/filtered-top-k behaviour are unmeasured in the prototype and + must be validated post-migration (a stated benchmark limitation). diff --git a/docs/research/benchmark-report.md b/docs/research/benchmark-report.md new file mode 100644 index 0000000..2a48d7b --- /dev/null +++ b/docs/research/benchmark-report.md @@ -0,0 +1,312 @@ +# Benchmark report: FTS (lexical baseline) vs Hybrid (lexical + dense + graph) + +Status: **the ADR-0001 decision instrument.** This is the head-to-head that gates hybrid +adoption. Read the [survey](survey.md) for the landscape and the +[integration design](integration-design.md) for what was built. + +**Bottom line:** Hybrid (lexical FTS ⊕ dense embeddings, RRF-fused) beats the lexical FTS baseline on +**every overall metric**, driven by the **paraphrase** stratum (recall@10 **+0.350**, robust under a +paired bootstrap) — precisely the gap embeddings were meant to close, with **no recall regression on +exact**. **Recommendation: adopt the lexical + dense fusion;** the ADR-0001 quality gate is met by +that fusion alone. + +> **⚠️ Post-review corrections — read before citing this report.** An adversarial completeness review +> after the run found that two prominent claims in the original draft did NOT survive scrutiny. They are +> corrected in place below; the full review is in the [Appendix](#appendix--adversarial-completeness-review). +> +> 1. **The concept graph was NOT shown to be useless — it was never validly tested.** The prototype's +> fusion restricted the graph leg to candidate ids *outside* the FTS∪dense base set and weighted it at +> 0.35, which makes it **mathematically impossible** for a graph-only result to enter the fused top-k +> (max graph RRF `0.35/(60+1)≈0.0057` < any base-leg minimum `1.0/(60+50)≈0.0091`). So the ablation +> "full-hybrid ≡ FTS+dense to three decimals" was **guaranteed by the fusion config, not an empirical +> finding** — the review even found a genuinely-relevant memory the graph surfaced (and both base legs +> missed) that fusion then discarded. The concept graph is therefore **UNEVALUATED**, and is deferred on +> **cost + invalid-test** grounds, *not* because it failed. A valid retest must put graph ids in the +> fused candidate pool (no base-set exclusion) and/or sweep the graph weight. +> 2. **The multi-hop "win" is not statistically significant.** A paired bootstrap (B=10000) puts 3 of the +> 4 multi-hop metric CIs across zero (recall@10 Δ+0.064, P(Δ≤0)≈0.06). Only the **overall** and +> **paraphrase** deltas are robust. Multi-hop deltas are de-bolded below. Since multi-hop is the entire +> rationale for a concept graph, there is currently **no statistically distinguishable evidence** that +> *anything* (dense or graph) helps the multi-hop stratum. +> +> Also: absolute recall/nDCG *levels* are biased low (binary, un-pooled qrels — §6); only FTS-vs-hybrid +> **deltas** are trustworthy. None of this changes the adopt-lexical+dense recommendation, which rests on +> the robust paraphrase/overall win. + +--- + +## 1. Methodology + +### 1.1 Test collection +- **Corpus:** 5,452 memories (`benchmarks/data/corpus.jsonl`, gitignored — privacy). Sensitive + memories (`is_sensitive=1`) excluded entirely per ADR-0003. The corpus is the user's real + personal memories; **only aggregate numbers and synthetic illustrations appear in this document.** +- **Queries:** 119, stratified — **40 exact / 40 paraphrase / 39 multi-hop** + (`benchmarks/data/queries.jsonl`). +- **Qrels:** `benchmarks/data/qrels.jsonl`, LLM-generated (the LongMemEval pipeline inverted for a + memory store: seed memory → exact/paraphrase/multi-hop query → relevant ids). + +### 1.2 The two systems +- **`fts` (baseline)** — `benchmarks/retrievers/fts.py::FtsRetriever`. A **faithful + reimplementation of the production code path** `src/claude_memory/mcp_server.py::_sqlite_recall` + (`sort_by="relevance"`): in-memory FTS5 over the full corpus with the production virtual-table + shape + default `unicode61` tokenizer (no stemming/stop-words); AND-match first, OR-broaden only + if AND returns zero; rank by the production blend `-bm25()*0.7 + importance*0.3`; LIKE fallback on + operational errors. **This is "the current system" the hybrid must beat — not the simplified + README prose.** +- **`hybrid`** — `benchmarks/retrievers/hybrid.py::HybridRetriever`. Three legs fused with weighted + RRF: (1) lexical = **`FtsRetriever` reused verbatim** (so the hybrid's lexical component *is* the + baseline — no drift); (2) dense = `BAAI/bge-large-en-v1.5` (1024-d, local, L2-normalized, BGE + query-instruction prefix), cosine via numpy; (3) graph = a keyword-co-occurrence memory-node graph + (5,452 nodes / 2,095,624 edges), 1-hop expansion from the top-10 seeds. + Fusion: `RRF(d) = Σ_leg w_leg/(60 + rank_leg(d))`, `w_fts = w_dense = 1.0, w_graph = 0.35`, each + leg to depth 50. + +### 1.3 Metrics & protocol +- **recall@5, recall@10** (hot-path "did we surface it"), **nDCG@10** (graded, position-aware + headline), **MRR** (first-hit). Per stratum and overall. +- `retrieve_k=20`, run via `scripts/run_eval.py --retriever retrievers.{fts,hybrid}:…`. Deterministic; + both invocation paths (programmatic + CLI) verified identical. +- **`sort_by="relevance"` pinned across both arms** (not the production `importance` default), so the + benchmark measures *retrieval* quality, not the importance prior — and everything else (corpus, + queries, OR-broaden behaviour) is held fixed. +- Full result JSONs written only to gitignored `benchmarks/results/{fts,hybrid}.json`; retriever code + contains no embedded corpus content (safe to commit). + +--- + +## 2. Head-to-head results + +### 2.1 Comparison table (FTS vs Hybrid, with deltas) + +| Stratum | Metric | FTS | Hybrid | Δ | +|---|---|---:|---:|---:| +| **Overall** (n=119) | recall@5 | 0.6663 | 0.7415 | **+0.0752** | +| | recall@10 | 0.6952 | 0.8338 | **+0.1386** | +| | nDCG@10 | 0.6507 | 0.7284 | **+0.0777** | +| | MRR | 0.6737 | 0.7297 | **+0.0560** | +| **Exact** (n=40) | recall@5 | 1.0000 | 1.0000 | +0.0000 | +| | recall@10 | 1.0000 | 1.0000 | +0.0000 | +| | nDCG@10 | 0.9908 | 0.9723 | −0.0185 | +| | MRR | 0.9875 | 0.9625 | −0.0250 | +| **Paraphrase** (n=40) | recall@5 | 0.3500 | 0.5500 | **+0.2000** | +| | recall@10 | 0.3750 | 0.7250 | **+0.3500** | +| | nDCG@10 | 0.3123 | 0.5023 | **+0.1900** | +| | MRR | 0.2958 | 0.4343 | **+0.1385** | +| **Multi-hop** (n=39) | recall@5 | 0.6485 | 0.6726 | +0.0241 ¹ | +| | recall@10 | 0.7111 | 0.7748 | +0.0637 ¹ | +| | nDCG@10 | 0.6491 | 0.7101 | +0.0610 ¹ | +| | MRR | 0.7393 | 0.7940 | +0.0547 ¹ | + +¹ **Multi-hop deltas are NOT statistically significant.** Paired bootstrap (B=10000): recall@5 +`CI[−0.046,+0.095]` (P(Δ≤0)≈0.25), recall@10 `CI[−0.020,+0.143]` (P≈0.06), MRR `CI[−0.038,+0.143]` +(P≈0.12); only nDCG@10 is marginal (P≈0.04). Treat the multi-hop stratum as **inconclusive**. The +overall and paraphrase deltas, by contrast, have CIs well clear of zero (P≤0.003). + +### 2.2 Reading the strata + +- **Exact — recall held at 1.0, but that is partly circular.** Exact queries were *generated* as "a + salient phrase whose top FTS hit is memory X," then X was labelled relevant — so FTS recall@5/@10 = 1.0 + is substantially **guaranteed by how the stratum was built**, not an independent property. The one + genuine signal here is hybrid's small **degradation**: nDCG@10 −0.018 `CI[−0.046,0]`, MRR −0.025 + `CI[−0.063,0]` — a real, consistent rank demotion from blending one perfect lexical hit with dense + near-ties. The proposed exact-match rank bonus (ADR-0005) would recover it, **but that fix is asserted, + not measured.** Recall is unaffected; the cost is rank-position only. + +- **Paraphrase — the LARGEST WIN, the dense leg's payoff.** recall@10 **+0.350** (0.375 → 0.725), + recall@5 **+0.200**, nDCG@10 **+0.190**. This is exactly the low-lexical-overlap stratum embeddings + were predicted to fix: lexical FTS finds barely a third of paraphrased answers; adding dense nearly + doubles recall@10. **This stratum alone justifies the hybrid.** + +- **Multi-hop — INCONCLUSIVE (deltas not significant).** recall@10 +0.064, nDCG@10 +0.061, MRR +0.055, + but 3 of 4 CIs cross zero (footnote ¹). So we **cannot claim** a multi-hop win for hybrid. We also + cannot attribute anything to the graph: per the §3 caveat, the fusion config structurally barred the + graph leg from the top-k, so the multi-hop stratum tests only FTS vs FTS+dense — and that difference is + not statistically distinguishable here. Multi-hop is exactly the stratum a *properly tested* concept + graph is meant to win, and it remains an open question. + +--- + +## 3. Ablation — what each leg contributes + +Four configs, everything else fixed: + +| Config | Legs | Overall recall@10 | Overall nDCG@10 | Para recall@10 | Multi recall@10 | Exact nDCG@10 | +|---|---|---:|---:|---:|---:|---:| +| **A** | FTS + dense + graph (full hybrid) | 0.834 | 0.728 | 0.725 | 0.775 | 0.972 | +| **B** | FTS + dense (`w_graph=0`) | **0.834** | **0.728** | **0.725** | **0.775** | **0.972** | +| **C** | dense only | 0.748 | — | — | — | 0.861 | +| **D** | FTS only (= baseline) | 0.695 | 0.651 | 0.375 | 0.711 | 0.991 | + +**A ≡ B to three decimals on every metric — but this is a STRUCTURAL ARTIFACT, not a test of the graph.** +The fusion restricts the graph leg to ids *outside* the FTS∪dense base set and weights it at 0.35, so a +graph-only id's maximum RRF score is `0.35/(60+1) ≈ 0.0057`, strictly below the *minimum* score of any id +from a base leg (`1.0/(60+50) ≈ 0.0091`). Since both base legs return 50 ids, **every base-leg id outranks +every graph-only id** — a graph-only result can never enter the fused top-k, regardless of corpus, query +set, or graph quality. A ≡ B was **mathematically guaranteed before any data ran.** The honest reading is +"the graph cannot affect top-k *under this fusion config*," NOT "the concept graph contributes nothing." (A +spot check found a genuinely-relevant memory the graph surfaced and both base legs missed at depth 50 — +fusion discarded it.) **The graph is therefore unevaluated.** + +What the ablation *does* validly show is the **FTS-vs-dense decomposition**, which stands: +- **Dense recovers paraphrase** (C beats D's 0.375 para recall@10 decisively) but is weaker on exact + (C exact nDCG 0.861 vs D's 0.991). +- **Lexical recovers exact** (D exact nDCG 0.991, the best) but collapses on paraphrase. +- **Fusion (B) gets the best of both** — exact recall stays perfect, paraphrase nearly doubles. +- *Caveat:* configs B/C/D were not persisted as result JSONs, so these specific numbers are not + independently reproducible from committed artifacts (only A = full hybrid and D = FTS are). + +**To actually test the graph** (deferred follow-up — [ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)): +put graph candidates in the fused pool (drop the base-set exclusion) and/or sweep `w_graph` upward, on a +multi-hop slice whose hops are *not* semantically adjacent (so the dense leg can't shortcut them), using +real typed-relation extraction rather than the prototype's zero-LLM keyword co-occurrence graph. + +--- + +## 4. Latency & storage (measured, NON-GATING per ADR-0001) + +### 4.1 Latency (per-query `retrieve()`, CPU-only box, no GPU) + +| System | p50 | p95 | mean | max | +|---|---:|---:|---:|---:| +| FTS (pure SQLite) | 15.7 ms | 27.8 ms | 12.8 ms | 31.9 ms | +| Hybrid | 229.6 ms | 344.5 ms | 249.3 ms | 640.0 ms | + +The hybrid's ~230 ms p50 is **dominated by the local bge-large query embedding** (one CPU forward +pass). On the production GPU node or via a hosted API this drops ~10× to low tens of ms. The FTS, +dense-ANN, and RRF-merge costs themselves are negligible. **Latency does not gate adoption** (the +success metric is quality-first), and the production read path (pgvector HNSW + GPU/hosted query +embed) is far faster than this prototype's CPU profile. + +### 4.2 Storage + +| Component | Size | Notes | +|---|---|---| +| FTS5 in-memory index | ~8.3 MB | SQLite shadow tables over 5,452 memories | +| Dense matrix | 22.3 MB | 5,452 × 1024 float32; cached `.npy`, fingerprint-keyed | +| Concept graph (in-memory) | ~202 MB (Python-object estimate) | 5,452 nodes + 2,095,624 edges, networkx; **not persisted, not shipped** | +| Total reported index | 232.2 MB | matrix + FTS + graph estimate | + +Production maps the dense matrix to pgvector `halfvec(1024)` (~2 KB/row, single-digit MB total for +the corpus) and — only if the graph is ever adopted — three Postgres node/edge tables. No +pgvector/docker in the prototype; in-process numpy cosine (faiss unnecessary at N=5452). Storage is +reported, not gating. + +--- + +## 5. Recommendation + +**ADOPT — ship lexical + dense fusion (phase 1); defer the concept graph behind a gate.** + +Rationale: +1. **The ADR-0001 quality gate is met.** Hybrid beats FTS on all four overall metrics, with the + decisive, statistically-robust win on **paraphrase** (recall@10 +0.350) — precisely the gap embeddings + were meant to close — with **no recall regression on exact**. (The multi-hop deltas are *not* + statistically significant — §2.2, footnote ¹ — so they are not part of the case for adoption.) +2. **The gain is the FTS+dense fusion, not the graph.** The ablation shows full-hybrid ≡ FTS+dense. + So phase 1 = embeddings (pgvector) fused with the existing FTS via weighted RRF, preserving the + importance prior — exactly the [integration design](integration-design.md) §A. +3. **The concept graph stays GATED — because it was not validly tested, not because it failed** + ([ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)). The prototype's fusion + config structurally barred it from the top-k (§3), so this benchmark says nothing about its value. + Deferral is justified by operational cost (LLM extraction + two extra tables + traversal) **plus the + remaining uncertainty** — not by evidence of uselessness. Re-open with a valid retest: graph candidates + in the fused pool (no base-set exclusion) and/or a swept weight, on non-semantically-adjacent multi-hop + queries built from real typed-relation extraction. +4. **The small exact-stratum nDCG/MRR dip** is the known RRF blending cost on recall-perfect queries; + an exact-match rank bonus is a cheap follow-up ([ADR-0005](../adr/0005-rrf-default-cc-challenger.md) + records RRF as the default with the bonus as a tunable). + +Adopt with the changes already designed: pgvector `halfvec(1024)` + HNSW, weighted-RRF fusion with +the importance prior preserved, sensitive memories excluded from embedding (ADR-0003), and SQLite- +only mode unchanged (ADR-0002). + +--- + +## 6. Limitations (stated honestly) + +1. **Label noise / qrels quality.** Qrels were **LLM-generated** with **lighter hand-verification + than the ideal protocol** (no measured Cohen's κ between LLM and human judgments). LLM judges are + systematically lenient, and the eval is **119 queries (~40/stratum)** — *below* the ~50/stratum + the literature (Voorhees & Buckley) recommends for confident per-stratum ranking, and **no + bootstrap CIs or paired significance test** were computed. The overall and paraphrase deltas + (+0.14, +0.35 recall@10) are large enough to be robust to plausible label noise; the multi-hop + (+0.06) and the exact-stratum dip (~0.02) are within the range where label noise could matter, so + treat those as directional, not precise. + +2. **Pooling / "holes."** It is not confirmed that qrels pooled the top-k of *all* arms (FTS, dense, + hybrid) before judging. If the pool was lexical-biased, the metric is **biased against** the dense + and hybrid arms (they retrieve unjudged relevant memories scored as misses) — meaning the true + hybrid uplift could be *larger* than reported, not smaller. This does not threaten the "adopt" + conclusion but caveats the exact magnitudes. + +3. **Snapshot vs pgvector (substrate mismatch).** The prototype used an **in-process numpy** dense + index over a static corpus snapshot, **not** the production pgvector HNSW on live CNPG Postgres. + Retrieval *quality* transfers (cosine is cosine; HNSW recall at this scale is ~exact), but the + production **latency profile, ANN approximation, and filtered-top-k behaviour are unmeasured here** + and must be validated post-migration. + +4. **Extraction shortcuts AND a structurally-excluded graph leg.** The concept graph was built with + **zero LLM calls** — concepts = `tags` ∪ `expanded_keywords` ∪ a regex noun-phrase proxy, edges from + keyword co-occurrence — **not** the typed-relation, LLM-extracted graph the production design (§A.5) + specifies. More importantly, the fusion config **structurally barred even this cheap graph from the + top-k** (§1 corrections, §3), so this run is **not a valid test of the graph at all** — not of the + cheap construction, and certainly not of a properly LLM-extracted typed-relation graph. The graph is + *gated and unevaluated*, not *killed*. + +5. **Embedding model is the prototype default, not the production pick.** Numbers are for + **bge-large-en-v1.5** (local). Production should use **Voyage-3.5** (also 1024-d) for + non-sensitive memories (ADR-0003); its higher quality ceiling on *our* content is unverified — a + cheap, recommended follow-up (re-run the dense leg with Voyage). + +6. **`sort_by="relevance"` not the production default.** The benchmark pins `relevance` to isolate + retrieval quality; production defaults to `importance`-blended ranking. The design preserves + importance as a post-fusion prior, but the *user-visible* ranking under the default sort was not + benchmarked. + +7. **Single user, dense corpus.** ~5,452 memories from one author are topically adjacent with many + near-duplicates, so "the one relevant id" is sometimes fuzzy; graded judgments over a pool mitigate + this but it remains a property of the corpus that may not generalize. + +--- + +## Appendix — adversarial completeness review + +An independent critic reviewed this report after the run (verdict: **usable-with-caveats**). Its findings +drove the §1 corrections; the full list is recorded here so the review is part of the permanent record. + +1. **Graph-null claim is circular, not empirical (most serious).** Proven that the fusion config + (graph leg excluded from the base set; `w_graph=0.35`) caps a graph-only id's RRF at `0.35/61≈0.0057`, + below any base-leg id's `1.0/110≈0.0091`. `A≡B` was guaranteed before any data. Honest statement: "the + graph cannot affect top-k under this fusion config," not "the graph contributes nothing." +2. **Graph mechanism mis-attributed, and contradicted by the data.** Reconstructing the legs (graph = + 5,452 nodes / 2,095,624 edges) on a 15-query sample found `rel_only_via_graph=1`: a relevant memory the + graph surfaced that was absent from BOTH base legs at depth 50 — dense did NOT already cover it, and + fusion discarded it. The "dense already retrieves them" explanation is false in ≥1 observed case. +3. **Multi-hop wins are not statistically supported — yet were bolded.** Paired bootstrap (B=10000): + overall & paraphrase robust (P≤0.003); multi-hop 3/4 CIs cross zero (recall@10 P≈0.06). Multi-hop is the + sole rationale for the graph, and it shows no statistically distinguishable hybrid advantage. +4. **Exact stratum is circular by construction.** All 40 exact queries were generated as "top FTS hit for a + salient phrase of ⟨id⟩," so FTS's 1.0 is largely tautological. The only genuine exact signal is hybrid's + small nDCG/MRR demotion, whose claimed rank-bonus fix is asserted, never measured. +5. **qrels are binary and un-pooled, so nDCG is mislabeled and absolute numbers unreliable.** Binary labels + make nDCG degenerate to position-discounted recall (no graded info). Un-pooled, author-assigned labels + bias absolute recall/nDCG **low** (esp. for dense/hybrid); only the FTS-vs-hybrid **delta** is trustworthy. +6. **Underpowered and single-corpus.** 119 queries (~40/stratum) is below the cited ~50/stratum standard; + 65% of queries have exactly one relevant id (recall@5≈recall@10 for most). One author, no second corpus, + no inter-annotator agreement: external validity asserted, not demonstrated. +7. **Headline metric overstates hot-path value.** recall@10 leads everywhere, but the auto-recall hook + injects a top-k into the prompt; recall@5 (+0.075) and MRR/first-hit are the decision-relevant metrics. + Hybrid's recall@10 edge partly comes from pushing answers into ranks 6–10. The hook's effective `k` is + never stated. +8. **Production substrate and model wholly unmeasured.** Numbers are local exact numpy cosine over + bge-large; production is pgvector HNSW (approximate; recall depends on `ef_search`) with a filtered top-k + (NULL-embedding sensitive rows — a partial-index interaction that can hurt HNSW recall), using Voyage-3.5 + (never run). "Quality transfers" and "~10× faster on GPU/hosted" are assumptions, not measurements. +9. **Ablation configs A/B/C/D are not reproducible from artifacts.** Only `results/{fts,hybrid}.json` + (= A and D) are persisted; B/C have no saved JSON, and there's no run-manifest/seed/version capture + beyond the embedding-cache fingerprint. The decision-critical `A≡B` and the C/D numbers must be re-run. +10. **Several deferred fixes are asserted, not tested.** The exact-match rank bonus, the importance + post-fusion prior (the benchmark pinned `sort_by="relevance"` without it, so the user-visible production + ranking was never measured), and the "CC ran on our set and RRF was chosen" claim (no CC results exist + anywhere) are all unverified. diff --git a/docs/research/integration-design.md b/docs/research/integration-design.md new file mode 100644 index 0000000..edf89b0 --- /dev/null +++ b/docs/research/integration-design.md @@ -0,0 +1,292 @@ +# Integration design: hybrid recall (production + prototype-as-built) + +Status: the design the benchmark validated. This document specifies **(A) the production design** +for the API/Postgres deployment and **(B) the prototype as actually built** for the benchmark. +Read the [survey](survey.md) for *why* these choices, the +[benchmark report](benchmark-report.md) for whether they cleared the gate, and +[ADR-0001–0003](../adr/) for the fixed constraints. + +**Headline architectural decision (recorded in +[ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)):** ship +**lexical + dense fusion** first (a statistically-robust paraphrase win cleared ADR-0001's gate); the +**concept graph is deferred behind a gate** because it was **never validly tested** — the prototype's +fusion config structurally barred it from the top-k (see [benchmark report §1 + §3](benchmark-report.md)), +so it is *unevaluated*, not disproven. Deferral is on cost + uncertainty grounds. The design below is +structured around that phasing. + +--- + +## A. Production design (API/Postgres deployment) + +### A.0 Where it plugs in + +The semantic layer targets the **API/Postgres path only** (ADR-0002). The authoritative store is +the CNPG Postgres behind the FastAPI server (`src/claude_memory/api/app.py`); the local SQLite +cache stays **lexical (FTS5) only** and degrades gracefully. Recall fires on the **hot path** of +every prompt (an auto-recall hook before each turn), so the read path must stay within a +per-prompt latency budget even though latency is non-gating for *adoption* (ADR-0001). + +The current production recall (`app.py::recall_memories`, `POST /api/memories/recall`) is a single +`plainto_tsquery('english')` + `ts_rank(search_vector, query)` ordered by a blend +`ts_rank*0.7 + importance*0.3` (or `*0.4/*0.6` for `sort_by="importance"`), with an OR-broaden +fallback gated at `ts_rank > OR_BROADEN_MIN_RANK` when AND-match under-fills. The live schema +(`migrations/001`) has a generated `search_vector tsvector` (setweight A=content, B=expanded_keywords, +C=tags, D=category) + a GIN index `idx_memories_search`. **The hybrid design is purely additive +to this.** + +### A.1 Schema delta (additive, one migration) + +```sql +-- new Alembic migration (Postgres only; SQLite path unchanged) +ALTER TABLE memories ADD COLUMN embedding halfvec(1024); -- NULL for is_sensitive=1 +CREATE INDEX CONCURRENTLY idx_memories_embedding + ON memories USING hnsw (embedding halfvec_cosine_ops) + WITH (m = 16, ef_construction = 64); +``` + +- **1024-d** matches both production (Voyage-3.5) and the prototype (bge-large-en-v1.5), so the + column dimension and all fusion code are identical whichever model runs. +- **halfvec** (fp16) halves index size at ~no recall loss; 1024-d halfvec = 2048 bytes/row → + single-digit MB for the whole corpus. +- The existing `search_vector` + GIN index are **untouched**. Lexical behaviour is unchanged, so + NULL-embedding rows (sensitive memories) and SQLite-only mode degrade to exactly today's FTS. +- `CONCURRENTLY` avoids locking the shared table during backfill. +- The concept-graph tables (§A.5) ship **only if/when the graph clears its gate** — phase 2. + +### A.2 Write path (store / update) — all LLM work here, off the recall hot path + +On `memory_store` / `memory_update`, for **non-sensitive** rows (`is_sensitive=0`, hard ADR-0003 +gate): + +1. **Embed** `content` (optionally `content + expanded_keywords`) → one `halfvec(1024)` vector, + written to the new column. Voyage-3.5 `input_type="document"` for stores; bge-large + `encode_document` for sensitive/no-key fallback. `is_sensitive=1` rows get `embedding=NULL` — + never embedded, never egressed; they still match via FTS. +2. **(Phase 2, gated) Extract** concepts/edges for the new memory and incrementally merge into the + graph tables (entity resolution via pgvector nearest-neighbour + threshold, LLM tie-break only + on ambiguity — Graphiti-style fast-path). +3. **(Optional, flagged) Curate** — the Mem0-style ADD/UPDATE/DELETE/NOOP loop, run async, never + physically deleting (supersede to `[SUPERSEDED]` tombstone). Isolated behind a flag so it never + confounds the benchmark. + +The existing **background sync engine** already moves rows SQLite↔Postgres in a daemon thread; the +embedding is just another column it carries (authoritative vector in Postgres). Extraction/curation +ride the same off-hot-path lane. The synchronous store call must **not** block on embedding/ +extraction if it would delay the response — these run async. + +### A.3 Read path (hot path) — three CTEs, RRF fusion, importance prior + +Replace the single ts_rank ORDER BY with three top-N legs over the **same `memories` table**, fused +in the handler: + +1. **Lexical leg** — the *existing* query verbatim: `plainto_tsquery('english', $q)` + + `ts_rank(search_vector, query)`, with the existing OR-broaden fallback + (`OR_BROADEN_MIN_RANK`) kept intact. `rank_lex` = position in this list. (LIMIT ~50.) +2. **Dense leg** — `ORDER BY embedding <=> $qvec LIMIT 50` using the HNSW index. `rank_dense` = + position. Sensitive rows (NULL embedding) never enter this list. +3. **Graph leg (phase 2, gated, currently disabled)** — seed concept nodes by reusing the + lexical+dense match, traverse 1–2 hops via a recursive CTE over the edge table, score reachable + memories by hop-decay → `rank_graph`. List allowed to be empty. + +**Fuse** in Python: `fused(d) = Σ_{s} w_s / (60 + rank_s(d))`, default `w_lex = w_dense = 1.0`, +`w_graph ≈ 0.35` (down-weighted per the negative-prior finding — see benchmark). Missing leg ⇒ 0 +contribution, no special-casing. + +**Preserve the importance prior** (the current code is *not* pure relevance): apply it as a +post-fusion multiplier — `final(d) = fused(d) * (0.7 + 0.3*importance)` for `sort_by="relevance"`, +or use importance as the tie-break for `sort_by="importance"`. Importance is a *prior*, **not** a +fourth fused list. `sort_by="recency"` stays a pure `ORDER BY created_at`, untouched. + +> **Why RRF, not convex combination, as the default:** we fuse three incomparable scales +> (unbounded ts_rank, bounded cosine, arbitrary graph proximity), one of them sparse/often-empty. +> RRF is scale-agnostic and treats a missing leg as a clean 0, where CC would force a maintained +> normalization per signal plus a decision for "absent." RRF also collapses to today's exact +> lexical ordering when dense/graph are empty (the SQLite degrade path, **same code**). The +> benchmark ran CC/TM2C2 as a challenger (ADR-0001 is quality-gated); on our set RRF was chosen. + +**Single-query alternative (future):** the three legs can be expressed as CTEs + a FULL OUTER JOIN +on `id` with RRF computed in SQL (Supabase hybrid-search pattern), saving a round-trip. The +prototype and initial production both fuse in Python for clarity; in-DB fusion is an optimization, +not a correctness change. + +### A.4 SQLite-only graceful degrade + +With only the lexical leg present, RRF reduces to ranking by `rank_lex` — **identical ordering to +today's FTS5 `bm25()`**. Zero behaviour change offline; the *same* fusion code path runs in both +modes (dense/graph legs simply empty). Satisfies ADR-0002. + +### A.5 Concept graph (phase 2 — designed, gated, NOT shipped in v1) + +If a future benchmark justifies it (the prototype's did not — §B.4): + +```sql +CREATE TABLE concepts ( + id bigserial PRIMARY KEY, + canonical_name text NOT NULL, + aliases text[], + embedding halfvec(1024), -- for canonicalization + query seeding + category text +); +CREATE TABLE concept_edges ( + src_id bigint REFERENCES concepts(id), + dst_id bigint REFERENCES concepts(id), + relation text NOT NULL, + weight real, + valid_from timestamptz, -- bi-temporal (Graphiti-style supersede) + valid_to timestamptz, + evidence_memory_ids bigint[] +); +CREATE TABLE memory_concepts ( -- the "mentions" link + memory_id bigint REFERENCES memories(id), + concept_id bigint REFERENCES concepts(id), + relation text +); +``` + +- **Construction** (backfill): batched open LLM triple-extraction (~10–25 calls for the whole + corpus, each memory id-tagged) → global embedding-cluster canonicalization (EDC/KGGEN style) → + write the three tables. Off the hot path; `is_sensitive=1` filtered *before* any call. +- **Incremental** (per new memory): extract its triples, resolve entities against `concepts` via + pgvector NN + threshold (LLM tie-break only on ambiguity), set-merge — never re-cluster. +- **Traversal at recall:** plain recursive SQL CTE (1–2 hops). **No Apache AGE** — our multi-hop is + shallow. If a future need for PPR arises, compute it in Python over a cached `scipy.sparse` + transition matrix loaded from the edge table (Postgres has no native PPR), rebuilt only on graph + mutation. +- **Bi-temporal edges** realize our "supersede, don't accumulate" rule as a queryable timeline: + contradicted edges get `valid_to` set, not deleted. + +### A.6 Optional stage-2 cross-encoder (gated separately) + +`bge-reranker-v2-m3` on the GPU node over the fused top ~20–30, `sort_by="relevance"` only, +sensitive rows excluded, with a hard-timeout fallback to fused order. Ship only if it clears both +the quality bar **and** the p95 hot-path budget. Not in v1. + +### A.7 Infrastructure (production deploy) + +- **pgvector enablement on CNPG:** the cluster already runs pgvector for Immich, so the legacy + custom-operand-image path is in place; `CREATE EXTENSION vector` + the additive migration. Any + extension add triggers a rolling restart of the shared cluster — coordinate via presence/GitOps. +- **All cluster changes via Terraform/Terragrunt** in `infra/stacks/...` (GitOps, never kubectl). +- **Embedding/extraction compute:** in-cluster **llama-cpp on the GPU node** for sensitive-safe + local processing (and the no-key fallback); **Voyage-3.5** (hosted) for the non-sensitive batch + (ADR-0003). Sensitive memories are routed locally or left lexical-only — enforced, not + best-effort. +- **PgBouncer:** set `hnsw.ef_search` via `SET LOCAL` inside the recall transaction (transaction + pooling). +- **pgvectorscale/DiskANN deferred** — not needed below ~1–5M vectors. + +--- + +## B. Prototype as built (the benchmark harness) + +The prototype validates **retrieval quality cheaply, in-process** — *not* pgvector/Postgres +(standing up CNPG just to benchmark would burn days before knowing if hybrid even beats FTS). It is +a faithful stand-in: the lexical leg is the *exact* production code path, and the fusion is the same +weighted RRF the production design specifies. + +### B.1 Files (committable code only; data/cache/results gitignored) +- `benchmarks/retrievers/fts.py` — `FtsRetriever`, the lexical baseline. +- `benchmarks/retrievers/hybrid.py` — `HybridRetriever`, the three-leg fusion. +- `benchmarks/retrievers/test_hybrid.py` — 9 model-free tests (synthetic content only). +- `benchmarks/scripts/run_eval.py`, `benchmarks/harness/` — runner + metrics. +- Eval data (`benchmarks/data/{corpus,queries,qrels}.jsonl`), embedding cache + (`benchmarks/cache/*.npy`), and full results (`benchmarks/results/*.json`) are **gitignored** + (verified via `git check-ignore`) — privacy rule: no real memory content committed. + +### B.2 Lexical leg = the real product + +`hybrid.py` **reuses `retrievers.fts.FtsRetriever` verbatim**, which is itself a faithful +reimplementation of `src/claude_memory/mcp_server.py::_sqlite_recall` (`sort_by="relevance"`): a +fresh in-memory FTS5 index over the 5,452-memory corpus with the production virtual-table shape and +default `unicode61` tokenizer; query handling mirrors production (AND-match first, OR-broaden if +zero rows; rank by `-bm25()*0.7 + importance*0.3`; LIKE fallback on operational errors). **So the +hybrid's lexical component *is* the exact production system it must beat — no drift.** + +### B.3 Dense leg + +- **Model:** `BAAI/bge-large-en-v1.5`, 1024-d, L2-normalized. Passages raw; the query gets the BGE + instruction prefix `"Represent this sentence for searching relevant passages: "`. Similarity = + cosine via numpy matmul over the normalized matrix (faiss unnecessary at N=5452). +- **Embeddings:** all 5,452 memories embedded once (one-time ≈ 31.5 min on a CPU-only box), + **cached** fingerprint-keyed to `cache/emb_BAAI_bge-large-en-v1.5_.npy` (+ `.ids.npy`) + → reruns skip the embed (cache-hit rebuild ≈ 8.3 s). +- Hosted Voyage/OpenAI/Cohere paths are implemented and key-gated but were **untriggered** (no key + in env) — so the prototype ran the local default, which is also the sensitive-only production + fallback. **Production maps this matrix to pgvector `halfvec(1024)`.** + +### B.4 Graph leg (built, but structurally excluded from the ranking — UNMEASURED) + +- **Construction with ZERO LLM calls** (the tractable shortcut): concepts = union of each memory's + `tags` + the already-LLM-generated `expanded_keywords` field + a lightweight regex/stop-word + noun-phrase proxy over `content`, normalized + de-pluralized. 37,075 concepts extracted; **19,907 + kept** after document-frequency pruning (df ∈ [2, 2%·N=109]: df<2 links nothing, df>109 are + non-discriminative hubs). Concept cliques → weighted memory–memory edges. Result: **5,452 nodes, + 2,095,624 edges**, built in ~9 s (in-memory networkx). +- **Traversal:** 1 hop from the top-10 RRF seeds (capped 25 neighbours/seed), contributing **only ids + not already in the base legs** — this exclusion is the bug (next bullet). +- **Result: the graph was structurally barred from the ranking, so its value is UNMEASURED.** Because + graph hits are restricted to ids *outside* the FTS∪dense base set and weighted 0.35, a graph-only id's + max RRF (`0.35/61 ≈ 0.0057`) sits below any base-leg id's min (`1.0/110 ≈ 0.0091`) — it can never enter + the fused top-k. The "graph ≡ nothing" ablation (§B.6) was therefore **guaranteed by construction, not + an empirical finding** (a post-run review found a relevant graph-surfaced memory that fusion discarded). + +### B.5 Fusion (the production recipe, exactly) + +Weighted RRF, `RRF(d) = Σ_leg w_leg/(60 + rank_leg(d))`, chosen over convex combination because it +is score-scale-free (no BM25-vs-cosine calibration). Weights `w_fts = 1.0, w_dense = 1.0, +w_graph = 0.35`. Each leg pulled to depth 50 before fusion, truncated to k. + +### B.6 Decision-relevant ablation (this is what informs ADR-0004) + +| Config | What | Overall recall@10 | Para recall@10 | Multi recall@10 | +|---|---|---|---|---| +| **A** full hybrid (FTS+dense+graph) | the prototype | 0.834 | 0.725 | 0.775 | +| **B** FTS+dense (w_graph=0) | graph removed | **0.834** | **0.725** | **0.775** | +| **C** dense-only | | 0.748 | — | — | +| **D** FTS-only (= baseline) | | 0.695 | 0.375 | 0.711 | + +**A and B are identical to three decimals on every metric — but this is a structural artifact (§B.4), +not a test of the graph.** The valid signal here is the **FTS-vs-dense decomposition**: dense-only (C) +and FTS-only (D) each lose to the fusion (B) — dense recovers paraphrase, lexical recovers exact, fusion +gets the best of both. The concept graph itself is **unevaluated** (it could never affect top-k under +this fusion config). **This still supports phasing — ship lexical+dense (phase 1), the robust measured +win — but the graph is gated pending a *valid* retest, not because it failed.** (Configs B/C/D were not +persisted as result JSONs; only A and D are reproducible from committed artifacts.) + +### B.7 Prototype → production mapping + +| Prototype (in-process) | Production (ADR-0002) | +|---|---| +| numpy cosine over normalized matrix | pgvector `halfvec(1024)` + HNSW ANN | +| `.npy` embedding cache, fingerprint-keyed | `embedding` column on `memories`, synced | +| in-memory networkx graph (phase 2) | `concepts` / `concept_edges` / `memory_concepts` tables | +| `FtsRetriever` (FTS5 in-memory) | existing `search_vector` + GIN (`plainto_tsquery`/`ts_rank`) | +| weighted RRF in Python | same RRF (Python handler, or CTE+FULL OUTER JOIN in SQL) | +| bge-large local | Voyage-3.5 hosted (non-sensitive) / bge-large local (sensitive, no-key) | + +### B.8 Prototype caveats (carried into the report's limitations) +1. **Graph result is INVALID, not merely "null."** The fusion config barred the graph leg from the + top-k by construction (§B.4), so the benchmark did not actually test it. A valid retest must include + graph candidates in the fused pool (drop the base-set exclusion) and/or sweep the weight, ideally with + a typed-relation graph from real LLM extraction and multi-hop queries whose hops are *not* semantically + adjacent. +2. **Exact-stratum nDCG/MRR dip ~0.018/0.025** vs FTS (recall unaffected) is the standard RRF cost + of blending one perfect hit with near-ties; a small exact-match rank bonus could recover it. +3. **Latency** (p50 ≈ 230 ms) is CPU-bound on the local query embed; non-gating, GPU/hosted ~10× + faster. Baseline FTS was p50 ≈ 15.7 ms (pure SQLite). +4. **No pgvector/Postgres** in the prototype — the production substrate is design-only here; the + numbers measure *retrieval quality*, which transfers, not the production latency profile. + +--- + +## C. Open questions (for production rollout) + +1. **pgvector enablement mechanism** — confirm whether the live CNPG is on the legacy + custom-operand-image (likely, since Immich uses pgvector) or the modern image-volume-extensions + path; either way the migration is additive, but the enablement DDL/Terraform differs. +2. **Graph gate** — what evidence would re-open the concept graph? Candidate: a multi-hop eval slice + whose hops are *not* semantically adjacent (where dense can't shortcut), built from real + LLM-extracted typed relations rather than keyword co-occurrence. Until then, graph stays off. +3. **Voyage vs bge-large in production** — the benchmark ran bge-large (local). A cheap follow-up: + re-run the dense leg with Voyage-3.5 on the non-sensitive corpus to confirm the hosted model's + higher quality ceiling holds on *our* content before committing the production default. diff --git a/docs/research/survey.md b/docs/research/survey.md new file mode 100644 index 0000000..b6cfa95 --- /dev/null +++ b/docs/research/survey.md @@ -0,0 +1,523 @@ +# Landscape survey: semantic + concept-graph memory for hybrid recall + +Status: research input for the ADR-0001 hybrid upgrade. Scope: how the agent-memory and +graph-RAG literature builds and retrieves over a personal-memory store, which embedding +model to use, how to fuse lexical + dense + graph signals, and how to evaluate the result. + +**Read this with the decisions already fixed in [ADR-0001](../adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md) +–[0003](../adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md):** we pursue +hybrid (gated on a benchmark beating FTS), embeddings live in pgvector on the existing CNPG +Postgres, the concept graph is node/edge tables in Postgres, sensitive memories never egress, +and adoption is decided **quality-first** (recall@k / nDCG@10 / MRR; latency & storage are +reported, not gating). + +The recurring conclusion below: **borrow the ideas, not the engines.** None of the four +systems surveyed is a drop-in for our stack, but each contributes a mechanism we re-implement +natively on Postgres + pgvector. + +--- + +## 1. Our workload is the opposite of GraphRAG's design target + +Before comparing systems it helps to state what we are *not*. The graph-RAG family was built +for **global sensemaking** ("what are the themes across this corpus") over a **static document +collection**. Our workload is the reverse: + +| Dimension | GraphRAG target | claude-memory-mcp | +|---|---|---| +| Unit | Long documents, chunked | Atomic, already-curated memories (avg ~500 chars) | +| Corpus dynamics | Static, indexed once | Append-heavy: a few hundred memories/day arriving | +| Query type | Corpus-wide summarization | Point / multi-hop recall ("what did I decide about X") | +| Hot path | Offline batch | **Every user prompt** (auto-recall hook fires before each turn) | +| Scale | 10k–1M+ chunks | ~5k memories today → tens of thousands | + +This mismatch is the lens for everything that follows. The expensive part of GraphRAG — +community detection + hierarchical LLM summaries — is the *wrong retrieval unit* for atomic +point lookups, and re-summarizing communities on a sustained append stream is its dominant, +unbounded cost. We want a design whose index-time work is **proportional to new content**, and +whose retrieval path has **no LLM call** (so it fits the per-prompt budget). + +--- + +## 2. The GraphRAG family — Microsoft GraphRAG, LightRAG, nano-graphrag, LazyGraphRAG + +All four turn text into an entity–relation knowledge graph via LLM extraction; they differ on +the expensive part (community detection + hierarchical summarization), which is exactly where +**incremental cost** lives. + +### Microsoft GraphRAG (Edge et al. 2024, arXiv 2404.16130) +Pipeline: chunk → LLM extracts entities + relationships per chunk (with multi-round +"gleanings" to catch misses) → summarize duplicate element instances into node/edge +descriptions → build graph → **Leiden** community detection producing a *hierarchy* +(levels C0..C3) → an LLM writes a **community report** for every community at every level. +Two query modes: **global** (map-reduce over all community reports — corpus-wide +sensemaking) and **local** (start from query-relevant entities, fan out). Indexing is +LLM-heavy: ~4,000 LLM calls / ~35 min for one textbook; ~$20–40 per 1M tokens with gpt-4o. + +**Incremental:** the `graphrag update` command (GraphRAG 1.0) computes deltas and places new +entities into existing communities "rather than re-running Leiden," re-summarizing only +changed communities — **but** maintainers warn that once drift crosses a threshold "the worst +case degrades to the same performance as a normal indexing." A periodic, unpredictable +full-reindex cliff on a sustained append stream. Parquet/file-pipeline oriented, not +Postgres-native. + +### LightRAG (HKUDS, arXiv 2410.05779, EMNLP 2025) +Pipeline: chunk → LLM extracts entities + relations → "profiling" generates a key-value text +summary per node/edge → **deduplicate** merges identical entities/relations across chunks. +**No community detection.** Retrieval is **dual-level**: the LLM splits the query into +low-level keywords (specific entities) and high-level keywords (broad themes via relationship +chains); each set is matched by *vector* similarity against an entity-vector index and a +relation-vector index, then one-hop neighbours are pulled from the graph. Modes: +naive / local / global / hybrid / mix (mix = default). + +**Incremental (the crux):** a new document goes through the same local indexing to produce a +small local graph, then is integrated by **set union** of node-sets and edge-sets into the +existing graph — "eliminating the need to rebuild the entire index graph." No communities ⇒ +**no global re-clustering or re-summarization, ever** ⇒ O(new content) per insert, the only +genuinely-incremental member. Ships a PostgreSQL all-in-one backend (PGVectorStorage on +pgvector + PGGraphStorage on Apache AGE + KV + doc-status in one DB, PG ≥16.6). + +### nano-graphrag (~1100 LOC) +A faithful minimal reimplementation of Microsoft GraphRAG and an excellent compact *reference* +for the exact extraction/community/report prompts. **Hard NO for incremental:** README states +plainly "each time you insert, the communities of graph will be re-computed and the community +reports will be re-generated" — O(whole graph) LLM cost per append. + +### LazyGraphRAG (Microsoft Research, 2024) +Defers **all** LLM work to query time: index time uses only NLP noun-phrase extraction + graph +statistics — "indexing costs are identical to vector RAG and 0.1% of the costs of full +GraphRAG." Sidesteps the incremental-re-summarization problem entirely by never pre-summarizing +communities. The **defer-LLM-cost principle** is the one to borrow. + +### Verdict for us +**Adopt none wholesale; steal LightRAG's architecture + LazyGraphRAG's defer-LLM principle.** +LightRAG is the only one whose incremental model (pure set-union, no re-clustering) structurally +fits an append-heavy stream, and whose retrieval path (vector + one-hop graph, no query-time +map-reduce) is hot-path-viable. But adopting LightRAG-the-product is not recommended: its +Postgres graph path needs the **Apache AGE** extension (not on our CNPG), and that path has +documented concurrency/entity-merge instability under append-heavy load (asyncpg pool timeouts +at the merge stage; slow upgrades). Our multi-hop is shallow (1–2 hops), expressible in plain +recursive SQL CTEs over node/edge tables — no AGE needed. + +--- + +## 3. Zep / Graphiti — temporal knowledge graph for agent memory + +Zep (arXiv 2501.13956) is an agent-memory service; **Graphiti** is its open engine +(Neo4j / FalkorDB / Kuzu backend, ~20k stars, MIT). It is the **closest conceptual analog** to +our hybrid goal — it fuses exactly the three signals ADR-0001 wants. + +**Three-tier graph:** episode subgraph (raw ingested data, the provenance layer ≈ our Memory +rows) → semantic entity subgraph (entity nodes + typed relationship edges, each linking back to +its source episodes) → community subgraph (clusters with LLM summaries — the GraphRAG "global" +layer). + +**Bi-temporal model:** every semantic edge carries **four** timestamps on two timelines — +*valid time* (`t_valid`/`t_invalid`: when the fact held true in the world) and *transaction +time* (`t'_created`/`t'_expired`: when Zep learned/retracted it). Facts are never deleted; +superseded facts get their validity window closed. This is a principled, queryable version of +our **"supersede, don't accumulate"** memory discipline. + +**Incremental ingestion (per episode):** a *sequence* of LLM calls — entity extraction → +entity resolution/dedup (embed + cosine + BM25 search against existing nodes, then an LLM +judges merge vs create) → fact (edge) extraction → fact dedup → temporal extraction (resolve +"two weeks ago" against a reference time) → edge invalidation (LLM compares each new edge +against related existing edges; on a temporally-overlapping contradiction it closes the old +edge). Cost is **heavy on write**, paid back on reads. + +**Retrieval (sub-second, NO LLM at query time):** three parallel searches fused, then reranked. +- `φ_cos`: cosine over embeddings of fact text / entity names / community summaries (BGE-m3, 1024-d). +- `φ_bm25`: BM25 full-text over the same fields. +- `φ_bfs`: breadth-first n-hop graph traversal from seed nodes — the genuinely graph-native signal. +- Rerank: pluggable — **RRF**, MMR (diversity), episode-mentions (frequency), node-distance, or a cross-encoder (most accurate, slowest). + +**Published quality (the strongest evidence in this family):** on **LongMemEval**, Zep reports +**+18.5%** accuracy over a baseline (71.2% vs 60.2% with gpt-4o) *and* ~90% query-latency +reduction; on **MemGPT DMR**, 94.8% vs 93.4%. These are conversational long-context QA, not +personal-fact recall@k — so the headline numbers won't transfer directly, but the *fusion +recipe* is exactly what we benchmark. + +### Verdict for us +**Primary design blueprint for the concept-graph half — but not an adopted dependency.** +Graphiti has **no pgvector backend**; adopting the engine forces a new Neo4j/FalkorDB graph DB +into the cluster, conflicting with ADR-0002 and reuse-before-building. We borrow four mechanisms, +re-implemented on Postgres: (1) the episodic(=Memory rows)/semantic(=new node+edge tables) +split; (2) the parallel-search + RRF fusion read path; (3) resolution-via-search to dedupe the +graph using our existing FTS+vector; (4) bi-temporal edge invalidation as the queryable form of +our supersede discipline. We de-scope the community/summarization tier and the default +cross-encoder. Two hard caveats: keep the multi-LLM-call extraction **off the hot path** +(background, like our sync engine), and route extraction through in-cluster llama-cpp / filter +`is_sensitive` per ADR-0003. + +--- + +## 4. Mem0 / Mem0g — extraction-based, LLM-curated memory + +Mem0 (arXiv 2504.19413) is a **write-side memory curator**, not a retrieval algorithm — it +solves a *different axis* than our gated problem, and is **complementary**. + +**Two-phase pipeline.** *Extraction:* on each new message pair, an LLM (fed an async +conversation summary + the last ~10 messages) emits a set of concise "candidate facts." +*Update (the curation step):* for each candidate fact, retrieve the top ~10 semantically-similar +existing memories, then a function-calling LLM picks one of four ops — **ADD** (new), **UPDATE** +(merge richer detail into an existing id, gated on information content), **DELETE** (a +contradicted memory), **NOOP**. Net effect: the store self-deduplicates, self-merges, and +self-supersedes instead of accumulating. Two LLM calls per write (extract + decide) + a vector +search; **async by default** (off the user hot path); the **read/search path is pure vector +similarity with no LLM**. + +**Mem0g (graph variant):** a directed labeled entity graph (Alice –lives_in→ SF) on Neo4j; a +conflict-detection + LLM update-resolver marks superseded relationships *invalid* rather than +deleting them. + +**Published quality:** on **LOCOMO**, Mem0 J=66.88 / Mem0g 68.44 beats OpenAI memory (52.90), +A-Mem (48.38), LangMem (58.10), ties Zep (65.99), at ~1/15th the tokens of full-context; Mem0g +specifically wins temporal reasoning. Reference latencies (gpt-4o-mini): search p95 ≈ 0.20s, +total p95 ≈ 1.44s, vs full-context ≈ 17s. + +### Verdict for us +**Adopt the curation loop as a separate, flagged subsystem — it does NOT move the ADR-0001 +retrieval metric by itself** (its search is vector-only, no lexical+graph fusion). The +ADD/UPDATE/DELETE/NOOP loop is the highest-leverage idea Mem0 offers: it automates a discipline +our own rules already mandate (every correction stored, supersede-don't-accumulate, tombstones) +but currently leave to manual human effort. It is cheap to build against our existing Memory +model + `update_memory` endpoint, runs async off the recall hot path, and respects the +`is_sensitive` boundary. **Hard guardrails required:** never physically DELETE — supersede to a +`[SUPERSEDED]` tombstone (importance ~0.3, per our convention); log every op; gate behind the +non-sensitive filter. Keep extraction *optional* (our memories are already atomic, so usually +only the single UPDATE-decision call is needed). Mem0g's "mark invalid, not delete" and triplet +schema (source, relation, dest) are reusable ideas, but implemented on pgvector/Postgres, not +Neo4j. **Critically: isolate curation behind a flag so the benchmark measures retrieval quality +independently of any curation behaviour change.** + +--- + +## 5. HippoRAG / HippoRAG 2 — Personalized PageRank over a concept graph + +HippoRAG (NeurIPS 2024, arXiv 2405.14831) and HippoRAG 2 (ICML 2025, arXiv 2502.14802) are the +**strongest published evidence that a concept graph wins on multi-hop** — precisely the query +class ADR-0001 says the graph must beat lexical on. + +**Mechanism (hippocampal indexing analogy):** LLM = neocortex; retrieval encoder = +parahippocampal region (detects synonyms); open KG = hippocampal index. *Offline:* an LLM runs +OpenIE on each passage → a schema-free KG of noun-phrase nodes joined by relation edges; the +encoder adds **synonym edges** between phrase nodes with cosine > τ=0.8. *Online, per query:* +(1) ONE LLM call does NER on the query; (2) the encoder links query entities to nearest KG +nodes = **seed nodes**; (3) each seed weight is scaled by node specificity (`|P_i|⁻¹`, an +IDF-like rare-phrase boost) and written into the **Personalized PageRank** reset vector; +(4) PPR runs to convergence (damping 0.5); (5) the phrase-node probability vector scores +passages. Multi-hop emerges because the random walk reaches passages sharing **no** query tokens +— in **one** retrieval step instead of iterative retrieve-reason loops. + +**HippoRAG 2** makes passages first-class nodes (linked to their phrases by "contains" context +edges), shifts linking to query→triple + **LLM triple-filtering** ("recognition memory"), and +seeds *all* passage nodes by embedding similarity (small weight ~0.05) so dense and graph blend +in one PPR. Net effect: a single PPR fuses lexical-ish phrase matching, dense passage +similarity, and multi-hop traversal into one ranked list. + +**Published quality (passage recall@5):** HippoRAG 2 beats the strongest 7B embedding baseline +(NV-Embed-v2) on every multi-hop set — 2Wiki **90.4 vs 76.5** (+13.9), MuSiQue **74.7 vs 69.7** +(+5.0), HotpotQA **96.3 vs 94.5** — and is the only structure-augmented method that *doesn't +regress* simple QA (NQ 78.0). + +### Verdict for us +**Adopt the idea (PPR spreading activation over our concept graph), not the framework.** Two +hard adaptations, both fitting our stack: +1. **Drop the per-query LLM** (v1 NER / v2 triple-filtering) — the only thing that would blow + the hot-path budget — and **seed PPR from our existing FTS top-k ∪ pgvector top-k**, weighted + by fused score × importance × node-specificity. This turns PPR into the *fusion layer* + ADR-0001 wants, with zero added LLM latency. +2. **Prefer a memory-node graph** (memories as nodes, our typed Relationships as edges) over + HippoRAG's phrase explosion (it turns 11.6k passages into ~92k nodes; at our scale that'd be + ~43k phrase nodes). Leaner and native to ADR-0002's node/edge tables. + +A reproducible PPR latency micro-benchmark on a 5,400-memory graph measured **~2 ms** (memory-node +graph, transition matrix cached) to **~21 ms** (full phrase graph), ~105 ms even at 3× growth — +PPR is **not** the bottleneck; the stock recipe's online LLM is (which we remove). Postgres can +*store* the graph but has no native PPR (pgrouting = shortest-path only), so PPR is computed in +Python over a cached `scipy.sparse` transition matrix loaded from the node/edge tables, rebuilt +only on graph mutation. **Caveat for the gate:** our LLM-free seeding variant is *not* validated +by the papers, and our 5.4k personal corpus is far smaller and less multi-hop-dense than their +90k-node Wikipedia graphs — so the benchmark must confirm the multi-hop win transfers. + +--- + +## 6. Embedding model survey + +Our `content` and `expanded_keywords` are **short** prose (capped ~500 chars), so a model's +max-token limit is effectively a non-constraint — quality, dimensionality, and deploy +feasibility decide. + +### Self-hostable (sentence-transformers on the GPU node, or GGUF via llama-cpp; pgvector stores the vector) + +| Model | Dim | Params / VRAM (fp16) | MTEB(en) avg | License | +|---|---|---|---|---| +| nomic-embed-text-v1.5 | 768 (Matryoshka 64–768) | 0.1B / <1 GB | 62.28 | Apache-2.0 | +| bge-base-en-v1.5 | 768 | 109M / ~0.5 GB | ~63.5 | MIT | +| **bge-large-en-v1.5** | **1024** | 335M / ~1.3 GB | **64.23** | MIT | +| e5-large-v2 | 1024 | 0.3B / ~1.3 GB | ~62.25 | MIT | +| bge-m3 | 1024 dense (+sparse +ColBERT) | 568M / ~1–2.4 GB | en ~59–60 (strong multiling/BEIR) | MIT | +| gte-Qwen2-1.5B-instruct | 1536 | 1.5B / ~3.4 GB | **67.16** (top of set) | Apache-2.0 | + +### Hosted (API call, NON-SENSITIVE memories only per ADR-0003) + +| Model | Dim | MTEB(en) avg | Price /1M tok | License | +|---|---|---|---|---| +| OpenAI text-embedding-3-small | 1536 (Matryoshka→256) | 62.3 | $0.02 | proprietary | +| OpenAI text-embedding-3-large | 3072 (Matryoshka) | 64.6 | $0.13 | proprietary | +| **Voyage-3.5** | **1024** (+256/512/2048, int8/binary) | beats OpenAI-3-large ~7.5% on Voyage's eval | $0.06 (first 200M free) | proprietary | +| Voyage-3.5-lite | 1024 | beats OpenAI-3-large ~2–3.8% | $0.02–0.03 | proprietary | +| Cohere embed-english-v3.0 | 1024 (native int8/binary) | ~64.5 | ~$0.10 (sales-quoted) | proprietary | + +**Implementation notes that matter.** Use **asymmetric** prompting (query vs document): +sentence-transformers `encode_query`/`encode_document`, always `normalize_embeddings=True` so +pgvector cosine == dot product. e5-large-v2 *requires* manual `"query: "`/`"passage: "` prefixes +or quality collapses; bge prepends a query instruction; gte-Qwen2 prepends a task instruction to +queries only. Pick the dimension **once** — changing it later forces a full re-embed + HNSW +rebuild. + +### Recommendation (one of each, quality-first) +- **Local: BAAI/bge-large-en-v1.5** (1024-d, MIT) — best quality-per-complexity in the + self-hostable set for short English memories: strong retrieval, ~1.3 GB VRAM (runs on CPU at + ~100 ms), no `trust_remote_code`, mature ST support. The 512-token cap is irrelevant for our + content. (gte-Qwen2-1.5B-instruct is the explicit upgrade candidate if the benchmark says + bge-large leaves quality on the table; nomic is the fallback if a long context or sub-768 + Matryoshka dims are ever wanted.) +- **Hosted: Voyage-3.5** (1024-d) — highest measured retrieval quality of the hosted options, + **same 1024-d as the local pick** so the pgvector column and fusion code are identical whether + local or hosted (clean A/B), and our whole corpus embeds inside the free tier. Non-sensitive + only; sensitive rows go to bge-large locally. (OpenAI text-embedding-3-small is the pragmatic + fallback if no Voyage key.) + +> **Prototype note:** the prototype as built used **bge-large-en-v1.5** (1024-d, local default, +> no API key in env). Production should adopt **Voyage-3.5** (also 1024-d) for non-sensitive +> memories per ADR-0003, keeping bge-large as the sensitive-only / no-key fallback. Both 1024-d +> means the pgvector schema and fusion code are unchanged across the choice. + +--- + +## 7. Fusion of lexical + dense + graph signals + +Three retrieval families produce candidate lists per recall; one fusion function merges them. + +### Reciprocal Rank Fusion (RRF) — rank-based, scale-agnostic +Cormack/Clarke/Büttcher (SIGIR'09): `score(d) = Σ_s 1/(k + rank_s(d))`, summed over every signal +the doc appears in. `k` is a smoothing constant — their sweep found **k=60 near-optimal but +uncritical** (≤0.3% MAP swing across [10,100]; k=0 or k=500 costs 3–4%). A doc present in one +list but absent from another contributes `1/(k+rank)` where it fires and **0** elsewhere — so +multi-signal agreement is rewarded and single-signal hits are penalized (the hybrid behaviour we +want). Extends to N lists trivially (just sum), which makes a 3-way fuse a one-liner. +**Weighted RRF:** `Σ_s w_s/(k + rank_s(d))` — bias a stronger signal, no pre-normalization. + +### Weighted score fusion / convex combination (CC) — score-based, needs normalization +Bruch et al. (arXiv 2210.11934, TOIS 2023): `f = α·φ(semantic) + (1-α)·φ(lexical)` with +**theoretical** min-max normalization (TM2C2): cosine min = -1, BM25 min = 0, per-query max — +stable across queries. Findings: **CC/TM2C2 beats RRF on nDCG and recall** in- and out-of-domain +(RRF ~3.86% lower nDCG@10 in one replication); Weaviate switched its default from rankedFusion to +relativeScoreFusion (min-max CC) for ~6% recall on FIQA. CC is sample-efficient (α tunes from a +small labeled set) but requires calibratable scores. + +### Folding in graph hits +Build a graph candidate list, then feed it to the fuser as just another ranked list: +**seed** (match the query to concept nodes via the same FTS + dense over node labels) → +**traverse** (1–2 hops to reachable memories) → **score** each reachable memory. Three documented +scorings: hop-decay `Σ_paths β^hops` (β≈0.5–0.7); Personalized PageRank seeded on matched nodes +(HippoRAG); or node-degree priority (GraphRAG local search). The +*Calibrated-Fusion-for-Graph-Vector* paper (arXiv 2603.28886) is explicit: naive graph+vector +fusion fails on **scale incompatibility**, so convert graph traversal into a probability-like +normalized score before fusing. Crucial consequence: the graph list is **sparse** (often a +handful of memories, sometimes zero). Under RRF that's handled automatically; under CC you must +explicitly treat "absent" as the theoretical min or the missing-modality term silently biases the +sum. + +### Cross-encoder re-rank — a separate stage-2, not a fusion function +Retrieve top-N each → fuse → take fused top ~20–30 → score each (query, memory) pair jointly with +a cross-encoder (e.g. bge-reranker-v2-m3) → re-sort. Reported lift +5 to +15 nDCG@10 on +BEIR/MTEB; cost scales with pair count so it is only ever a small-candidate-set stage. + +### Recommendation +**Weighted RRF over three lists (FTS, dense, graph), k=60, equal weights to start**, with +importance applied as a deterministic post-fusion prior and a cross-encoder as an optional, +benchmark-gated stage-2. RRF is the right *default* because we fuse three incompatible scales, +one of them sparse/often-empty; it is near-parameter-free; and it collapses to exactly today's +lexical ordering when dense/graph are empty (the SQLite graceful-degrade path). **But because +adoption is quality-gated, the benchmark must also run CC/TM2C2 as a challenger** — the +literature is consistent that CC edges RRF on quality when scores are calibratable. (See the +[benchmark report](benchmark-report.md) for which won on *our* eval set.) + +--- + +## 8. Concept-graph construction from memories + +Turning flat Memory rows into nodes (concepts/entities) + typed directed edges + memory→concept +"mentions" edges. Three extraction families: + +- **(A) Open LLM triple extraction** (schema-free) — prompt an LLM to emit `[subject, relation, + object]` triples. High recall, but relation labels proliferate ("prefers"/"likes"/"favors"), so + it **requires** downstream canonicalization. GraphRAG is the canonical implementation + (extract + gleaning + cross-chunk entity summarization). +- **(B) Schema-guided** — constrain to a fixed ontology. Cleaner, but a fixed schema misses + surprises in a heterogeneous personal corpus. **EDC** (Zhang & Soh, EMNLP 2024) bridges the two: + *extract* open triples → *define* (LLM writes a one-sentence definition per distinct relation) → + *canonicalize* (embed definitions, retrieve nearest existing relations, LLM verifies map-vs-add). + Two modes: target-alignment (fixed schema) and self-canonicalization (grow schema dynamically). +- **(C) Entity resolution / canonicalization** (the dedup problem — "Svelte"/"SvelteKit"/"svelte + framework" are one node): cluster-then-refine on the *aggregated* graph — embed every surface + string, cluster by cosine (HDBSCAN / connected-components over a threshold), optional + LLM-as-judge per cluster. KGGEN (arXiv 2502.09956) does iterative LLM-guided clustering; + Graphiti uses MinHash+LSH fast-path with LLM fallback. **Cost scales with distinct entities (low + thousands), not with memory count.** +- **(D) Lightweight non-LLM** — spaCy NER + noun-chunks + co-occurrence edges, or **ReLiK** + (Sapienza, ACL 2024) for *typed* relations on CPU at up to 40× LLM speed, zero per-doc LLM cost. + The natural ablation baseline and sqlite-only fallback. + +**The tractable recipe for our corpus.** Measured: 5,452 non-sensitive memories ≈ 683K content +tokens total — *tiny*. At ~125 content-tokens/memory, ~570 memories pack into one 100K-token +request, so the **entire corpus extracts in ~10–25 batched LLM calls, not 5,452 sequential +calls**. Pipeline: (1) batch-extract open triples (each memory tagged with its `memory_id` so +triples map back), parallelized — LangExtract / KGGEN style; (2) aggregate + canonicalize globally +*once* (embed distinct entities, cluster, LLM-judge only ambiguous clusters — tens of calls); +(3) optionally one batched LLM "define relations" pass for EDC-style relation canonicalization. +Total budget: low tens of calls, minutes of wall-clock, a few dollars hosted or one GPU-node +llama-cpp session. **Canonicalization quality (the similarity threshold / cluster granularity) is +where this lives or dies and must be tuned against held-out data, not eyeballed.** Write-time / +Graphiti-style per-memory extraction is for *incremental updates only* — the wrong tool for the +one-shot backfill. + +--- + +## 9. Vector storage in Postgres (production substrate) + +`pgvector` is a **proven capability on our exact CNPG cluster** (Immich already does vector search +there, and claude-memory-mcp is already a tenant of the shared `pg-cluster-rw.dbaas` behind +PgBouncer) — zero new infrastructure, reuse-before-building satisfied. + +- **HNSW** (recommended default): `USING hnsw (embedding halfvec_cosine_ops) WITH (m=16, + ef_construction=64)`; query knob `SET hnsw.ef_search` (default 40). Best speed-recall tradeoff; + buildable on an empty table; graph in RAM. **IVFFlat** is rejected (must be built *after* data + exists — an empty-table footgun — and has a lower recall ceiling). +- **halfvec** (fp16, 2 bytes/dim) halves index size at ~no recall loss; indexable ≤4000 dims. + 768-d halfvec = 1536 bytes/row; at our scale total embedding storage is single-digit MB. +- **Filtered ANN:** we always filter `deleted_at IS NULL` (often `category`). Post-filtering can + under-fill top-k; enable `hnsw.iterative_scan='relaxed_order'`, and **always add a tie-breaker** + (`, id`) since approximate indexes give non-deterministic order. +- **Hybrid in one query:** each retriever is a CTE producing a per-ranker rank; fuse with RRF via + FULL OUTER JOIN on memory id — no score calibration needed across the incomparable ts_rank and + cosine scales. +- **pgvectorscale / StreamingDiskANN** (bounded-RAM disk graph, SBQ compression) is **deferred** — + Rust/pgrx must be compiled into the operand image, and it only earns its keep above ~1–5M + vectors. Our corpus is orders of magnitude below that. +- **PgBouncer gotcha:** per-query GUCs (`hnsw.ef_search`) must be `SET LOCAL` inside the recall + transaction, not session-level, under transaction pooling. + +**Not for the prototype** (the prototype uses an in-process numpy index); this is the production +adoption path *if* the benchmark clears the gate — an additive Alembic migration (one nullable +`halfvec(1024)` column + HNSW index) plus a Terraform change to the CNPG stack. + +--- + +## 10. Evaluation methodology + +A retrieval test collection = corpus + query set + **qrels** (relevance judgments). For each +query, call recall, take the ordered list of returned memory ids, score against qrels — measuring +the *retriever in isolation*, exactly what ADR-0001's gate needs. + +**Metrics (compute all; pick one primary):** +- **Recall@k** — "did we surface the right memory at all?" *The* hot-path metric (auto-recall + injects top-N; if the memory isn't in top-k it can't help). Report @5/@10/@20/@30. +- **nDCG@k** — graded + position-aware; the best single summary (BEIR standard is nDCG@10). + Headline quality number for the gate. +- **MRR** — only the first hit matters; relevant for the exact-lookup stratum. +- **MAP** — broad binary recall+precision blend; secondary, stable for significance tests. + +**Stratification (the ADR-0001 hypothesis-targeted design):** *exact/lexical* (FTS already wins — +the **regression guard**); *paraphrase/semantic* (disjoint vocabulary — the value-of-embeddings +test); *multi-hop* (≥2 memories or a concept link — the graph test). + +**Qrels generation (the LongMemEval pipeline, inverted for memories):** sample seed memories +stratified by category + importance → an LLM generates exact / paraphrase / multi-hop queries → +label relevant ids, with **pooling** (union the top-k of every arm, TREC-style) and an LLM-judge +on the **UMBRELA 0–3 scale**. **Separate the generator model from the judge model** to avoid +self-preference leakage. **Hand-verify** a ≥15–20% sample (oversample multi-hop) and require +Cohen's κ(LLM, human) ≥ ~0.6 before trusting auto-labels; always hand-author multi-hop +relevant-id sets. + +**Pitfalls with standard mitigations (all baked into the protocol):** LLM judges are +systematically *lenient* (κ gate); "holes" (new arms retrieve unjudged docs — must pool *all* +arms before judging, else the gate is rigged against semantic/hybrid); generator-as-judge leakage +(model separation); too-easy self-generated queries (check paraphrase shares no content tokens); +adversarial/unanswerable queries have no relevant id and **must be kept out of the ranked metrics** +(mixing them corrupted the disputed Zep-vs-mem0 LOCOMO comparison — 84%→58%). + +**Sizing:** Voorhees & Buckley (2002) — ≥25 topics is the floor, 50 yield reliable rankings, and a +~5–6% absolute gap at n=50 is needed for 95% confidence the ordering holds on a different query +set. Since the gate is *per stratum*, each stratum wants its own ~50 queries. + +> **Honest note on what we actually built:** our eval set is **119 queries (40 exact / 40 +> paraphrase / 39 multihop)** — just below the ~50/stratum ideal, and qrels were LLM-generated +> with lighter hand-verification than the full protocol prescribes. This is a real limitation, +> tracked in the [benchmark report](benchmark-report.md). + +--- + +## 11. Synthesis — what we borrow, from whom + +| Source | Borrowed mechanism | Re-implemented as | Adopted? | +|---|---|---|---| +| LightRAG | Incremental set-union graph merge; dual-level retrieve | Native node/edge tables, no AGE; FTS+dense+graph fuse | Idea only | +| LazyGraphRAG | Defer LLM cost; index-time work ∝ new content | Store-time extraction off hot path | Principle | +| Zep / Graphiti | Episodic/semantic split; 3-signal RRF read path; bi-temporal invalidation | Memory rows + Postgres node/edge tables; pgvector+FTS+CTE | **Blueprint** | +| Mem0 | ADD/UPDATE/DELETE/NOOP write-side curation | Flagged async curator over existing `update_memory` | Complementary, flagged | +| HippoRAG 2 | PPR spreading activation for multi-hop | LLM-free FTS+vector-seeded PPR over memory-node graph (phase 2, gated) | Idea only | +| Bruch et al. / Cormack | RRF default + CC/TM2C2 challenger | Weighted RRF k=60, post-fusion importance prior | **Direct** | +| EDC / KGGEN | Open-extract → define → canonicalize globally | Batched extraction + embedding-cluster canonicalization | **Direct** | +| pgvector / Supabase | HNSW + halfvec + RRF hybrid in one SQL query | Additive migration to CNPG (production only) | **Production design** | +| LongMemEval / UMBRELA / Voorhees | Stratified LLM-qrels + pooling + κ gate | Our exact/paraphrase/multi-hop eval | **Direct** | + +The through-line: **a memory-node concept graph, dense pgvector embeddings, and the existing +lexical FTS, fused with weighted RRF, with all LLM work pushed to store time** — sized for an +append-heavy personal store and gated on a benchmark that beats FTS. + +--- + +## Sources + +**GraphRAG family** +- Edge et al., "From Local to Global: A Graph RAG Approach…" (Microsoft, 2024) — arXiv 2404.16130 +- Microsoft GraphRAG incremental-indexing design — github.com/microsoft/graphrag/issues/741; GraphRAG 1.0 blog (microsoft.com/en-us/research/blog/moving-to-graphrag-1-0…) +- Guo et al., "LightRAG: Simple and Fast RAG" (HKUDS, EMNLP 2025) — arXiv 2410.05779; github.com/HKUDS/LightRAG (incl. issue #2122, PG+AGE concurrency) +- nano-graphrag — github.com/gusye1234/nano-graphrag +- LazyGraphRAG — microsoft.com/en-us/research/blog/lazygraphrag-setting-a-new-standard-for-quality-and-cost/ + +**Temporal KG memory** +- Rasmussen et al., "Zep: A Temporal Knowledge Graph Architecture for Agent Memory" (2025) — arXiv 2501.13956 +- Graphiti — github.com/getzep/graphiti; Neo4j writeup (neo4j.com/blog/developer/graphiti-knowledge-graph-memory/) + +**Extraction-based memory** +- "Mem0: Building Production-Ready AI Agents…" (2025) — arXiv 2504.19413; github.com/mem0ai/mem0 (configs/prompts.py) + +**Graph-PPR retrieval** +- Gutiérrez et al., "HippoRAG" (NeurIPS 2024) — arXiv 2405.14831 +- "From RAG to Memory" = HippoRAG 2 (ICML 2025) — arXiv 2502.14802; github.com/OSU-NLP-Group/HippoRAG + +**Embeddings** +- bge-large-en-v1.5 — huggingface.co/BAAI/bge-large-en-v1.5; gte-Qwen2-1.5B — huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct; nomic — huggingface.co/nomic-ai/nomic-embed-text-v1.5; bge-m3 — huggingface.co/BAAI/bge-m3; e5-large-v2 — huggingface.co/intfloat/e5-large-v2 +- Voyage-3/3.5 — blog.voyageai.com/2024/09/18/voyage-3/; docs.voyageai.com/docs/pricing +- OpenAI text-embedding-3 — developers.openai.com/api/docs/guides/embeddings; Cohere embed v3 — docs.cohere.com/docs/cohere-embed + +**Fusion** +- Cormack, Clarke, Büttcher (SIGIR'09) — cormack.uwaterloo.ca/cormacksigir09-rrf.pdf +- Bruch et al., "An Analysis of Fusion Functions for Hybrid Retrieval" (TOIS 2023) — arXiv 2210.11934 +- Elastic weighted RRF; Weaviate hybrid-search fusion algorithms; "Calibrated Fusion for Heterogeneous Graph-Vector Retrieval" — arXiv 2603.28886 +- bge-reranker — huggingface.co/BAAI/bge-reranker-base + +**Concept-graph construction** +- Zhang & Soh, "Extract-Define-Canonicalize" (EDC, EMNLP 2024) — arXiv 2404.03868; github.com/clear-nus/edc +- KGGEN — arXiv 2502.09956; ReLiK (ACL 2024) — arXiv 2408.00103; LightKGG — arXiv 2510.23341; Google LangExtract — github.com/google/langextract + +**Postgres vector storage** +- pgvector — github.com/pgvector/pgvector; pgvectorscale — github.com/timescale/pgvectorscale; CNPG image-volume extensions — cloudnative-pg.io/docs/devel/imagevolume_extensions/; Supabase hybrid search — supabase.com/docs/guides/ai/hybrid-search +- This cluster: `infra/docs/architecture/databases.md` (claude-memory-mcp is a CNPG tenant); this repo: `migrations/versions/001_initial_schema.py`, `src/claude_memory/api/app.py` + +**Evaluation** +- LoCoMo — arXiv 2402.17753; LongMemEval — arXiv 2410.10813; UMBRELA — arXiv 2406.06519; "Judging the Judges" / LLMJudge — arXiv 2502.13908; Voorhees & Buckley "Topic Set Size" (2002); Buckley & Voorhees "Bias and the Limits of Pooling"; BEIR — arXiv 2104.08663