claude-memory-mcp/benchmarks/README.md

# claude-memory recall benchmark

Stratified retrieval benchmark gating the hybrid-recall adoption decision
(ADR-0001): does dense-vector semantic recall + a concept graph beat the current
lexical FTS on **recall@5, recall@10, nDCG@10, MRR**? Quality decides adoption;
latency/storage are measured but non-gating.

> **PRIVACY — read first.** The corpus is the operator's REAL personal memories.
> `data/` (corpus/queries/qrels), `.venv/`, `cache/`, `results/`, and
> `scripts/build_eval_set.py` (the generator embeds memory-derived query text)
> are **gitignored and must never be committed**. Everything else here contains
> only code / aggregate numbers and is safe to commit. Sensitive memories
> (`is_sensitive=1`) are excluded from the corpus entirely.

## Layout

```
benchmarks/
  harness/                 # importable package (committable; no real content)
    types.py               # Memory, Query, Qrels, Retriever protocol
    metrics.py             # recall@k, nDCG@k, MRR (binary relevance)
    dataset.py             # load_dataset() + referential-integrity validation
    runner.py              # run_benchmark() -> overall + per-stratum + latency
    baselines.py           # SqliteFtsRetriever (faithful FTS5/BM25 reference)
    example_retriever.py   # worked example of the plug-in interface
    test_harness.py        # unit tests (pytest)
  scripts/
    export_corpus.py       # SQLite -> data/corpus.jsonl (non-sensitive only)
    build_eval_set.py      # -> data/queries.jsonl + qrels.jsonl  [GITIGNORED]
    dataset_stats.py       # validate + print AGGREGATE stats (safe)
    run_eval.py            # CLI: run a retriever, print/save metrics
  data/                    # [GITIGNORED] corpus.jsonl, queries.jsonl, qrels.jsonl
  .venv/                   # [GITIGNORED]
```

## Dataset schema (JSONL, one object per line)

**`corpus.jsonl`** — every non-sensitive memory:
```json
{"id": 137, "content": "...", "category": "decisions", "tags": "memory,architecture",
 "expanded_keywords": "...", "importance": 0.85}
```
`id` (int) is the join key everywhere. `tags` is comma-separated; `expanded_keywords`
space-separated (matches the production schema).

**`queries.jsonl`** — eval queries, three strata:
```json
{"query_id": "para_006", "text": "...", "stratum": "paraphrase", "relevant_ids": [380],
 "_note": "author rationale", "_jaccard": 0.023}
```
- `stratum` ∈ `exact` | `paraphrase` | `multihop`.
- `relevant_ids` is a convenience copy; **`qrels.jsonl` is authoritative**.
- `_note` / `_jaccard` are provenance fields (underscore-prefixed); ignore them in
  scoring.

**`qrels.jsonl`** — binary relevance judgments (authoritative):
```json
{"query_id": "multi_006", "relevant_ids": [263, 423, 637]}
```

### Strata (what each one tests)

| stratum | construction | who should win |
|---|---|---|
| **exact** | query = a salient phrase lifted from ONE memory; that memory is relevant (verified as the top FTS hit at build time) | lexical already strong; floor check |
| **paraphrase** | query restates ONE memory's meaning in DIFFERENT words (low lexical overlap, validated Jaccard ≤ ~0.18 vs content+keywords) | **dense embeddings** |
| **multihop** | query needs 2+ DISTINCT memories sharing an entity/concept (e.g. project + decision, or a multi-part runbook); ALL are relevant | **concept graph** |

Where a near-duplicate memory equally satisfies a single-target query, qrels was
augmented to include the twin (so a good retriever isn't penalised); deliberate
discriminator queries are kept single-target on purpose.

## Pluggable retriever interface

A retriever is any object implementing **one** method:

```python
def retrieve(self, query: str, k: int) -> list[int]:
    """Return up to k memory ids (corpus `id`s), ranked best-first."""
```

Optional lifecycle hooks the runner uses if present (duck-typed):

```python
def build_index(self, corpus: list[Memory]) -> None: ...   # timed separately
def index_size_bytes(self) -> int: ...                     # reported
name: str                                                  # label in reports
```

A bare callable `retrieve(query, k) -> list[int]` also works.

## Run it

```bash
.venv/bin/python scripts/export_corpus.py        # (re)build data/corpus.jsonl
.venv/bin/python scripts/build_eval_set.py        # (re)build queries+qrels (local)
.venv/bin/python scripts/dataset_stats.py         # validate + aggregate stats
.venv/bin/python -m pytest harness/test_harness.py -q

# evaluate a retriever (built-in alias or module:Class)
.venv/bin/python scripts/run_eval.py --retriever fts5
.venv/bin/python scripts/run_eval.py --retriever your_pkg.mod:YourRetriever --json results/yours.json
```

Programmatic use:

```python
from harness import load_dataset, run_benchmark
ds = load_dataset()
result = run_benchmark(MyRetriever(), ds)   # builds index, times queries
print(result.summary())                     # overall + per-stratum table
result.to_dict()                            # full machine-readable result
```

`run_benchmark` requests `retrieve_k=20` per query by default (≥ the max metric
cutoff of 10), macro-averages metrics over queries (overall + per stratum), and
reports per-query latency p50/p95 plus index build time/size when the hooks exist.

## Reference baseline

`harness.baselines.SqliteFtsRetriever` mirrors the production local-store search
(README "Search Algorithm"): FTS5 over content/category/tags/expanded_keywords,
`'"w1" OR "w2" ...'` MATCH, `ORDER BY bm25(), importance`. This is the lexical
"current system" any hybrid retriever must beat. (The Postgres `tsvector` path
uses weighted A/B/C/D ranking and an importance-first default; FTS5/BM25 is the
faithful, dependency-free relevance reference for the quality comparison.)
research: benchmark hybrid (lexical+dense+graph) recall vs current FTS Viktor asked to enhance the memory system with 'semantics' — remember concepts (not just tokens) linked in a graph — and to prove, by benchmarking against the current system, that it actually improves recall. A multi-phase research workflow (18 agents) did landscape research, an adversarially-reviewed integration design, a stratified eval set over the real 5,452-memory corpus, and a head-to-head prototype-vs-current benchmark. Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend adopting lexical+dense; the concept graph is DEFERRED. Post-run adversarial review correction (applied to all docs before commit): the prototype's fusion config structurally barred the graph leg from the ranked top-k, so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty). Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/. Privacy: the corpus/queries/qrels/results are the user's real memories and stay gitignored (data/, cache/, results/, build_eval_set.py); only harness code, aggregate numbers, and synthetic examples are committed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-25 17:51:53 +00:00			`# claude-memory recall benchmark`

			`Stratified retrieval benchmark gating the hybrid-recall adoption decision`
			`(ADR-0001): does dense-vector semantic recall + a concept graph beat the current`
			`lexical FTS on recall@5, recall@10, nDCG@10, MRR? Quality decides adoption;`
			`latency/storage are measured but non-gating.`

			`> PRIVACY — read first. The corpus is the operator's REAL personal memories.`
			> `data/` (corpus/queries/qrels), `.venv/`, `cache/`, `results/`, and
			> `scripts/build_eval_set.py` (the generator embeds memory-derived query text)
			`> are gitignored and must never be committed. Everything else here contains`
			`> only code / aggregate numbers and is safe to commit. Sensitive memories`
			> (`is_sensitive=1`) are excluded from the corpus entirely.

			`## Layout`

			```
			`benchmarks/`
			`harness/ # importable package (committable; no real content)`
			`types.py # Memory, Query, Qrels, Retriever protocol`
			`metrics.py # recall@k, nDCG@k, MRR (binary relevance)`
			`dataset.py # load_dataset() + referential-integrity validation`
			`runner.py # run_benchmark() -> overall + per-stratum + latency`
			`baselines.py # SqliteFtsRetriever (faithful FTS5/BM25 reference)`
			`example_retriever.py # worked example of the plug-in interface`
			`test_harness.py # unit tests (pytest)`
			`scripts/`
			`export_corpus.py # SQLite -> data/corpus.jsonl (non-sensitive only)`
			`build_eval_set.py # -> data/queries.jsonl + qrels.jsonl [GITIGNORED]`
			`dataset_stats.py # validate + print AGGREGATE stats (safe)`
			`run_eval.py # CLI: run a retriever, print/save metrics`
			`data/ # [GITIGNORED] corpus.jsonl, queries.jsonl, qrels.jsonl`
			`.venv/ # [GITIGNORED]`
			```

			`## Dataset schema (JSONL, one object per line)`

			`corpus.jsonl` — every non-sensitive memory:
			```json
			`{"id": 137, "content": "...", "category": "decisions", "tags": "memory,architecture",`
			`"expanded_keywords": "...", "importance": 0.85}`
			```
			`id` (int) is the join key everywhere. `tags` is comma-separated; `expanded_keywords`
			`space-separated (matches the production schema).`

			`queries.jsonl` — eval queries, three strata:
			```json
			`{"query_id": "para_006", "text": "...", "stratum": "paraphrase", "relevant_ids": [380],`
			`"_note": "author rationale", "_jaccard": 0.023}`
			```
			- `stratum` ∈ `exact` \| `paraphrase` \| `multihop`.
			- `relevant_ids` is a convenience copy; `qrels.jsonl` is authoritative.
			- `_note` / `_jaccard` are provenance fields (underscore-prefixed); ignore them in
			`scoring.`

			`qrels.jsonl` — binary relevance judgments (authoritative):
			```json
			`{"query_id": "multi_006", "relevant_ids": [263, 423, 637]}`
			```

			`### Strata (what each one tests)`

			`\| stratum \| construction \| who should win \|`
			`\|---\|---\|---\|`
			`\| exact \| query = a salient phrase lifted from ONE memory; that memory is relevant (verified as the top FTS hit at build time) \| lexical already strong; floor check \|`
			`\| paraphrase \| query restates ONE memory's meaning in DIFFERENT words (low lexical overlap, validated Jaccard ≤ ~0.18 vs content+keywords) \| dense embeddings \|`
			`\| multihop \| query needs 2+ DISTINCT memories sharing an entity/concept (e.g. project + decision, or a multi-part runbook); ALL are relevant \| concept graph \|`

			`Where a near-duplicate memory equally satisfies a single-target query, qrels was`
			`augmented to include the twin (so a good retriever isn't penalised); deliberate`
			`discriminator queries are kept single-target on purpose.`

			`## Pluggable retriever interface`

			`A retriever is any object implementing one method:`

			```python
			`def retrieve(self, query: str, k: int) -> list[int]:`
			"""Return up to k memory ids (corpus `id`s), ranked best-first."""
			```

			`Optional lifecycle hooks the runner uses if present (duck-typed):`

			```python
			`def build_index(self, corpus: list[Memory]) -> None: ... # timed separately`
			`def index_size_bytes(self) -> int: ... # reported`
			`name: str # label in reports`
			```

			A bare callable `retrieve(query, k) -> list[int]` also works.

			`## Run it`

			```bash
			`.venv/bin/python scripts/export_corpus.py # (re)build data/corpus.jsonl`
			`.venv/bin/python scripts/build_eval_set.py # (re)build queries+qrels (local)`
			`.venv/bin/python scripts/dataset_stats.py # validate + aggregate stats`
			`.venv/bin/python -m pytest harness/test_harness.py -q`

			`# evaluate a retriever (built-in alias or module:Class)`
			`.venv/bin/python scripts/run_eval.py --retriever fts5`
			`.venv/bin/python scripts/run_eval.py --retriever your_pkg.mod:YourRetriever --json results/yours.json`
			```

			`Programmatic use:`

			```python
			`from harness import load_dataset, run_benchmark`
			`ds = load_dataset()`
			`result = run_benchmark(MyRetriever(), ds) # builds index, times queries`
			`print(result.summary()) # overall + per-stratum table`
			`result.to_dict() # full machine-readable result`
			```

			`run_benchmark` requests `retrieve_k=20` per query by default (≥ the max metric
			`cutoff of 10), macro-averages metrics over queries (overall + per stratum), and`
			`reports per-query latency p50/p95 plus index build time/size when the hooks exist.`

			`## Reference baseline`

			`harness.baselines.SqliteFtsRetriever` mirrors the production local-store search
			`(README "Search Algorithm"): FTS5 over content/category/tags/expanded_keywords,`
			`'"w1" OR "w2" ...'` MATCH, `ORDER BY bm25(), importance`. This is the lexical
			"current system" any hybrid retriever must beat. (The Postgres `tsvector` path
			`uses weighted A/B/C/D ranking and an importance-first default; FTS5/BM25 is the`
			`faithful, dependency-free relevance reference for the quality comparison.)`