127 lines
5.6 KiB
Markdown
127 lines
5.6 KiB
Markdown
|
|
# claude-memory recall benchmark
|
||
|
|
|
||
|
|
Stratified retrieval benchmark gating the hybrid-recall adoption decision
|
||
|
|
(ADR-0001): does dense-vector semantic recall + a concept graph beat the current
|
||
|
|
lexical FTS on **recall@5, recall@10, nDCG@10, MRR**? Quality decides adoption;
|
||
|
|
latency/storage are measured but non-gating.
|
||
|
|
|
||
|
|
> **PRIVACY — read first.** The corpus is the operator's REAL personal memories.
|
||
|
|
> `data/` (corpus/queries/qrels), `.venv/`, `cache/`, `results/`, and
|
||
|
|
> `scripts/build_eval_set.py` (the generator embeds memory-derived query text)
|
||
|
|
> are **gitignored and must never be committed**. Everything else here contains
|
||
|
|
> only code / aggregate numbers and is safe to commit. Sensitive memories
|
||
|
|
> (`is_sensitive=1`) are excluded from the corpus entirely.
|
||
|
|
|
||
|
|
## Layout
|
||
|
|
|
||
|
|
```
|
||
|
|
benchmarks/
|
||
|
|
harness/ # importable package (committable; no real content)
|
||
|
|
types.py # Memory, Query, Qrels, Retriever protocol
|
||
|
|
metrics.py # recall@k, nDCG@k, MRR (binary relevance)
|
||
|
|
dataset.py # load_dataset() + referential-integrity validation
|
||
|
|
runner.py # run_benchmark() -> overall + per-stratum + latency
|
||
|
|
baselines.py # SqliteFtsRetriever (faithful FTS5/BM25 reference)
|
||
|
|
example_retriever.py # worked example of the plug-in interface
|
||
|
|
test_harness.py # unit tests (pytest)
|
||
|
|
scripts/
|
||
|
|
export_corpus.py # SQLite -> data/corpus.jsonl (non-sensitive only)
|
||
|
|
build_eval_set.py # -> data/queries.jsonl + qrels.jsonl [GITIGNORED]
|
||
|
|
dataset_stats.py # validate + print AGGREGATE stats (safe)
|
||
|
|
run_eval.py # CLI: run a retriever, print/save metrics
|
||
|
|
data/ # [GITIGNORED] corpus.jsonl, queries.jsonl, qrels.jsonl
|
||
|
|
.venv/ # [GITIGNORED]
|
||
|
|
```
|
||
|
|
|
||
|
|
## Dataset schema (JSONL, one object per line)
|
||
|
|
|
||
|
|
**`corpus.jsonl`** — every non-sensitive memory:
|
||
|
|
```json
|
||
|
|
{"id": 137, "content": "...", "category": "decisions", "tags": "memory,architecture",
|
||
|
|
"expanded_keywords": "...", "importance": 0.85}
|
||
|
|
```
|
||
|
|
`id` (int) is the join key everywhere. `tags` is comma-separated; `expanded_keywords`
|
||
|
|
space-separated (matches the production schema).
|
||
|
|
|
||
|
|
**`queries.jsonl`** — eval queries, three strata:
|
||
|
|
```json
|
||
|
|
{"query_id": "para_006", "text": "...", "stratum": "paraphrase", "relevant_ids": [380],
|
||
|
|
"_note": "author rationale", "_jaccard": 0.023}
|
||
|
|
```
|
||
|
|
- `stratum` ∈ `exact` | `paraphrase` | `multihop`.
|
||
|
|
- `relevant_ids` is a convenience copy; **`qrels.jsonl` is authoritative**.
|
||
|
|
- `_note` / `_jaccard` are provenance fields (underscore-prefixed); ignore them in
|
||
|
|
scoring.
|
||
|
|
|
||
|
|
**`qrels.jsonl`** — binary relevance judgments (authoritative):
|
||
|
|
```json
|
||
|
|
{"query_id": "multi_006", "relevant_ids": [263, 423, 637]}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Strata (what each one tests)
|
||
|
|
|
||
|
|
| stratum | construction | who should win |
|
||
|
|
|---|---|---|
|
||
|
|
| **exact** | query = a salient phrase lifted from ONE memory; that memory is relevant (verified as the top FTS hit at build time) | lexical already strong; floor check |
|
||
|
|
| **paraphrase** | query restates ONE memory's meaning in DIFFERENT words (low lexical overlap, validated Jaccard ≤ ~0.18 vs content+keywords) | **dense embeddings** |
|
||
|
|
| **multihop** | query needs 2+ DISTINCT memories sharing an entity/concept (e.g. project + decision, or a multi-part runbook); ALL are relevant | **concept graph** |
|
||
|
|
|
||
|
|
Where a near-duplicate memory equally satisfies a single-target query, qrels was
|
||
|
|
augmented to include the twin (so a good retriever isn't penalised); deliberate
|
||
|
|
discriminator queries are kept single-target on purpose.
|
||
|
|
|
||
|
|
## Pluggable retriever interface
|
||
|
|
|
||
|
|
A retriever is any object implementing **one** method:
|
||
|
|
|
||
|
|
```python
|
||
|
|
def retrieve(self, query: str, k: int) -> list[int]:
|
||
|
|
"""Return up to k memory ids (corpus `id`s), ranked best-first."""
|
||
|
|
```
|
||
|
|
|
||
|
|
Optional lifecycle hooks the runner uses if present (duck-typed):
|
||
|
|
|
||
|
|
```python
|
||
|
|
def build_index(self, corpus: list[Memory]) -> None: ... # timed separately
|
||
|
|
def index_size_bytes(self) -> int: ... # reported
|
||
|
|
name: str # label in reports
|
||
|
|
```
|
||
|
|
|
||
|
|
A bare callable `retrieve(query, k) -> list[int]` also works.
|
||
|
|
|
||
|
|
## Run it
|
||
|
|
|
||
|
|
```bash
|
||
|
|
.venv/bin/python scripts/export_corpus.py # (re)build data/corpus.jsonl
|
||
|
|
.venv/bin/python scripts/build_eval_set.py # (re)build queries+qrels (local)
|
||
|
|
.venv/bin/python scripts/dataset_stats.py # validate + aggregate stats
|
||
|
|
.venv/bin/python -m pytest harness/test_harness.py -q
|
||
|
|
|
||
|
|
# evaluate a retriever (built-in alias or module:Class)
|
||
|
|
.venv/bin/python scripts/run_eval.py --retriever fts5
|
||
|
|
.venv/bin/python scripts/run_eval.py --retriever your_pkg.mod:YourRetriever --json results/yours.json
|
||
|
|
```
|
||
|
|
|
||
|
|
Programmatic use:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from harness import load_dataset, run_benchmark
|
||
|
|
ds = load_dataset()
|
||
|
|
result = run_benchmark(MyRetriever(), ds) # builds index, times queries
|
||
|
|
print(result.summary()) # overall + per-stratum table
|
||
|
|
result.to_dict() # full machine-readable result
|
||
|
|
```
|
||
|
|
|
||
|
|
`run_benchmark` requests `retrieve_k=20` per query by default (≥ the max metric
|
||
|
|
cutoff of 10), macro-averages metrics over queries (overall + per stratum), and
|
||
|
|
reports per-query latency p50/p95 plus index build time/size when the hooks exist.
|
||
|
|
|
||
|
|
## Reference baseline
|
||
|
|
|
||
|
|
`harness.baselines.SqliteFtsRetriever` mirrors the production local-store search
|
||
|
|
(README "Search Algorithm"): FTS5 over content/category/tags/expanded_keywords,
|
||
|
|
`'"w1" OR "w2" ...'` MATCH, `ORDER BY bm25(), importance`. This is the lexical
|
||
|
|
"current system" any hybrid retriever must beat. (The Postgres `tsvector` path
|
||
|
|
uses weighted A/B/C/D ranking and an importance-first default; FTS5/BM25 is the
|
||
|
|
faithful, dependency-free relevance reference for the quality comparison.)
|