diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml
index 463cf3a..5b4b20c 100644
--- a/.github/workflows/build.yml
+++ b/.github/workflows/build.yml
@@ -70,7 +70,6 @@ jobs:
       - uses: docker/build-push-action@v7
         with:
           context: .
-          file: docker/Dockerfile
           push: true
           platforms: linux/amd64
           # Single-manifest images (no provenance/SBOM attestation children) so
diff --git a/.gitignore b/.gitignore
index 4a896e7..fb63210 100644
--- a/.gitignore
+++ b/.gitignore
@@ -48,9 +48,3 @@ docker/pgdata/
 # Beads / Dolt files (added by bd init)
 .dolt/
 .beads-credential-key
-
-# Agent git worktrees (standing policy: never the shared checkout)
-.worktrees/
-
-# uv lockfile — CI runs `uv sync` itself; not tracked
-uv.lock
diff --git a/CONTEXT.md b/CONTEXT.md
deleted file mode 100644
index 6782e70..0000000
--- a/CONTEXT.md
+++ /dev/null
@@ -1,45 +0,0 @@
-# Claude Memory MCP
-
-Persistent cross-session memory for Claude. Today it stores **Memories** as rows and
-retrieves them by **lexical recall** (full-text keyword matching). This context is being
-extended with **semantic recall** (embeddings) and a **concept graph** so retrieval works
-by meaning and related memories become traversable.
-
-## Language
-
-**Memory**:
-A single stored unit of knowledge — a fact, preference, decision, project note, or person
-detail — with content plus metadata (category, tags, importance). The atomic thing a user
-stores and recalls.
-
-**Recall**:
-Retrieving the Memories most relevant to a query. The read path.
-
-**Lexical recall**:
-The existing retrieval method — matches Memories whose words (content, tags, LLM-generated
-keywords) overlap the query, ranked by BM25 / `ts_rank`. Matches *tokens*, not meaning.
-_Avoid_: calling this "semantic search" — it is not semantic.
-
-**Semantic recall**:
-Retrieval by meaning via dense-vector **Embedding** similarity, so a query surfaces a Memory
-even with zero shared words (e.g. "what UI library?" → "prefers Svelte").
-
-**Embedding**:
-A dense vector representation of a Memory's (or Concept's) meaning, used for Semantic recall.
-
-**Concept**:
-A distinct entity or idea that recurs across Memories (e.g. "Svelte", "Viktor", "TripIt",
-"frontend framework"). A node in the Concept graph. Distinct from a Memory: one Memory can
-mention several Concepts, and one Concept spans many Memories.
-
-**Concept graph**:
-The network of Concepts joined by typed **Relationships**, making the memory store
-traversable — from one Memory or Concept to related ones.
-
-**Relationship**:
-A typed, directed edge in the Concept graph, between two Concepts or between a Memory and a
-Concept (e.g. `prefers`, `is-a`, `used-in`, `mentions`).
-
-**Hybrid retrieval**:
-The target read path — combining Lexical recall, Semantic recall, and Concept-graph
-traversal into one ranked result set.
diff --git a/benchmarks/.gitignore b/benchmarks/.gitignore
deleted file mode 100644
index 9761d5d..0000000
--- a/benchmarks/.gitignore
+++ /dev/null
@@ -1,25 +0,0 @@
-# Benchmark dataset is the user's REAL personal memories — NEVER commit.
-# Privacy hard-rule (research task brief): corpus/queries/qrels stay LOCAL.
-data/
-.venv/
-cache/
-*.npy
-*.faiss
-*.db
-
-# The eval-set GENERATOR embeds real memory-derived query text + author notes
-# (paraphrases of real memories, real ids/notes). Treat it as a data artifact:
-# LOCAL-ONLY, never commit. Regenerates data/ from corpus.jsonl. The HARNESS
-# itself (harness/*.py, the other scripts) contains NO real content and is safe.
-scripts/build_eval_set.py
-
-# Python noise
-__pycache__/
-*.pyc
-.pytest_cache/
-*.egg-info/
-.ipynb_checkpoints/
-
-# Results from runs may quote real content — keep local by default.
-results/
-*.results.json
diff --git a/benchmarks/README.md b/benchmarks/README.md
deleted file mode 100644
index a2859ab..0000000
--- a/benchmarks/README.md
+++ /dev/null
@@ -1,126 +0,0 @@
-# claude-memory recall benchmark
-
-Stratified retrieval benchmark gating the hybrid-recall adoption decision
-(ADR-0001): does dense-vector semantic recall + a concept graph beat the current
-lexical FTS on **recall@5, recall@10, nDCG@10, MRR**? Quality decides adoption;
-latency/storage are measured but non-gating.
-
-> **PRIVACY — read first.** The corpus is the operator's REAL personal memories.
-> `data/` (corpus/queries/qrels), `.venv/`, `cache/`, `results/`, and
-> `scripts/build_eval_set.py` (the generator embeds memory-derived query text)
-> are **gitignored and must never be committed**. Everything else here contains
-> only code / aggregate numbers and is safe to commit. Sensitive memories
-> (`is_sensitive=1`) are excluded from the corpus entirely.
-
-## Layout
-
-```
-benchmarks/
-  harness/                 # importable package (committable; no real content)
-    types.py               # Memory, Query, Qrels, Retriever protocol
-    metrics.py             # recall@k, nDCG@k, MRR (binary relevance)
-    dataset.py             # load_dataset() + referential-integrity validation
-    runner.py              # run_benchmark() -> overall + per-stratum + latency
-    baselines.py           # SqliteFtsRetriever (faithful FTS5/BM25 reference)
-    example_retriever.py   # worked example of the plug-in interface
-    test_harness.py        # unit tests (pytest)
-  scripts/
-    export_corpus.py       # SQLite -> data/corpus.jsonl (non-sensitive only)
-    build_eval_set.py      # -> data/queries.jsonl + qrels.jsonl  [GITIGNORED]
-    dataset_stats.py       # validate + print AGGREGATE stats (safe)
-    run_eval.py            # CLI: run a retriever, print/save metrics
-  data/                    # [GITIGNORED] corpus.jsonl, queries.jsonl, qrels.jsonl
-  .venv/                   # [GITIGNORED]
-```
-
-## Dataset schema (JSONL, one object per line)
-
-**`corpus.jsonl`** — every non-sensitive memory:
-```json
-{"id": 137, "content": "...", "category": "decisions", "tags": "memory,architecture",
- "expanded_keywords": "...", "importance": 0.85}
-```
-`id` (int) is the join key everywhere. `tags` is comma-separated; `expanded_keywords`
-space-separated (matches the production schema).
-
-**`queries.jsonl`** — eval queries, three strata:
-```json
-{"query_id": "para_006", "text": "...", "stratum": "paraphrase", "relevant_ids": [380],
- "_note": "author rationale", "_jaccard": 0.023}
-```
-- `stratum` ∈ `exact` | `paraphrase` | `multihop`.
-- `relevant_ids` is a convenience copy; **`qrels.jsonl` is authoritative**.
-- `_note` / `_jaccard` are provenance fields (underscore-prefixed); ignore them in
-  scoring.
-
-**`qrels.jsonl`** — binary relevance judgments (authoritative):
-```json
-{"query_id": "multi_006", "relevant_ids": [263, 423, 637]}
-```
-
-### Strata (what each one tests)
-
-| stratum | construction | who should win |
-|---|---|---|
-| **exact** | query = a salient phrase lifted from ONE memory; that memory is relevant (verified as the top FTS hit at build time) | lexical already strong; floor check |
-| **paraphrase** | query restates ONE memory's meaning in DIFFERENT words (low lexical overlap, validated Jaccard ≤ ~0.18 vs content+keywords) | **dense embeddings** |
-| **multihop** | query needs 2+ DISTINCT memories sharing an entity/concept (e.g. project + decision, or a multi-part runbook); ALL are relevant | **concept graph** |
-
-Where a near-duplicate memory equally satisfies a single-target query, qrels was
-augmented to include the twin (so a good retriever isn't penalised); deliberate
-discriminator queries are kept single-target on purpose.
-
-## Pluggable retriever interface
-
-A retriever is any object implementing **one** method:
-
-```python
-def retrieve(self, query: str, k: int) -> list[int]:
-    """Return up to k memory ids (corpus `id`s), ranked best-first."""
-```
-
-Optional lifecycle hooks the runner uses if present (duck-typed):
-
-```python
-def build_index(self, corpus: list[Memory]) -> None: ...   # timed separately
-def index_size_bytes(self) -> int: ...                     # reported
-name: str                                                  # label in reports
-```
-
-A bare callable `retrieve(query, k) -> list[int]` also works.
-
-## Run it
-
-```bash
-.venv/bin/python scripts/export_corpus.py        # (re)build data/corpus.jsonl
-.venv/bin/python scripts/build_eval_set.py        # (re)build queries+qrels (local)
-.venv/bin/python scripts/dataset_stats.py         # validate + aggregate stats
-.venv/bin/python -m pytest harness/test_harness.py -q
-
-# evaluate a retriever (built-in alias or module:Class)
-.venv/bin/python scripts/run_eval.py --retriever fts5
-.venv/bin/python scripts/run_eval.py --retriever your_pkg.mod:YourRetriever --json results/yours.json
-```
-
-Programmatic use:
-
-```python
-from harness import load_dataset, run_benchmark
-ds = load_dataset()
-result = run_benchmark(MyRetriever(), ds)   # builds index, times queries
-print(result.summary())                     # overall + per-stratum table
-result.to_dict()                            # full machine-readable result
-```
-
-`run_benchmark` requests `retrieve_k=20` per query by default (≥ the max metric
-cutoff of 10), macro-averages metrics over queries (overall + per stratum), and
-reports per-query latency p50/p95 plus index build time/size when the hooks exist.
-
-## Reference baseline
-
-`harness.baselines.SqliteFtsRetriever` mirrors the production local-store search
-(README "Search Algorithm"): FTS5 over content/category/tags/expanded_keywords,
-`'"w1" OR "w2" ...'` MATCH, `ORDER BY bm25(), importance`. This is the lexical
-"current system" any hybrid retriever must beat. (The Postgres `tsvector` path
-uses weighted A/B/C/D ranking and an importance-first default; FTS5/BM25 is the
-faithful, dependency-free relevance reference for the quality comparison.)
diff --git a/benchmarks/harness/__init__.py b/benchmarks/harness/__init__.py
deleted file mode 100644
index eb2b856..0000000
--- a/benchmarks/harness/__init__.py
+++ /dev/null
@@ -1,28 +0,0 @@
-"""Benchmark harness for claude-memory recall evaluation.
-
-Public API:
-    from harness import Retriever, load_dataset, run_benchmark, BenchmarkResult
-    from harness import metrics
-
-A retriever is any object (or callable) implementing:
-    retrieve(query: str, k: int) -> list[memory_id]   # ranked, best first
-
-memory_id matches the `id` field in corpus.jsonl / qrels.jsonl (int).
-"""
-from .types import Retriever, Query, Memory, Qrels
-from .dataset import load_dataset, Dataset
-from .runner import run_benchmark, BenchmarkResult, StratumResult
-from . import metrics
-
-__all__ = [
-    "Retriever",
-    "Query",
-    "Memory",
-    "Qrels",
-    "load_dataset",
-    "Dataset",
-    "run_benchmark",
-    "BenchmarkResult",
-    "StratumResult",
-    "metrics",
-]
diff --git a/benchmarks/harness/baselines.py b/benchmarks/harness/baselines.py
deleted file mode 100644
index b5fd678..0000000
--- a/benchmarks/harness/baselines.py
+++ /dev/null
@@ -1,93 +0,0 @@
-"""Reference LEXICAL baseline retrievers that mirror the production system.
-
-These exist so (a) the eval-set author can VERIFY a query's labels and check
-that paraphrase queries genuinely defeat lexical matching, and (b) later agents
-have an honest "current system" to beat.
-
-`SqliteFtsRetriever` builds an in-memory SQLite FTS5 index over the corpus and
-runs the SAME query shape the production local store uses:
-    words -> '"w1" OR "w2" ...' MATCH, ORDER BY bm25(), importance as tiebreak.
-(README "SQLite: FTS5 with BM25".) This is the closest faithful, dependency-free
-baseline. The Postgres tsvector path is documented in the README; its ranking
-differs (weighted A/B/C/D + importance-first default) but for a quality ceiling
-comparison the FTS5/BM25 relevance ordering is the right lexical reference.
-"""
-from __future__ import annotations
-
-import re
-import sqlite3
-from collections.abc import Sequence
-
-from .types import Memory, MemoryId
-
-# FTS5 reserved-ish tokens; we quote every term anyway, but strip embedded quotes.
-_WORD_RE = re.compile(r"[A-Za-z0-9_]+")
-
-
-class SqliteFtsRetriever:
-    """Faithful FTS5/BM25 lexical baseline (mirrors local_store search)."""
-
-    name = "sqlite_fts5_bm25"
-
-    def __init__(self, sort_by: str = "relevance") -> None:
-        # "relevance": ORDER BY bm25(), importance DESC  (best for quality eval)
-        # "importance": ORDER BY importance DESC, ... (production default)
-        self.sort_by = sort_by
-        self._con: sqlite3.Connection | None = None
-
-    def build_index(self, corpus: Sequence[Memory]) -> None:
-        con = sqlite3.connect(":memory:")
-        con.execute(
-            """
-            CREATE VIRTUAL TABLE memories_fts USING fts5(
-                content, category, tags, expanded_keywords,
-                memory_id UNINDEXED, importance UNINDEXED
-            )
-            """
-        )
-        con.executemany(
-            "INSERT INTO memories_fts(content, category, tags, expanded_keywords, memory_id, importance)"
-            " VALUES (?,?,?,?,?,?)",
-            [
-                (m.content, m.category, m.tags, m.expanded_keywords, m.id, m.importance)
-                for m in corpus
-            ],
-        )
-        con.commit()
-        self._con = con
-
-    def _fts_query(self, query: str) -> str:
-        words = _WORD_RE.findall(query.lower())
-        if not words:
-            return ""
-        return " OR ".join(f'"{w}"' for w in words)
-
-    def retrieve(self, query: str, k: int) -> list[MemoryId]:
-        assert self._con is not None, "call build_index first"
-        match = self._fts_query(query)
-        if not match:
-            return []
-        if self.sort_by == "importance":
-            order = "importance DESC, bm25(memories_fts)"
-        else:
-            order = "bm25(memories_fts), importance DESC"
-        try:
-            rows = self._con.execute(
-                f"SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? "
-                f"ORDER BY {order} LIMIT ?",
-                (match, k),
-            ).fetchall()
-        except sqlite3.OperationalError:
-            # mirror production LIKE fallback on FTS syntax errors
-            like = f"%{query}%"
-            rows = self._con.execute(
-                "SELECT memory_id FROM memories_fts WHERE content LIKE ? OR tags LIKE ? "
-                "ORDER BY importance DESC LIMIT ?",
-                (like, like, k),
-            ).fetchall()
-        return [r[0] for r in rows]
-
-    def close(self) -> None:
-        if self._con is not None:
-            self._con.close()
-            self._con = None
diff --git a/benchmarks/harness/dataset.py b/benchmarks/harness/dataset.py
deleted file mode 100644
index ad5607a..0000000
--- a/benchmarks/harness/dataset.py
+++ /dev/null
@@ -1,115 +0,0 @@
-"""Load corpus / queries / qrels JSONL into typed objects."""
-from __future__ import annotations
-
-import json
-from dataclasses import dataclass
-from pathlib import Path
-
-from .types import Memory, Query, Qrels, MemoryId
-
-_DATA_DIR = Path(__file__).resolve().parents[1] / "data"
-
-
-@dataclass
-class Dataset:
-    corpus: list[Memory]
-    queries: list[Query]
-    qrels: Qrels
-
-    @property
-    def corpus_by_id(self) -> dict[MemoryId, Memory]:
-        return {m.id: m for m in self.corpus}
-
-    def strata(self) -> set[str]:
-        return {q.stratum for q in self.queries}
-
-
-def _read_jsonl(path: Path) -> list[dict]:
-    out: list[dict] = []
-    with path.open(encoding="utf-8") as f:
-        for line in f:
-            line = line.strip()
-            if line:
-                out.append(json.loads(line))
-    return out
-
-
-def load_corpus(path: Path | None = None) -> list[Memory]:
-    path = path or (_DATA_DIR / "corpus.jsonl")
-    rows = _read_jsonl(path)
-    return [
-        Memory(
-            id=r["id"],
-            content=r["content"],
-            category=r.get("category", "facts"),
-            tags=r.get("tags", "") or "",
-            expanded_keywords=r.get("expanded_keywords", "") or "",
-            importance=r.get("importance", 0.5),
-        )
-        for r in rows
-    ]
-
-
-def load_queries(path: Path | None = None) -> list[Query]:
-    path = path or (_DATA_DIR / "queries.jsonl")
-    rows = _read_jsonl(path)
-    return [
-        Query(
-            query_id=r["query_id"],
-            text=r["text"],
-            stratum=r["stratum"],
-            relevant_ids=tuple(r.get("relevant_ids", [])),
-        )
-        for r in rows
-    ]
-
-
-def load_qrels(path: Path | None = None) -> Qrels:
-    path = path or (_DATA_DIR / "qrels.jsonl")
-    rows = _read_jsonl(path)
-    qrels: Qrels = {}
-    for r in rows:
-        qid = r["query_id"]
-        rel = set(r["relevant_ids"])
-        qrels.setdefault(qid, set()).update(rel)
-    return qrels
-
-
-def load_dataset(
-    corpus_path: Path | None = None,
-    queries_path: Path | None = None,
-    qrels_path: Path | None = None,
-    *,
-    validate: bool = True,
-) -> Dataset:
-    corpus = load_corpus(corpus_path)
-    queries = load_queries(queries_path)
-    qrels = load_qrels(qrels_path)
-
-    if validate:
-        _validate(corpus, queries, qrels)
-
-    return Dataset(corpus=corpus, queries=queries, qrels=qrels)
-
-
-def _validate(corpus: list[Memory], queries: list[Query], qrels: Qrels) -> None:
-    corpus_ids = {m.id for m in corpus}
-    q_ids = {q.query_id for q in queries}
-
-    # Every query must have a qrels entry, and vice versa.
-    missing_qrels = q_ids - set(qrels)
-    if missing_qrels:
-        raise ValueError(f"queries without qrels: {sorted(missing_qrels)[:10]}")
-    orphan_qrels = set(qrels) - q_ids
-    if orphan_qrels:
-        raise ValueError(f"qrels without queries: {sorted(orphan_qrels)[:10]}")
-
-    # Every relevant id must exist in the corpus and the set must be non-empty.
-    for qid, rels in qrels.items():
-        if not rels:
-            raise ValueError(f"empty qrels for query {qid}")
-        unknown = rels - corpus_ids
-        if unknown:
-            raise ValueError(
-                f"query {qid} references non-corpus ids {sorted(unknown)[:10]}"
-            )
diff --git a/benchmarks/harness/example_retriever.py b/benchmarks/harness/example_retriever.py
deleted file mode 100644
index 29ed748..0000000
--- a/benchmarks/harness/example_retriever.py
+++ /dev/null
@@ -1,59 +0,0 @@
-"""Worked example: how a later agent plugs a retriever into the harness.
-
-A retriever needs only one method:
-
-    retrieve(self, query: str, k: int) -> list[int]   # ranked memory ids
-
-Optionally it may implement lifecycle hooks the runner will use if present:
-
-    build_index(self, corpus: list[Memory]) -> None    # timed separately
-    index_size_bytes(self) -> int                      # reported
-
-Run this file directly for a smoke test against the local eval set:
-    .venv/bin/python -m harness.example_retriever
-"""
-from __future__ import annotations
-
-from collections.abc import Sequence
-
-from .types import Memory, MemoryId
-
-
-class SubstringRetriever:
-    """Trivial baseline: rank by count of query-word occurrences in content.
-
-    Deliberately weak — exists only to demonstrate the interface. The real
-    lexical baseline is harness.baselines.SqliteFtsRetriever.
-    """
-
-    name = "substring_demo"
-
-    def __init__(self) -> None:
-        self._corpus: list[Memory] = []
-
-    def build_index(self, corpus: Sequence[Memory]) -> None:
-        self._corpus = list(corpus)
-
-    def retrieve(self, query: str, k: int) -> list[MemoryId]:
-        words = [w for w in query.lower().split() if len(w) > 2]
-        scored: list[tuple[int, float]] = []
-        for m in self._corpus:
-            hay = (m.content + " " + m.expanded_keywords + " " + m.tags).lower()
-            score = sum(hay.count(w) for w in words)
-            if score:
-                scored.append((m.id, score + m.importance))  # importance tiebreak
-        scored.sort(key=lambda t: t[1], reverse=True)
-        return [mid for mid, _ in scored[:k]]
-
-
-def _smoke() -> None:
-    from .dataset import load_dataset
-    from .runner import run_benchmark
-
-    ds = load_dataset()
-    res = run_benchmark(SubstringRetriever(), ds)
-    print(res.summary())
-
-
-if __name__ == "__main__":
-    _smoke()
diff --git a/benchmarks/harness/metrics.py b/benchmarks/harness/metrics.py
deleted file mode 100644
index 4edd789..0000000
--- a/benchmarks/harness/metrics.py
+++ /dev/null
@@ -1,100 +0,0 @@
-"""Retrieval metrics with BINARY relevance.
-
-Conventions
------------
-- `ranked`: list of memory ids, best-first, as returned by a retriever.
-- `relevant`: set of relevant memory ids for the query (from qrels).
-- All functions are pure and operate on a single query; the runner aggregates
-  (macro-average over queries).
-
-Definitions
------------
-recall@k   = |relevant ∩ ranked[:k]| / |relevant|
-             (fraction of all relevant items retrieved within the top k)
-MRR        = 1 / rank_of_first_relevant  (0 if none retrieved at all)
-nDCG@k     = DCG@k / IDCG@k  with binary gains (gain=1 for relevant)
-             DCG@k = sum over i in [1..k] of rel_i / log2(i + 1)
-             IDCG@k is the DCG of the ideal ranking (all relevant first),
-             capped at min(|relevant|, k) ones.
-
-Notes
------
-- nDCG uses the standard log2(rank+1) discount (Järvelin & Kekäläinen 2002);
-  with binary gains this is the common IR convention also used by BEIR/pytrec_eval.
-- MRR is reported as the reciprocal rank of the FIRST relevant hit, which for a
-  single query equals the per-query reciprocal-rank that the runner averages.
-- Duplicate ids in `ranked` are de-duplicated keeping first occurrence, so a
-  retriever cannot inflate recall by repeating an id.
-"""
-from __future__ import annotations
-
-import math
-from collections.abc import Iterable, Sequence
-
-MemoryId = int
-
-
-def _dedup_keep_order(ranked: Sequence[MemoryId]) -> list[MemoryId]:
-    seen: set[MemoryId] = set()
-    out: list[MemoryId] = []
-    for x in ranked:
-        if x not in seen:
-            seen.add(x)
-            out.append(x)
-    return out
-
-
-def recall_at_k(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId], k: int) -> float:
-    rel = set(relevant)
-    if not rel:
-        # Undefined; treat as 0 contribution. Runner should never pass empty.
-        return 0.0
-    top = _dedup_keep_order(ranked)[:k]
-    hits = sum(1 for x in top if x in rel)
-    return hits / len(rel)
-
-
-def reciprocal_rank(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId]) -> float:
-    rel = set(relevant)
-    if not rel:
-        return 0.0
-    for i, x in enumerate(_dedup_keep_order(ranked), start=1):
-        if x in rel:
-            return 1.0 / i
-    return 0.0
-
-
-def dcg_at_k(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId], k: int) -> float:
-    rel = set(relevant)
-    top = _dedup_keep_order(ranked)[:k]
-    dcg = 0.0
-    for i, x in enumerate(top, start=1):
-        if x in rel:
-            dcg += 1.0 / math.log2(i + 1)
-    return dcg
-
-
-def ndcg_at_k(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId], k: int) -> float:
-    rel = set(relevant)
-    if not rel:
-        return 0.0
-    dcg = dcg_at_k(ranked, rel, k)
-    ideal_hits = min(len(rel), k)
-    idcg = sum(1.0 / math.log2(i + 1) for i in range(1, ideal_hits + 1))
-    if idcg == 0.0:
-        return 0.0
-    return dcg / idcg
-
-
-def per_query_metrics(ranked: Sequence[MemoryId], relevant: Iterable[MemoryId]) -> dict[str, float]:
-    """All headline metrics for one query."""
-    rel = set(relevant)
-    return {
-        "recall@5": recall_at_k(ranked, rel, 5),
-        "recall@10": recall_at_k(ranked, rel, 10),
-        "ndcg@10": ndcg_at_k(ranked, rel, 10),
-        "mrr": reciprocal_rank(ranked, rel),
-    }
-
-
-METRIC_NAMES = ("recall@5", "recall@10", "ndcg@10", "mrr")
diff --git a/benchmarks/harness/runner.py b/benchmarks/harness/runner.py
deleted file mode 100644
index bb693c4..0000000
--- a/benchmarks/harness/runner.py
+++ /dev/null
@@ -1,223 +0,0 @@
-"""Benchmark runner: drive a pluggable retriever over the eval set and report
-overall + per-stratum quality metrics, plus per-query latency and (optional)
-index build time / size.
-
-Quality decides adoption (recall@k, nDCG@10, MRR). Latency and storage are
-measured and reported but DO NOT gate the decision (ADR-0001 success metric).
-"""
-from __future__ import annotations
-
-import statistics
-import time
-from collections.abc import Callable
-from dataclasses import dataclass, field, asdict
-from typing import Any
-
-from . import metrics
-from .dataset import Dataset
-from .types import MemoryId, Query, Retriever
-
-# A retriever may be the Protocol object or a bare callable retrieve(query, k).
-RetrieverLike = Retriever | Callable[[str, int], list[MemoryId]]
-
-# k used for the retrieve() call. We request enough depth to compute all
-# metrics (max cutoff is 10) with headroom so ties past k=10 don't distort.
-DEFAULT_RETRIEVE_K = 20
-
-
-def _percentile(values: list[float], pct: float) -> float:
-    """Linear-interpolation percentile (pct in [0,100]). Empty -> 0.0."""
-    if not values:
-        return 0.0
-    if len(values) == 1:
-        return values[0]
-    s = sorted(values)
-    rank = (pct / 100.0) * (len(s) - 1)
-    lo = int(rank)
-    hi = min(lo + 1, len(s) - 1)
-    frac = rank - lo
-    return s[lo] + (s[hi] - s[lo]) * frac
-
-
-@dataclass
-class StratumResult:
-    stratum: str
-    n_queries: int
-    metrics: dict[str, float]  # macro-averaged metric -> value
-
-
-@dataclass
-class BenchmarkResult:
-    retriever_name: str
-    n_queries: int
-    retrieve_k: int
-    overall: dict[str, float]
-    per_stratum: dict[str, StratumResult]
-    latency_ms: dict[str, float]  # mean / p50 / p95 / max
-    index_build_seconds: float | None = None
-    index_size_bytes: int | None = None
-    per_query: list[dict[str, Any]] = field(default_factory=list)
-
-    def to_dict(self) -> dict:
-        d = asdict(self)
-        d["per_stratum"] = {k: asdict(v) for k, v in self.per_stratum.items()}
-        return d
-
-    def summary(self) -> str:
-        lines = [
-            f"Retriever: {self.retriever_name}",
-            f"Queries: {self.n_queries}  (retrieve_k={self.retrieve_k})",
-        ]
-        if self.index_build_seconds is not None:
-            lines.append(f"Index build: {self.index_build_seconds:.3f}s")
-        if self.index_size_bytes is not None:
-            lines.append(f"Index size: {self.index_size_bytes / 1e6:.2f} MB")
-        lat = self.latency_ms
-        lines.append(
-            "Latency/query: "
-            f"p50={lat['p50']:.2f}ms  p95={lat['p95']:.2f}ms  "
-            f"mean={lat['mean']:.2f}ms  max={lat['max']:.2f}ms"
-        )
-        cols = metrics.METRIC_NAMES
-        header = "  ".join(f"{c:>10}" for c in cols)
-        lines.append("")
-        lines.append(f"{'stratum':<12}{'n':>5}  {header}")
-        lines.append("-" * (19 + len(header)))
-        for name in ("overall", *sorted(self.per_stratum)):
-            if name == "overall":
-                m, n = self.overall, self.n_queries
-            else:
-                sr = self.per_stratum[name]
-                m, n = sr.metrics, sr.n_queries
-            row = "  ".join(f"{m[c]:>10.4f}" for c in cols)
-            lines.append(f"{name:<12}{n:>5}  {row}")
-        return "\n".join(lines)
-
-
-def _get_retrieve_fn(retriever: RetrieverLike) -> Callable[[str, int], list[MemoryId]]:
-    if hasattr(retriever, "retrieve"):
-        return retriever.retrieve  # type: ignore[attr-defined]
-    if callable(retriever):
-        return retriever
-    raise TypeError("retriever must implement retrieve(query, k) or be callable")
-
-
-def _maybe_build_index(retriever: RetrieverLike, dataset: Dataset) -> tuple[float | None, int | None]:
-    """Call optional lifecycle hooks if present (duck-typed).
-
-    - build_index(corpus) -> None : measured wall-clock build time.
-    - index_size_bytes() -> int   : reported on-disk/in-memory index size.
-    Returns (build_seconds_or_None, size_bytes_or_None).
-    """
-    build_seconds: float | None = None
-    size_bytes: int | None = None
-
-    build = getattr(retriever, "build_index", None)
-    if callable(build):
-        t0 = time.perf_counter()
-        build(dataset.corpus)
-        build_seconds = time.perf_counter() - t0
-
-    size_fn = getattr(retriever, "index_size_bytes", None)
-    if callable(size_fn):
-        try:
-            size_bytes = int(size_fn())
-        except Exception:
-            size_bytes = None
-
-    return build_seconds, size_bytes
-
-
-def run_benchmark(
-    retriever: RetrieverLike,
-    dataset: Dataset,
-    *,
-    retrieve_k: int = DEFAULT_RETRIEVE_K,
-    retriever_name: str | None = None,
-    warmup: bool = True,
-    collect_per_query: bool = True,
-) -> BenchmarkResult:
-    """Evaluate `retriever` over `dataset`.
-
-    The retriever is asked for `retrieve_k` ids per query (>= max metric
-    cutoff of 10). Metrics are macro-averaged over queries, overall and per
-    stratum. Latency is measured around each retrieve() call only (index build
-    is timed separately via the optional build_index hook).
-    """
-    name = retriever_name or getattr(retriever, "name", None) or type(retriever).__name__
-    retrieve = _get_retrieve_fn(retriever)
-    qrels = dataset.qrels
-
-    build_seconds, size_bytes = _maybe_build_index(retriever, dataset)
-
-    # Optional warmup (first call can pay import/JIT/connection costs that would
-    # skew p95). Excluded from latency stats. Uses the first query if any.
-    if warmup and dataset.queries:
-        try:
-            retrieve(dataset.queries[0].text, retrieve_k)
-        except Exception:
-            pass  # warmup failures surface on the real call below
-
-    per_query_rows: list[dict[str, Any]] = []
-    latencies_ms: list[float] = []
-    # accumulate per-stratum metric sums for macro-average
-    strata: dict[str, dict[str, float]] = {}
-    strata_counts: dict[str, int] = {}
-    overall_sums = {m: 0.0 for m in metrics.METRIC_NAMES}
-
-    for q in dataset.queries:
-        rel = qrels[q.query_id]
-        t0 = time.perf_counter()
-        ranked = list(retrieve(q.text, retrieve_k))
-        dt_ms = (time.perf_counter() - t0) * 1000.0
-        latencies_ms.append(dt_ms)
-
-        m = metrics.per_query_metrics(ranked, rel)
-        for key, val in m.items():
-            overall_sums[key] += val
-        strata.setdefault(q.stratum, {mm: 0.0 for mm in metrics.METRIC_NAMES})
-        strata_counts[q.stratum] = strata_counts.get(q.stratum, 0) + 1
-        for key, val in m.items():
-            strata[q.stratum][key] += val
-
-        if collect_per_query:
-            per_query_rows.append(
-                {
-                    "query_id": q.query_id,
-                    "stratum": q.stratum,
-                    "n_relevant": len(rel),
-                    "latency_ms": round(dt_ms, 3),
-                    "retrieved": ranked[:retrieve_k],
-                    **{k: round(v, 6) for k, v in m.items()},
-                }
-            )
-
-    n = len(dataset.queries)
-    overall = {k: (overall_sums[k] / n if n else 0.0) for k in metrics.METRIC_NAMES}
-    per_stratum: dict[str, StratumResult] = {}
-    for s, sums in strata.items():
-        c = strata_counts[s]
-        per_stratum[s] = StratumResult(
-            stratum=s,
-            n_queries=c,
-            metrics={k: (sums[k] / c if c else 0.0) for k in metrics.METRIC_NAMES},
-        )
-
-    latency_stats = {
-        "mean": statistics.fmean(latencies_ms) if latencies_ms else 0.0,
-        "p50": _percentile(latencies_ms, 50),
-        "p95": _percentile(latencies_ms, 95),
-        "max": max(latencies_ms) if latencies_ms else 0.0,
-    }
-
-    return BenchmarkResult(
-        retriever_name=name,
-        n_queries=n,
-        retrieve_k=retrieve_k,
-        overall=overall,
-        per_stratum=per_stratum,
-        latency_ms=latency_stats,
-        index_build_seconds=build_seconds,
-        index_size_bytes=size_bytes,
-        per_query=per_query_rows,
-    )
diff --git a/benchmarks/harness/test_harness.py b/benchmarks/harness/test_harness.py
deleted file mode 100644
index 9375374..0000000
--- a/benchmarks/harness/test_harness.py
+++ /dev/null
@@ -1,145 +0,0 @@
-"""Unit tests for metrics + runner. No real corpus needed (synthetic data).
-
-Run:  .venv/bin/python -m pytest harness/test_harness.py -q
-"""
-from __future__ import annotations
-
-import math
-
-from harness import metrics
-from harness.dataset import Dataset
-from harness.runner import run_benchmark, _percentile
-from harness.types import Memory, Query
-
-
-# ---------------- metrics ----------------
-
-def test_recall_at_k_basic():
-    ranked = [9, 8, 3, 7, 1]
-    rel = {3, 1, 99}  # 99 never retrieved
-    assert metrics.recall_at_k(ranked, rel, 5) == 2 / 3
-    assert metrics.recall_at_k(ranked, rel, 2) == 0.0  # neither in top2
-    assert metrics.recall_at_k(ranked, rel, 3) == 1 / 3  # only id 3 in top3
-
-
-def test_recall_perfect_and_zero():
-    assert metrics.recall_at_k([1, 2, 3], {1, 2, 3}, 5) == 1.0
-    assert metrics.recall_at_k([4, 5, 6], {1, 2, 3}, 5) == 0.0
-
-
-def test_reciprocal_rank():
-    assert metrics.reciprocal_rank([5, 4, 3], {3}) == 1 / 3
-    assert metrics.reciprocal_rank([3, 4, 5], {3}) == 1.0
-    assert metrics.reciprocal_rank([7, 8], {3}) == 0.0
-    # first relevant wins
-    assert metrics.reciprocal_rank([9, 3, 1], {1, 3}) == 1 / 2
-
-
-def test_ndcg_perfect():
-    # all relevant at the top -> nDCG == 1
-    assert math.isclose(metrics.ndcg_at_k([1, 2, 3, 4], {1, 2, 3}, 10), 1.0)
-
-
-def test_ndcg_known_value():
-    # single relevant doc at rank 2: DCG = 1/log2(3); IDCG = 1/log2(2)=1
-    ranked = [9, 1, 8]
-    val = metrics.ndcg_at_k(ranked, {1}, 10)
-    assert math.isclose(val, (1 / math.log2(3)) / 1.0)
-
-
-def test_ndcg_two_relevant_suboptimal_order():
-    # relevant {1,2}; retrieved at ranks 1 and 3
-    ranked = [1, 9, 2]
-    dcg = 1 / math.log2(2) + 1 / math.log2(4)  # ranks 1 and 3
-    idcg = 1 / math.log2(2) + 1 / math.log2(3)  # ideal ranks 1 and 2
-    assert math.isclose(metrics.ndcg_at_k(ranked, {1, 2}, 10), dcg / idcg)
-
-
-def test_dedup_does_not_inflate():
-    # repeating a relevant id must not increase recall beyond 1 hit's worth
-    ranked = [3, 3, 3, 3]
-    assert metrics.recall_at_k(ranked, {3, 7}, 5) == 0.5
-    assert metrics.reciprocal_rank(ranked, {3}) == 1.0
-
-
-def test_empty_relevant_is_zero():
-    assert metrics.recall_at_k([1, 2], set(), 5) == 0.0
-    assert metrics.ndcg_at_k([1, 2], set(), 5) == 0.0
-
-
-# ---------------- percentile ----------------
-
-def test_percentile():
-    vals = [10, 20, 30, 40]
-    assert _percentile(vals, 50) == 25.0  # interpolated median
-    assert _percentile(vals, 0) == 10
-    assert _percentile(vals, 100) == 40
-    assert _percentile([5.0], 95) == 5.0
-    assert _percentile([], 50) == 0.0
-
-
-# ---------------- runner ----------------
-
-def _toy_dataset() -> Dataset:
-    corpus = [Memory(id=i, content=f"memory {i}") for i in range(1, 11)]
-    queries = [
-        Query("q_exact_1", "find 1", "exact", (1,)),
-        Query("q_para_1", "restate 5", "paraphrase", (5,)),
-        Query("q_multi_1", "join 3 and 4", "multihop", (3, 4)),
-    ]
-    qrels = {"q_exact_1": {1}, "q_para_1": {5}, "q_multi_1": {3, 4}}
-    return Dataset(corpus=corpus, queries=queries, qrels=qrels)
-
-
-class _PerfectRetriever:
-    """Returns exactly the relevant ids first (oracle) — for runner plumbing."""
-
-    def __init__(self, qrels):
-        self._qrels = qrels
-        self._by_text = None
-
-    def build_index(self, corpus):
-        self._n = len(corpus)
-
-    def index_size_bytes(self):
-        return 1234
-
-    def retrieve(self, query, k):
-        # map query text back via the toy queries' known answers
-        mapping = {"find 1": [1], "restate 5": [5], "join 3 and 4": [3, 4]}
-        ids = mapping.get(query, [])
-        # pad with distractors
-        pad = [x for x in range(100, 100 + k)]
-        return (ids + pad)[:k]
-
-
-def test_runner_perfect_retriever():
-    ds = _toy_dataset()
-    r = _PerfectRetriever(ds.qrels)
-    res = run_benchmark(r, ds, retriever_name="perfect")
-    assert res.n_queries == 3
-    assert math.isclose(res.overall["recall@10"], 1.0)
-    assert math.isclose(res.overall["mrr"], 1.0)
-    assert math.isclose(res.overall["ndcg@10"], 1.0)
-    # per-stratum present
-    assert set(res.per_stratum) == {"exact", "paraphrase", "multihop"}
-    assert res.per_stratum["multihop"].n_queries == 1
-    # lifecycle hooks captured
-    assert res.index_build_seconds is not None
-    assert res.index_size_bytes == 1234
-    # latency recorded
-    assert res.latency_ms["p95"] >= 0.0
-
-
-def test_runner_callable_retriever_and_misses():
-    ds = _toy_dataset()
-
-    def retrieve(query, k):  # always wrong
-        return [999][:k]
-
-    res = run_benchmark(retrieve, ds, retriever_name="bad", warmup=False)
-    assert res.overall["recall@10"] == 0.0
-    assert res.overall["mrr"] == 0.0
-    assert res.index_build_seconds is None  # no hook on a bare callable
-    assert "perfect" not in res.summary()
-    assert "bad" in res.summary()
diff --git a/benchmarks/harness/types.py b/benchmarks/harness/types.py
deleted file mode 100644
index 0665f33..0000000
--- a/benchmarks/harness/types.py
+++ /dev/null
@@ -1,53 +0,0 @@
-"""Core dataclasses and the pluggable Retriever protocol."""
-from __future__ import annotations
-
-from dataclasses import dataclass, field
-from typing import Protocol, runtime_checkable
-
-MemoryId = int
-
-
-@dataclass(frozen=True)
-class Memory:
-    """One corpus entry (mirrors corpus.jsonl)."""
-
-    id: MemoryId
-    content: str
-    category: str = "facts"
-    tags: str = ""
-    expanded_keywords: str = ""
-    importance: float = 0.5
-
-
-@dataclass(frozen=True)
-class Query:
-    """One eval query (mirrors queries.jsonl)."""
-
-    query_id: str
-    text: str
-    stratum: str  # "exact" | "paraphrase" | "multihop"
-    # convenience copy of relevant ids; authoritative source is Qrels
-    relevant_ids: tuple[MemoryId, ...] = field(default_factory=tuple)
-
-
-# query_id -> set of relevant memory ids (binary relevance)
-Qrels = dict[str, set[MemoryId]]
-
-
-@runtime_checkable
-class Retriever(Protocol):
-    """Pluggable retriever contract.
-
-    Implementations rank corpus memories for a query and return the top-k
-    memory ids, best match first. The harness will call `retrieve` once per
-    query and compare against qrels.
-
-    Optional lifecycle hooks let a retriever build an index from the corpus
-    and report index build time / on-disk size; the runner uses them if
-    present (duck-typed), so a minimal retriever need only implement
-    `retrieve`.
-    """
-
-    def retrieve(self, query: str, k: int) -> list[MemoryId]:
-        """Return up to k memory ids, ranked best-first."""
-        ...
diff --git a/benchmarks/retrievers/__init__.py b/benchmarks/retrievers/__init__.py
deleted file mode 100644
index 893917d..0000000
--- a/benchmarks/retrievers/__init__.py
+++ /dev/null
@@ -1,10 +0,0 @@
-"""Pluggable retrievers for the claude-memory recall benchmark.
-
-Each retriever implements the harness `retrieve(query, k) -> list[int]` contract
-(see ``harness/types.py`` :: ``Retriever``) and, optionally, the ``build_index`` /
-``index_size_bytes`` lifecycle hooks the runner duck-types.
-
-``fts.FtsRetriever`` is the LEXICAL BASELINE — the product's current local-store
-recall (SQLite FTS5/BM25). It is the "current system" any hybrid retriever must
-beat on recall@k / nDCG@10 / MRR (ADR-0001).
-"""
diff --git a/benchmarks/retrievers/fts.py b/benchmarks/retrievers/fts.py
deleted file mode 100644
index 6a5f516..0000000
--- a/benchmarks/retrievers/fts.py
+++ /dev/null
@@ -1,224 +0,0 @@
-"""BASELINE retriever: the product's CURRENT lexical recall (SQLite FTS5/BM25).
-
-This is the "current system" the hybrid upgrade (dense embeddings + concept
-graph, ADR-0001) must beat on recall@k / nDCG@10 / MRR. It is a *faithful*
-reimplementation of the production local-store recall path, not an idealised
-sketch — it mirrors ``src/claude_memory/mcp_server.py :: _sqlite_recall`` (and
-the FTS5 schema/triggers in the same module) line-for-line where it matters:
-
-Production recall (``sort_by="relevance"``) does ALL of the following, and so
-does this retriever:
-
-1. **Concatenate then split.** The MCP tool builds
-   ``all_terms = f"{context} {expanded_query}"`` and splits it on whitespace,
-   stripping any embedded ``"`` from each token. The harness already hands us
-   one ``query`` string (the concatenation happens upstream of recall), so here
-   ``query`` IS ``all_terms``; we split + strip identically.
-
-2. **AND-first, then OR-broaden.** Production builds BOTH
-   ``'"w1" AND "w2" ...'`` and ``'"w1" OR "w2" ...'`` and runs the **AND** match
-   first; only if it returns zero rows does it fall back to the **OR** match.
-   (The README's "Search Algorithm" prose shows only the OR form; the *code* is
-   AND→OR, and the code is authoritative. We replicate the code.)
-
-3. **Blended BM25+importance relevance ordering.** ``sort_by="relevance"`` is
-   NOT a pure ``ORDER BY bm25()``. It is the blend
-   ``(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC`` (bm25 is negated
-   because SQLite returns more-negative = better-match). We use the EXACT same
-   expression. We deliberately evaluate ``relevance`` (not the production
-   ``importance`` default) so the benchmark measures RETRIEVAL quality rather
-   than the importance-sort prior — per the research brief.
-
-4. **FTS5 default tokenizer.** The production virtual table is declared with no
-   explicit tokenizer, i.e. ``unicode61`` — case-folding + unicode diacritic
-   stripping, NO stemming and NO stop-word removal. We declare ours the same
-   way, so "running" does not match "run" (a known lexical weakness the dense
-   path is expected to fix on the *paraphrase* stratum).
-
-5. **LIKE fallback.** If the FTS5 MATCH raises ``sqlite3.OperationalError``
-   (e.g. a token that trips the FTS5 query grammar), production degrades to a
-   ``content LIKE %context% OR tags LIKE %context%`` scan ordered by importance.
-   We mirror that fallback (using the full query as the LIKE needle, since the
-   harness query is the whole ``all_terms``).
-
-DIFFERENCES FROM PRODUCTION (all immaterial to ranking, documented for honesty):
-- The benchmark corpus has no per-user / soft-delete / category filtering, so we
-  drop the ``user_id``/``deleted_at``/``category`` predicates. No category is
-  passed by the harness, so the category branch is never taken anyway.
-- We build a fresh in-memory FTS5 index over ``data/corpus.jsonl`` rather than
-  reading the live ``memory.db``; same schema, same tokenizer, same columns
-  (content/category/tags/expanded_keywords), so BM25 statistics match what the
-  product would compute over the same documents.
-
-The harness reference ``harness.baselines.SqliteFtsRetriever`` implements the
-*README* ordering (pure ``ORDER BY bm25(), importance``). This module is the
-faithful-to-the-CODE variant and is the one the RUN reports as ``retriever="fts"``.
-"""
-from __future__ import annotations
-
-import re
-import sqlite3
-from collections.abc import Sequence
-
-# Import the corpus dataclass from the sibling harness package. run_eval.py and
-# run_benchmark put the benchmarks/ root on sys.path; support direct execution
-# (python retrievers/fts.py) too by adding it ourselves if the import fails.
-try:  # pragma: no cover - exercised by both import paths
-    from harness.types import Memory, MemoryId
-except ModuleNotFoundError:  # pragma: no cover
-    import sys
-    from pathlib import Path
-
-    sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
-    from harness.types import Memory, MemoryId
-
-# Mirror production token extraction: split ``all_terms`` on whitespace and strip
-# any embedded double-quote from each token (mcp_server uses
-# ``w.replace(chr(34), "")``). We lowercase as well; FTS5 unicode61 case-folds
-# regardless, so this only normalises the quoted MATCH literals we emit.
-_DQUOTE = '"'
-
-
-class FtsRetriever:
-    """Faithful reimplementation of the production SQLite FTS5/BM25 recall.
-
-    Mirrors ``_sqlite_recall(sort_by="relevance")``: AND-first then OR-broaden
-    over an FTS5(content, category, tags, expanded_keywords) index, ranked by
-    the blended ``(-bm25*0.7 + importance*0.3)`` score, with a LIKE fallback.
-    """
-
-    #: Label surfaced in benchmark reports / the RUN schema.
-    name = "fts"
-
-    def __init__(self, sort_by: str = "relevance") -> None:
-        # We benchmark "relevance" so the metric reflects retrieval quality, not
-        # the importance prior. "importance" is kept for parity / diagnostics.
-        if sort_by not in ("relevance", "importance"):
-            raise ValueError(f"sort_by must be 'relevance' or 'importance', got {sort_by!r}")
-        self.sort_by = sort_by
-        self._con: sqlite3.Connection | None = None
-
-    # ── lifecycle hooks (duck-typed by the runner) ───────────────────────────
-
-    def build_index(self, corpus: Sequence[Memory]) -> None:
-        """Build a fresh in-memory FTS5 index over the corpus.
-
-        Same virtual-table shape and (default ``unicode61``) tokenizer as the
-        production ``memories_fts`` table. We carry ``memory_id`` and
-        ``importance`` as UNINDEXED columns so the relevance blend can read
-        importance without a join — semantically identical to the production
-        ``memories m JOIN memories_fts fts ON m.id = fts.rowid`` read.
-        """
-        con = sqlite3.connect(":memory:")
-        con.execute(
-            """
-            CREATE VIRTUAL TABLE memories_fts USING fts5(
-                content, category, tags, expanded_keywords,
-                memory_id UNINDEXED, importance UNINDEXED
-            )
-            """
-        )
-        con.executemany(
-            "INSERT INTO memories_fts"
-            "(content, category, tags, expanded_keywords, memory_id, importance)"
-            " VALUES (?,?,?,?,?,?)",
-            [
-                (
-                    m.content,
-                    m.category,
-                    m.tags,
-                    m.expanded_keywords,
-                    int(m.id),
-                    float(m.importance),
-                )
-                for m in corpus
-            ],
-        )
-        con.commit()
-        self._con = con
-
-    def index_size_bytes(self) -> int:
-        """Approximate on-disk index size (sum of FTS5 shadow-table page bytes).
-
-        The index is in-memory, so this is the SQLite page accounting for the
-        FTS5 shadow tables — reported for the storage column, non-gating per
-        ADR-0001.
-        """
-        if self._con is None:
-            return 0
-        try:
-            page_count = self._con.execute("PRAGMA page_count").fetchone()[0]
-            page_size = self._con.execute("PRAGMA page_size").fetchone()[0]
-            return int(page_count) * int(page_size)
-        except sqlite3.Error:
-            return 0
-
-    # ── query construction (mirrors _sqlite_recall) ──────────────────────────
-
-    @staticmethod
-    def _tokens(query: str) -> list[str]:
-        """Split ``all_terms`` exactly as production does: whitespace split,
-        drop embedded double-quotes, drop empties."""
-        return [w.replace(_DQUOTE, "").lower() for w in query.split() if w.strip()]
-
-    @classmethod
-    def _and_or_queries(cls, query: str) -> tuple[str, str]:
-        """Build the ('"w1" AND "w2" ...', '"w1" OR "w2" ...') MATCH pair."""
-        words = cls._tokens(query)
-        if not words:
-            return "", ""
-        quoted = [f'"{w}"' for w in words]
-        return " AND ".join(quoted), " OR ".join(quoted)
-
-    def _order_clause(self) -> str:
-        # bm25() is negative (more-negative = better), so negate before blending.
-        if self.sort_by == "relevance":
-            return "(-bm25(memories_fts) * 0.7 + importance * 0.3) DESC"
-        return "(-bm25(memories_fts) * 0.4 + importance * 0.6) DESC"
-
-    # ── retrieve ──────────────────────────────────────────────────────────────
-
-    def retrieve(self, query: str, k: int) -> list[MemoryId]:
-        """Return up to ``k`` memory ids, ranked best-first.
-
-        AND-match first (precise); if it yields nothing, OR-broaden. On an FTS5
-        grammar error, fall back to a LIKE scan ordered by importance — exactly
-        the production degradation path.
-        """
-        assert self._con is not None, "call build_index first"
-        and_query, or_query = self._and_or_queries(query)
-        if not or_query:  # no usable tokens
-            return []
-
-        order = self._order_clause()
-        base_select = "SELECT memory_id FROM memories_fts WHERE memories_fts MATCH ? "
-        try:
-            rows: list[tuple[int]] = []
-            # AND first for precise matches, fall back to OR for broader recall.
-            for fts_query in (and_query, or_query):
-                rows = self._con.execute(
-                    f"{base_select}ORDER BY {order} LIMIT ?",
-                    (fts_query, k),
-                ).fetchall()
-                if rows:
-                    break
-        except sqlite3.OperationalError:
-            # Mirror production LIKE fallback: full query as the needle,
-            # ordered by importance.
-            like = f"%{query}%"
-            rows = self._con.execute(
-                "SELECT memory_id FROM memories_fts "
-                "WHERE content LIKE ? OR tags LIKE ? "
-                "ORDER BY importance DESC LIMIT ?",
-                (like, like, k),
-            ).fetchall()
-        return [r[0] for r in rows]
-
-    def close(self) -> None:
-        if self._con is not None:
-            self._con.close()
-            self._con = None
-
-
-# Convenience for `run_eval.py --retriever retrievers.fts:FtsRetriever`
-# and a no-arg default instantiation (sort_by="relevance").
diff --git a/benchmarks/retrievers/hybrid.py b/benchmarks/retrievers/hybrid.py
deleted file mode 100644
index 6039b21..0000000
--- a/benchmarks/retrievers/hybrid.py
+++ /dev/null
@@ -1,570 +0,0 @@
-"""HYBRID retriever (ADR-0001/0002/0003 prototype): lexical FTS + dense semantic
-recall + a memory-node concept graph, fused with Reciprocal Rank Fusion (RRF).
-
-This is the self-contained prototype the hybrid-recall ADOPTION decision is gated
-on (ADR-0001): does dense embeddings + a concept graph beat the current lexical
-FTS5/BM25 on recall@5/recall@10/nDCG@10/MRR? Quality decides; latency/storage are
-reported but non-gating.
-
-It implements the harness ``retrieve(query, k) -> list[int]`` contract and the
-optional ``build_index(corpus)`` / ``index_size_bytes()`` / ``name`` hooks.
-
-Three legs, mirroring the FINAL DESIGN
-======================================
-
-1. **Lexical (FTS5/BM25).** We reuse the *faithful* production reimplementation
-   ``retrievers.fts.FtsRetriever`` verbatim — AND-first then OR-broaden over an
-   FTS5(content, category, tags, expanded_keywords) index, ranked by the blended
-   ``(-bm25*0.7 + importance*0.3)``. This is the exact "current system" the hybrid
-   must beat, so the lexical leg of the hybrid IS that system (no drift).
-
-2. **Dense (semantic).** Embeddings per FINAL DESIGN: a HOSTED API is used ONLY if
-   its key is in the environment (``OPENAI_API_KEY`` / ``VOYAGE_API_KEY`` /
-   ``CO_API_KEY``) AND the memory is non-sensitive (ADR-0003); otherwise the local
-   default ``BAAI/bge-large-en-v1.5`` (1024-d, MIT, sentence-transformers). The
-   benchmark corpus is already sensitive-free (``is_sensitive=1`` excluded at
-   export, README privacy note), so here the choice is purely "hosted key present
-   or not". Vectors are L2-normalised; similarity is cosine = dot product. The
-   corpus matrix is cached to ``cache/`` (gitignored) keyed by model id + a corpus
-   fingerprint, so re-runs skip re-embedding. BGE retrieval convention: the QUERY
-   gets the instruction prefix "Represent this sentence for searching relevant
-   passages: "; passages are embedded raw (per the official BAAI model card).
-
-3. **Graph (concept expansion).** A memory-node concept graph built with the
-   design's TRACTABLE extraction — NO 5452 sequential LLM calls. Concepts are the
-   union of each memory's ``tags`` and its already-LLM-generated
-   ``expanded_keywords`` (plus salient content noun-phrases via a lightweight
-   regex/stop-word filter), normalised and de-pluralised. A concept that appears
-   in 2..N memories (very common concepts above a document-frequency ceiling are
-   dropped as non-discriminative) links those memories: ``memory -[shares
-   concept c]- memory``. At query time we take the fused dense+lexical SEEDS, walk
-   1 hop to neighbours that share *discriminative* concepts, and emit those
-   neighbours as a third ranked list. This targets the **multihop** stratum
-   (queries needing 2+ memories that share an entity/concept) without re-ranking
-   the precise hits the other legs already nail.
-
-Fusion (``retrieval_fusion``)
-=============================
-Reciprocal Rank Fusion (Cormack et al., 2009): for a document *d* with rank
-``r_leg(d)`` (1-based) in a leg's ranked list,
-
-    RRF(d) = Σ_leg  w_leg / (k_rrf + r_leg(d))
-
-with ``k_rrf = 60`` (the standard constant) and per-leg weights. RRF is
-score-scale-free (no BM25-vs-cosine calibration), which is why the design floats
-"RRF vs CC" and we pick RRF for the prototype. The dense and lexical legs carry
-full weight; the graph leg is down-weighted (it is a RECALL extender for multihop,
-and the design explicitly flags a possible negative graph prior — so it can add
-documents but should not dethrone strong dense/lexical hits). All three weights
-are class attributes so the kill-gate analysis can ablate the graph to zero.
-
-Graceful degradation (task requirement)
-=======================================
-If the embedding model cannot be loaded/used (missing package, download failure,
-OOM), the dense leg is skipped, the failure is recorded in ``self.errors``, and the
-retriever degrades to **FTS + graph** (or FTS-only if the graph also failed). The
-harness still gets metrics for whatever worked.
-"""
-from __future__ import annotations
-
-import hashlib
-import os
-import re
-import sys
-import time
-from collections import defaultdict
-from collections.abc import Sequence
-from pathlib import Path
-
-# ── package-relative imports that also work under direct execution ────────────
-try:  # pragma: no cover - exercised by both import paths
-    from harness.types import Memory, MemoryId
-    from retrievers.fts import FtsRetriever
-except ModuleNotFoundError:  # pragma: no cover
-    sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
-    from harness.types import Memory, MemoryId
-    from retrievers.fts import FtsRetriever
-
-_BENCH_ROOT = Path(__file__).resolve().parents[1]
-_CACHE_DIR = _BENCH_ROOT / "cache"
-
-# Local default embedding model (FINAL DESIGN: prototype default + sensitive-only
-# fallback). 1024-d, MIT-licensed, strong on MTEB retrieval.
-_LOCAL_MODEL = "BAAI/bge-large-en-v1.5"
-# BGE retrieval query instruction (official BAAI model card recommendation; the
-# v1.5 line relaxed it but it still helps short-query / long-passage asymmetry,
-# which is exactly the paraphrase stratum). Applied to QUERIES only.
-_BGE_QUERY_INSTRUCTION = "Represent this sentence for searching relevant passages: "
-
-# RRF constant (Cormack/Clarke/Buettcher 2009). 60 is the canonical default.
-_RRF_K = 60
-
-# Concept-graph tuning.
-#   _CONCEPT_MIN_DF : a concept must appear in >= this many memories to form edges
-#                     (df==1 links nothing; we need a shared concept).
-#   _CONCEPT_MAX_DF_FRAC : drop concepts appearing in more than this fraction of
-#                     the corpus — they are non-discriminative hubs ("memory",
-#                     "homelab") that would over-connect the graph (design risk:
-#                     "over-merge").
-#   _GRAPH_SEEDS    : how many fused seeds to expand from.
-#   _GRAPH_NEIGHBOURS_PER_SEED : cap neighbours pulled per seed (keeps the graph
-#                     leg from flooding the candidate pool).
-_CONCEPT_MIN_DF = 2
-_CONCEPT_MAX_DF_FRAC = 0.02
-_GRAPH_SEEDS = 10
-_GRAPH_NEIGHBOURS_PER_SEED = 25
-
-# A small English stop-word set for the lightweight noun-phrase extraction. We
-# deliberately keep this tiny + dependency-free (no spaCy/NLTK download on the hot
-# path); the heavy lifting is done by the pre-computed ``expanded_keywords``.
-_STOPWORDS = frozenset(
-    """
-    a an the of to in on at by for with from into over under and or but not is are
-    was were be been being do does did has have had this that these those it its as
-    if then than so such no yes can will would should could may might must i you he
-    she they we me him her them us my your his their our about above after again all
-    any because before below between both during each few more most other some only
-    own same too very up down out off here there when where which who whom what how
-    """.split()
-)
-_WORD_RE = re.compile(r"[A-Za-z][A-Za-z0-9_+.-]{2,}")
-
-
-def _normalise_concept(token: str) -> str:
-    """Lowercase, strip surrounding punctuation, light de-plural so concept
-    variants collapse to one node (e.g. 'decisions'->'decision',
-    'addresses'->'address', 'policies'->'policy'). This is a heuristic collapser,
-    not a linguistically perfect stemmer — its only job is to merge obvious
-    plural/singular pairs so the graph links them; exactness is not load-bearing.
-    Order matters: -ies, then -sses, then sibilant -es, then a bare trailing -s."""
-    t = token.lower().strip(".,;:!?()[]{}\"'`")
-    if len(t) > 4 and t.endswith("ies"):  # policies -> policy
-        return t[:-3] + "y"
-    if len(t) > 4 and t.endswith("sses"):  # addresses -> address, classes -> class
-        return t[:-2]
-    if len(t) > 4 and t.endswith(("ches", "shes", "xes", "zes", "ses")):  # boxes->box
-        return t[:-2]
-    if len(t) > 3 and t.endswith("s") and not t.endswith(("ss", "us", "is")):  # tags->tag
-        return t[:-1]
-    return t
-
-
-def _concepts_for(memory: Memory) -> set[str]:
-    """Extract the concept set for one memory: tags ∪ expanded_keywords ∪ salient
-    content tokens. ``expanded_keywords`` is already an LLM-generated keyword field
-    in the corpus, so this is the design's 'tractable extraction' — we reuse the
-    extraction that production already pays for instead of new LLM calls."""
-    concepts: set[str] = set()
-    # tags: comma-separated
-    for tag in memory.tags.split(","):
-        c = _normalise_concept(tag)
-        if len(c) >= 3 and c not in _STOPWORDS:
-            concepts.add(c)
-    # expanded_keywords: space-separated, already curated
-    for kw in memory.expanded_keywords.split():
-        c = _normalise_concept(kw)
-        if len(c) >= 3 and c not in _STOPWORDS:
-            concepts.add(c)
-    # salient content tokens (lightweight noun-phrase proxy: alpha tokens len>=3,
-    # not stop-words). This is a cheap NER/noun-phrase stand-in per the design.
-    for m in _WORD_RE.finditer(memory.content):
-        c = _normalise_concept(m.group(0))
-        if len(c) >= 3 and c not in _STOPWORDS:
-            concepts.add(c)
-    return concepts
-
-
-def _corpus_fingerprint(corpus: Sequence[Memory]) -> str:
-    """Stable hash over (id, content) so the embedding cache invalidates if the
-    corpus changes but is reused across runs of the same corpus."""
-    h = hashlib.sha256()
-    for m in corpus:
-        h.update(str(m.id).encode())
-        h.update(b"\x00")
-        h.update(m.content.encode("utf-8", "replace"))
-        h.update(b"\x01")
-    return h.hexdigest()[:16]
-
-
-class HybridRetriever:
-    """Lexical FTS + dense (bge-large-en-v1.5 / hosted) + concept-graph expansion,
-    fused with RRF. Degrades to FTS(+graph) if embeddings are unavailable."""
-
-    #: Label surfaced in benchmark reports / the RUN schema.
-    name = "hybrid"
-
-    # Per-leg RRF weights. Dense + lexical carry full weight; graph is a
-    # down-weighted recall extender (design: possible negative graph prior).
-    w_dense = 1.0
-    w_fts = 1.0
-    w_graph = 0.35
-
-    def __init__(self, model_name: str | None = None) -> None:
-        self.errors: list[str] = []
-        self.model_name = model_name or _LOCAL_MODEL
-        self.embedding_backend: str = "none"  # "local:<model>" | "hosted:<provider>:<model>"
-        self.embedding_dim: int | None = None
-
-        # FTS leg (always available; pure stdlib sqlite).
-        self._fts = FtsRetriever(sort_by="relevance")
-
-        # Dense leg state.
-        self._model = None  # SentenceTransformer or None
-        self._np = None  # numpy module handle (set on successful dense build)
-        self._emb = None  # (N, d) float32 L2-normalised matrix, row i ↔ self._ids[i]
-        self._ids: list[MemoryId] = []  # row order of self._emb
-
-        # Graph leg state.
-        self._graph = None  # networkx.Graph or None
-        self._concept_to_mems: dict[str, list[MemoryId]] = {}
-        self._mem_concepts: dict[MemoryId, set[str]] = {}
-        self._n_concepts_total = 0  # before df pruning, for reporting
-        self._n_concepts_kept = 0
-        self._n_edges = 0
-
-        self._corpus_size = 0
-
-    # ── lifecycle: build_index (timed by the runner) ─────────────────────────
-
-    def build_index(self, corpus: Sequence[Memory]) -> None:
-        corpus = list(corpus)
-        self._corpus_size = len(corpus)
-        _CACHE_DIR.mkdir(parents=True, exist_ok=True)
-
-        # 1) lexical leg
-        self._fts.build_index(corpus)
-
-        # 2) dense leg (graceful)
-        try:
-            self._build_dense(corpus)
-        except Exception as exc:  # pragma: no cover - defensive
-            self.errors.append(f"dense leg disabled: {type(exc).__name__}: {exc}")
-            self._model = None
-            self._emb = None
-
-        # 3) graph leg (graceful)
-        try:
-            self._build_graph(corpus)
-        except Exception as exc:  # pragma: no cover - defensive
-            self.errors.append(f"graph leg disabled: {type(exc).__name__}: {exc}")
-            self._graph = None
-            self._concept_to_mems = {}
-
-    # ── dense leg ────────────────────────────────────────────────────────────
-
-    def _select_embedding_backend(self) -> str:
-        """Pick the embedding backend per FINAL DESIGN: hosted only if a key is in
-        the env (non-sensitive corpus already guaranteed by export), else local.
-        Returns a human label and sets self.model_name accordingly."""
-        if os.environ.get("VOYAGE_API_KEY"):
-            self.model_name = "voyage-3.5"
-            return "hosted:voyage:voyage-3.5"
-        if os.environ.get("OPENAI_API_KEY"):
-            self.model_name = "text-embedding-3-large"
-            return "hosted:openai:text-embedding-3-large"
-        if os.environ.get("CO_API_KEY"):
-            self.model_name = "embed-english-v3.0"
-            return "hosted:cohere:embed-english-v3.0"
-        self.model_name = _LOCAL_MODEL
-        return f"local:{_LOCAL_MODEL}"
-
-    def _build_dense(self, corpus: Sequence[Memory]) -> None:
-        import numpy as np  # required for the dense leg
-
-        self._np = np
-        self.embedding_backend = self._select_embedding_backend()
-        self._ids = [m.id for m in corpus]
-        fp = _corpus_fingerprint(corpus)
-        safe_model = self.model_name.replace("/", "_")
-        emb_path = _CACHE_DIR / f"emb_{safe_model}_{fp}.npy"
-        ids_path = _CACHE_DIR / f"emb_{safe_model}_{fp}.ids.npy"
-
-        # cache hit?
-        if emb_path.exists() and ids_path.exists():
-            cached_ids = np.load(ids_path)
-            if list(cached_ids.tolist()) == self._ids:
-                self._emb = np.load(emb_path).astype(np.float32)
-                self.embedding_dim = int(self._emb.shape[1])
-                return  # cached embeddings reused
-
-        # cache miss → embed
-        if self.embedding_backend.startswith("hosted:"):
-            vecs = self._embed_hosted([m.content for m in corpus])
-        else:
-            vecs = self._embed_local([m.content for m in corpus])
-        vecs = vecs.astype(np.float32)
-        # L2-normalise so dot product == cosine.
-        norms = np.linalg.norm(vecs, axis=1, keepdims=True)
-        norms[norms == 0] = 1.0
-        vecs = vecs / norms
-        self._emb = vecs
-        self.embedding_dim = int(vecs.shape[1])
-        np.save(emb_path, vecs)
-        np.save(ids_path, np.array(self._ids, dtype=np.int64))
-
-    def _load_local_model(self):
-        from sentence_transformers import SentenceTransformer
-
-        if self._model is None:
-            # CPU is fine for ~5.5k short docs; force CPU to avoid CUDA init noise.
-            self._model = SentenceTransformer(_LOCAL_MODEL, device="cpu")
-            # Median memory is ~120 tokens; cap the window at 384 so the rare long
-            # memory (1.6% > 512 tok) doesn't pad an entire batch to 512. bge's
-            # native max is 512; 384 keeps ~p99 intact while bounding CPU cost.
-            self._model.max_seq_length = min(self._model.max_seq_length, 384)
-        return self._model
-
-    def _embed_local(self, texts: list[str]):
-        import numpy as np
-
-        model = self._load_local_model()
-        # Length-sort so each batch pads to a homogeneous length (big CPU win), then
-        # restore original order. Passages embedded raw; the caller L2-normalises so
-        # the local and hosted paths stay byte-for-byte consistent downstream.
-        order = sorted(range(len(texts)), key=lambda i: len(texts[i]))
-        sorted_texts = [texts[i] for i in order]
-        out = model.encode(
-            sorted_texts,
-            batch_size=64,
-            convert_to_numpy=True,
-            normalize_embeddings=False,
-            show_progress_bar=False,
-        )
-        out = np.asarray(out)
-        # invert the permutation
-        restored = np.empty_like(out)
-        restored[np.asarray(order)] = out
-        return restored
-
-    def _embed_query_local(self, query: str):
-        import numpy as np
-
-        model = self._load_local_model()
-        out = model.encode(
-            [_BGE_QUERY_INSTRUCTION + query],
-            convert_to_numpy=True,
-            normalize_embeddings=True,  # query L2-normalised → cosine via dot
-            show_progress_bar=False,
-        )
-        return np.asarray(out)[0]
-
-    def _embed_hosted(self, texts: list[str]):
-        """Batch-embed passages via the selected hosted API. Implemented for
-        Voyage / OpenAI / Cohere; only reached when the matching key is set."""
-        import numpy as np
-
-        backend = self.embedding_backend
-        if backend.startswith("hosted:voyage"):
-            import voyageai
-
-            client = voyageai.Client()
-            vecs: list[list[float]] = []
-            for i in range(0, len(texts), 128):
-                batch = texts[i : i + 128]
-                r = client.embed(batch, model="voyage-3.5", input_type="document")
-                vecs.extend(r.embeddings)
-            return np.asarray(vecs)
-        if backend.startswith("hosted:openai"):
-            from openai import OpenAI
-
-            client = OpenAI()
-            vecs = []
-            for i in range(0, len(texts), 256):
-                batch = texts[i : i + 256]
-                r = client.embeddings.create(model="text-embedding-3-large", input=batch)
-                vecs.extend([d.embedding for d in r.data])
-            return np.asarray(vecs)
-        if backend.startswith("hosted:cohere"):
-            import cohere
-
-            client = cohere.Client()
-            vecs = []
-            for i in range(0, len(texts), 96):
-                batch = texts[i : i + 96]
-                r = client.embed(texts=batch, model="embed-english-v3.0", input_type="search_document")
-                vecs.extend(r.embeddings)
-            return np.asarray(vecs)
-        raise RuntimeError(f"unknown hosted backend {backend!r}")
-
-    def _embed_query_hosted(self, query: str):
-        import numpy as np
-
-        backend = self.embedding_backend
-        if backend.startswith("hosted:voyage"):
-            import voyageai
-
-            client = voyageai.Client()
-            r = client.embed([query], model="voyage-3.5", input_type="query")
-            v = np.asarray(r.embeddings[0], dtype=np.float32)
-        elif backend.startswith("hosted:openai"):
-            from openai import OpenAI
-
-            client = OpenAI()
-            r = client.embeddings.create(model="text-embedding-3-large", input=[query])
-            v = np.asarray(r.data[0].embedding, dtype=np.float32)
-        elif backend.startswith("hosted:cohere"):
-            import cohere
-
-            client = cohere.Client()
-            r = client.embed(texts=[query], model="embed-english-v3.0", input_type="search_query")
-            v = np.asarray(r.embeddings[0], dtype=np.float32)
-        else:
-            raise RuntimeError(f"unknown hosted backend {backend!r}")
-        n = np.linalg.norm(v)
-        return v / n if n else v
-
-    def _dense_rank(self, query: str, k: int) -> list[MemoryId]:
-        """Top-k corpus ids by cosine similarity to the query embedding."""
-        if self._emb is None or self._np is None:
-            return []
-        np = self._np
-        if self.embedding_backend.startswith("hosted:"):
-            qv = self._embed_query_hosted(query)
-        else:
-            qv = self._embed_query_local(query)
-        sims = self._emb @ qv  # (N,) cosine sims (both sides L2-normalised)
-        kk = min(k, sims.shape[0])
-        # argpartition for the top-kk, then sort those by score desc.
-        idx = np.argpartition(-sims, kk - 1)[:kk]
-        idx = idx[np.argsort(-sims[idx])]
-        return [self._ids[i] for i in idx]
-
-    # ── graph leg ──────────────────────────────────────────────────────────
-
-    def _build_graph(self, corpus: Sequence[Memory]) -> None:
-        import networkx as nx
-
-        n = len(corpus)
-        max_df = max(_CONCEPT_MIN_DF, int(_CONCEPT_MAX_DF_FRAC * n))
-
-        # concept → set(memory ids)
-        concept_to_mems: dict[str, set[MemoryId]] = defaultdict(set)
-        mem_concepts: dict[MemoryId, set[str]] = {}
-        for m in corpus:
-            cs = _concepts_for(m)
-            mem_concepts[m.id] = cs
-            for c in cs:
-                concept_to_mems[c].add(m.id)
-        self._n_concepts_total = len(concept_to_mems)
-
-        # Keep only discriminative concepts: appear in [_CONCEPT_MIN_DF, max_df]
-        # memories. Below MIN_DF links nothing; above max_df is a non-specific hub.
-        kept: dict[str, list[MemoryId]] = {}
-        for c, mems in concept_to_mems.items():
-            df = len(mems)
-            if _CONCEPT_MIN_DF <= df <= max_df:
-                kept[c] = sorted(mems)
-        self._n_concepts_kept = len(kept)
-        self._concept_to_mems = kept
-        # restrict each memory's concept set to kept concepts (for neighbour scoring)
-        self._mem_concepts = {
-            mid: {c for c in cs if c in kept} for mid, cs in mem_concepts.items()
-        }
-
-        # Build a weighted memory-node graph: edge weight = # shared kept concepts.
-        # We add edges via concept cliques but CAP per-concept fan-out to avoid an
-        # O(df^2) blow-up on the densest kept concepts (design risk: over-merge).
-        g = nx.Graph()
-        g.add_nodes_from(m.id for m in corpus)
-        edge_w: dict[tuple[MemoryId, MemoryId], int] = defaultdict(int)
-        for c, mems in kept.items():
-            # mems is small (<= max_df) by construction; full clique is fine.
-            for i in range(len(mems)):
-                a = mems[i]
-                for j in range(i + 1, len(mems)):
-                    b = mems[j]
-                    key = (a, b) if a < b else (b, a)
-                    edge_w[key] += 1
-        for (a, b), w in edge_w.items():
-            g.add_edge(a, b, weight=w)
-        self._n_edges = g.number_of_edges()
-        self._graph = g
-
-    def _graph_rank(self, seeds: list[MemoryId], exclude: set[MemoryId], k: int) -> list[MemoryId]:
-        """From fused seeds, walk 1 hop and rank neighbour memories by accumulated
-        edge weight (shared-concept strength), weighted by the seed's own rank so
-        higher-confidence seeds pull harder. Returns up to k NEW ids (not in
-        ``exclude``)."""
-        if self._graph is None or not seeds:
-            return []
-        g = self._graph
-        scores: dict[MemoryId, float] = defaultdict(float)
-        for rank, s in enumerate(seeds[:_GRAPH_SEEDS], start=1):
-            if s not in g:
-                continue
-            seed_w = 1.0 / rank  # earlier seeds contribute more
-            nbrs = sorted(
-                g[s].items(), key=lambda kv: kv[1].get("weight", 1), reverse=True
-            )[:_GRAPH_NEIGHBOURS_PER_SEED]
-            for nbr, data in nbrs:
-                if nbr in exclude:
-                    continue
-                scores[nbr] += seed_w * float(data.get("weight", 1))
-        ranked = sorted(scores.items(), key=lambda kv: kv[1], reverse=True)
-        return [mid for mid, _ in ranked[:k]]
-
-    # ── fusion + retrieve ────────────────────────────────────────────────────
-
-    @staticmethod
-    def _rrf_accumulate(scores: dict[MemoryId, float], ranked: list[MemoryId], weight: float) -> None:
-        for r, mid in enumerate(ranked, start=1):
-            scores[mid] += weight / (_RRF_K + r)
-
-    def retrieve(self, query: str, k: int) -> list[MemoryId]:
-        """Fuse lexical + dense + graph-expansion via weighted RRF and return the
-        top-k memory ids. Pulls each leg deeper than k so fusion has material to
-        re-order, then truncates."""
-        depth = max(k, 50)  # per-leg retrieval depth before fusion
-
-        fts_ranked = self._fts.retrieve(query, depth)
-        dense_ranked = self._dense_rank(query, depth)  # [] if dense disabled
-
-        # Seeds for graph expansion: RRF of the two base legs (so the graph walks
-        # from the best-agreed memories, not just one leg's view).
-        seed_scores: dict[MemoryId, float] = defaultdict(float)
-        self._rrf_accumulate(seed_scores, fts_ranked, self.w_fts)
-        self._rrf_accumulate(seed_scores, dense_ranked, self.w_dense)
-        seeds = [mid for mid, _ in sorted(seed_scores.items(), key=lambda kv: kv[1], reverse=True)]
-        base_set = set(seeds)
-        graph_ranked = self._graph_rank(seeds, exclude=base_set, k=depth)
-
-        # Final weighted RRF over all three legs.
-        scores: dict[MemoryId, float] = defaultdict(float)
-        self._rrf_accumulate(scores, fts_ranked, self.w_fts)
-        self._rrf_accumulate(scores, dense_ranked, self.w_dense)
-        self._rrf_accumulate(scores, graph_ranked, self.w_graph)
-
-        fused = sorted(scores.items(), key=lambda kv: (kv[1], -kv[0]), reverse=True)
-        return [mid for mid, _ in fused[:k]]
-
-    # ── reporting hooks ───────────────────────────────────────────────────────
-
-    def index_size_bytes(self) -> int:
-        """Sum of the dense matrix bytes + FTS index bytes (graph is in-memory
-        networkx; we approximate it via node+edge count * a small constant). Non-
-        gating per ADR-0001; reported for the storage column."""
-        total = 0
-        if self._emb is not None:
-            total += int(self._emb.nbytes)
-        try:
-            total += self._fts.index_size_bytes()
-        except Exception:
-            pass
-        if self._graph is not None:
-            # rough: ~64 B/node + ~96 B/edge accounting for python object overhead.
-            total += self._graph.number_of_nodes() * 64 + self._graph.number_of_edges() * 96
-        return total
-
-    def graph_stats(self) -> dict[str, int]:
-        return {
-            "nodes": self._graph.number_of_nodes() if self._graph is not None else 0,
-            "edges": self._n_edges,
-            "concepts_total": self._n_concepts_total,
-            "concepts_kept": self._n_concepts_kept,
-        }
-
-    def close(self) -> None:
-        self._fts.close()
-
-
-# Convenience for `run_eval.py --retriever retrievers.hybrid:HybridRetriever`.
diff --git a/benchmarks/retrievers/test_hybrid.py b/benchmarks/retrievers/test_hybrid.py
deleted file mode 100644
index 09612a6..0000000
--- a/benchmarks/retrievers/test_hybrid.py
+++ /dev/null
@@ -1,204 +0,0 @@
-"""Unit tests for the HYBRID retriever's pure logic: concept normalisation, the
-concept-graph build + 1-hop expansion, weighted RRF fusion, and graceful
-degradation when the dense leg is unavailable.
-
-These tests are MODEL-FREE on purpose — they never load sentence-transformers (a
-~1.3 GB / multi-minute CPU load). The dense leg is exercised by monkeypatching the
-ranking method, so the fusion + graph behaviour is verified deterministically and
-fast. The full end-to-end quality run is done via scripts/run_eval.py against the
-real (local, gitignored) corpus.
-
-Run:  .venv/bin/python -m pytest retrievers/test_hybrid.py -q
-"""
-from __future__ import annotations
-
-import math
-
-from harness.types import Memory
-from retrievers.hybrid import (
-    _RRF_K,
-    HybridRetriever,
-    _concepts_for,
-    _normalise_concept,
-)
-
-
-# ---------------- concept normalisation ----------------
-
-def test_normalise_concept_depluralisation():
-    cases = {
-        "Decisions": "decision",
-        "policies": "policy",
-        "addresses": "address",
-        "boxes": "box",
-        "tags": "tag",
-        # invariants: don't over-strip
-        "access": "access",
-        "class": "class",
-        "status": "status",
-        "analysis": "analysis",
-        "kubernetes": "kubernete",  # heuristic, acceptable (collapses consistently)
-        "k8s": "k8s",
-        "GPU": "gpu",
-    }
-    for inp, exp in cases.items():
-        assert _normalise_concept(inp) == exp, f"{inp!r} -> {_normalise_concept(inp)!r}"
-
-
-def test_normalise_concept_is_stable_under_repetition():
-    # normalising an already-normalised token must be a no-op (idempotent), so the
-    # graph collapses variants consistently no matter the source field.
-    for tok in ["decision", "policy", "address", "tag", "gpu", "access"]:
-        assert _normalise_concept(_normalise_concept(tok)) == _normalise_concept(tok)
-
-
-def test_concepts_for_unions_tags_keywords_content():
-    m = Memory(
-        id=1,
-        content="The Postgres cluster uses pgvector for embeddings.",
-        tags="database,postgres",
-        expanded_keywords="cnpg vector search",
-    )
-    cs = _concepts_for(m)
-    # from tags (note: 'postgres' de-plurals to 'postgre' — a consistent heuristic
-    # collapse; what matters is every memory mentioning it lands on the SAME node).
-    assert "database" in cs and "postgre" in cs
-    # from expanded_keywords
-    assert "cnpg" in cs and "vector" in cs and "search" in cs
-    # from content (salient tokens, stop-words removed)
-    assert "pgvector" in cs and "embedding" in cs  # 'embeddings' -> 'embedding'
-    assert "the" not in cs and "for" not in cs  # stop-words excluded
-
-
-# ---------------- graph build + expansion ----------------
-
-def _shared_concept_corpus() -> list[Memory]:
-    # Three memories share concept "alpha" (df=3); two share "beta" (df=2); "gamma"
-    # is unique (df=1, links nothing). With min_df=2 and a generous max_df, alpha
-    # and beta both form edges.
-    return [
-        Memory(id=10, content="alpha topic one", tags="alpha", expanded_keywords="beta"),
-        Memory(id=20, content="alpha topic two", tags="alpha", expanded_keywords="beta"),
-        Memory(id=30, content="alpha topic three", tags="alpha", expanded_keywords="gamma"),
-        Memory(id=40, content="unrelated delta", tags="delta", expanded_keywords="delta"),
-    ]
-
-
-def test_graph_build_links_shared_concepts():
-    r = HybridRetriever()
-    # widen max_df so small-corpus concepts aren't pruned as "hubs"
-    import retrievers.hybrid as H
-
-    old = H._CONCEPT_MAX_DF_FRAC
-    H._CONCEPT_MAX_DF_FRAC = 1.0
-    try:
-        r._build_graph(_shared_concept_corpus())
-    finally:
-        H._CONCEPT_MAX_DF_FRAC = old
-
-    g = r._graph
-    assert g is not None
-    # alpha links 10-20-30 (a triangle); beta links 10-20; "topic" links 10-20-30
-    # too (shared content token). So the triangle exists and 10-20 is the heaviest
-    # edge (they additionally share 'beta').
-    assert g.has_edge(10, 20)
-    assert g.has_edge(10, 30)
-    assert g.has_edge(20, 30)
-    # 10-20 share alpha + beta + topic (=3); 10-30 share alpha + topic (=2). The
-    # exact counts aren't load-bearing — the INVARIANT is w(10,20) > w(10,30).
-    assert g[10][20]["weight"] > g[10][30]["weight"]
-    # the unrelated memory 40 (concept 'delta', df=1) links nothing.
-    assert g.degree(40) == 0
-    stats = r.graph_stats()
-    assert stats["nodes"] == 4 and stats["edges"] >= 3
-
-
-def test_graph_rank_expands_from_seeds_by_weight():
-    r = HybridRetriever()
-    import retrievers.hybrid as H
-
-    old = H._CONCEPT_MAX_DF_FRAC
-    H._CONCEPT_MAX_DF_FRAC = 1.0
-    try:
-        r._build_graph(_shared_concept_corpus())
-    finally:
-        H._CONCEPT_MAX_DF_FRAC = old
-
-    # Seed from memory 10; neighbours 20 (w=2) and 30 (w=1) should both surface,
-    # with 20 ranked above 30 (heavier shared-concept edge).
-    nbrs = r._graph_rank([10], exclude={10}, k=10)
-    assert nbrs[:2] == [20, 30]
-    # excluded seeds are never returned
-    assert 10 not in nbrs
-
-
-def test_graph_rank_empty_without_graph_or_seeds():
-    r = HybridRetriever()  # no graph built
-    assert r._graph_rank([1, 2], exclude=set(), k=5) == []
-    r._graph = object.__new__(type("G", (), {}))  # truthy but unused
-    assert r._graph_rank([], exclude=set(), k=5) == []  # no seeds
-
-
-# ---------------- RRF fusion ----------------
-
-def test_rrf_accumulate_formula():
-    scores: dict[int, float] = {}
-    from collections import defaultdict
-
-    scores = defaultdict(float)
-    HybridRetriever._rrf_accumulate(scores, [7, 8, 9], weight=1.0)
-    assert math.isclose(scores[7], 1.0 / (_RRF_K + 1))
-    assert math.isclose(scores[8], 1.0 / (_RRF_K + 2))
-    assert math.isclose(scores[9], 1.0 / (_RRF_K + 3))
-    # a second weighted list adds on top
-    HybridRetriever._rrf_accumulate(scores, [8], weight=0.5)
-    assert math.isclose(scores[8], 1.0 / (_RRF_K + 2) + 0.5 / (_RRF_K + 1))
-
-
-def test_retrieve_fuses_all_three_legs_and_degrades():
-    """End-to-end fusion with the dense leg STUBBED (no model). Verifies (a) FTS +
-    dense agreement floats a doc to the top, (b) the graph leg can introduce a doc
-    neither base leg returned, and (c) dense-disabled degrades to FTS(+graph)."""
-    corpus = [
-        Memory(id=1, content="alpha shared concept", tags="alpha", expanded_keywords="alpha"),
-        Memory(id=2, content="alpha shared concept too", tags="alpha", expanded_keywords="alpha"),
-        Memory(id=3, content="beta unrelated", tags="beta", expanded_keywords="beta"),
-    ]
-    import retrievers.hybrid as H
-
-    old = H._CONCEPT_MAX_DF_FRAC
-    H._CONCEPT_MAX_DF_FRAC = 1.0
-    try:
-        r = HybridRetriever()
-        # Stub the dense BUILD so the test never loads the ~1.3 GB model nor writes
-        # to the shared cache/ dir; build_index then only does FTS + graph.
-        r._build_dense = lambda _c: None  # type: ignore[method-assign]
-        r.build_index(corpus)  # FTS + graph build only
-        # Stub the dense RANKER deterministically to "agree" with FTS on doc 1.
-        r._dense_rank = lambda q, k: [1]  # type: ignore[method-assign]
-
-        # query matching doc 1 lexically; doc 2 shares concept 'alpha' with doc 1
-        # (graph neighbour) even if FTS ranks it lower.
-        out = r.retrieve("alpha shared concept", k=3)
-        assert out, "should return something"
-        assert out[0] == 1  # FTS+dense agreement puts doc 1 first
-        assert 2 in out  # graph expansion (shares 'alpha') pulls doc 2 in
-    finally:
-        H._CONCEPT_MAX_DF_FRAC = old
-
-
-def test_graceful_degradation_records_error(monkeypatch):
-    """If the dense build raises, the retriever records it and still serves FTS."""
-    corpus = [Memory(id=i, content=f"doc number {i} content", tags="t") for i in range(1, 6)]
-    r = HybridRetriever()
-
-    def boom(_corpus):
-        raise RuntimeError("simulated embedding failure")
-
-    monkeypatch.setattr(r, "_build_dense", boom)
-    r.build_index(corpus)
-    assert any("dense leg disabled" in e for e in r.errors)
-    assert r._emb is None
-    # FTS still answers
-    out = r.retrieve("doc number 3 content", k=5)
-    assert 3 in out
diff --git a/benchmarks/scripts/dataset_stats.py b/benchmarks/scripts/dataset_stats.py
deleted file mode 100644
index 87c79b8..0000000
--- a/benchmarks/scripts/dataset_stats.py
+++ /dev/null
@@ -1,49 +0,0 @@
-#!/usr/bin/env python3
-"""Validate the eval set and print AGGREGATE stats (safe to share / commit-able
-numbers only — prints NO raw memory content)."""
-from __future__ import annotations
-
-import json
-import statistics
-import sys
-from collections import Counter
-from pathlib import Path
-
-sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
-from harness import load_dataset  # noqa: E402
-
-
-def main() -> None:
-    ds = load_dataset(validate=True)  # raises on any referential-integrity issue
-
-    strata = Counter(q.stratum for q in ds.queries)
-    rel_per_q = {s: [] for s in strata}
-    for q in ds.queries:
-        rel_per_q[q.stratum].append(len(ds.qrels[q.query_id]))
-
-    # how many DISTINCT corpus memories are exercised as relevant
-    relevant_union = set()
-    for rels in ds.qrels.values():
-        relevant_union |= rels
-
-    out = {
-        "corpus_count": len(ds.corpus),
-        "query_count": len(ds.queries),
-        "strata": dict(strata),
-        "relevant_ids_per_query": {
-            s: {
-                "min": min(v),
-                "median": statistics.median(v),
-                "max": max(v),
-                "mean": round(statistics.fmean(v), 2),
-            }
-            for s, v in rel_per_q.items()
-        },
-        "distinct_relevant_memories": len(relevant_union),
-        "validation": "PASS (all qrels ids exist in corpus; every query has qrels)",
-    }
-    print(json.dumps(out, indent=2))
-
-
-if __name__ == "__main__":
-    main()
diff --git a/benchmarks/scripts/export_corpus.py b/benchmarks/scripts/export_corpus.py
deleted file mode 100644
index a66a774..0000000
--- a/benchmarks/scripts/export_corpus.py
+++ /dev/null
@@ -1,78 +0,0 @@
-#!/usr/bin/env python3
-"""Export the local SQLite memory cache to a LOCAL-ONLY corpus.jsonl.
-
-Privacy: emits ONLY rows where is_sensitive=0. The output file lives under
-benchmarks/data/ which is gitignored. NEVER commit corpus.jsonl.
-
-Fields emitted per line: {id, content, category, tags, expanded_keywords, importance}
-"""
-from __future__ import annotations
-
-import argparse
-import json
-import sqlite3
-import sys
-from pathlib import Path
-
-DEFAULT_DB = Path.home() / ".claude" / "claude-memory" / "memory" / "memory.db"
-DEFAULT_OUT = Path(__file__).resolve().parents[1] / "data" / "corpus.jsonl"
-
-
-def export(db_path: Path, out_path: Path) -> dict:
-    if not db_path.exists():
-        raise SystemExit(f"DB not found: {db_path}")
-
-    con = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
-    con.row_factory = sqlite3.Row
-    cur = con.cursor()
-
-    total = cur.execute("SELECT COUNT(*) FROM memories").fetchone()[0]
-    sensitive = cur.execute(
-        "SELECT COUNT(*) FROM memories WHERE is_sensitive=1"
-    ).fetchone()[0]
-
-    rows = cur.execute(
-        """
-        SELECT id, content, category, tags, expanded_keywords, importance
-        FROM memories
-        WHERE is_sensitive=0
-        ORDER BY id
-        """
-    ).fetchall()
-
-    out_path.parent.mkdir(parents=True, exist_ok=True)
-    written = 0
-    with out_path.open("w", encoding="utf-8") as f:
-        for r in rows:
-            rec = {
-                "id": r["id"],
-                "content": r["content"],
-                "category": r["category"],
-                "tags": r["tags"],
-                "expanded_keywords": r["expanded_keywords"],
-                "importance": r["importance"],
-            }
-            f.write(json.dumps(rec, ensure_ascii=False) + "\n")
-            written += 1
-    con.close()
-
-    return {
-        "total_rows": total,
-        "sensitive_excluded": sensitive,
-        "non_sensitive_written": written,
-        "out_path": str(out_path),
-    }
-
-
-def main() -> None:
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--db", type=Path, default=DEFAULT_DB)
-    ap.add_argument("--out", type=Path, default=DEFAULT_OUT)
-    args = ap.parse_args()
-    stats = export(args.db, args.out)
-    json.dump(stats, sys.stdout, indent=2)
-    print()
-
-
-if __name__ == "__main__":
-    main()
diff --git a/benchmarks/scripts/run_eval.py b/benchmarks/scripts/run_eval.py
deleted file mode 100644
index 0ad0dc9..0000000
--- a/benchmarks/scripts/run_eval.py
+++ /dev/null
@@ -1,65 +0,0 @@
-#!/usr/bin/env python3
-"""Run the benchmark for a named retriever and print overall + per-stratum metrics.
-
-Usage:
-    .venv/bin/python scripts/run_eval.py --retriever fts5      # lexical baseline
-    .venv/bin/python scripts/run_eval.py --retriever substring # demo
-    .venv/bin/python scripts/run_eval.py --retriever mypkg.mymod:MyRetriever
-    .venv/bin/python scripts/run_eval.py --retriever fts5 --json results/fts5.json
-
-The --retriever value is either a built-in alias or a "module:Class" path. The
-class is instantiated with no args; the runner calls build_index() if present.
-
-Outputs are LOCAL-ONLY when written under results/ (gitignored): a results file
-may echo retrieved ids (not content), but keep it local to be safe.
-"""
-from __future__ import annotations
-
-import argparse
-import importlib
-import json
-import sys
-from pathlib import Path
-
-sys.path.insert(0, str(Path(__file__).resolve().parents[1]))
-from harness import load_dataset, run_benchmark  # noqa: E402
-from harness.baselines import SqliteFtsRetriever  # noqa: E402
-from harness.example_retriever import SubstringRetriever  # noqa: E402
-
-ALIASES = {
-    "fts5": lambda: SqliteFtsRetriever(sort_by="relevance"),
-    "fts5_importance": lambda: SqliteFtsRetriever(sort_by="importance"),
-    "substring": SubstringRetriever,
-}
-
-
-def resolve(spec: str):
-    if spec in ALIASES:
-        return ALIASES[spec]()
-    if ":" not in spec:
-        raise SystemExit(f"unknown retriever alias '{spec}' (use module:Class or one of {list(ALIASES)})")
-    mod_name, cls_name = spec.split(":", 1)
-    mod = importlib.import_module(mod_name)
-    return getattr(mod, cls_name)()
-
-
-def main() -> None:
-    ap = argparse.ArgumentParser()
-    ap.add_argument("--retriever", default="fts5")
-    ap.add_argument("--k", type=int, default=20, help="depth requested from retriever")
-    ap.add_argument("--json", type=Path, default=None, help="write full result JSON here")
-    args = ap.parse_args()
-
-    ds = load_dataset(validate=True)
-    retr = resolve(args.retriever)
-    res = run_benchmark(retr, ds, retrieve_k=args.k)
-    print(res.summary())
-
-    if args.json:
-        args.json.parent.mkdir(parents=True, exist_ok=True)
-        args.json.write_text(json.dumps(res.to_dict(), indent=2))
-        print(f"\nwrote {args.json}")
-
-
-if __name__ == "__main__":
-    main()
diff --git a/docs/adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md b/docs/adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md
deleted file mode 100644
index e9f0cf9..0000000
--- a/docs/adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md
+++ /dev/null
@@ -1,21 +0,0 @@
-# Pursue hybrid retrieval: embeddings + concept graph over pure lexical
-
-Today recall is **lexical only** (BM25 in SQLite, `tsvector`/`ts_rank` in Postgres over
-content + LLM-generated `expanded_keywords`). It matches *tokens*, so it misses
-paraphrase/synonym queries and cannot traverse between related Memories. We will pursue a
-**hybrid** read path that adds dense-vector **Semantic recall** and a traversable **Concept
-graph** (typed Relationships) alongside the existing Lexical recall.
-
-This decision is **gated on a benchmark**: we adopt hybrid only if it shows a material
-recall-quality uplift over the current lexical system on a stratified eval set (exact /
-paraphrase / multi-hop). If the benchmark shows no improvement, a later ADR supersedes this
-and we stay lexical.
-
-## Considered options
-
-- **Pure semantic (embeddings only)** — fixes paraphrase gaps but gives no real concept
-  traversal; rejected as the *sole* mechanism.
-- **Pure concept graph** — enables traversal but node-matching stays lexical, so paraphrase
-  gaps remain; rejected as the *sole* mechanism.
-- **Hybrid (chosen)** — embeddings for meaning + graph for traversal + existing FTS, fused
-  into one ranked result. Highest ceiling; the GraphRAG / Zep-Graphiti / HippoRAG family.
diff --git a/docs/adr/0002-api-postgres-first-sqlite-stays-lexical.md b/docs/adr/0002-api-postgres-first-sqlite-stays-lexical.md
deleted file mode 100644
index 9310b72..0000000
--- a/docs/adr/0002-api-postgres-first-sqlite-stays-lexical.md
+++ /dev/null
@@ -1,20 +0,0 @@
-# API/Postgres deployment gets semantics; SQLite-only stays lexical
-
-The semantic + concept-graph layer targets the **API/Postgres** deployment only: embeddings
-in pgvector on the (CNPG) Postgres, the Concept graph as node/edge tables in Postgres, and
-embedding/extraction via reused cluster infra (llama-cpp on GPU, or a hosted API). The
-**SQLite-only** mode keeps working but stays **lexical (FTS) only** — it gains no embeddings
-or graph, degrading gracefully.
-
-This is surprising because the README markets zero-config offline SQLite as the headline
-feature. We accept that trade-off: the operator actually runs the remote API/Postgres store,
-reuse-before-building favours cluster infra, and bundling a local embedding model into the
-zero-config path would add heavy dependencies and double the build/test matrix for little
-real-world benefit.
-
-## Consequences
-
-- All benchmark numbers are produced in API/Postgres mode.
-- Offline zero-config users see no behaviour change.
-- A future ADR may revisit offline semantics (e.g. via `sqlite-vec` + a small local model)
-  if there is demand.
diff --git a/docs/adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md b/docs/adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md
deleted file mode 100644
index e0ec9f4..0000000
--- a/docs/adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md
+++ /dev/null
@@ -1,19 +0,0 @@
-# External embedding/extraction APIs allowed for non-sensitive memories
-
-Embedding and concept extraction may use **hosted APIs** (e.g. OpenAI `text-embedding-3`,
-Voyage, Cohere) for **non-sensitive** memories, to access a higher quality ceiling than
-self-hosted models alone. **Sensitive / Vault-encrypted (secret) memories are never sent
-externally** and are excluded from the corpus that gets embedded or extracted.
-
-This is a deliberate relaxation of the homelab's usual local-only posture, made because the
-quality gain is worth it for non-secret personal memory content. The research/benchmark may
-still compare hosted vs self-hostable models (nomic-embed, bge-m3, gte-Qwen2, e5) so the
-production choice is data-driven; this ADR only records that egress is *permitted* within the
-sensitive-data boundary.
-
-## Consequences
-
-- The corpus-export step MUST filter out `is_sensitive` / secret memories before any external
-  call.
-- Production deployment needs an embedding API key (or falls back to the in-cluster
-  llama-cpp model when absent).
diff --git a/docs/adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md b/docs/adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md
deleted file mode 100644
index 6bdbae1..0000000
--- a/docs/adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md
+++ /dev/null
@@ -1,45 +0,0 @@
-# Phase the hybrid: lexical + dense first, concept graph gated
-
-The [benchmark](../research/benchmark-report.md) shows the hybrid read path beats lexical FTS on
-every overall metric (recall@10 **+0.139**, driven by the statistically-robust paraphrase win recall@10
-**+0.350**), with no recall regression on exact — so [ADR-0001](0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md)'s
-quality gate **is met by the lexical + dense fusion**. The ablation also showed full-hybrid **identical
-to three decimals** to FTS+dense alone — **but a post-run adversarial review proved this was a structural
-artifact, not an empirical result**: the prototype's fusion barred the graph leg from the fused top-k by
-construction (graph candidates excluded from the FTS∪dense base set; `w_graph=0.35` → max graph RRF
-`0.0057` < base-leg min `0.0091`). So the concept graph was **never validly tested — it is *unevaluated*,
-not disproven.**
-
-We therefore **phase** the hybrid:
-
-- **Phase 1 (adopt now):** lexical FTS ⊕ dense pgvector embeddings, fused with weighted RRF, the
-  importance prior preserved as a post-fusion multiplier. This *is* the measured uplift.
-- **Phase 2 (gated, NOT shipped):** the typed-relation concept graph. It stays designed
-  ([integration design §A.5](../research/integration-design.md)) but disabled — its value is **unproven**
-  and it carries real operational cost (LLM extraction + two extra Postgres tables + traversal).
-
-The graph's prototype was a **zero-LLM keyword-co-occurrence** graph (concepts = tags ∪
-expanded_keywords ∪ regex noun-phrases, edges from co-occurrence), **not** the LLM-extracted
-typed-relation graph the production design specifies. So the null result kills *that cheap
-construction* on *this eval set* — it does not prove an LLM-extracted graph is also null. The graph
-is **gated, not killed**: re-open it only with evidence it helps — e.g. a multi-hop slice whose hops
-are *not* semantically adjacent (where the dense leg can't shortcut), built from real typed-relation
-extraction.
-
-## Why the graph result is inconclusive
-
-The ablation `A≡B` was **guaranteed by the fusion config** (graph candidates excluded from the FTS∪dense
-base set; `w_graph=0.35` caps a graph-only id's RRF below any base-leg id), so it tested nothing about the
-graph — the review even found a relevant memory the graph surfaced that both base legs missed, which
-fusion then discarded. Separately, the multi-hop deltas (recall@10 +0.064) are **not statistically
-significant** (3 of 4 CIs cross zero), so there is no distinguishable multi-hop win to attribute to
-*either* leg. The graph is deferred on **cost + uncertainty**, not on evidence it fails.
-
-## Consequences
-
-- Production ships embeddings + RRF fusion; the graph schema is documented but not migrated.
-- The concept-graph research (Zep/Graphiti split, HippoRAG PPR, EDC canonicalization) is preserved as
-  the phase-2 blueprint, behind the gate.
-- Phasing avoids paying the graph's cost while its value is unproven; the robust **lexical+dense**
-  paraphrase win is what ADR-0001's gate actually surfaced. The graph stays a designed, gated follow-up
-  pending a valid retest (graph candidates in the fused pool / swept weight, real typed-relation extraction).
diff --git a/docs/adr/0005-rrf-default-cc-challenger.md b/docs/adr/0005-rrf-default-cc-challenger.md
deleted file mode 100644
index b05f5fa..0000000
--- a/docs/adr/0005-rrf-default-cc-challenger.md
+++ /dev/null
@@ -1,40 +0,0 @@
-# Weighted RRF as the fusion default; convex combination as a benchmark challenger
-
-The hybrid read path must fuse three retrieval signals on **incomparable scales** — unbounded
-`ts_rank` (BM25), bounded cosine, and (phase 2) an arbitrary graph-proximity score — where the graph
-list is **sparse and often empty**. We adopt **weighted Reciprocal Rank Fusion** as the default
-fusion function: `score(d) = Σ_s w_s/(60 + rank_s(d))`, default `w_lex = w_dense = 1.0`,
-`w_graph = 0.35`, with the existing **importance** value applied as a *post-fusion prior multiplier*
-(`final = fused × (0.7 + 0.3·importance)` for `sort_by="relevance"`) — importance is a prior, **not**
-a fourth fused list.
-
-RRF is the right default because it is **score-scale-free** (no BM25-vs-cosine calibration to
-maintain), treats a missing leg as a clean **0** contribution (no missing-modality bias), is
-near-parameter-free (`k=60` is demonstrably uncritical across [10,100]), and **collapses to today's
-exact lexical ordering** when the dense/graph legs are empty — which is the SQLite-only
-graceful-degrade path ([ADR-0002](0002-api-postgres-first-sqlite-stays-lexical.md)) running the *same*
-code.
-
-## Considered options
-
-- **Convex combination / TM2C2** (Bruch et al., min-max normalization) — the literature consistently
-  shows it **edges RRF on nDCG/recall** when scores are calibratable (Weaviate switched its default to it).
-  It is the **standing challenger**. ⚠️ **Correction:** an earlier draft claimed "the benchmark ran CC
-  against RRF and RRF was chosen on our eval set" — **no CC results were actually produced or persisted in
-  this run.** RRF was adopted on *principled* grounds (scale-free, treats a missing/empty leg as a clean 0,
-  collapses to today's exact lexical ordering for the SQLite-only degrade path), **not** a measured
-  head-to-head. Benchmarking CC vs RRF on our eval set is an open follow-up — do it before locking fusion,
-  and especially if the graph is ever adopted or score distributions shift.
-- **Cross-encoder stage-2 re-rank** (e.g. `bge-reranker-v2-m3` over the fused top ~20–30) — a
-  *separate*, independently-gated stage, not a fusion function. Deferred; ship only if it clears both
-  the quality bar and the hot-path p95 budget on the GPU node.
-
-## Consequences
-
-- Fusion is ~30 lines over three top-N queries; the lexical leg reuses the existing
-  `plainto_tsquery`/`ts_rank` query + OR-broaden fallback verbatim.
-- The exact-stratum nDCG/MRR dip (~0.018/0.025, recall unaffected) is the known RRF cost of blending
-  one perfect hit with near-ties; a small **exact-match rank bonus** is the tunable recovery and a
-  cheap follow-up.
-- `k=60` is borrowed from TREC ad-hoc IR; a quick re-sweep on the eval set is worthwhile but the
-  literature says it is insensitive.
diff --git a/docs/adr/0006-pgvector-hnsw-halfvec-1024d-embeddings.md b/docs/adr/0006-pgvector-hnsw-halfvec-1024d-embeddings.md
deleted file mode 100644
index 158a52b..0000000
--- a/docs/adr/0006-pgvector-hnsw-halfvec-1024d-embeddings.md
+++ /dev/null
@@ -1,49 +0,0 @@
-# Production vector storage: pgvector HNSW + halfvec(1024); 1024-d embeddings (Voyage-3.5 / bge-large)
-
-Phase 1 of the hybrid ([ADR-0004](0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)) needs a
-production home for the dense embeddings. Per [ADR-0002](0002-api-postgres-first-sqlite-stays-lexical.md)
-that is **pgvector on the shared CNPG Postgres**, where claude-memory is already a database tenant — no new
-datastore.
-
-> ⚠️ **Correction (verified against live infra by a design challenger):** an earlier draft justified this
-> with "Immich already runs pgvector on the same cluster." That is **false** — Immich runs its **own**
-> Postgres, not the shared CNPG — so it is NOT evidence the shared cluster has the extension. **pgvector
-> must be explicitly enabled on CNPG** (extension install, and possibly a CNPG operand-image change) via
-> Terraform **before this can land**; do not assume it is already available.
-
-Decisions:
-
-- **Index: HNSW** (`USING hnsw (embedding halfvec_cosine_ops) WITH (m=16, ef_construction=64)`,
-  query knob `hnsw.ef_search` set via `SET LOCAL` inside the recall txn under PgBouncer). Best
-  speed-recall tradeoff, buildable on an empty table. **IVFFlat rejected** — it must be built *after*
-  data exists (empty-table footgun) and has a lower recall ceiling.
-- **Type: `halfvec(1024)`** (fp16) — halves index size at ~no recall loss; 1024-d halfvec = 2048
-  bytes/row → single-digit MB for the whole corpus.
-- **Dimension fixed at 1024**, chosen **once** (changing it later forces a full re-embed + HNSW
-  rebuild). 1024 matches both the production model (Voyage-3.5) and the prototype model
-  (bge-large-en-v1.5), so the column and all fusion code are identical regardless of model.
-- **Model: Voyage-3.5** (1024-d, hosted) for **non-sensitive** memories (highest measured retrieval
-  quality of the hosted options, free tier covers the corpus); **bge-large-en-v1.5** (1024-d, local,
-  MIT) for **sensitive memories and the no-API-key fallback** ([ADR-0003](0003-external-embedding-apis-allowed-for-non-sensitive-memories.md)).
-  `is_sensitive=1` rows are never embedded externally — `embedding=NULL`, lexical-only.
-- **pgvectorscale / StreamingDiskANN deferred** — Rust/pgrx must be compiled into the CNPG operand
-  image, and it only earns its keep above ~1–5M vectors; our corpus is orders of magnitude below that.
-
-## Migration shape
-
-A single **additive** Alembic migration: `ALTER TABLE memories ADD COLUMN embedding halfvec(1024)`
-(NULL for sensitive) + `CREATE INDEX CONCURRENTLY … USING hnsw …`. The existing generated
-`search_vector tsvector` + GIN index (`migrations/001`) are **untouched**, so lexical behaviour and
-the SQLite-only degrade path are unchanged. pgvector enablement on CNPG and any extension/operand
-change land as **Terraform/Terragrunt** in `infra/stacks/…` (GitOps, never kubectl) and trigger a
-rolling restart of the shared cluster — coordinate accordingly.
-
-## Consequences
-
-- The prototype's in-process numpy matrix maps directly to this column; only the substrate changes,
-  not the retrieval math.
-- The prototype measured **bge-large** quality; a cheap follow-up should re-run the dense leg with
-  **Voyage-3.5** on the non-sensitive corpus to confirm the hosted ceiling holds on our content
-  before locking the production default.
-- Production latency/ANN-approximation/filtered-top-k behaviour are unmeasured in the prototype and
-  must be validated post-migration (a stated benchmark limitation).
diff --git a/docs/research/benchmark-report.md b/docs/research/benchmark-report.md
deleted file mode 100644
index 2a48d7b..0000000
--- a/docs/research/benchmark-report.md
+++ /dev/null
@@ -1,312 +0,0 @@
-# Benchmark report: FTS (lexical baseline) vs Hybrid (lexical + dense + graph)
-
-Status: **the ADR-0001 decision instrument.** This is the head-to-head that gates hybrid
-adoption. Read the [survey](survey.md) for the landscape and the
-[integration design](integration-design.md) for what was built.
-
-**Bottom line:** Hybrid (lexical FTS ⊕ dense embeddings, RRF-fused) beats the lexical FTS baseline on
-**every overall metric**, driven by the **paraphrase** stratum (recall@10 **+0.350**, robust under a
-paired bootstrap) — precisely the gap embeddings were meant to close, with **no recall regression on
-exact**. **Recommendation: adopt the lexical + dense fusion;** the ADR-0001 quality gate is met by
-that fusion alone.
-
-> **⚠️ Post-review corrections — read before citing this report.** An adversarial completeness review
-> after the run found that two prominent claims in the original draft did NOT survive scrutiny. They are
-> corrected in place below; the full review is in the [Appendix](#appendix--adversarial-completeness-review).
->
-> 1. **The concept graph was NOT shown to be useless — it was never validly tested.** The prototype's
->    fusion restricted the graph leg to candidate ids *outside* the FTS∪dense base set and weighted it at
->    0.35, which makes it **mathematically impossible** for a graph-only result to enter the fused top-k
->    (max graph RRF `0.35/(60+1)≈0.0057` < any base-leg minimum `1.0/(60+50)≈0.0091`). So the ablation
->    "full-hybrid ≡ FTS+dense to three decimals" was **guaranteed by the fusion config, not an empirical
->    finding** — the review even found a genuinely-relevant memory the graph surfaced (and both base legs
->    missed) that fusion then discarded. The concept graph is therefore **UNEVALUATED**, and is deferred on
->    **cost + invalid-test** grounds, *not* because it failed. A valid retest must put graph ids in the
->    fused candidate pool (no base-set exclusion) and/or sweep the graph weight.
-> 2. **The multi-hop "win" is not statistically significant.** A paired bootstrap (B=10000) puts 3 of the
->    4 multi-hop metric CIs across zero (recall@10 Δ+0.064, P(Δ≤0)≈0.06). Only the **overall** and
->    **paraphrase** deltas are robust. Multi-hop deltas are de-bolded below. Since multi-hop is the entire
->    rationale for a concept graph, there is currently **no statistically distinguishable evidence** that
->    *anything* (dense or graph) helps the multi-hop stratum.
->
-> Also: absolute recall/nDCG *levels* are biased low (binary, un-pooled qrels — §6); only FTS-vs-hybrid
-> **deltas** are trustworthy. None of this changes the adopt-lexical+dense recommendation, which rests on
-> the robust paraphrase/overall win.
-
----
-
-## 1. Methodology
-
-### 1.1 Test collection
-- **Corpus:** 5,452 memories (`benchmarks/data/corpus.jsonl`, gitignored — privacy). Sensitive
-  memories (`is_sensitive=1`) excluded entirely per ADR-0003. The corpus is the user's real
-  personal memories; **only aggregate numbers and synthetic illustrations appear in this document.**
-- **Queries:** 119, stratified — **40 exact / 40 paraphrase / 39 multi-hop**
-  (`benchmarks/data/queries.jsonl`).
-- **Qrels:** `benchmarks/data/qrels.jsonl`, LLM-generated (the LongMemEval pipeline inverted for a
-  memory store: seed memory → exact/paraphrase/multi-hop query → relevant ids).
-
-### 1.2 The two systems
-- **`fts` (baseline)** — `benchmarks/retrievers/fts.py::FtsRetriever`. A **faithful
-  reimplementation of the production code path** `src/claude_memory/mcp_server.py::_sqlite_recall`
-  (`sort_by="relevance"`): in-memory FTS5 over the full corpus with the production virtual-table
-  shape + default `unicode61` tokenizer (no stemming/stop-words); AND-match first, OR-broaden only
-  if AND returns zero; rank by the production blend `-bm25()*0.7 + importance*0.3`; LIKE fallback on
-  operational errors. **This is "the current system" the hybrid must beat — not the simplified
-  README prose.**
-- **`hybrid`** — `benchmarks/retrievers/hybrid.py::HybridRetriever`. Three legs fused with weighted
-  RRF: (1) lexical = **`FtsRetriever` reused verbatim** (so the hybrid's lexical component *is* the
-  baseline — no drift); (2) dense = `BAAI/bge-large-en-v1.5` (1024-d, local, L2-normalized, BGE
-  query-instruction prefix), cosine via numpy; (3) graph = a keyword-co-occurrence memory-node graph
-  (5,452 nodes / 2,095,624 edges), 1-hop expansion from the top-10 seeds.
-  Fusion: `RRF(d) = Σ_leg w_leg/(60 + rank_leg(d))`, `w_fts = w_dense = 1.0, w_graph = 0.35`, each
-  leg to depth 50.
-
-### 1.3 Metrics & protocol
-- **recall@5, recall@10** (hot-path "did we surface it"), **nDCG@10** (graded, position-aware
-  headline), **MRR** (first-hit). Per stratum and overall.
-- `retrieve_k=20`, run via `scripts/run_eval.py --retriever retrievers.{fts,hybrid}:…`. Deterministic;
-  both invocation paths (programmatic + CLI) verified identical.
-- **`sort_by="relevance"` pinned across both arms** (not the production `importance` default), so the
-  benchmark measures *retrieval* quality, not the importance prior — and everything else (corpus,
-  queries, OR-broaden behaviour) is held fixed.
-- Full result JSONs written only to gitignored `benchmarks/results/{fts,hybrid}.json`; retriever code
-  contains no embedded corpus content (safe to commit).
-
----
-
-## 2. Head-to-head results
-
-### 2.1 Comparison table (FTS vs Hybrid, with deltas)
-
-| Stratum | Metric | FTS | Hybrid | Δ |
-|---|---|---:|---:|---:|
-| **Overall** (n=119) | recall@5 | 0.6663 | 0.7415 | **+0.0752** |
-| | recall@10 | 0.6952 | 0.8338 | **+0.1386** |
-| | nDCG@10 | 0.6507 | 0.7284 | **+0.0777** |
-| | MRR | 0.6737 | 0.7297 | **+0.0560** |
-| **Exact** (n=40) | recall@5 | 1.0000 | 1.0000 | +0.0000 |
-| | recall@10 | 1.0000 | 1.0000 | +0.0000 |
-| | nDCG@10 | 0.9908 | 0.9723 | −0.0185 |
-| | MRR | 0.9875 | 0.9625 | −0.0250 |
-| **Paraphrase** (n=40) | recall@5 | 0.3500 | 0.5500 | **+0.2000** |
-| | recall@10 | 0.3750 | 0.7250 | **+0.3500** |
-| | nDCG@10 | 0.3123 | 0.5023 | **+0.1900** |
-| | MRR | 0.2958 | 0.4343 | **+0.1385** |
-| **Multi-hop** (n=39) | recall@5 | 0.6485 | 0.6726 | +0.0241 ¹ |
-| | recall@10 | 0.7111 | 0.7748 | +0.0637 ¹ |
-| | nDCG@10 | 0.6491 | 0.7101 | +0.0610 ¹ |
-| | MRR | 0.7393 | 0.7940 | +0.0547 ¹ |
-
-¹ **Multi-hop deltas are NOT statistically significant.** Paired bootstrap (B=10000): recall@5
-`CI[−0.046,+0.095]` (P(Δ≤0)≈0.25), recall@10 `CI[−0.020,+0.143]` (P≈0.06), MRR `CI[−0.038,+0.143]`
-(P≈0.12); only nDCG@10 is marginal (P≈0.04). Treat the multi-hop stratum as **inconclusive**. The
-overall and paraphrase deltas, by contrast, have CIs well clear of zero (P≤0.003).
-
-### 2.2 Reading the strata
-
-- **Exact — recall held at 1.0, but that is partly circular.** Exact queries were *generated* as "a
-  salient phrase whose top FTS hit is memory X," then X was labelled relevant — so FTS recall@5/@10 = 1.0
-  is substantially **guaranteed by how the stratum was built**, not an independent property. The one
-  genuine signal here is hybrid's small **degradation**: nDCG@10 −0.018 `CI[−0.046,0]`, MRR −0.025
-  `CI[−0.063,0]` — a real, consistent rank demotion from blending one perfect lexical hit with dense
-  near-ties. The proposed exact-match rank bonus (ADR-0005) would recover it, **but that fix is asserted,
-  not measured.** Recall is unaffected; the cost is rank-position only.
-
-- **Paraphrase — the LARGEST WIN, the dense leg's payoff.** recall@10 **+0.350** (0.375 → 0.725),
-  recall@5 **+0.200**, nDCG@10 **+0.190**. This is exactly the low-lexical-overlap stratum embeddings
-  were predicted to fix: lexical FTS finds barely a third of paraphrased answers; adding dense nearly
-  doubles recall@10. **This stratum alone justifies the hybrid.**
-
-- **Multi-hop — INCONCLUSIVE (deltas not significant).** recall@10 +0.064, nDCG@10 +0.061, MRR +0.055,
-  but 3 of 4 CIs cross zero (footnote ¹). So we **cannot claim** a multi-hop win for hybrid. We also
-  cannot attribute anything to the graph: per the §3 caveat, the fusion config structurally barred the
-  graph leg from the top-k, so the multi-hop stratum tests only FTS vs FTS+dense — and that difference is
-  not statistically distinguishable here. Multi-hop is exactly the stratum a *properly tested* concept
-  graph is meant to win, and it remains an open question.
-
----
-
-## 3. Ablation — what each leg contributes
-
-Four configs, everything else fixed:
-
-| Config | Legs | Overall recall@10 | Overall nDCG@10 | Para recall@10 | Multi recall@10 | Exact nDCG@10 |
-|---|---|---:|---:|---:|---:|---:|
-| **A** | FTS + dense + graph (full hybrid) | 0.834 | 0.728 | 0.725 | 0.775 | 0.972 |
-| **B** | FTS + dense (`w_graph=0`) | **0.834** | **0.728** | **0.725** | **0.775** | **0.972** |
-| **C** | dense only | 0.748 | — | — | — | 0.861 |
-| **D** | FTS only (= baseline) | 0.695 | 0.651 | 0.375 | 0.711 | 0.991 |
-
-**A ≡ B to three decimals on every metric — but this is a STRUCTURAL ARTIFACT, not a test of the graph.**
-The fusion restricts the graph leg to ids *outside* the FTS∪dense base set and weights it at 0.35, so a
-graph-only id's maximum RRF score is `0.35/(60+1) ≈ 0.0057`, strictly below the *minimum* score of any id
-from a base leg (`1.0/(60+50) ≈ 0.0091`). Since both base legs return 50 ids, **every base-leg id outranks
-every graph-only id** — a graph-only result can never enter the fused top-k, regardless of corpus, query
-set, or graph quality. A ≡ B was **mathematically guaranteed before any data ran.** The honest reading is
-"the graph cannot affect top-k *under this fusion config*," NOT "the concept graph contributes nothing." (A
-spot check found a genuinely-relevant memory the graph surfaced and both base legs missed at depth 50 —
-fusion discarded it.) **The graph is therefore unevaluated.**
-
-What the ablation *does* validly show is the **FTS-vs-dense decomposition**, which stands:
-- **Dense recovers paraphrase** (C beats D's 0.375 para recall@10 decisively) but is weaker on exact
-  (C exact nDCG 0.861 vs D's 0.991).
-- **Lexical recovers exact** (D exact nDCG 0.991, the best) but collapses on paraphrase.
-- **Fusion (B) gets the best of both** — exact recall stays perfect, paraphrase nearly doubles.
-- *Caveat:* configs B/C/D were not persisted as result JSONs, so these specific numbers are not
-  independently reproducible from committed artifacts (only A = full hybrid and D = FTS are).
-
-**To actually test the graph** (deferred follow-up — [ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)):
-put graph candidates in the fused pool (drop the base-set exclusion) and/or sweep `w_graph` upward, on a
-multi-hop slice whose hops are *not* semantically adjacent (so the dense leg can't shortcut them), using
-real typed-relation extraction rather than the prototype's zero-LLM keyword co-occurrence graph.
-
----
-
-## 4. Latency & storage (measured, NON-GATING per ADR-0001)
-
-### 4.1 Latency (per-query `retrieve()`, CPU-only box, no GPU)
-
-| System | p50 | p95 | mean | max |
-|---|---:|---:|---:|---:|
-| FTS (pure SQLite) | 15.7 ms | 27.8 ms | 12.8 ms | 31.9 ms |
-| Hybrid | 229.6 ms | 344.5 ms | 249.3 ms | 640.0 ms |
-
-The hybrid's ~230 ms p50 is **dominated by the local bge-large query embedding** (one CPU forward
-pass). On the production GPU node or via a hosted API this drops ~10× to low tens of ms. The FTS,
-dense-ANN, and RRF-merge costs themselves are negligible. **Latency does not gate adoption** (the
-success metric is quality-first), and the production read path (pgvector HNSW + GPU/hosted query
-embed) is far faster than this prototype's CPU profile.
-
-### 4.2 Storage
-
-| Component | Size | Notes |
-|---|---|---|
-| FTS5 in-memory index | ~8.3 MB | SQLite shadow tables over 5,452 memories |
-| Dense matrix | 22.3 MB | 5,452 × 1024 float32; cached `.npy`, fingerprint-keyed |
-| Concept graph (in-memory) | ~202 MB (Python-object estimate) | 5,452 nodes + 2,095,624 edges, networkx; **not persisted, not shipped** |
-| Total reported index | 232.2 MB | matrix + FTS + graph estimate |
-
-Production maps the dense matrix to pgvector `halfvec(1024)` (~2 KB/row, single-digit MB total for
-the corpus) and — only if the graph is ever adopted — three Postgres node/edge tables. No
-pgvector/docker in the prototype; in-process numpy cosine (faiss unnecessary at N=5452). Storage is
-reported, not gating.
-
----
-
-## 5. Recommendation
-
-**ADOPT — ship lexical + dense fusion (phase 1); defer the concept graph behind a gate.**
-
-Rationale:
-1. **The ADR-0001 quality gate is met.** Hybrid beats FTS on all four overall metrics, with the
-   decisive, statistically-robust win on **paraphrase** (recall@10 +0.350) — precisely the gap embeddings
-   were meant to close — with **no recall regression on exact**. (The multi-hop deltas are *not*
-   statistically significant — §2.2, footnote ¹ — so they are not part of the case for adoption.)
-2. **The gain is the FTS+dense fusion, not the graph.** The ablation shows full-hybrid ≡ FTS+dense.
-   So phase 1 = embeddings (pgvector) fused with the existing FTS via weighted RRF, preserving the
-   importance prior — exactly the [integration design](integration-design.md) §A.
-3. **The concept graph stays GATED — because it was not validly tested, not because it failed**
-   ([ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)). The prototype's fusion
-   config structurally barred it from the top-k (§3), so this benchmark says nothing about its value.
-   Deferral is justified by operational cost (LLM extraction + two extra tables + traversal) **plus the
-   remaining uncertainty** — not by evidence of uselessness. Re-open with a valid retest: graph candidates
-   in the fused pool (no base-set exclusion) and/or a swept weight, on non-semantically-adjacent multi-hop
-   queries built from real typed-relation extraction.
-4. **The small exact-stratum nDCG/MRR dip** is the known RRF blending cost on recall-perfect queries;
-   an exact-match rank bonus is a cheap follow-up ([ADR-0005](../adr/0005-rrf-default-cc-challenger.md)
-   records RRF as the default with the bonus as a tunable).
-
-Adopt with the changes already designed: pgvector `halfvec(1024)` + HNSW, weighted-RRF fusion with
-the importance prior preserved, sensitive memories excluded from embedding (ADR-0003), and SQLite-
-only mode unchanged (ADR-0002).
-
----
-
-## 6. Limitations (stated honestly)
-
-1. **Label noise / qrels quality.** Qrels were **LLM-generated** with **lighter hand-verification
-   than the ideal protocol** (no measured Cohen's κ between LLM and human judgments). LLM judges are
-   systematically lenient, and the eval is **119 queries (~40/stratum)** — *below* the ~50/stratum
-   the literature (Voorhees & Buckley) recommends for confident per-stratum ranking, and **no
-   bootstrap CIs or paired significance test** were computed. The overall and paraphrase deltas
-   (+0.14, +0.35 recall@10) are large enough to be robust to plausible label noise; the multi-hop
-   (+0.06) and the exact-stratum dip (~0.02) are within the range where label noise could matter, so
-   treat those as directional, not precise.
-
-2. **Pooling / "holes."** It is not confirmed that qrels pooled the top-k of *all* arms (FTS, dense,
-   hybrid) before judging. If the pool was lexical-biased, the metric is **biased against** the dense
-   and hybrid arms (they retrieve unjudged relevant memories scored as misses) — meaning the true
-   hybrid uplift could be *larger* than reported, not smaller. This does not threaten the "adopt"
-   conclusion but caveats the exact magnitudes.
-
-3. **Snapshot vs pgvector (substrate mismatch).** The prototype used an **in-process numpy** dense
-   index over a static corpus snapshot, **not** the production pgvector HNSW on live CNPG Postgres.
-   Retrieval *quality* transfers (cosine is cosine; HNSW recall at this scale is ~exact), but the
-   production **latency profile, ANN approximation, and filtered-top-k behaviour are unmeasured here**
-   and must be validated post-migration.
-
-4. **Extraction shortcuts AND a structurally-excluded graph leg.** The concept graph was built with
-   **zero LLM calls** — concepts = `tags` ∪ `expanded_keywords` ∪ a regex noun-phrase proxy, edges from
-   keyword co-occurrence — **not** the typed-relation, LLM-extracted graph the production design (§A.5)
-   specifies. More importantly, the fusion config **structurally barred even this cheap graph from the
-   top-k** (§1 corrections, §3), so this run is **not a valid test of the graph at all** — not of the
-   cheap construction, and certainly not of a properly LLM-extracted typed-relation graph. The graph is
-   *gated and unevaluated*, not *killed*.
-
-5. **Embedding model is the prototype default, not the production pick.** Numbers are for
-   **bge-large-en-v1.5** (local). Production should use **Voyage-3.5** (also 1024-d) for
-   non-sensitive memories (ADR-0003); its higher quality ceiling on *our* content is unverified — a
-   cheap, recommended follow-up (re-run the dense leg with Voyage).
-
-6. **`sort_by="relevance"` not the production default.** The benchmark pins `relevance` to isolate
-   retrieval quality; production defaults to `importance`-blended ranking. The design preserves
-   importance as a post-fusion prior, but the *user-visible* ranking under the default sort was not
-   benchmarked.
-
-7. **Single user, dense corpus.** ~5,452 memories from one author are topically adjacent with many
-   near-duplicates, so "the one relevant id" is sometimes fuzzy; graded judgments over a pool mitigate
-   this but it remains a property of the corpus that may not generalize.
-
----
-
-## Appendix — adversarial completeness review
-
-An independent critic reviewed this report after the run (verdict: **usable-with-caveats**). Its findings
-drove the §1 corrections; the full list is recorded here so the review is part of the permanent record.
-
-1. **Graph-null claim is circular, not empirical (most serious).** Proven that the fusion config
-   (graph leg excluded from the base set; `w_graph=0.35`) caps a graph-only id's RRF at `0.35/61≈0.0057`,
-   below any base-leg id's `1.0/110≈0.0091`. `A≡B` was guaranteed before any data. Honest statement: "the
-   graph cannot affect top-k under this fusion config," not "the graph contributes nothing."
-2. **Graph mechanism mis-attributed, and contradicted by the data.** Reconstructing the legs (graph =
-   5,452 nodes / 2,095,624 edges) on a 15-query sample found `rel_only_via_graph=1`: a relevant memory the
-   graph surfaced that was absent from BOTH base legs at depth 50 — dense did NOT already cover it, and
-   fusion discarded it. The "dense already retrieves them" explanation is false in ≥1 observed case.
-3. **Multi-hop wins are not statistically supported — yet were bolded.** Paired bootstrap (B=10000):
-   overall & paraphrase robust (P≤0.003); multi-hop 3/4 CIs cross zero (recall@10 P≈0.06). Multi-hop is the
-   sole rationale for the graph, and it shows no statistically distinguishable hybrid advantage.
-4. **Exact stratum is circular by construction.** All 40 exact queries were generated as "top FTS hit for a
-   salient phrase of ⟨id⟩," so FTS's 1.0 is largely tautological. The only genuine exact signal is hybrid's
-   small nDCG/MRR demotion, whose claimed rank-bonus fix is asserted, never measured.
-5. **qrels are binary and un-pooled, so nDCG is mislabeled and absolute numbers unreliable.** Binary labels
-   make nDCG degenerate to position-discounted recall (no graded info). Un-pooled, author-assigned labels
-   bias absolute recall/nDCG **low** (esp. for dense/hybrid); only the FTS-vs-hybrid **delta** is trustworthy.
-6. **Underpowered and single-corpus.** 119 queries (~40/stratum) is below the cited ~50/stratum standard;
-   65% of queries have exactly one relevant id (recall@5≈recall@10 for most). One author, no second corpus,
-   no inter-annotator agreement: external validity asserted, not demonstrated.
-7. **Headline metric overstates hot-path value.** recall@10 leads everywhere, but the auto-recall hook
-   injects a top-k into the prompt; recall@5 (+0.075) and MRR/first-hit are the decision-relevant metrics.
-   Hybrid's recall@10 edge partly comes from pushing answers into ranks 6–10. The hook's effective `k` is
-   never stated.
-8. **Production substrate and model wholly unmeasured.** Numbers are local exact numpy cosine over
-   bge-large; production is pgvector HNSW (approximate; recall depends on `ef_search`) with a filtered top-k
-   (NULL-embedding sensitive rows — a partial-index interaction that can hurt HNSW recall), using Voyage-3.5
-   (never run). "Quality transfers" and "~10× faster on GPU/hosted" are assumptions, not measurements.
-9. **Ablation configs A/B/C/D are not reproducible from artifacts.** Only `results/{fts,hybrid}.json`
-   (= A and D) are persisted; B/C have no saved JSON, and there's no run-manifest/seed/version capture
-   beyond the embedding-cache fingerprint. The decision-critical `A≡B` and the C/D numbers must be re-run.
-10. **Several deferred fixes are asserted, not tested.** The exact-match rank bonus, the importance
-    post-fusion prior (the benchmark pinned `sort_by="relevance"` without it, so the user-visible production
-    ranking was never measured), and the "CC ran on our set and RRF was chosen" claim (no CC results exist
-    anywhere) are all unverified.
diff --git a/docs/research/integration-design.md b/docs/research/integration-design.md
deleted file mode 100644
index edf89b0..0000000
--- a/docs/research/integration-design.md
+++ /dev/null
@@ -1,292 +0,0 @@
-# Integration design: hybrid recall (production + prototype-as-built)
-
-Status: the design the benchmark validated. This document specifies **(A) the production design**
-for the API/Postgres deployment and **(B) the prototype as actually built** for the benchmark.
-Read the [survey](survey.md) for *why* these choices, the
-[benchmark report](benchmark-report.md) for whether they cleared the gate, and
-[ADR-0001–0003](../adr/) for the fixed constraints.
-
-**Headline architectural decision (recorded in
-[ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)):** ship
-**lexical + dense fusion** first (a statistically-robust paraphrase win cleared ADR-0001's gate); the
-**concept graph is deferred behind a gate** because it was **never validly tested** — the prototype's
-fusion config structurally barred it from the top-k (see [benchmark report §1 + §3](benchmark-report.md)),
-so it is *unevaluated*, not disproven. Deferral is on cost + uncertainty grounds. The design below is
-structured around that phasing.
-
----
-
-## A. Production design (API/Postgres deployment)
-
-### A.0 Where it plugs in
-
-The semantic layer targets the **API/Postgres path only** (ADR-0002). The authoritative store is
-the CNPG Postgres behind the FastAPI server (`src/claude_memory/api/app.py`); the local SQLite
-cache stays **lexical (FTS5) only** and degrades gracefully. Recall fires on the **hot path** of
-every prompt (an auto-recall hook before each turn), so the read path must stay within a
-per-prompt latency budget even though latency is non-gating for *adoption* (ADR-0001).
-
-The current production recall (`app.py::recall_memories`, `POST /api/memories/recall`) is a single
-`plainto_tsquery('english')` + `ts_rank(search_vector, query)` ordered by a blend
-`ts_rank*0.7 + importance*0.3` (or `*0.4/*0.6` for `sort_by="importance"`), with an OR-broaden
-fallback gated at `ts_rank > OR_BROADEN_MIN_RANK` when AND-match under-fills. The live schema
-(`migrations/001`) has a generated `search_vector tsvector` (setweight A=content, B=expanded_keywords,
-C=tags, D=category) + a GIN index `idx_memories_search`. **The hybrid design is purely additive
-to this.**
-
-### A.1 Schema delta (additive, one migration)
-
-```sql
--- new Alembic migration (Postgres only; SQLite path unchanged)
-ALTER TABLE memories ADD COLUMN embedding halfvec(1024);   -- NULL for is_sensitive=1
-CREATE INDEX CONCURRENTLY idx_memories_embedding
-  ON memories USING hnsw (embedding halfvec_cosine_ops)
-  WITH (m = 16, ef_construction = 64);
-```
-
-- **1024-d** matches both production (Voyage-3.5) and the prototype (bge-large-en-v1.5), so the
-  column dimension and all fusion code are identical whichever model runs.
-- **halfvec** (fp16) halves index size at ~no recall loss; 1024-d halfvec = 2048 bytes/row →
-  single-digit MB for the whole corpus.
-- The existing `search_vector` + GIN index are **untouched**. Lexical behaviour is unchanged, so
-  NULL-embedding rows (sensitive memories) and SQLite-only mode degrade to exactly today's FTS.
-- `CONCURRENTLY` avoids locking the shared table during backfill.
-- The concept-graph tables (§A.5) ship **only if/when the graph clears its gate** — phase 2.
-
-### A.2 Write path (store / update) — all LLM work here, off the recall hot path
-
-On `memory_store` / `memory_update`, for **non-sensitive** rows (`is_sensitive=0`, hard ADR-0003
-gate):
-
-1. **Embed** `content` (optionally `content + expanded_keywords`) → one `halfvec(1024)` vector,
-   written to the new column. Voyage-3.5 `input_type="document"` for stores; bge-large
-   `encode_document` for sensitive/no-key fallback. `is_sensitive=1` rows get `embedding=NULL` —
-   never embedded, never egressed; they still match via FTS.
-2. **(Phase 2, gated) Extract** concepts/edges for the new memory and incrementally merge into the
-   graph tables (entity resolution via pgvector nearest-neighbour + threshold, LLM tie-break only
-   on ambiguity — Graphiti-style fast-path).
-3. **(Optional, flagged) Curate** — the Mem0-style ADD/UPDATE/DELETE/NOOP loop, run async, never
-   physically deleting (supersede to `[SUPERSEDED]` tombstone). Isolated behind a flag so it never
-   confounds the benchmark.
-
-The existing **background sync engine** already moves rows SQLite↔Postgres in a daemon thread; the
-embedding is just another column it carries (authoritative vector in Postgres). Extraction/curation
-ride the same off-hot-path lane. The synchronous store call must **not** block on embedding/
-extraction if it would delay the response — these run async.
-
-### A.3 Read path (hot path) — three CTEs, RRF fusion, importance prior
-
-Replace the single ts_rank ORDER BY with three top-N legs over the **same `memories` table**, fused
-in the handler:
-
-1. **Lexical leg** — the *existing* query verbatim: `plainto_tsquery('english', $q)` +
-   `ts_rank(search_vector, query)`, with the existing OR-broaden fallback
-   (`OR_BROADEN_MIN_RANK`) kept intact. `rank_lex` = position in this list. (LIMIT ~50.)
-2. **Dense leg** — `ORDER BY embedding <=> $qvec LIMIT 50` using the HNSW index. `rank_dense` =
-   position. Sensitive rows (NULL embedding) never enter this list.
-3. **Graph leg (phase 2, gated, currently disabled)** — seed concept nodes by reusing the
-   lexical+dense match, traverse 1–2 hops via a recursive CTE over the edge table, score reachable
-   memories by hop-decay → `rank_graph`. List allowed to be empty.
-
-**Fuse** in Python: `fused(d) = Σ_{s} w_s / (60 + rank_s(d))`, default `w_lex = w_dense = 1.0`,
-`w_graph ≈ 0.35` (down-weighted per the negative-prior finding — see benchmark). Missing leg ⇒ 0
-contribution, no special-casing.
-
-**Preserve the importance prior** (the current code is *not* pure relevance): apply it as a
-post-fusion multiplier — `final(d) = fused(d) * (0.7 + 0.3*importance)` for `sort_by="relevance"`,
-or use importance as the tie-break for `sort_by="importance"`. Importance is a *prior*, **not** a
-fourth fused list. `sort_by="recency"` stays a pure `ORDER BY created_at`, untouched.
-
-> **Why RRF, not convex combination, as the default:** we fuse three incomparable scales
-> (unbounded ts_rank, bounded cosine, arbitrary graph proximity), one of them sparse/often-empty.
-> RRF is scale-agnostic and treats a missing leg as a clean 0, where CC would force a maintained
-> normalization per signal plus a decision for "absent." RRF also collapses to today's exact
-> lexical ordering when dense/graph are empty (the SQLite degrade path, **same code**). The
-> benchmark ran CC/TM2C2 as a challenger (ADR-0001 is quality-gated); on our set RRF was chosen.
-
-**Single-query alternative (future):** the three legs can be expressed as CTEs + a FULL OUTER JOIN
-on `id` with RRF computed in SQL (Supabase hybrid-search pattern), saving a round-trip. The
-prototype and initial production both fuse in Python for clarity; in-DB fusion is an optimization,
-not a correctness change.
-
-### A.4 SQLite-only graceful degrade
-
-With only the lexical leg present, RRF reduces to ranking by `rank_lex` — **identical ordering to
-today's FTS5 `bm25()`**. Zero behaviour change offline; the *same* fusion code path runs in both
-modes (dense/graph legs simply empty). Satisfies ADR-0002.
-
-### A.5 Concept graph (phase 2 — designed, gated, NOT shipped in v1)
-
-If a future benchmark justifies it (the prototype's did not — §B.4):
-
-```sql
-CREATE TABLE concepts (
-  id            bigserial PRIMARY KEY,
-  canonical_name text NOT NULL,
-  aliases       text[],
-  embedding     halfvec(1024),   -- for canonicalization + query seeding
-  category      text
-);
-CREATE TABLE concept_edges (
-  src_id        bigint REFERENCES concepts(id),
-  dst_id        bigint REFERENCES concepts(id),
-  relation      text NOT NULL,
-  weight        real,
-  valid_from    timestamptz,     -- bi-temporal (Graphiti-style supersede)
-  valid_to      timestamptz,
-  evidence_memory_ids bigint[]
-);
-CREATE TABLE memory_concepts (   -- the "mentions" link
-  memory_id     bigint REFERENCES memories(id),
-  concept_id    bigint REFERENCES concepts(id),
-  relation      text
-);
-```
-
-- **Construction** (backfill): batched open LLM triple-extraction (~10–25 calls for the whole
-  corpus, each memory id-tagged) → global embedding-cluster canonicalization (EDC/KGGEN style) →
-  write the three tables. Off the hot path; `is_sensitive=1` filtered *before* any call.
-- **Incremental** (per new memory): extract its triples, resolve entities against `concepts` via
-  pgvector NN + threshold (LLM tie-break only on ambiguity), set-merge — never re-cluster.
-- **Traversal at recall:** plain recursive SQL CTE (1–2 hops). **No Apache AGE** — our multi-hop is
-  shallow. If a future need for PPR arises, compute it in Python over a cached `scipy.sparse`
-  transition matrix loaded from the edge table (Postgres has no native PPR), rebuilt only on graph
-  mutation.
-- **Bi-temporal edges** realize our "supersede, don't accumulate" rule as a queryable timeline:
-  contradicted edges get `valid_to` set, not deleted.
-
-### A.6 Optional stage-2 cross-encoder (gated separately)
-
-`bge-reranker-v2-m3` on the GPU node over the fused top ~20–30, `sort_by="relevance"` only,
-sensitive rows excluded, with a hard-timeout fallback to fused order. Ship only if it clears both
-the quality bar **and** the p95 hot-path budget. Not in v1.
-
-### A.7 Infrastructure (production deploy)
-
-- **pgvector enablement on CNPG:** the cluster already runs pgvector for Immich, so the legacy
-  custom-operand-image path is in place; `CREATE EXTENSION vector` + the additive migration. Any
-  extension add triggers a rolling restart of the shared cluster — coordinate via presence/GitOps.
-- **All cluster changes via Terraform/Terragrunt** in `infra/stacks/...` (GitOps, never kubectl).
-- **Embedding/extraction compute:** in-cluster **llama-cpp on the GPU node** for sensitive-safe
-  local processing (and the no-key fallback); **Voyage-3.5** (hosted) for the non-sensitive batch
-  (ADR-0003). Sensitive memories are routed locally or left lexical-only — enforced, not
-  best-effort.
-- **PgBouncer:** set `hnsw.ef_search` via `SET LOCAL` inside the recall transaction (transaction
-  pooling).
-- **pgvectorscale/DiskANN deferred** — not needed below ~1–5M vectors.
-
----
-
-## B. Prototype as built (the benchmark harness)
-
-The prototype validates **retrieval quality cheaply, in-process** — *not* pgvector/Postgres
-(standing up CNPG just to benchmark would burn days before knowing if hybrid even beats FTS). It is
-a faithful stand-in: the lexical leg is the *exact* production code path, and the fusion is the same
-weighted RRF the production design specifies.
-
-### B.1 Files (committable code only; data/cache/results gitignored)
-- `benchmarks/retrievers/fts.py` — `FtsRetriever`, the lexical baseline.
-- `benchmarks/retrievers/hybrid.py` — `HybridRetriever`, the three-leg fusion.
-- `benchmarks/retrievers/test_hybrid.py` — 9 model-free tests (synthetic content only).
-- `benchmarks/scripts/run_eval.py`, `benchmarks/harness/` — runner + metrics.
-- Eval data (`benchmarks/data/{corpus,queries,qrels}.jsonl`), embedding cache
-  (`benchmarks/cache/*.npy`), and full results (`benchmarks/results/*.json`) are **gitignored**
-  (verified via `git check-ignore`) — privacy rule: no real memory content committed.
-
-### B.2 Lexical leg = the real product
-
-`hybrid.py` **reuses `retrievers.fts.FtsRetriever` verbatim**, which is itself a faithful
-reimplementation of `src/claude_memory/mcp_server.py::_sqlite_recall` (`sort_by="relevance"`): a
-fresh in-memory FTS5 index over the 5,452-memory corpus with the production virtual-table shape and
-default `unicode61` tokenizer; query handling mirrors production (AND-match first, OR-broaden if
-zero rows; rank by `-bm25()*0.7 + importance*0.3`; LIKE fallback on operational errors). **So the
-hybrid's lexical component *is* the exact production system it must beat — no drift.**
-
-### B.3 Dense leg
-
-- **Model:** `BAAI/bge-large-en-v1.5`, 1024-d, L2-normalized. Passages raw; the query gets the BGE
-  instruction prefix `"Represent this sentence for searching relevant passages: "`. Similarity =
-  cosine via numpy matmul over the normalized matrix (faiss unnecessary at N=5452).
-- **Embeddings:** all 5,452 memories embedded once (one-time ≈ 31.5 min on a CPU-only box),
-  **cached** fingerprint-keyed to `cache/emb_BAAI_bge-large-en-v1.5_<corpusfp>.npy` (+ `.ids.npy`)
-  → reruns skip the embed (cache-hit rebuild ≈ 8.3 s).
-- Hosted Voyage/OpenAI/Cohere paths are implemented and key-gated but were **untriggered** (no key
-  in env) — so the prototype ran the local default, which is also the sensitive-only production
-  fallback. **Production maps this matrix to pgvector `halfvec(1024)`.**
-
-### B.4 Graph leg (built, but structurally excluded from the ranking — UNMEASURED)
-
-- **Construction with ZERO LLM calls** (the tractable shortcut): concepts = union of each memory's
-  `tags` + the already-LLM-generated `expanded_keywords` field + a lightweight regex/stop-word
-  noun-phrase proxy over `content`, normalized + de-pluralized. 37,075 concepts extracted; **19,907
-  kept** after document-frequency pruning (df ∈ [2, 2%·N=109]: df<2 links nothing, df>109 are
-  non-discriminative hubs). Concept cliques → weighted memory–memory edges. Result: **5,452 nodes,
-  2,095,624 edges**, built in ~9 s (in-memory networkx).
-- **Traversal:** 1 hop from the top-10 RRF seeds (capped 25 neighbours/seed), contributing **only ids
-  not already in the base legs** — this exclusion is the bug (next bullet).
-- **Result: the graph was structurally barred from the ranking, so its value is UNMEASURED.** Because
-  graph hits are restricted to ids *outside* the FTS∪dense base set and weighted 0.35, a graph-only id's
-  max RRF (`0.35/61 ≈ 0.0057`) sits below any base-leg id's min (`1.0/110 ≈ 0.0091`) — it can never enter
-  the fused top-k. The "graph ≡ nothing" ablation (§B.6) was therefore **guaranteed by construction, not
-  an empirical finding** (a post-run review found a relevant graph-surfaced memory that fusion discarded).
-
-### B.5 Fusion (the production recipe, exactly)
-
-Weighted RRF, `RRF(d) = Σ_leg w_leg/(60 + rank_leg(d))`, chosen over convex combination because it
-is score-scale-free (no BM25-vs-cosine calibration). Weights `w_fts = 1.0, w_dense = 1.0,
-w_graph = 0.35`. Each leg pulled to depth 50 before fusion, truncated to k.
-
-### B.6 Decision-relevant ablation (this is what informs ADR-0004)
-
-| Config | What | Overall recall@10 | Para recall@10 | Multi recall@10 |
-|---|---|---|---|---|
-| **A** full hybrid (FTS+dense+graph) | the prototype | 0.834 | 0.725 | 0.775 |
-| **B** FTS+dense (w_graph=0) | graph removed | **0.834** | **0.725** | **0.775** |
-| **C** dense-only | | 0.748 | — | — |
-| **D** FTS-only (= baseline) | | 0.695 | 0.375 | 0.711 |
-
-**A and B are identical to three decimals on every metric — but this is a structural artifact (§B.4),
-not a test of the graph.** The valid signal here is the **FTS-vs-dense decomposition**: dense-only (C)
-and FTS-only (D) each lose to the fusion (B) — dense recovers paraphrase, lexical recovers exact, fusion
-gets the best of both. The concept graph itself is **unevaluated** (it could never affect top-k under
-this fusion config). **This still supports phasing — ship lexical+dense (phase 1), the robust measured
-win — but the graph is gated pending a *valid* retest, not because it failed.** (Configs B/C/D were not
-persisted as result JSONs; only A and D are reproducible from committed artifacts.)
-
-### B.7 Prototype → production mapping
-
-| Prototype (in-process) | Production (ADR-0002) |
-|---|---|
-| numpy cosine over normalized matrix | pgvector `halfvec(1024)` + HNSW ANN |
-| `.npy` embedding cache, fingerprint-keyed | `embedding` column on `memories`, synced |
-| in-memory networkx graph (phase 2) | `concepts` / `concept_edges` / `memory_concepts` tables |
-| `FtsRetriever` (FTS5 in-memory) | existing `search_vector` + GIN (`plainto_tsquery`/`ts_rank`) |
-| weighted RRF in Python | same RRF (Python handler, or CTE+FULL OUTER JOIN in SQL) |
-| bge-large local | Voyage-3.5 hosted (non-sensitive) / bge-large local (sensitive, no-key) |
-
-### B.8 Prototype caveats (carried into the report's limitations)
-1. **Graph result is INVALID, not merely "null."** The fusion config barred the graph leg from the
-   top-k by construction (§B.4), so the benchmark did not actually test it. A valid retest must include
-   graph candidates in the fused pool (drop the base-set exclusion) and/or sweep the weight, ideally with
-   a typed-relation graph from real LLM extraction and multi-hop queries whose hops are *not* semantically
-   adjacent.
-2. **Exact-stratum nDCG/MRR dip ~0.018/0.025** vs FTS (recall unaffected) is the standard RRF cost
-   of blending one perfect hit with near-ties; a small exact-match rank bonus could recover it.
-3. **Latency** (p50 ≈ 230 ms) is CPU-bound on the local query embed; non-gating, GPU/hosted ~10×
-   faster. Baseline FTS was p50 ≈ 15.7 ms (pure SQLite).
-4. **No pgvector/Postgres** in the prototype — the production substrate is design-only here; the
-   numbers measure *retrieval quality*, which transfers, not the production latency profile.
-
----
-
-## C. Open questions (for production rollout)
-
-1. **pgvector enablement mechanism** — confirm whether the live CNPG is on the legacy
-   custom-operand-image (likely, since Immich uses pgvector) or the modern image-volume-extensions
-   path; either way the migration is additive, but the enablement DDL/Terraform differs.
-2. **Graph gate** — what evidence would re-open the concept graph? Candidate: a multi-hop eval slice
-   whose hops are *not* semantically adjacent (where dense can't shortcut), built from real
-   LLM-extracted typed relations rather than keyword co-occurrence. Until then, graph stays off.
-3. **Voyage vs bge-large in production** — the benchmark ran bge-large (local). A cheap follow-up:
-   re-run the dense leg with Voyage-3.5 on the non-sensitive corpus to confirm the hosted model's
-   higher quality ceiling holds on *our* content before committing the production default.
diff --git a/docs/research/survey.md b/docs/research/survey.md
deleted file mode 100644
index b6cfa95..0000000
--- a/docs/research/survey.md
+++ /dev/null
@@ -1,523 +0,0 @@
-# Landscape survey: semantic + concept-graph memory for hybrid recall
-
-Status: research input for the ADR-0001 hybrid upgrade. Scope: how the agent-memory and
-graph-RAG literature builds and retrieves over a personal-memory store, which embedding
-model to use, how to fuse lexical + dense + graph signals, and how to evaluate the result.
-
-**Read this with the decisions already fixed in [ADR-0001](../adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md)
-–[0003](../adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md):** we pursue
-hybrid (gated on a benchmark beating FTS), embeddings live in pgvector on the existing CNPG
-Postgres, the concept graph is node/edge tables in Postgres, sensitive memories never egress,
-and adoption is decided **quality-first** (recall@k / nDCG@10 / MRR; latency & storage are
-reported, not gating).
-
-The recurring conclusion below: **borrow the ideas, not the engines.** None of the four
-systems surveyed is a drop-in for our stack, but each contributes a mechanism we re-implement
-natively on Postgres + pgvector.
-
----
-
-## 1. Our workload is the opposite of GraphRAG's design target
-
-Before comparing systems it helps to state what we are *not*. The graph-RAG family was built
-for **global sensemaking** ("what are the themes across this corpus") over a **static document
-collection**. Our workload is the reverse:
-
-| Dimension | GraphRAG target | claude-memory-mcp |
-|---|---|---|
-| Unit | Long documents, chunked | Atomic, already-curated memories (avg ~500 chars) |
-| Corpus dynamics | Static, indexed once | Append-heavy: a few hundred memories/day arriving |
-| Query type | Corpus-wide summarization | Point / multi-hop recall ("what did I decide about X") |
-| Hot path | Offline batch | **Every user prompt** (auto-recall hook fires before each turn) |
-| Scale | 10k–1M+ chunks | ~5k memories today → tens of thousands |
-
-This mismatch is the lens for everything that follows. The expensive part of GraphRAG —
-community detection + hierarchical LLM summaries — is the *wrong retrieval unit* for atomic
-point lookups, and re-summarizing communities on a sustained append stream is its dominant,
-unbounded cost. We want a design whose index-time work is **proportional to new content**, and
-whose retrieval path has **no LLM call** (so it fits the per-prompt budget).
-
----
-
-## 2. The GraphRAG family — Microsoft GraphRAG, LightRAG, nano-graphrag, LazyGraphRAG
-
-All four turn text into an entity–relation knowledge graph via LLM extraction; they differ on
-the expensive part (community detection + hierarchical summarization), which is exactly where
-**incremental cost** lives.
-
-### Microsoft GraphRAG (Edge et al. 2024, arXiv 2404.16130)
-Pipeline: chunk → LLM extracts entities + relationships per chunk (with multi-round
-"gleanings" to catch misses) → summarize duplicate element instances into node/edge
-descriptions → build graph → **Leiden** community detection producing a *hierarchy*
-(levels C0..C3) → an LLM writes a **community report** for every community at every level.
-Two query modes: **global** (map-reduce over all community reports — corpus-wide
-sensemaking) and **local** (start from query-relevant entities, fan out). Indexing is
-LLM-heavy: ~4,000 LLM calls / ~35 min for one textbook; ~$20–40 per 1M tokens with gpt-4o.
-
-**Incremental:** the `graphrag update` command (GraphRAG 1.0) computes deltas and places new
-entities into existing communities "rather than re-running Leiden," re-summarizing only
-changed communities — **but** maintainers warn that once drift crosses a threshold "the worst
-case degrades to the same performance as a normal indexing." A periodic, unpredictable
-full-reindex cliff on a sustained append stream. Parquet/file-pipeline oriented, not
-Postgres-native.
-
-### LightRAG (HKUDS, arXiv 2410.05779, EMNLP 2025)
-Pipeline: chunk → LLM extracts entities + relations → "profiling" generates a key-value text
-summary per node/edge → **deduplicate** merges identical entities/relations across chunks.
-**No community detection.** Retrieval is **dual-level**: the LLM splits the query into
-low-level keywords (specific entities) and high-level keywords (broad themes via relationship
-chains); each set is matched by *vector* similarity against an entity-vector index and a
-relation-vector index, then one-hop neighbours are pulled from the graph. Modes:
-naive / local / global / hybrid / mix (mix = default).
-
-**Incremental (the crux):** a new document goes through the same local indexing to produce a
-small local graph, then is integrated by **set union** of node-sets and edge-sets into the
-existing graph — "eliminating the need to rebuild the entire index graph." No communities ⇒
-**no global re-clustering or re-summarization, ever** ⇒ O(new content) per insert, the only
-genuinely-incremental member. Ships a PostgreSQL all-in-one backend (PGVectorStorage on
-pgvector + PGGraphStorage on Apache AGE + KV + doc-status in one DB, PG ≥16.6).
-
-### nano-graphrag (~1100 LOC)
-A faithful minimal reimplementation of Microsoft GraphRAG and an excellent compact *reference*
-for the exact extraction/community/report prompts. **Hard NO for incremental:** README states
-plainly "each time you insert, the communities of graph will be re-computed and the community
-reports will be re-generated" — O(whole graph) LLM cost per append.
-
-### LazyGraphRAG (Microsoft Research, 2024)
-Defers **all** LLM work to query time: index time uses only NLP noun-phrase extraction + graph
-statistics — "indexing costs are identical to vector RAG and 0.1% of the costs of full
-GraphRAG." Sidesteps the incremental-re-summarization problem entirely by never pre-summarizing
-communities. The **defer-LLM-cost principle** is the one to borrow.
-
-### Verdict for us
-**Adopt none wholesale; steal LightRAG's architecture + LazyGraphRAG's defer-LLM principle.**
-LightRAG is the only one whose incremental model (pure set-union, no re-clustering) structurally
-fits an append-heavy stream, and whose retrieval path (vector + one-hop graph, no query-time
-map-reduce) is hot-path-viable. But adopting LightRAG-the-product is not recommended: its
-Postgres graph path needs the **Apache AGE** extension (not on our CNPG), and that path has
-documented concurrency/entity-merge instability under append-heavy load (asyncpg pool timeouts
-at the merge stage; slow upgrades). Our multi-hop is shallow (1–2 hops), expressible in plain
-recursive SQL CTEs over node/edge tables — no AGE needed.
-
----
-
-## 3. Zep / Graphiti — temporal knowledge graph for agent memory
-
-Zep (arXiv 2501.13956) is an agent-memory service; **Graphiti** is its open engine
-(Neo4j / FalkorDB / Kuzu backend, ~20k stars, MIT). It is the **closest conceptual analog** to
-our hybrid goal — it fuses exactly the three signals ADR-0001 wants.
-
-**Three-tier graph:** episode subgraph (raw ingested data, the provenance layer ≈ our Memory
-rows) → semantic entity subgraph (entity nodes + typed relationship edges, each linking back to
-its source episodes) → community subgraph (clusters with LLM summaries — the GraphRAG "global"
-layer).
-
-**Bi-temporal model:** every semantic edge carries **four** timestamps on two timelines —
-*valid time* (`t_valid`/`t_invalid`: when the fact held true in the world) and *transaction
-time* (`t'_created`/`t'_expired`: when Zep learned/retracted it). Facts are never deleted;
-superseded facts get their validity window closed. This is a principled, queryable version of
-our **"supersede, don't accumulate"** memory discipline.
-
-**Incremental ingestion (per episode):** a *sequence* of LLM calls — entity extraction →
-entity resolution/dedup (embed + cosine + BM25 search against existing nodes, then an LLM
-judges merge vs create) → fact (edge) extraction → fact dedup → temporal extraction (resolve
-"two weeks ago" against a reference time) → edge invalidation (LLM compares each new edge
-against related existing edges; on a temporally-overlapping contradiction it closes the old
-edge). Cost is **heavy on write**, paid back on reads.
-
-**Retrieval (sub-second, NO LLM at query time):** three parallel searches fused, then reranked.
-- `φ_cos`: cosine over embeddings of fact text / entity names / community summaries (BGE-m3, 1024-d).
-- `φ_bm25`: BM25 full-text over the same fields.
-- `φ_bfs`: breadth-first n-hop graph traversal from seed nodes — the genuinely graph-native signal.
-- Rerank: pluggable — **RRF**, MMR (diversity), episode-mentions (frequency), node-distance, or a cross-encoder (most accurate, slowest).
-
-**Published quality (the strongest evidence in this family):** on **LongMemEval**, Zep reports
-**+18.5%** accuracy over a baseline (71.2% vs 60.2% with gpt-4o) *and* ~90% query-latency
-reduction; on **MemGPT DMR**, 94.8% vs 93.4%. These are conversational long-context QA, not
-personal-fact recall@k — so the headline numbers won't transfer directly, but the *fusion
-recipe* is exactly what we benchmark.
-
-### Verdict for us
-**Primary design blueprint for the concept-graph half — but not an adopted dependency.**
-Graphiti has **no pgvector backend**; adopting the engine forces a new Neo4j/FalkorDB graph DB
-into the cluster, conflicting with ADR-0002 and reuse-before-building. We borrow four mechanisms,
-re-implemented on Postgres: (1) the episodic(=Memory rows)/semantic(=new node+edge tables)
-split; (2) the parallel-search + RRF fusion read path; (3) resolution-via-search to dedupe the
-graph using our existing FTS+vector; (4) bi-temporal edge invalidation as the queryable form of
-our supersede discipline. We de-scope the community/summarization tier and the default
-cross-encoder. Two hard caveats: keep the multi-LLM-call extraction **off the hot path**
-(background, like our sync engine), and route extraction through in-cluster llama-cpp / filter
-`is_sensitive` per ADR-0003.
-
----
-
-## 4. Mem0 / Mem0g — extraction-based, LLM-curated memory
-
-Mem0 (arXiv 2504.19413) is a **write-side memory curator**, not a retrieval algorithm — it
-solves a *different axis* than our gated problem, and is **complementary**.
-
-**Two-phase pipeline.** *Extraction:* on each new message pair, an LLM (fed an async
-conversation summary + the last ~10 messages) emits a set of concise "candidate facts."
-*Update (the curation step):* for each candidate fact, retrieve the top ~10 semantically-similar
-existing memories, then a function-calling LLM picks one of four ops — **ADD** (new), **UPDATE**
-(merge richer detail into an existing id, gated on information content), **DELETE** (a
-contradicted memory), **NOOP**. Net effect: the store self-deduplicates, self-merges, and
-self-supersedes instead of accumulating. Two LLM calls per write (extract + decide) + a vector
-search; **async by default** (off the user hot path); the **read/search path is pure vector
-similarity with no LLM**.
-
-**Mem0g (graph variant):** a directed labeled entity graph (Alice –lives_in→ SF) on Neo4j; a
-conflict-detection + LLM update-resolver marks superseded relationships *invalid* rather than
-deleting them.
-
-**Published quality:** on **LOCOMO**, Mem0 J=66.88 / Mem0g 68.44 beats OpenAI memory (52.90),
-A-Mem (48.38), LangMem (58.10), ties Zep (65.99), at ~1/15th the tokens of full-context; Mem0g
-specifically wins temporal reasoning. Reference latencies (gpt-4o-mini): search p95 ≈ 0.20s,
-total p95 ≈ 1.44s, vs full-context ≈ 17s.
-
-### Verdict for us
-**Adopt the curation loop as a separate, flagged subsystem — it does NOT move the ADR-0001
-retrieval metric by itself** (its search is vector-only, no lexical+graph fusion). The
-ADD/UPDATE/DELETE/NOOP loop is the highest-leverage idea Mem0 offers: it automates a discipline
-our own rules already mandate (every correction stored, supersede-don't-accumulate, tombstones)
-but currently leave to manual human effort. It is cheap to build against our existing Memory
-model + `update_memory` endpoint, runs async off the recall hot path, and respects the
-`is_sensitive` boundary. **Hard guardrails required:** never physically DELETE — supersede to a
-`[SUPERSEDED]` tombstone (importance ~0.3, per our convention); log every op; gate behind the
-non-sensitive filter. Keep extraction *optional* (our memories are already atomic, so usually
-only the single UPDATE-decision call is needed). Mem0g's "mark invalid, not delete" and triplet
-schema (source, relation, dest) are reusable ideas, but implemented on pgvector/Postgres, not
-Neo4j. **Critically: isolate curation behind a flag so the benchmark measures retrieval quality
-independently of any curation behaviour change.**
-
----
-
-## 5. HippoRAG / HippoRAG 2 — Personalized PageRank over a concept graph
-
-HippoRAG (NeurIPS 2024, arXiv 2405.14831) and HippoRAG 2 (ICML 2025, arXiv 2502.14802) are the
-**strongest published evidence that a concept graph wins on multi-hop** — precisely the query
-class ADR-0001 says the graph must beat lexical on.
-
-**Mechanism (hippocampal indexing analogy):** LLM = neocortex; retrieval encoder =
-parahippocampal region (detects synonyms); open KG = hippocampal index. *Offline:* an LLM runs
-OpenIE on each passage → a schema-free KG of noun-phrase nodes joined by relation edges; the
-encoder adds **synonym edges** between phrase nodes with cosine > τ=0.8. *Online, per query:*
-(1) ONE LLM call does NER on the query; (2) the encoder links query entities to nearest KG
-nodes = **seed nodes**; (3) each seed weight is scaled by node specificity (`|P_i|⁻¹`, an
-IDF-like rare-phrase boost) and written into the **Personalized PageRank** reset vector;
-(4) PPR runs to convergence (damping 0.5); (5) the phrase-node probability vector scores
-passages. Multi-hop emerges because the random walk reaches passages sharing **no** query tokens
-— in **one** retrieval step instead of iterative retrieve-reason loops.
-
-**HippoRAG 2** makes passages first-class nodes (linked to their phrases by "contains" context
-edges), shifts linking to query→triple + **LLM triple-filtering** ("recognition memory"), and
-seeds *all* passage nodes by embedding similarity (small weight ~0.05) so dense and graph blend
-in one PPR. Net effect: a single PPR fuses lexical-ish phrase matching, dense passage
-similarity, and multi-hop traversal into one ranked list.
-
-**Published quality (passage recall@5):** HippoRAG 2 beats the strongest 7B embedding baseline
-(NV-Embed-v2) on every multi-hop set — 2Wiki **90.4 vs 76.5** (+13.9), MuSiQue **74.7 vs 69.7**
-(+5.0), HotpotQA **96.3 vs 94.5** — and is the only structure-augmented method that *doesn't
-regress* simple QA (NQ 78.0).
-
-### Verdict for us
-**Adopt the idea (PPR spreading activation over our concept graph), not the framework.** Two
-hard adaptations, both fitting our stack:
-1. **Drop the per-query LLM** (v1 NER / v2 triple-filtering) — the only thing that would blow
-   the hot-path budget — and **seed PPR from our existing FTS top-k ∪ pgvector top-k**, weighted
-   by fused score × importance × node-specificity. This turns PPR into the *fusion layer*
-   ADR-0001 wants, with zero added LLM latency.
-2. **Prefer a memory-node graph** (memories as nodes, our typed Relationships as edges) over
-   HippoRAG's phrase explosion (it turns 11.6k passages into ~92k nodes; at our scale that'd be
-   ~43k phrase nodes). Leaner and native to ADR-0002's node/edge tables.
-
-A reproducible PPR latency micro-benchmark on a 5,400-memory graph measured **~2 ms** (memory-node
-graph, transition matrix cached) to **~21 ms** (full phrase graph), ~105 ms even at 3× growth —
-PPR is **not** the bottleneck; the stock recipe's online LLM is (which we remove). Postgres can
-*store* the graph but has no native PPR (pgrouting = shortest-path only), so PPR is computed in
-Python over a cached `scipy.sparse` transition matrix loaded from the node/edge tables, rebuilt
-only on graph mutation. **Caveat for the gate:** our LLM-free seeding variant is *not* validated
-by the papers, and our 5.4k personal corpus is far smaller and less multi-hop-dense than their
-90k-node Wikipedia graphs — so the benchmark must confirm the multi-hop win transfers.
-
----
-
-## 6. Embedding model survey
-
-Our `content` and `expanded_keywords` are **short** prose (capped ~500 chars), so a model's
-max-token limit is effectively a non-constraint — quality, dimensionality, and deploy
-feasibility decide.
-
-### Self-hostable (sentence-transformers on the GPU node, or GGUF via llama-cpp; pgvector stores the vector)
-
-| Model | Dim | Params / VRAM (fp16) | MTEB(en) avg | License |
-|---|---|---|---|---|
-| nomic-embed-text-v1.5 | 768 (Matryoshka 64–768) | 0.1B / <1 GB | 62.28 | Apache-2.0 |
-| bge-base-en-v1.5 | 768 | 109M / ~0.5 GB | ~63.5 | MIT |
-| **bge-large-en-v1.5** | **1024** | 335M / ~1.3 GB | **64.23** | MIT |
-| e5-large-v2 | 1024 | 0.3B / ~1.3 GB | ~62.25 | MIT |
-| bge-m3 | 1024 dense (+sparse +ColBERT) | 568M / ~1–2.4 GB | en ~59–60 (strong multiling/BEIR) | MIT |
-| gte-Qwen2-1.5B-instruct | 1536 | 1.5B / ~3.4 GB | **67.16** (top of set) | Apache-2.0 |
-
-### Hosted (API call, NON-SENSITIVE memories only per ADR-0003)
-
-| Model | Dim | MTEB(en) avg | Price /1M tok | License |
-|---|---|---|---|---|
-| OpenAI text-embedding-3-small | 1536 (Matryoshka→256) | 62.3 | $0.02 | proprietary |
-| OpenAI text-embedding-3-large | 3072 (Matryoshka) | 64.6 | $0.13 | proprietary |
-| **Voyage-3.5** | **1024** (+256/512/2048, int8/binary) | beats OpenAI-3-large ~7.5% on Voyage's eval | $0.06 (first 200M free) | proprietary |
-| Voyage-3.5-lite | 1024 | beats OpenAI-3-large ~2–3.8% | $0.02–0.03 | proprietary |
-| Cohere embed-english-v3.0 | 1024 (native int8/binary) | ~64.5 | ~$0.10 (sales-quoted) | proprietary |
-
-**Implementation notes that matter.** Use **asymmetric** prompting (query vs document):
-sentence-transformers `encode_query`/`encode_document`, always `normalize_embeddings=True` so
-pgvector cosine == dot product. e5-large-v2 *requires* manual `"query: "`/`"passage: "` prefixes
-or quality collapses; bge prepends a query instruction; gte-Qwen2 prepends a task instruction to
-queries only. Pick the dimension **once** — changing it later forces a full re-embed + HNSW
-rebuild.
-
-### Recommendation (one of each, quality-first)
-- **Local: BAAI/bge-large-en-v1.5** (1024-d, MIT) — best quality-per-complexity in the
-  self-hostable set for short English memories: strong retrieval, ~1.3 GB VRAM (runs on CPU at
-  ~100 ms), no `trust_remote_code`, mature ST support. The 512-token cap is irrelevant for our
-  content. (gte-Qwen2-1.5B-instruct is the explicit upgrade candidate if the benchmark says
-  bge-large leaves quality on the table; nomic is the fallback if a long context or sub-768
-  Matryoshka dims are ever wanted.)
-- **Hosted: Voyage-3.5** (1024-d) — highest measured retrieval quality of the hosted options,
-  **same 1024-d as the local pick** so the pgvector column and fusion code are identical whether
-  local or hosted (clean A/B), and our whole corpus embeds inside the free tier. Non-sensitive
-  only; sensitive rows go to bge-large locally. (OpenAI text-embedding-3-small is the pragmatic
-  fallback if no Voyage key.)
-
-> **Prototype note:** the prototype as built used **bge-large-en-v1.5** (1024-d, local default,
-> no API key in env). Production should adopt **Voyage-3.5** (also 1024-d) for non-sensitive
-> memories per ADR-0003, keeping bge-large as the sensitive-only / no-key fallback. Both 1024-d
-> means the pgvector schema and fusion code are unchanged across the choice.
-
----
-
-## 7. Fusion of lexical + dense + graph signals
-
-Three retrieval families produce candidate lists per recall; one fusion function merges them.
-
-### Reciprocal Rank Fusion (RRF) — rank-based, scale-agnostic
-Cormack/Clarke/Büttcher (SIGIR'09): `score(d) = Σ_s 1/(k + rank_s(d))`, summed over every signal
-the doc appears in. `k` is a smoothing constant — their sweep found **k=60 near-optimal but
-uncritical** (≤0.3% MAP swing across [10,100]; k=0 or k=500 costs 3–4%). A doc present in one
-list but absent from another contributes `1/(k+rank)` where it fires and **0** elsewhere — so
-multi-signal agreement is rewarded and single-signal hits are penalized (the hybrid behaviour we
-want). Extends to N lists trivially (just sum), which makes a 3-way fuse a one-liner.
-**Weighted RRF:** `Σ_s w_s/(k + rank_s(d))` — bias a stronger signal, no pre-normalization.
-
-### Weighted score fusion / convex combination (CC) — score-based, needs normalization
-Bruch et al. (arXiv 2210.11934, TOIS 2023): `f = α·φ(semantic) + (1-α)·φ(lexical)` with
-**theoretical** min-max normalization (TM2C2): cosine min = -1, BM25 min = 0, per-query max —
-stable across queries. Findings: **CC/TM2C2 beats RRF on nDCG and recall** in- and out-of-domain
-(RRF ~3.86% lower nDCG@10 in one replication); Weaviate switched its default from rankedFusion to
-relativeScoreFusion (min-max CC) for ~6% recall on FIQA. CC is sample-efficient (α tunes from a
-small labeled set) but requires calibratable scores.
-
-### Folding in graph hits
-Build a graph candidate list, then feed it to the fuser as just another ranked list:
-**seed** (match the query to concept nodes via the same FTS + dense over node labels) →
-**traverse** (1–2 hops to reachable memories) → **score** each reachable memory. Three documented
-scorings: hop-decay `Σ_paths β^hops` (β≈0.5–0.7); Personalized PageRank seeded on matched nodes
-(HippoRAG); or node-degree priority (GraphRAG local search). The
-*Calibrated-Fusion-for-Graph-Vector* paper (arXiv 2603.28886) is explicit: naive graph+vector
-fusion fails on **scale incompatibility**, so convert graph traversal into a probability-like
-normalized score before fusing. Crucial consequence: the graph list is **sparse** (often a
-handful of memories, sometimes zero). Under RRF that's handled automatically; under CC you must
-explicitly treat "absent" as the theoretical min or the missing-modality term silently biases the
-sum.
-
-### Cross-encoder re-rank — a separate stage-2, not a fusion function
-Retrieve top-N each → fuse → take fused top ~20–30 → score each (query, memory) pair jointly with
-a cross-encoder (e.g. bge-reranker-v2-m3) → re-sort. Reported lift +5 to +15 nDCG@10 on
-BEIR/MTEB; cost scales with pair count so it is only ever a small-candidate-set stage.
-
-### Recommendation
-**Weighted RRF over three lists (FTS, dense, graph), k=60, equal weights to start**, with
-importance applied as a deterministic post-fusion prior and a cross-encoder as an optional,
-benchmark-gated stage-2. RRF is the right *default* because we fuse three incompatible scales,
-one of them sparse/often-empty; it is near-parameter-free; and it collapses to exactly today's
-lexical ordering when dense/graph are empty (the SQLite graceful-degrade path). **But because
-adoption is quality-gated, the benchmark must also run CC/TM2C2 as a challenger** — the
-literature is consistent that CC edges RRF on quality when scores are calibratable. (See the
-[benchmark report](benchmark-report.md) for which won on *our* eval set.)
-
----
-
-## 8. Concept-graph construction from memories
-
-Turning flat Memory rows into nodes (concepts/entities) + typed directed edges + memory→concept
-"mentions" edges. Three extraction families:
-
-- **(A) Open LLM triple extraction** (schema-free) — prompt an LLM to emit `[subject, relation,
-  object]` triples. High recall, but relation labels proliferate ("prefers"/"likes"/"favors"), so
-  it **requires** downstream canonicalization. GraphRAG is the canonical implementation
-  (extract + gleaning + cross-chunk entity summarization).
-- **(B) Schema-guided** — constrain to a fixed ontology. Cleaner, but a fixed schema misses
-  surprises in a heterogeneous personal corpus. **EDC** (Zhang & Soh, EMNLP 2024) bridges the two:
-  *extract* open triples → *define* (LLM writes a one-sentence definition per distinct relation) →
-  *canonicalize* (embed definitions, retrieve nearest existing relations, LLM verifies map-vs-add).
-  Two modes: target-alignment (fixed schema) and self-canonicalization (grow schema dynamically).
-- **(C) Entity resolution / canonicalization** (the dedup problem — "Svelte"/"SvelteKit"/"svelte
-  framework" are one node): cluster-then-refine on the *aggregated* graph — embed every surface
-  string, cluster by cosine (HDBSCAN / connected-components over a threshold), optional
-  LLM-as-judge per cluster. KGGEN (arXiv 2502.09956) does iterative LLM-guided clustering;
-  Graphiti uses MinHash+LSH fast-path with LLM fallback. **Cost scales with distinct entities (low
-  thousands), not with memory count.**
-- **(D) Lightweight non-LLM** — spaCy NER + noun-chunks + co-occurrence edges, or **ReLiK**
-  (Sapienza, ACL 2024) for *typed* relations on CPU at up to 40× LLM speed, zero per-doc LLM cost.
-  The natural ablation baseline and sqlite-only fallback.
-
-**The tractable recipe for our corpus.** Measured: 5,452 non-sensitive memories ≈ 683K content
-tokens total — *tiny*. At ~125 content-tokens/memory, ~570 memories pack into one 100K-token
-request, so the **entire corpus extracts in ~10–25 batched LLM calls, not 5,452 sequential
-calls**. Pipeline: (1) batch-extract open triples (each memory tagged with its `memory_id` so
-triples map back), parallelized — LangExtract / KGGEN style; (2) aggregate + canonicalize globally
-*once* (embed distinct entities, cluster, LLM-judge only ambiguous clusters — tens of calls);
-(3) optionally one batched LLM "define relations" pass for EDC-style relation canonicalization.
-Total budget: low tens of calls, minutes of wall-clock, a few dollars hosted or one GPU-node
-llama-cpp session. **Canonicalization quality (the similarity threshold / cluster granularity) is
-where this lives or dies and must be tuned against held-out data, not eyeballed.** Write-time /
-Graphiti-style per-memory extraction is for *incremental updates only* — the wrong tool for the
-one-shot backfill.
-
----
-
-## 9. Vector storage in Postgres (production substrate)
-
-`pgvector` is a **proven capability on our exact CNPG cluster** (Immich already does vector search
-there, and claude-memory-mcp is already a tenant of the shared `pg-cluster-rw.dbaas` behind
-PgBouncer) — zero new infrastructure, reuse-before-building satisfied.
-
-- **HNSW** (recommended default): `USING hnsw (embedding halfvec_cosine_ops) WITH (m=16,
-  ef_construction=64)`; query knob `SET hnsw.ef_search` (default 40). Best speed-recall tradeoff;
-  buildable on an empty table; graph in RAM. **IVFFlat** is rejected (must be built *after* data
-  exists — an empty-table footgun — and has a lower recall ceiling).
-- **halfvec** (fp16, 2 bytes/dim) halves index size at ~no recall loss; indexable ≤4000 dims.
-  768-d halfvec = 1536 bytes/row; at our scale total embedding storage is single-digit MB.
-- **Filtered ANN:** we always filter `deleted_at IS NULL` (often `category`). Post-filtering can
-  under-fill top-k; enable `hnsw.iterative_scan='relaxed_order'`, and **always add a tie-breaker**
-  (`, id`) since approximate indexes give non-deterministic order.
-- **Hybrid in one query:** each retriever is a CTE producing a per-ranker rank; fuse with RRF via
-  FULL OUTER JOIN on memory id — no score calibration needed across the incomparable ts_rank and
-  cosine scales.
-- **pgvectorscale / StreamingDiskANN** (bounded-RAM disk graph, SBQ compression) is **deferred** —
-  Rust/pgrx must be compiled into the operand image, and it only earns its keep above ~1–5M
-  vectors. Our corpus is orders of magnitude below that.
-- **PgBouncer gotcha:** per-query GUCs (`hnsw.ef_search`) must be `SET LOCAL` inside the recall
-  transaction, not session-level, under transaction pooling.
-
-**Not for the prototype** (the prototype uses an in-process numpy index); this is the production
-adoption path *if* the benchmark clears the gate — an additive Alembic migration (one nullable
-`halfvec(1024)` column + HNSW index) plus a Terraform change to the CNPG stack.
-
----
-
-## 10. Evaluation methodology
-
-A retrieval test collection = corpus + query set + **qrels** (relevance judgments). For each
-query, call recall, take the ordered list of returned memory ids, score against qrels — measuring
-the *retriever in isolation*, exactly what ADR-0001's gate needs.
-
-**Metrics (compute all; pick one primary):**
-- **Recall@k** — "did we surface the right memory at all?" *The* hot-path metric (auto-recall
-  injects top-N; if the memory isn't in top-k it can't help). Report @5/@10/@20/@30.
-- **nDCG@k** — graded + position-aware; the best single summary (BEIR standard is nDCG@10).
-  Headline quality number for the gate.
-- **MRR** — only the first hit matters; relevant for the exact-lookup stratum.
-- **MAP** — broad binary recall+precision blend; secondary, stable for significance tests.
-
-**Stratification (the ADR-0001 hypothesis-targeted design):** *exact/lexical* (FTS already wins —
-the **regression guard**); *paraphrase/semantic* (disjoint vocabulary — the value-of-embeddings
-test); *multi-hop* (≥2 memories or a concept link — the graph test).
-
-**Qrels generation (the LongMemEval pipeline, inverted for memories):** sample seed memories
-stratified by category + importance → an LLM generates exact / paraphrase / multi-hop queries →
-label relevant ids, with **pooling** (union the top-k of every arm, TREC-style) and an LLM-judge
-on the **UMBRELA 0–3 scale**. **Separate the generator model from the judge model** to avoid
-self-preference leakage. **Hand-verify** a ≥15–20% sample (oversample multi-hop) and require
-Cohen's κ(LLM, human) ≥ ~0.6 before trusting auto-labels; always hand-author multi-hop
-relevant-id sets.
-
-**Pitfalls with standard mitigations (all baked into the protocol):** LLM judges are
-systematically *lenient* (κ gate); "holes" (new arms retrieve unjudged docs — must pool *all*
-arms before judging, else the gate is rigged against semantic/hybrid); generator-as-judge leakage
-(model separation); too-easy self-generated queries (check paraphrase shares no content tokens);
-adversarial/unanswerable queries have no relevant id and **must be kept out of the ranked metrics**
-(mixing them corrupted the disputed Zep-vs-mem0 LOCOMO comparison — 84%→58%).
-
-**Sizing:** Voorhees & Buckley (2002) — ≥25 topics is the floor, 50 yield reliable rankings, and a
-~5–6% absolute gap at n=50 is needed for 95% confidence the ordering holds on a different query
-set. Since the gate is *per stratum*, each stratum wants its own ~50 queries.
-
-> **Honest note on what we actually built:** our eval set is **119 queries (40 exact / 40
-> paraphrase / 39 multihop)** — just below the ~50/stratum ideal, and qrels were LLM-generated
-> with lighter hand-verification than the full protocol prescribes. This is a real limitation,
-> tracked in the [benchmark report](benchmark-report.md).
-
----
-
-## 11. Synthesis — what we borrow, from whom
-
-| Source | Borrowed mechanism | Re-implemented as | Adopted? |
-|---|---|---|---|
-| LightRAG | Incremental set-union graph merge; dual-level retrieve | Native node/edge tables, no AGE; FTS+dense+graph fuse | Idea only |
-| LazyGraphRAG | Defer LLM cost; index-time work ∝ new content | Store-time extraction off hot path | Principle |
-| Zep / Graphiti | Episodic/semantic split; 3-signal RRF read path; bi-temporal invalidation | Memory rows + Postgres node/edge tables; pgvector+FTS+CTE | **Blueprint** |
-| Mem0 | ADD/UPDATE/DELETE/NOOP write-side curation | Flagged async curator over existing `update_memory` | Complementary, flagged |
-| HippoRAG 2 | PPR spreading activation for multi-hop | LLM-free FTS+vector-seeded PPR over memory-node graph (phase 2, gated) | Idea only |
-| Bruch et al. / Cormack | RRF default + CC/TM2C2 challenger | Weighted RRF k=60, post-fusion importance prior | **Direct** |
-| EDC / KGGEN | Open-extract → define → canonicalize globally | Batched extraction + embedding-cluster canonicalization | **Direct** |
-| pgvector / Supabase | HNSW + halfvec + RRF hybrid in one SQL query | Additive migration to CNPG (production only) | **Production design** |
-| LongMemEval / UMBRELA / Voorhees | Stratified LLM-qrels + pooling + κ gate | Our exact/paraphrase/multi-hop eval | **Direct** |
-
-The through-line: **a memory-node concept graph, dense pgvector embeddings, and the existing
-lexical FTS, fused with weighted RRF, with all LLM work pushed to store time** — sized for an
-append-heavy personal store and gated on a benchmark that beats FTS.
-
----
-
-## Sources
-
-**GraphRAG family**
-- Edge et al., "From Local to Global: A Graph RAG Approach…" (Microsoft, 2024) — arXiv 2404.16130
-- Microsoft GraphRAG incremental-indexing design — github.com/microsoft/graphrag/issues/741; GraphRAG 1.0 blog (microsoft.com/en-us/research/blog/moving-to-graphrag-1-0…)
-- Guo et al., "LightRAG: Simple and Fast RAG" (HKUDS, EMNLP 2025) — arXiv 2410.05779; github.com/HKUDS/LightRAG (incl. issue #2122, PG+AGE concurrency)
-- nano-graphrag — github.com/gusye1234/nano-graphrag
-- LazyGraphRAG — microsoft.com/en-us/research/blog/lazygraphrag-setting-a-new-standard-for-quality-and-cost/
-
-**Temporal KG memory**
-- Rasmussen et al., "Zep: A Temporal Knowledge Graph Architecture for Agent Memory" (2025) — arXiv 2501.13956
-- Graphiti — github.com/getzep/graphiti; Neo4j writeup (neo4j.com/blog/developer/graphiti-knowledge-graph-memory/)
-
-**Extraction-based memory**
-- "Mem0: Building Production-Ready AI Agents…" (2025) — arXiv 2504.19413; github.com/mem0ai/mem0 (configs/prompts.py)
-
-**Graph-PPR retrieval**
-- Gutiérrez et al., "HippoRAG" (NeurIPS 2024) — arXiv 2405.14831
-- "From RAG to Memory" = HippoRAG 2 (ICML 2025) — arXiv 2502.14802; github.com/OSU-NLP-Group/HippoRAG
-
-**Embeddings**
-- bge-large-en-v1.5 — huggingface.co/BAAI/bge-large-en-v1.5; gte-Qwen2-1.5B — huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct; nomic — huggingface.co/nomic-ai/nomic-embed-text-v1.5; bge-m3 — huggingface.co/BAAI/bge-m3; e5-large-v2 — huggingface.co/intfloat/e5-large-v2
-- Voyage-3/3.5 — blog.voyageai.com/2024/09/18/voyage-3/; docs.voyageai.com/docs/pricing
-- OpenAI text-embedding-3 — developers.openai.com/api/docs/guides/embeddings; Cohere embed v3 — docs.cohere.com/docs/cohere-embed
-
-**Fusion**
-- Cormack, Clarke, Büttcher (SIGIR'09) — cormack.uwaterloo.ca/cormacksigir09-rrf.pdf
-- Bruch et al., "An Analysis of Fusion Functions for Hybrid Retrieval" (TOIS 2023) — arXiv 2210.11934
-- Elastic weighted RRF; Weaviate hybrid-search fusion algorithms; "Calibrated Fusion for Heterogeneous Graph-Vector Retrieval" — arXiv 2603.28886
-- bge-reranker — huggingface.co/BAAI/bge-reranker-base
-
-**Concept-graph construction**
-- Zhang & Soh, "Extract-Define-Canonicalize" (EDC, EMNLP 2024) — arXiv 2404.03868; github.com/clear-nus/edc
-- KGGEN — arXiv 2502.09956; ReLiK (ACL 2024) — arXiv 2408.00103; LightKGG — arXiv 2510.23341; Google LangExtract — github.com/google/langextract
-
-**Postgres vector storage**
-- pgvector — github.com/pgvector/pgvector; pgvectorscale — github.com/timescale/pgvectorscale; CNPG image-volume extensions — cloudnative-pg.io/docs/devel/imagevolume_extensions/; Supabase hybrid search — supabase.com/docs/guides/ai/hybrid-search
-- This cluster: `infra/docs/architecture/databases.md` (claude-memory-mcp is a CNPG tenant); this repo: `migrations/versions/001_initial_schema.py`, `src/claude_memory/api/app.py`
-
-**Evaluation**
-- LoCoMo — arXiv 2402.17753; LongMemEval — arXiv 2410.10813; UMBRELA — arXiv 2406.06519; "Judging the Judges" / LLMJudge — arXiv 2502.13908; Voorhees & Buckley "Topic Set Size" (2002); Buckley & Voorhees "Bias and the Limits of Pooling"; BEIR — arXiv 2104.08663
diff --git a/src/claude_memory/local_store.py b/src/claude_memory/local_store.py
deleted file mode 100644
index 09cd3d9..0000000
--- a/src/claude_memory/local_store.py
+++ /dev/null
@@ -1,116 +0,0 @@
-"""Single, process-wide serialized SQLite writer for the local memory cache.
-
-SQLite permits only one writer at a time. The MCP server's store path and the
-background sync engine used to open *separate* connections to the *same* file;
-under heavy concurrent ``memory_store`` calls those two writers fought over the
-single SQLite write lock, blew past ``busy_timeout``, and surfaced
-``sqlite3.OperationalError: database is locked`` — which made the tool slow and
-eventually dropped the session.
-
-``LocalStore`` fixes this structurally: it owns ONE connection (opened with
-``check_same_thread=False``) guarded by ONE re-entrant lock. Every component that
-needs to touch the local DB shares the same ``LocalStore`` instance, so all
-writes serialize cleanly through the in-process lock and queue instead of racing
-the SQLite writer. On the rare residual lock (e.g. another OS process touching
-the file), writes retry with bounded exponential backoff rather than failing the
-caller. WAL stays on for concurrent reads.
-
-Uses only stdlib — no pip install required.
-"""
-
-import logging
-import sqlite3
-import threading
-import time
-from pathlib import Path
-from typing import Any, Callable, TypeVar
-
-logger = logging.getLogger(__name__)
-
-T = TypeVar("T")
-
-# Bounded retry window for the rare residual "database is locked" — handles a
-# lock held by a *different OS process* (the in-process lock already serializes
-# this process's own writers). Total worst-case wait ≈ 0.05+0.1+0.2+0.4+0.8 ≈ 1.55s.
-_MAX_RETRIES = 5
-_BASE_BACKOFF_S = 0.05
-_BUSY_TIMEOUT_MS = 30000
-
-
-def _is_locked_error(exc: sqlite3.OperationalError) -> bool:
-    msg = str(exc).lower()
-    return "database is locked" in msg or "database is busy" in msg
-
-
-class LocalStore:
-    """Owns the single shared SQLite connection + lock for local memory writes."""
-
-    def __init__(self, conn: sqlite3.Connection) -> None:
-        self.conn = conn
-        # Re-entrant so a transaction callback may itself call ``execute``/``write``
-        # without dead-locking on the same thread.
-        self.lock = threading.RLock()
-
-    # ── construction ────────────────────────────────────────────────
-
-    @classmethod
-    def open(cls, db_path: str) -> "LocalStore":
-        """Open (creating parent dirs) a WAL connection safe for cross-thread use."""
-        Path(db_path).parent.mkdir(parents=True, exist_ok=True)
-        conn = sqlite3.connect(db_path, timeout=30.0, check_same_thread=False)
-        conn.row_factory = sqlite3.Row
-        conn.execute("PRAGMA journal_mode=WAL")
-        conn.execute(f"PRAGMA busy_timeout={_BUSY_TIMEOUT_MS}")
-        return cls(conn)
-
-    # ── serialized access ───────────────────────────────────────────
-
-    def transaction(self, fn: Callable[[sqlite3.Connection], T]) -> T:
-        """Run ``fn(conn)`` holding the shared lock, with bounded retry on lock errors.
-
-        ``fn`` is responsible for issuing its own ``COMMIT`` (call ``conn.commit()``)
-        when it performs writes. The whole callback runs under the process-wide lock,
-        so concurrent callers queue rather than collide on the SQLite writer.
-        """
-        last_exc: sqlite3.OperationalError | None = None
-        for attempt in range(_MAX_RETRIES):
-            with self.lock:
-                try:
-                    return fn(self.conn)
-                except sqlite3.OperationalError as exc:
-                    if not _is_locked_error(exc):
-                        raise
-                    last_exc = exc
-                    # Roll back any partial txn so the retry starts clean and the
-                    # connection isn't left mid-transaction holding locks.
-                    try:
-                        self.conn.rollback()
-                    except sqlite3.Error:
-                        pass
-            # Back off *outside* the lock so other writers can make progress.
-            backoff = _BASE_BACKOFF_S * (2 ** attempt)
-            logger.warning(
-                "SQLite locked (attempt %d/%d) — backing off %.3fs", attempt + 1, _MAX_RETRIES, backoff
-            )
-            time.sleep(backoff)
-        assert last_exc is not None
-        raise last_exc
-
-    def execute(self, sql: str, params: tuple[Any, ...] = ()) -> sqlite3.Cursor:
-        """Run a single read query under the shared lock (no implicit commit)."""
-        with self.lock:
-            return self.conn.execute(sql, params)
-
-    def write(self, sql: str, params: tuple[Any, ...] = ()) -> sqlite3.Cursor:
-        """Run a single write statement + commit, serialized and retry-guarded."""
-
-        def _do(conn: sqlite3.Connection) -> sqlite3.Cursor:
-            cur = conn.execute(sql, params)
-            conn.commit()
-            return cur
-
-        return self.transaction(_do)
-
-    def close(self) -> None:
-        with self.lock:
-            self.conn.close()
diff --git a/src/claude_memory/mcp_server.py b/src/claude_memory/mcp_server.py
index 367ed71..a3fa210 100644
--- a/src/claude_memory/mcp_server.py
+++ b/src/claude_memory/mcp_server.py
@@ -17,10 +17,7 @@ import sqlite3
 import sys
 import urllib.error
 import urllib.request
-from typing import TYPE_CHECKING, Any, Callable
-
-if TYPE_CHECKING:
-    from claude_memory.local_store import LocalStore
+from typing import Any
 
 logger = logging.getLogger(__name__)
 
@@ -38,17 +35,9 @@ HYBRID_MODE = bool(API_KEY) and not SYNC_DISABLED
 HTTP_ONLY = bool(API_KEY) and SYNC_DISABLED
 SQLITE_ONLY = not API_KEY
 
-# Default per-request HTTP timeout, and a wider bound for the one genuinely slow path:
-# a remote semantic ``memory_recall`` (embedding/search can be slow to warm up). Both are
-# explicit upper bounds so a call can never hang the MCP server indefinitely.
-DEFAULT_API_TIMEOUT = 15
-RECALL_TIMEOUT = int(os.environ.get("MEMORY_RECALL_TIMEOUT", "30"))
 
-
-def _api_request(
-    method: str, path: str, body: dict[str, Any] | None = None, timeout: int = DEFAULT_API_TIMEOUT
-) -> dict[str, Any]:
-    """Make an HTTP request to the memory API (bounded by ``timeout`` seconds)."""
+def _api_request(method: str, path: str, body: dict[str, Any] | None = None) -> dict[str, Any]:
+    """Make an HTTP request to the memory API."""
     url = f"{API_BASE_URL}{path}"
     data = json.dumps(body).encode() if body else None
     req = urllib.request.Request(
@@ -61,7 +50,7 @@ def _api_request(
         },
     )
     try:
-        with urllib.request.urlopen(req, timeout=timeout) as resp:
+        with urllib.request.urlopen(req, timeout=15) as resp:
             result: dict[str, Any] = json.loads(resp.read().decode())
             return result
     except urllib.error.HTTPError as e:
@@ -139,9 +128,7 @@ def _init_sqlite(db_path: str | None = None) -> tuple[sqlite3.Connection, str]:
     db_path = _get_db_path(db_path)
     Path(os.path.dirname(db_path)).mkdir(parents=True, exist_ok=True)
 
-    # check_same_thread=False: the MCP server may handle requests on different
-    # threads and shares this one connection via LocalStore's lock (see local_store.py).
-    conn = sqlite3.connect(db_path, timeout=30.0, check_same_thread=False)
+    conn = sqlite3.connect(db_path, timeout=30.0)
     conn.row_factory = sqlite3.Row
     conn.execute("PRAGMA journal_mode=WAL")
     conn.execute("PRAGMA busy_timeout=30000")
@@ -403,30 +390,19 @@ class MemoryServer:
 
     def __init__(self, sqlite_db_path: str | None = None) -> None:
         self.sqlite_conn: sqlite3.Connection | None = None
-        self.store: "LocalStore | None" = None  # single serialized writer (see local_store.py)
         self.sync_engine: Any = None
-        # Sink for MCP notifications (e.g. progress). Defaults to writing a JSON-RPC
-        # notification line to stdout; overridable in tests.
-        self._notify: Callable[[str, dict[str, Any]], None] = self._emit_notification
 
         if SQLITE_ONLY or HYBRID_MODE:
-            conn, resolved_path = _init_sqlite(sqlite_db_path)
-            from claude_memory.local_store import LocalStore
-            self.store = LocalStore(conn)
-            self.sqlite_conn = conn
+            self.sqlite_conn, resolved_path = _init_sqlite(sqlite_db_path)
 
             if HYBRID_MODE:
                 from claude_memory.sync import SyncEngine
                 sync_interval = int(os.environ.get("MEMORY_SYNC_INTERVAL", "60"))
-                # Share the SAME LocalStore (one connection, one lock) so the background
-                # sync writer never opens a second connection to the file and never races
-                # the store path on the single SQLite writer.
                 self.sync_engine = SyncEngine(
                     db_path=resolved_path,
                     api_base_url=API_BASE_URL,
                     api_key=API_KEY,
                     sync_interval=sync_interval,
-                    store=self.store,
                 )
                 self.sync_engine.start()
 
@@ -479,59 +455,31 @@ class MemoryServer:
         limit = args.get("limit", 10)
 
         if HTTP_ONLY:
-            return self._recall_remote(args)
+            result = _api_request("POST", "/api/memories/recall", {
+                "context": context,
+                "expanded_query": expanded_query,
+                "category": category,
+                "sort_by": sort_by,
+                "limit": limit,
+            })
+            rows = result.get("memories", [])
+            if not rows:
+                filter_desc = f" in category '{category}'" if category else ""
+                return f"No memories found matching: {context}{filter_desc}"
+
+            sort_desc = "by relevance" if sort_by == "relevance" else "by importance"
+            filter_desc = f" in '{category}'" if category else ""
+            results = []
+            for row in rows:
+                results.append(
+                    f"#{row['id']} [{row['category']}] (importance: {row['importance']:.1f}) {row['content']}"
+                    f"\n  Tags: {row.get('tags') or 'none'} | Stored: {row['created_at']}"
+                )
+            return f"Found {len(rows)} memories{filter_desc} ({sort_desc}):\n\n" + "\n\n".join(results)
 
         # SQLite-only or Hybrid: always read from local cache
         return self._sqlite_recall(context, expanded_query, category, sort_by, limit)
 
-    def _recall_remote(self, args: dict[str, Any]) -> str:
-        """Remote semantic recall — the one genuinely slow path.
-
-        Bounded by ``RECALL_TIMEOUT`` so it can never hang the MCP server. On a timeout
-        or unreachable backend it returns a clear "working / not done — retry" signal
-        rather than raising or blocking silently.
-        """
-        context = args.get("context")
-        expanded_query = args.get("expanded_query", "")
-        category = args.get("category")
-        sort_by = args.get("sort_by", "importance")
-        limit = args.get("limit", 10)
-
-        try:
-            result = _api_request(
-                "POST", "/api/memories/recall",
-                {
-                    "context": context,
-                    "expanded_query": expanded_query,
-                    "category": category,
-                    "sort_by": sort_by,
-                    "limit": limit,
-                },
-                timeout=RECALL_TIMEOUT,
-            )
-        except RuntimeError as e:
-            # _api_request wraps connection errors / timeouts as RuntimeError. Surface a
-            # clear, actionable signal instead of hanging or dumping a stack trace.
-            return (
-                f"Memory recall did not complete within {RECALL_TIMEOUT}s — the semantic "
-                f"search backend is slow or unreachable ({e}). Please try again."
-            )
-
-        rows = result.get("memories", [])
-        if not rows:
-            filter_desc = f" in category '{category}'" if category else ""
-            return f"No memories found matching: {context}{filter_desc}"
-
-        sort_desc = "by relevance" if sort_by == "relevance" else "by importance"
-        filter_desc = f" in '{category}'" if category else ""
-        results = []
-        for row in rows:
-            results.append(
-                f"#{row['id']} [{row['category']}] (importance: {row['importance']:.1f}) {row['content']}"
-                f"\n  Tags: {row.get('tags') or 'none'} | Stored: {row['created_at']}"
-            )
-        return f"Found {len(rows)} memories{filter_desc} ({sort_desc}):\n\n" + "\n\n".join(results)
-
     def memory_list(self, args: dict[str, Any]) -> str:
         category = args.get("category")
         limit = args.get("limit", 20)
@@ -571,11 +519,10 @@ class MemoryServer:
         # SQLite-only or Hybrid: delete from local SQLite
         # In hybrid mode, also try to sync delete to server
         server_id: int | None = None
-        if HYBRID_MODE and self.sync_engine and self.store:
-            with self.store.lock:
-                cursor = self.store.conn.cursor()
-                cursor.execute("SELECT server_id FROM memories WHERE id = ?", (memory_id,))
-                row = cursor.fetchone()
+        if HYBRID_MODE and self.sync_engine and self.sqlite_conn:
+            cursor = self.sqlite_conn.cursor()
+            cursor.execute("SELECT server_id FROM memories WHERE id = ?", (memory_id,))
+            row = cursor.fetchone()
             server_id = row["server_id"] if row and row["server_id"] else None
 
         result_text = self._sqlite_delete(memory_id)
@@ -616,13 +563,12 @@ class MemoryServer:
             lines.append(f"Last sync success: {counts['last_sync_success']}")
             return "\n".join(lines)
 
-        if self.store:
-            with self.store.lock:
-                cursor = self.store.conn.cursor()
-                cursor.execute("SELECT COUNT(*) as c FROM memories")
-                total = cursor.fetchone()["c"]
-                cursor.execute("SELECT category, COUNT(*) as c FROM memories GROUP BY category ORDER BY c DESC")
-                by_cat = cursor.fetchall()
+        if self.sqlite_conn:
+            cursor = self.sqlite_conn.cursor()
+            cursor.execute("SELECT COUNT(*) as c FROM memories")
+            total = cursor.fetchone()["c"]
+            cursor.execute("SELECT category, COUNT(*) as c FROM memories GROUP BY category ORDER BY c DESC")
+            by_cat = cursor.fetchall()
             lines = [f"Local memories (SQLite-only): {total}"]
             for row in by_cat:
                 lines.append(f"  {row['category']}: {row['c']}")
@@ -736,25 +682,19 @@ class MemoryServer:
     def _sqlite_store(self, content: str, category: str, tags: str, importance: float, expanded_keywords: str, force_sensitive: bool = False) -> str:
         from datetime import datetime, timezone
 
-        assert self.store is not None
+        assert self.sqlite_conn is not None
         now = datetime.now(timezone.utc).isoformat()
         is_sensitive = 1 if force_sensitive else 0
-
-        def _insert(conn: sqlite3.Connection) -> int | None:
-            cursor = conn.execute(
-                "INSERT INTO memories (content, category, tags, expanded_keywords, importance, is_sensitive, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
-                (content, category, tags, expanded_keywords, importance, is_sensitive, now, now),
-            )
-            conn.commit()
-            return cursor.lastrowid
-
-        # Serialized + retry-guarded through the shared LocalStore so concurrent
-        # stores (and the background sync writer) never collide on the SQLite writer.
-        new_id = self.store.transaction(_insert)
-        return f"Stored memory #{new_id} in category '{category}' with importance {importance:.1f}"
+        cursor = self.sqlite_conn.cursor()
+        cursor.execute(
+            "INSERT INTO memories (content, category, tags, expanded_keywords, importance, is_sensitive, created_at, updated_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
+            (content, category, tags, expanded_keywords, importance, is_sensitive, now, now),
+        )
+        self.sqlite_conn.commit()
+        return f"Stored memory #{cursor.lastrowid} in category '{category}' with importance {importance:.1f}"
 
     def _sqlite_recall(self, context: str, expanded_query: str, category: str | None, sort_by: str, limit: int) -> str:
-        assert self.store is not None
+        assert self.sqlite_conn is not None
         all_terms = f"{context} {expanded_query}".strip()
         words = [w.replace(chr(34), "") for w in all_terms.split() if w]
         and_query = " AND ".join(f'"{w}"' for w in words)
@@ -772,38 +712,35 @@ class MemoryServer:
             "SELECT m.id, m.content, m.category, m.tags, m.importance, m.created_at "
             "FROM memories m JOIN memories_fts fts ON m.id = fts.rowid "
         )
+        cursor = self.sqlite_conn.cursor()
         rows: list[Any] = []
-        # Hold the shared lock for the whole read — the connection is shared across
-        # threads with the sync writer, so reads must serialize too.
-        with self.store.lock:
-            cursor = self.store.conn.cursor()
-            try:
-                # Try AND first for precise matches, fall back to OR for broader results
-                cat_filter = "AND m.category = ?" if category else ""
-                for fts_query in (and_query, or_query):
-                    params = [fts_query, category, limit] if category else [fts_query, limit]
-                    cursor.execute(
-                        f"{base_select}WHERE memories_fts MATCH ? {cat_filter} ORDER BY {order} LIMIT ?",
-                        tuple(p for p in params if p is not None),
-                    )
-                    rows = cursor.fetchall()
-                    if rows:
-                        break
-            except sqlite3.OperationalError:
-                like = f"%{context}%"
-                if category:
-                    cursor.execute(
-                        "SELECT id, content, category, tags, importance, created_at FROM memories "
-                        "WHERE (content LIKE ? OR tags LIKE ?) AND category = ? ORDER BY importance DESC LIMIT ?",
-                        (like, like, category, limit),
-                    )
-                else:
-                    cursor.execute(
-                        "SELECT id, content, category, tags, importance, created_at FROM memories "
-                        "WHERE content LIKE ? OR tags LIKE ? ORDER BY importance DESC LIMIT ?",
-                        (like, like, limit),
-                    )
+        try:
+            # Try AND first for precise matches, fall back to OR for broader results
+            cat_filter = "AND m.category = ?" if category else ""
+            for fts_query in (and_query, or_query):
+                params = [fts_query, category, limit] if category else [fts_query, limit]
+                cursor.execute(
+                    f"{base_select}WHERE memories_fts MATCH ? {cat_filter} ORDER BY {order} LIMIT ?",
+                    tuple(p for p in params if p is not None),
+                )
                 rows = cursor.fetchall()
+                if rows:
+                    break
+        except sqlite3.OperationalError:
+            like = f"%{context}%"
+            if category:
+                cursor.execute(
+                    "SELECT id, content, category, tags, importance, created_at FROM memories "
+                    "WHERE (content LIKE ? OR tags LIKE ?) AND category = ? ORDER BY importance DESC LIMIT ?",
+                    (like, like, category, limit),
+                )
+            else:
+                cursor.execute(
+                    "SELECT id, content, category, tags, importance, created_at FROM memories "
+                    "WHERE content LIKE ? OR tags LIKE ? ORDER BY importance DESC LIMIT ?",
+                    (like, like, limit),
+                )
+            rows = cursor.fetchall()
 
         if not rows:
             return f"No memories found matching: {context}"
@@ -820,22 +757,21 @@ class MemoryServer:
         )
 
     def _sqlite_list(self, category: str | None, limit: int) -> str:
-        assert self.store is not None
-        with self.store.lock:
-            cursor = self.store.conn.cursor()
-            if category:
-                cursor.execute(
-                    "SELECT id, content, category, tags, importance, created_at FROM memories "
-                    "WHERE category = ? ORDER BY created_at DESC LIMIT ?",
-                    (category, limit),
-                )
-            else:
-                cursor.execute(
-                    "SELECT id, content, category, tags, importance, created_at FROM memories "
-                    "ORDER BY created_at DESC LIMIT ?",
-                    (limit,),
-                )
-            rows = cursor.fetchall()
+        assert self.sqlite_conn is not None
+        cursor = self.sqlite_conn.cursor()
+        if category:
+            cursor.execute(
+                "SELECT id, content, category, tags, importance, created_at FROM memories "
+                "WHERE category = ? ORDER BY created_at DESC LIMIT ?",
+                (category, limit),
+            )
+        else:
+            cursor.execute(
+                "SELECT id, content, category, tags, importance, created_at FROM memories "
+                "ORDER BY created_at DESC LIMIT ?",
+                (limit,),
+            )
+        rows = cursor.fetchall()
         if not rows:
             return f"No memories in category '{category}'" if category else "No memories stored yet"
 
@@ -849,30 +785,25 @@ class MemoryServer:
         return header + f" ({len(rows)} shown):\n\n" + "\n\n".join(results)
 
     def _sqlite_delete(self, memory_id: int) -> str:
-        assert self.store is not None
-
-        def _delete(conn: sqlite3.Connection) -> str:
-            cursor = conn.execute("SELECT id, content FROM memories WHERE id = ?", (memory_id,))
-            row = cursor.fetchone()
-            if not row:
-                return f"Memory #{memory_id} not found"
-            preview = row["content"][:50]
-            conn.execute("DELETE FROM memories WHERE id = ?", (memory_id,))
-            conn.commit()
-            return f"Deleted memory #{memory_id}: {preview}..."
-
-        # SELECT + DELETE + commit as one serialized, retry-guarded unit.
-        return self.store.transaction(_delete)
+        assert self.sqlite_conn is not None
+        cursor = self.sqlite_conn.cursor()
+        cursor.execute("SELECT id, content FROM memories WHERE id = ?", (memory_id,))
+        row = cursor.fetchone()
+        if not row:
+            return f"Memory #{memory_id} not found"
+        preview = row["content"][:50]
+        cursor.execute("DELETE FROM memories WHERE id = ?", (memory_id,))
+        self.sqlite_conn.commit()
+        return f"Deleted memory #{memory_id}: {preview}..."
 
     def _sqlite_secret_get(self, memory_id: int) -> str:
-        assert self.store is not None
-        with self.store.lock:
-            cursor = self.store.conn.cursor()
-            cursor.execute(
-                "SELECT id, content, category, is_sensitive FROM memories WHERE id = ?",
-                (memory_id,),
-            )
-            row = cursor.fetchone()
+        assert self.sqlite_conn is not None
+        cursor = self.sqlite_conn.cursor()
+        cursor.execute(
+            "SELECT id, content, category, is_sensitive FROM memories WHERE id = ?",
+            (memory_id,),
+        )
+        row = cursor.fetchone()
         if not row:
             return f"Memory #{memory_id} not found"
         if not row["is_sensitive"]:
@@ -894,25 +825,9 @@ class MemoryServer:
             tools.extend(SHARING_TOOLS)
         return {"tools": tools}
 
-    # Tools whose work is genuinely slow enough to warrant a progress signal.
-    _SLOW_TOOLS = frozenset({"memory_recall"})
-
     def handle_tools_call(self, params: dict[str, Any]) -> dict[str, Any]:
         tool_name: str = params.get("name", "")
         arguments: dict[str, Any] = params.get("arguments", {})
-
-        # If the client opted into progress (sent a token) and this is a slow tool, emit a
-        # single "working" progress notification so the call shows life instead of looking hung.
-        progress_token = (params.get("_meta") or {}).get("progressToken")
-        if progress_token is not None and tool_name in self._SLOW_TOOLS:
-            try:
-                self._notify(
-                    "notifications/progress",
-                    {"progressToken": progress_token, "progress": 0, "message": f"Running {tool_name}…"},
-                )
-            except Exception:
-                pass  # never let progress reporting break the actual call
-
         try:
             handler = {
                 "memory_store": self.memory_store,
@@ -936,10 +851,6 @@ class MemoryServer:
         except Exception as e:
             return {"content": [{"type": "text", "text": f"Error: {e!s}"}], "isError": True}
 
-    def _emit_notification(self, method: str, params: dict[str, Any]) -> None:
-        """Default notification sink: write a JSON-RPC notification line to stdout."""
-        print(json.dumps({"jsonrpc": "2.0", "method": method, "params": params}), flush=True)
-
     def process_message(self, message: dict[str, Any]) -> dict[str, Any] | None:
         method = message.get("method")
         params = message.get("params", {})
diff --git a/src/claude_memory/sync.py b/src/claude_memory/sync.py
index 86acda4..b4d1f99 100644
--- a/src/claude_memory/sync.py
+++ b/src/claude_memory/sync.py
@@ -5,14 +5,14 @@ Uses only stdlib — no pip install required.
 
 import json
 import logging
+import sqlite3
 import threading
 import urllib.error
 import urllib.parse
 import urllib.request
 from typing import Any
 from datetime import datetime, timezone
-
-from claude_memory.local_store import LocalStore
+from pathlib import Path
 
 logger = logging.getLogger(__name__)
 
@@ -26,14 +26,7 @@ FULL_RESYNC_EVERY = 10
 class SyncEngine:
     """Background sync between local SQLite cache and remote API."""
 
-    def __init__(
-        self,
-        db_path: str,
-        api_base_url: str,
-        api_key: str,
-        sync_interval: int = 60,
-        store: "LocalStore | None" = None,
-    ):
+    def __init__(self, db_path: str, api_base_url: str, api_key: str, sync_interval: int = 60):
         self.db_path = db_path
         self.api_base_url = api_base_url.rstrip("/")
         self.api_key = api_key
@@ -44,20 +37,13 @@ class SyncEngine:
         self._last_sync_success = False
         self._auth_failed = False
 
-        # Share the caller's LocalStore (one connection, one lock) when given, so the
-        # background sync writer never opens a SECOND connection to the same file and
-        # never races the store path on the single SQLite writer. When run standalone
-        # (e.g. tests), open our own store. Either way, all SQLite access below goes
-        # through self._store / self._conn / self._lock, which now point at one shared
-        # connection guarded by one re-entrant lock.
-        if store is None:
-            self._store = LocalStore.open(db_path)
-            self._owns_store = True
-        else:
-            self._store = store
-            self._owns_store = False
-        self._conn = self._store.conn
-        self._lock = self._store.lock
+        # Own connection for thread safety
+        Path(db_path).parent.mkdir(parents=True, exist_ok=True)
+        self._conn = sqlite3.connect(db_path, timeout=30.0, check_same_thread=False)
+        self._conn.row_factory = sqlite3.Row
+        self._conn.execute("PRAGMA journal_mode=WAL")
+        self._conn.execute("PRAGMA busy_timeout=30000")
+        self._lock = threading.Lock()
 
         self._init_sync_tables()
 
@@ -135,10 +121,7 @@ class SyncEngine:
         self._stop_event.set()
         if self._thread and self._thread.is_alive():
             self._thread.join(timeout=5)
-        # Only close the connection if we opened it; when sharing the server's
-        # LocalStore, the server owns the connection lifecycle.
-        if self._owns_store:
-            self._store.close()
+        self._conn.close()
 
     def _sync_loop(self) -> None:
         """Periodic sync loop running in background thread."""
diff --git a/tests/test_mcp_server.py b/tests/test_mcp_server.py
index e470f5e..584fd69 100644
--- a/tests/test_mcp_server.py
+++ b/tests/test_mcp_server.py
@@ -2,12 +2,7 @@
 
 import json
 import os
-import sqlite3
 import sys
-import threading
-from datetime import datetime, timezone
-from typing import Any
-from unittest.mock import patch
 
 import pytest
 
@@ -413,244 +408,3 @@ class TestSchemaMigration:
         columns = {row["name"] for row in cursor.fetchall()}
         assert "server_id" in columns
         srv.sqlite_conn.close()
-
-
-def _server_rows(server: MemoryServer) -> int:
-    assert server.sqlite_conn is not None
-    return int(server.sqlite_conn.execute("SELECT COUNT(*) AS c FROM memories").fetchone()["c"])
-
-
-class TestConcurrentWrites:
-    """Regression tests for 'database is locked' under heavy concurrent local writes.
-
-    The store path (MemoryServer) and the background sync writer (SyncEngine) must not
-    contend on the single SQLite writer. Before the fix they used two separate connections
-    to the same file; under load the second writer blew past busy_timeout and raised
-    ``sqlite3.OperationalError: database is locked``. After the fix all local writes are
-    serialized through one shared, lock-guarded connection, so a lock error is impossible.
-    """
-
-    def test_concurrent_stores_all_rows_land(self, tmp_path):
-        """Many threads calling memory_store concurrently → every row lands, no lock error."""
-        db_path = str(tmp_path / "concurrent.db")
-        server = MemoryServer(sqlite_db_path=db_path)
-        try:
-            n_threads = 16
-            per_thread = 12
-            errors: list[BaseException] = []
-            barrier = threading.Barrier(n_threads)
-
-            def worker(tid: int) -> None:
-                barrier.wait()  # release all threads at once to maximise contention
-                for i in range(per_thread):
-                    try:
-                        server.memory_store({
-                            "content": f"thread {tid} memory {i}",
-                            "expanded_keywords": f"thread {tid} memory {i} concurrent write",
-                        })
-                    except BaseException as exc:  # noqa: BLE001 — capture everything for the assert
-                        errors.append(exc)
-
-            threads = [threading.Thread(target=worker, args=(t,)) for t in range(n_threads)]
-            for t in threads:
-                t.start()
-            for t in threads:
-                t.join()
-
-            assert errors == [], f"concurrent stores raised: {errors!r}"
-            assert _server_rows(server) == n_threads * per_thread
-        finally:
-            if server.sqlite_conn:
-                server.sqlite_conn.close()
-
-    def test_concurrent_stores_while_sync_writer_active_no_lock(self, tmp_path):
-        """Store path + background sync writer hammer the SAME file → no 'database is locked'.
-
-        Reproduces the production failure: ``MemoryServer`` and ``SyncEngine`` both write to
-        one SQLite file. We shrink ``busy_timeout`` so the structural race (two writers fighting
-        the single SQLite write lock) surfaces within seconds instead of 30s. The shared-writer
-        fix makes a lock error impossible regardless of busy_timeout.
-        """
-        from claude_memory.sync import SyncEngine
-
-        db_path = str(tmp_path / "hybrid.db")
-        server = MemoryServer(sqlite_db_path=db_path)
-        # The production hybrid wiring: the sync engine SHARES the server's LocalStore
-        # (one connection, one lock) — the structural fix for the cross-connection race.
-        engine = SyncEngine(
-            db_path=db_path,
-            api_base_url="http://fake-api:8080",
-            api_key="test-key",
-            sync_interval=3600,  # never auto-syncs; we drive the writer by hand
-            store=server.store,
-        )
-        assert engine._conn is server.sqlite_conn  # truly one shared connection
-
-        # Shrink the busy timeout so that, were there still two writers, the race would
-        # surface in ms not 30s. With one shared connection a lock error is impossible.
-        server.sqlite_conn.execute("PRAGMA busy_timeout=50")
-
-        now = datetime.now(timezone.utc).isoformat()
-
-        # A write-heavy pull: many rows upserted inside the sync engine's single lock block —
-        # exactly the kind of long-held writer that starved the store path.
-        def big_pull(method: str, path: str, body: Any = None) -> dict[str, Any]:
-            return {
-                "memories": [
-                    {
-                        "id": 10_000 + j,
-                        "content": f"server row {j}",
-                        "category": "facts",
-                        "tags": "",
-                        "expanded_keywords": "",
-                        "importance": 0.5,
-                        "is_sensitive": False,
-                        "created_at": now,
-                        "updated_at": now,
-                        "deleted_at": None,
-                    }
-                    for j in range(120)
-                ],
-                "server_time": now,
-            }
-
-        errors: list[BaseException] = []
-        stop = threading.Event()
-
-        def sync_writer() -> None:
-            with patch.object(engine, "_api_request", side_effect=big_pull):
-                while not stop.is_set():
-                    try:
-                        engine._pull_changes()
-                    except BaseException as exc:  # noqa: BLE001
-                        errors.append(exc)
-
-        n_threads = 12
-        per_thread = 12
-        barrier = threading.Barrier(n_threads)
-
-        def store_worker(tid: int) -> None:
-            barrier.wait()
-            for i in range(per_thread):
-                try:
-                    server.memory_store({
-                        "content": f"local {tid}.{i}",
-                        "expanded_keywords": f"local thread {tid} item {i} keywords here",
-                    })
-                except BaseException as exc:  # noqa: BLE001
-                    errors.append(exc)
-
-        writer = threading.Thread(target=sync_writer, daemon=True)
-        writer.start()
-        final_rows = 0
-        try:
-            store_threads = [threading.Thread(target=store_worker, args=(t,)) for t in range(n_threads)]
-            for t in store_threads:
-                t.start()
-            for t in store_threads:
-                t.join()
-            stop.set()
-            writer.join(timeout=5)
-            final_rows = _server_rows(server)  # read while the connection is still open
-        finally:
-            stop.set()
-            engine.stop()  # shares the server's store → does not close the connection
-            if server.sqlite_conn:
-                server.sqlite_conn.close()
-
-        locked = [e for e in errors if isinstance(e, sqlite3.OperationalError) and "locked" in str(e)]
-        assert locked == [], f"hit 'database is locked' under concurrency: {locked!r}"
-        assert errors == [], f"concurrent writes raised: {errors!r}"
-        # Every local store must have landed.
-        assert final_rows >= n_threads * per_thread
-
-
-class TestRecallProgressAndBounding:
-    """The slow path — a remote semantic ``memory_recall`` — must be bounded and give a
-    notion of progress instead of hanging silently and dropping the session."""
-
-    def test_remote_recall_timeout_returns_clear_signal_not_raise(self, server):
-        """A timed-out / unreachable remote recall returns a clear 'retry' message, never hangs/raises."""
-        import claude_memory.mcp_server as m
-
-        with patch.object(m, "_api_request", side_effect=RuntimeError("API connection error: timed out")):
-            text = server._recall_remote({"context": "x", "expanded_query": "a b c d e"})
-
-        assert "recall" in text.lower()
-        # Mentions the bound and that the caller should retry — a clear working/not-done signal.
-        assert "retry" in text.lower() or "again" in text.lower()
-        assert str(m.RECALL_TIMEOUT) in text
-
-    def test_remote_recall_success_formats_rows(self, server):
-        """A successful remote recall still formats results normally."""
-        import claude_memory.mcp_server as m
-
-        payload = {"memories": [
-            {"id": 7, "category": "facts", "importance": 0.8, "content": "hello",
-             "tags": "t", "created_at": "2026-01-01T00:00:00+00:00"},
-        ]}
-        with patch.object(m, "_api_request", return_value=payload):
-            text = server._recall_remote({"context": "x", "expanded_query": "a b c d e"})
-
-        assert "Found 1 memories" in text
-        assert "hello" in text
-
-    def test_remote_recall_uses_bounded_timeout(self, server):
-        """The remote recall passes the bounded RECALL_TIMEOUT to the HTTP layer."""
-        import claude_memory.mcp_server as m
-
-        with patch.object(m, "_api_request", return_value={"memories": []}) as mock_api:
-            server._recall_remote({"context": "x", "expanded_query": "a b c d e"})
-
-        _, kwargs = mock_api.call_args
-        assert kwargs.get("timeout") == m.RECALL_TIMEOUT
-
-    def test_api_request_accepts_timeout_kwarg(self):
-        """Module-level _api_request must accept an explicit timeout without breaking callers."""
-        import inspect
-        import claude_memory.mcp_server as m
-
-        sig = inspect.signature(m._api_request)
-        assert "timeout" in sig.parameters
-
-    def test_progress_notification_emitted_for_recall_with_token(self, server):
-        """When the client supplies a progressToken, a 'working' progress notification is emitted."""
-        sent: list[dict[str, Any]] = []
-        server._notify = lambda method, params: sent.append({"method": method, "params": params})
-
-        with patch.object(server, "memory_recall", return_value="ok"):
-            server.handle_tools_call({
-                "name": "memory_recall",
-                "arguments": {"context": "x", "expanded_query": "a b c d e"},
-                "_meta": {"progressToken": "tok-1"},
-            })
-
-        progress = [s for s in sent if s["method"] == "notifications/progress"]
-        assert progress, "expected a notifications/progress for a tokened recall"
-        assert progress[0]["params"]["progressToken"] == "tok-1"
-
-    def test_no_progress_notification_without_token(self, server):
-        """No token → no progress notification (don't spam clients that didn't opt in)."""
-        sent: list[dict[str, Any]] = []
-        server._notify = lambda method, params: sent.append({"method": method, "params": params})
-
-        with patch.object(server, "memory_recall", return_value="ok"):
-            server.handle_tools_call({
-                "name": "memory_recall",
-                "arguments": {"context": "x", "expanded_query": "a b c d e"},
-            })
-
-        assert [s for s in sent if s["method"] == "notifications/progress"] == []
-
-    def test_fast_tools_do_not_emit_progress(self, server):
-        """Only the slow recall path signals progress; a store with a token stays quiet."""
-        sent: list[dict[str, Any]] = []
-        server._notify = lambda method, params: sent.append({"method": method, "params": params})
-
-        server.handle_tools_call({
-            "name": "memory_store",
-            "arguments": {"content": "x", "expanded_keywords": "a b c d e"},
-            "_meta": {"progressToken": "tok-2"},
-        })
-
-        assert [s for s in sent if s["method"] == "notifications/progress"] == []