research: benchmark hybrid (lexical+dense+graph) recall vs current FTS
Viktor asked to enhance the memory system with 'semantics' — remember concepts (not just tokens) linked in a graph — and to prove, by benchmarking against the current system, that it actually improves recall. A multi-phase research workflow (18 agents) did landscape research, an adversarially-reviewed integration design, a stratified eval set over the real 5,452-memory corpus, and a head-to-head prototype-vs-current benchmark. Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend adopting lexical+dense; the concept graph is DEFERRED. Post-run adversarial review correction (applied to all docs before commit): the prototype's fusion config structurally barred the graph leg from the ranked top-k, so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty). Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/. Privacy: the corpus/queries/qrels/results are the user's real memories and stay gitignored (data/, cache/, results/, build_eval_set.py); only harness code, aggregate numbers, and synthetic examples are committed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
7439540f8f
commit
1cc8a2b378
23 changed files with 3428 additions and 0 deletions
292
docs/research/integration-design.md
Normal file
292
docs/research/integration-design.md
Normal file
|
|
@ -0,0 +1,292 @@
|
|||
# Integration design: hybrid recall (production + prototype-as-built)
|
||||
|
||||
Status: the design the benchmark validated. This document specifies **(A) the production design**
|
||||
for the API/Postgres deployment and **(B) the prototype as actually built** for the benchmark.
|
||||
Read the [survey](survey.md) for *why* these choices, the
|
||||
[benchmark report](benchmark-report.md) for whether they cleared the gate, and
|
||||
[ADR-0001–0003](../adr/) for the fixed constraints.
|
||||
|
||||
**Headline architectural decision (recorded in
|
||||
[ADR-0004](../adr/0004-phase-the-hybrid-lexical-dense-first-graph-gated.md)):** ship
|
||||
**lexical + dense fusion** first (a statistically-robust paraphrase win cleared ADR-0001's gate); the
|
||||
**concept graph is deferred behind a gate** because it was **never validly tested** — the prototype's
|
||||
fusion config structurally barred it from the top-k (see [benchmark report §1 + §3](benchmark-report.md)),
|
||||
so it is *unevaluated*, not disproven. Deferral is on cost + uncertainty grounds. The design below is
|
||||
structured around that phasing.
|
||||
|
||||
---
|
||||
|
||||
## A. Production design (API/Postgres deployment)
|
||||
|
||||
### A.0 Where it plugs in
|
||||
|
||||
The semantic layer targets the **API/Postgres path only** (ADR-0002). The authoritative store is
|
||||
the CNPG Postgres behind the FastAPI server (`src/claude_memory/api/app.py`); the local SQLite
|
||||
cache stays **lexical (FTS5) only** and degrades gracefully. Recall fires on the **hot path** of
|
||||
every prompt (an auto-recall hook before each turn), so the read path must stay within a
|
||||
per-prompt latency budget even though latency is non-gating for *adoption* (ADR-0001).
|
||||
|
||||
The current production recall (`app.py::recall_memories`, `POST /api/memories/recall`) is a single
|
||||
`plainto_tsquery('english')` + `ts_rank(search_vector, query)` ordered by a blend
|
||||
`ts_rank*0.7 + importance*0.3` (or `*0.4/*0.6` for `sort_by="importance"`), with an OR-broaden
|
||||
fallback gated at `ts_rank > OR_BROADEN_MIN_RANK` when AND-match under-fills. The live schema
|
||||
(`migrations/001`) has a generated `search_vector tsvector` (setweight A=content, B=expanded_keywords,
|
||||
C=tags, D=category) + a GIN index `idx_memories_search`. **The hybrid design is purely additive
|
||||
to this.**
|
||||
|
||||
### A.1 Schema delta (additive, one migration)
|
||||
|
||||
```sql
|
||||
-- new Alembic migration (Postgres only; SQLite path unchanged)
|
||||
ALTER TABLE memories ADD COLUMN embedding halfvec(1024); -- NULL for is_sensitive=1
|
||||
CREATE INDEX CONCURRENTLY idx_memories_embedding
|
||||
ON memories USING hnsw (embedding halfvec_cosine_ops)
|
||||
WITH (m = 16, ef_construction = 64);
|
||||
```
|
||||
|
||||
- **1024-d** matches both production (Voyage-3.5) and the prototype (bge-large-en-v1.5), so the
|
||||
column dimension and all fusion code are identical whichever model runs.
|
||||
- **halfvec** (fp16) halves index size at ~no recall loss; 1024-d halfvec = 2048 bytes/row →
|
||||
single-digit MB for the whole corpus.
|
||||
- The existing `search_vector` + GIN index are **untouched**. Lexical behaviour is unchanged, so
|
||||
NULL-embedding rows (sensitive memories) and SQLite-only mode degrade to exactly today's FTS.
|
||||
- `CONCURRENTLY` avoids locking the shared table during backfill.
|
||||
- The concept-graph tables (§A.5) ship **only if/when the graph clears its gate** — phase 2.
|
||||
|
||||
### A.2 Write path (store / update) — all LLM work here, off the recall hot path
|
||||
|
||||
On `memory_store` / `memory_update`, for **non-sensitive** rows (`is_sensitive=0`, hard ADR-0003
|
||||
gate):
|
||||
|
||||
1. **Embed** `content` (optionally `content + expanded_keywords`) → one `halfvec(1024)` vector,
|
||||
written to the new column. Voyage-3.5 `input_type="document"` for stores; bge-large
|
||||
`encode_document` for sensitive/no-key fallback. `is_sensitive=1` rows get `embedding=NULL` —
|
||||
never embedded, never egressed; they still match via FTS.
|
||||
2. **(Phase 2, gated) Extract** concepts/edges for the new memory and incrementally merge into the
|
||||
graph tables (entity resolution via pgvector nearest-neighbour + threshold, LLM tie-break only
|
||||
on ambiguity — Graphiti-style fast-path).
|
||||
3. **(Optional, flagged) Curate** — the Mem0-style ADD/UPDATE/DELETE/NOOP loop, run async, never
|
||||
physically deleting (supersede to `[SUPERSEDED]` tombstone). Isolated behind a flag so it never
|
||||
confounds the benchmark.
|
||||
|
||||
The existing **background sync engine** already moves rows SQLite↔Postgres in a daemon thread; the
|
||||
embedding is just another column it carries (authoritative vector in Postgres). Extraction/curation
|
||||
ride the same off-hot-path lane. The synchronous store call must **not** block on embedding/
|
||||
extraction if it would delay the response — these run async.
|
||||
|
||||
### A.3 Read path (hot path) — three CTEs, RRF fusion, importance prior
|
||||
|
||||
Replace the single ts_rank ORDER BY with three top-N legs over the **same `memories` table**, fused
|
||||
in the handler:
|
||||
|
||||
1. **Lexical leg** — the *existing* query verbatim: `plainto_tsquery('english', $q)` +
|
||||
`ts_rank(search_vector, query)`, with the existing OR-broaden fallback
|
||||
(`OR_BROADEN_MIN_RANK`) kept intact. `rank_lex` = position in this list. (LIMIT ~50.)
|
||||
2. **Dense leg** — `ORDER BY embedding <=> $qvec LIMIT 50` using the HNSW index. `rank_dense` =
|
||||
position. Sensitive rows (NULL embedding) never enter this list.
|
||||
3. **Graph leg (phase 2, gated, currently disabled)** — seed concept nodes by reusing the
|
||||
lexical+dense match, traverse 1–2 hops via a recursive CTE over the edge table, score reachable
|
||||
memories by hop-decay → `rank_graph`. List allowed to be empty.
|
||||
|
||||
**Fuse** in Python: `fused(d) = Σ_{s} w_s / (60 + rank_s(d))`, default `w_lex = w_dense = 1.0`,
|
||||
`w_graph ≈ 0.35` (down-weighted per the negative-prior finding — see benchmark). Missing leg ⇒ 0
|
||||
contribution, no special-casing.
|
||||
|
||||
**Preserve the importance prior** (the current code is *not* pure relevance): apply it as a
|
||||
post-fusion multiplier — `final(d) = fused(d) * (0.7 + 0.3*importance)` for `sort_by="relevance"`,
|
||||
or use importance as the tie-break for `sort_by="importance"`. Importance is a *prior*, **not** a
|
||||
fourth fused list. `sort_by="recency"` stays a pure `ORDER BY created_at`, untouched.
|
||||
|
||||
> **Why RRF, not convex combination, as the default:** we fuse three incomparable scales
|
||||
> (unbounded ts_rank, bounded cosine, arbitrary graph proximity), one of them sparse/often-empty.
|
||||
> RRF is scale-agnostic and treats a missing leg as a clean 0, where CC would force a maintained
|
||||
> normalization per signal plus a decision for "absent." RRF also collapses to today's exact
|
||||
> lexical ordering when dense/graph are empty (the SQLite degrade path, **same code**). The
|
||||
> benchmark ran CC/TM2C2 as a challenger (ADR-0001 is quality-gated); on our set RRF was chosen.
|
||||
|
||||
**Single-query alternative (future):** the three legs can be expressed as CTEs + a FULL OUTER JOIN
|
||||
on `id` with RRF computed in SQL (Supabase hybrid-search pattern), saving a round-trip. The
|
||||
prototype and initial production both fuse in Python for clarity; in-DB fusion is an optimization,
|
||||
not a correctness change.
|
||||
|
||||
### A.4 SQLite-only graceful degrade
|
||||
|
||||
With only the lexical leg present, RRF reduces to ranking by `rank_lex` — **identical ordering to
|
||||
today's FTS5 `bm25()`**. Zero behaviour change offline; the *same* fusion code path runs in both
|
||||
modes (dense/graph legs simply empty). Satisfies ADR-0002.
|
||||
|
||||
### A.5 Concept graph (phase 2 — designed, gated, NOT shipped in v1)
|
||||
|
||||
If a future benchmark justifies it (the prototype's did not — §B.4):
|
||||
|
||||
```sql
|
||||
CREATE TABLE concepts (
|
||||
id bigserial PRIMARY KEY,
|
||||
canonical_name text NOT NULL,
|
||||
aliases text[],
|
||||
embedding halfvec(1024), -- for canonicalization + query seeding
|
||||
category text
|
||||
);
|
||||
CREATE TABLE concept_edges (
|
||||
src_id bigint REFERENCES concepts(id),
|
||||
dst_id bigint REFERENCES concepts(id),
|
||||
relation text NOT NULL,
|
||||
weight real,
|
||||
valid_from timestamptz, -- bi-temporal (Graphiti-style supersede)
|
||||
valid_to timestamptz,
|
||||
evidence_memory_ids bigint[]
|
||||
);
|
||||
CREATE TABLE memory_concepts ( -- the "mentions" link
|
||||
memory_id bigint REFERENCES memories(id),
|
||||
concept_id bigint REFERENCES concepts(id),
|
||||
relation text
|
||||
);
|
||||
```
|
||||
|
||||
- **Construction** (backfill): batched open LLM triple-extraction (~10–25 calls for the whole
|
||||
corpus, each memory id-tagged) → global embedding-cluster canonicalization (EDC/KGGEN style) →
|
||||
write the three tables. Off the hot path; `is_sensitive=1` filtered *before* any call.
|
||||
- **Incremental** (per new memory): extract its triples, resolve entities against `concepts` via
|
||||
pgvector NN + threshold (LLM tie-break only on ambiguity), set-merge — never re-cluster.
|
||||
- **Traversal at recall:** plain recursive SQL CTE (1–2 hops). **No Apache AGE** — our multi-hop is
|
||||
shallow. If a future need for PPR arises, compute it in Python over a cached `scipy.sparse`
|
||||
transition matrix loaded from the edge table (Postgres has no native PPR), rebuilt only on graph
|
||||
mutation.
|
||||
- **Bi-temporal edges** realize our "supersede, don't accumulate" rule as a queryable timeline:
|
||||
contradicted edges get `valid_to` set, not deleted.
|
||||
|
||||
### A.6 Optional stage-2 cross-encoder (gated separately)
|
||||
|
||||
`bge-reranker-v2-m3` on the GPU node over the fused top ~20–30, `sort_by="relevance"` only,
|
||||
sensitive rows excluded, with a hard-timeout fallback to fused order. Ship only if it clears both
|
||||
the quality bar **and** the p95 hot-path budget. Not in v1.
|
||||
|
||||
### A.7 Infrastructure (production deploy)
|
||||
|
||||
- **pgvector enablement on CNPG:** the cluster already runs pgvector for Immich, so the legacy
|
||||
custom-operand-image path is in place; `CREATE EXTENSION vector` + the additive migration. Any
|
||||
extension add triggers a rolling restart of the shared cluster — coordinate via presence/GitOps.
|
||||
- **All cluster changes via Terraform/Terragrunt** in `infra/stacks/...` (GitOps, never kubectl).
|
||||
- **Embedding/extraction compute:** in-cluster **llama-cpp on the GPU node** for sensitive-safe
|
||||
local processing (and the no-key fallback); **Voyage-3.5** (hosted) for the non-sensitive batch
|
||||
(ADR-0003). Sensitive memories are routed locally or left lexical-only — enforced, not
|
||||
best-effort.
|
||||
- **PgBouncer:** set `hnsw.ef_search` via `SET LOCAL` inside the recall transaction (transaction
|
||||
pooling).
|
||||
- **pgvectorscale/DiskANN deferred** — not needed below ~1–5M vectors.
|
||||
|
||||
---
|
||||
|
||||
## B. Prototype as built (the benchmark harness)
|
||||
|
||||
The prototype validates **retrieval quality cheaply, in-process** — *not* pgvector/Postgres
|
||||
(standing up CNPG just to benchmark would burn days before knowing if hybrid even beats FTS). It is
|
||||
a faithful stand-in: the lexical leg is the *exact* production code path, and the fusion is the same
|
||||
weighted RRF the production design specifies.
|
||||
|
||||
### B.1 Files (committable code only; data/cache/results gitignored)
|
||||
- `benchmarks/retrievers/fts.py` — `FtsRetriever`, the lexical baseline.
|
||||
- `benchmarks/retrievers/hybrid.py` — `HybridRetriever`, the three-leg fusion.
|
||||
- `benchmarks/retrievers/test_hybrid.py` — 9 model-free tests (synthetic content only).
|
||||
- `benchmarks/scripts/run_eval.py`, `benchmarks/harness/` — runner + metrics.
|
||||
- Eval data (`benchmarks/data/{corpus,queries,qrels}.jsonl`), embedding cache
|
||||
(`benchmarks/cache/*.npy`), and full results (`benchmarks/results/*.json`) are **gitignored**
|
||||
(verified via `git check-ignore`) — privacy rule: no real memory content committed.
|
||||
|
||||
### B.2 Lexical leg = the real product
|
||||
|
||||
`hybrid.py` **reuses `retrievers.fts.FtsRetriever` verbatim**, which is itself a faithful
|
||||
reimplementation of `src/claude_memory/mcp_server.py::_sqlite_recall` (`sort_by="relevance"`): a
|
||||
fresh in-memory FTS5 index over the 5,452-memory corpus with the production virtual-table shape and
|
||||
default `unicode61` tokenizer; query handling mirrors production (AND-match first, OR-broaden if
|
||||
zero rows; rank by `-bm25()*0.7 + importance*0.3`; LIKE fallback on operational errors). **So the
|
||||
hybrid's lexical component *is* the exact production system it must beat — no drift.**
|
||||
|
||||
### B.3 Dense leg
|
||||
|
||||
- **Model:** `BAAI/bge-large-en-v1.5`, 1024-d, L2-normalized. Passages raw; the query gets the BGE
|
||||
instruction prefix `"Represent this sentence for searching relevant passages: "`. Similarity =
|
||||
cosine via numpy matmul over the normalized matrix (faiss unnecessary at N=5452).
|
||||
- **Embeddings:** all 5,452 memories embedded once (one-time ≈ 31.5 min on a CPU-only box),
|
||||
**cached** fingerprint-keyed to `cache/emb_BAAI_bge-large-en-v1.5_<corpusfp>.npy` (+ `.ids.npy`)
|
||||
→ reruns skip the embed (cache-hit rebuild ≈ 8.3 s).
|
||||
- Hosted Voyage/OpenAI/Cohere paths are implemented and key-gated but were **untriggered** (no key
|
||||
in env) — so the prototype ran the local default, which is also the sensitive-only production
|
||||
fallback. **Production maps this matrix to pgvector `halfvec(1024)`.**
|
||||
|
||||
### B.4 Graph leg (built, but structurally excluded from the ranking — UNMEASURED)
|
||||
|
||||
- **Construction with ZERO LLM calls** (the tractable shortcut): concepts = union of each memory's
|
||||
`tags` + the already-LLM-generated `expanded_keywords` field + a lightweight regex/stop-word
|
||||
noun-phrase proxy over `content`, normalized + de-pluralized. 37,075 concepts extracted; **19,907
|
||||
kept** after document-frequency pruning (df ∈ [2, 2%·N=109]: df<2 links nothing, df>109 are
|
||||
non-discriminative hubs). Concept cliques → weighted memory–memory edges. Result: **5,452 nodes,
|
||||
2,095,624 edges**, built in ~9 s (in-memory networkx).
|
||||
- **Traversal:** 1 hop from the top-10 RRF seeds (capped 25 neighbours/seed), contributing **only ids
|
||||
not already in the base legs** — this exclusion is the bug (next bullet).
|
||||
- **Result: the graph was structurally barred from the ranking, so its value is UNMEASURED.** Because
|
||||
graph hits are restricted to ids *outside* the FTS∪dense base set and weighted 0.35, a graph-only id's
|
||||
max RRF (`0.35/61 ≈ 0.0057`) sits below any base-leg id's min (`1.0/110 ≈ 0.0091`) — it can never enter
|
||||
the fused top-k. The "graph ≡ nothing" ablation (§B.6) was therefore **guaranteed by construction, not
|
||||
an empirical finding** (a post-run review found a relevant graph-surfaced memory that fusion discarded).
|
||||
|
||||
### B.5 Fusion (the production recipe, exactly)
|
||||
|
||||
Weighted RRF, `RRF(d) = Σ_leg w_leg/(60 + rank_leg(d))`, chosen over convex combination because it
|
||||
is score-scale-free (no BM25-vs-cosine calibration). Weights `w_fts = 1.0, w_dense = 1.0,
|
||||
w_graph = 0.35`. Each leg pulled to depth 50 before fusion, truncated to k.
|
||||
|
||||
### B.6 Decision-relevant ablation (this is what informs ADR-0004)
|
||||
|
||||
| Config | What | Overall recall@10 | Para recall@10 | Multi recall@10 |
|
||||
|---|---|---|---|---|
|
||||
| **A** full hybrid (FTS+dense+graph) | the prototype | 0.834 | 0.725 | 0.775 |
|
||||
| **B** FTS+dense (w_graph=0) | graph removed | **0.834** | **0.725** | **0.775** |
|
||||
| **C** dense-only | | 0.748 | — | — |
|
||||
| **D** FTS-only (= baseline) | | 0.695 | 0.375 | 0.711 |
|
||||
|
||||
**A and B are identical to three decimals on every metric — but this is a structural artifact (§B.4),
|
||||
not a test of the graph.** The valid signal here is the **FTS-vs-dense decomposition**: dense-only (C)
|
||||
and FTS-only (D) each lose to the fusion (B) — dense recovers paraphrase, lexical recovers exact, fusion
|
||||
gets the best of both. The concept graph itself is **unevaluated** (it could never affect top-k under
|
||||
this fusion config). **This still supports phasing — ship lexical+dense (phase 1), the robust measured
|
||||
win — but the graph is gated pending a *valid* retest, not because it failed.** (Configs B/C/D were not
|
||||
persisted as result JSONs; only A and D are reproducible from committed artifacts.)
|
||||
|
||||
### B.7 Prototype → production mapping
|
||||
|
||||
| Prototype (in-process) | Production (ADR-0002) |
|
||||
|---|---|
|
||||
| numpy cosine over normalized matrix | pgvector `halfvec(1024)` + HNSW ANN |
|
||||
| `.npy` embedding cache, fingerprint-keyed | `embedding` column on `memories`, synced |
|
||||
| in-memory networkx graph (phase 2) | `concepts` / `concept_edges` / `memory_concepts` tables |
|
||||
| `FtsRetriever` (FTS5 in-memory) | existing `search_vector` + GIN (`plainto_tsquery`/`ts_rank`) |
|
||||
| weighted RRF in Python | same RRF (Python handler, or CTE+FULL OUTER JOIN in SQL) |
|
||||
| bge-large local | Voyage-3.5 hosted (non-sensitive) / bge-large local (sensitive, no-key) |
|
||||
|
||||
### B.8 Prototype caveats (carried into the report's limitations)
|
||||
1. **Graph result is INVALID, not merely "null."** The fusion config barred the graph leg from the
|
||||
top-k by construction (§B.4), so the benchmark did not actually test it. A valid retest must include
|
||||
graph candidates in the fused pool (drop the base-set exclusion) and/or sweep the weight, ideally with
|
||||
a typed-relation graph from real LLM extraction and multi-hop queries whose hops are *not* semantically
|
||||
adjacent.
|
||||
2. **Exact-stratum nDCG/MRR dip ~0.018/0.025** vs FTS (recall unaffected) is the standard RRF cost
|
||||
of blending one perfect hit with near-ties; a small exact-match rank bonus could recover it.
|
||||
3. **Latency** (p50 ≈ 230 ms) is CPU-bound on the local query embed; non-gating, GPU/hosted ~10×
|
||||
faster. Baseline FTS was p50 ≈ 15.7 ms (pure SQLite).
|
||||
4. **No pgvector/Postgres** in the prototype — the production substrate is design-only here; the
|
||||
numbers measure *retrieval quality*, which transfers, not the production latency profile.
|
||||
|
||||
---
|
||||
|
||||
## C. Open questions (for production rollout)
|
||||
|
||||
1. **pgvector enablement mechanism** — confirm whether the live CNPG is on the legacy
|
||||
custom-operand-image (likely, since Immich uses pgvector) or the modern image-volume-extensions
|
||||
path; either way the migration is additive, but the enablement DDL/Terraform differs.
|
||||
2. **Graph gate** — what evidence would re-open the concept graph? Candidate: a multi-hop eval slice
|
||||
whose hops are *not* semantically adjacent (where dense can't shortcut), built from real
|
||||
LLM-extracted typed relations rather than keyword co-occurrence. Until then, graph stays off.
|
||||
3. **Voyage vs bge-large in production** — the benchmark ran bge-large (local). A cheap follow-up:
|
||||
re-run the dense leg with Voyage-3.5 on the non-sensitive corpus to confirm the hosted model's
|
||||
higher quality ceiling holds on *our* content before committing the production default.
|
||||
Loading…
Add table
Add a link
Reference in a new issue