claude-memory-mcp/docs/research/survey.md
Viktor Barzin 1cc8a2b378
Some checks failed
Build and Push / lint-and-test (push) Has been cancelled
Build and Push / build (push) Has been cancelled
Build and Push / deploy (push) Has been cancelled
Build and Push / notify-failure (push) Has been cancelled
research: benchmark hybrid (lexical+dense+graph) recall vs current FTS
Viktor asked to enhance the memory system with 'semantics' — remember concepts
(not just tokens) linked in a graph — and to prove, by benchmarking against the
current system, that it actually improves recall. A multi-phase research workflow
(18 agents) did landscape research, an adversarially-reviewed integration design,
a stratified eval set over the real 5,452-memory corpus, and a head-to-head
prototype-vs-current benchmark.

Result: hybrid (lexical FTS + dense embeddings, RRF-fused) beats FTS on every
overall metric, driven by a robust paraphrase win (recall@10 +0.350). Recommend
adopting lexical+dense; the concept graph is DEFERRED.

Post-run adversarial review correction (applied to all docs before commit): the
prototype's fusion config structurally barred the graph leg from the ranked top-k,
so the 'graph contributes nothing' ablation was a math artifact, NOT an empirical
result — the graph is UNEVALUATED, not disproven (deferred on cost+uncertainty).
Multi-hop deltas are not statistically significant. Glossary in CONTEXT.md; framing
in ADR-0001-0003; findings in ADR-0004-0006 + docs/research/.

Privacy: the corpus/queries/qrels/results are the user's real memories and stay
gitignored (data/, cache/, results/, build_eval_set.py); only harness code,
aggregate numbers, and synthetic examples are committed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-25 17:51:53 +00:00

523 lines
35 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Landscape survey: semantic + concept-graph memory for hybrid recall
Status: research input for the ADR-0001 hybrid upgrade. Scope: how the agent-memory and
graph-RAG literature builds and retrieves over a personal-memory store, which embedding
model to use, how to fuse lexical + dense + graph signals, and how to evaluate the result.
**Read this with the decisions already fixed in [ADR-0001](../adr/0001-pursue-hybrid-retrieval-embeddings-and-concept-graph.md)
[0003](../adr/0003-external-embedding-apis-allowed-for-non-sensitive-memories.md):** we pursue
hybrid (gated on a benchmark beating FTS), embeddings live in pgvector on the existing CNPG
Postgres, the concept graph is node/edge tables in Postgres, sensitive memories never egress,
and adoption is decided **quality-first** (recall@k / nDCG@10 / MRR; latency & storage are
reported, not gating).
The recurring conclusion below: **borrow the ideas, not the engines.** None of the four
systems surveyed is a drop-in for our stack, but each contributes a mechanism we re-implement
natively on Postgres + pgvector.
---
## 1. Our workload is the opposite of GraphRAG's design target
Before comparing systems it helps to state what we are *not*. The graph-RAG family was built
for **global sensemaking** ("what are the themes across this corpus") over a **static document
collection**. Our workload is the reverse:
| Dimension | GraphRAG target | claude-memory-mcp |
|---|---|---|
| Unit | Long documents, chunked | Atomic, already-curated memories (avg ~500 chars) |
| Corpus dynamics | Static, indexed once | Append-heavy: a few hundred memories/day arriving |
| Query type | Corpus-wide summarization | Point / multi-hop recall ("what did I decide about X") |
| Hot path | Offline batch | **Every user prompt** (auto-recall hook fires before each turn) |
| Scale | 10k1M+ chunks | ~5k memories today → tens of thousands |
This mismatch is the lens for everything that follows. The expensive part of GraphRAG —
community detection + hierarchical LLM summaries — is the *wrong retrieval unit* for atomic
point lookups, and re-summarizing communities on a sustained append stream is its dominant,
unbounded cost. We want a design whose index-time work is **proportional to new content**, and
whose retrieval path has **no LLM call** (so it fits the per-prompt budget).
---
## 2. The GraphRAG family — Microsoft GraphRAG, LightRAG, nano-graphrag, LazyGraphRAG
All four turn text into an entityrelation knowledge graph via LLM extraction; they differ on
the expensive part (community detection + hierarchical summarization), which is exactly where
**incremental cost** lives.
### Microsoft GraphRAG (Edge et al. 2024, arXiv 2404.16130)
Pipeline: chunk → LLM extracts entities + relationships per chunk (with multi-round
"gleanings" to catch misses) → summarize duplicate element instances into node/edge
descriptions → build graph → **Leiden** community detection producing a *hierarchy*
(levels C0..C3) → an LLM writes a **community report** for every community at every level.
Two query modes: **global** (map-reduce over all community reports — corpus-wide
sensemaking) and **local** (start from query-relevant entities, fan out). Indexing is
LLM-heavy: ~4,000 LLM calls / ~35 min for one textbook; ~$2040 per 1M tokens with gpt-4o.
**Incremental:** the `graphrag update` command (GraphRAG 1.0) computes deltas and places new
entities into existing communities "rather than re-running Leiden," re-summarizing only
changed communities — **but** maintainers warn that once drift crosses a threshold "the worst
case degrades to the same performance as a normal indexing." A periodic, unpredictable
full-reindex cliff on a sustained append stream. Parquet/file-pipeline oriented, not
Postgres-native.
### LightRAG (HKUDS, arXiv 2410.05779, EMNLP 2025)
Pipeline: chunk → LLM extracts entities + relations → "profiling" generates a key-value text
summary per node/edge → **deduplicate** merges identical entities/relations across chunks.
**No community detection.** Retrieval is **dual-level**: the LLM splits the query into
low-level keywords (specific entities) and high-level keywords (broad themes via relationship
chains); each set is matched by *vector* similarity against an entity-vector index and a
relation-vector index, then one-hop neighbours are pulled from the graph. Modes:
naive / local / global / hybrid / mix (mix = default).
**Incremental (the crux):** a new document goes through the same local indexing to produce a
small local graph, then is integrated by **set union** of node-sets and edge-sets into the
existing graph — "eliminating the need to rebuild the entire index graph." No communities ⇒
**no global re-clustering or re-summarization, ever** ⇒ O(new content) per insert, the only
genuinely-incremental member. Ships a PostgreSQL all-in-one backend (PGVectorStorage on
pgvector + PGGraphStorage on Apache AGE + KV + doc-status in one DB, PG ≥16.6).
### nano-graphrag (~1100 LOC)
A faithful minimal reimplementation of Microsoft GraphRAG and an excellent compact *reference*
for the exact extraction/community/report prompts. **Hard NO for incremental:** README states
plainly "each time you insert, the communities of graph will be re-computed and the community
reports will be re-generated" — O(whole graph) LLM cost per append.
### LazyGraphRAG (Microsoft Research, 2024)
Defers **all** LLM work to query time: index time uses only NLP noun-phrase extraction + graph
statistics — "indexing costs are identical to vector RAG and 0.1% of the costs of full
GraphRAG." Sidesteps the incremental-re-summarization problem entirely by never pre-summarizing
communities. The **defer-LLM-cost principle** is the one to borrow.
### Verdict for us
**Adopt none wholesale; steal LightRAG's architecture + LazyGraphRAG's defer-LLM principle.**
LightRAG is the only one whose incremental model (pure set-union, no re-clustering) structurally
fits an append-heavy stream, and whose retrieval path (vector + one-hop graph, no query-time
map-reduce) is hot-path-viable. But adopting LightRAG-the-product is not recommended: its
Postgres graph path needs the **Apache AGE** extension (not on our CNPG), and that path has
documented concurrency/entity-merge instability under append-heavy load (asyncpg pool timeouts
at the merge stage; slow upgrades). Our multi-hop is shallow (12 hops), expressible in plain
recursive SQL CTEs over node/edge tables — no AGE needed.
---
## 3. Zep / Graphiti — temporal knowledge graph for agent memory
Zep (arXiv 2501.13956) is an agent-memory service; **Graphiti** is its open engine
(Neo4j / FalkorDB / Kuzu backend, ~20k stars, MIT). It is the **closest conceptual analog** to
our hybrid goal — it fuses exactly the three signals ADR-0001 wants.
**Three-tier graph:** episode subgraph (raw ingested data, the provenance layer ≈ our Memory
rows) → semantic entity subgraph (entity nodes + typed relationship edges, each linking back to
its source episodes) → community subgraph (clusters with LLM summaries — the GraphRAG "global"
layer).
**Bi-temporal model:** every semantic edge carries **four** timestamps on two timelines —
*valid time* (`t_valid`/`t_invalid`: when the fact held true in the world) and *transaction
time* (`t'_created`/`t'_expired`: when Zep learned/retracted it). Facts are never deleted;
superseded facts get their validity window closed. This is a principled, queryable version of
our **"supersede, don't accumulate"** memory discipline.
**Incremental ingestion (per episode):** a *sequence* of LLM calls — entity extraction →
entity resolution/dedup (embed + cosine + BM25 search against existing nodes, then an LLM
judges merge vs create) → fact (edge) extraction → fact dedup → temporal extraction (resolve
"two weeks ago" against a reference time) → edge invalidation (LLM compares each new edge
against related existing edges; on a temporally-overlapping contradiction it closes the old
edge). Cost is **heavy on write**, paid back on reads.
**Retrieval (sub-second, NO LLM at query time):** three parallel searches fused, then reranked.
- `φ_cos`: cosine over embeddings of fact text / entity names / community summaries (BGE-m3, 1024-d).
- `φ_bm25`: BM25 full-text over the same fields.
- `φ_bfs`: breadth-first n-hop graph traversal from seed nodes — the genuinely graph-native signal.
- Rerank: pluggable — **RRF**, MMR (diversity), episode-mentions (frequency), node-distance, or a cross-encoder (most accurate, slowest).
**Published quality (the strongest evidence in this family):** on **LongMemEval**, Zep reports
**+18.5%** accuracy over a baseline (71.2% vs 60.2% with gpt-4o) *and* ~90% query-latency
reduction; on **MemGPT DMR**, 94.8% vs 93.4%. These are conversational long-context QA, not
personal-fact recall@k — so the headline numbers won't transfer directly, but the *fusion
recipe* is exactly what we benchmark.
### Verdict for us
**Primary design blueprint for the concept-graph half — but not an adopted dependency.**
Graphiti has **no pgvector backend**; adopting the engine forces a new Neo4j/FalkorDB graph DB
into the cluster, conflicting with ADR-0002 and reuse-before-building. We borrow four mechanisms,
re-implemented on Postgres: (1) the episodic(=Memory rows)/semantic(=new node+edge tables)
split; (2) the parallel-search + RRF fusion read path; (3) resolution-via-search to dedupe the
graph using our existing FTS+vector; (4) bi-temporal edge invalidation as the queryable form of
our supersede discipline. We de-scope the community/summarization tier and the default
cross-encoder. Two hard caveats: keep the multi-LLM-call extraction **off the hot path**
(background, like our sync engine), and route extraction through in-cluster llama-cpp / filter
`is_sensitive` per ADR-0003.
---
## 4. Mem0 / Mem0g — extraction-based, LLM-curated memory
Mem0 (arXiv 2504.19413) is a **write-side memory curator**, not a retrieval algorithm — it
solves a *different axis* than our gated problem, and is **complementary**.
**Two-phase pipeline.** *Extraction:* on each new message pair, an LLM (fed an async
conversation summary + the last ~10 messages) emits a set of concise "candidate facts."
*Update (the curation step):* for each candidate fact, retrieve the top ~10 semantically-similar
existing memories, then a function-calling LLM picks one of four ops — **ADD** (new), **UPDATE**
(merge richer detail into an existing id, gated on information content), **DELETE** (a
contradicted memory), **NOOP**. Net effect: the store self-deduplicates, self-merges, and
self-supersedes instead of accumulating. Two LLM calls per write (extract + decide) + a vector
search; **async by default** (off the user hot path); the **read/search path is pure vector
similarity with no LLM**.
**Mem0g (graph variant):** a directed labeled entity graph (Alice lives_in→ SF) on Neo4j; a
conflict-detection + LLM update-resolver marks superseded relationships *invalid* rather than
deleting them.
**Published quality:** on **LOCOMO**, Mem0 J=66.88 / Mem0g 68.44 beats OpenAI memory (52.90),
A-Mem (48.38), LangMem (58.10), ties Zep (65.99), at ~1/15th the tokens of full-context; Mem0g
specifically wins temporal reasoning. Reference latencies (gpt-4o-mini): search p95 ≈ 0.20s,
total p95 ≈ 1.44s, vs full-context ≈ 17s.
### Verdict for us
**Adopt the curation loop as a separate, flagged subsystem — it does NOT move the ADR-0001
retrieval metric by itself** (its search is vector-only, no lexical+graph fusion). The
ADD/UPDATE/DELETE/NOOP loop is the highest-leverage idea Mem0 offers: it automates a discipline
our own rules already mandate (every correction stored, supersede-don't-accumulate, tombstones)
but currently leave to manual human effort. It is cheap to build against our existing Memory
model + `update_memory` endpoint, runs async off the recall hot path, and respects the
`is_sensitive` boundary. **Hard guardrails required:** never physically DELETE — supersede to a
`[SUPERSEDED]` tombstone (importance ~0.3, per our convention); log every op; gate behind the
non-sensitive filter. Keep extraction *optional* (our memories are already atomic, so usually
only the single UPDATE-decision call is needed). Mem0g's "mark invalid, not delete" and triplet
schema (source, relation, dest) are reusable ideas, but implemented on pgvector/Postgres, not
Neo4j. **Critically: isolate curation behind a flag so the benchmark measures retrieval quality
independently of any curation behaviour change.**
---
## 5. HippoRAG / HippoRAG 2 — Personalized PageRank over a concept graph
HippoRAG (NeurIPS 2024, arXiv 2405.14831) and HippoRAG 2 (ICML 2025, arXiv 2502.14802) are the
**strongest published evidence that a concept graph wins on multi-hop** — precisely the query
class ADR-0001 says the graph must beat lexical on.
**Mechanism (hippocampal indexing analogy):** LLM = neocortex; retrieval encoder =
parahippocampal region (detects synonyms); open KG = hippocampal index. *Offline:* an LLM runs
OpenIE on each passage → a schema-free KG of noun-phrase nodes joined by relation edges; the
encoder adds **synonym edges** between phrase nodes with cosine > τ=0.8. *Online, per query:*
(1) ONE LLM call does NER on the query; (2) the encoder links query entities to nearest KG
nodes = **seed nodes**; (3) each seed weight is scaled by node specificity (`|P_i|⁻¹`, an
IDF-like rare-phrase boost) and written into the **Personalized PageRank** reset vector;
(4) PPR runs to convergence (damping 0.5); (5) the phrase-node probability vector scores
passages. Multi-hop emerges because the random walk reaches passages sharing **no** query tokens
— in **one** retrieval step instead of iterative retrieve-reason loops.
**HippoRAG 2** makes passages first-class nodes (linked to their phrases by "contains" context
edges), shifts linking to query→triple + **LLM triple-filtering** ("recognition memory"), and
seeds *all* passage nodes by embedding similarity (small weight ~0.05) so dense and graph blend
in one PPR. Net effect: a single PPR fuses lexical-ish phrase matching, dense passage
similarity, and multi-hop traversal into one ranked list.
**Published quality (passage recall@5):** HippoRAG 2 beats the strongest 7B embedding baseline
(NV-Embed-v2) on every multi-hop set — 2Wiki **90.4 vs 76.5** (+13.9), MuSiQue **74.7 vs 69.7**
(+5.0), HotpotQA **96.3 vs 94.5** — and is the only structure-augmented method that *doesn't
regress* simple QA (NQ 78.0).
### Verdict for us
**Adopt the idea (PPR spreading activation over our concept graph), not the framework.** Two
hard adaptations, both fitting our stack:
1. **Drop the per-query LLM** (v1 NER / v2 triple-filtering) — the only thing that would blow
the hot-path budget — and **seed PPR from our existing FTS top-k pgvector top-k**, weighted
by fused score × importance × node-specificity. This turns PPR into the *fusion layer*
ADR-0001 wants, with zero added LLM latency.
2. **Prefer a memory-node graph** (memories as nodes, our typed Relationships as edges) over
HippoRAG's phrase explosion (it turns 11.6k passages into ~92k nodes; at our scale that'd be
~43k phrase nodes). Leaner and native to ADR-0002's node/edge tables.
A reproducible PPR latency micro-benchmark on a 5,400-memory graph measured **~2 ms** (memory-node
graph, transition matrix cached) to **~21 ms** (full phrase graph), ~105 ms even at 3× growth —
PPR is **not** the bottleneck; the stock recipe's online LLM is (which we remove). Postgres can
*store* the graph but has no native PPR (pgrouting = shortest-path only), so PPR is computed in
Python over a cached `scipy.sparse` transition matrix loaded from the node/edge tables, rebuilt
only on graph mutation. **Caveat for the gate:** our LLM-free seeding variant is *not* validated
by the papers, and our 5.4k personal corpus is far smaller and less multi-hop-dense than their
90k-node Wikipedia graphs — so the benchmark must confirm the multi-hop win transfers.
---
## 6. Embedding model survey
Our `content` and `expanded_keywords` are **short** prose (capped ~500 chars), so a model's
max-token limit is effectively a non-constraint — quality, dimensionality, and deploy
feasibility decide.
### Self-hostable (sentence-transformers on the GPU node, or GGUF via llama-cpp; pgvector stores the vector)
| Model | Dim | Params / VRAM (fp16) | MTEB(en) avg | License |
|---|---|---|---|---|
| nomic-embed-text-v1.5 | 768 (Matryoshka 64768) | 0.1B / <1 GB | 62.28 | Apache-2.0 |
| bge-base-en-v1.5 | 768 | 109M / ~0.5 GB | ~63.5 | MIT |
| **bge-large-en-v1.5** | **1024** | 335M / ~1.3 GB | **64.23** | MIT |
| e5-large-v2 | 1024 | 0.3B / ~1.3 GB | ~62.25 | MIT |
| bge-m3 | 1024 dense (+sparse +ColBERT) | 568M / ~12.4 GB | en ~5960 (strong multiling/BEIR) | MIT |
| gte-Qwen2-1.5B-instruct | 1536 | 1.5B / ~3.4 GB | **67.16** (top of set) | Apache-2.0 |
### Hosted (API call, NON-SENSITIVE memories only per ADR-0003)
| Model | Dim | MTEB(en) avg | Price /1M tok | License |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 (Matryoshka256) | 62.3 | $0.02 | proprietary |
| OpenAI text-embedding-3-large | 3072 (Matryoshka) | 64.6 | $0.13 | proprietary |
| **Voyage-3.5** | **1024** (+256/512/2048, int8/binary) | beats OpenAI-3-large ~7.5% on Voyage's eval | $0.06 (first 200M free) | proprietary |
| Voyage-3.5-lite | 1024 | beats OpenAI-3-large ~23.8% | $0.020.03 | proprietary |
| Cohere embed-english-v3.0 | 1024 (native int8/binary) | ~64.5 | ~$0.10 (sales-quoted) | proprietary |
**Implementation notes that matter.** Use **asymmetric** prompting (query vs document):
sentence-transformers `encode_query`/`encode_document`, always `normalize_embeddings=True` so
pgvector cosine == dot product. e5-large-v2 *requires* manual `"query: "`/`"passage: "` prefixes
or quality collapses; bge prepends a query instruction; gte-Qwen2 prepends a task instruction to
queries only. Pick the dimension **once** changing it later forces a full re-embed + HNSW
rebuild.
### Recommendation (one of each, quality-first)
- **Local: BAAI/bge-large-en-v1.5** (1024-d, MIT) best quality-per-complexity in the
self-hostable set for short English memories: strong retrieval, ~1.3 GB VRAM (runs on CPU at
~100 ms), no `trust_remote_code`, mature ST support. The 512-token cap is irrelevant for our
content. (gte-Qwen2-1.5B-instruct is the explicit upgrade candidate if the benchmark says
bge-large leaves quality on the table; nomic is the fallback if a long context or sub-768
Matryoshka dims are ever wanted.)
- **Hosted: Voyage-3.5** (1024-d) highest measured retrieval quality of the hosted options,
**same 1024-d as the local pick** so the pgvector column and fusion code are identical whether
local or hosted (clean A/B), and our whole corpus embeds inside the free tier. Non-sensitive
only; sensitive rows go to bge-large locally. (OpenAI text-embedding-3-small is the pragmatic
fallback if no Voyage key.)
> **Prototype note:** the prototype as built used **bge-large-en-v1.5** (1024-d, local default,
> no API key in env). Production should adopt **Voyage-3.5** (also 1024-d) for non-sensitive
> memories per ADR-0003, keeping bge-large as the sensitive-only / no-key fallback. Both 1024-d
> means the pgvector schema and fusion code are unchanged across the choice.
---
## 7. Fusion of lexical + dense + graph signals
Three retrieval families produce candidate lists per recall; one fusion function merges them.
### Reciprocal Rank Fusion (RRF) — rank-based, scale-agnostic
Cormack/Clarke/Büttcher (SIGIR'09): `score(d) = Σ_s 1/(k + rank_s(d))`, summed over every signal
the doc appears in. `k` is a smoothing constant their sweep found **k=60 near-optimal but
uncritical** (≤0.3% MAP swing across [10,100]; k=0 or k=500 costs 34%). A doc present in one
list but absent from another contributes `1/(k+rank)` where it fires and **0** elsewhere so
multi-signal agreement is rewarded and single-signal hits are penalized (the hybrid behaviour we
want). Extends to N lists trivially (just sum), which makes a 3-way fuse a one-liner.
**Weighted RRF:** `Σ_s w_s/(k + rank_s(d))` bias a stronger signal, no pre-normalization.
### Weighted score fusion / convex combination (CC) — score-based, needs normalization
Bruch et al. (arXiv 2210.11934, TOIS 2023): `f = α·φ(semantic) + (1-α)·φ(lexical)` with
**theoretical** min-max normalization (TM2C2): cosine min = -1, BM25 min = 0, per-query max
stable across queries. Findings: **CC/TM2C2 beats RRF on nDCG and recall** in- and out-of-domain
(RRF ~3.86% lower nDCG@10 in one replication); Weaviate switched its default from rankedFusion to
relativeScoreFusion (min-max CC) for ~6% recall on FIQA. CC is sample-efficient (α tunes from a
small labeled set) but requires calibratable scores.
### Folding in graph hits
Build a graph candidate list, then feed it to the fuser as just another ranked list:
**seed** (match the query to concept nodes via the same FTS + dense over node labels)
**traverse** (12 hops to reachable memories) **score** each reachable memory. Three documented
scorings: hop-decay `Σ_paths β^hops` (β0.50.7); Personalized PageRank seeded on matched nodes
(HippoRAG); or node-degree priority (GraphRAG local search). The
*Calibrated-Fusion-for-Graph-Vector* paper (arXiv 2603.28886) is explicit: naive graph+vector
fusion fails on **scale incompatibility**, so convert graph traversal into a probability-like
normalized score before fusing. Crucial consequence: the graph list is **sparse** (often a
handful of memories, sometimes zero). Under RRF that's handled automatically; under CC you must
explicitly treat "absent" as the theoretical min or the missing-modality term silently biases the
sum.
### Cross-encoder re-rank — a separate stage-2, not a fusion function
Retrieve top-N each fuse take fused top ~2030 score each (query, memory) pair jointly with
a cross-encoder (e.g. bge-reranker-v2-m3) re-sort. Reported lift +5 to +15 nDCG@10 on
BEIR/MTEB; cost scales with pair count so it is only ever a small-candidate-set stage.
### Recommendation
**Weighted RRF over three lists (FTS, dense, graph), k=60, equal weights to start**, with
importance applied as a deterministic post-fusion prior and a cross-encoder as an optional,
benchmark-gated stage-2. RRF is the right *default* because we fuse three incompatible scales,
one of them sparse/often-empty; it is near-parameter-free; and it collapses to exactly today's
lexical ordering when dense/graph are empty (the SQLite graceful-degrade path). **But because
adoption is quality-gated, the benchmark must also run CC/TM2C2 as a challenger** the
literature is consistent that CC edges RRF on quality when scores are calibratable. (See the
[benchmark report](benchmark-report.md) for which won on *our* eval set.)
---
## 8. Concept-graph construction from memories
Turning flat Memory rows into nodes (concepts/entities) + typed directed edges + memoryconcept
"mentions" edges. Three extraction families:
- **(A) Open LLM triple extraction** (schema-free) prompt an LLM to emit `[subject, relation,
object]` triples. High recall, but relation labels proliferate ("prefers"/"likes"/"favors"), so
it **requires** downstream canonicalization. GraphRAG is the canonical implementation
(extract + gleaning + cross-chunk entity summarization).
- **(B) Schema-guided** — constrain to a fixed ontology. Cleaner, but a fixed schema misses
surprises in a heterogeneous personal corpus. **EDC** (Zhang & Soh, EMNLP 2024) bridges the two:
*extract* open triples → *define* (LLM writes a one-sentence definition per distinct relation) →
*canonicalize* (embed definitions, retrieve nearest existing relations, LLM verifies map-vs-add).
Two modes: target-alignment (fixed schema) and self-canonicalization (grow schema dynamically).
- **(C) Entity resolution / canonicalization** (the dedup problem — "Svelte"/"SvelteKit"/"svelte
framework" are one node): cluster-then-refine on the *aggregated* graph — embed every surface
string, cluster by cosine (HDBSCAN / connected-components over a threshold), optional
LLM-as-judge per cluster. KGGEN (arXiv 2502.09956) does iterative LLM-guided clustering;
Graphiti uses MinHash+LSH fast-path with LLM fallback. **Cost scales with distinct entities (low
thousands), not with memory count.**
- **(D) Lightweight non-LLM** — spaCy NER + noun-chunks + co-occurrence edges, or **ReLiK**
(Sapienza, ACL 2024) for *typed* relations on CPU at up to 40× LLM speed, zero per-doc LLM cost.
The natural ablation baseline and sqlite-only fallback.
**The tractable recipe for our corpus.** Measured: 5,452 non-sensitive memories ≈ 683K content
tokens total — *tiny*. At ~125 content-tokens/memory, ~570 memories pack into one 100K-token
request, so the **entire corpus extracts in ~1025 batched LLM calls, not 5,452 sequential
calls**. Pipeline: (1) batch-extract open triples (each memory tagged with its `memory_id` so
triples map back), parallelized — LangExtract / KGGEN style; (2) aggregate + canonicalize globally
*once* (embed distinct entities, cluster, LLM-judge only ambiguous clusters — tens of calls);
(3) optionally one batched LLM "define relations" pass for EDC-style relation canonicalization.
Total budget: low tens of calls, minutes of wall-clock, a few dollars hosted or one GPU-node
llama-cpp session. **Canonicalization quality (the similarity threshold / cluster granularity) is
where this lives or dies and must be tuned against held-out data, not eyeballed.** Write-time /
Graphiti-style per-memory extraction is for *incremental updates only* — the wrong tool for the
one-shot backfill.
---
## 9. Vector storage in Postgres (production substrate)
`pgvector` is a **proven capability on our exact CNPG cluster** (Immich already does vector search
there, and claude-memory-mcp is already a tenant of the shared `pg-cluster-rw.dbaas` behind
PgBouncer) — zero new infrastructure, reuse-before-building satisfied.
- **HNSW** (recommended default): `USING hnsw (embedding halfvec_cosine_ops) WITH (m=16,
ef_construction=64)`; query knob `SET hnsw.ef_search` (default 40). Best speed-recall tradeoff;
buildable on an empty table; graph in RAM. **IVFFlat** is rejected (must be built *after* data
exists — an empty-table footgun — and has a lower recall ceiling).
- **halfvec** (fp16, 2 bytes/dim) halves index size at ~no recall loss; indexable ≤4000 dims.
768-d halfvec = 1536 bytes/row; at our scale total embedding storage is single-digit MB.
- **Filtered ANN:** we always filter `deleted_at IS NULL` (often `category`). Post-filtering can
under-fill top-k; enable `hnsw.iterative_scan='relaxed_order'`, and **always add a tie-breaker**
(`, id`) since approximate indexes give non-deterministic order.
- **Hybrid in one query:** each retriever is a CTE producing a per-ranker rank; fuse with RRF via
FULL OUTER JOIN on memory id — no score calibration needed across the incomparable ts_rank and
cosine scales.
- **pgvectorscale / StreamingDiskANN** (bounded-RAM disk graph, SBQ compression) is **deferred** —
Rust/pgrx must be compiled into the operand image, and it only earns its keep above ~15M
vectors. Our corpus is orders of magnitude below that.
- **PgBouncer gotcha:** per-query GUCs (`hnsw.ef_search`) must be `SET LOCAL` inside the recall
transaction, not session-level, under transaction pooling.
**Not for the prototype** (the prototype uses an in-process numpy index); this is the production
adoption path *if* the benchmark clears the gate — an additive Alembic migration (one nullable
`halfvec(1024)` column + HNSW index) plus a Terraform change to the CNPG stack.
---
## 10. Evaluation methodology
A retrieval test collection = corpus + query set + **qrels** (relevance judgments). For each
query, call recall, take the ordered list of returned memory ids, score against qrels — measuring
the *retriever in isolation*, exactly what ADR-0001's gate needs.
**Metrics (compute all; pick one primary):**
- **Recall@k** — "did we surface the right memory at all?" *The* hot-path metric (auto-recall
injects top-N; if the memory isn't in top-k it can't help). Report @5/@10/@20/@30.
- **nDCG@k** — graded + position-aware; the best single summary (BEIR standard is nDCG@10).
Headline quality number for the gate.
- **MRR** — only the first hit matters; relevant for the exact-lookup stratum.
- **MAP** — broad binary recall+precision blend; secondary, stable for significance tests.
**Stratification (the ADR-0001 hypothesis-targeted design):** *exact/lexical* (FTS already wins —
the **regression guard**); *paraphrase/semantic* (disjoint vocabulary — the value-of-embeddings
test); *multi-hop* (≥2 memories or a concept link — the graph test).
**Qrels generation (the LongMemEval pipeline, inverted for memories):** sample seed memories
stratified by category + importance → an LLM generates exact / paraphrase / multi-hop queries →
label relevant ids, with **pooling** (union the top-k of every arm, TREC-style) and an LLM-judge
on the **UMBRELA 03 scale**. **Separate the generator model from the judge model** to avoid
self-preference leakage. **Hand-verify** a ≥1520% sample (oversample multi-hop) and require
Cohen's κ(LLM, human) ≥ ~0.6 before trusting auto-labels; always hand-author multi-hop
relevant-id sets.
**Pitfalls with standard mitigations (all baked into the protocol):** LLM judges are
systematically *lenient* (κ gate); "holes" (new arms retrieve unjudged docs — must pool *all*
arms before judging, else the gate is rigged against semantic/hybrid); generator-as-judge leakage
(model separation); too-easy self-generated queries (check paraphrase shares no content tokens);
adversarial/unanswerable queries have no relevant id and **must be kept out of the ranked metrics**
(mixing them corrupted the disputed Zep-vs-mem0 LOCOMO comparison — 84%→58%).
**Sizing:** Voorhees & Buckley (2002) — ≥25 topics is the floor, 50 yield reliable rankings, and a
~56% absolute gap at n=50 is needed for 95% confidence the ordering holds on a different query
set. Since the gate is *per stratum*, each stratum wants its own ~50 queries.
> **Honest note on what we actually built:** our eval set is **119 queries (40 exact / 40
> paraphrase / 39 multihop)** — just below the ~50/stratum ideal, and qrels were LLM-generated
> with lighter hand-verification than the full protocol prescribes. This is a real limitation,
> tracked in the [benchmark report](benchmark-report.md).
---
## 11. Synthesis — what we borrow, from whom
| Source | Borrowed mechanism | Re-implemented as | Adopted? |
|---|---|---|---|
| LightRAG | Incremental set-union graph merge; dual-level retrieve | Native node/edge tables, no AGE; FTS+dense+graph fuse | Idea only |
| LazyGraphRAG | Defer LLM cost; index-time work ∝ new content | Store-time extraction off hot path | Principle |
| Zep / Graphiti | Episodic/semantic split; 3-signal RRF read path; bi-temporal invalidation | Memory rows + Postgres node/edge tables; pgvector+FTS+CTE | **Blueprint** |
| Mem0 | ADD/UPDATE/DELETE/NOOP write-side curation | Flagged async curator over existing `update_memory` | Complementary, flagged |
| HippoRAG 2 | PPR spreading activation for multi-hop | LLM-free FTS+vector-seeded PPR over memory-node graph (phase 2, gated) | Idea only |
| Bruch et al. / Cormack | RRF default + CC/TM2C2 challenger | Weighted RRF k=60, post-fusion importance prior | **Direct** |
| EDC / KGGEN | Open-extract → define → canonicalize globally | Batched extraction + embedding-cluster canonicalization | **Direct** |
| pgvector / Supabase | HNSW + halfvec + RRF hybrid in one SQL query | Additive migration to CNPG (production only) | **Production design** |
| LongMemEval / UMBRELA / Voorhees | Stratified LLM-qrels + pooling + κ gate | Our exact/paraphrase/multi-hop eval | **Direct** |
The through-line: **a memory-node concept graph, dense pgvector embeddings, and the existing
lexical FTS, fused with weighted RRF, with all LLM work pushed to store time** — sized for an
append-heavy personal store and gated on a benchmark that beats FTS.
---
## Sources
**GraphRAG family**
- Edge et al., "From Local to Global: A Graph RAG Approach…" (Microsoft, 2024) — arXiv 2404.16130
- Microsoft GraphRAG incremental-indexing design — github.com/microsoft/graphrag/issues/741; GraphRAG 1.0 blog (microsoft.com/en-us/research/blog/moving-to-graphrag-1-0…)
- Guo et al., "LightRAG: Simple and Fast RAG" (HKUDS, EMNLP 2025) — arXiv 2410.05779; github.com/HKUDS/LightRAG (incl. issue #2122, PG+AGE concurrency)
- nano-graphrag — github.com/gusye1234/nano-graphrag
- LazyGraphRAG — microsoft.com/en-us/research/blog/lazygraphrag-setting-a-new-standard-for-quality-and-cost/
**Temporal KG memory**
- Rasmussen et al., "Zep: A Temporal Knowledge Graph Architecture for Agent Memory" (2025) — arXiv 2501.13956
- Graphiti — github.com/getzep/graphiti; Neo4j writeup (neo4j.com/blog/developer/graphiti-knowledge-graph-memory/)
**Extraction-based memory**
- "Mem0: Building Production-Ready AI Agents…" (2025) — arXiv 2504.19413; github.com/mem0ai/mem0 (configs/prompts.py)
**Graph-PPR retrieval**
- Gutiérrez et al., "HippoRAG" (NeurIPS 2024) — arXiv 2405.14831
- "From RAG to Memory" = HippoRAG 2 (ICML 2025) — arXiv 2502.14802; github.com/OSU-NLP-Group/HippoRAG
**Embeddings**
- bge-large-en-v1.5 — huggingface.co/BAAI/bge-large-en-v1.5; gte-Qwen2-1.5B — huggingface.co/Alibaba-NLP/gte-Qwen2-1.5B-instruct; nomic — huggingface.co/nomic-ai/nomic-embed-text-v1.5; bge-m3 — huggingface.co/BAAI/bge-m3; e5-large-v2 — huggingface.co/intfloat/e5-large-v2
- Voyage-3/3.5 — blog.voyageai.com/2024/09/18/voyage-3/; docs.voyageai.com/docs/pricing
- OpenAI text-embedding-3 — developers.openai.com/api/docs/guides/embeddings; Cohere embed v3 — docs.cohere.com/docs/cohere-embed
**Fusion**
- Cormack, Clarke, Büttcher (SIGIR'09) — cormack.uwaterloo.ca/cormacksigir09-rrf.pdf
- Bruch et al., "An Analysis of Fusion Functions for Hybrid Retrieval" (TOIS 2023) — arXiv 2210.11934
- Elastic weighted RRF; Weaviate hybrid-search fusion algorithms; "Calibrated Fusion for Heterogeneous Graph-Vector Retrieval" — arXiv 2603.28886
- bge-reranker — huggingface.co/BAAI/bge-reranker-base
**Concept-graph construction**
- Zhang & Soh, "Extract-Define-Canonicalize" (EDC, EMNLP 2024) — arXiv 2404.03868; github.com/clear-nus/edc
- KGGEN — arXiv 2502.09956; ReLiK (ACL 2024) — arXiv 2408.00103; LightKGG — arXiv 2510.23341; Google LangExtract — github.com/google/langextract
**Postgres vector storage**
- pgvector — github.com/pgvector/pgvector; pgvectorscale — github.com/timescale/pgvectorscale; CNPG image-volume extensions — cloudnative-pg.io/docs/devel/imagevolume_extensions/; Supabase hybrid search — supabase.com/docs/guides/ai/hybrid-search
- This cluster: `infra/docs/architecture/databases.md` (claude-memory-mcp is a CNPG tenant); this repo: `migrations/versions/001_initial_schema.py`, `src/claude_memory/api/app.py`
**Evaluation**
- LoCoMo arXiv 2402.17753; LongMemEval arXiv 2410.10813; UMBRELA arXiv 2406.06519; "Judging the Judges" / LLMJudge arXiv 2502.13908; Voorhees & Buckley "Topic Set Size" (2002); Buckley & Voorhees "Bias and the Limits of Pooling"; BEIR arXiv 2104.08663