openalicelabs / academy
COURSE SYS-03 LESSON 03 · 02 TOPIC MEMPALACE EST. READ ~13 MIN
OPENALICE LABORATORIES · EDUCATION PATH · SYSTEMS 03 · 02

A memory
palace for
agents.

An agent that forgets every session starts from zero each morning. MemPalace gives it long-term memory with one contrarian rule: store everything verbatim, summarize nothing at write time. No LLM on the write path. Recall is just embedding search — and it tops the benchmark anyway.

FIG.00 — WAKE-UP
loading…
FIG.0A — THE WRITE PATH · raw text → verbatim drawer + embedding · NO LLM

A message arrives. It is written verbatim into a drawer — the lossless source of truth — and a small local model turns it into an embedding vector. That is the entire write. No model summarizes it, no API is called, nothing leaves the machine.

WRITE PATHverbatim drawer + local embedding — zero LLM calls
READ PATHsemantic similarity search, optional LLM rerank
WAKE-UP COST~170 tokens (identity + critical facts)
BENCHMARK96.6% Recall@5 on LongMemEval, raw mode
DEPENDENCIESChromaDB + PyYAML · local-first
01 / 07
The central thesis · verbatim vs extract

Don't summarize. Just keep it.

Most agent-memory systems run a model at write time to distill chats into "facts." MemPalace bets the opposite: store the raw text, embed it, let search do the work. One bet decides everything downstream — and you can feel the trade-off below.

EXTRACT-FIRST

distill on write

An LLM compresses each conversation into facts. Dense and pre-resolved — but it costs money, adds latency, is non-deterministic, and silently loses what you didn't know you'd need.

VERBATIM ★

keep, then retrieve

Write the raw text losslessly; embed it locally. Writes are free, offline, deterministic. Nothing is lost — recall leans entirely on embedding quality, which is now good enough.

THE WAGER

retrieval > distillation

If embeddings can find the right verbatim snippet on demand, you never needed to summarize. The benchmark says the bet largely pays: 96.6% R@5 with zero write-time LLM.

FIG.01 — TWO WRITE PATHS · COST + RISK FLIP
VERBATIM (no write-time LLM)
WRITE COST / YEAR
WRITE LATENCY
DETERMINISTIC?
INFORMATION LOST?
BYTES STORED

VERBATIM EXTRACT

Slide to extract and writes get denser but you pay an LLM on every turn, give up determinism, and quietly drop detail you can't get back.

Slide to verbatim and writes become free and offline. You store more bytes — disk is cheap — and shift all the cleverness to read time, where embeddings now shine.

02 / 07
The data model · method of loci

Wings, rooms, halls, drawers.

The "palace" is the ancient memory mnemonic dressed over a vector store. Underneath, every label is just metadata on a document. Click a part of the map to see what it really is.

FIG.02 — THE PALACE · CLICK A STRUCTURE
SELECT A STRUCTURE

It's all metadata

The whole hierarchy is tags on documents in one vector store. "Search within a wing" is a metadata-filtered similarity query. Click the map to walk it.

A drawer holds the verbatim original — the source of truth. A closet is a compressed pointer to it. Halls cut across all wings by memory type (facts, events, preferences). It is a clean human ergonomic — and, as we'll see in §07, that's all it is.

03 / 07
Progressive loading · the 170-token wake-up

Pay context only for what the turn needs.

Memory loads in four layers. At session start an agent pulls only L0+L1 — a tiny identity + critical-facts preamble. Deeper memory waits until a topic surfaces. Toggle the layers and watch the context bill move.

FIG.03 — LAYER BUDGET · CLICK TO TOGGLE

CONTEXT LOADED
OF A 200K WINDOW
MODE
// what wake-up emits L0 identity / system ~50 tok always L1 critical facts (AAAK) ~120 tok always L2 room recall on-demand when topic surfaces L3 deep semantic search on-query on explicit ask

The ~170-token headline is just L0+L1. Compare that to pasting an entire history (millions of tokens, impossible) or re-summarizing with an LLM every session (dollars and latency). Here, steady-state memory is nearly free.

Order-of-magnitude from the project's own docs: paste-everything ≈ 19.5M tokens · LLM-summaries ≈ $507/yr · wake-up ≈ $0.70/yr · wake-up + 5 searches/day ≈ $10/yr.

04 / 07
Read path · constrain, then rank

Narrow the search, raise the recall.

The free win: don't search the whole palace. Scope the candidate set by wing → hall → room before ranking, and recall climbs measurably. Drag the scope and watch the same query sharpen.

FLAT WING+ROOM
ALL CLOSETS (FLAT)

Same query, same store, same embeddings — only the candidate set changes. Tighter scope means fewer distractors compete with the right memory. This is exactly the lesson graph-scoped retrieval teaches from the other side: constrain before you rank.

FIG.04 — CANDIDATES FOR ONE QUERY · scope shrinks the field
query · "which auth provider did Kai pick?"
CANDIDATE DRAWERS · green = the right one

CANDIDATES SEARCHED
RECALL@10 (LongMemEval)
RAW

96.6% R@5

Pure embedding similarity over verbatim drawers. No heuristics, no LLM. This is the headline number — and most of it is the embedding model plus verbatim storage.

HYBRID v4

~98.4%

Raw plus keyword boosting, temporal-proximity boosting, and preference-pattern extraction. Cheap deterministic signals layered on top.

LLM RERANK ★

≥99%

Take the top-20 candidates and re-rank with a model. The expensive tier runs only on a tiny short-list — recall first, reasoning last.

05 / 07
Compression · the AAAK preamble

Shorthand an LLM reads natively.

L1 facts are stored as AAAK — a deterministic, structured shorthand any model parses without a decoder. It's how a thousand-token team note becomes a ~120-token preamble. Type a note and watch it compress, live.

FIG.05 — LIVE AAAK COMPRESSION · DETERMINISTIC, NO LLM
team note preferences a decision
AAAK OUTPUT · the L1 preamble an agent loads
PROSE · TOKENS≈
AAAK · TOKENS≈
COMPRESSION
LOSSLESS?no — index only
THE HONEST NUMBER

Marketing said "30×, zero loss." The ablation says realistic compression is ~8–10× and it is lossy — retrieving from AAAK instead of the verbatim drawer drops Recall@5 by 12.4 points. So AAAK is a great preamble, never the retrieval substrate.

That's why the drawer exists: AAAK is the fast index you load every turn; the verbatim drawer is the ground truth you fall back to when search needs the real thing.

06 / 07
Validity windows · "what was true then"

Facts that expire, not vanish.

Beside the vector store sits a temporal triple graph. Every fact carries a validity window — and invalidation marks an end date rather than deleting. So "as-of" queries return what was true at a point in time. Drag the timeline.

FIG.06 — TEMPORAL TRIPLES · DRAG "AS OF"
AS OF
kg.add_triple("Kai","works_on","Orion", valid_from="2025-06-01") kg.invalidate("Kai","works_on","Orion", ended="2026-03-01") kg.query_entity("Kai", as_of="2026-01-20")

Drag the marker before March 2026 and "Kai works_on Orion" is active; drag past it and the fact is retired — not erased. Stale facts can never quietly corrupt a present-tense decision, and history stays answerable.

The honest limit: this is a flat triple store, single-hop, with no entity resolution. Useful, but not the multi-hop associative graph the palace metaphor hints at — build deeper if you copy it.

07 / 07
Insight vs branding · what to actually steal

The insight is real. The palace is paint.

A critical-analysis paper went through the claims. The verdict is worth taking at face value: a significant architectural insight wrapped in overstated branding.

The claimThe realitySteal it?
Verbatim + zero-LLM writesGenuinely distinctive. Free, offline, deterministic write path.YES — the core idea
Progressive L0–L3 loadingPay context only for what the turn needs. Clean and sound.YES
Scoped retrieval +34%Real win — but it's metadata filtering, in every vector-DB tutorial.YES (it's free)
The spatial "palace" metaphorBuys nothing over flat path/metadata scoping. A human ergonomic only.no — skip
AAAK "30×, zero loss"Really ~8–10× and lossy (−12.4 pts R@5 vs drawers).as index only
"100% on the benchmark"Required iterative LLM rerank + test-fixing; later walked back.distrust
Temporal knowledge graphReal + useful, but flat & single-hop, no entity resolution.steal, go deeper
WHY THE NUMBERS HOLD UP

Most of the 96.6% is the embedding model + verbatim storage, not the palace. Put the same embeddings and the same "never summarize" policy on a bare ChromaDB and you reproduce most of it. That's not a knock — it's the proof the insight is what matters, not the branding.

THE OPPOSITE BET · A-MEM

A-MEM takes the contrary stance: a Zettelkasten-style network where notes dynamically link and evolve — new observations trigger synthesis, detail compresses upward into principles. MemPalace says "archive raw, let embeddings retrieve"; A-MEM says "memory must be structured and continuously refined." The design to watch if you ever want associative recall, not just similarity.

03 · 02 — you made it

You toured
the palace.

Verbatim over summary. The four-layer wake-up. Scoped retrieval, deterministic AAAK, the temporal graph — and a clear eye on what's insight versus paint. You now know how to give an agent memory that never forgets and barely costs.

03·01 Agent memory systems · the design space — tiering, linking, distillation ✓ done
03·02 MemPalace · verbatim storage, zero-LLM writes, L0–L3, temporal KG ✓ complete
03·03 GraphRAG · constrain the candidate set with a graph before you rank next
03·04 Model routing & councils · cheap recall first, the expensive model on the short-list locked
Next · related

Mixture-of-Experts →

The same "route to the few that matter" instinct, but inside the network: a learned gate sends each token to a small set of experts.

openalicelabs