openalicelabs / academy
COURSE ARCH-03 LESSON 03 · 04 TOPIC AGENT MEMORY EST. READ ~13 MIN
OPENALICE LABORATORIES · EDUCATION PATH · AGENTS & SYSTEMS 03 · 04

Teaching a
stateless mind
to remember.

An LLM is stateless. Between two API calls it forgets everything that isn't in the prompt. "Agent memory" is the engineering that makes a next-token predictor behave as if it had a durable, growing mind — a storage layer outside the window plus a policy for what to write, what to compress, and what to read back at the right moment.

FIG.00 — THE THREE VERBS
loading…
FIG.0A — THE WHOLE IDEA · the LLM is stateless; the SYSTEM around it remembers

The context window is RAM: small, costly, lossy in the middle. Long-term memory is a disk outside it. Every named system — MemGPT, Generative Agents, plain vector memory — is just a different answer to the same three verbs.

THE PROBLEMthe model forgets between calls — it is stateless
IN-CONTEXT TIERRAM — small, always seen, pay every turn
EXTERNAL TIERdisk — large, searchable, never directly in-prompt
THE THREE VERBSwrite · compress · retrieve
MEMGPT'S LIFTGPT-4 fixed window 32.1% → MemGPT 92.5% (same weights)
01 / 07
The one-sentence idea · where memory lives

The model has no memory. The system does.

Long-term memory is a storage layer outside the context window plus a policy that decides three things. Every architecture below is just a different answer to these three verbs.

WRITE

what to keep

After each turn, which facts are worth saving outside the prompt — and how (raw log? a labelled fact? a rated observation?).

COMPRESS

what to forget

Storage and the window are finite. Old detail gets summarized or consolidated — capacity bought by losing specifics. The hard part.

RETRIEVE

what to read back

At the right moment, pull the few relevant memories back into the prompt. Relevance, recency and importance all matter — not just cosine similarity.

write(turn) → external store // what to keep compress(store) → smaller store // what to forget retrieve(query) → into the prompt // what to read back

That's the entire field on three lines. The only interesting question is who runs these verbs — a hand-written controller, the LLM itself in-band, or a separate background agent — and how retrieval scores what to pull.

Why it's worth caring: this is measurable. On MemGPT's Deep Memory Retrieval consistency task, GPT-4 with a fixed window scored 32.1%; the same model wrapped in MemGPT scored 92.5%. Same weights — the entire lift is memory engineering.

And it's still unsolved at the hard part: on long conversational benchmarks (35 sessions, 300+ turns), even RAG-augmented LLMs lag far behind humans on temporal/causal reasoning — "what changed since last week, and why."

02 / 07
The window is RAM, not disk · MemGPT's FIFO + recursive summary

Watch the window fill, warn, and evict.

A bigger window doesn't solve memory: it's finite, costs per token every turn, and is lossy in the middle. MemGPT treats the prompt like physical RAM. Press "+ turn" and watch the FIFO queue overflow into a recursive summary.

FIG.02 — MAIN CONTEXT · SYSTEM + WORKING + FIFO QUEUE
 
used of window

Main context has three parts: system instructions (read-only rules + memory-function schemas), a small read/write working context scratchpad for stable facts that's always in-prompt, and the rolling FIFO message queue.

At ~70% a memory-pressure warning fires so the agent can flush facts to disk before data is lost. At 100% the queue manager evicts ~50% of messages and folds them into a recursive summary sitting at the head:

new_summary = summarize(old_summary + evicted_msgs)

A multi-generational summary that degrades gracefully instead of dropping data on the floor. The raw messages still live in recall storage on disk.

03 / 07
What most apps actually ship · and where it breaks

Two baselines everyone reaches for first.

Before the clever architectures, two plain designs. They work — until they don't. The real systems are built to fix exactly their failure modes.

BASELINE 0

buffer + summary

Keep the last N turns verbatim; on overflow, summarize the oldest into a paragraph and prepend. Cheap, no infra. Failure: summarizing a summary, turn after turn, erodes names, numbers, who-said-what until it confidently misremembers.

BASELINE 1

vector memory (RAG)

Embed every message, store it, pull top-k nearest neighbours back each turn. Unbounded storage, nothing forgotten. Failure: relevance ≠ usefulness — cosine surfaces topically-similar text, not the causally needed fact; no recency; returns disconnected fragments.

THE FIX ★

salience beats similarity

The single biggest idea both baselines miss is importance — an explicit "how much does this matter" score. Recency + importance + relevance together retrieve far better than relevance alone. That's §05.

Compression is lossy and compounds. The engineering trick that runs through every real system: keep the raw observations underneath, and treat the summary only as an index — never the sole copy.

04 / 07
Architecture A · the LLM as its own memory manager

MemGPT pages through memory by calling functions.

MemGPT (now the Letta framework) borrows the OS virtual-memory metaphor. The defining move: the LLM edits its own memory by calling functions, and the results feed back as new messages. Step through a real paging loop.

FIG.04 — IN-BAND PAGING · search → heartbeat → read → answer
USER ASKS A FACT BURIED IN OLD MESSAGES
step 0 / 6

Two tiers. Main context = the prompt (system + working + FIFO queue). External context = disk, reached only via function call: recall storage (all past messages) and archival storage (arbitrary long text the agent chooses to keep).

working_context.append(...) archival_memory.insert(text) archival_memory.search(query) conversation_search(query)

Each call carries request_heartbeat. If true, MemGPT immediately runs another inference step — so the model can chain search → read page 2 → answer. If absent, the agent yields and waits for the next event. It's literally an OS interrupt/yield loop with the LLM in the driver's seat.

Letta productionizes this as memory blocks — labelled, always-visible, character-capped key/values, editable via tools, and even shareable across agents (edit one, both see it). Plus sleep-time agents that consolidate between turns, off the hot path.

05 / 07
Architecture B · Generative Agents · the memory stream

Score every memory: recency + importance + relevance.

Park et al.'s "Smallville" agents use an append-only stream of timestamped observations — everything is retrieved, nothing is always-in-context. The intelligence is in the score. Drag the weights and watch which memory wins for the query "is Klaus dating anyone?"

score = α_recency·recency + α_importance·importance + α_relevance·relevance // Park et al.: all three α = 1, each component min-max normalized to [0,1]
FIG.05 — MEMORY STREAM · LIVE RE-RANK · winner highlighted

Top 3 memories are pulled into the prompt. Bar = total score. The winner is accent-filled.

RETRIEVAL WEIGHTS · α
recency 1.0
importance 1.0
relevance 1.0
clock · hrs 2
Park (1·1·1) relevance only recency only importance only

Recency decays exponentially since last access (factor 0.995/hr) — so frequently-used memories stay warm. Importance is the LLM rating the memory 1–10 at write time ("1 = brushing teeth, 10 = a breakup"). Relevance is cosine similarity to the query — the only RAG part.

Crank relevance only and a stale, low-importance match can win — exactly the vector-baseline trap. Park's reflection goes further: when summed importance > 150, the agent synthesizes high-level insights and writes them back into the stream, manufacturing new knowledge by compression.

06 / 07
The subtle good idea · decay on last access, not creation

Why used memories stay warm.

Recency isn't decay since a memory was created — it's decay since it was last accessed. Touch a memory and its clock resets. Drag the slider to access the memory mid-decay and watch the curve jump back to full.

FIG.06 — RECENCY = 0.995 ^ (hours since last access)
ACCESS AT hr hr 80

At factor 0.995 per sandbox-hour, recency halves roughly every ~138 hours. A memory you never revisit fades smoothly toward zero. But the moment retrieval touches it, last-access resets to now — so it springs back to 1.0 and starts decaying afresh.

That single choice — last access, not creation — gives the stream a sense of habit. Things you keep returning to stay vivid; things you mention once and never again quietly sink.

Defaults to tune, not laws: 0.995 and the 150-importance reflection threshold are tuned to a sandbox where one "hour" is a game tick. Lift them into a real product unchanged and they'll be wrong.

07 / 07
The two big families · honest limits · how OpenAlice maps on

Imperative vs declarative — and the poisoning problem.

The two architectures differ in who runs the verbs. Most production systems in 2025–26 are hybrids of the two.

SystemWho manages memoryWriteRetrieveConsolidate
MemGPT / Lettathe LLM, in-band (calls functions)agent decides: archival.insertagent calls search + pagingrecursive FIFO summary; sleep-time agent
Generative Agentsa scoring function, declarativeappend observation + LLM rates 1–10rec + imp + rel score, top-kreflection (insights written back)
Vector / buffer baselinea hand-written controllerembed + store every messagetop-k cosinerolling summary (lossy)
SELF-EDITING MEMORY CAN CORRUPT ITSELF

When the LLM writes its own memory, a hallucinated core_memory_replace can overwrite a true fact with a false one — and that error becomes the new ground truth, compounding. An adversarial input that gets stored persists across sessions: memory poisoning. Treat agent-written memory as untrusted — version and audit it.

And it's still unsolved at the hard part: temporal/causal reasoning over long histories ("what did the user believe last month vs. now, and what changed it") stays far below human level even with RAG. Don't promise a memory system "remembers everything correctly over months" — it doesn't. Retrieval quality dominates everything: bad embeddings make the most elegant architecture useless.

HOW OPENALICE MAPS ON

Alice runs a file-first memory that lines up cleanly with both tiers. sessions.jsonl per chat is MemGPT's recall storage — the durable, replayable, auditable log on disk. Scoped charter_memory_key facts are OpenAlice's Letta-style memory blocks — small, labelled, always-in-prompt — but edited by explicit tooling, dodging the poisoning risk.

Consolidation = the circadian goal. Alice's q9-circadian-memory is Park's reflection + Letta's sleep-time agent, with a safety upgrade: the pass is Critic-gated, so a hallucinated insight is vetoed before it's written back — a critic in the write path as the answer to poisoning. The honest gap, same as everyone's: scored retrieval over the flat log.

03 · 04 — you made it

You gave a
mind a memory.

The three verbs. The window as RAM, FIFO eviction, the recursive summary. MemGPT paging by function call, the Generative-Agents stream scored by recency + importance + relevance, decay on last access, and the poisoning problem a critic guards against. A stateless predictor now behaves like it remembers.

03·03 GraphRAG · retrieval over a knowledge graph, not flat chunks prev
03·04 Agent Memory · write · compress · retrieve · MemGPT & memory streams ✓ complete
03·05 MemPalace · a structured, navigable long-term memory next
03·06 Model Routing · send each query to the cheapest model that can answer locked
Next · 03 · 05

MemPalace →

You scored a flat stream. Now give long-term memory a structure you can navigate — a palace the agent walks, not a pile it searches.

openalicelabs