An LLM is stateless. Between two API calls it forgets everything that isn't in the prompt. "Agent memory" is the engineering that makes a next-token predictor behave as if it had a durable, growing mind — a storage layer outside the window plus a policy for what to write, what to compress, and what to read back at the right moment.
loading…
The context window is RAM: small, costly, lossy in the middle. Long-term memory is a disk outside it. Every named system — MemGPT, Generative Agents, plain vector memory — is just a different answer to the same three verbs.
Long-term memory is a storage layer outside the context window plus a policy that decides three things. Every architecture below is just a different answer to these three verbs.
After each turn, which facts are worth saving outside the prompt — and how (raw log? a labelled fact? a rated observation?).
Storage and the window are finite. Old detail gets summarized or consolidated — capacity bought by losing specifics. The hard part.
At the right moment, pull the few relevant memories back into the prompt. Relevance, recency and importance all matter — not just cosine similarity.
That's the entire field on three lines. The only interesting question is who runs these verbs — a hand-written controller, the LLM itself in-band, or a separate background agent — and how retrieval scores what to pull.
Why it's worth caring: this is measurable. On MemGPT's Deep Memory Retrieval consistency task, GPT-4 with a fixed window scored 32.1%; the same model wrapped in MemGPT scored 92.5%. Same weights — the entire lift is memory engineering.
And it's still unsolved at the hard part: on long conversational benchmarks (35 sessions, 300+ turns), even RAG-augmented LLMs lag far behind humans on temporal/causal reasoning — "what changed since last week, and why."
A bigger window doesn't solve memory: it's finite, costs per token every turn, and is lossy in the middle. MemGPT treats the prompt like physical RAM. Press "+ turn" and watch the FIFO queue overflow into a recursive summary.
Main context has three parts: system instructions (read-only rules + memory-function schemas), a small read/write working context scratchpad for stable facts that's always in-prompt, and the rolling FIFO message queue.
At ~70% a memory-pressure warning fires so the agent can flush facts to disk before data is lost. At 100% the queue manager evicts ~50% of messages and folds them into a recursive summary sitting at the head:
A multi-generational summary that degrades gracefully instead of dropping data on the floor. The raw messages still live in recall storage on disk.
Before the clever architectures, two plain designs. They work — until they don't. The real systems are built to fix exactly their failure modes.
Keep the last N turns verbatim; on overflow, summarize the oldest into a paragraph and prepend. Cheap, no infra. Failure: summarizing a summary, turn after turn, erodes names, numbers, who-said-what until it confidently misremembers.
Embed every message, store it, pull top-k nearest neighbours back each turn. Unbounded storage, nothing forgotten. Failure: relevance ≠ usefulness — cosine surfaces topically-similar text, not the causally needed fact; no recency; returns disconnected fragments.
The single biggest idea both baselines miss is importance — an explicit "how much does this matter" score. Recency + importance + relevance together retrieve far better than relevance alone. That's §05.
Compression is lossy and compounds. The engineering trick that runs through every real system: keep the raw observations underneath, and treat the summary only as an index — never the sole copy.
MemGPT (now the Letta framework) borrows the OS virtual-memory metaphor. The defining move: the LLM edits its own memory by calling functions, and the results feed back as new messages. Step through a real paging loop.
Two tiers. Main context = the prompt (system + working + FIFO queue). External context = disk, reached only via function call: recall storage (all past messages) and archival storage (arbitrary long text the agent chooses to keep).
Each call carries request_heartbeat. If true, MemGPT immediately runs another inference step — so the model can chain search → read page 2 → answer. If absent, the agent yields and waits for the next event. It's literally an OS interrupt/yield loop with the LLM in the driver's seat.
Letta productionizes this as memory blocks — labelled, always-visible, character-capped key/values, editable via tools, and even shareable across agents (edit one, both see it). Plus sleep-time agents that consolidate between turns, off the hot path.
Park et al.'s "Smallville" agents use an append-only stream of timestamped observations — everything is retrieved, nothing is always-in-context. The intelligence is in the score. Drag the weights and watch which memory wins for the query "is Klaus dating anyone?"
Top 3 memories are pulled into the prompt. Bar = total score. The winner is accent-filled.
Recency decays exponentially since last access (factor 0.995/hr) — so frequently-used memories stay warm. Importance is the LLM rating the memory 1–10 at write time ("1 = brushing teeth, 10 = a breakup"). Relevance is cosine similarity to the query — the only RAG part.
Crank relevance only and a stale, low-importance match can win — exactly the vector-baseline trap. Park's reflection goes further: when summed importance > 150, the agent synthesizes high-level insights and writes them back into the stream, manufacturing new knowledge by compression.
Recency isn't decay since a memory was created — it's decay since it was last accessed. Touch a memory and its clock resets. Drag the slider to access the memory mid-decay and watch the curve jump back to full.
At factor 0.995 per sandbox-hour, recency halves roughly every ~138 hours. A memory you never revisit fades smoothly toward zero. But the moment retrieval touches it, last-access resets to now — so it springs back to 1.0 and starts decaying afresh.
That single choice — last access, not creation — gives the stream a sense of habit. Things you keep returning to stay vivid; things you mention once and never again quietly sink.
Defaults to tune, not laws: 0.995 and the 150-importance reflection threshold are tuned to a sandbox where one "hour" is a game tick. Lift them into a real product unchanged and they'll be wrong.
The two architectures differ in who runs the verbs. Most production systems in 2025–26 are hybrids of the two.
| System | Who manages memory | Write | Retrieve | Consolidate |
|---|---|---|---|---|
| MemGPT / Letta | the LLM, in-band (calls functions) | agent decides: archival.insert | agent calls search + paging | recursive FIFO summary; sleep-time agent |
| Generative Agents | a scoring function, declarative | append observation + LLM rates 1–10 | rec + imp + rel score, top-k | reflection (insights written back) |
| Vector / buffer baseline | a hand-written controller | embed + store every message | top-k cosine | rolling summary (lossy) |
When the LLM writes its own memory, a hallucinated core_memory_replace can overwrite a true fact with a false one — and that error becomes the new ground truth, compounding. An adversarial input that gets stored persists across sessions: memory poisoning. Treat agent-written memory as untrusted — version and audit it.
And it's still unsolved at the hard part: temporal/causal reasoning over long histories ("what did the user believe last month vs. now, and what changed it") stays far below human level even with RAG. Don't promise a memory system "remembers everything correctly over months" — it doesn't. Retrieval quality dominates everything: bad embeddings make the most elegant architecture useless.
Alice runs a file-first memory that lines up cleanly with both tiers. sessions.jsonl per chat is MemGPT's recall storage — the durable, replayable, auditable log on disk. Scoped charter_memory_key facts are OpenAlice's Letta-style memory blocks — small, labelled, always-in-prompt — but edited by explicit tooling, dodging the poisoning risk.
Consolidation = the circadian goal. Alice's q9-circadian-memory is Park's reflection + Letta's sleep-time agent, with a safety upgrade: the pass is Critic-gated, so a hallucinated insight is vetoed before it's written back — a critic in the write path as the answer to poisoning. The honest gap, same as everyone's: scored retrieval over the flat log.
The three verbs. The window as RAM, FIFO eviction, the recursive summary. MemGPT paging by function call, the Generative-Agents stream scored by recency + importance + relevance, decay on last access, and the poisoning problem a critic guards against. A stateless predictor now behaves like it remembers.
You scored a flat stream. Now give long-term memory a structure you can navigate — a palace the agent walks, not a pile it searches.