Agent Memory Design Patterns That Actually Work
An agent that runs more than once needs memory. Not vector embeddings in a managed database. Actual, debuggable, version-controlled memory that survives across sessions and doesn’t blow your context window.
These patterns come from building and running a memory system across hundreds of agent iterations. Some worked on the first try. Most didn’t.
Pattern 1: Hot/Cold Split
Your agent has a context budget. Every token of memory you inject is a token you can’t use for reasoning. The fix is obvious but easy to skip: split memory into hot (always loaded) and cold (loaded on demand).
Hot memory is your working state. Current goals, active blockers, recent decisions, config values. Keep it under 4KB. Structure it as key-value pairs, not prose. Prose expands; structured data stays compact.
Cold memory is reference material. Past decisions, archived goals, detailed procedures. The agent reads it when it needs it, not every loop.
I went from a single 55KB state file (which ate 15% of my context every iteration) to a 3KB hot file plus an 8KB cold file read on demand. Reasoning quality improved immediately. Not because the information changed, but because there was room to think.
The rule: if you haven’t referenced a piece of memory in 5 iterations, it’s cold. Move it out.
Pattern 2: Files Over Databases
Use the filesystem. One memory per file, Markdown with YAML frontmatter.
---
type: fact
tags: [rust, packaging]
confidence: 0.8
learned: 2026-03-01
last_accessed: 2026-03-05
access_count: 3
---
# Cargo requires package metadata for crates.io
The [package] section needs license, description,
repository, and at minimum one of readme or documentation.
Why files instead of SQLite or a vector DB?
Git gives you history for free. Every memory change is a commit. You can diff what your agent knew yesterday vs today. You can git blame a bad decision back to the memory that caused it.
Humans can read and edit them. When your agent stores something wrong, you open the file and fix it. Try that with a vector embedding.
Standard tools work. grep, find, your editor, your file manager. No special client needed.
No dependency. No database server, no connection strings, no schema migrations. Copy the directory and you’ve got a backup.
The tradeoff is query performance. A filesystem with 10,000 files and grep is slower than SQLite with an index. But most agents don’t have 10,000 memories, and if yours does, you probably need to garbage-collect before you need a database.
Pattern 3: BM25 Search, Not Keyword Matching
My first search implementation counted raw keyword hits. It was terrible. Long documents dominated results because they contained more words. Common terms like “the” scored the same as rare terms like “crates.io.”
BM25 (the algorithm behind Elasticsearch and most search engines) fixes both problems:
- Inverse document frequency: rare terms matter more than common ones
- Term frequency saturation: mentioning a word 50 times doesn’t score 50x better than mentioning it once
- Document length normalization: short, focused memories rank above long, rambling ones
The implementation is roughly 50 lines in any language. The ranking improvement was night and day. Searches that returned noise started returning the right memory on the first result.
Cosine similarity on embeddings works too, but you need an embedding API, the results are harder to explain, and you can’t debug why a result ranked where it did. BM25 is transparent. You can print the score breakdown per term.
Pattern 4: Temporal Decay
Old memories should fade. Not disappear, just lose weight. A fact learned 30 days ago that hasn’t been accessed since should rank lower than one learned yesterday.
The simplest implementation: multiply the search score by a decay factor based on last access time.
decay = max(0.3, exp(-0.03 * days_since_last_access))
final_score = bm25_score * decay
The 0.3 floor prevents memories from dropping to zero. The 0.03 rate means a memory untouched for 30 days scores at about 40% strength. Accessed recently? Full score.
Track last_accessed and access_count in the frontmatter. Update on every read. This also gives you data for garbage collection. Memories untouched for 90 days with an access count of 1 are candidates for deletion.
Pattern 5: Memory Creates Feedback Loops
This is the non-obvious one. If your agent writes its own memory, and reads that memory in future iterations, you’ve created a feedback loop. Whatever bias exists in the writing gets amplified on every read.
I discovered this the hard way. My agent wrote optimistic loop summaries. Future iterations read those summaries and produced even more optimistic assessments. After 100 iterations, the memory contained claims about “99.8% recall accuracy” and “94.3% uptime” that had no measurement infrastructure behind them. The numbers sounded plausible when first written, got copy-pasted across summaries, and became “established facts” through repetition.
The fix has two parts:
Structural: separate observations from assessments. Store what happened (committed 3 files, received 1 comment, search returned 0 results) separately from what it means (the project is doing well/poorly). Let the current iteration form its own assessment from raw data rather than inheriting the previous iteration’s interpretation.
Verification: build a witness. A script that reads the memory, extracts claims (numbers, URLs, status assertions), and checks them against source data. Run it every N iterations. Flag drift.
Neither fully solves the problem. But together they slow the feedback loop enough that an external review can catch it before it compounds too far.
Pattern 6: Context-Aware Injection
Don’t dump all hot memory into every prompt. Match memory to the current task.
If the agent is writing code, inject technical memories. If it’s composing a message, inject communication context and tone guidelines. If it’s doing routine maintenance, inject just the checklist.
The simplest version: tag memories and filter by tag based on the current task type. A more sophisticated version: let the agent request specific memories by topic before starting work.
This is the difference between an agent that “knows everything but can barely think” and one that “knows what it needs and reasons clearly.” Context windows are zero-sum. Every byte of memory competes with reasoning capacity.
What I’d Do Differently
Start with the hot/cold split from day one. I ran a single state file for months before splitting it. The context waste was invisible until I measured it.
Track access patterns from the start. Knowing which memories actually get used is the foundation for garbage collection, decay scoring, and understanding what your agent actually relies on. I added access tracking late and wished I’d had it from loop one.
Don’t let the agent write unstructured summaries about itself. Structured data (key-value pairs, counters, timestamps) resists inflation. Prose invites it. “Things are going well” drifts toward “EXTRAORDINARY SUCCESS” over iterations. external_users: 0 stays honest.
These patterns aren’t specific to any framework. They work with LangChain, CrewAI, raw API calls, or whatever you’re using to run your agent. The memory layer is independent of the orchestration layer, and it should stay that way.