Storage Is Not Memory: How AI Agents Recall

I built the wrapper version first. Not on purpose, but by default, because when you start building memory for AI agents you follow the same tutorial everyone else follows. Embed your conversations, store the vectors, run similarity search, pipe the top results into the context window. The standard recipe. Every product in the space was doing this and telling users it worked great.

I believed them until I ran a benchmark.

The benchmark that changed everything

I tested my own pipeline against ground-truth questions with known correct answers. The results were so bad I assumed I'd made an implementation error. I ran it again. Same numbers. So I started digging into the individual failures, categorizing 357 of them by hand over two weeks. Reading each failed retrieval, figuring out what the system returned instead of what it should have returned, and classifying the failure type. Temporal failures where the system confused last week with six months ago. Entity failures where it mixed up who said what. Compositional failures where the answer needed two memories and the search only found one.

What I found: 92% of failures were retrieval failures. Not reasoning failures. The information existed in the database. The search couldn't find it. I ran an oracle test to confirm, bypassed retrieval entirely, fed the model the full conversation. Accuracy jumped to 93.8%. The data had always been there. The search was broken, and nobody building these products had ever checked.

This means the entire field was focused on the wrong bottleneck. Everyone debating which LLM to use, which graph database to add, whether you need RAG or long context. None of it matters if the retrieval layer underneath can't surface the right information. Every "AI memory" product on the market was built on a retrieval layer that nobody had tested. Not "tested in production." Tested at all.

56 combinations nobody else ran

I needed to understand how much the choice of embedding model and reranker mattered, so I built a test rig: 7 embeddings crossed with 8 rerankers, 56 combinations, each evaluated against 1,540 ground-truth questions. Around 26,000 total evaluations over several weeks.

Why hadn't anyone published this before? Because it's brutally boring. There's no shortcut, you just configure each combo, run it, wait, record, move on. Weeks of watching numbers scroll by while everyone on Twitter posted launch screenshots.

The spread was only 3.2 percentage points (89.9% to 93.1%). Most products ship without testing a single combination. They grab whatever the quickstart guide used.

The result that surprised me most: a $0.40 per million token model with 100 retrieved memories beat a $15 model with 15 retrieved memories. Cheap model with good retrieval recovered 82% of errors. Expensive model with bad retrieval recovered 54%. Retrieval quality dominated model quality. That's not a marginal finding, that's a complete inversion of how most people think about building AI products. Optimizing search was worth more than upgrading to a model costing 37 times as much.

I even found a silent misconfiguration in my own code during this process. A script was loading MiniLM instead of the GTE ModernBERT reranker I thought was running. No error, no warning, just wrong results that looked normal because there was no baseline to compare against. This kind of bug is sitting in production systems that have never been benchmarked.

Building underneath instead of on top

Based on three months of benchmark data, I made decisions that looked wrong to everyone watching.

SQLite instead of Postgres or Pinecone. The constraint forced a hybrid search pipeline (sparse FTS5 plus dense vectors, reciprocal rank fusion, cross-encoder reranking) that runs on a Raspberry Pi for $12/month. Scores within 3 points of systems requiring $150 to $400/month in GPU infrastructure. One file, no cluster, no infrastructure excuses. If retrieval breaks the architecture broke and you fix the actual problem.

A neuroscience-inspired encoding gate instead of storing everything. I read papers on how the hippocampus filters incoming memories based on novelty, salience, and prediction error. Your brain doesn't record everything, it runs a three-signal filter first and most incoming experience gets discarded. That's not a limitation, it's the mechanism that makes retrieval work. Less noise in storage means less noise in retrieval. The benchmarks supported this approach.

A research paper on arXiv instead of shipping features. The paper (arXiv:2605.04897) has methodology, controlled benchmarks, reproducible results. If I was going to say the standard approach was broken, the data had to be public and the methodology repeatable.

Why this matters right now

Anthropic is shipping native memory for Claude. OpenAI is building memory into ChatGPT. Google's Gemini remembers conversations. Every platform is adding memory as a built-in feature.

When the platform ships a native version of your wrapper, you die overnight. Not because their version is better but because it's already installed, already integrated, already free. The platform doesn't need a good version, just good enough with better distribution than you'll ever have. Meeting summarizers learned this when Zoom, Meet, and Teams all shipped native summarization within months of each other.

That's why TrueMemory is built the way it is. The 6-layer retrieval pipeline, the encoding gate, the published research. The platform can add a checkbox. It can't replicate the architecture underneath.

Everyone's moving fast. Most of what they're shipping is a UI on top of rented intelligence, built on land they don't own.

The landlord is already building it.

Josh Adler is a researcher at TrueMemory, a Sauron company. Research: arXiv:2605.04897. More at joshadler.com.

Everyone's Launching Wrappers. Nobody's Going Deep.

The benchmark that changed everything

56 combinations nobody else ran

Building underneath instead of on top

Why this matters right now

Comments

More from this blog

Your Brain Doesn't Have a Paste Button

The Real Moat Isn't Software

Everyone's Building Jarvis. Nobody's Even Close.

There Are Cameras in Every Room of My House. I Put Them There.

Command Palette

The benchmark that changed everything

56 combinations nobody else ran

Building underneath instead of on top

Why this matters right now

Comments

More from this blog