This really shows how ideas are worth more than the code itself nowadays. Haven't really tried the project myself yet, but if the benchmark is correct - this looks like a major breakthrough. Even more so coming from someone which (AFAIK) is not technical.
This is amazing. Well done Milla & team!
Btw, I already love the memes around this: "Missed the chance to call this Resident Eval
The 'store everything vs. synthesize' axis is the interesting design question here. MemPalace bets that retrieval quality is the problem — keep everything verbatim, make it findable. That's a legitimate approach.
We went the other direction with Lodestar (launching tomorrow). The bet is that you don't need the transcript but you do need the decision layer. What was decided, why, what was rejected. Synthesized by an LLM at session end into a plain markdown file that commits to the repo.
The other structural difference: the file lives in the codebase, not on the machine. git clone on a new laptop and the context is already there. Different tradeoff than a local vector DB.
Curious whether MemPalace users find the verbatim storage approach holds up as session count grows — the retrieval quality question gets harder with more history.
Interesting to see this gaining traction. I've been building something similar in a sense. It's called XI-eXtended Intelligence - that takes a different angle on the integrity problem. MemPalace solves retrieval really well. XI is more focused on what happens when memories are wrong or conflict over time- signed journals amandment overlays (amending, not replacing memories that need updated for whatever reason), auditing. Happy to share the project if anyone is interested.
Many commentators are, mostly fairly, criticizing the repo due to issues raised on benchmarks. This is reasonable, however many are going further to bash the repo likely due to the authors.
For me, I see a silver lining. I'll be implementing mempalace for a few small agents to have memory portability that's managed locally.
I think the benchmarker who ran independent tests in GitHub issue #39 summed it up best:
To be clear about what this all means for our own use case: we still think there's a real product here, just not the one the README is selling. The combination of a one-command ChromaDB ingest pipeline for Claude Code, ChatGPT, and Slack exports, a working semantic search index over months or years of conversation history, fully local, MIT-licensed, no API key required, and a standalone temporal knowledge graph module (knowledge_graph.py) that could be used independently of the rest of the palace machinery,is genuinely useful, and we're planning to integrate it into our Sandcastle orchestrator as a claude_history_search MCP tool exactly along those lines.
so the 100% LongMemEval score is a bit misleading if you actually look at whats going on. they took the 3 questions that were failing and applied targeted fixes for those specifically, plus LLM reranking on top. if you hold out those fixes the actual score is 98.4%. still good but not "100%" good yknow
same story with LoCoMo, the 100% score uses top-k=50 which literally exceeds the session count lol, with reranking on top. honest top-10 no rerank gets you 88.9%
this is giving openclaw energy where you engineer your benchmark results to look perfect and then market it as some breakthrough. the underlying tech might be interesting but leading with "highest score ever published" when the methodology has these kinds of asterisks is not great
cool that milla jovovich is vibe coding tho i guess
This is definitely out of the blue in terms of the person behind it. Its a cool concept as well (basically compressing meaning using language only).
That being said, can't help but wonder if stuff like this is better done with auto-encoders. The implementation in dialect.py seems very "narrative" oriented, probably not that good for things like coding.
Yes. It's _that_ Milla Jovovich (actress known for Resident Evil). This definitely wasn't on my 2026 Bingo Card.
Confirmed in this post: https://www.instagram.com/reel/DWzNnqwD2Lu
This really shows how ideas are worth more than the code itself nowadays. Haven't really tried the project myself yet, but if the benchmark is correct - this looks like a major breakthrough. Even more so coming from someone which (AFAIK) is not technical.
This is amazing. Well done Milla & team!
Btw, I already love the memes around this: "Missed the chance to call this Resident Eval
Turns out that when something is too good to be true... it mostly is a scam. Such as in this case.
One more meme: Missed the chance to call it Multi-Pass
The 'store everything vs. synthesize' axis is the interesting design question here. MemPalace bets that retrieval quality is the problem — keep everything verbatim, make it findable. That's a legitimate approach.
We went the other direction with Lodestar (launching tomorrow). The bet is that you don't need the transcript but you do need the decision layer. What was decided, why, what was rejected. Synthesized by an LLM at session end into a plain markdown file that commits to the repo.
The other structural difference: the file lives in the codebase, not on the machine. git clone on a new laptop and the context is already there. Different tradeoff than a local vector DB.
Curious whether MemPalace users find the verbatim storage approach holds up as session count grows — the retrieval quality question gets harder with more history.
Would love to learn more about Lodestar
Interesting to see this gaining traction. I've been building something similar in a sense. It's called XI-eXtended Intelligence - that takes a different angle on the integrity problem. MemPalace solves retrieval really well. XI is more focused on what happens when memories are wrong or conflict over time- signed journals amandment overlays (amending, not replacing memories that need updated for whatever reason), auditing. Happy to share the project if anyone is interested.
Many commentators are, mostly fairly, criticizing the repo due to issues raised on benchmarks. This is reasonable, however many are going further to bash the repo likely due to the authors.
For me, I see a silver lining. I'll be implementing mempalace for a few small agents to have memory portability that's managed locally.
I think the benchmarker who ran independent tests in GitHub issue #39 summed it up best:
To be clear about what this all means for our own use case: we still think there's a real product here, just not the one the README is selling. The combination of a one-command ChromaDB ingest pipeline for Claude Code, ChatGPT, and Slack exports, a working semantic search index over months or years of conversation history, fully local, MIT-licensed, no API key required, and a standalone temporal knowledge graph module (knowledge_graph.py) that could be used independently of the rest of the palace machinery,is genuinely useful, and we're planning to integrate it into our Sandcastle orchestrator as a claude_history_search MCP tool exactly along those lines.
https://github.com/milla-jovovich/mempalace/issues/39
so the 100% LongMemEval score is a bit misleading if you actually look at whats going on. they took the 3 questions that were failing and applied targeted fixes for those specifically, plus LLM reranking on top. if you hold out those fixes the actual score is 98.4%. still good but not "100%" good yknow
same story with LoCoMo, the 100% score uses top-k=50 which literally exceeds the session count lol, with reranking on top. honest top-10 no rerank gets you 88.9%
this is giving openclaw energy where you engineer your benchmark results to look perfect and then market it as some breakthrough. the underlying tech might be interesting but leading with "highest score ever published" when the methodology has these kinds of asterisks is not great
cool that milla jovovich is vibe coding tho i guess
Certainly not the best result compared to mempalace but I was thinking something similar trough my project https://github.com/softcane/contextgc
If you really looking for memory solution for your agents: https://github.com/rohitg00/aegntmemory
Does your agent remember how to spell your repository?
This is definitely out of the blue in terms of the person behind it. Its a cool concept as well (basically compressing meaning using language only).
That being said, can't help but wonder if stuff like this is better done with auto-encoders. The implementation in dialect.py seems very "narrative" oriented, probably not that good for things like coding.
[dead]