Transcription provided by Huntsville AI Transcribe
This time we are gonna be going into a paper called SWE EXP, which is a SWE experience software engineering experience, which is basically a system for autonomous debugging over many runs. using memory as a way to kind of kind of get some recurrence across tasks. And so that’s the paper that we’re going to look at.
We’re going to look at the precursor paper to it as well, which is called SWE search, where that one is kind of doing the same thing, but it’s using a Monte Carlo tree search. We’ve talked about the Monte Carlo tree search a few times before and we did the RL discussion.
We talked about it a good bit, prompt engineering. So this is going to be another case where we can kind of see how this thing can get used.
But it’s just kind of an interesting tidbit, because what we really care about talking tonight is memory.
The concept of agentic memory, which is really one of the major rungs of what it means to be an agent. So the major components of an agent are, of course, the ability to call tools, some sort of context engineering layer, but then also statefulness.
The way that we get statefulness out of these agents is through this thing that we call memory.
And it’s actually generally called stateful agents is kind of how these things are referred to is giving them some way to, you know, once they disappear, they don’t just go explode. So that’s kind of kind of what we’re going to talk through tonight. We’re going to go through the paper in detail. I have lots of little tidbits and visualizations of kind of the different concepts that we’re going to go through.
And then I’m gonna talk a little bit at the very end of what are some like off the shelf libraries you can use and kind of see them on my bar here of MIM zero and Leto. We’ll talk a little bit about those because those are kind of two big ones that are in the space. And we’ll sit in there.
Like I was telling Jay, we’re not super, super crammed for content where there is a lot of content. So this could go long, but if we need time to sit and talk, we can do that for sure. Cause I think it’d be good to kind of talk through kind of issues that we’ve had with memory and experience we’ve had dealing with some of these capabilities in the products that we’re using that do exist. Right, with that, I used Bill Murray with Groundhog Day here as the example, because I think this captures really the essence of what this paper is. If you somehow do not know what Groundhog Day is, of course, this is a movie where The main character gets stuck in a time loop where he has to live the same day over and over and over again.
Nothing persists except his memory of the day before.
And this is fantastically similar to what this paper is really doing.
Obviously, pared down quite a bit, but I thought that was a good call out here. So this is the system that we’re going to be looking at.
And the main idea here is that there’s some sort of issues that exist in a repository somewhere.
And I want to be able to use all of these issue sort of solving paths, which they call trajectories.
I want to be able to take those good or bad episodes and harvest the goodies out of them and then be able to store them away in a nice, useful way.
And then retrieve them in a nice, useful way so that I can not mess up so much. And that’s kind of the memory system that they are proposing. And so there are three major elements here that we’re going to talk about.
And I think each of these three elements could be harvested out.
And you could just use that one element and in your whatever your memory system is, and that’d be perfectly fine.
So, you know, this is really kind of a snapshot of where this sort of memory research is right now.
I would say that the right memory system for you is going to be some amalgamation of all the different things that we’re going to talk about and things that we’re not going to talk about.
So the first of the three major components here that we’re talking through is this experience bank where it’s looking for experiences that represent comprehension or more of the abstract knowledge about a thing that isn’t specific to something very unique.
So it could be like Memories that you have race condition issues in some part of the code and generally things that deal with mutexes give you an ulcer.
You know, that might be a comprehension style of memory. There’s memories that have to do with modifications, which are very specific changes that you did and didn’t do in the past. And it kind of logs those as well. things like overall failed attempts, and then also a way of kind of matching some sort of solution in the past to what your current problem is. So you need some way of doing that that goes beyond just querying a vector database and hoping.
And so we’re gonna go into kind of how they do that.
The other one that they have here is the MCTS search, this Monte Carlo tree search.
We’ll do a small recovery or covering that again for this.
whenever we go over sweet search.
But this is basically just a way for us to do smart exploration.
It does do some interesting things with how it does some back propagation for scoring different values.
So that’s kind of cool.
But not really what we’re going to talk about here. But the big idea is just kind of finding an efficient way of trying, basically throwing paint at the wall in a way that’s a little bit smarter than just brute forcing. And then the other idea that they have here is definitely the dual agent sort of concept at runtime. This is pretty normal for us now, you know, if you go to use Claude code, they have like the planning mode and the, the acting mode, the same thing with like client, you know, this is kind of a fairly normal pattern here, but they use that here with what they call the instructor and the assistant.
And these are the two major patterns here.
This is a way that they use to get around the loops.
And it’s definitely effective.
I implemented this when I was doing my prototype very late and it instantly solved all of my issues where it was getting stuck in loops, which is what they say it’s for. So this is a very useful pattern to kind of consider these, at least the dual agent sort of approach.
And we’ll talk a little bit about that here.
They also have like three other agents that exist here, one of which is the issue resolution agent, which is basically crawling GitHub to harvest trajectories. The experiencing agent, which is kind of an interesting one, where it’s almost like it is, so say you have like a GitHub issue sort of thread. and change set, it kind of tries to put itself in the shoes of that change set and harvest trajectories in order to kind of bootstrap the system.
And that’s what the experience agent is, or the experience agent is doing. Then it also serves as kind of your query agent as well for those experiences.
And there’s also a re-ranker agent, which we’ll talk about.
We talked a lot about re-ranking last go around. And so this is kind of in that vein too. So lots of little agency sort of stuff, which of course increases complexity, but depending on what you’re doing can be the right thing to do. All right, so that’s kind of the 10 mile view. And so let’s talk a little bit about kind of the problem that we’re trying to solve with these agents.
And the main issue, and I think we all, know this, know this frustration. If you’re working with Claude code or you’re working with chat GPT and it’s starting to get a little bit long in the tooth, your conversation is going to start quite long, starting to forget stuff.
And the only way to get it kind of back to normal, whether you get evicted by Anthropic or OpenAI is that you got to start a new chat window. It really stinks because then you’ve suddenly lost all of your context and you don’t have a good way of getting it back again. And so this is a fairly common problem with, you know, those long running chat instances, but even with those, we’re generally staying in that chat and you have the option to keep on a chat after a different session. When you’re dealing with an AI agent that doesn’t have memory set up, it’s even less like that, because you’re basically just spinning it up to do something and it goes off and does the task and then it disappears.
And so it’s more like a workflow than what we would consider generally as an agent.
This has lots of issues.
It means that you can generally waste a lot of compute. It will go down, read the wrong pass multiple times.
If it gets past a hurdle finally on one run, it won’t be able to harvest those gains for another run.
And so say for here, we’ve got bug two where it’s able to actually solve a bug that has some sort of similarity to the other ones.
That doesn’t persist unless you have some sort of advanced retrieval engine.
It’s not going anywhere. The idea with the memory agent is that we want to kind of supplant this fixed context length while it’s not possible to feed in all of the context in the entire human knowledge into the thing. We’ve got to kind of deal with those constraints and how do we inject a little high value snippets into this thing?
And so that’s what these memory systems are trying to do.
You know, there’s obviously eventually the dream is that these models will be able to continually learn and continually compile these things.
That is not something that’s going to be happening anytime soon.
And at least not to great effectiveness. There’s some people that are trying that sort of actual continual learning stuff. I think it would be good to find a way of doing it that does not require continual training. And this sort of memory system is that way of doing it.
So how can we make these agents, not just good searchers for that one time, but to start remembering stuff over long periods of time?
And that’s the goal that we’re gonna kind of try and talk about.
Imagine if you had your intern, you had to teach them like it was their first day every day that you had them. Sometimes it can feel like that, but in truth, they are learning and we want our agents too as well. All right. Any questions?
I posted one in the chat. I’m not sure what you meant by stuck in a loop, but it’s something I’m running into is where Klein is making the same edits to the same. It may be one file, just the same thing. I have to reject it to kick it out of there or cancel and try again. Or I got into something the other day. It was harder to detect because it’s editing a bunch of stuff for me. And all of a sudden, I think I’ve seen this file before and I scroll up into history. It’s like, no, it made the same edit like four files ago.
And I watch it and it’s like, I bet the next file is going to be the next. Yeah. And it’s, so it had like four or five files in a loop, just continually spinning.
And luckily it’s not something I’m spending money on.
But, you know, is that what you mean?
Or is this something else?
Exactly what I mean. Yes.
Okay.
We’ll even have a fancy visualization later, showing that exact scenario.
Okay.
Definitely search can be one where it happens and edit can be one where it happens a whole bunch.
They can get stuck in those loops.
That’s actually one of the big issues with the current GPT OSS model is it’ll get stuck in those loops really easily too.
Yeah. All right. Any other questions and comments?
Has anybody here kind of played around with memory sort of systems on their own? Is something that’s familiar to folks?
This is brand new to me other than like super super old like expert type Well, I think they were called expert systems at the time where you had a set of rules and you know You had to keep up with what was true and you know update that and then see what rules fall fell out that that’s way different than this I believe Yes, yes. And you might get to that sort of thing. We’ll talk about that.
There might be this concept of like a sleep time compute, this test time compute and training time compute.
And now people are talking about sleep time compute, where it’s kind of compiling the stuff in the background, where that might end up looking like your expert rules that are kind of AI generated in an interesting way.
But we won’t talk too much about that today. But it is also a very interesting topic. Let’s see, something crispy.
playing around with the rag database.
Yes, yes, rag is a good kind of first step into memory sort of stuff. And we’ll talk about that. All right, so we’re gonna talk about the kind of four different systems of memory.
This is not exhaustive, but I think these cover probably about 90% of ways that people are handling memory systems right now.
I kind of got four different analogies for this, one of which is the whiteboard.
So this is kind of your working memory is also what we kind of think of as the context link is kind of what this is. It’s the active thinking space of right now.
It forces very, very sort of strict token budgets.
And as soon as you’re out of budget, you’ve got to find a way to kick something out or basically compress something in a way that loses fidelity.
And so we’ll talk about that one.
Then we’ll talk about the researcher’s desk. Think about this as your categorical filing cabinet. Choose your analogy here, but some sort of hierarchical memory that has hot to cold storage, that sort of concept. Then there is something called the playbook, which is experiential memory. This is where we’re really looking at something called episodic memory.
This is very big in RL with robotics.
It’s also very much how humans, kind of their primary memory, their primary, our primary memory system works. And so it’s also kind of one of the main things that we’ll be looking at inside of.
this system, this paper that we’re talking about.
And there’s also the concept of like a subway map, which is more of your semantic memory.
It’s where we’re really looking more towards the connections between things, nodes and edges.
It’s very graph database feeling, although it doesn’t have to be technically a graph database, but generally that is kind of how people deal with it.
So we’ll start off with the easiest one.
So this is the working memory.
The common patterns for this is stuff where we’re kind of finding ways to just deal with my chat context.
I’ve got my chat context. I’m not going to use some sort of external systems for this sort of stuff, but I’m going to kind of kind of try and manipulate my context link inside of here. This is where we’re trying to do stuff like using the scratch pads and chain of thought reasoning to kind of do some interesting stuff. We’re really more thinking about the scratch pad here kind of more like a persistent.
scratch pad.
So if you think of like a Claude MD, when you’re using that to store, you know, lists of changes, I think whenever we were going through Jack’s presentation, he was showing the way the replete does it, where it kind of has its changelog of stuff.
That would be an example of where we’re kind of using the context to serve as a cheap memory system.
And that’s perfectly a valid way to do it.
Obviously it works for replete. and they’re, you know, have a company that’s very large and has employees. So it’s perfectly reasonable to do it this way.
But the problem with the conversation context method is that there is a fixed size and that size is getting bigger.
You know, Claude four is now, I think they’re, they just announced they’re up to one million tokens, but they started at 200K. GPT five, the new one is at 400K and there’s a fixed amount here.
And not all tokens are, you know, necessarily made the same within this context, unfortunately. plus a purple at least the entire curve is yes that is a very unfortunate But yeah, so generally how this goes is that we have our conversation and we have one of two options.
We can start going things and as things we run out of context links, things get evicted and it just falls out the other end. If you don’t do anything, generally it’ll just fall out the other end and they’ll just forget the stuff. One thing that we’ve seen now though is that they’ll start summarizing things.
So say I’ve kicked out multiple texts in here.
Maybe I’m kicking these things out, but I’ll add like a summary of stuff. and kind of try and continually keeping it going. But obviously the longer and longer that goes, the more context that summary takes up. And so the less like real context you have, and eventually you can’t compress it anymore in a meaningful way. And so you talk about these people that are, you know, having these chat GPT40 conversations for like two years and they flipped out because they deprecated the model or whatever it was. You know, this long sort of, three billion sort of token conversations they’ve been having is that compressing sort of conversation context along with some other tools.
And so yeah, the idea here is it’s first in, first out eviction, but they can do some smart compaction.
And I think just about everybody probably has dealt with this form of memory. All right, so let’s pop over to the next one, which is this hierarchical memory system.
This is something that’s probably a little bit more similar to like what you’re talking about, Chris, where you’re using some sort of a RAG system to create a soft memory layer where it’s kind of storing, you know, different thoughts in there and you’re doing some sort of a similarity search or keyword search to kind of grab things from different levels of storage. And generally, This will work with that context. So you’ll have something that’s doing that summarization, doing those rules and like reminders.
So a Claude MD that’s hooked up to like a Neo4j memory MCP, which is one of the ones that they like try and get you on out of the gate.
It works pretty well.
So I’m not saying that a derogatory way. That’s a perfectly good thing to do. There’s some sort of a graph database that has semantic memory in it. This would be an example of a hierarchical memory system.
So I’m trying to do multiple levels in order to get a more layered memory, which is really ideal.
You want to be working with hierarchical memory. And so you have generally that in context, you know, RAM store, you have something that’s on your hot storage, which is your vector DB, but, you know, computing vectors is hard.
Computing semantic vectors is harder.
with like a graph database.
And so sometimes you just have a blob of everything.
And so like a good example of this would be in chat GPT where they have your memories that you can go look up and it says like, here are the things about Joe and it has like a bunch of facts about you and that’s stuff that it can go quickly get without much issue. But it also can go and just search all of your conversations too if you have it with certain things set up. And so that sort of hot storage where it’s doing a larger, broader search, it takes longer, or longer amounts of time. But if you go to the cold storage, you can possibly mine up some additional context. And so that’s the idea here. Got a little visualization here. So I think we’re gonna give them a query of how do I fix authentication errors?
That’s something that’s gonna be very common.
It’d probably be something that was worthwhile for us to index. And so the user asks the query, goes to the AI assistant, and it says, hey, I can’t find it in my context window. I’m going to go search my next one down, which is the vector database.
I can find it.
Cool. I’m going to go get that documentation and add it into its context. So it’s promoting it into the context, and it goes back to the user. But maybe I have a more sort of a niche question, like how do I migrate from subterrests? Might not be something that you have indexed out of the gate.
And so I go to my context window, can’t find it, go to my vector days, better ways, can’t find it, but I can go into my full corpus, go to my SharePoint site or whatever it is, and I’m able to go find the thing and then at that point, go and put it into my context, break it up, run whatever pipeline I need to to get that thing suddenly freshened up for my context, and then I’m able to go and do it. I think another way to think about this would be maybe like a web search. You could think of that as a web search or a search out into the full context of the internet. That’s of course doing it just general retrieval. But we can also do this for things like memory. So I think the example of searching past chat conversations would be a good one here as that sort of proxy.
This is the first one of the other ones that are a little bit more complex.
Any questions about the hierarchical memory?
In what you’ve seen so far, is it all kind of based on a single user and what they’ve been working on?
Or have you seen this? I think it would be pretty interesting, like the team that I’m working with, if somebody was doing something and then found that to be a very useful thing. Isn’t, hey, like you were doing earlier on the, hey, how do I switch over to I don’t know if it was soaked to rest or whatever the something like that. And it actually worked well, you know, and they got good results from it.
If they could actually throw that to like a communal experience bucket somewhere. Yes, absolutely.
There are many things that do that.
I think there’s one product called Dot where they’ve got a branch off of that called HiveMind or something like that.
OK.
And it’s you pretty much have all of the same things.
but you have the additional aspect of doing network analysis where you’re kind of looking at influencers who are subject matter experts for areas, who are, you know, what are the things that people know about or have helped on and reputation scores, stuff like that kind of gets added in. So it’s more complex, but everything that has to do with memory in general is just straight up copy pasted for sure.
Last call.
Yeah, that’s the one. I’ve seen several people talking about that sort of thing.
The one that’s coming to mind though is the dot application and their hive mind thing. So I think they talked about that recently. All right, any other questions on hierarchical memory? All right, we’ll go into episodic memory next, which this is probably of these I think is the most exciting.
It has the most value.
So the common patterns for this is stuff like RLXperience replay buffers really used heavily in robotics where it’s really, the big thing with episodic memory is that there is a strong temporal element.
It really cares about the sequencing of when things happened and some sort of a time window that, I mean, it’s an episode.
We know what episodes are.
It’s like an episode on TV.
It’s some sort of time window that is meaningful. for memory.
And so this can be captured through things like trajectory logs, and those can be raw trajectory logs, or they can be things that are sort of condensed, which is what we’ll actually see in this system, but also things like reasoning, it can be things that are stored by some sort of temporal boundary, but you can also do other sorts of partitioning as well. And so one example of that partitioning is, in this case, they partition on issues where they treat a GitHub pull request slash issue as an episode.
Even though it doesn’t have technically, you could finagle yourself back into saying that the time window, but there are multiple ways you can partition episodes, but generally it does do things like some sort of timeline.
And yeah, so the retrieval for these, it can be different things.
We’ll kind of go into more into that when we get into it.
I have a little bit of a little visual here of us showing.
So we have kind of like a left to right visual of different bugs that we’ve fixed.
So we have like one here that’s the updated API endpoint.
And this thing has a little trajectory that goes along with it.
Of course, where are all my stuff?
Oh, okay.
I got to do the thing. There we go. And so each one of these have like a little trajectory.
So this, and this one that happened three weeks ago, let’s say we found the endpoint, we updated the route and did a whole bunch of different stuff.
And so that’s what’s kind of hooked inside of this thing.
And some of these were successful and some of them were failed. So I had one here that was like resolving a race condition where I tried to do some stuff. I found some code and I decided to add a mutex and that apparently didn’t go well.
And so just different stuff like that where we’re bounding it. And so right now I have an issue, which isn’t actually the issue that’s inside the paper, which is that we clicked a checkbox and suddenly instead of just the option I picked going, all of the options ended up being selected, not just the one I picked. And so it’s finding, and this is what they found in the paper, that the most likely thing related to this was a very deep state mutation bug that was fixed three days ago.
And so it’s going to go and find and pull those things in along with however many other things I decided to pull, what my K was.
And this is just the kind of the ideas that I’m trying to look for different episodes where I’m looking for the trajectory in general as a pathway for me to find similar clusters of fairly messy data.
I think would be the way, because now I can go and look at this and go find all the code and different stuff.
There’s a different way of querying and indexing on things where you’re looking for similarity of trajectory.
And we’ll talk about how we optimize this sort of thing.
Does that kind of make sense what we’re doing with this trajectory search here?
We’ll go through this a few times, just because it is kind of weird and complex.
That’s really what the paper is about.
Yeah, I mean, it’s the abstract part of it’s like, yeah, here’s a thing. I’m like, yeah, okay. Yeah, I trust you. Yeah, we’ll dive in.
Okay, got it.
We will dive into this.
All right, so there’s one last pattern here we’re gonna talk about was just semantic memory. This is kind of when you start really looking at like a knowledge graphs, graph databases, things like that.
This is just kind of memory that doesn’t, just the same way that the other one was tied to specific time, place, and condition context.
This one is more not tied to that. So we are looking at things like relationships and things like that, but they’re going to be more abstract sort of connections. And so things like knowledge graphs, ontologies and taxonomies, things for parametric memory and like the LLM WAPES could technically fall into this category.
although those are kind of cooked in a little bit deeper.
But the big idea is that there’s some sort of entity and it has some sort of relationship to some sort of other entity. And so the idea here is where the other one was just kind of very loosey-goosey. This one, you know, I might be saying I’ve got a mutation bug and I want to find relationships that, you know, mutation bug was identified by a certain error type or it has occurred in a certain checkbox. and I resolved this error with a copy pattern.
And so instead of looking at kind of like, you know, this would have come out of a trajectory.
But what this thing is, instead of looking at that trajectory as a whole episode, I’m trying to kind of deduce things about the world or my environment and extract them away from where I learned that information.
And so you’re trying to kind of get like a bigger picture sort of thing.
So if you kept this and forgot all of the trajectory stuff, you’d be kind of be able to approach problems with first principles thinking. I think that’s kind of a word that people use for this, where you’re trying to instill principles of the world into the agents. And so it’s a different way of traversing the graph.
Has anybody here worked with these sort of graph structures before?
There’s a new as well.
I think I’ve worked with some. Maybe not for this reason. Well, these are definitely very cool.
Also very popular. So if you look at things like graph rag, like I mentioned, the Neo4j stuff, this is not hard to find people using network sort of data for agents specifically.
It works really, really well for agents.
Where graph kind of starts breaking is whenever you kind of scale it, it can be very difficult.
But for most cases, you know, stuff that we’d be using it for is not going to be an issue. And there are ways to fix some of the scaling problems too.
Yeah, all right. So that is the end of the intro.
I’ve not put any breaker slides in here.
Any sort of conversations, general first thoughts before we get into the actual papers?
I think I can follow the need for it especially, you know, and the other thing is A lot of times on the AI side, we try to build AI that kind of mimics, you know, what people do or how our memory works or whatnot. But then because it’s digital and you can, I can actually use your past experience like it was one of my own. That, that’s something that’s throwing me a little bit, but it’s, it’s one of the really cool parts of it. Yeah. Yeah. I think the, the ability to, to build the tribal knowledge sort of stuff or, you know, at least a layer of it. is very useful here for sure. All right, so we’ll talk now about sweet search. This is the paper that preceded this one where they’re basically trying to use Monte Carlo Tree Search as a way of performing code repair. And that’s the whole idea behind this one is code repair.
You can definitely generalize just about every single thing that we talk about in this paper to non-code repair problems.
But obviously it’s very useful for autonomous debugging.
But you could generally use it for, you know, story writing, conversational agents, for, you know, design, architecture style tasks, analyses, anything else.
I think it would also be useful where it’d be useful where you would be upset if your coworker that was doing that task forgot everything that you taught them yesterday.
If you’d be mad at that, you should probably look at this sort of thing.
So this one here is not doing that, which this is basically a way of doing the search problem where it is starting at a root point and basically says, I’m gonna get agent A and agent B. and give them the same task and see which one gets closer to the problem.
So we talked about this, AlphaGo is kind of what AlphaGo is doing to brute force its whole thing. We’re just gonna kind of go through that tree and spin off a whole bunch of things and find some way to get to the correct solution and effectively prune this sort of thing. And what this did was found a way to do that for tool calling code editing agents.
And so this is where we’re looking for I’ve got my selection process.
I have a bug and a bunch of different files that I can use, different tools.
And so I need to spawn out a bunch of different nodes and decide which one of those to continue to providing resources to.
And then I need to figure out which one I need to expand.
And to do that, I have a bunch of different tools that the thing has access to.
And they have different ways that they can explore.
They explore things, they find solutions, and they get closer. to the problem or they get further away from the solution and all that sort of stuff. And the problem here is that as I’m doing this sort of stuff, I’ve just got to kind of run more nodes until it works. And so if I have bad nodes, there’s nothing to gain from them. I can get some feedback signal within this individual loop.
So it has a little bit of back propagation where it’s, if I find something here, I can send it back up and it might help me influence nodes that I want to expand in this area here.
That’s one of the things that they’re trying to do.
That’s very, very messy and it doesn’t work across runs. So it only works for one run of this tree.
And so the big thing that we want to do is that we want to take this sort of capability and see how do we turn this into Groundhog Day.
So in this case, we want to stick our agents in Groundhog Day. They can learn how to do this. This kind of gets us to the main problem with this sort of way. And this is what you were talking about, Jay, is we’re getting stuck. And so one of the ways that this paper decides to get us unstuck is by using a dual agent solution.
So here, in this case, where I have only a single agent, it has all of the tools available.
its job is to both plan and execute actions.
The issue with that is that whenever it’s stuck in those sorts of things, it is biased to do a certain action, and if that action doesn’t work, it’s not necessarily going to effectively change its plan. That’s why these things get stuck in these loops, is because the choice they’re going to make, they’re going to keep making that choice. You’ll get situations here where you say it’s going to try and find a different class, a certain class is going to try and view the code.
find a function.
It can’t find the function.
So it decides to view the code again.
It can’t do that.
So it’s going to try and find the class. So then it tries to find the class and find the view.
So this happens all the time, especially with like the dumber models, the older models, where they’ll just kind of loop and loop and loop and loop and they have no way of getting themselves unstuck.
And they’ll do this forever until they crash out essentially.
And so the trick here with the dual agent system is that it kind of partitions these things out and it has one agent who its entire job is just to make the plan.
It doesn’t have the ability to edit code.
It can go search for code and go ask for things for some of the other agents, like the experience agent can go request things, it can write prompts, but it can only tell something what to do, essentially. And you got another one that’s the assistant that’s actually doing those execution tasks. And this makes it for a cleaner break. It’s easy to get yourself out of loops in this sort of system.
And we can see here, and it’ll pop between the two agents with blue and purple.
So I can plan the search and then it does the execution.
It finds the class and that goes up to the instructor again.
It analyzes it, executes the view code.
It then does the plan to modify.
and the thing does the execution to modify.
And that’s because this up here is tuned for reading and understanding code and instructions. You’re not doubling up on these sorts of things. And this helps with being good at a task and like optimizing system prompts, optimizing the context that goes into it, but it also helps for things like efficiency. A lot of times the Claude code actually just ruled out this thing where they have now, you know, the planning mode or something, where all of your planning tasks naturally go to Opus, which is the more expensive model, but it’s smarter.
And all of your actual coding tasks go to Sonnet, which is the cheaper model that’s better at software engineering and editing and sort of stuff.
Or I guess maybe software development, not engineering, if you think there’s any distinction between those two things. And so that’s another sort of thing that opens up with this sort of pattern.
If there’s one thing that you take away from all this tonight, I would say the most valuable thing is do this.
Do this dual pattern where you have two agents, one of which is your planner and one of which is your doer. If you don’t do any of the other memory stuff, do this, because it will jump your performance up on whatever it is that you’re doing. It doesn’t matter. All right, so we’re gonna talk through, this is kind of the example that they have. I already talked through it a little bit, but they use this example throughout their entire paper. There’s a problem inside of Django where somebody would click on the newsletter thing here and instead of just the newsletter thing going, all of the check boxes would check as the state was going. And whenever they did it without this sort of experience system, the model would just be like, oh, well, dummy, just click the one checkbox and it won’t check all three of them.
That’s the sort of problem that you had here.
And the thing with the experience system, they had something compiled that there was an issue with Django way down deep with their state management stuff, some sort of issue that was recently solved.
And it was able to pull that.
And even though it didn’t have this stuff in this specific code base, it knew that it needed to go and search in the shared state module for some sort of bug.
And that was kind of their test case of this thing.
And so the idea is that, if your problems are all super easy, you don’t have anything that require context inside your code base.
You just buy code and something up.
And you’re doing it on the most vanilla of stacks.
you might not need a memory system until your project gets larger.
But as soon as you kind of have a situation where you need that cross session sort of knowledge, this sort of thing becomes very useful.
All right, so we’re going to now go into their four-stage pipeline, which is the trajectory collection.
where we are running the agents on diverse problems.
For what they did this, for what?
Let me try and start this sentence again. What they did for this was basically crawl, like GitHub and stuff like that. I think they had like 27,000 issues or something like that. Not huge, but basically something to bootstrap it with some common experiences.
And they put all those inside of its memory database and kind of use that to bootstrap thing. But then obviously if you’re running this on your own system, you know, you have your own private GitLab instance or whatever it is, you could run it on that.
You know, in GitHub, whatever it is that you care about. So that would be the sort of thing that you would run this thing on once you got going.
The next thing that we do is the experience extraction. So that’s, I’ve got the trajectories, I’ve got them kind of moved in.
I want to have something pretend to have been experiencing that sort of thing and to draw certain conclusions that can be retrieved at a later point in time. This is kind of your, I don’t know, a good correlation. Your data pipeline, your sleep time compute sort of activities for this thing where you’re cleaning up the data a good bit.
And then we want a system that is able to retrieve those memories.
And so then we’ve got some stuff like the re-ranker here.
And then we’re actually looking at the actual reuse. We’re just doing our normal SWE search almost, that original paper we’re talking about, but we’re just adding on to that thing one of those tools is memory and the dual agent thing, but I digress. All right, so we’re going to go through the… The major components here, so this is here talking about what is an experience. We talk about that we’re storing these experiences and to them, this is what an experience is, which is a some sort of combination of a directive, which is what is the agent trying to do?
It is then what action did the agent take?
So where’s a fine class view code or some series of actions that could be that too.
The resulting state after they did that action, and whether that was a good action or a bad action.
That is what they call this experience in this area where a trajectory is a chain of experiences.
And so they did this across all the 2700 repositories, lots of different types of bugs and issues at varying levels of complexities. And what is very notable is that they do this for both things that went well, but also things that went, did not go well. They want both positive and negative trajectories. And so what this looked like, I can take a trajectory here where I’m successful.
So something that was a positive thing where I’m viewing the code and I found the enum class definition, found the serialization logic, found the string method.
And then I realized there that that is where I need to add the representation for the string.
and it then takes that, this is the experiencer agent, it looks at that trajectory of stuff, and it takes out these two different types of structured outputs, one of which is a perspective, which is kind of your more contextual semantic understanding, you know, that the serialization errors usually stem from missing or incorrect string method implementations.
Now, whether that’s the right, thing to take out of that. That’s not what we’re talking about here. The big thing is that it’s trying to deduce things about the world semantically. And this is really used with your instructor, your planner agent. And then there’s also some sort of specific modification, specific code change elements inside of that, that it also extracts those. And those are used then by your coder agent essentially.
And so that’s the idea of what the experience agent is doing. It goes and takes those things, stores them in the experience bank.
And Yipi, we’ve done our step. So we do this also for our failed trajectory runs where I found the enum class definition.
I tried to do a replacement and it didn’t work because my coding agent is dumb and it doesn’t know how to deal with new lines or some sort of thing like that.
And it tries to do some stuff, made some random changes, did some bunch of stuff. And so that is now available.
for us to know what not to do.
And so here it’s saying that I was unable to understand that this sort of thing happened. And this is being done by the experiencer agent for these other two agents. This could be like a smarter model.
This could be like a domain specific model. This could be a model that has additional tool calls and things like that.
So you can do lots of different things with this experiencer agent too.
The idea is that you have some sort of structured analysis agent that lives outside of your normal runtime.
This guy wouldn’t be running at runtime.
All right.
That was a huge chunk of stuff. I guess any comments, questions here? One thought, just in case you know, you’re mentioning, you know, splitting your planning agent and your, if you call it an acting agent or whatever, is it possible to use if you Is it even useful to use the same model for both just having them being separate like memory space?
Yeah, absolutely.
They can even be in the same context that that’s what what client will do off the bait off the bat.
OK, yeah, absolutely.
Let’s see what LLM engine or is it flexible to multiple?
Yeah, you can use this with anything.
Anything that’s available to do tool calling.
So it’d be so you can’t do this like llama.
I wouldn’t think, but you could do this with all of the big cloud models.
So, you know, Claude, OpenAI, and then also a lot of open source models now are able to do this.
So like DevStroll, it’s a really good tool calling agent that I would trust to be able to do this.
You know, GLM, Win3, those sorts of guys can probably do this too.
Sounds like similar to training this takes a large amount of data.
It means it doesn’t need a lot of data to start working. I think you could bootstrap this pretty well. You would have to, I would say, bootstrap the thing with a hundred examples or so, which they can be synthetic examples. So if you’re working on something, go and have it read the documentation.
This would be a good place to use stuff like DSPy, things like that, concepts like that, generate some synthetic data. You’re off to the races, I think.
I don’t think that you have to start with the golden data set.
All right.
Oh, okay.
I got ahead of myself. What’s new?
All right. So here’s another example of what an experience is, which yeah, so these two different elements here, it’s perspective, any modification, let’s see if there’s anything that I forgot.
I don’t think we did. Yes.
Yeah.
And the big idea here is that we want to have these two different types of experiences because we have two different types of agents. And so I don’t want to feed code modification stuff to my, it’s like me feeding a code review to my project manager.
Maybe a funny thing to do, but not particularly useful.
All right. So next up, we’re going to talk about the actual experience pipeline where we are going in to get those experiences out of my bank. I’ve made all this effort to collect the stuff. So how do I go and get them? And so they talk about three different levels of storage here, which is your hot, warm, and cold. Your hot storage is… Kind of things that are more recent. So this could be something where you’re storing trajectories. They’re happening in my same run.
You remember we’re doing this tree sort of thing.
So you might have a jsonl sort of file that’s not indexed yet, that’s storing memories from other runs that are currently happening.
So that would be your hot storage and memory.
Your warm storage is your vector database, your Neo4j, Postgres, LanceDB, all that sort of stuff. That would kind of be where you put this stuff.
Chroma, I think in face, we’ve talked about here, that’d also be perfectly fine. And cold storage, this could be, you’re storing those trajectories in parquet files, you take those jsonl files, you convert them to carquet, stick them up in a blob storage somewhere. If I really need to go dig deep, and I’m trying to find some sort of Hail Mary to get past this thing, this would be a good place to use that. And for our retrieval strategy here, we do our normal sort of vector similarity search.
That’s what they do.
I would also say you could consider, you know, a BM25, you know, keyword search, some of those old faithful stuff still works very good.
A good place to use hybrid search.
They generally do this up to 10.
So you want to do a broad initial net and then we’re going to feed it into a rerank agent that then tries and narrows it down to the one best memory of those to give to the agents.
They put a lot of focus on this.
It’s best to give that only one that we’ll talk about later. And just to what you remember, re-ranking is basically I get a whole bunch of stuff and I try and sort it based off of some sort of semantic criteria.
So it’s the effort to take messy non-linear data and force it into a linear order.
We talked a lot about this with the multimodal embeddings last time.
If you’ll remember, we’re trying to get an access of young to old from image data.
That’s a very non-linear dataset.
With the transformer models, they do have the capability of forcing them into at least a semi-linear pattern.
That’s what a re-ranker is doing here.
All right, so the re-rank agent here, they used the E5 large embeddings.
You could use any sort of embeddings you want for this sort of thing. I would not use E5 personally, but that is what they used.
They take top 10 candidates, use cosine similarity.
It’s fast, fast, fast.
And then they are looking to get to that K equals one.
they’re going to do this based off of what agent they’re talking to. I want to get those experience for either one of my agents. Generally, their prompts, all of this is open source by the way, this entire repo.
If you look for SWE EXP, you can go and find their prompts.
Generally, what their prompts are looking for for the RERANK agent for the instructor is, am I looking at the real bug?
here or just a symptom.
So it’s looking for root cause sort of stuff.
Uh, is it similar to issues that we’ve already fixed before?
Uh, what sort of area of the code am I in?
And does that apply to what I, what I’m doing now?
Uh, and should we sort of look locally refactor some of the big picture questions?
Uh, it’s really what it’s looking for whenever it’s ranking, uh, into that linear axis for the instructor.
Uh, and for the assistant, it’s saying, you know, we’re going to make an edit.
What am I going to mess up? Well, what tests we need to be passing is very much more grounded.
That’s the idea here.
For their tests, they did do same repo exclusion because they’re doing benchmarks and stuff like that so that you’re not overfitting on your repo. This would be interesting. I think that, you know, obviously sometimes you’re going to want to be able to get experiences about your same repo.
And so you don’t necessarily have to do that, but they did do that here for all of their testing.
And I think that there are times where it could be useful to also apply that in your production cases too. So it’d be interesting to see, obviously you don’t want toggles everywhere.
You probably want to find a good default for this.
I’d be interested to see which was better and on what sort of problem domain.
So I’m guessing the more niche your domain, the better it’d be to have this on. Yeah, that is the re-rank agent. All right, and just to kind of play it all through, the idea here is that I have an input query here, which my enum values are not converting to string possibly, or properly, which this is the problem that was happening with the checkbox.
And so I take the embedding, I look for that vector.
I’m searching through all of the experiences that they did their initial pipeline on, and I’m retrieving the top 10 candidates from that scenario.
I’ve got five here that meet my threshold.
So you’re gonna have some sort of threshold.
So there are five of those possible 10 that meet my threshold, which we saw those five in that first example where we had the graph way earlier.
And then I’m gonna do whatever filtering I need to do on this thing for them.
They’re doing that same repo exclusion. You can have your own filters here.
This could be something like maybe classification, you’re able to use this.
That’s a code that I can’t use for some reason.
This would be a good place to put that sort of thing.
Do the re-ranking. And you’re off to the races. You’ve got your thing back. And now my job is just to use that thing. And then we’re back, just back to like a normal link.
You’re injecting it into the context. And it’s from, you know, the ether instead of inside of your Claude MD or your replete.md. All right, so data here then is, okay, so is this worth it?
Is it worth to just do this sort of thing? And yes, probably.
You know, I think, you know, obviously you’re going to have to look at your own use cases for all these sorts of things.
But it does get improvements over kind of your code act models is your normal react agent sort of thing with GBT40.
You know, they tested the whole bunch of different types of models here.
So there I have some some squabbles or squabbles with a how they did some of their evaluation stuff here. but it does seem, they did do one here with the SWE EXP and the SWE search with the same model here with DeepSeq.
And they got a 6% improvement or so over that, which makes a lot of sense. And then looking at one of those kind of equivalents, yeah, I really wish they had an evaluation with DeepSeq on this code act, which is basically just a React agent, because that’d be kind of your real baseline.
I think, do they have an equivalent?
Yeah, they don’t. So it does seem like there is some improvements on this sort of thing.
And I think that, especially if you’re dealing with a problem that has more domain awareness required, because they’re just doing sweet bench sort of, you know, evaluation style tasks.
But, you know, if you’re dealing with like a large code base and where there’s lots of intricacies over time, I think this would probably perform over this as well.
Yeah, oh wait, it’s a 0.62% increase.
They did have a nice inflation study in there as well, where they’re kind of talking about, okay, so we added three different things of those three different things, which ones matter the most. And so for their, what they split up was basically the two different types of experience, and then the dual agent thing. And so for their testing, they found that the comprehension, this is the instructor agent experience of all the things had the most impact. And so here, this is that sort of like planner, so it’s more important to give your planner good, good experiential context, that semantic context, which I think that might be a little bit of it too, is that it’s that first principles context that is more transportable, might really be what the nugget is here. It’s more important to give it that. It’s also quite important to have the actual code level knowledge. And the dual agent stuff kind of prevents those infinitely sort of things. So that’s kind of the takeaway that they had. I thought this was nice. It was a good application study. The other one that they did that was nice was they played around with their different values for K, where K is the amount. So I’ve got my 10 experiences and I want to whittle them down to the best one is what they said. And so what if I gave it two experiences instead?
And they actually found that their performance decreased with this as they added more experiences.
And that is because you get conflicting guidance. So I might have two experiences.
If they’re in a similar space, they might be incompatible.
One might tell you to go one way, one might tell you to go the other way.
And if the agent tries to do both of them, it’s gonna get no solution, where if it had just done one or the other, it might have gotten to the right solution. I think this is interesting.
I’d be interested in pushing this area of the research further. There’s probably additional systems you could do to allow this to work. They found that ability, you know, and once you get up to three or getting additional cognitive burden, these things just get kind of confused.
So that was interesting. And so yeah, this is implementation settings.
I won’t go super into this.
I do want to point out, we’re at the end, by the way.
I don’t have time to talk here. I did look at a few of the memory implementations that are out there right now.
Two of which that I’m going to look at myself.
I’ve heard of MIM zero a whole bunch. This is the one that most people talk about when they talk about memory systems.
I think they’ve got a really good marketing team is what it seems like, but it seems like the product is good too.
I’m pooping them too much. You just see them everywhere. And they’ve basically got like a vector, graph, KV store, store solutions.
They’ve got multi-layered stuff.
It seems very easy to pick it up and just kind of go and just kind of get an idea of what this sort of system provides. You know, it has the ability to store different memories and have them be available to your users.
It seems to have lots of good models for if you’re making a product to have it.
be attached to users and be secure.
It’s open source and all that sort of jazz. So this is an option out there if you want to get into playing with memory without having to do too much.
This might be a good topic for, you know, one of the Wednesday sessions, Jay.
So something like this, because you can just kind of pop it in a book and have it go up to the races. And yeah, yeah. So this is just like a nice little easy one. I’d say this one looks the easiest.
There’s some other ones, the zip AI.
I’ve seen this one a lot in like your NaN sort of areas.
So what you see is what you get, workflow builders.
It seems like people are using that in that area a lot.
I didn’t find this super compelling.
It seems like it’s probably more towards cloud.
So if you’re doing all that stuff in the cloud anyways, if I’m in NaN, I’m in the cloud anyways. So that was an option there that seemed to be popular in that space.
The other one that was very interesting here was Leta. It’s actually the MIM GPT is the paper that kind of kicked, obviously all the memory stuff has been around people. It’s intuitive idea, but this is the first one that was very successful. This MIM GPT paper is very likely what they used for OpenAI’s initial memory module. And so they spun out of Berkeley, I think that whole group of PhDs spun out and made this company. And I’m really interested and I need to go look up some stuff about them, because they’re kind of looking at the operating system for agents. We’re kind of giving them an open layer of memory. It seems like they’ve got some very cool stuff out there. And so let’s just let a company, I didn’t want to plug them. We might be talking about them again at some point, if they have something interesting. There’s some other ones too.
They were out there. I didn’t get to look deep into them, but I do have a little bit of blurb of kind of who is the top of the stars ranking for whatever that’s worth. Yeah. There’s the memory spelled wrong. Yes. So that is that is there.
That’s fine.
Why not?
I know.
How long do you think it’ll be before someone?
I mean, has this made its way into some of the tools that we use right now like cursor and others?
Or is this something that’s kind of on its way? Humorously, actually, Claude just pushed out their memory feature. Obviously, it’s part of GBT.
Jim and I added theirs in about February.
So lots of memory stuff is actually happening right now. So it’s a moving space. It is very much not solved.
So this is one where I think there’s going to be a lot of movement still, but I think that what’s around right now is it works enough that it’s not a bad time to think about using it. Okay. But, and, you know, obviously you look at, you know, like the contextual memory with like the Claude MDs and the replets, that stuff’s kind of sitting out there.
And I think people are just Finding the right way to do the vector store stuff, I think it’s a little bit less accessible because you’ve got to kind of think about how you do the data model at that point.
But people are starting to, just because there’s so much value in doing it and the need is so high, people are starting to implement that sort of thing. And so if you look at all the Pullabaloo, I don’t know if you guys saw Obviously GPT-5 exists, so we can talk a little bit about that too if we want to, if we have a time at the end.
But there’s a big hull below about them taking offline 4.0, like all of the old models, which a lot of people were very, very angry about that.
And Sam Altman had to walk it back, because these people had gotten weirdly parasocial with this 4.0 model, which is surprising to me because I find the thing obnoxious.
But there’s this memory system.
It’s remembering things about them.
And they got used to how it was responding. And this sort of system can be very powerful for better or for worse. So that was kind of another interesting tidbit of how these memory systems can make these systems feel very personal, which is good from a chat sort of viewpoint.
But obviously, I want my assistant when I’m working to be personal as well. I want it to know the libraries I use, the patterns I use.
So it’s an interesting problem.
It’s one, we talk about tools and context engineering, all that sort of stuff a lot.
And I would say that memory is equally as important, if not more important than those elements in the long term in order to be able to have continual learning agents.
Yeah, that’s my answer that was more than you asked for That’s fine So it’s a little less better than none So getting you know moving in the right moving forward We’re likely to see different approaches and until it kind of I guess Consolidates into a I don’t know if it’ll there could be multiple approaches that are all valid So it may not wind up being you know, everybody on the same, you know, it’s probably not gonna be it’s probably going to be more like rag where we are all there. And then it’s, I guess, incrementally improved, you know, as you add new little tweaks here and there. Yes.
I think so anyway.
There’s not one method to store data.
You know, I’ve got databases, I’ve got file system, I’ve got, you know, all those things exist in a working system.
All right, we had one question from Daniel. if you wanna give a quick synopsis, but for Daniel, we do record these and we’ll post it as soon as I get around to getting it packaged on Vimeo and get a link out if you wanna go back and watch the first part of it. Yeah, and some of it is the vector store, but there’s a little bit more than that to this, but the vector store is obviously a very big element of it. For sure.
All right.
Any other questions, comments, open topics?
Kind of an open floor here. Got about 15 minutes, so. I won’t belabor the point if we have no discussion. I don’t have anything else on this one. Again, a lot of the topics and papers I wind up going back. I really like the broad overview and then gives me some pointers into where I need to go spend some time to get deeper. I think your whole approach probably saves me a good two hours or so just getting into the, you know what I mean? I don’t know, but we appreciate it. No problem. Appreciate getting the app. All right. I’m going to stop the recording then.

