Structured Agentic Software Engineering

Transcription provided by Huntsville AI Transcribe

Josh Phillips: Thanks for the data point. Uh that’s your fault. So yeah, and I think we’ve covered most of what this slide was when we’re kind of waiting there. Uh but yeah, the the main idea here is a task agentic. that’s really focusing on actually we were talking kind of three to four but yeah three is where you you are still kind of giving it something broad so I might be saying things like you know add a caching layer uh where four or level two I’m really looking at at much much much more granular like I want to say like reduce the code complexity of this by uh you know extracting this and this into these functions and that’s kind of what you have to do like that’s what you have to do right now with the local coding models so if I’m like using like a devstrol and I can use devstrol in a client but I really have to talk to it at a task agentic level whereas with claude sonnet you know 4.5 or codeex I can I can let it start doing level three sorts of things pretty easily.

Josh Phillips: So yeah, and the main thing that we’re really talking about here when is we know it’s a problem, but how do we kind of talk about this and triangulate what the problem is? And the problem is the the the the delta between speed and trust. Uh which is that these things can kick out a whole bunch of stuff. But in order for me as the human who has the identity that is able to approve this stuff, uh you know, I’d have to go through all that that uh that code manually. Uh if I just kind of have things set up like we currently have them or I just got to say yolo and say it’s fine if it breaks. Uh it’s not my fault. I’m not going to look at it. If I don’t look at it, it’s not there. And so what we’re looking here at here with these workloads is that we want to kind of uh focus around the specification where I am talking about uh not just you know what it is that that I want the end state but how I want to validate it.

Josh Phillips: We’re going to use that validation sort of techniques and these incremental artifacts in order to build trust in the process without having to open up and look at all the code. That’s the big things that we want to get out of. But we want to start thinking about uh code almost like uh you know we we treat code right now like it is a a genuine work product that has value and it’s really more uh going to end up being sort of a commodity. Uh more like you know I don’t open up my my binaries and inspect them. Um, and the the the thing the thought is that that code is really going to start becoming that sort of thing where we’re not really looking in depth at them outside of the most uh critical use cases in the same way that you know some people do actually work down at the CUDA kernel level right now but most AI engineers don’t uh for the most part you know most people might not be looking at code so much in 40 to 50 50 or years.

Josh Phillips: Uh the other thought that we want to look at is how do we we generate out a whole bunch of these things at once so I can get like you know five merge requests and choose the ones I like or maybe merge them together. Uh and then focus about the human going from the coder into sort of the specifier orchestrator. And they call this this coach role. Um which gives me a little bit of the uh the willies just hearing that but I don’t think it’s it’s incorrect.
J. Langley: That’s the safe stuff again being an agile coach.
Josh Phillips: Yeah. Yeah. Yeah. But I think I think it is probably correct. Um so yeah, then we’re going to actually talk about what they’re proposing here. I think this is the most interesting part of the paper. Um so they have kind of these four pillars that they’re looking at. They they said they’re the four pillars of software engineering. I don’t I don’t know if that’s true or not or if they uh they just are calling it the four four pillars, but they’re calling it basically there are actors, there are processes, there are tools, and there are artifacts.

Josh Phillips: Um, and uh, we want to think about these as kind of the the the buckets uh, that the system organizes itself by. And so for the first pillar here, uh, for the actors inside of the system, they’re splitting into this sort of two-lane mode. And there’s always these two lanes inside of this area that coincides to the human workbench and the uh, agent workbench we talked about before. you know, the VS Code versus lovableish uh sort of things where the we want to optimize uh the a the human agent for setting the goals, orchestrating, mentoring, and doing all the things we talked about before and optimize the agents for actually executing those tasks, generating the actual code. Uh being able to split and parallelize. Uh, and then also, uh, pretty importantly here is whenever it needs guidance, whenever it has conflicting information, don’t just make crap up. It needs to be able to raise some sort of flag that it needs a decision or it needs some clarification. And so, there’s something inside of this this system that must be able to trigger like that.

Josh Phillips: Uh, the second one here is the processes that we use to manage our workspaces over time. and they’ve got these big fancy names for these sorts of things. Uh uh I I’m including them here. I had a section for this stuff. Uh but then it’s really just the how to do the artifacts. Well, so there’s six artifacts and there’s six processes which are basically how you manage the artifacts. Um and uh basically uh the processes here are are just that. So there’s something around briefing which is how do I I craft clear actionable mission plans? How do I do mentorship with these agents which is basically giving it institutional knowledge? This is things like how do you define your code standards? Uh how do you define your style guides? Those sorts of things. It’s what they’re calling mentorship. Um and then the guidance engineering which is sort of um how do we do these uh sort of interactive rerouting activities? uh number one, how do we do that exchange and and sort of make those those sort of exchanges?

Josh Phillips: Uh but then also how do we codify them and make them so that they can just be looked up later, which kind of puts it into that mentorship engineering sort of thing. Uh and so this could also the stuff on the left here, the briefing, mentorship and guidance that could easily correlate to other words like spec for briefing, uh memory for mentorship and guidance is just you know telling it how to how to fix itself basically. Um then for the agents uh the processes that they care about are this this concept of the agentic loop. So what is the the SOP for how it’s executing? You know, sometimes you can just kind of let it figure it out, but sometimes you want it to go through a very specific process uh in order to do its work and you can define that process at different levels of granularity. Uh and that’s what this uh sort of aspect here is about agentic loop engineering and this correlates to the loop script that we’ll talk about. Uh then there’s this concept of the life cycle engineering which is uh the correlation to the mentorship engineering which is how do I store the memory?

Josh Phillips: How does the agent retrieve memory or tribal knowledge? This could be something like a rag. It could be something like an episodic memory thing. You know, there’s lots of different ways you could do that. Uh, and the infrastructure is, you know, what are the tools and resources that you provide to it. So, this could be things like your MCP servers. It could be custom CLI libraries that you write for it. Things like that. So yeah, that’s pillar two. I guess before we go too much further, uh any sort of questions about these two? These are kind of these two kind of go together, I think, pretty clearly.
J. Langley: This is very similar to me trying to help run a software development organization. You know, it’s like I mean I I know I know.
Josh Phillips: Well, that’s what you’re trying to do here.
David Showalter: Yeah.
J. Langley: Um it’s like agents maybe they’re interns or you know it’s it’s it’s it’s hitting the nail on the head. I’m like well is my job any different now?

J. Langley: I just have agents.
Josh Phillips: Right. Right. Could you have whole programs or just one person essentially managing managing a a host of of of nonhumans?
J. Langley: I mean that’s kind of where it’s Yeah. I could yell at them more.
Josh Phillips: Claude might actually file a complaint on you though.
J. Langley: That would be kind of interesting. Oh man.
David Showalter: I found it more just get snippy with you in return.
Josh Phillips: All right. So, um, next up is, uh, the tools. So, we have the kind of the big two big ideas here, which is we talked about the agent command environment and the agent execution environment. These are these two workspaces. They have five names for the same thing. This is the one uh version of that concept which is we have the the the human IDE which to me feels like it’s it’s trending towards more of a project management style thing. Uh plus+ you know it’s not Jira so I want to be able to see the actual code or you know and stuff.

Josh Phillips: Uh but there’s more things that are about sort of managing things inside of the development environment. I’m looking at things like cost because these things cost money per token. um looking at things like you know warnings and telemetry and things like that uh how things are if I’m generating five agents to go after a problem and then merging them together you know things that are related to that uh and then lots of things to kind of store and manage that memory over time and not just uh manage your uh your codebase right now you know like I I do this sort of stuff and I just kind of have this this docs folder or the specs folder that is kind of special and I I just have to know that that’s actually a database more than it is code. Um, and so yeah, same thing with the agent execution environment. Uh, some some things here that are kind of new uh in relation to past stuff we’ve talked about is a lot of more focus on uh these structural editors. The structural editors, what we’re talking about here is things where it’s interacting with the code through the abstract syntax tree.

Josh Phillips: Uh and so instead of editing the code files, uh it might be uh interacting with things through um the uh you know some sort of interface, some sort of programmatic interface to edit the quote unquote code. Uh and that might be displayed as code. Uh but it’s thinking about it in a completely different manner. And we’ll have a little a slide later that kind of talks about this in depth. Uh but I’ll just mention that that is one thing they’re talking about here. Uh because there are some interesting things when you kind of pull that thread. Uh so things like hyperdebuggers where we’re debugging multiple agent runs. So sort of debugging not program state but also uh sort of uh agent state across you know multiple quoteunquote threads. Uh and then things that are able to do self monitoring. So it understands it’s in a loop. it understands that it has, you know, conflicting information. Uh, and then that that it is able to then escalate something up the chain and then pull for a response.

Josh Phillips: Uh, so that’s the kind of the tools that they’re talking about here. The last bit is the artifacts. Um, and I’m not going to go through all these right now because I’m about to have a whole section on this stuff and I don’t want to talk about it twice. All right. So, we’ll start going into artifacts. All right. So, there’s six artifacts that we care about here. These are pretty good. Um, I think you could pick these and go and do anything with them. Um, and they’re kind of split into six or three different areas. And three of them are kind of on the uh human side where they kind of belong to uh spec development, planning development uh and providing the tribal knowledge. And so one of those is a briefing script. Uh this is where you’re basically defying out your uh uh your requirements for some sort of thing that you want it to do. So this could relate to I think people have called it like the the the product requirements prompt or you know spec kit has this sort of stuff.

Josh Phillips: Uh people are coming up with a whole bunch of fancy names for requirements but it’s essentially requirements. Um that’s what the briefing script is. Um loop script is basically defining out your flow and we talked about that before. mentor mentorship script is uh something we’ve talked about in the prior slides so I won’t go over that here until we go detailed. Uh so the two ones that are are kind of new uh here are the the consultation request pack as this is like the structured way that it raises the concerns and provides the information to you. Uh and the idea here is that it provides it in a way that you can quickly look at it and make a decision and it goes off and does its thing because at this point the bottleneck is the human review time. the longer amount of time that you have these very expensive developers for viewing stuff in order to push things along. Uh if you can optimize that, that’s going to be the thing that speeds your uh agentic software factory up.

Josh Phillips: Uh and so that’s one where it’s had to stop work and then it wants to uh get a an unblock. And the other version where they raise stuff is the merge readiness pack which is basically them saying I’m done. Here’s how I’m proving that I’m done. So you have all the different validations that you require uh and it’s presenting that same way in a way that’s optimized for quick uh human review. And then the other uh important aspect here is that anything that we’re kind of having these exchanges with these things I don’t want it to just live in some sort of clawed code window that goes away forever. We want to persist these things so that I can call back to them so they can be searched by the agents so that um you know if I do you know remember we’re trying to optimize for as little time humans have to review as possible for the thing to be trustworthy uh and to be able to query old uh conversations where these questions were answered uh is uh one of the ways of doing that.

Josh Phillips: Of course, you have to manage that uh because sometimes old decisions are no longer valid. All right. So, we’re going to start with the briefing script. And so, here uh I have some examples of the different briefing script. We just kind of talk through this stuff. See if this makes sense. I I’d kind of like this one to be more conversational for each of these. See if you guys understand what these are. And so, the briefing script here is basically the mission plan. Uh and the idea here is to define out what it is that you want. Uh you know provide the context uh for things. So don’t just say add OOTH but explain why you’re adding OOTH. Uh because if you do not tell it why you’re adding OOTH, it might invent a reason and that’s where it starts hallucinating things like GDPR and you know all these different things that it has to do because it will just try and fill in the gaps. And so you want to define something that looks like a goal.

Josh Phillips: You want to add a Y for that goal. define out your success criteria and any relevant context for that thing. Uh then kind of give it some guardrails as far as where you want it. Uh and then any sort of gotchas. And so here’s one that’s for caching. Um and then just think about this in a different area. is for like for password reset. Uh so instead of just saying do forgot password uh you want to uh make sure that it’s clear that this is for things like locked accounts. So it knows uh to only implement things towards this the obviously you know it uh tells it what to do but also tells it what not to do if you if you specify to a level that’s sufficient. Um and so you’re generating out different sort of criteria here uh and these things will become more relevant uh when we get to the end and want to think about that validation aspect. So you know one aspect when you’re doing the task sort of focused approach could be that you know you tell it this thing and you argue with it for a little while uh and then it’ll eventually do things right.

Josh Phillips: Um, and this requires us to think about things a little bit more on the front end, uh, instead of kind of yoloing it as we go. So, yeah, I guess any thoughts on briefing script?
J. Langley: Yeah, on mass on my program they’re called features or enablers and we’ve got it down into this very very close to this template except success criteria is called acceptance criteria.
David Showalter: No, it seems like standard standard best practice for prompt, you know, be clear, lay out lay out every subject.
J. Langley: area. Uh context is I I don’t know what we call that one. Uh we’ve got an inscope section, an out of scope section. If you got a hey, don’t go there, go here. Um we don’t have a I like what you’ve got with uh if you go back to caching. I think uh the blueprint we call that architecture guidance or something because that’s usually where it’s coming from.
Josh Phillips: Mhm.
J. Langley: not necessarily, you know, it’s kind of like, hey, this is your general you, you know, this is the general place we’re going to work and here’s kind of how I want it done, you know.

Josh Phillips: Mhm.
J. Langley: Um, and then Charlie’s message in the chat for it looks like a get GitLab issue template. I’m like, yeah, that’s yes, this is how we already tell developers or team in my case, teams of developers what what the thing is we’re building and where to go and, you know, what the ground rules are.
Josh Phillips: Absolutely. Yes. And if this was all that this sort of framework was doing, I’d say it was essentially doing nothing. Um, and so it’s it’s this plus a whole bunch of other stuff. But to me, this one seems the most, you know, it’s very much like a user story sort of thing. The interesting thing is that is that this lives in your codebase. Now, uh, and so you you you store this for ever and always in version control. All right. So that is uh the brief script or whatever that was. So now we’re going to go into loop script. Uh, and so this is a some sort of an SOP.

Josh Phillips: These are generally reusable. Um, so you kind of store these as almost like your slash commands. I think of slash commands as very very uh close to this. Are I know I think David you said that you don’t use clawed code. Are you familiar with what the slash commands are that be like an imagine in midjourney you know you can do that spawns off a very specific job inside of midjourney.
David Showalter: Uh I am not. Yeah. Yeah. Yeah.
Josh Phillips: You can do slashescribe. It does a very specific job.
David Showalter: Okay.
Josh Phillips: So it’s kind of like that.
David Showalter: Okay. Perfect.
Josh Phillips: Um, so you it’s a it’s some sort of thing uh that you you give it some specification and it’ll go do it. Generally there’s there’s arguments that you can throw into this and I have some examples of this that I have from my own uh sort of code that’ll make that Yeah.
David Showalter: So similar to some of the customization you can do in cursor and custom commands and all

Josh Phillips: Yeah. Yeah. Absolutely. And uh the thing is that you can define it through some sort of you know loose you know markdown style stuff. uh if you’re using you know claude or codeex you can give it tools so you can say you know for this loop script I’m only going to give you access to this one MCP inside of the context of this slash command or this loop script. Uh so things like that uh you can you can quickly do to kind of almost shift it into a different mode. Uh so one of the ones that I will do this for is I’ll have you know like a planning loop. I’ll have one where I give it access to playright. So it can open up a browser uh and it don’t it doesn’t just have access to playright. It has to use it um and provide me certain sorts of validations. And so you’re kind of defining it out in SOP sort of thing. So here’s an example is you know fix a fix a typo do a lint check do a quick visual test which taking a screenshot in playright and then uh it can submit an MRP.

Josh Phillips: So basically run a a gentic CI sort of pipeline. And so we have that here with this simple fix. Um you could also do it for uh this you know refactor sort of thing. So if you wanted to do uh something that that was related to authentication or that needed a security scan. So you know maybe there’s like a certain decorator that you want to something you could feed it in that way. Um, you can also chain these loop scripts, which I actually have an example of this chained loop loop loop script uh in my uh real world stuff where I I use another script to generate out a spec and then another one gets that almost like a higher order function uh receives that as input and continues to go um and so on. So yeah, any thoughts on this one?
J. Langley: I’m trying to uh trying to make well uh I’ve been using Klein so much instead of clawed and I’m trying to figure out how this you know what I mean what where this would apply there.

J. Langley: I think there’s there’s some stuff I need to probably put uh a little more effort into the underlying setup I I have. But anyway, okay.
Josh Phillips: Yeah. And generally these will be called like commands. I think that’s the the dot agents version. You know there’s this new standard of like dot agents. It’s kind of like the general coding agent stuff. Uh these are generally called commands is is the idea. Yeah. But these guys are writing a paper. So they have to make up words for everything. So it’s called loop script.
J. Langley: It’s great.
Josh Phillips: All right. And so the next one is mentor script. Uh codified rulebook. I think this one’s really cool. This one is the one I think that that more people should do. Um and so it’s kind of loading up, you know, that that institutional knowledge, quote unquote, the tribal knowledge, and then storing it inside of your Git repository. And so you it ends up looking like this giant blob of messy markdown files.

Josh Phillips: But institutional knowledge is messy. Uh and the way that these things interact with code is different from how we do. Uh and so you kind of store this stuff along with your scripts. Uh and so it can be things like you know uh different rule sort of stuff. So if you know we’re using lib time utils uh you want to use uh you know our custom wrapper instead of doing the python date stuff. Uh so being able to quickly sort of draw those things in and the other thing with these is that uh you know this could be things like code standards you know I have I’ve written I always write special uh linting libraries now for my code to take things along and I say hey go run this and tell me what the feedback is from my special linting library. Um, and the idea with the these sorts of things is uh you want to be able to codify them in a way that agents can understand and quickly discover. Um, which can be a little bit different than how humans might do it.

Josh Phillips: So, so you need to have think a lot about how things kind of link together. Uh, thinking about how you name things so they’re easy to search and think about how these mentor scripts kind of integrate with your loop scripts. So, a lot of the times, you know, I will write these quote unquote commands or loop scripts. And what they really do, and you’ll see this is that they really just load up, you know, five or so mentor scripts and say, “Hey, load these five things into context and run this tool, and then you can start work.” And so, you’re kind of preloading a certain memory state for these things. It’s like what if I could could you know do a slash command and get a new intern on my uh my infrastructure team for a coding project like right now and that’s all you had to do to do it. So these kind of like these these these these suits that you can put these agents in that loads them up into the right environment. And so the mentor script kind of sort for forms a a a fundamental thing here.

Josh Phillips: And so here’s the explicit rule sort of thing. We also have the concept of like these inferred rules. Uh where you know the stuff that’s very specific. Don’t use this thing, use that thing. Uh but we can also do preferences and things that are a little bit more semantic. So prever prefer concise error handling. Uh you know don’t have code complexity over 10. Uh don’t use this library or you know we hate camel case whatever that might be.
David Showalter: Yeah.
Josh Phillips: Yeah. Absolutely. You can also do it for things like security. Um, there’s lots of different ways you could use this sort of stuff, you know. Yeah. Any thoughts on this one? I think this one’s really cool.
David Showalter: I keep thinking about did you look at the effective context engineering for AI agents from claude or from anthropic last week?
Josh Phillips: Yeah. Yes. Yes, I did.
David Showalter: Yeah. All I keep thinking about is going back and forth between that and this and just trying to fit the two together.

Josh Phillips: Right.
David Showalter: See?
Josh Phillips: Right. Yeah, they’re they’re very they work together pretty well. I think I think this is a pretty good uh different lens of really a lot of the best practices right now and definitely stuff that works real well with Claude specifically. Claude’s really good with this stuff. All right, those are the three that were kind of human- centered. Uh and now we’re going to get into the ones that are more uh robot centered. Uh so the consultation request pack this is the one that requests for help. So examples of this you know we talk about this but what does it really mean? Um is you know we we have something that says that we’re optimizing for performance uh on this application. Uh but then we have a rule somewhere that says above all else security is important. There’s nothing more important than security because if you have if you write security rules you always have to use that language no matter what. Even though obviously people make trade-offs on security all the time right?

Josh Phillips: So these sorts of things confuse the agents and they’re they’re going to be stuff like this inside of your codebase. Uh and so the consolation request pack is supposed to be a way for it to raise that sort of stuff uh to brief you. Uh and so some examples of how we might do this is that you know there is a briefing script. So this is where we’re starting to get the cascading importance. So we’re we’re using a trail. The stuff’s inside of our version control. And so it’s going to call back out to this briefing script user cache which in our example here is this one right here. This little guy add caching layer API latency is reduced by 50%. So this is this is a performance uh critical uh uh briefing. That’s the idea is that we want performance. That’s what we’re focusing on. Uh but something is considered to be uh a security hit because obviously we’re caching something related to authentication and credentials. So there’s going to be stuff that that the human needs to sign off on right here.

Josh Phillips: And so it’s going to talk about tell it exactly what the conflicting rules are. I don’t have to go look through some slop. It it should tell me exactly what the problem is, what the options are. So if you know there is something that’s obvious, I can just say do two instead of saying you know I I don’t have to think about parse bash request to invalidate affected keys and think about how to communicate that to claude. It’d be much nicer if cla tried something and I could say yeah that and then uh exactly what its problem is. So so as far as as it you know you know knows what does it think the issue is. So that way you could do a complete reroute if you have to. And so as a cache strategy and the big thing here is that this will be as a markdown file just like everything else. Uh so I’ve written this blah to you and uh you could then make notes inside the markdown file uh or respond to it directly you know kind of whatever you want to see so you got speed and security and yeah some more examples here.
Josh Phillips: So yeah I guess any thoughts on this one?
J. Langley: It’s uh I’ll probably ask this again later. Um looking for some kind of a reference implementation of you know like you’ve got a lot of a lot of examples here. Does anybody are there any like repositories that you can get a head start, you know?
Josh Phillips: uh there will be in January at my talk for the symposium. Uh absolutely. Yes. Part of what I’m I’m providing is going to be this.
David Showalter: Nice.
J. Langley: Okay, good deal.
Josh Phillips: So this is this is actually a a semi- dry run for my symposium talk. Yeah. Yeah. And I think I think probably you know there’s some you know uh I don’t know about this specific system but this sort of thing there there are tons of folks that are trying to do this sort of stuff. Um yeah. Okay. So there’s that. Uh the next one is the merge readiness pack. This is basically um you think of it like you know your your briefing script.

Josh Phillips: I think the way to do this is like to make a little folder and you know you’d have all these things kind of living together and they get archived together when it’s done. And so the last thing that you’d expect to show up in there is your merge readiness pack which is basically it’s evidence bundle uh where it is saying you know I did all of these things. Uh claude is probably going to tell you that it’s production ready. Uh but the important thing here is that uh because of some of the different things that you’ve you’ve established along the way uh things like validations, things like you know special linting rules, uh things that are like checking for mocks and stuff like that, you check for some antiatterns. Um if you’re forcing it to return those things to you with, you know, links to those output artifacts, um they can then be easily referenced inside this merge readiness package. So you still don’t have to go in and look at code too deeply. uh unless you really really have to.

Josh Phillips: And so that’s kind of the idea here. Um I I really uh it’s it’s hard to do it because you know we’re looking at a screen right now, you know, with some text, but I always have this uh connected with uh validation tools that generate output files uh where I have written the scripts myself for those validation tools and it’s not allowed to edit unless I give it explicit permission. the things that are really hard for it to game. Yeah. So, here’s our caching feature. So, you can see the evidence attached. It’s got its completeness uh verification where it’s writing the tests. Uh there’s some sort of gates that we have access. Uh we’ve have a rule inside of one of our mentor script sort of things. Uh it’s it’s talked about it and there’s some some logs that’s attached to it. And you just basically want to have it do it this for everything. And it’s actually pretty I think it in my mind as I’ve been working through this sort of stuff for the past you know four or five months uh these sorts of things it speeds stuff up because there’s longer periods of time where you can go hands off and so there there’s some initial uh sort of structured thinking but you’re arguing with it way less.

Josh Phillips: It’s doing way less stupid stuff. So yeah, that’s the main idea with this one. Merge readiness pack. Uh, anything else here?
David Showalter: No, I’m tracking. Looks good.
Josh Phillips: All right, the last one is the version controlled resolution. Uh, and the main thing here is that we want to, you know, document any decisions that we had, uh, and store it inside of the the codebase. And this kind of lives in sort sort of a packaged artifact that goes into version control of some sort. Uh so that we can link all these things together. Uh and that way you know if there’s something like this consultation request pack uh you know I don’t want to answer the same question 20 times. So if there’s one that I’m happy with and I want to say always do this you know I want to go and save that sort of thing. So you start getting into a loop uh which is their their pillar sort of stuff. Uh this is their their life cycle pillar.

Josh Phillips: So yeah um and the general loop here is that the author does the brief the the human does the brief uh we execute the loop. If there is a need for consultation it goes it assembles the evidence and records resolution. Sometimes all you’ll do is brief and it assembles the merge packet. Everything looks gravy. Neat. need a burrito. We’re good to go. Um, and then if something was wrong with that, uh, I can then reference back to it in two weeks or so whenever, you know, I want to edit it. And a lot of times, half the time it’ll go like that. That’s what I found. Um, there’s also the the cons that they have of their workflow that I did want to attach here. I, you know, I don’t really do this a lot where I’m spinning out multiple agents on the same problem. It’s nothing I’ve had to do. Uh, that’s probably bad. I probably should do this. Uh but you know following a single train is still my comfort uh of you know watching what an agent is doing and reviewing that thing thing deeply.

Josh Phillips: Uh so I haven’t done like the five agent sort of thing but they they kind of specify out here a way to specify that out and get a whole bunch of PRs and then pass on certain things and merge different ones. Um I don’t do that but they define that here. So mention that. I don’t know. Is that something that you guys have played with? It makes me very very uncomfortable to have too many kind of going off and doing stuff. I’m pretty comfortable now with letting them do stuff though, right?
J. Langley: if I got better stuff from them maybe. Um, you know, um, I’m not quite there yet. It’s probably on me for not being able to explain things or but I think if you if you start nailing it and I like the way this is kind of put down from the start, you know, moving forward um you can I would probably do one and then as you get further in and get better at setting up the loop scripts and the you know all the other pieces.

J. Langley: I think you may get to where you you know you you you trust one and you go, “Okay, this one knows this one’s okay. Let
Josh Phillips: right?
J. Langley: me add another one. and then see how when can I trust two of them? Um maybe
David Showalter: Yeah, like you guys know, I’m uh I’m more of a use the tools than build them.
Josh Phillips: Yeah.
David Showalter: So, I’ve just gotten into agent use in the last couple months, and I just keep it to where I can wrap my head around what everything’s doing. Uh if I start to get confused what’s on what, then then I cut
Josh Phillips: Yeah, the one use case where I’ll do a whole I’ll sometimes do two. I’d say about half the time I’m doing two at this point where like one is doing documentation and one is doing code. So, it’s really only one thing that’s doing code. But the one where I will actually have three agents editing at once is uh linting. So I’ll do this giant, you know, uh authoritarian level linting style rules uh and and you know code rules and have it spawn out, you know, different files for each finding and I’ll give each of the agents one file where they’re all going to be editing different packages so I know they’re not going to converge.

Josh Phillips: I’ll do that because it’s linting. It’s most of the time it’s pretty easy, but uh that’s about it. I think I think three is my my my uh my biggest one now. But I know it’s not going to cut it. You know, you you just know that as the further it goes along, your efficiency with this is going to scale with how effectively you can run more agents. I think uh sometimes it’ll be one model, sometimes I’ll use different models, sometimes I’ll use a local model and a cloud model depending on what I’m working on.
John Roadman: And you’re running against one model, right? So you’re inferencing is just one local model with all the right.
Josh Phillips: Like this presentation, if I’m working on this uh yeah, I’ll use cloud. I don’t care, you know, it’s going up on YouTube anyways. Sure.
John Roadman: Right. No, I I just meant for um with the agents. The agents can all be hit in the same model. Whether it’s public or private, doesn’t matter.

John Roadman: Um they the model doesn’t really know the difference between the agents except for the questions being asked.
Josh Phillips: Yeah, absolutely. Yeah, it basically only will know it at the time that it’s doing that exact inference.
David Showalter: Now most of the experiments I’m doing is uh combining agents using SLMs with one using a large language model. So, I’ve just been playing around a lot with okay, where the simple task that I can do quick with the SLM and then get the LLM to kind of be the review phase.
Josh Phillips: right? Yeah.
John Roadman: Mhm.
Josh Phillips: One of the examples where that you know it’s just a kind of a codified pattern now is the planner you know doer sort of thing where you have the really really smart expensive model plan out stuff and then
John Roadman: Mhm.
Josh Phillips: you have the cheapo model you know review it and sometimes that’ll happen one request I’ll make one request and it’ll use three models in the course of responding to that request without me doing anything. that sort of stuff uh that sort of stuff is very powerful for sure.

John Roadman: And that’s coded into what you’re doing picking the less expensive models or Yeah.
Josh Phillips: Yeah, absolutely.
John Roadman: Okay.
Josh Phillips: Coded or or configured or you different stuff like that. It could be that could be that could be what your loop script is that we talked about. You know, it could say for planning use this agent then you know for doing the code use this agent then for doing the lints use your cheapo model because you know it doesn’t take anything.
John Roadman: Right.
Josh Phillips: do stuff like that.
John Roadman: Yeah, some some of the models are a system of experts, right? So, I thought maybe the the front-end router was doing that for you.
Josh Phillips: It depends. I think uh like OpenAI, I think they have something that kind of does that. Uh it’s I would call it a mixture of models. Mixture of experts is something very very different. Uh but yeah.
John Roadman: Oh, okay.
David Showalter: It would be interesting if on the consult phase you worked in an expensive model first.

David Showalter: So if needed, the agent generates a CRP to request a expensive model decision and then run it through and then go back and then see if it needs to talk to a Yeah.
Josh Phillips: just like a real consultant. All right.
John Roadman: They’re just they’re just as expensive,
Josh Phillips: Yeah. Right. Yeah. Okay. So here is uh some this is my command structure for kind of my generic stuff. This is the the stuff that I’ll take anywhere and I kind of map this these things to um what uh their terminology are is for these things but they precede them uh by a good bit. Uh but this is kind of thing that it how it looks like uh in practice. uh and so I have few things which I call briefing script generators essentially where I will give it some sort of a a basic sort of thing that I wanted to do or I’ll describe it out uh and you see here I’ve got something that I’ve got a briefing script to generate briefing scripts that are different depending on what I wanted it to do uh and so if I wanted to write a new feature I give it uh you know an idea of like here’s what you’re doing uh here’s where you’re putting it you’re going to go put this feature inside of my my specs slashmd file uh inside of uh a plan format sort

Josh Phillips: of thing. Uh and that I I’m making, you know, sure that it’s just writing the plan. It’s not going to go do the thing. That way I can talk to it like I want it to just go do the thing. I don’t have to like use weird language to tell it to write a plan. I can just say, “Go write me a cache that integrates with this webpage spec.” and I can send it to webpage spec and it can go fetch that thing and start defining out the requirements and stuff like that. Um, and so I have this sort of thing and it’ll generate that stuff out and then I’ll go and I’ll edit that thing more specifically. Uh, and I give it, you know, this sort of instructions sort of its rules of road. It’s your normal prompt engineering sort of stuff. Another thing that I will do though is that I’ll give it you know the readme file. I’ll tell where the scripts are for different things. you know what my monor repo application sort of thing is and this generally because all my code bases kind of are laid out the same generally this file I can just pick it up and take it anywhere um and so that’s kind of nice and then for my briefing spec that it generates out I give it a sort of little thing here and it generates it out and it’s real real spiffy and it does it all

Josh Phillips: the time has no issue with this sort of thing uh and it’ll generate out this sort of briefing script Uh, it’s pretty fun. And so I have one of these that’s for feature. It’s focused on developing new stuff. Another one that’s for like bugs. Uh, so I I wanted to kind of go and think a little bit more about validation. Think about proving out the initial case a little bit more. And it has a little bit different sort of plan format. Uh, and then uh there’s also this concept of a chore. This can be like a task. Sometimes I’ll have a feature that generates out a whole bunch of briefing scripts. Uh, and then I’ll find something that’s really really wrong with like 17 of them because it’s doing some like mass migration thing. Uh, and I’ll generate a chore to go and edit all those things. So, it gets real meta real fast. I have a briefing script for the briefing script for the fixing the briefing script for another briefing script.

Josh Phillips: U, but it can get a lot of work done uh once you’re able to kind of uh churn things like that whenever it becomes relevant. Um, and so the thing is that these things, they’ll generate out these spec files, uh, which is what they’re calling the breathing script. And then I need to go have something to run those spec files. Um, which is my implement here. Uh, so I it is much more simple. So this other one was, you know, very complex. I’m giving it a whole bunch of information. This kind of lines up with what they’re calling this briefing script thing. A loop script is just like, hey, go do this. uh implement whatever was in that thing. Uh and then whenever it’s done uh summarize what you did and report the files and line change with git diff uh and then sometimes I’ll have another one that’s a loop to you know prime and go read certain things. So I just want to get it ready and have it go read stuff about my code phase generally.

00:46:11

Josh Phillips: And sometimes I’ll make versions of this that are like prime this package or prime that package to have it go, you know, read the documentation on maybe like pideantic AI or something like that. You can do something like that. Um, but then I also have some loop scripts that are kind of a combination between a loot script and what the MRP is. So I talk about, you know, um, I sometimes make these custom llinters. So the things that AI agents do that bother me like using emojis or you know doing what I call hedge code uh something where it’s kind of you know uh making a whole bunch of defensive fallbacks and you know things so things don’t go wrong. Uh so I have things that I want it to run to do that and those things spit out these giant giant log files that I don’t want to read uh if I don’t have to. And so I have something that says, “Hey, go run all of my special lints uh and then analyze all the logs and tell me what it says.”

Josh Phillips: And I will generally do this on a different agent who is fresh. So you know, you can if if I maybe did this thing on the agent that did the code, just like a real engineer, it might flub the the sort of thing. It’s like, “Ah, it’s fine.” You know, it’s it’s a minor lint issue. Uh it’s really not that important. uh you know find if I spawn a new agent and give it this task, it’s really really good at finding the problems and deciding that they’re not okay. So that’s another sort of benefit of having these slash commands is that you can quickly recycle context whenever the agents kind of get uh fussy. Uh and the fun thing with these two is that uh you can actually chain these generally. So like I could do something where I’m defining a feature uh and then I’m fairly confident that it’s going to spit out something fine just because it’s something you know maybe like a chore I would do this with maybe not a feature because I I need to review that but you know to tell it to go fix all the myi issues.

Josh Phillips: Uh I can tell it to go generate out the spec so it’s permanent and that way if I need to restart the agent I don’t have to restart from there. I can kind of checkpoint it. Uh but I also just want you to go and do this. you know, I don’t want to have to hit yes again uh at some point in the middle. And so you kind of chain these things, too. Um, and here’s another one. So, where it’s kind of doing a loop script into the MRP. Um, I don’t know why this one’s green, that one’s red, but this one’s green. Uh, where I have one that, you know, it’s doing just lint sort of stuff. Another one is just just reading the the tests, you know, so run the test suite, analyze the test results, and tell me what kind of came out of that. And yeah, there’s another one here that I I listed as kind of this hedge resolve where it’s uh I have some special liner sort of thing that I made and it will go and tell it to go read this document.

Josh Phillips: Uh and basically this mentor script is just giving it context from out in the codebase. And I can run this after I do like this diagnostics thing to get something specialized. Yeah, I don’t know if that’s helpful, but this is uh, you know, non-abstract. This is stuff I I use every day. I’ve been using for like three months, four months now, you know, before I read this paper. That’s why I said that this kind of matched pretty well with some of the things that that I’ve I’ve been doing naturally and just kind of work. So, I like that.
David Showalter: Well, this is cool. Uh, I know you mentioned at the very beginning kind of the tiebacks to agile and different reminders you’ve gotten. Like we were saying, you know, working with the agents is like working with a team.
Josh Phillips: Yeah.
David Showalter: And all I could keep thinking about about this whole workflow is it’s basically like a ISO workflow for AI agents.
Josh Phillips: Yeah, absolutely.
J. Langley: Yeah, there’s there’s several several thoughts.

J. Langley: Um, one of them is uh uh or question um are there how many different competing papers or approaches are there to do things that are very similar to this or do you think
Josh Phillips: Uh I think uh the the actual number is infinite. Uh I think that the the number of ones that are really have a following there’s probably about five or so. I think spec kit BMAD which I don’t like these necessarily.
J. Langley: Okay.
Josh Phillips: I’m just saying that these are the ones that are out there and I I I think in general same how many how many project management you know v variants are there you know you got you got XP you
J. Langley: Right.
Josh Phillips: got agile you got scrum you got scaled agile you got you know so it’s it’s it’s kind of in that that world um but uh I think I think that the the right solution is going to be what makes sense for whatever project you’re making is you got to kind of customize these Right.
J. Langley: Okay. Yeah. I’m trying to think of uh so a lot of that you know if I’m if I’m bringing somebody on my project and I can explain well we’re um we’re sort of agile or we’re sort you know we’re start

Josh Phillips: Yeah. Right.
J. Langley: here and then we we mod you know um if I could figure out kind of how some of those lay out and then you know anchor off of something like that that’d be kind of neat.
Josh Phillips: Yeah. Yeah.
J. Langley: And then the the other thing is I I could really see using this as team lead training.
Josh Phillips: Right.
J. Langley: I mean, how to set expectations, how to be clear in your communication, how to verify results, how to, you know, um uh that’s kind of where I’m headed.
David Showalter: Yes.
Josh Phillips: Well, if somebody can’t manage a group of agents, can they manage a group of humans?
David Showalter: Yeah, 100%.
John Roadman: Exactly.
J. Langley: Um I mean, it’s really and the it’s interesting when you put it in automated fashion, you you can’t get away with crappy leadership. Um, so it’s it’s kind of well I I I know um they don’t they don’t get tired.
Josh Phillips: Like it’s just you. It’s your problem. It’s not that that they’re lazy. Like no, that’s you.

J. Langley: they don’t you know um they don’t well they will talk back to you sometimes but um yeah I mean the uh the thought I had was similar to uh you know some it’s when you start uh tracking and
Josh Phillips: Yeah. This Right.
J. Langley: tracing and adding cost into it and things like that you know it’s like if you had an agent that was costing a lot of money you’d you’d think about it you know but then if you’ve got five people
John Roadman: Nice.
J. Langley: on a team and one of them is expensive and you’re not sure what they do. Well, that just kind of gets swept under the rug a lot of times.
Josh Phillips: Yeah.
J. Langley: I mean, kind of back to the we don’t we won’t let we don’t just hand out checkbooks to all of all of the people on the team, but we will let nearly anybody schedule a meeting with 30 people in it, you know, um you know, without an agenda and without a without a goal. And anyway, that’s uh yeah, it’s just got a lot of thoughts on, you know, like a um I do think this has some serious legs as far as uh reaching a bunch of different types of people that are may not be developers on how Agentic is is about to change things or is already changing a lot of things, you know.

Josh Phillips: Right. Oh, yeah. You you could totally take this just completely wipe out the software engineering fact. It’s just agents. So, I’m just using agents that are writing marketing copy.
John Roadman: Mhm.
Josh Phillips: And it’s you can use the same stuff, you know.
J. Langley: It still applies.
Josh Phillips: Yeah.
J. Langley: Yeah.
David Showalter: Yeah. Yeah. Anything.
John Roadman: But it’s kind of like you hired a hundred people for that one person you had before and they’re running so fast that if you if you’re if you go to sleep, right, you just spent a ton of money
Josh Phillips: Yeah.
David Showalter: Um
Josh Phillips: Right.
John Roadman: and you’re in trouble real fast instead of having four people and you came to work the next day and nothing got done.
Josh Phillips: That is that is it. And that’s why I started with this slide. It’s like this is fast. This so fast. It’s it’s and it’s good. That’s the problem is that it’s it’s fast and it’s it’s it looks right for the most part.

John Roadman: But as time goes on, right, the expectation is going to get higher and higher that, you know, I I need a hundred people’s worth of or 100 agents worth of uh productiveness, right?
Josh Phillips: Oh yeah.
John Roadman: And I just looked at your stats for the last five days and you messed up. You got sick and everything went bad and we lost a year’s worth of work.
David Showalter: I always think of I I always say his last name wrong, but I a few months ago was talking about current work with AI and just made a throwaway comment that he’s spending 97% of his time on validation
J. Langley: Oh s***. Okay.
David Showalter: now.
Josh Phillips: Oh yeah.
David Showalter: He’s like, “The writing is so fast. It’s just the testing and validation.
Josh Phillips: Yeah. I I would say outside of like the the there there are specific use case where it’s not the case, but for the most part, humans really shouldn’t be coding that much at this point. Now there is definitely like 10 to 20% of the code still that is that way they mean it does have to have a have human touch but that’s going to get smaller and smaller.

David Showalter: Yes.
John Roadman: I would think the validation would start getting smaller too because you’re going to start um using agents for validation, right?
Josh Phillips: Um I would see I would say you’re validating uh I think the the the uh quantity the uh that’s not like the water pressure of how much validation you’re doing you know is going to increase so much we
John Roadman: It’s just it’s just the the validating the validation that’s going to get more and more humanbased.
Josh Phillips: might use agents more but there’s just going to be so much to validate you know I
John Roadman: Yeah.
David Showalter: Yeah, and to get back to uh just to get back to what we were saying earlier about, hey, if you’re not able to manage agents well, you’re probably not good at managing humans.
John Roadman: Yeah. I was think ahead.
David Showalter: I heard somebody allude to this, you know, for training people about the potential of AI. You know, I just wanted to get back to that point of if you have somebody that is good at leading a team of people, this is something you can show them to get across the potential of AI.

David Showalter: So, I just left that point you guys made earlier.
John Roadman: Yeah. On the flip side though, they may be good now, but they’re good at a size, you know, small company or or they have a bunch of directors underneath them to so that it’s seems smaller, right?
Josh Phillips: Hm. H. They might be rolling the bowling ball down the alley with
John Roadman: It it may not scale. It may be these things are so fast and so so u powerful that what the what’s been in the past is going to stay in the past and the validation’s going to get the hard part
Josh Phillips: two hands when they really need to just put thumb in. Yeah.
David Showalter: Hey, and that’s why we’re all doing this now, right?
John Roadman: and the ship
Josh Phillips: Yeah. I I am fairly confident I am still I still have bumpers on.
David Showalter: Dude, I’m still uh I’m still trying to get my shoes
John Roadman: When you get past the bumpers though, it may be screaming so fast that you’ve learned all this stuff, but nobody else can jump in without, you know, wiping out severely when they try to Uh-huh.

Josh Phillips: Right. Cool. Well, it’s it’s a it’s it’s very interesting because there’s, you know, dramatic power. So, if you have one person now who can do the whole work of a program, what happens, you know, if one person leaves a program right now, that’s not a problem. What happens if like one day just the whole program disappeared because it’s it’ be crazy because because nobody else can work with those agents, you know, or knows how to That’s that’s It’s
David Showalter: Yeah.
John Roadman: You mean the resources? Oh yeah.
David Showalter: Yeah, there’s a lot of talk of using that intentionally for job security popping around social media.
Josh Phillips: to me that seems really really dabilizing. I I don’t know what to do about it.
David Showalter: Oh yeah, companies need to be aware of that.
Josh Phillips: Yeah.
John Roadman: So that’s just going to force them to be standardized on how they do it, which means, you know, your MD files are all going to be the same because somebody else has got to jump in and learn where

Josh Phillips: Right. Yes. that would scaled agile for AI is coming I’m sure.
John Roadman: you took, you know, you died.
J. Langley: Oh my gosh.
Josh Phillips: Yeah.
J. Langley: uh let’s make a certification course and then charge money um and then change it every year to change some words and then make people go back and take the training again and get reertified to learn the new words.
John Roadman: Yeah. And then I will
J. Langley: Um and
David Showalter: Hey, I got I got how to beat this. You know, all these organizations that do the certification, you know, basically came up with them.
Todd Page: Sounds like
David Showalter: So, let’s get ahead of the curve. Uh, I want to propose us making AISO. Um,
John Roadman: or you could start with an ISO 9000 kind of idea where every company defines their Um, this is how we make the same product at the end, but they all define it differently. And then you have to go inspect based on how they define it.
J. Langley: No, see, you have to So, after what you do is you do the search.

J. Langley: I I follow what David’s saying. The next thing is you have to figure out how to let people tailor things. And of course to tailor it, you’re going to need a consultant to help you tailor. And that consultant needs to be a certified consultant, of course.
John Roadman: Mhm.
David Showalter: which we handle the first small B. Yeah.
J. Langley: Right.
Josh Phillips: Oh no, no, no.
J. Langley: For for a small.
Josh Phillips: Consulted model. You have a you have a consultant model that you sell special inference firm that you can only host.
J. Langley: Right. Of course.
David Showalter: Yeah.
Josh Phillips: We have we have a you know here
J. Langley: Yes. Oh my gosh. Oh, this is
David Showalter: Yeah.
John Roadman: So with the with the ISO 9000 theory, it’s that um you put out the same level of quality product every time. It could be terrible, it could be great. That’s not defined. What is defined is what is it you put out every time and how do you guarantee it’s that quality or better?

John Roadman: And that’s why they define what’s your process and then they then they um inspect based on that process when you get uh you know yearly or quarterly whatever um inspected to see if you’re following that Third.
J. Langley: Yeah. No, this has been really cool stuff. I’m glad we I’m glad we jumped through this. Um I do want to I would like to do a I don’t want to say a repeat of it, but I could walk through all of this again and not not miss a beat, you know? Uh if if we want to do this with a you know what I mean? like at a like at a I don’t I think we were planning on our uh social next week, but if we wanted to do this live instead or you know I’ve got November right around the corner.
Josh Phillips: Yeah, I think we can do something in November.
J. Langley: Um cool.
Josh Phillips: I mean, it’ll be good for me to repeat this until January anyway. So, yeah. Uh, there are some interesting subtopics.

John Roadman: So,
Josh Phillips: We don’t won’t go into tonight, but maybe this will be something for later. If I can just hit them real quick and then we can sign off.
J. Langley: Yeah.
Josh Phillips: Uh, but I think it is interesting to kind of talk about the bitter lesson in con context of all this is that’s the simple methods that scale beat complex handcrafted knowledge. You have got all the scaffolding sort of stuff. Um, and I think the big thing that that really concerns me about this is that you still have a lot of human in the loop sort of stuff. Uh, where you’re going to have to have the humans sitting down and reviewing it and how do you you effectively manage that time. Um, I think that the nice thing about this is that it is all at a fairly high level. I think in general the bitter lesson it’s talking about, you know, compute architectures and AI sort of transformer stuff. Um, but I think I think that’s relevant to a certain extent here. So that might be an interesting thing to talk about complexity with this stuff.

Josh Phillips: Um, the other one that’s really interesting with these is the concept of uh don’t repeat yourself. Are you guys familiar with that? The dry principle.
John Roadman: Not really.
Josh Phillips: Uh it’s more like uh it’s really bad practice if I’m running a code project to have uh five different files that create a calendar date table.
John Roadman: Oh yeah. So you write the same kind of function in many different places but don’t have one function running it.
Josh Phillips: Yeah. Yeah. And and don’t repeat yourself fairly says hey you should really just have one datetime module and import that everywhere so that if I want to edit it it propagates everywhere. Uh, and one of the things that this paper, it talks a little bit about, and I thought it was interesting. It’s the first time I’ve heard people talk about it. And to me, like I hate uh violating dry. I hate having multiple things that do uh, you know, something that should be localized, but I Exactly.
John Roadman: That’s cuz you you fix it and you missed it in the other three.

Josh Phillips: Yeah. It really pisses me off and the agent’s really bad for this. Uh, and so I kind of go on a a pretty authoritarian bent on having that get wiped out. And this made a good argument that made me second guess that where you know it’s talking about humans are good at extraction uh but we have limited context window and so we optimize things like dry to prevent that so I can just know I know to go here to edit the thing because I can’t remember all the places I’ve written where agents have the opposite problem and so it’s pissing me off but maybe the fact that it’s pissing me off is good for performance is a possibility if you know they can have all of their you know their datetime function it’s functionally the same every time. Um, and if it edits it one way, it’s in a place local to the file, uh, and it it has that locality, can you just read the whole file in, you know, maybe that’s better. So, that was one of the things I was talking about.

Josh Phillips: Um, I’m not sold completely yet, but I I I it does make sense.
J. Langley: It’s similar to another thing I’ve run into where uh we’re using we’re using some agentic stuff to go build out um you know a a full set of uh unit tests for for basically all of our packages in the codebase. A lot of times what we’d see in the past is some of these functions are really really really hard to get into this one state where these couple of methods get called, you know, so you just wave your hands. Oh yeah, we’re 98% code coverage, blah, blah, blah. Um, this thing can generate thousands of lines of code to get into every little nook and cranny. You know what I mean? And it’s like, yeah, it’s a boatload of code. I would never want to maintain that crap. Um, it’s almost Yeah, but it’s it’s most likely a bunch of repeated stuff, you know.
Josh Phillips: Right.
J. Langley: Um, it’s almost like I need to treat it slightly different because it’s it’s an agent can can deal with that stuff at scale whereas I don’t have the memory for it, you know.

Josh Phillips: Yeah, it’s it’s very weird. It’s almost like it’s compiled binary that we can read is is the way I’m starting to think about it where it’s just all of these sort of functions that that they have kind of leaked out. uh and it just happens to be in a way that we can read it, but you really don’t want to be interacting with it directly.
David Showalter: Well, and I know this is counting on advance, but there’s the whole idea of, you know, the models are the worst they’ll ever be today. You know, that gets talked about a lot. So, I keep thinking, okay, let’s say I end up with a very clunky, horrible, unoptimized codebase to run something. Uh, what’s it look like if you just count on, okay, in a couple months, let me have the models cleaned up for me. You know, can we count on them iteratively improving over time?
Josh Phillips: Yeah, I think at least another another scale there’s no reason to believe they wouldn’t uh if if for nothing else than just are things like claude code getting better not cla or claude obus claude code you know codeex

John Roadman: Mhm.
David Showalter: Yes.
Josh Phillips: klene those tools getting better at inferencing the existing models is by itself going to add stuff.
David Showalter: Yeah. Like we’ve mentioned the models are improving fast enough that we still haven’t hit the limit of any model along the way before we get a advance.
Josh Phillips: Oh yeah.
David Showalter: So even if it stopped today, we would have a few years of improvements left off of what’s what’s
Josh Phillips: Yeah, absolutely.
John Roadman: So when do we get to a point where refactoring really really large software projects is not hard or are we already there?
Josh Phillips: uh it how do you want to refactor it? I think for some things we are there you know what’s it written in you know what’s your validation do you have tests you know it depends right
John Roadman: Yeah. Oh, so we could be already there. I just can’t get my mind around it.
J. Langley: Yeah, I’m kind of working backwards using the using the model to build the tests first and then we can go refactor some things and I’ve got tests that know it works the same as it did before.

Josh Phillips: yeah I that’s a really good use
J. Langley: you know, and then eventually we’ll get to an integration level test.
John Roadman: Mhm.
J. Langley: And I, you know, I mean, heck, I had a thought the other day there was something I need an MCP for, and I’m like, well, crap. Let me just find the spec, and I’ll have a model go write its own MCP for it.
John Roadman: Mhm.
Josh Phillips: Yeah.
John Roadman: Mhm.
J. Langley: Um, and soon you won’t you won’t have MCPs that you have to build.
John Roadman: Mhm.
J. Langley: The model will just go figure it out how to do it, you know?
Josh Phillips: Right.
J. Langley: Um, right.
John Roadman: Yeah, that’ll be then next.
Josh Phillips: And store it, put it in version control and then other models can use it. I mean it’s Right.
J. Langley: Then you turn around and it’s like your 5-year-old learned how to dial the phone and they’re calling China, you know, and it’s like, whoa, hold on. That’s not what I I know, right?
Josh Phillips: Your model, depending on your model, it might do that, too.

John Roadman: That’s s That’s 6.0, right?
Josh Phillips: That’s Quinn for
John Roadman: Interesting. because the the the refactoring idea, I mean, just playing around with software and I’m not a software engineer, right? um I would write it once with cloud code and then I go back and run it again and run it with a rep ripplet or something like that and it’s like oh this is totally different and much much more efficient and much more detailed and and you know at some point it’s like well it’s way p past my uh understanding and I’m not going to go read all that information to figure out is it that better much better or is it just seen better.
Josh Phillips: Yeah.
John Roadman: So, so that’s kind of a it’s almost like how are you going to get people to get that deep? It’s like back going back to assembly language, right? How are you going to get people to go back that deep to say this is actually a whole lot better after we’ve been living with AIS for five years or from now?

Josh Phillips: It’s going to be uh it’s going to be interesting. That is for sure.
David Showalter: I mean, most software development today at large firms is not optimized. It takes advantage of increasing hardware.
John Roadman: So in all those MD files is one of the MD files um telling Claude uh don’t do this again when you run into this do this you know because it’s like I found a lot of times in in
Josh Phillips: Yeah, absolutely.
John Roadman: um vibe coding right uh that it’ll mess something up and then it’ll go back and start over again and it’ll do the same screw up again. It’s like, “Oh man, I forgot about that. I already fixed that once.” And it’s like, “Okay, well, here here’s the here’s a memory so that you can remember not to do that again.” Um, and I was thinking MD files is are the way we’re remembering right now. So, maybe one of those files is don’t do this again.
Josh Phillips: Yeah, that that would be like this mentor script thing.
John Roadman: Do it.

Josh Phillips: I think you you would either do this this mentor script with the markdown or if it became something that was really really consistent, you know, have it write some sort of thing that checks for it and tells it how to solve it too. You could do
John Roadman: Mhm. Well, our context windows keep getting bigger. Do we expect that to continue to grow? At the same time, the models get smaller.
Josh Phillips: Um I think I think the context windows will grow. I don’t it’s still going to be expensive and I the models still have the concept of context rot where even though it can technically do you know a million tokens uh all of those tokens aren’t the same really
John Roadman: Mhm.
Josh Phillips: only the first 200 are any good for 90% of the tasks so you technically can fill it up to 500 but it’s it’s intellect starts dropping pretty heavily.
John Roadman: Oh.
Josh Phillips: Um, so, so right, you’re still going to be incentivized to do things like rag uh and and things like that or I think the best thing is is to be able to always just clear the session and start

John Roadman: M so almost seems like even though it’s large, it doesn’t necessarily mean it’s good.
Josh Phillips: a new session. And that’s why you you it’s good to have the markdown artifacts uh that are kind of gradually writing stuff down as you go so that you can just clear it and get back to context zero.
John Roadman: Mhm.
Josh Phillips: Um not necessarily big prompts don’t mean good prompts.
John Roadman: At the same time, the better you get at the the context description, the bigger the the prompt becomes, right?
David Showalter: Yeah.
John Roadman: Well, what I was thinking is is say you start over the context again, right?
Josh Phillips: Mhm.
John Roadman: Some somehow you have to tell the model here’s all here here’s all your context, right?
Josh Phillips: Right.
John Roadman: And then after you’ve told it that, now you can keep going with the um regular questions. But if it gets bad enough where I can only get three questions or four questions before it starts losing its mind, then I got to start over even faster. And that means that however long or however large that context window is is the maximum that I can give it context and one question.

01:12:43

Josh Phillips: I think gen so that that happens all the time. I think the the the solution generally there is that you kind of stop and you step back. Sometimes you uh prune the tree uh and then you split the problem up or like give it additional tools things that don’t require because I mean sometimes they’ll just hit a wall where it’s too much you know and
John Roadman: Yeah.
Josh Phillips: uh yeah yeah I might delete stuff you know it’s like I can I can go in and I can edit its memory you know my agent got traumatized.
John Roadman: So you end up like uh taking all your MD files and just pruning those MD files down to what you’re working on instead of giving it everything every time.
Josh Phillips: So, I’m just going to delete that one, which uh is, you know, uh there’s probably problems with that if if the things end up being conscious, but right now they’re not.
J. Langley: Oh my gosh.
Josh Phillips: So, yeah, I I yelled at my agent.
J. Langley: We’re gonna have HR, right?
John Roadman: in. That’s
Josh Phillips: I’m going to edit that memory.
J. Langley: Yep.
David Showalter: Oh, I already I already got some looks one day when I talked about firing my AI employees. I’m like, well, I had to cut I had to cut Gemini the other day, but it just wasn’t keeping up.
Josh Phillips: That’s
J. Langley: That’s funny. Oh,
John Roadman: that’s back to that same cartoon where he said, you know, wasn’t that on my left side, doc? And the doc is a robot. And you’re absolutely right. Because of your keen eye, we’ll go back in there and operate again. And you’re on the left. It’s like, oh, great.
Josh Phillips: I just saw that. Yes, I know what you’re talking about.
John Roadman: It’s like, yeah, that’s what we want.
Josh Phillips: Yeah.
David Showalter: This is great. Thanks for doing this, Josh.
John Roadman: Yes, Josh.
Josh Phillips: Yeah.
John Roadman: Thank you.
J. Langley: Yeah. And y’all let me know how you like the team, the the sorry, Google Meet versus the Zoom. If there’s anything that you liked better or worse or whatnot.