OpenHands Code Agent – Huntsville AI

Transcription provided by Huntsville AI Transcribe

So we don’t have anybody new that hasn’t heard my normal spiel about Huntsville AI, so we will skip that point and jump right to it. A few… A few things to hit before we really get going.

There is an AI session next Thursday from the AI Huntsville Task Force. It’s going to be a presentation by Deloitte for basically AI 101 focused on end users, general public. Here’s what it is. Here’s what you can do with it, that kind of thing. So if you’re interested. I’ll see if I can get a link to this out.

There’s, I think, over 100 people registered for it so far, which is good. I haven’t even sent it out on our list yet. And there’s a few overlaps so far. But this will be really good to try to widen the – I’m not sure the right way to say it.

Folks that are interested in AI, it kind of helps give them a place to plug in.

or a way to ask questions. Because one of the things that I’m noticing is that with the increase in popularity of AI, you have an increase in the number of talking heads spouting off whatever it is they want to do. You have more AI evangelism from trusted sources, I guess, in some way. So there’s that. Hopefully that goes well. Other things, there is a, I may have this up. Yeah, there we go.

Also, in this room next Tuesday, Alabama Launchpad is going to be doing a regional meetup.

So if you’re working on something and trying to figure out how to start a startup or looking for resources for funding for things, this is kind of the area that goes. It’s not normally a thing we do at Hustle AI because most of us have day jobs.

And, you know, if you want to start something, that’s great.

We can funnel you to the groups that do that.

This is one of the groups that does that.

So if you’re interested, the worst thing about it is most AI ideas have already been implemented or will be in next week.

It’s just it’s a hard thing to have an original idea to go build a product right now. The other thing I wanted to do was give a shout out to a podcast called Practical AI. If you haven’t been watching this one or if this isn’t on your list, this is actually the one I listened to. Trying to remember what I was doing.

This is actually where I got into the whole, hey, there’s this all hands down or open hands down. And this was last week, I think.

But anyway, generally folks that are – Chris Vinson actually works at Lockheed. One of the other guys works at a – I’m not sure if it’s an ethical AI startup kind of a thing, but they’re usually going at it from a I want to integrate and use for, you know, not necessarily the research side. But here’s a new thing. I’m trying to figure out what to do with it.

So there’s that. um so dropping into uh let me find it open hands um oh you mentioned devon there was a thing called open devon or something that this is what that is this used to be open devon um according to their statement here um So anyway, I was looking through, okay, what is this? What do you do with it? Of course, there’s a cloud that you can pay for, blah, blah, blah. Or see the guide for system requirements. And then they came across this guy.

I’m like, I’ve got Docker all over the place.

I use it all the time. Let me try this. And I did. And it came up.

I’m actually not running it anymore. Hold on. I forgot I had to reboot in between.

Live demos. Fun stuff. Glad I caught that because the thing I was about to show was not going to work. Sure, paste anyway. Y’all don’t always do this where you copy something in and it’s an executable thing in your bash and you say, sure, why not?

As long as there are no credit cards involved.

Yes, that should get going. We’ll let that roll for a minute. I was just following their instructions, all that kind of stuff, and then now I’m running open hands at whatever this is.

We’ll see how well it actually… spins up again and if it remembers anything I put in before because there was a little bit of setup.

Have you played with memory limits in Docker?

Not much.

So maybe this thing will still be up. Nope. Stay away. We’ll talk a little bit more about some of the other ways to run it and then we’ll flip back.

So to talk through it a little bit, I was able to get that up and running.

It needs a couple of pieces of information. It needs to know if you’re working with GitHub, it needs at least some kind of a token that it can use to interact with GitHub.

Initially, I went and made my own personal token and I handed it that so it can do things. It was good enough in that I actually went and created a separate GitHub account called Hustle AI Agent and made a token for it and then added it as my separate kind of an agent user on GitHub. Now I can actually track what parts were done through an agent versus what parts were done by me. So if you… One of the reasons I found that, if I go, well, we’ll get there in a minute, but OpenHands itself has an OpenHands agent associated with their GitHub, and you can actually go see all of its contributions to their own, you know what I mean?

It’s kind of neat to see them using their own thing to do stuff.

So I started off, I plugged it in. It also needs a, it needs to know which model that you want it to use. Um, and it can use, uh, sure that works.

Oh, you grabbed that door.

There we go.

So it wants to know which model to use.

It can use, uh, open AI models.

It can use cloud models. It could use pretty much whatever you want. Um, I did find, um, and if we get far enough in the weeds, uh, I’ll show you where it is. Um, It does seem like they had to do some things different based on which model was selected. I found a part in one of their code bases that said, hey, if it’s Gemini, I got to do it this way versus the rest of them I can keep going, which was kind of neat. We’ll jump into that as well. Let me see if this is up yet.

It looks like it.

Let’s see.

Ah, there we go.

So now it’s waiting on a client.

Anyway, this is what their thing looks like after you bring it up, if you run it on your own machine.

So the first thing they’re going to want, if I go to settings, you’ll see where it wants to know which provider.

The guy, sure, Anthropic, DeepSeek, OpenAI.

I believe you can put your own. entry in there if you have a self-hosted model somewhere.

And it wants to know which one to pick.

Right now I’m running 04 Mini for that.

And then it wants to know what your API key is.

So the thing I will tell you is, the first thing of course, I went and created my own project on OpenAI for this and immediately assigned it a budget.

Right now I’m at $15 of a budget per month. Because this thing can get quite chatty.

There are some other, I don’t know if I’ve got advanced. Some places I’ve found, I think it’s in a, when I get into the other part of it, but some of the other integrations, you can actually tell it how many iterations max to take when it’s going through and trying things.

So initially there was that, and then I connected to a repository for presentations, which is what we’re playing with now, and I hit launch.

So some of the things that they built in out of the box, which is something we miss a lot from a, as a developer, we like to put every little button and switch thing in there. Over time, they found that there’s like four different things that people always tend to do with this.

You know, hey, I need unit testing.

I need better test coverage. Or, hey, can you fix the merge conflicts on this?

You know, or, hey, I need to, well, apparently fix Ruby.

Dependencies is another big one, especially if you’re operating like in something like a node or… you know PILP and everything you keep getting a hey this doesn’t work or I’ll update this but it doesn’t work with that other dependency of this thing and so they’ve got stuff like that that’s basically out of the box.

You click the button it goes and tries it.

I will let this roll for a minute. What I did which I actually got captured in a pull request up here.

So, this is exactly what I typed in.

I said, I need a presentation in a markdown file in a 25-whatever directory.

Use the same format as the other ones.

Here’s the subject. Here’s a reference. After it’s completed, create a pull request.

Name of the file should be blah, blah, blah, open answer.

And the first thing it did, created a commit.

Which one was the first one?

This guy. It named the file the right thing.

It used the same logo, the same, you know, based on everything else. Came up with an overview along with, you know, how to install, how to set up your keys, what to run, basic, you know, general information, which… That was kind of, and created the pull request. And did this whole, you know, hey, this pull request, that was all generated by the agent. And this was my first thing.

It’s like, wait a minute, but it’s got my name on it. So that’s weird. So it’s, I’m going to confuse myself if I keep doing this because I can’t remember if I made a commit or not, you know. So the next thing was… I needed an example of… Sorry, go ahead. I was looking at the task force framework.

Oh, okay.

The frameworks you’re covering includes AI Huntsville Task Force.

Oh, that’s fun.

High hallucination, I hear.

No, that’s a direct copy from one of my mailing, my email system. So that was me. That was not an hallucination. That was my own hallucination. So that’s funny. So the next thing they do have, and we’ll get into that.

I’m jumping all over the place because my brain’s a little frazzled.

There are other ways to run this open hands agent. The way that they do it right now is you spin up your own little Docker thing and you act like it’s its own thing. The second way is you go create an account on their cloud thing, which is fine. They even have their own hosted models that they use and stuff. And they give you $50 of credits because we all love our credits. whatever that means to use. And then the third way is to add it as like an action on your GitHub actions or your GitLab, you know, sets of things. So I wanted to, hey, give me an example of how you do it with GitHub. And it went and did that.

Added it to pull requests, added it to the presentation. And then I also, I said, hey, this was just me cheating because I needed information to put in a newsletter to send out to everybody. I said, hey, can you, here’s a copy of my previous newsletter. Use the same outline format and voice from the example text.

And so this is what I gave it. If you look at the newsletter that got generated, there are two links in there that are wrong.

One of them is wrong. The other is just, it will be correct as soon as I merge the merge request. It was smart enough to update the date to today, even though I didn’t tell it it was May the 7th. It picked that up out of the name of the file that I had asked it to create, which is kind of, that was a little surprising. it kept the same time while updating the date you know and then it i don’t think i kept the version that had its attempt to make up a zoom link it was close but uh was not a valid uh url um and i asked i said hey add that as a comment to the pull request oh actually i do have okay so it went and created this which is pretty much what i copied and pasted into the you know, in the newsletter. Presentation marked down will be correct after we merge the merge request, pull request. This example is not true. Got the name right, and then the, you know, pwd-equals-dot-dot-dot isn’t really a thing.

Other than that, everything in here was right, which was pretty interesting. So I guess we can go ahead and merge this thing. And see what happens.

Okay, so there’s that.

So now this link should actually be active.

So here is the open hands code agent presentation that it made for you.

So they got it in a bunch of different microservices.

If you go look at their code repo, there is a pretty good diagram, which I tried to get it to pull, but apparently it didn’t. So let’s jump over to that real quick.

Hands documentation.

Star history somewhere. I walked through this a decent amount trying to figure stuff out. Let’s look in.

That wasn’t it. There we go. So they’ve got a pretty modular setup.

You’ll see some MCP in here, you’ll see some agents and similar to some of the coursework or some of the stuff we’ve done before where they’ve got an agent, they’ve got a thing that reasons and they got a thing that has tools available for use and a thing that actually activates all of that and orchestrates it.

It did seem to be a little simpler, actually, than what some of the other stuff was. They’re using LightLLM, which is a fairly popular library. I didn’t quite see the same. It’s definitely different than the Hugging Face approach and how they do agents, and they put a lot of effort into it. you know, oh, just make a tool, but it’s not really a tool, but it’s Python code, but we’re not really going to execute your Python code. We’re going to take it, use it to explain what to do to, you know, that was kind of a little odd. But if I go back, the main part to look at, if I can find their code agent, code agent.

Actually, the other thing that’s really, it’s a really well laid out repo in general.

As far as most of their top level folders have read me is about, hey, here’s what’s the contents of this module and here’s what to do with it.

Things like that.

So their basic, okay, abstraction for an agent.

That’s the main part.

So to think about what they call an agent, that’s not really what I was looking for.

But they, of course, created their own class called an agent.

The other thing I would mention is some of this stuff actually predates a lot of the, like this would be called the year of agents as far as AI goes. A lot of the stuff was from fairly early last year. So they’ve been doing this for a while. And it is kind of interesting to see what parts that they were doing winds up being in like OpenAI’s agent framework or HuggingFace’s agent framework.

But their main pieces are, you’ve got some kind of a way to get the system message, a function to tell it it’s complete.

But their main piece they do is something called, Take the, it’ll be, you’ll get called and say, take the next step.

And the results of a step are a list of actions or a, basically a, I’m complete event.

And so the interesting thing here that I really didn’t see in some of the other places, one of the actions that an agent can spit out is actually a request for more information. So it’s. It was actually, I don’t think I’ve got it on here before, but I’ve actually asked this thing to do stuff, but I guess we can do that. Okay, so I don’t know if I want to do that live. Yeah, I’ve actually asked it to do some things. I can’t remember which one it was. There’s a transcription. repo that I’ve got on Huntsville AI. I was trying to expand the unit test coverage of that. I clicked the main button for, hey, increase test coverage. And it came back and said, well, hey, do you want to use PyTest? Or I can’t remember the other Python test frameworks.

It’s like, which one of these do you want?

I’m like, well, PyTest, that sounds like a PyTest. So yes.

So it is interesting that you can actually have your agent go back and ask for more info.

which is, yeah, anyway, that was neat. Getting into like the code agent itself, one of the things I thought was interesting initially was the way they do it is they got all of their props thrown into Jinja templates and then they basically make these things up. I’m not sure the right way to say that. They generate all the rest of it based on that.

But it is pretty interesting to see their approach at a system prompt.

So you can kind of see how you could actually take this, and if you needed your agent to be slightly different, here’s a really, really good starting point.

Maybe your system prompt needs to be something different because you’re trying to solve a different. problem than a code agent, you know, things like that.

But basically, a lot of stuff in here as far as, you know, things that you would expect to see working with code.

If I jump back into the main area props, so for like a system prop, that’s not very useful.

Or a user prompt.

I guess they don’t have that.

Not sure if they’ve got… That’s interesting, but not super useful.

See if we can look at actually, at like the code for the code agent.

Oh.

They’ve got several kinds of tools they use.

They’ve got a create command run tool, a browser tool, a finish tool, IPython tool, LLM-based file edit tool, whatever that means, string replace editor, which I’ve seen kind of get called over and over again in kind of the output of their tool set.

I don’t know what a memory condenser is.

Apparently it’s a thing.

Say you go to a new chat, summarizing the last chat you had so it can go back to it almost like a rag sort of thing. Okay. I think it really helps with like context length as well.

If I have tons and tons of steps, I don’t need all of that. I can start condensing it and have what are the important things so I keep kind of on the track that I can go on. Okay. They’ll do those banking tokens a lot too, I’ve noticed.

Yeah. That makes sense. So they do have a good bit of some overview in the, you know, as far as the description of it goes.

But I think they do, there’s a little different box in this.

I think there’s a piece in here that actually, there’s a piece that will get the set of tools available. There is, I believe, the agent itself.

There’s something that’s a, I don’t know if it comes from OpenAI, there’s a chat completion agent or a chat completion something that I’ve seen them subclass.

The completions is just like that’s the general generation name for OpenAI.

Okay, so they’re just doing just that and then, okay, so they’re not pulling any like an agent-agent thing from that.

Prop manager, Git tools, which does a lot of things. And then again, you get the reset, the thing for step.

So it returns a bunch of different kinds of things when it steps.

Again, a lot of this is based on the tools.

The agent finish action is basically what it sends when it thinks it’s done, done.

Back on the Hugging Face side, they had their own kind of a, okay, I’m done kind of a thing that they used as more of a token. So they’ve also got a thing to where if you’ve still got actions that are pending, it can kind of take that into account. Yeah, this is where it’s condensing events, things like that. We’re not going to spend too much time in here, but here’s the place where it is doing something specific if the model… is this one very specific Gemini 2.5, which is, it sounds a little odd to me to put this hard-coded in the middle of an agent, but.

That’s very funny because they actually probably just fixed that.

That was the 05-03 model that they just let out. Yeah. They said like, we fixed function calling.

It’s like one of the big things that they said.

So now they’ve got this piece of debt sitting in here. Light LLM assumes pretty much that everything is using an OpenAPI spec.

They have some special things for other things, but they try and use OpenAI for everything.

Anything that kind of goes off gets a little wonky with them. An open router too. Anything that’s kind of like that call one library and inference any model is doing that sort of thing.

It reminds me of early Java where most of the Java packages were fine and then you’d pull in this one other package from somewhere else and all of a sudden you’re tied in with this thing and it doesn’t work with this other thing.

I can’t remember what the, anyway, that’s, the good news is there’s likely about to be a code agent fix that rips this out and puts it back like it was maybe.

Anyway, that’s their general approach.

It was kind of neat to see they’ve got one that’s a dummy agent that they use for testing. It basically always returns the same thing instead of going and calling an LLM that they use for some kind of a harness to be able to test some of their other stuff without, and it’s, I guess, deterministic, which kind of helps. in your testing if you’re trying to figure something out.

They’ve got a browsing agent, a read-only agent, which was more along the lines of it’s kind of like a code agent, but it actually doesn’t make any changes.

It just tells you what changes it would want to make, which I guess is kind of useful. So anyway, that was… The one that they’ve got that is the way to generate from actions. Let’s go look at that real quick.

I went and added that to our hackathon starter kit. This is something we’ve had. We walked through this a couple of times.

We did it last year in year four. Hudson Alpha Tech Challenge put this out as part of their initial material to help folks, you know, students especially that are coming to a challenge. I just need a place to start. You know, it’s got some sample notebooks with some scikit-learn stuff and, you know, you start from somewhere.

So anyway, it’s fairly simple.

I use that word.

So I went and added this whole open hands issue responder, which is a cut and paste from what open hands provides as a hook. So on issues that are labeled pull request, you know, issue comments that are created, review commit comments, you know, things like that, it’ll trigger.

It can write content, pull requests, issues. And basically it’s calling this thing along with, I’ve got to, you can give it your own macro for comments, or you can actually, by default, it tries to grab open hands agent.

It grabs.

There’s a couple of parameters you can put in for max iterations.

It uses its own container image.

It’ll pick up your model, but it really likes the anthropic one.

It took me a while to get it off of the anthropic to use the one I was trying because I don’t actually have a Hunter AI account with anthropic.

Target branch. It tries to go to main.

In other words, if it generates a pull request, it’ll try to route it to main unless you have your own target branch to find.

And then, of course, it wants some tokens.

And so I went and I added some tokens and did some stuff. And then later on, I realized that there are others in Huntsville AI group that I have on this GitHub account. So if you happen to have gotten kicked earlier this week, I removed you from the internal and put you as an external collaborator because I don’t want tokens flying around. Anyway, that’s fine. There’s another thing and I will kill this before, you know, before tomorrow comes along. Now that this is in here, if you create a new issue, which we can go do real quick.

So I’ve got some issues for adding an iris notebook.

This was just testing some workflow stuff. So what is a… a thing that you would want to see. Oh, I know one of the things I was, how about a gradio, I’m not sure the right pronunciation. So, what do I want to do?

Create, if I type an example.

What would you do as your first off the ground if you were trying to show a college student how to do some kind of a, I’m not sure what linear, maybe endist.

I’m trying to think of what would be something that you could throw into a radio kind of a thing.

Endist?

Just go endist.

Any specific classifier you want to try?

Any regression, logistic regression?

I’m trying to… I can let it just try to figure it out.

I should boost a decision tree. What do you think?

Patterns.

Top number patterns.

Yeah, I just tell it to use some kind of classifier.

So it operates two different ways.

If you want it to look at the entire piece of an issue or a pull request or something, you use a label that says fix me.

If you want it to just address a single comment, in your issue, then you do the at open hands agent.

And it’ll just do what you said in the comment and not try to do the whole thing. Let’s do that. Let’s create this thing.

And let’s see what happens.

I don’t know if I need to refresh or if it just automatically pops up or what.

If it’s an action, it’ll pop up. Okay. Let’s see, well, it’s already, okay, so it’s running in the action.

There we go, open hands started, and here’s my Huntsville agent account that I created along with a nifty, I tried to do an icon or whatever. So let’s track the progress. So you can actually see this guy running. What it’s actually doing is starting up the same container that we’re running locally along with some, basically took what I had in the issue and it added it to the container basically as a task to execute. So instead of running on my box, it’s running in a GitHub action. It’s still using the same tokens and all the stuff that I’ve got associated over with OpenAI.

It’ll take it a little bit, but while that’s going, let me see if I can find my somewhere.

Usage. So I’ve already done a couple of things today, but so far I’m at a 28 cents today. So none of this is free, of course.

The other thing that occurred to me maybe earlier today is that that starter kit is a public repo. It’s widely, you know what I mean, it’s been distributed through emails and to a pretty wide audience. And now anybody that happens to come create an issue and label it with fix me is going to cost me money. Next up, I’m looking at trying to see how to modify that action in GitHub to only apply to either my username or something so that I’m the only one that can trigger that or some other way to constrain that.

Of course, if too many people hop on, they’ll run through my $15 pretty quick. So let’s see if we can go back to the actual fun stuff.

So it’s cranking along. It’ll eventually… It’s going through my repo and looking at a bunch of stuff that it’s already found.

The one that I had created locally to go add the test instrumentation to the transcription engine, that one took about nearly 30 minutes to run. But it came back and it changed a whole bunch of files and done it. I guess we can jump over and look at that in a second as well.

It cost me, I think it was like a dollar or something.

That was whatever day this was. So $1.60 for like 30 minutes worth of AI time for whatever it was doing. But the neat thing about how it’s working from the GitHub action thing, I do a decent amount of GitHub stuff just on my phone. You know, I create merge requests, I create, you know, if I have an idea about a presentation I want to give, I would just go ahead and create a little issue and, you know, GitHub and, you know, type in some stuff or link some things that I’ve come across. I’m at a point now, who knows what this thing’s doing, but anyway, we’ll keep letting it roll.

I’m almost at a point where some of the smaller things that I would normally go log into a machine and go do, I could just create an issue and say, hey, go do this and assign it and just submit it.

And by the time I’ve made it home from work, it’s probably done.

So I’m hoping this thing actually works out because it would be a pretty bad demo if it didn’t. We can go look at some of the other ones it did.

So I did ask it. One of the issues I did was, hey, I need to add an iris notebook. And basically, I get the same kind of thing. It’s using a KNN classifier and pretty much made me a poor request.

If I go look at the files changed, it did add.

a iris classification notebook. If you go look at it, it looks, if you look at a notebook form, it looks like what you would expect. And it went and updated my readme to say, hey, by the way, you now have an iris sweater. I guess I can go ahead and merge this and see what happens. Oh, that’s the one. I’m still a work in progress. What do I need to do? The merge button was on the last, on the actual page, or the review.

Add your review.

Like, where?

Up the top.

Press the green button. Okay. Then there’ll be an approve.

Review changes. Green button again. There you go.

Submit. Anyone approve? Yeah, got it. That’s very LGDN. Say sure.

Okay, so am I done?

So can I go look at dev and see this thing now?

Review successfully submitted. I’m trying to figure out Click the ready for review. Looks like it put draft on it too. I think it’s because it’s a draft, yeah. There you go. Yes, do the thing.

Nobody’s going to get hurt. All right. So now if I actually go back, I’m looking at dev. I can see where it’s at in the iris data set.

If I go look in notebooks, hopefully I see. It didn’t put the 06 in front, which I’ll give it one to merit for that. Uh-oh.

Yeah.

I don’t think that’s a thing. I don’t think that’s a thing. It’s a new one to me. Is that good? No. So this is what I kind of saw whenever I was looking, where you’re kind of going through the code.

It doesn’t look like it has any validators.

Right.

Which seems to me kind of be the main issue with this, is it’s not going in a loop.

You know, it’s not doing some of the things that things like Klein or Rue do.

Right.

Where it’s actually getting the active feedback.

It does something where it actually wants.

I guess we’ll go back and scroll up some.

I have seen some things where it’s doing things that will actually go try to compile it and run it in some kind of an IPython thing and then see if it actually works or not.

I do have it clamped to 10 iterations.

I don’t know if that is killing me or what.

Does it do the same thing in here that it did in the UI thing that you had, the VS Code looking thing?

Do you have like that UI?

I can see it doing more there.

Like back over here?

Yeah, yeah. I think it’s the same image with the same set of things. Let’s go see what the other piece has worked out so far.

So one thing, interestingly, I’ve not seen kNeighborsClassifier, but that actually is a valid scikit-learn class.

Which is bizarre.

I’ve not seen it before.

I always thought it was just kNN, but kNeighborsClassifier is part of the latest stable sklearn version.

3N’s expensive, so they all used… I don’t know.

That’s a little okay. I would be curious, I mean obviously you’d have to go through and like uncomment a lot of that code, but I do wonder if it actually would run if you went and formatted it.

I’ve always had a problem with doing the notebooks too, because it’s actually the JSON sort of weird crap underneath.

That would, you know, whenever I saw it was doing a notebook, I was like, oh okay.

Yeah.

Nice.

Yeah.

I didn’t think you could really do that. Not nice.

Probably that.

Okay.

Oh, is it still going?

I’m just watching the, this looks like one of the traveling port interface launching.

Oh, maybe it was in draft because it wasn’t done. What was that about doing it live? Pretty much. It’s like racing. Probably the baddies. It’s like racing. If it’s going to blow up, it’s going to do it in front of everybody. Might have done it before the check resolution result. Yeah, yeah. Whoops. Oh, no, they’re just telling you to do it now. Interesting. Change to make sure that… Huh. Interesting. Awesome.

Hopefully this thing actually finishes in enough time for us to look and see what it’s done.

It’s looked mostly correct so far.

Yeah.

Some of the stuff it did on the, hey, I need better test coverage for my transcription thing, most of what it added was React, because my front end is React, and the reason it didn’t have any… Before was because I don’t know how to do testing with React. I hadn’t learned that yet. So unfortunately, the stuff it created, I’m not good enough to evaluate and say yes, whether this is good or not. The piece that it did with Python was actually correct because I do know how to do that. It’s still rocking and rolling and doing something. I may have to check back later. and see if it’s actually done done. So some of the things that were pretty interesting was the way that it seems like it’s it seems like a very useful thing especially if you can take their framework and extend it to maybe do something a little different for your specific program. If you had certain ways that your program operates, or if you had different kinds of coding standards that you have to apply that others don’t, or if you have specific security stuff that you have to deal with, that would be kind of an interesting follow-on. The other thing is this is probably the first time I’ve come across the agent stuff that we’ve been talking about that actually has an example that is usable and useful within 15 minutes.

You know, hey, read this, run this doc. Okay, I did that. It gives me a UI.

Okay, what do you want to do?

I click the improve test coverage and it’s off to the races.

I think a lot of the stuff we do from, this may just be the developer side of me, but we like to build things.

And we like to do a lot of things and sometimes we don’t actually think like a user and make something useful to one person, you know.

So anyway, that was interesting. The other thing that this brings to mind, I want to cover quickly, just get some opinions from folks in the room. Off of the rails, but anybody see this today? This was CrowdStrike. Cybersecurity, things like that, and apparently they’re looking at cutting 5% of their jobs, like 500 people. If you get down to actually what their CEO had talked about, basically the key statement. was right here. I could see somebody picking up either the recording that we’ll post about this session in a week and go, oh, I can just pull open hands AI agent and drop that in and now I’ve got 10x agents or whatever. So the real question and I The main reason for this site was to grab that quote. I typically automatically discount websites with a bajillion advertisements and pop-up videos and crap on it. The question really is, where is this useful? How useful is it?

What does it actually buy you? For me, especially Huntsville AI, there’s a lot of things that I would use this for. Hey, I’ve got an idea for a presentation or whatever, drop it an issue and have it at least go get something started. It was good enough to write a newsletter. It was good enough to do some things, but actually making code changes to something in production isn’t happening. You know?

Where do you think the line is?

I don’t think that any of the agents are at a point where you can just kind of let them go do something, which makes this not super useful to me. Because, you know, when you do something like the, you know, cursors, your Klein, your, you know, root code sort of things, you know, it’s there in my environment with me.

You know, I can tell it to shut up whenever it’s wrong.

I can give it a whole bunch of tools.

I can not give it those tools. I can define custom instructions.

And more importantly, I can give it access to my static analysis.

You know, it can see my linter.

that has my ESLint stuff.

Right. And, you know, this thing doesn’t have any of that.

It has, you know, its environment inside of its little Docker container that has to, you know, spin up and run in its fast API server.

And so I think, you know, this seems very useful in maybe five years, you know, whenever we can like fully trust that I can give a super, super cheap model, you know, access to my GitHub and it can spit a bunch of stuff out.

and I’m not going to get that GitHub issue DDoSed, or it’s going to DDoS me through its crappy code that I have to review.

I like the idea. I think that the value prop, your 15-minute thing here is way easier to see like, oh, this could be useful, but whenever you actually get into it, it might be hard.

keep using it on some of the presentation stuff and see what happens.

I will definitely keep it rolling on this hackathon thing because it’s throwaway stuff anyway, you know. The other interesting thing, let me get off, this is bugging me for having all this stuff. One other thing I will say on that point too is I feel like in places where you already have well constrained workflows, but there’s variability that seems to be where agents seem to be very valuable today. So things like if I have some like finance application or some payroll thing where it’s generally very constrained, but there are like small deviations or like things where I would need it to be a little bit smarter.

It seems like a lot of people are using their phones at a conference in DC as like an AI summit, but mainly focused on like department of defense stuff. everybody’s talking about agent ai agents like where can we use them and so forth but the big thing is like there’s not been that like critical like mission application yet and i think a lot of that is because like we’re when a lot of people see this they think immediately like all the cool things they can do but if you don’t put it into a constrained environment to your point like you’re in such a risk for hallucination that it’s not going to be worth it so i think there’s been a lot like on your like back of office side of things that have been very helpful from like an agent perspective But I think it will be a while before you see it more. It’s like a novel product that people are buying. It’s going to be able to keep doing more. The high bar of the coolest thing that it can do is going to get better and better and better. But the reliability isn’t until it’s at that five nines. We can’t let it loose.

So it’s going to be a very weird thing where it’s like we’ve got a super intelligent master PhD, but he’s drunk all the time.

So that’s one thing. I feel like it’s even worse.

It’s like he’s not drunk all the time. He’s drunk like 5% of the time. But the only way to know that is to go review all 100% of what they generate. And that’s where it becomes difficult in a lot of places to even use LLMs unless they’re very heavily constrained.

Because you don’t know when it’s going to hallucinate.

I’m the bottleneck. It’s like you spend more time debugging AI than you do working on your own code.

So you move the work.

Yes.

Which for some applications, that’s fine because you end up, that is a time saving or cost saving to do it that way.

Right. But not all of them. If you had something where I’ve got, let’s say I’ve got a starter kit for an AI, whatever, and I’ve got this at my base and I need to make bots to that for this competition or that competition or whatever, I could see that going from one template to another, you know, being a transform type thing might be pretty easy.

It almost reminds me of one of the concepts or philosophies on AGI is basically, yeah, AI can do a lot of things that are hard for humans to do, but it still sucks at a lot of the things that humans find easy.

And until it’s at a point where the things that are easy for us are easy for it, then we’re not there. This is similar to… You can give it some very specific, constrained things, and it can go do it faster than a person could.

But the little things, again, it’s a similar kind of a thing. I think it’s okay for it to be good at the small stuff. I mean, it does a thing. You’re going to use it to do the thing, do the thing. I would not vary too far from… where its core capability is. But I will find it interesting in the next couple of months to see what other companies might be.

We’ve seen things like hiring freezes over the last year, especially coming off of COVID and things like that. And it’ll be interesting to see if that flows back into any additional kind of layoff notices as certain people.

take a big swig of Kool-Aid and then go make crazy decisions on that. Somebody was already rehiring the staff that they laid off because AI was too productive. That’d be fun. It was like one quarter, you know, they just wanted to, like, I don’t know if they wanted to see how it would work without the staff or if they really actually thought they could get away with getting rid of people. Normally you don’t make those kinds of decisions unless you are fairly certain. Because after the rehire, normally there’s one other firing.

It makes me wonder if it would be good as like, I need to set up a brand new, I’m on a brand new contract, right? So like, why don’t I, I have my constraints. I have my restrictions and requirements.

Start me off. Start me off with a design. So I can kind of see the full picture.

I don’t think it’s probably ready for deployment and releases and products, right?

But just like, hey, build me a brand new newsletter.

Build me a brand new PowerPoint. Build me kind of a code. Right. It’s pretty good if you have a template already.

Yeah.

Kind of like I gave you the other email template.

Use the same voice.

Just like the presentation. I just said, look at the other stuff in that folder.

Make me another one of those. Name it this.

Here’s what’s changed.

I can see, let’s say you’ve got maybe a user interface thing and you need to add another.

Hey, we’ve got another kind of a thing and here’s how it’s laid out.

But by the way, follow the pattern that’s already here as far as how we, where we put buttons, what we color things, what we, you know, that could be something it would be good at.

I’m just wondering how long this thing is going to keep rolling on doing its thing.

Let me go see if it’s actually going. We are. Actually, before I do that, let me check and see how much money it’s spent.

$15. I’m at $1.10 so far.

If you are using the one they give you for this guy, you can actually right click down here and I don’t think I’ve done anything.

Well, no, $0.22 so far. Basically seeing some tokens and stuff like that, but it counts.

Which is pretty odd considering I just pointed it at my repo and I haven’t told it what to do yet.

It’s just pulled my repo and done some initial analysis to see what is in this thing, which is kind of interesting. There is a lot of material in there.

It goes back, you know, we’re closing in on seven, eight years now worth of stuff. So it’s, I can see that being the thing.

General question for the question you’re asking tonight. How similar info do you have in the repo you pointed to? Zero.

There is no radio stuff in there. There is no MNIST stuff in there. It will find a bunch of classifiers because we did Iris.

We did, there’s a Musk data set that we used for Hudson Alpha once because it’s molecular stuff, scientific things.

That’d be interesting. But yeah, for this, it’s still rocking and rolling. I have to let you know what happened. I would expect for GPT-404 to be able to one-shot that right there.

Yeah, I ran it through O3, clock 3.7 max, and GROK, and all three of them gave a quick level.

Here it is for you.

Yeah, well, that’s the interesting thing that I’m not quite sure of.

I mean, if I actually look at this… It looks like it has generated it successfully multiple times.

It’s generating the test of the thing.

I mean, this is my test.

This is the only test part I know.

This whole thing.

It’s not just building a page. It’s actually building like a full-up thing along with unit tests and other kinds of things that are… Okay. Apparently it created it as a… It’s stuck in like a mock loop.

Yeah. Oh, it’s… Oh, God.

The other thing you’re wanting it to do for testing its output?

Yeah, it’s… It’s not testing its output. It’s building a test to test the outputs. So we are like three levels high on the Meadowland. Will it choke on the size of the library you put it in or ask it to use?

Not really.

I mean, the biggest downside I see to it right now is it needs a fairly big model to do things.

So if you’re going to try to self-host this, you may need something decent or big or possibly tuned more.

I see using something smaller that’s actually more suited for if you constrain this to kind of match.

what you’re doing in the model or what model you have available.

I think the smallest one that can consistently do something is like Quinn 32B coder.

Yeah. But I mean anything past, I mean even then that’s not very good. You really need something that’s quite big right now. The other thing since most of my world is disconnected from the internet, the fun part, a lot of the models I believe The power for some of them comes through the whole, hey, I can search the web now. I don’t have this in my training data. This wasn’t a thing when I was built.

Let me go reach out to the internet and find this stuff and pull it in, which is not a thing I can do where I normally work, which is a whole other thing. Let me see. Do we have any questions online?

From the chat, folks.

Hey, we approved it, yeah. See if any of that comes through. See if anybody is still paying attention. Whoa, whoa. Hey.

What?

Hey!

Create a new branch, pushing changes, blah, blah, blah. See how much test code you have. Okay.

Here’s the mocking framework for my Python file.

Wrong one. Let me go to sprite one.

Let’s see.

Pull request.

Three change files.

Oh, it was trying to train it.

Okay. Was it running that on a GitLab?

I have no idea.

I’m still not quite used to the whole request, review, all this stuff. I’m gonna have to get a little smarter on how to use GitHub. Yeah, so here’s its actual… Go NumPy and OS data sets.

Logistic regression is what it wound up going with.

Predict returns integer.

Okay, here’s the part that launches Gradio.

Here’s the updates to my requirements that it went and added and also built a test fixture around the theme.

It’s pretty cool.

It’s interesting that O3 chose logistic regression as well.

Claude and Grok chose random forest classifier instead.

There is, if you actually go in Open Hands, like their main website, they actually are tracking what latest models have people been finding to work well.

So you can actually see And if you want to go run your own set of tests with whatever model you’re looking for, you can actually submit those as well.

That’s not a bad place to end the night.

So, closing thoughts? Any cries of heresy?

Will you merge it in so I can pull it down and test it? All right. Let me make sure I know how to do this. Review changes, whatever, submit review. Oh, I forgot to approve. I guess review changes, approve. See if that actually answers my question. Yeah. All right. Oh, what did I do?

Go back here.

All right, ready for review and then merge. All right, it is merged. So for those online that can’t see, Jack is over here furiously trying to pull this code repo and going to try to see what happens if he tries to run it.

So while we’re at it, I will go ahead and stop recording.