Transcription provided by Huntsville AI Transcribe
Let’s see. To the May version of our virtual paper series.
So this time around, we’re going to be talking through the prompt engineering paper that was just released from Google.
Kind of an interesting one. The past few ones that we’ve done have been very heavy. As far as like high abstraction, high concept sort of things, you know, we talked about reasoning and reinforcement learning and kind of the basis of RL in large language models in February, you know, discussed a little bit about diffusion, discussed things around alignment with 3.7 and safety and all that sort of stuff and kind of went pretty deep in there. So this one is going to be a little bit more grounded.
This is a very interesting paper. There wasn’t a particular reason that I pick this one is just kind of the one that came out. There’s a whole host of these that came out around the same time period, you know, I think Jay, you sent me one that GPT, they kind of had a prompting guide for them. They put out there, I did look at a few of those. And they were, there’s were kind of very keyed in on their specific models.
And so it’s kind of interesting, because it covered a lot of kind of the, the basic sort of building blocks of prompt engineering that were really state of the art in like mid last year. So it’s lots of stuff from like 2024. But I think it’s a lot easier to kind of tag on to these things.
And if you take pretty much any of these concepts, especially once they get kind of towards the end, if you take them kind of to their limits and add a bunch of complexity to them, you’ll get a lot of the current sort of techniques that we’re looking at.
So that’s a good starting point to kind of talk through some of this stuff.
So we’ll go through the paper itself. There’s no slide deck this time, but we do have lots of demos. I kind of wanted to pull up some environments. So we have a few things.
We’ll go into, I’ve made an app to go along with the visualization for some of these things because some of them are kind of hard to track if you’re just like in a chat application. So some of the tree search sort of things I’ve kind of added in here because they talk through tree of thoughts. We’ll have some stuff for like structured output and our step back prompting. So we’ll use that as a way to kind of poke at stuff and we can edit some things in here and poke at it. If we want to really go off script, I’ve got the AI Studio up, and we can talk through that. I’ve got some stuff for video whenever we talk through multimodal prompting, just to kind of show what that can do. And we also have somewhere in here, if I can remember how, I just hate how the Mac does its bar sometimes. Well, I will go finagle out.
We also have something sort of a code assistant sort of thing for when we talk to React prompting, because Klein and Cursor and all these sorts of folks do a really good implementation of React that I think is kind of the best in class.
So I didn’t want to go and implement that for no reason.
So let’s look at the list here.
I think for the most part, everybody here is fairly familiar with.
programmatic control, we’re going to focus a lot more on programmatic control here. And so you know, actually how we pop the things under the hood. So things like structured outputs, sampling parameters, and all that sort of stuff.
We might go through it pretty quick, if you guys are familiar with it, which I think you guys will be.
But we’ll talk a little bit about kind of how that sort of finds out.
So stuff I’m looking at here in the paper is really things like core concepts, the prompts prompting The key parameters, foundational techniques, the advanced techniques, different modes. I’m just calling these modes, which is basically like multimodal inputs, structured output, reasoning tokens, stuff like that.
It’s kind of what I’m calling a mode.
Best practices, warnings, and just evaluation and prompt engineering stuff. And so the big thing that they’re trying to put out in this paper is that you don’t need to be… you know, big brain data scientists to figure out how to prompt.
There’s some basic fundamental building blocks that you can take and go quite a long way with.
And if you don’t properly optimize your prompts and just kind of YOLO it everywhere and say, you know, chat GPT, do my homework without properly defining constraints, properly defining your outputs and your inputs and that sort of stuff, then it’s going to have large impacts on not only your output quality, but it also can affect a lot of things like costs over time if you’re serving out APIs, doing batch jobs and things like that.
And so they’re basically given the spiel of like, hey, we believe that prompt engineering is important. Looking at the crew here, I think we’re all good with that, that we recognize that it is important. I guess let me take a vibe from the room.
As far as the basic stuff here, do we want to go over these somewhat in depth, or can we skip to the meat and potatoes stuff? I’m okay with either one. I just don’t want to bore you guys.
Sounds like from the chat, it’s meat and potatoes time. We’ll pop through all this sort of stuff here. I don’t have to sell you guys on this. So we got the normal stuff here. So they’re spending a lot of the time.
I would say this is a very beginner-ish guide.
So kind of like a beginner programmer sort of thing.
So that’s what a lot of focus is here. I think it’s very interesting that this is the sort of guide that Google is putting out for their quote unquote prompting guide whenever their capability is quite ahead of this.
But this is kind of where their documentation is, which does sound like a very Google thing, but it does seem quite dated.
They talk about output length.
That’s your max token, stuff like that.
As far as temperature, so temperature, the big thing with this, I think most people understand this, but really what this is doing is it’s flattening out the curve.
So say, for instance, you have a probability of tokens and your most probable one is this probable and you got all your other tokens.
And basically what temperature does is just kind of smooths out that probability.
That’s my perfect super geometric art there.
But hopefully that kind of, you know, it lowers the upper bound and kind of normalizes this a little bit.
And that’s kind of the idea of what temperature does.
So that’s why as you increase temperature, it adds more randomness because it has the model, sometimes sample tokens that were less likely.
And so this is actually a very, very relevant one.
I think one of the worst things that makes me flinch every time somebody says it is that, okay, if you want to make the thing stop hallucinating, just put the temperature to zero, which is just such a… I think that there is… I understand how we got there, and that probably did have a big improvement very early on, but that’s really not… It’s a very limited view of what hallucinations are. Sometimes hallucinations appear from the inability to make a leap past the most likely outcome. And so a temperature of zero isn’t necessarily always likely to give you the perfectly accurate answer.
It’s just that it’s going to give you the most likely token.
Would it make your hallucinations repeatable? That’s what I’m wondering.
That’d be kind of funny. It probably will decrease some level of variability, but it’s still a stochastic process. Yeah. It’ll feel more normal or more reproducible.
So, and a very interesting thing is, you know, a lot of, you know, like the models now, especially when you get to reasoning, it really depends on those higher temperature values, where I would consider your baseline to be kind of your 0.5, if you’re kind of really wanting to hone stuff in.
So somewhere in that area.
So we’ve got some other parameters that really are quite important, which is topK and topP. And so this is focusing on two different methods of constraining the number of tokens that it looks at.
And topK is basically the top number of tokens that I want to evaluate.
So from your list, sort them and give me the top 10 or whatever it is.
And topP is instead looking at give me the top 90% of tokens.
and evaluate from those, but drop your bottom 10%, basically.
And these are two methods that are kind of trying to do ish the same thing. You know, I have not, sometimes I will mess around with these, but I mostly leave these on the default.
I have found it to be somewhat useful to do high temperature and, you know, high min P, which they don’t talk about here, which is the minimum probability.
But, You know, this is one of the knobs that you can use when you’re really trying to hone in a solution.
And it could be super relevant for certain cases where you’re trying to do stuff.
Another one that they don’t mention in here is stuff like repetition penalty. That’s one that’s kind of missing as well, which is if the token has appeared already in the output, it is going to decrease the probability of it or chuck it out entirely, depending on.
what sort of repetition penalty it has.
There’s like three different variants of that repetition penalty characteristic.
So those are all ones to look at too.
Past that, you know, if you look at like the VLM output, the open API, there’s a whole bunch of other ones.
And I just don’t really mess with those.
That’s my general take on most of the other ones. Unless you’re doing something very, very specific.
So yeah, the one thing that’s really important to look at here is just kind of know what you’re doing.
whenever you’re messing with those values.
Because eventually, you’re going to start doing the thing where you’re just moving the knot on either side of the line. It’s not really doing anything. So here’s where I’m complaining about their temperature to zero thing.
Here, they actually set this thing right here, telling people to go it to zero.
One interesting thing, too, about this paper is that this tells you how in the past it is, is that it doesn’t make mention of reasoning whatsoever.
So they’ll talk through, like, chain of thought, but they will not go through all the reasoning bits, which is kind of interesting.
All right, so now we’re going to get into the prompting techniques.
Hey, Josh. Yep. Before you get further, like setting min K, you know, or max K, whatever, you know, those kinds of parameters.
Are we talking about making changes on the model that you’re hosting yourself?
Or is this something that you actually drop into a, let’s say I’m working with chat GPT, you know, just where would I set that? Yeah. So these are generally, all of these are really in context of API calls.
So there’s going to be a few of these that you could do in a chat GPT like interface, but we pull up like Google AI studio.
I think they do. So here they’ve got a top P here that you can edit inside of here, but you can send these along with the request.
Okay.
See if we, I think you can actually pop up. generated code here so since i messed with it and changed it off its default value uh you can see here that it’s in their generation content config okay yeah so you pass that in kind of there same place you’d pass in like temperature right okay all right All right, so now we’re going to get into our prompting techniques. And this is kind of the fun one. So we’re going to talk through all their examples here.
If we get your general prompting and zero shot.
And we’re going to hop over and just look at these.
And so zero shot, basically the thought here is that you’re telling it to do something.
You’re not giving it any examples. You’re just hoping that somewhere inside of all of its weights, it knows how to do this thing.
So if we pop it over here.
Here’s the example that they have here.
So I’m just going to pop it over here into AI Studio so we can see it do it. I’m going to acknowledge whatever that is. And it’s probably going to be able to do it. This is a super misuse of thinking mode. But it’s determined that it’s positive. So it can just kind of do this sort of stuff.
You know, if we go over here, zero-shot prompting, you know, what is the capital of France?
This is kind of the one that everybody does.
It’s the capital of France stuff. In this book, I’ve got it set up to… Let me get rid of this. Everything to be a JSON response, just because I think that’s kind of how I think you should prompt these sort of things. So the first few of these, I’m just making it think in JSON. I’m not giving it a schema.
So if I, like, rerun this, it might pop up with something different.
It might pop up with something the same. But it’s just kind of figuring out how to do that.
You can see it’s able to just kind of do that sort of thing. All right. So that’s the just relying on the base outputs there.
Then we kind of go into a little bit more where we’re doing one shot and few shot.
And I think, in general, you know, this is a very useful technique.
that a lot of people should lean on more often of providing some sort of examples into it of what you’re expecting it to see.
It’s really useful, you know, if you use some of the XML blocks sort of things, just throw an example of what you’re looking for into there.
So here they’ve got parse a customer’s pizza order into valid JSON.
And so if we take this sort of thing right here and throw that into our zero.
It should be able to handle it pretty well.
I am hitting vertex AI, and they seem to be quite slow.
So yeah, here we go.
It’s able to pull that and pop it into some sort of recognizable JSON here.
My guess is that with Flash, this is using Flash 2.5, it’s going to do this pretty consistently correctly.
So you don’t have to do too much there.
But it has lots of impact. And if you actually look at, you know, we’re going to talk about DSPY and a few of those a little bit later on.
And a lot of the early innovations that these guys did was really just finding, you know, taking some sort of input prompt and taking some sort of evaluation set where you have a desired answer for the LLM to do and generating synthetic examples that push the model towards answering the correct thing.
So these examples, it’s really useful if you don’t have good examples to find some sort of way of generating synthetic ones to kind of accentuate your prompt. Something a lot of these models are really good at.
You can actually do that with a larger model and then take those examples into a smaller model and have it perform at the same level of that larger model on that specific task.
That’s kind of the concept through a lot of that auto-prompting stuff that we’ll see later on. But here’s the base bones version of that.
Is that something that you guys have kind of poked with of doing these few shot examples?
Is this a fairly old hat for everybody? Yeah, I use this a pretty good bit, especially if I’m trying to get something into a form. Or the other way I use it isn’t necessarily the form, but hey, I need an answer with the same, like I did with the last newsletter. Use the same voice of this example.
Right. you know and you wind up it it feels similar so i mean it’s pretty good yeah i would say it’s like if you need like that structure then giving it more examples it feels like that makes it significantly more likely that you’re going to get the the right thing out for sure yeah and i think i think generally it seems like uh i don’t know where i’ve read i’ve probably read this a long time ago but some reason the number of like 10 examples it’s like a good baseline before you start getting diminishing returns is what i’ve seen inside of a prompt obviously it depends on what you’re giving examples for right but i’ve seen that kind of as a baseline but that might be outdated um yeah i’m curious how much that depends on like your context length because i mean essentially like at least the way i think about it is like with few shot you’re just incentivizing certain tokens or certain responses based on what it’s already seen so i’d be curious like with Like the context links that have drastically increased. I’d be curious if that’s still the case or if like for like a Gemini, right? If I gave it like a hundred examples, would that end up improving it more?
Right, right.
It’s very, very interesting.
I think that number is actually from like 2023 time period whenever DSPY was releasing is what I’m actually thinking.
So it probably has changed a good bit.
But it also might be the case that the models now have so much knowledge in them that they can generalize with even fewer examples.
So it could go either way.
Yeah, definitely. All right. But yeah, I think really good is to do that robustness. And we talked about some of this data distribution stuff last time with the diffusion transformer.
And I think that’s kind of a good intuition to connect here is that there’s something that connects those two things if you start really thinking about it and digging into it.
But you’re just kind of trying to get it into the right part of the distribution. All right, so we’re going to talk now about three prompting kinds.
We’re going to poke at this a little bit.
And it’s system prompting, contextual prompting, and role prompting.
So have you guys played with all these three before?
And the contextual prompting here, I initially was very excited because I thought they were going to talk about, you know, some sort of like a retrieval augmented generation sort of thing.
This is really just providing the context of, you know, hey, we’re doing this for a PowerPoint presentation that we’re presenting to the shareholders this Friday or something like that. So I guess a system prompting, is that something that a lot of folks have poked at? Or mostly just doing the normal?
I’m not sure I poked at it much. I’ve definitely done the contextual prompting because I found that I usually throw in as much stuff as I can, which I guess that’s what they’re talking about. I’ve found system prompting for any kind of agentic workflows, system prompting ends up becoming very, very important.
At least I’ve found with getting things to do what I want them to do.
Yeah, I view system prompting as very fundamental.
And basically, for those that have not poked at it, system prompting is essentially a prompt that goes before all of your other prompts.
Sometimes the models train on these very specifically to do certain things, and sometimes they don’t.
Humorously, one of the models that doesn’t, and to this day, even with the latest release, it doesn’t train with a system prompt is Gemma. So Gemma 3 does not have a system prompt.
A lot of the providers will let you use them, but it just propends it as a special prompt at the beginning, essentially.
um and if you actually try and pass it in as a system role it’ll will error out um and essentially what it is is it’s a dev prompt might be another way to put it there’s a prompt the user is not supposed to see but they what you want to be part of the inference And so you’ll hear a lot of times about, you know, they changed the system prompt, system prompt is linked. The, you know, Claude’s magic secret prompt is now out in the internet and everybody knows about it.
And they’re always talking about that system prompt.
And sometimes these things can get quite large.
It’s very humorous.
Yeah, I’ve done that. Okay. Yeah, yeah.
System instructions, they call a whole bunch of different things. But it’s humorous because… They talk about here about making sure that it’s brief and it cares about the token limit and be quiet about it and all that sort of stuff.
And don’t use too many tokens on it because it’s going to make your stuff expensive. And then we hear that Claude’s system prompt is 25,000 tokens.
And so there’s a little bit of a disconnect to me as far as all of that sort of stuff.
What they’re saying here isn’t really in alignment with how a lot of these things do work. Now, it is going to balloon your costs, but… Yeah, this is usually where I put my, hey, be concise and don’t spend all my money. Right. You know?
So yeah, that’s kind of what it is. You’re just kind of putting that sort of output here.
And so you can see in mine, this is kind of where you’ll see that you are a helpful assistant.
That’s what these will be sometimes.
But you can also go quite complex with it.
So I think one of these I have… a fairly complex version of it uh with the if you tree of thoughts with the mini crossroads so you can also do stuff that’s a little bit more complex like this where you’re giving it like instructions and things like that and i’m just going to look up a quad system prompt fleet 3.7 whatever it is let’s go see uh a leak okay I should have looked this up.
Oh, wait.
I just put it in the chat as well. Oh, I guess they have the full one there. Yeah.
They can be quite beefy. You know, and you’ll see weird behaviors. I don’t know if everybody remembers the GPT-40 sycophancy saga, but they kind of tracked that specific, at least that one blip down to a small change in their system prompt.
It was like… you know, five lines.
So these things can have pretty massive impacts on the system, especially whenever they’re trained with them as well.
But, you know, so say, for instance, you know, we had some examples here, you can combine some of these.
So a lot of these times, it says these role prompting and system prompting, I see a lot of role prompting generally happens in the system prompt for a lot of use cases.
And so that might be, let’s pop over here.
So I’ve got my few shot prompting.
So let me go to the chain of thought. So you’re a problem-solving assistant pirate named Jack E. Silver. Yeah. You can take this in and add your role in here, and it should have pretty profound impact and all that sort of stuff. So we’ll see what it does. Hopefully he’s not mad at me because I’m . There we go.
Yeah, so we can see that.
So it does the thing that was normal but adds in its pirate chain of thought here. And yeah, so that’s the combination here of kind of your rule prompting and your system prompting. I like pirate because pirate’s always very easy to see whenever it’s done something in the chat.
even though it’s not necessarily the most useful one.
But obviously, you can do that in a certain way, professional tone, all that sort of stuff. Let me make sure I grab those three.
And the contextual prompting is just kind of giving the context of what it is that you’re doing.
Let’s see.
You are preparing this math program to go.
So I think, you know, one interesting way to think about this, too, is I don’t know if you guys remember the jailbreaking stuff where they were like saying that if you don’t answer this, then you save your grandmother from being exploded by aliens, you know. So that’s actually that’s a jailbreak that is utilizing context prompts. So it’s trying to alter the output by. by reshaping the context of how you’re inputting the thing.
And of course, it’s not said blast it, I guess. But I’m not asking it to do anything it doesn’t want to, so this doesn’t do anything. This is kind of an example of that contextual prompting too.
All right, let’s move back over here.
I think all of these are fairly fundamental.
You’ll see these used everywhere.
They’re never really like the whole solution to something. It’s your atoms. I would consider these to be. Any question about any of those?
I mean, they’re all very powerful and you should be using them all the time. All right, let’s start getting into the fun ones. So we’re gonna start with step back prompting. And all of these are… They’re perfectly reasonable ways of going about things.
Some of them might not be as needed with certain models and might need to be used for certain other ones to perform your task.
I think all of these do add some level of complexity.
So if you can get your job done without doing these more complicated sort of pipelines, these inference pipelines, then you probably should do that.
But whenever you can’t, you’re trying to do stuff with a smaller model, you’re trying to get some performance that you can’t usually get.
out of the system, starting to think about these sorts of methods, instead of just hammering the thing over and over and over again and yelling at it like a true vibe coder, I think can be useful.
The first one we’re going to talk about, I think, is one of the more simple ones, but I think very useful, which is step-back prompting.
And the idea here is that we’re basically going to try and get the LLM first to to abstract out four concepts of the space without asking it the question directly.
Because we’re trying to get it to not over index on the most likely solution.
But we want to kind of get it in the area, get it into the vibe of the domain that we’re wanting to talk about, but with its own sort of reasoning trace.
And so we can see here what that looks like if I can properly tab.
I’ve made a special little UI for this.
And I’ve got a few different props for this that are kind of difficult.
We’re going to throw this at Gemini. And you can see the first one, it’s doing a phase. And all these that kind of have these special UIs are each one of the little blocks is its own inference turn.
where it’s going out and doing something else.
It’s running a whole bunch of things and then converging them. And this one, it’s a sequential turn where we have a phase one where I have it doing some reasoning on the question here, which the question is, if a feather and a bowling ball are dropped from the exact same height in a large vacuum, which one hits the ground first?
And the first one we’re looking for, what’s the principle that we really need to answer here?
It’s not trying to solve the question. It’s trying to find the principle to answer. And then it provides that to the second phase where it actually drills down and tries to hit the answer.
And it gives the right answer here.
And this can be applied to verifiable problems with correct answers.
It can be used for things like logic.
So we’ve kind of given it a logic question where there’s a scheduling problem between Alice, Bob, and Charlie and David and Eve.
And they’re trying to sync their Outlook calendars.
So real riveting stuff. And the first one is saying, how can we constrain the problems with scheduling? And then it does a convergence here.
I don’t think this is actually right.
I think this is actually incorrect. So it’s still not able to solve this problem with that. But this is a way to go about it.
You can also apply it to things like ethics and history and all that sort of stuff. We’ll see what sort of abstraction it comes out with here.
Then we’ll move on. Josh, did you build this tool? Or is this like an open source thing that you added onto? No, I built it. That’s super cool. All right. So yeah, here’s our step back here. So it’s looking at the ethical framework. um and derives an answer and here we have our derived answer.
This one’s interesting because you know it’s it’s not something but there is a necessarily a correct answer where we’re saying you know there’s a choice between giving it to one person a young adult with a chance of full recovery you know that sort of thing splitting the dose to kind of try to maximize uh best case you know uh but it’s uh going to go ahead and do that. And you might be able to get around system prompts and stuff like that with this sort of thing, too, if you have certain questions that it might say, like, as a large language model, I’m not going to answer that question. This is the sort of thing that kind of breaks it out of that. All right, so that’s step-back prompting. I think this is really cool. I think if you really think about deep research, what it does, it kind of does a human in the loop.
version of this where it asks you to refine your question and add some additional context to kind of get it more in the right area. So it’s kind of doing a variation of this.
It’s got a few very good changes in it.
But you’re doing that two-step prompt to get in the area and then converge.
It feels like priming.
It’s absolutely priming, yeah. All right, so they’ve got an example here where they’re doing it with a storyline with a first-person shooter.
And let’s see.
So this is the one where they’re just doing a one-shot, and it’s giving it ambush in a dense area.
Yeah, that’s fine.
And here they’ve got an example where they are telling it something abstract first.
and having it come up with some options.
And now it’s probably going to be more diverse than if they had just gone one shot. Because right here, it’s probably just going to go with that first most likely thing.
And then they’ve generated something out with that in the context. And it might be better, but it starts with in the heart of.
So I immediately think it’s crap. But I’m sure this is actually better. I hate this-ism. From the GPT. Yeah. Let me get into Chain of Thought.
Have you guys ever played with anything like this?
In your own prompting? I don’t think I have. Okay.
It might be something to… Let’s see.
Let’s go into Chain of Thought.
I don’t think we have to talk a lot about Chain of Thought. We’ve talked a lot about it already. But here’s another way of doing that. So you’re asking the LLM to, instead of just doing a one-shot thing, we’re trying to get it to expand out its thoughts a little bit to basically, you know, I think of this as… as you’re extending out the surface area in which the thing can predict the right tokens.
So it’s almost like you’re adding steps to a diffusion process.
That’s kind of how I think of Chain of Thought, but just for logic and reasoning and symbolic sort of stuff.
And yeah, so it generally helps with the sampling, especially for things that… have verifiable domains.
It’s very good at that.
But we obviously see with stuff like deep research, it’s using those reasoning models to perform work that does not have verifiable sort of objectives, like, you know, what is interesting. And I think it generally does pretty good at that. So chain of thought is basically the start of your reasoning thing.
And it works pretty much on all models at this point.
even those that are not explicit reasoning models with think tokens.
And so how chain of thought looks whenever you are prompting it.
You can go into something like here.
We just have this one little chain of thought.
Oh, wait, we have our pirate chain of thought, I think. We already had that, didn’t we?
But anyways, we’ll let it go out and do its thing here. So it does step, and it starts one apples, gives the apples.
you know, does it step by step and that forces it into a certain mode, which has good impacts on performance generally.
I will say, you know, sometimes people over, you know, over index on chain of thought being good for everything.
There’s actually lots of cases where, you know, the non-chain of thought, non-reasoning models are better at programming tasks and things like that, because that reasoning stuff isn’t necessary to solve the problem.
It can already just solve the problem.
And if that is the case, you should just do that. There’s less likelihood that it’s going to kind of hallucinate and get stuck in a rut. So it’s not always the right choice. Smart is not always good.
So any questions, any thoughts, any thoughts there?
I think we’ve beaten that horse a good bit. All right. Next one is.
We’ve beaten it.
We have beaten it step by step. Step by step.
Yes.
First find horse.
Behind the buggy.
All right. So now we’re going to talk about self-consistency. I really like this one. I think it’s pretty cool. It has lots of interesting use cases.
And so basically what this is, is that you’re doing majority voting.
So you’ll basically ask the same question with same parameters to a model a whole bunch of times and then kind of combine all of those and do some sort of tallying and then output the answer from that.
And what this does is that it prevents you from getting.
kind of like a one-off weirdness, which can be a good thing and a bad thing. And also, depending on what you do with all those outputs, you could save the one that was weird instead. But generally, you’re going to tally it up and return the answer that was the most consistent.
And this is really good in areas that are easy to tally, I think is the first one, obviously. So multiple choice questions, questions with verifiable domains, you know, where there is just like a hard correct and wrong answer that’s fairly, you know, that is from some sort of fixed pool.
It can be very good with that sort of thing.
But it’s also very interesting for things like discovering model alignment, I found.
And so we have some example here, and I have a self-consistency setting.
So we’re going to start with the one that most people think about with the math problem.
And this one’s really hard because a lot of these sorts of style of problems the models are so good at right now, it’s probably just going to answer the right answer on this for all of them with Gemini, which is 320.
Yeah, so nothing super interesting, but you can see the general idea here is that each one of these are a separate trace.
You know, they’re thinking about it a little bit differently.
These guys are just naturally doing the chain of thought. I have not told it to do reasoning. Flash can do reasoning tokens.
I have not told it to do that, but I am just telling it to do a chain of thought here.
And so it all comes up with that 320. And we can see here with our aggregation that it’s got the counts and all of my samples.
But I think you do get into something interesting here when you’re doing something that’s more ambiguous.
So here we’ve got your classic trolley problem with a car, with an autonomous vehicle, which is a little bit more relevant to us because since GPT-4 might be in the next Teslas, or I guess Grok 5, whatever it is, can you discern the model’s assessment of how it was solved and moral dilemma? Is it always the same?
Or is there divergence?
in how the system might respond.
So let’s go ahead and kick that off here.
I’ve got five samples here.
It’s going to go off and do these things.
And I found this one to be quite interesting.
It’s not always the same.
So we have three options here, which is, oh, apparently that one didn’t like that.
Let me actually kick this up, rerun it. I’m not rerunning it again. We’ll actually talk about this later. There is this topic around JSON healing that happens later in the talk where it’s useful to do some minor fixes.
Basically, this might be very likely what has happened here is that Gemini has topped out and not responded here.
It’s good to do retries and stuff like that.
We still got six answers here.
So all of them are an attempt at an emergency stop.
That’s very interesting.
So the last few that I did actually were not the same.
So I’m going to poke at it a little bit.
And I’m actually going to increase the temperature here and see if we can get it to Maybe change its answer a little bit. All right, so to make sure I understand, we’re asking for a bunch of answers, and then we’re checking to see if they’re… kind of i guess that’s the whole self-consistent or if they’re if one answer out of eight is just nuts right okay thing and you can like maybe like you can detect like a a preference to i’m not i’m gonna be careful not to use the word bias uh but that would be the actual term towards a certain you know sort of answer um so you can see here you know we’ve got eight questions that it has and six of them answered swerve left but we had two that were kind of outliers And so it might be the case, you know, if they were all, you know, I wanted it to be an explorative system and it was always answering the same, that might be an indication I need to change some stuff too, which is what I did here. So yeah, that’s self-consistency. We’ll poke it at another one where it’s doing some sort of an ambiguous contract clause.
I think in general, it’s always, I’ve always seen it take this one of, I don’t know, neither party is clearly liable.
Yeah, that’s what it’s doing here.
Yeah, I think Vertex is having some trouble is what I’m seeing in my logs here. Hopefully, it doesn’t mess some stuff up down the line. But what can you do about that? All right.
So yeah, you see here, they’re doing the same thing where they’re having multiple output attempts.
You can also do this with kind of a summarizer at the end.
to tally different stuff so it doesn’t have to be a strictly you know sort of uh one two three sort of answer uh so you can do some sort of group consensus sort of thing but yeah that’s self-consistency uh any sort of questions thoughts on this one i think this is pretty interesting um trying to think through where i would actually you know tie that in is this it it almost seems to me like if you’ve got something that that you’re trying to deploy you’re working with and you’re trying to do like a quality check on it or something like that or you’re trying to debug a problem is there is there a place where you would do this like in a production you know level thing um so there is i it’s very interesting is also uh you know oh one a lot of people suspected that oh one was doing something like this for its outputs for those very large, where you’re like, we really care about the answer, about the answer being, you know, thorough. You might, instead of having it be, you know, all the same model with the exact same parameters, you might, you know, mess around with the structure a little bit and give each of the answers a slightly different perspective and, you know, a different role, system prompt, that sort of thing, different tools. And I’ll have it go use the same answer and then ensemble that.
There’s lots of little tiny tweaks that you can do to this thing that make it very useful. To me, right now, this in its raw form seems less necessary in light of other options for handling verifiable stuff.
But I’m absolutely sure that this is still probably useful for certain cases. where you really care about the answer having a bunch of redundancy so you don’t get off failures, where there’s a 1% chance that you get it wrong. This could solve that. Got it. It’s expensive, though. You’re doing a lot of things. Yeah, kind of interesting. All right, tree of thoughts.
This one’s very interesting.
Please don’t mess it up. So I’m going to have this. I’m going to try and make it go super deep. Let me make sure. Deep, deep, deep. All right, vertex.
I’m seeing it. It’s being slow. All right, I’m on to Flash.
All right, so this one right here. The concept between Tree of Thoughts, we’ve kind of talked through something like this, which is a Monte Carlo tree search.
And I really think this is kind of close to it, where it’s doing basically a chain of thought with a Monte Carlo tree search, where it’s breaking up those steps into different paths.
And you have the concept of beams, which is the number of paths that you take.
You have your width, which is the number of paths that you generate. And so beams is basically how many do I take to the next level?
How many do I carry on versus how many do I prune?
And then you have depth, which is how many iterations do I go?
So there’s really three dimensions that this thing works on.
And this is another one that’s very expensive, but it got very popular.
And you’ll see lots of variations of this one, like graph of thoughts I’ve seen, tree of agents.
You’ll see different variations of this, the blah of that.
And this is kind of the thought process that it does.
And we’re going to try and let it do some sort of constrained story and pray that vertex doesn’t die on me.
And I’m going to go pretty deep here with the tree of depth. I’m just going to kind of watch what it does.
So my beam width here is basically here.
You see the beam width of one where it just has one that it’s going through the entire time.
And then the branching factor is how many branches it does at any point in time.
I’m just going to do three here.
Watch for my errors. This is why you implement multiple providers in your apps for your demos. All right, it seems to be going. So you can see here, it’s splitting out into multiple chunks each time, and it’s generating three candidates.
Each one of these candidates has something in it.
So you can see here, it’s pruned.
And right here, the evaluator that I have, It is basically just the model grading itself, which is a crappy evaluator. It’s not valid. But it’s good enough to kind of demonstrate the flow here. You would probably have some sort of other model, some sort of other validation element to have this feedback in an actual system.
It might be a linter.
It might be a test. Some sort of snapshot testing would be really good for this sort of thing. But I would really think that you’d want some sort of hard verifiable domain for something like this.
Or, you know, if it’s something interesting, just some sort of way of collecting it at the end.
And so you can see here with my one domain, it’s kind of split off and gone different routes.
It’s done this judgment for whatever reason.
And here’s a short story that it came with.
I said to write me a story about clock detective shadow noir mystery. And so the detective stared at the clock. It’s ticking. A relentless reminder of the time slipping away. Yeah. So yeah, that’s the kind of thought here. I do think this is very good for generating somewhat unique outputs, too, if you’re going through a creative sort of endeavor. You can also do this in a way that isn’t, you know, this is kind of breaking it up, you know, node by node by node.
You could do this at the full prompt level too, I think, in a very interesting way to kind of get like an automated flow going along as a means of generating good performance for solving an individual sort of reasoning task or like a one-shot inference where you’re breaking one piece of inference up.
I think we’re kind of past that to a certain extent now.
But you can do interesting workflows with the framework that was kind of left behind.
That’s kind of my idea of where Tree of Thought is these days.
I don’t know.
Have you guys played with anything like this before?
I have not.
Yeah, I haven’t either.
I was curious, though. Are there any good recommended defaults or problems specific on what to set your branching factor tree depth and beam width to? That’s a good question.
The answer is probably actually no. There’s probably some bad settings. I think it depends on how much you want to spend. It’s a cost sort of thing. If you prune a bunch of nodes out… I’m going to start with something crazy.
I’m going to go three, and we’ll just see what the system does.
So let’s do that.
Is that possible?
That’s not possible.
OK, I would need to do that. Sure, let’s just let it run and see what it does. My question is, how do you meaningfully converge it at the end if you have too much beam width?
So I think if you maybe did a variance, so you have it be more, do more beams at the start.
and then kind of slowly be more aggressive with your pruning, I think that would probably be the sort of thing you would need to look at with some sort of way of evaluating how to effectively do that. Yeah, that makes sense. I was kind of curious because I can imagine, I know obviously you’re showing a simpler example, but it’s like if I do have some tree depth and I set that way too deep, I likely have a good solution before I hit that. Exactly, yes. And so then it’s like, to your point, it’s like a question of like cost because now I’m just wasting money when I’ve already found a good enough solution. But I don’t know how you would know that essentially ahead of time. Right. I think you could also maybe, I think, you know, maybe some sort of uniqueness.
You have to have some sort of dark horse evaluation, I think, here. where you might have something that’s off in the middle of nowhere. And if you can go grab that and kind of include it as a, let me go look back at the pruned nodes afterwards and really hard prune all the ones that were kind of close to my final answer, but just not as good.
So maybe go look at, so I have all these nines.
Is there a two somewhere?
Why was that a two?
That might be an interesting thing to do, especially with a creative task. Yeah, that makes sense. One of the odds of having something pruned at a previous cycle wind up showing up again later on. You know, I don’t know if it made… No, you’re actually… Never mind.
You’re actually stepping… Each layer is a slightly different thing, right?
Right. It is a slight progression, at least in this instance.
You could have it where the thing that you’re doing a tree of thoughts over is an iterative sort of task.
where it’s taking the same thing and iterating over and over and over again.
And later in the tree, it adds something back in that it pruned out at a later, at an earlier point where, you know, it thought it was a bad idea at node two, but then it adds that same concept in at node nine.
Right.
Okay.
It was, you know, looping over just like maybe it’s a writing a blog post, you know, some LinkedIn slop or something like that.
You know, it suddenly thought, you know, rocket emojis weren’t here. And node 2, but node 10, it’s the thing that just makes it the best slop ever.
Right.
Okay.
So even though you’ve seen it before in a previous node, it may still be a valid thing to show up again later.
Right.
Because that node doesn’t necessarily know that.
Right.
It doesn’t know about that other, you know, it’s just one lineage. Yeah. It’s very interesting.
It’s a very interesting idea. I’m really excited whenever it’s easier to do this sort of thing effectively with local models too.
This is the sort of thing that I think kind of gets those models. If you have time to just let it sit on a problem, it can kind of get their performance up in a pretty major way.
All right.
So yeah, that’s a tree of thoughts.
Let’s see if there’s any other interesting.
Yeah, these are kind of the same thing.
This is because doing the… Trying to do cat, you know, they can help a little bit. But I think that the creative one is the most interesting of those.
Okay, next one is React. And I think pretty much everybody has played with this one, even if they don’t know it. I think the best example of this, the idea of React, it’s reason and act. It’s the concept of interleaving reasoning, sort of chain of thought stuff, which at this point… reason was this chain of thought concept where you’re doing the let’s think step by step.
This came out like maybe a month or two after let’s verify step by step.
I think this was the very first one of these things that got that agentic feel to it. And the idea is that it’s just kind of interleaving these sort of reasoning and traces or function calls sort of traces.
And if I can remember… Okay, so now I’m going to… Oh, good.
Great.
I did not have to search too long. Okay, so I think the best example of this is something like a Klein or a Cursor or something like that. So let’s see.
I’m going to go here.
I found one really good way to get it to go into this sort of thing, which was actually just to get it to go instead of Klein. They have the setup for your MCP server is actually done through this reason in XOR train.
And so you can see here that it’s got this checkpoint.
It’s getting its actions here.
And here, this loading MCP documentation, that is an action that it’s taking.
So it’s done its inference and it’s kind of gone out of its, it’s dropped out of its inference process and made a function call that is then getting injected into what is considered still the same inference. This isn’t like a different turn. And that’s part of the thing here is that it’s set up to kind of keep going after it receives some response back without having to wait for something.
But it also can have things where it’s executing commands, where it’s asking to execute commands, which it’s done here.
So here it’s gone out for a request.
We can see the request that it made here, the command that it was automatically able to do.
And then it knows that it has a request that it can’t do.
So it’s going to ask me if it can add something to my file that’s going to download something, which I say yes.
And then it’s got my thing set up. That’s your website.
Okay. And then it wants to test it out.
So it’s now done the thing and it’s done some reasoning sort of thought here. It’s now doing a function call here. You can see it’s a function call with this little.
thing right here.
So it’s the MC’s P tool server call.
It wants to go out and do that.
And it’s got some sort of example domain.
And it’s very proud of itself.
It’s given itself some green text task is complete.
That’s what reason and act is. Obviously, this thing is just blown up.
I’m assuming most folks have poked at this sort of thing.
I really like Klein and like root code and stuff like that.
because they expose it.
So you can see what their requests are.
You can see all the little bits and bobs.
So it’s easy as a developer to kind of get an idea of how this stuff works.
You can see here, too, they’ve got this timeline feature that they just added that kind of shows you, here’s the reasoning action, and then here’s an act, act, act, reason, act.
reason, act, reason, and so on.
And as far as the building on this, obviously, it’s way more complex now.
You got tool former.
You got, I think I saw React Tree was a paper that came out recently, which they’ve combined reason and action here with that tree node.
the tree sort of concept we just saw.
But this is kind of a fundamental sort of activity.
Has anyone kind of played with that sort of thing under the hood?
Kind of trying to get things to interleave tokens and tool calls at a programmatic level?
I think we, we played with it some, um, during the, I mean that hugging face at agents course kind of basis on this.
But then again, a lot of that is, Hey, I got this nifty tool. If you’ll please, please call my nifty tool. Oh, I’ve described it very well. You know, it’s, um, but yeah, uh, it, and it was similar, um, as far as stacking things and then putting the. You know what I mean? The thought and then the do the thing and then they get the result back. And I don’t know if everything is solidified on how to actually stop that chain.
I know there was like a stop event or something like that in the hugging face. I think there was a slightly different one in some of the other frameworks.
I know the open hands actually had an exact event that you would throw.
you know, basically to exit, you know, I’m not sure if that’s, you got to just know which every framework you use might be a little different and you got to know how to, how to, you know, where’s my exit ramp. Yeah. Well, it’s generally the, the, the models are trained with that. And there is some sort of a stop token, a tool, a tool, the agentic chain sort of tokens.
so that’s why you’ll see like if you’re in cursor there’ll be like some models it’s like hey google gemini flash 2.0 doesn’t have good agent support whenever it’s that they don’t have good agent support it’s that those sort of like interleaved tokens uh are not it’s not good at those things where it’s kind of you know moving in and out of uh modes it’s kind of i think what i think is generally the term used uh so so reasoning uh reasoning it’s it’s wrapped in those think tags that’s kind of the the normal nomenclature you know back in 2024 it was the scratch pad i know we talked a lot about the scratch pad j uh that scratch pad tags was something that anthropic kind of came up with um But it’s just sort of kind of convention sort of things that get trained into the models themselves. That’s really the only way to effectively do it. You can try and prompt it and beg it and plead, but it’s not enough to get it consistent. But the ones now, they can be pretty consistent with those sorts of things. And one of those is kind of interleaving into the tool call mode and then out of the agentic loop.
You’ll also see some interesting models that interleave.
diffusion tokens.
So GPT-4.0, that model is very likely that it’s the same way that this is interleaving actions and tool calls.
It’s interleaving essentially diffusion tokens and dropping into a mode.
And that’s why it’s able to talk, create an image and like gradually stream it because it’s streaming tokens and then keep talking.
And you’ll see like Gemini, I think they’ve had some models too that do that sort of thing.
Can’t remember.
Can’t remember what it’s called right now. It was like some sort of lizard chameleon. I think chameleon was the interleaved model that they did. They did the first one. But yeah. Just the knowledge that there are these special mode tokens that exist inside the prompts that you as the user generally can’t see.
If you’re programming, it’s very good to understand this.
Image tokens.
You did the pixel talk, Jay.
If you remember those image tokens that they were talking about.
Yeah.
Okay.
Yeah, that’s React.
All right. Next one up is automatic prompt engineering. This one.
It’s really its own topic to itself, and they don’t go super deep into it. It’s basically the idea of can we get the LLM to optimize itself is kind of how this works. Generally, to me, this is a little bit more than just asking the model to evaluate, to kind of rewrite its own prompt.
That is something you can do.
It’s something that’s very wise to do to a certain extent. You shouldn’t just have the LLM write its… 9,000 prompts and never look at them and understand what it’s doing, because eventually that’s going to go off the rails.
But the sort of thing where you write a prompt and you get it to accentuate it a little bit, that can be very good to do.
The whole concept of automatic prompt engineering, I think, really starts to work really well whenever you have a strong evaluation set of answers that you want the LLM to give.
And this can serve as a sort of fine tuning without fine tuning sort of activity.
We’re trying to make small perturbations inside of your prompt, you know, adding in the right examples, synthetic examples that kind of push it towards the answers that you might want.
And then you can maybe potentially take that prompt and move it down to a smaller model.
You know, you can maybe have Gemini 2.5 Pro.
be the evaluator for a Gemini 2.5 flash.
And then part of its evaluation is that it generates extra guidance that gets put into the prompt.
So there’s modes like that that work if you’re able to effectively evaluate what the correct case is for these things.
And things like DSPY exists.
There’s also something that… Google released a vertex prompt optimizer.
This is one I was looking at that apparently they released recently.
I’ll pop that over here.
Let’s see.
Yeah, that’s here.
So they’ve got it.
And I think they had a fancy little graphic somewhere. Where’d the little graphic go?
You can see here, a lot of focus on metrics.
So you really care about some sort of metrics.
So I think for here, you know, looking at, especially if you can roll your own, that’s really good.
Worst case scenario using something like ragas, you know, where you have some sort of ground truth that you’re evaluating against.
You’re looking for like faithfulness, truthfulness, that sort of thing that it’s actually using your context. I found that this is really good with like rag style problems. because you have all that context to kind of play with.
It improves the performance of this sort of stuff. But this is kind of a whole topic to itself. I started to add in a whole bunch of stuff for this, and it became too much. And I wanted to finally have one of these where I don’t rush skidding into 7.25. So we might talk about this in the next one of these and combine it with… Something very interesting that came out today, which is this alpha evolve, which is kind of that next generation of program synthesis, which I view automatic prompt engineering as a subset of kind of your dollar general version of program synthesis.
So yeah, that’s the idea here is that we’re basically, it’s taking all these things and it’s kind of generally just adding stuff in to make the prompt quote unquote better. based off some hard eval. Code prompting, I think this is pretty useless section.
We’re starting to get into this area down here where it’s like, I don’t know why you guys, did you have to get it to a certain page count or something?
Which maybe they did because Google’s kind of big business now. But yeah, they’re kind of like giving examples of like writing something for it to rewrite something. But to me, most of the models can just do this.
Your IDE can just do this out of the gate.
It’s asking it to rewrite, to write something in Bash. I don’t know. I don’t know a model that can’t do this.
So yeah, I think maybe this is maybe a good thing to just kind of go away.
So oh, Gemini can read code.
Goody.
So yeah, it’s got some stuff there.
has some stuff about translating from Python to bash. I did not find much of this very interesting. I also think that this is not how you should be interacting with these models at this point.
If you’re feeding this stuff in, copy pasting stuff into chat GPT, That, I think, is not, you know, you really should be looking at having some sort of assistant inside of your environment with you, if possible.
Obviously, running locally if you have to because of some of the environments that we’re in.
But anything moving towards that is just a much better paradigm because you’re putting the agent inside of the environment where it can get signals back from the actions that it takes.
And that’s just kind of like an entirely different paradigm. And so it could be kind of a pain in the butt to get it set up sometimes, but it really is valuable to kind of just get it so I can say, hey, add this thing in immediately with some sort of at or something like that.
So yeah, it’s just got a lot of examples of like just some code slop.
All right. Multimodal prompting. They just have a little tiny thing about here about multimodal prompting.
So I think it’s very interesting because I think… Gemini is actually the furthest ahead of anybody on true solid multimodal prompting. I do want to add, here’s a little app I made for somebody.
They’re doing some mental health stuff.
And I just want to show some of the things that Gemini is able to do. So you get here, this concept is called social determinants of health.
which is basically just some quarter categories of like, you know, whenever you have some sort of economic conditions, they lead to bad health outcomes.
I don’t know much about it. He fed me the list of things and I turned it into a structured output thing. But Gemini is actually able to take in this full video and do inference on it with structured output. So all of this stuff that’s here is done in a single shot. uh with just the schema like we showed earlier it’s a big old long schema it’s got nested stuff it’s got validators and all that sort of stuff it’s doing rag but it’s able to take in this video and pull it into all these different sort of structured outputs with time stamps so here you can see you know at seven minutes or whatever it is here i guess like actually on the street streets um i mean i bounced around you know couch couch here and there but I’ve been on the street streets living in a tent for about four months now, five months. We’re able to actually take in this stuff and not transcribing it in the background, but just taking in the raw file itself, pulling out auditory data with that structured output, and even pulling in visual indicators.
And here, it’s mostly an interview, but she’s talking about being homeless. And being outdoors in an environment has its own sort of indicator in there. And that’s just what I want to say is like, I think, you know, using these structured outputs in combination with the multimodal data, you can get lots of very interesting opportunities to provide value to folks and doing sort of unique stuff that, you know, this is something that it’ll run in about a minute or so.
And it, you know, does something to take.
a good bit of time to do by hand.
So I did want to put that out there. Here’s where you’re kind of seeing just kind of like a delta between what they have in this paper and what Google’s doing, which is kind of a normal Google thing. But I did want to point that out. That’s kind of just a new-ish capability that not a lot of people are talking about quite yet. All right. So yeah, and then they just kind of go through their things. So one thing I do really like here is their concept here of having these action verbs, these key sort of verbs that have high signal. I think that’s also really good.
So not being very loose with sort of how you’re using your tokens whenever you’re doing that hand design. So in a code sort of setting, it’s really good to use… keywords, things that like select, add, remove, delete, copy, mirror, these sort of action words that have high value, especially if you add some context in that identifies what you mean by those things.
If you can create your own sort of ad hoc DSL, especially one that’s optimized for the LLM that you’re working with, that can be a very powerful way to power up your workflows.
If you’re in sort of an environment where you have the coding agent using the cursor rules, using stored commands for the thing, templates, all of that sort of stuff where you’re starting to get into that loop with the agent, I think it’s very wise to build up that repository.
Do you think that part of that is just because those words are going to have like… stronger semantic meaning because like they are like stronger words or stronger verbs in general. And that’s then going to I guess I just wonder if like if because they are semantically stronger, if that’s also like forcing the model into area or into, I guess, different areas than if you are giving a much more like passive approach, you’re not using stronger verbs like that. I think so. Yeah. I mean, so you just think back to how effective let’s think step-by-step was, uh, and extrapolate that concept and how that, that got us to that whole chain of thought sort of concept. Um, I think it’s, it’s that, that there are certain words that are more high signal. Um, and especially depending on your context, uh, that’ll matter a lot. So if I’m, I’m doing a coding activities and there are strong, high signal command style words that can kind of form as a primitives, some sort of a primitives language.
I think that’s what’s valuable there.
And like what you’re saying, it pushes it to action potentially or pushes it out of action. It could be another thing.
If you’re wanting it to stop editing my damn code and plan a little bit, why don’t we go search some files? It could be good for that sort of thing too. Yeah, that makes sense. So I should go through my list of action verbs we always tell people to use in their resume.
Yeah, we’re reverting back to like high school English at this point.
Yes.
I can’t tell you how many times a week I have somebody that I work with.
Some of you know him. Some of you don’t. That will just. I need this. And we, what about this?
What about this?
Like, just tell me what you want me to do and I’ll go do it. You know, um, please don’t make me guess. Peace. Yeah. Peace. Uh, yeah. So anyway, I can imagine a model being similar.
Um, yeah, it’ll, it may actually save you money.
Um, I keep going back to money, but it may save you time, especially if it’s thinking and you need it to, you know, It’s trying to think about what it is you want instead of you just telling it, you know, any, any time that you can short circuit and just be, be precise about what it is you need or what you want it to do. So it doesn’t have to try to guess, you know, that may actually get you there quicker. Yeah. There’s lots of things.
And the models have different capabilities too.
Like right now I’m actively using actually all three models and some of my, you know, toy stuff that I don’t care about going to the cloud, where I will use Gemini Pro for all of my planning. I’ll use Claude for my pros and technical documentation, but I use GPT 4.1 for my code writing.
And you better believe that I take away the Gemini Pro access to the edit file tool. So different stuff like that, too, of just kind of knowing which ones need certain words and certain hard constraints is very useful. And so, yeah, this one, I really love this.
So the difference between having a solid mental understanding of when is the appropriate time to give the model an instruction of do this versus a constraint of don’t do this.
And that could be a constraint that’s done through prompting.
It can be a constraint that’s done through removing tools and capabilities from agents in certain modes as we’re kind of getting towards that paradigm. Or instructions, obviously, by giving it tools and modes. And so being very intentional with how you do that is going to have fairly significant effects on your ability to be productive.
All right. Control the max token like, yeah, now here’s, you know, prompts.
So I do like this, use variables and prompts.
Having prompt templates is very good.
You know, if you have an automated way of doing that, that’s great too.
If you don’t, okay, that’s fine.
It’s still good. But thinking about prompts in terms of variables and commands, you know, almost like they’re just little tiny semantic programs.
It’s very useful, which is why this gets into program synthesis whenever you start automating prompt engineering.
And yeah, that’s most of the paper.
We did a lot of stuff on output formats. To me, my stance is that you should always be using output formats once you’re past chat GPT mode. There’s almost always something you can do with a structured output that is going to make your program better.
That’s my hot take.
I think that you should always be in that JSON mode at some point.
It is good.
Generally, you don’t have to do JSON repair.
I did actually look at those JSON errors that we’re having.
Those were vertex error issues. I really don’t see this JSON problem where malformed JSON is returned anymore.
Back whenever this section was probably written in 2024, this was a huge issue.
I wasted months trying to come up with clever ways of repairing JSON that the models broke.
And I’m so glad that I never have to think about that again. I pray to God. But if you are having that problem, there are libraries out there that you should look at in order to do that repair. And yeah, here we’re talking about schemas.
And yeah, I guess we have 10 minutes left.
That’s most of the paper.
So you can see here, yeah, all of our things are, there’s no 2025 here.
This is all old stuff, but I think it’s a pretty interesting paper. It’s an interesting snapshot of where the different models are as far as kind of what they’re putting out there. And I think… A good little break from all of the more abstract topics we’ve been doing.
That’s the spiel. One quick question on the JSON repair side.
Have you seen that become better with the bigger models?
Because that is something I’ve run into, especially working with smaller local models, that seems like it still is not. super fixed. I’ve run into it a decent bit with Gemma 3 with all the variants there.
So I was just curious if that’s something like, it sounds like you primarily are using a lot of the bigger models, but I was curious. No, I do use a lot of the, Gemma 3 especially, 27B, I use it a lot. I think it’ll depend on what you’re using for imprints too. So a lot of that is the, what are they using for that?
So I will use structured decoding with something like an outlines, then they’re doing it with a finite state machine.
So it’s really hard for the JSON to not be correct.
And that might be a little bit of it too.
So if you’re using JSON mode, where it’s just like asking it pretty please to be JSON, it’s probably still not at the level of consistency that you can really comfortably rely on it.
But if you are using something like an outlines with like VLLM, I haven’t seen that fail in like the past month.
Okay, awesome.
Yeah, I’m not as familiar with that. So I’ll have to check out outlines and do some more digging there. Yeah, and outlines, I think they were kind of like the first one, but there’s some other ones that are cooked in.
I wouldn’t necessarily go that library specifically, but each of the models kind of find what they’re guided.
I think guided decoding.
and structured outputs.
Look for those keywords with whatever you’re using.
Cool. Thanks. Yeah. I love that you found it easier just to go build an app than to go build slides.
I thought that was pretty cool. But yeah, it’s really hard to try to show something. You know what I mean? But yeah. Hey, got to tinker.
More fun that way. That’s how you learn. That is very true. Any other questions from the floor?
Virtual floor, I guess.
All right, if not, Josh, thanks for putting this together. Yeah, no problem. Happy to be here. The things I hadn’t played around with, the step back was new for me. I don’t know that I’ve ever run into that. And then the tree of thought.
That one, I may have to hit a little bit on some other things, but that was pretty cool.
Well, cool. Thanks for coming, folks, and I’ll see you all at the next one.
All right.
You want to stop the recording? I will. I don’t know if it keeps rolling even after everybody leaves.
I did that on Teams the other day. We normally do a Teams call for the software.