AI Reasoning & Reinforcement Learning: Fundamental Papers & History

Links dropped in the chat:

Here is the video of Andrej: https://www.youtube.com/watch?v=7xTGNNLPyMI
The MCTS Tic-Tac-Toe game: https://vgarciasc.github.io/mcts-viz/
CoT paper: https://arxiv.org/abs/2201.11903
STaR paper: https://arxiv.org/abs/2203.14465
Let’s Verify paper: https://arxiv.org/abs/2305.20050
The Bitter Lesson: https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf
https://epoch.ai/frontiermath
New paper just published on the coding side a couple days ago: https://arxiv.org/abs/2502.06807
DeepSeekMath paper: https://arxiv.org/abs/2402.03300
Deepseek V3: https://arxiv.org/abs/2412.19437
Deepseek R1: https://arxiv.org/abs/2501.12948
s1 paper: https://arxiv.org/abs/2501.19393

Transcription provided by Huntsville AI Transcribe

All right, so starting over again, we’re going to go through key concepts in RL.

These are kind of the high-level concepts of an agent, environment, states, and all that sort of jazz.

There’s some other post-training concepts that we’re going to go through as well, typically supervised fine-tuning and RLHF. But then we’ll also talk a little bit about GANs, just because of how similar they are to the value network models, and then talk about model distillation here as well, since we’re talking about some relevant things. And this becomes relevant when we get to the R1 models, the distilled models that we’ll talk about at the end.

As far as papers and history, it seems like everything in the kitchen sink, but this is a very specific line of papers that started in 2017 with AlphaGo, and really kind of lay the groundwork for all of the current things happening with reasoning models. There are lots of other papers that I hate to not have in here, but to me, this is kind of like the core path of what got us here, is AlphaGo, chain of thought, self-taught reasoner, verify step-by-step. All these deep-seek papers are mostly because that’s what people are talking about now.

But we’ll also talk about 01 and a recent paper that came out in early February for test timescaling. All right, and feel free to pop things into chat, raise your hand, interrupt. All that’s totally OK. But otherwise, I will plow along. We’re going to start off with some key concepts in reinforcement learning that’s important for what we’re doing.

And the main intuition here is this concept of some sort of an agent, which is this really just a model that’s acting inside of the environment that is trying to decide how to take an action based off of some policy, where it’s evaluating the environment, the current state of the environment, and what it’s predicting will happen whenever it takes certain actions, and how those actions will give it the most reward.

And here, in an agent, this is not necessarily the same way we talk about agents, but it’s close.

This is just the normal term for the thing acting inside of the system, where in the case of an LLM, a lot of times that thing that the agent is doing is generating the next token.

But in an RL framework, that might be like move the car to the left, go to this place, play this chess piece, and all that sort of stuff.

And so we’re going to dig into each one of those elements, because basically all of the innovations are people doing weird things with each one of these elements. And so the first and most important pieces here is the concept of the agent itself, which is composed of some sort of a policy and a value function.

And so when you think about the policy, this is really the strategy that the agent uses to decide which action it needs to take next in any sort of given state.

This is generally what we’re optimizing when we talk about an RL.

So you’ll hear people talk about PPO, DPO, ORPO.

There’s always a PO in there.

And then GRPO, which is the most recent one from DeepMind, from I think around April of last year.

And there are two main kinds of policies, ones that are deterministic, which means that it’ll always kind of pursue this action.

You’ll see this a lot in expert systems.

And deterministic, I think of this more, and this is not a perfect example. I think it illustrates it, is if I see a green light, I move forward.

You’re generally going to do that.

Whereas a stochastic is something where it’s sampled.

So it’s not always going to be the same.

So maybe something like if I see a yellow light, I stop.

Sometimes you don’t do that. You push forward instead. So the decision to stop or not is a stochastic policy.

And the value function here is the other thing that we’re optimizing to a certain point, which is basically an estimate of how the actions that my policy might take will give me certain amounts of reward over the long term.

So it’s kind of looking forward into the future and trying to determine what set of actions is going to give it the best reward.

All right, so any questions at this point? All right, so the next major thing here is the concept of the environment.

This is really the other huge element.

So the agent has to be reacting to some sort of environment.

This is the world or system in which the agent operates.

And this has things like states and the observations from the agent about the states of the environment.

So in this case, our agent here is this little mouse.

And his goal is to get the cheese and the cheese gives him reward. But then you also have this concept of negative reward. And so the cat is a huge negative reward.

The mouse is going to try and optimize for things that get him the cheese, but don’t get him eaten. But then you also have the concept of different amounts of things have different levels of reward. So as far as this observation is the information that he can see. So he has a visibility over these three states.

And so his policy should be acting only in the context of these states and not necessarily acting with knowledge that this cat is here. That’s the general idea here is that you’re controlling how the agent works with imperfect information.

You’ll hear people talk about like imperfect information games.

That’s sort of what we’re talking about here.

I think I already talked about this mostly.

Yep, I got ahead of myself. That’ll happen. So yeah, the actions here, we’re looking at how we do those in the given state. And the big thing here is that there’s this concept of a cumulative reward, which is the total reward over time. That’s what our value fact function should be helping us build up.

So it should be building up something that gets him to kind of move along, avoid this and get this.

And so the general cycle of RL here is that the agent is plopped into situation.

It observes what’s around it.

It decides an action, the environment updates.

It gets some sort of a reward here where you can see here that there’s like a passive minus one here for everything.

That’s to push the agent into action. But then just sit there and do nothing, which is a problem. So you wanna kind of push it to action. But then the agent’s gonna refine its policy and estimates and keep going. All right, so that’s the general idea of what RL is doing. And so when we’re doing RL training, we’re updating the policy to optimize for this. And so there are some other post-training methods that are also fairly key for all these reasoning discussions.

And some of them that are kind of red herrings as well whenever we’re talking about reinforcement learning.

We’ll talk about these, which is supervised fine tuning, model distillation, RLHF, and the GAN.

So the first concept that is critical here is supervised fine tuning.

And this has really been the most common form of fine tuning for these language models for a long time.

That’s still done very heavily.

And the general concept of supervised fine tuning is that we get some sort of labeled data that’s labeled by some sort of human experts.

And then we train the model to try and move towards those answers.

And so you’ll see this a lot with the GPT and PALMS and stuff like this, but then we have these fine tunes.

This is super old here, talking about Alpaca, Dolly, and Vicuna. But the general idea is that we’re giving it some sort of a query, and then we had a human create this response, and then we’re moving the model towards that.

And down here, we can see that’s done here for like a stylistic sort of training.

And you can see here, we’re doing kind of a safety supervised fine tuning sort of thing.

So this is where you’ll get things like the, you know, I’m just a large language model and I can’t possibly say that, you know.

Some of that sort of stuff where it’s just, you’re always kind of getting the same answers where they’ve beaten this thing with some sort of supervised fine tuning or the RLHF, which we’ll talk about later.

But the main idea is that we’re trying to take some sort of pre-trained model where we’ve just kind of, you know, jammed a bunch of data into this, and then we’re moving it towards that labeled data.

A big thing with supervised, actually, let me hold that for the RLHF section.

All right, so another key element that we need to talk about with a lot of these training methodologies is the concept of cross entropy loss.

And really what we’re looking at here, I have a KL divergence up here just because this gives a good visual of sort of these ideas.

Your cross entropy loss is basically, you’re looking at two distributions of numbers and you’re wanting to determine how far away those distributions are.

This has a lot of use cases for machine learning models, especially whenever you’re trying to push things towards a certain direction, doing distillation where you’re trying to alter a distribution in a certain direction.

But the main idea is that the lower the loss is, the closer the models are to whatever we consider the true answers or the desired answers for the model.

And there’s this other concept of KL divergence which is basically the measure of how different they are.

And this can be used in a few different ways, one of which is a normalization sort of method.

And so it might be that, say for instance, this green model here is my good model.

This is what I want to move to.

And my blue model is what I’m actually generating.

And so you can do a few different things.

If you have an especially chaotic training process and you’re expecting this to be very large, you might use the KL divergence as a normalization parameter to basically tell the model to cool it.

Don’t update quite so much.

I want you to do small steps towards this in the event that the distance is large.

And we see this in GRPO, uses this sort of coolant method a good bit.

But then you could also use KL divergence in a positive way of distilling a model, knowledge from another model where say this is a teacher model, this is a student model and they’re different models fundamentally.

And say, I want to just inch it towards that.

You can do that proactively where the KL divergence is actually what you’re using to push the gradients.

And there’s lots of different ways of using this, but this is just kind of the base concept you’ll see a few times where you’ll either see the D parameter or KL in a few places.

And yeah, so here’s one big item, the KL divergence is a huge thing for, we talked about distillation last Wednesday and this is one of the big methods that people use where you have the data, you have the teacher model.

So think of this as your R1 671B massive MOE model.

You know, it does its prediction out and you have little tiny Quinn here.

And it’s basically using that distance to get your distillation loss, which is used to train the model.

We’ll talk about this a lot later, but kind of keep this in mind as we’re talking about all different ways that KL divergence can be used.

All right, next up, we’re gonna talk about GANs. This one’s a little bit weird in the context of RL, but I did want to introduce it.

This is a huge thing kind of in 2016 to 2018, 19 period, where everyone was talking about GANs, GANs, GANs. This is gonna be the big thing and it’s really composed of, it’s kind of fallen out. You know, it pops up every once in a while, but it’s kind of fallen out of style, but it lives on in spirit. And we’ll see that as we go through this paper with the concept of validators and the value networks.

So the main idea with the GANs is that you have a two-part system, one of which is a generator, something that generates out data, and the other one, which is a discriminator, which evaluates the samples to try and distinguish between real and fake data.

And so you basically have a creator and a judge, or a creator and a pruner, whatever you want to call it. And so you’re trying to do this sort of adversarial training where they’re fighting with one another and getting closer and closer to some sort of form of reality. I guess it’s down here, sorry. Pointing at the wrong box.

And that’s the general idea of a GAN.

The problem is they’re very fussy because you’re not just training one model, you’re training two models. You’re training this generator model to update correctly and the discriminator.

It’s very easy for one of them to get out of whack.

So it’s kind of a complicated training process.

It’s very fussy.

And a lot of times with GANs, you’re using them at inference time too. So it’s not just something that you have to deal with the complexity during training. You have to deal with it at runtime as well. That’s the concept of the GAN. All right, next up is RLHF.

This is one lots of people talk about. And RLHF is called reinforcement learning with human feedback.

This is a big thing that came out of the, a lot of the people who are now at Anthropic, in fact Dario Amadei is actually one of the major guys who did this.

I think, I can’t remember. They just lost their safety guy who is also one of the kings of this sort of methodology. So when I think of RLHF, I think of sort of like the early stages of Anthropic, but this is also the major thing that was really behind ChatGPT. Cause this is really how they kind of got that chatbot vibe into the original GPT-3 was through this reinforcement learning with human feedback.

So the question is, what is that?

And it’s really kind of a misnomer. We’ll get into that in a second.

But the main thing here is that we’re getting preference data from humans where humans are basically looking at two answers from the models and deciding which one they like more than the other.

And then we use that preference data to train a reward model.

And basically what this model does is it tries to imitate the human because what we’re trying to get past here is having to rely so heavily on human annotated examples like we have to do with supervised fine-tuning.

And so they’re kind of trying to push it towards a reinforcement learning paradigm where the human preference is the reward.

And they’re kind of treating that human preference proxy model as quote unquote the environment and objective that this thing is living in. And then it’s optimizing itself for generating that.

So then we’re back to the thing where it’s doing the loop.

It’s going and it’s going through and deciding what path I’m going to take, what are the tokens that I can generate that are going to get me the most reward and most likely to get me a positive response. I think also this is kind of what leads to, a lot of times the first time people pick up ChatGPT they think it’s amazing. It’s kind of given them a bunch of interesting things to on first glance. But then over time it really started to fall off.

And I’m talking here but really specifically about ChatGPT 3.5.

They’re not so heavily dependent on RLHF now.

But a lot of the things that were bad about that 3.5 system and good were from RLHF. And so, oh yeah, here, I should have just gone to this one.

So when you’re thinking about this term of, it’s inside of the environment.

So this is ChatGPT and it’s looking for good vibes.

And that’s really the general idea behind RLHF is we’re looking for good vibes and not bad vibes.

This is why you get things like llama2 refusing to kill a Python process and all that sort of stuff.

And here’s kind of what the RLHF process looks like.

So we go in and feed in the text data.

We have a frozen LM.

So this would be your base 3.5.

And then you have your trained model, which is whatever this RLHF tuned 3.5 is.

So I guess this frozen LM would be three, probably ChatGPT or a GPT-3 or some variant of that.

And then we’re going to get the KL divergence.

So here’s our KL loss.

And the reason here that we’re doing this is because we’re training it on essentially nonsense data as far as actually completing out its tasks and stuff like that.

And so we wanna constrain the loss so it doesn’t output gibberish and fool the model because now we’re starting to train the model to feel the vibes. And so we’re giving the model its output to this reward model.

And that’s kind of generating out the loss that we want, putting it into some sort of an RL update, which here a lot of the time it was a PPO, which we’ll talk about this a lot in the GRPO section.

So I’m not gonna go into it here.

But basically this is just a way of utilizing this reward model to generate out the loss to push into the network. And then they do this until they feel like not doing it anymore.

So that’s the general idea behind RLHF.

But wait, bad vibe alert. So if you don’t know Andre Karpathy, this guy is one of the kings of the AI space. He’s been on a lot of the key innovations here. And I find him to give the most base takes on stuff. So whenever something happens, I always go to see what Andre has to say. And if you don’t follow him, I would suggest doing so. And if you have not seen his recent three or so hour dive into GPT and transformers, I would suggest go and watch that.

It’s really good.

But this is Andre’s take on RLHF.

He’s not a big fan.

Says RLHF is just barely RL.

The reason that he says this is, you know, this thing happens after the SFT is it’s powerful and RLHF is not.

And we’re actually gonna, I’m gonna hold off too much going into this because we’ll go into it later on.

This is a huge post that he has. And it’s really talking about AlphaGo.

And so we’ll go into what he has, what problems he has with RLHF there.

Thank you, Charlie. All right, so now we’re gonna go into the building blocks of reasoning. So this is gonna go over a long period of time.

So from 2017 to 2023, there’s lots of things that happened in between here that we’re not gonna go into, but you’ll see it’s very, very heavy towards the 22, 23 time period where 2017 is really where we got the first big glint of this.

And it took some time to kind of get the transformer, the text generation models to catch up.

But we’re gonna start with AlphaGo and the Monte Carlo Tree Search. This is a huge paper from 2017, Mastering the Game of Go Without Human Knowledge.

If you’ve not heard about AlphaGo, this was basically the first system that was able to defeat professional Go players.

And it defeated here at least at all.

It was this huge televised event with lots of people. And it was able to kick out the Grand Master and this is one of those sort of games, those, I can’t think of the right term. Bellwether might be the correct term.

But basically, lots of people are saying like, okay, it solved chess, that’s fine.

It solved Jeopardy, that’s fine.

But whenever it solves Go, that’s gonna be a problem. And the reason for that is because look at how many squares there are on this board, how many possible moves you could possibly take.

And compare that to something like chess and it’s not even close.

So just the amount of variations that are here make it significantly harder to brute force the game.

Although that is kind of what they ended up doing regardless with the Monte Carlo Tree Search, but in a kind of a smart way. But the idea here was to use this concept of some sort of a reinforcement learning loop to allow the system to come up with moves that did not rely on this labeled examples. So the supervised fine tuning sort of stuff.

And so what we’ll look at here is basically what it did is that it did this concept of self play, which is basically the model would play out around and then it would update its network and it would play out against different versions of itself to generate out of distribution games.

So it was not reliant on human knowledge because even with all the games that exist, there is a finite limit that can be played and computers can play much faster.

So they can see more games than humans could ever possibly play. And so one of the major elements here that was used is this Monte Carlo Tree Search. And I think this little free app is great that kind of visualizes this thing out.

So here I’ve just gone here and I’m going to have the system do a Monte Carlo Tree Search run. And what it’s kind of doing is it’s rolling out all the possible actions that can happen. So I’m just gonna hit on this like a crazy monkey for a little while. And so we can see kind of how it’s doing stuff.

And basically you can see here is that it’s generating out possible moves and how it’s deciding which of these moves have the most values is somewhat dependent on that policy that we’re talking about.

So it’s not just going at complete random.

There is some element of a policy that’s deciding these things, but then it’s kind of doing that sampling in a more stochastic way.

And so you see here, it’s generating out a whole bunch of options.

And then I did something there, but then makes a play based off of all of those options it samples which one of those is going to have the most reward and then approves the network to go to that one.

And so this concept of expanding out and pruning is probably the most key thing here.

That’s why this thing is on the list, even though it’s so far away from whenever we started reasoning, reasoning doesn’t exist without this sort of tree search expanding out the network and then pruning element.

Let me find where I am.

Let’s see.

All right, yes.

So here’s another visualization of Monte Carlo tree search that we kind of think out where we go out and it’s planning out all these different nodes.

And then we have this concept of a rollout where it’s really trying to plan out all the ways to get to the area that it wants to go, which is the area of the proper reward.

And in this case, we have this box here of what is the correct reward.

So there’s some sort of combination of where your starting place is and what the valid answer is here is what we’re seeing.

This is really relevant because this is what we’re looking for when we’re talking about reasoning.

So Monte Carlo tree search and concept of this is that it’s trying to get into the right area and generate rollouts of how to get there. There’s one problem, however, with Monte Carlo tree search and that is that it is expensive.

So each one of these nodes is another inference. And so you’re increasing the cost of what it takes to run these models significantly.

The counterpoint is that it has a significant impact on performance.

And so a lot of the innovations as we go along, and especially the innovations once we get to deep seek have to do with this concept of tree search and search is really, really good.

It’s also really, really expensive. So you want to optimize for that reality. That being said, there’s a reason to optimize. And I think this is one of the biggest, if you can get a intuition around this chart, I think it’s going to remove a lot of the concerns and sort of, I guess BS out there about the limits of data.

And the real intuition here is that with supervised learning, we have here the dotted line right here is your grand master where we’re looking at ELO rating of the grand master.

And so with supervised learning, we get rapid increases as we’re moving the model towards the performance of all of these sampled games that we have labeled and we’re training the model on massive performance up, then it trails off and it gets up here, but then we kind of get into this limit of, it can at this point only approach this dotted line if it’s only moving towards that line.

And so if we go another route with self-play and reinforcement learning, we start much lower and there’s lots of problems at the very beginning and it gets rapidly better and eventually it surpasses human performance because it’s going outside of the distribution of the label human data to the point where it’s, effectively, limitless in its potential of how far up it can grow.

So this is whenever anybody says, we’re running out of data and all those sorts of things, that’s really not a concern.

The concern is if we are running out of verifiable tasks.

And what makes reinforcement learning work is that there’s some sort of, yes, this was correct or no, it wasn’t.

And so if you do not have something that’s verifiable, it’s very, very difficult to do reinforcement learning.

And that is what Andre does not like.

So RLHF is just barely RL because it’s got a crap objective, essentially.

So when we’re looking back here at RLHF, what we’re looking for is this general concept of good vibes.

And so the Go game, I know I succeeded with Go if I won or not.

But how do you do a strong verifiable reward for something that is subjective, like good vibes and bad vibes?

And the essence is essentially that you really can’t. And he does this through the concept of putting this in terms of if we tried to do AlphaGo with RLHF. So in the case where we were trying to do AlphaGo in the area of RLHF, we would have a whole bunch of labels of different board states, and we’d be asking the human labelers, which one do you like better? So you like this board or that board?

And this seems kind of silly, but it could probably work to a certain point.

You could probably get, if you got old Lisa Dahl right here and all of his compatriots to judge out these boards, you’d probably get a reasonable jump in performance on a base model to a certain extent. But the problem is there’s only a certain amount of games that they have considered.

And the big thing here in this game is this move called Move 37.

And I think it was this move right here where he played, oh, I don’t have my little pointer, do I?

Did I have it?

Oh, there it is.

Okay, so he played this guy right here, I think.

It doesn’t matter. He played something over here. I think it was right here. And then the AlphaGo played this guy right here. This was a move that no human would have done. They actually looked inside of the probabilities and the probability that a human would have made this move was one in 10,000. They got from its internal network. Everyone was kind of losing their crap. This guy actually went outside and smoked a cigarette after it played it. So he could have time to think. So it was kind of this big moment. That’s why everyone knows what Move 37 is.

Move 37 doesn’t happen here because it’s only going to be going on policy of what humans have seen.

And there’s also some other elements where you can have adversarial examples that basically trick this reward model.

So yeah, you’re getting all of these comparisons and getting the reward model to try and imitate the vibe check of the board state.

And we’re getting good vibes for our boards, but there are two major problems with this, which is the vibes can be misleading because this is a proxy objective. We’re not actually having it. We don’t have a win the board state.

We’re just kind of having a good vibes objective. And we also have the concept of things that are out of distribution. So how do you give an objective reward for summarizing an article, answering an ambiguous question, or telling a joke, writing Java code to Python?

You can’t really.

So that’s the rant on RLHF. So it is not quite RL.

All right, so now we’re going to do a big jump to, I guess, do we have any questions?

Anybody want to pipe in? Give it a second. I have a tendency to kind of just zoom. All right. Okay, let’s pop forward to chain of thought, prompting elicits reasoning in large language models.

I think this came in a deep mind.

I don’t think I put it on here, but yeah, this is a Google paper. And so this is the first concept of chain of thought is in 2022. And the main idea here is that we’re trying to get large language models to solve more complex problems that require multi-step reasoning, especially things like mathematical problems, things where there’s a little bit more verifiable stuff. And this introduced the first major sort of reasoning prompts that existed. This concept of a structured output actually started here.

We’re really looking to get some of the latent ability of the network to focus in some sort of a structured reasoned way.

And so what this looked like before at this point, we’d ask a question, get an answer.

And even with a prior example here, if it was just trying to zero shot into, the answer is 27, it would get it wrong.

If you know how tokenizers work, this makes a lot of sense because you’re basically trying to get it to do a whole bunch of calculations, a whole bunch of prediction stuff in one or two tokens here. And so you’re trying to fit all this stuff and there’s just not enough space inside of here for especially the models at this time to do this. Now you might be able to get GPT-4 to do this now, perfectly fine, but at that time, you were certainly not able to do this. But the trick was is that we’re basically elongating out the input samples into the next tokens before we have it do the answer.

And this concept of elongating out the sampling is very important because we give the network more time to settle into the right prediction because it has more things that it’s feeding off against in the past.

And so yeah, here, and you can see here, they’re doing some sort of an in-context learning here.

So they’re giving it an example of, here’s the question, here’s an answer.

It’s showing it what reasoning looks like because this thing doesn’t know how to do this, but then it’s able to imitate that because it’s doing next token prediction and successfully get the answer right.

And so this is the first version of Chain of Thought.

And that’s really all I’m gonna talk about there.

I think this is probably one of the lesser known papers that is out there, which is a shame because it’s pretty impactful. So this is self-taught reasoner.

This is a star paper. It’s bootstrapping reasoning with reasoning. This is the start of the flywheel.

All right, so the idea here is that we want to train the large language models to generate reasoning traces without having to rely heavily on human annotated data.

And Scott does here is this, a human put this example here. This is in 2022. Before we had lots of synthetic data happening.

And so, at least in this space, and so a human put this in here and basically taught what the proper reasoning trace was.

And what they wanted to do is they wanted to have the model generate out those traces themselves.

There’s lots of really good reasons to do this, one of which is cost.

You don’t have to have humans to annotate the data. There’s also the intuition of how does a human know at what point they need to generate out a reasoning trace? You know, we don’t think the same way that these models do.

So it is much more useful to have the model generate out the reasoning traces where we’re pruning it down based off of some sort of verifiable reward at the end, which math is a good proxy for that.

And so the idea here is that there’s a self-generated chain of thought, and we train it on that generated chain of thought whenever it ends in a good solution.

And we want to bootstrap the capability.

And what they did here is really clever. It’s really weird. So they had the language model generate out, they gave it that answer before, you know, the question and answer, the math, whatever it is, and it generates out a rationale and an answer.

And a lot of the times at the very beginning, it gave out the wrong answer.

But what they would then do is that they would basically give it the answer and say, okay, the answer was actually nine, or the answer was actually B, tell me why B was right with the chain of thought. And so it would generate out the chain of thought.

Then we would go in and then fine tune it as if it had given that answer.

So we’re using its own chain of thought here that’s coming, you know, based off of its internal neural network.

And then we fine tune it, and it is more likely in the next run to use some of those patterns that it learned from this bootstrapped chain of thought.

And we are officially in the flywheel.

So now we gotta let this cook for a little while, and that is what they do.

So here we’re in the self-reinforcing cycle.

It generates its own traces and learns from them.

It gets outputs that are scored.

The best outputs are guide the updates to the network.

And then we get this sort of emergent reflection and backtracking capability off of the races.

Step one, question mark, question mark, profit. Here’s my own personal theory.

This is a kind of tinfoil hat theory that I’ve had since around summer of last year or so.

So self-taught reasoner.

I think, so 01, let me start up here actually.

The 01 series of models, they were originally kind of under wraps, went under the name of strawberry.

Everybody thought this was because the famous problem of count two Rs, three Rs, and it’s very difficult for the things to get this right. And I think this is probably correct, but I think the more interesting thing here is that the 01 series of model are probably a homage to self-taught reasoner, which is exactly the sort of paradigm that we think that these things were trained on.

So it has its own flywheel, and this was the first of the reasoning models.

And I think that this is really kind of the seminal paper in the current day reasoning traces.

And you can see the influence of this flywheel all the way up to what we’ll talk about with R1-0.

All right, let’s get into let’s verify step-by-step.

Okay, so I love this paper. This is kind of whenever, I remember reading this paper the day it came out.

This is around just when I was getting into all of the reasoning stuff.

So this one came out of OpenAI.

It was one of the last really good papers that they put out into the open. And the idea here is this concept of a process supervision.

We’re kind of splitting out before, we’re really thinking of this whole thing as a chain of thought, and we’re generating it out on a reasoning trace.

And the main thing with let’s verify step-by-step is that we’re going to start splitting out this reasoning trace into multiple steps.

And we wanna verify each one of these steps individually.

So we’re starting to look at this process reward model.

So before we have a reward model that’s looking at the entire answer.

Now we have an additional one on top of that that’s looking at all the intermediate steps here where we’re essentially say, let’s go back to our good friend chain of thought.

Here we go.

So in this case here, we have the cafeteria had eight apples originally.

That might be a segment.

They used 20 to make lunch. That might be a segment. So they had 23 minus 20 equals three.

Let’s say here it said, so they had 23 plus 20 equals five. So that would be a place where the process reward model marks this thing as essentially red.

So you have these traces here.

We’re doing a beam search.

Each one of these is an individual node.

It goes out to red and then we decide to prune that node.

And it goes to green here.

Okay, that gets to keep living. Green, green, ends up red, no good.

And so we’re doing this sort of beam search thing here. And what does this look like?

It’s a tree search.

So this is kind of when we’re starting to lean back into these Monte Carlo elements here.

So we have this concept of beam search.

This is a very useful one.

I think in general, beam search is not used so much these days.

This is a big thing in like Whisper.

I think it’s the last big thing that I know uses beam search a whole bunch. A lot of the providers deprecate this just because it’s very expensive. And so people just don’t use it because of how expensive it is. But getting us started out, this is a huge element that was used a lot. We also have this concept of a look-ahead search where we’re kind of looking at rollouts and we’re trying to predict.

So we’re not just running the games out.

It’s actually trying to do some sort of a, have an intuition about how my current state is going to impact future states by looking at kind of a trajectory.

And then we have just a simple best of end sampling where we’re doing a whole bunch of samples out and getting the best ones.

Usually there was kind of a mix of these things being used. Best of end sampling is still very popular. That’s really what we’re looking at here is that we’re looking at each of the individual intermediate steps.

Something very notable about let’s verify step-by-step, you’ll see our homeboy Ilya here as the senior sort of advisor on this one. This one definitely smells a lot like him.

Oh, actually he’s not senior, it’s Carl Cobe.

Anyways, Ilya is on this paper.

I always am happy to see his name somewhere.

So why we care about this, as you can see here, we have this concept of an outcome supervised RM. That’s really what we were doing beforehand, which is this outcome. The simple majority voting sort of thing and then a process supervised RM. This thing’s obviously very complicated.

We’re doing expensive things. This is not cheap. Beam search is expensive. Look ahead, search is expensive. You have another value model that you’re having to train. So you’re having to run that at least during training time, probably during inference time as well, at least during training time.

But it improves the performance. So we do it.

Another interesting thing about let’s think step-by-step is that we started kind of getting this concept of the zero shot COT where before, if we wanted to do, we have the concept of few shot.

Few shot is basically where I’m feeding it in examples.

So I’m showing it a chain of thought and then giving it a question and then trying to get it to do that chain of thought here. And so zero shot is that I’m not gonna give it an example and one of the things that we started seeing at the time, this is really, really effective was to add this in let’s think step-by-step. I think we actually saw Jay, he was doing it as an example last week.

He had to add let’s think step-by-step to get the model to do the thinking and reasoning. So it was even in there at that point. But the thing is is that you shouldn’t really have to do that with the recent models. But what it would do is that this token sequence would essentially put it into a mode that was the same as giving it a few shot example.

So this token sequence contained the latent concept of a chain of thought.

That’s kind of the important intuition there.

All right, so in between here and these next papers, there’s lots of really cool things that I’m going to skip over just because we can and time is moving on. So we’re going to skip pretty far ahead now.

So this was, I think, 2023.

There’s lots of stuff that happened in the second half of 2024. And what it really was was people catching up. So you started seeing Mistral be able to do this. You started to see the stuff popping up in the Gemma models, started seeing it in some of those. I think Mistral was the one who got it really, really early.

Cohere with the Command-R, different stuff like that.

So a lot of people were catching up and something was brewing, which was 01. It is very likely. So you heard a bunch of hubbub with Sam Altman getting fired and all that sort of stuff that happened. It was around the time that this was probably getting demonstrated. If you kind of look at all of hindsight, looking back at the statements they were saying, I think one of the big things was Ilia was starting to start burning effigies to AGI and Sam Altman, I think had the quote of, we peeled back the veil and saw true understanding. And essentially what that was, was seeing all the concepts between test time scaling, which is the ability to scale, not with pre-training, but with this additional sort of access of test time scaling, which is I can spend tokens at test time. And this test time just means at runtime, essentially. So like whenever you interact with chat GPT, it is technically test time compute.

That’s just the nomenclature that people use, mostly born out of the training paradigm.

But what we see here is up to this point, the only way to scale was to really train bigger models for longer, which was feasible at that point, but we’re starting to get pretty expensive models.

And there’s only so big, we can build out our data centers until we have magical robot gods. So for now, we’re kind of getting to that upper limit with this, I think the Stargate, they’ve pledged the $500 billion to build out all these data centers.

You’ve got Musk with his data center up in Memphis.

So we’re getting up there, they’re going for these big ones, but it would be really nice if that was not the only way that we could scale.

And what we found with these 01 models is that we can.

And as with all things in AI, as this 01 stuff, it’s really a extension of self-taught reasoner, which is actually a Google thing, but Google failed to deliver on it.

And so chat GPT got it instead. But what we see here is the train time compute scaled right here, and test time compute also has its own pass it one accuracy sort of enhancements. And this all tracks of, if you have not heard of The Bitter Lesson, I think this is well, it’s a really, really short little blog post from, oh God, I can’t remember the guy’s name right now. Let’s see, Rich Sutton. Rich Sutton’s one of the granddaddies of AI.

He’s been doing this for a long, long time. Lots of really good interviews of him out there. I think they just did one with a machine learning pot or a machine learning street talk, which is really interesting. So I would definitely go suggest listening to him talk. But his big thing is this Bitter Lesson is that the general concept of, as engineers, we want to do clever things and have these giant complex pipelines that have all these whips and bobs, little tiny sort of optimizations here and there. And these things always just don’t matter in the end. And The Bitter Lesson is that as clever as we can possibly be, as far as it goes with AI models, you’ve got two things that really matter, and that is search and learning. And what he means, so learning, this is obviously the big thing, which is the training bigger models. You give it more data, you give it more parameters, you train it for longer, or you make the data better, all that sort of stuff. And there’s all this concept of search. This concept of search is not Google search. It’s not going on Perplexity. It’s searching for solutions. It’s this test time compute. It is this Monte Carlo tree search.

So I am searching the solution space for all of the possible actions that I can possibly do that might lead me to a positive reward, where that action might be something that a robot is doing or what the next token is, or as we all see in DeepSeek, what the several next tokens are.

But that’s the general concept here is that search and learning is all that we need to care about. And hey, it looks like he was right. This is Noam Brown, I think. So this guy is really famous for essentially solving poker, which is really difficult if you think about it. So we have Go.

Go is difficult because the scale of possible moves.

Poker is difficult for a lot of other reasons because it’s an imperfect information game that has elements of diplomacy and all that sort of stuff. But he basically solved it out. And this is also, I think, the lead guy behind the 01 series of models, or at least a very lead engineer in it. So he was a big aqua hire for OpenAI.

Right before they basically started all the Strawberry models. And what he says is that he wishes he had done search earlier, he didn’t do it for poker, and he should have done it for poker. And so that was 01 here, and we see this. This is the first sort of thing that’s came out. I think in September is when they posted this thing, but they also did this big announcement for 03. Obviously, we have 03 many now. But, and this is speculation.

There’s lots of speculation here. We don’t really know what’s going on with 03.

We don’t really know what’s going on with 01. 01, we have a better idea because it’s the closest to the stuff that we do know out there. So it’s probably some of that flywheel stuff and with some extra goodies.

But 03, we really don’t know what they’ve done here.

But we do know that it has lots of impacts on these verifiable domains of competition math.

This AI-ME is really the big benchmark that a lot of people are talking about right now.

There’s still GPQA diamond as well. But these are the ones that we’re really looking at.

You can see the big jump here is that we’re continuing to scale up to the point where we’re really saturated on this benchmark if things go much further.

We have this new benchmark of frontier math.

I don’t know, Charlie, you’re posting in lessons. You might go look for frontier math and post some of those problems. Actually, I’m gonna pull it up right now. Just to kind of give you an idea. So these are a little bit more simple.

The frontier math stuff looks like… All right, I’m done. Oh, did you find it?

All right, I’ll post the link into Discord later.

But it’s these really, really complex problems out there that are really looking at, it’s frontier sort of math.

Things that you would give PhDs a bunch of coffee in several months to go figure out. Maybe that’s a little bit hyperbole, but very difficult stuff. So the previous state of the art here on these was 2.0, and it jumped up here to 25.2.

So we’re starting to eek into interesting areas on capability.

And O3, I think they just announced that they’re not going to release this full O3 model.

There’s good reasons.

I didn’t include it in here, but I think that the cost for each one of these problems was something like in the tens of thousands of dollars.

It was a lot. They did this on the Arc AGI task, and I think it cost well over a million dollars to complete out the full evaluation. And so this problem right here, lots of really good performance, 96, 25, yippy, very good. Money, costs too much.

It’s not feasible. But things get cheaper. Oh, sweet.

Thank you, David. I’ll have to check that out.

All right. So how do we deal with this money problem?

Luckily, we have an answer coming to the rescue, or at least a few answers of kind of ways that we chip away at this sort of thing. So the first one we’re gonna talk about is DeepSeek Math.

This is actually from April of last year is whenever they first introduced DeepSeek Math, which was introducing one of the first of their major optimizations on simplifying this sort of stuff.

And this is actually where they introduced the GRPO stuff. So one of the things you’ll notice with these DeepSeek papers is that they’re really, a lot of the things that are being claimed for the current papers of like these major innovations, they’re like recycling these innovations.

So it’s a lot of the success of R1 is actually not R1.

There’s something really cool in there, but a lot of things that people are going crazy about are not related to this.

One of these is the GRPO, and that’s in this paper right here. So we’re gonna get kind of deep in this one. This one, we’re actually gonna go into a little bit of what they did here, because what they’re really replacing here is this concept of the proximal policy optimization, which is that major, I think this is the second major RL algorithm that got used.

It started with TRPO, which is the trust remote policy optimization.

And then PPO kind of took over, and this has kind of been the major one for a long time.

And so what PPO is doing is you have the policy model here, you’re taking the question, it has the output.

You have your reference model, which is the base model that’s being trained.

Your reward model, which is the thing that’s generating out the actual reward back to the model.

But then you also have the value model.

So if we link back to way, way at the beginning with our agent here, so you have the thing that is generating out sort of your long-term path of how much is this entire path going to give me reward?

So you can almost think of it as your future reward, your future planning model.

And so PPO actually has an additional model, a whole other model to do that.

So that’s this value model right here.

And so it, with the reference and reward model, we generate out the KL divergence and use that to generate in our generative adversarial, E, I forget what that’s called now, but it’s basically an adversarial sort of loss.

And we use that to generate up the updates for PPO.

So GRPO is group relative policy optimization, and it just kills the value model.

It does not have this model whatsoever.

And instead what it does is it generates out a set of outcomes, and then it just concatenates them together and uses that to do the update, which is way simpler.

It’s one of those, you know, it’s so stupid it works sort of things, but it seems to work.

Oh, I’ve messed up. Here we go.

Okay, found myself.

All right, so the problem with PPO, and this is really what a lot of people were doing, especially the big labs, because it was very effective, but you got too many plates spinning. So here you’ve got one, two, okay, you got at least one, two, three models that are being trained all at once. And you’re having to keep these, okay, you have two models here that you’re getting trained at once, and you’re having to keep these in sync. So what if your value model starts going off into a weird direction or isn’t updating properly? Well, then you’ve got a failed run, and you’ve got to start over.

So GRPO is really nice because you only have this one model that you’re optimizing, and so it’s easy to, you know, back up, restart, do all those sorts of different things, because you’re not having to train this value model as well.

The other big element here is that, okay, that’s right, you’re not having to run the value model.

So the cost to train these things, the GPU hardware that is necessarily to run the training runs isn’t necessarily anymore.

And this was something that DeepSeek was very interested in as they are GPU constrained.

So they really need to find ways to do this stuff without, you know, massive $100 billion data centers. The other context here, and we’ll get into this a little bit, is this concept of how they deal with advantage.

That’s also very key.

So we’re talking about these two.

I do want to mention that there are other policy optimization methods.

This GRPO thing, it’s really cool for sure, but it’s not the right choice for every case.

It’s not even necessarily something that’s going to always win out over PPO.

PPO might still be the right choice for certain problems, but it is an interesting new way of doing things.

There are some other ones that are also very interesting that might work for different areas.

I still am very fond of Orpo. This is one that I usually use right now, but it’s basically different ways of getting the preference tuning, where PPO, you’re using the value model, DPO, you’re looking at sort of preference pairs, the odds ratio preference, where you’re looking at the change in the ratio with your pairs, and Kahneman-Tversky, where you’re really looking at breaking up the pairwise comparison, so you’re not doing binary pairs.

So you can have, say for instance, a 10 to one negative to positive example for your preferences, which is really good because it’s easier to give negative examples.

So I really think KTO and Orpo are still very big here, but GRPO, very cool too.

I do think DPO kind of loses now that Orpo is out there, but DPO is still very popular. All right, we’re gonna get into the math a little bit, not super deep, partially, because I’m not brilliant at the math, but I know it enough to plug in the numbers, and we’re gonna talk a little bit about the key elements here to kind of get an intuition of what’s happening with GRPO.

So the first thing here, we’re gonna look at these greens.

This is kind of equivalent here of what we’re doing with our optimization.

So the equivalent for GRPO here is this section in the PPO, and even as supervised fine-tuning, you have the equivalent here with the query and out.

And so the major thing that’s different between these is this I value right here, where we’re iterating over some set of outcomes, and they used eight a lot, so generally they were going over eight different outputs here.

And so we have the new policy update going over the old policy update, which is basically how confident am I in the change that’s happening, and we’re relating it to this A hat I of T, which is the concept of an advantage, or more closely saying how this next action that I might take, how good is that for me over the long term?

And so this will use your value network that we’ve been talking about before, has this advantage here, that’s where that’s being used to calculate this step that I’m going to do, how much value is it gonna give me over the long term? And then we have a bunch of terms here that’s basically executing a form of control. It’s doing that by clipping out this same confidence sort of policy here and adding it along with a KL divergence, so it’s telling it basically to cool it, so we said there’s two ways of doing KL divergence.

This is definitely a cool it way, because we’re doing something wacky with our update, which is concatenating all these values together, and then using that as the advantage to push back into the network.

And so we have this hyperparameter here, a beta, so this is using basically how much do we want this to affect the chill network policy?

That’s what all of this stuff is here is doing. And so that’s the math, let’s get some pictures.

All right, I think we can move past here.

So the big thing here to think is what’s happening is that it’s going in, we’re putting it into the prompt, and we’re generating out eight different completions, and then all of those completions have their different rewards, so we still have the reward model here that’s frozen, and then we still have the calculated advantage here, and then we just concatenate all those together. We look at this updated policy, our reference policy, generate out the divergence where we’re basically putting these up against one another, get the mean, and that generates out the objective.

All right, where are we on time?

I’m gonna move forward a little bit.

All right, let’s move into DeepSeek V3.

And so I think this is a sleeper favorite.

Everyone kind of freaked out about R1 because it had the web app with it, but it’s not a super hot take that this is really the big optimization.

This is the one that’s really hard too. So DeepSeek R1, you’ll see all these people talking about, like, we’ve replicated DeepSeek R1 for $50 in a cave, and that’s true.

They did do that.

What they didn’t replicate was DeepSeek V3, which is stupid complicated. I’m still trying to wrap my head around some elements of this, especially the auxiliary loss.

I don’t have this one totally in my mind yet, but we can talk about at least a few of these.

They’re really relevant.

And the overlying context here with DeepSeek is obviously they’re in China.

So there’s lots of export restrictions on these guys. Now, there’s lots of people saying that DeepSeek have done all this training for $5.5 million, and it costs SAML and $500 billion and stuff like that, and that’s all nonsense. There’s lots of creative narrative accounting occurring where it might be the case that the actual run took 5.5 million.

It probably does not include all of the experiments that happened.

It probably does not include all the capex that happened on the GPUs that the parent quant company was donating time on to DeepSeek. And it also probably doesn’t include all the off-the-books GPUs that were probably sold to that quant company as well. There’s lots of things of like, yeah, what they did here was very cool, and they still were constrained quite a bit and did some pretty crazy optimizations.

But there’s lots of hyperbole going around about how efficient all this stuff was. That being said, let’s talk about the cool stuff that is quite efficient. So the first one, the very big one is this MOE design.

Now, it’s very likely that a lot of the major models are doing this mixture of experts design.

And there is a talk in the past, and we won’t go super deep into mixture of experts.

We did do a talk on this, I think October, 2023, which was kind of whenever mixture of experts was just kind of getting hot in the LLM space. There was talk through that this was probably what ChatGPT-4 was using, was an MOE model.

But this was prior to Mixstral, I think, at that time. So we were before there.

But this is basically a way of splitting up all of the parameters of the model and only activating certain ones.

And each of these experts, they’re randomly initialized.

So it’s not like this is the science expert, this is the math expert, this is the blah expert.

They’re just random networks that we’ve kind of split up and we have a router to try and route things to the right networks that is also trained with it. And so it almost serves as this weird kind of turnkey decoder ring to get you into certain circuits of neural sort of patterns inside of the networks. So it’s very clever and it lets you be able to run these massive models at significantly reduced cost at runtime.

And you still have to have all the VRAM and all that sort of stuff, you still have to have them loaded up.

But it allows you to not have to do the multiplications across the entire network.

The problem is that these things are stupid hard to train because this problem that we were talking about before of trying to keep these guys in sync, imagine if you’re trying to split that across 32 different experts and the router between it as well.

You have lots of random little problems that can happen with this.

I mean, there’s so much that can go wrong with a mixture of extras models.

That by itself is very difficult to get past.

And so on top of that, they’ve got this bonkers multi-head latent attention, which I think this is the coolest thing in here. Maybe not the most impactful, but it’s really cool. We’ll actually go into a better model of this.

But the general idea is that, actually, I’m just gonna go to that other picture right now.

So multi-head attention, this is really the base transformers thing where it was a big deal, where you’re basically pulling the attention out into multiple rungs so that you can parallelize it. So you notice there’s usually eight, which correlates with the normal amount of GPUs that you have inside of a node. And so you have one of those big nodes of eight GPUs, and each GPU would get one of the heads, essentially.

It’s generally how this works.

That’s really cool. There’s other attention methods that have come out that’s been pretty neat. GQA is one that you’ll see a lot in the open source space. I don’t know if it’s in the closed source space, but I know it’s in the open source space.

Or this concept is that you’re trying to kind of cluster the attention a little bit, so you’re not having to do so many computations. And just finding different little ways of kind of nipping away at GQA. There’s lots of things now where people will only do global attention every eight layers, and they’ll do local attention for 4,000 tokens on all the intermediate layers between there without getting a performance loss, little things like that to reduce the amount of calculations that we have to do.

And then multi-head latent attention is just bonkers to me.

Because essentially what they’re doing is that they’re not, this concept of the KV cache, and we’ll go super deep into that, but essentially think of the KV cache as storing all of the context up to the point of the token that you’re currently generating.

So it’s basically your context length.

You can think of it as a proxy for that.

And they’re basically compressing the KV cache into latent space.

So they do not collapse it into token space.

And so they’re able to drastically reduce how much they’re storing here and just project it at inference time into the generation process.

And what this led to, I think, let me pop down to this, I’m gonna cheat a little bit. So they’re compressing the indices and they get 93% compression on KV cache, which is nuts. So it’s still quadratic, you still have quadratic attention, but you’re drastically changing the constant that that is being multiplied against.

So that’s super cool.

I don’t know, I haven’t implemented this, but theoretically that seems really neat. And it seems to have gotten them really good results. The other one, I’m not gonna talk about this, I don’t know much about the MOE with auxiliary loss.

I know it’s there, I have not dug deep into this one specifically, but that is one of the other things that people are talking about. I will say though, multi-token prediction, this is an interesting one.

So they’ve actually found that instead of predicting one token at a time, which is like the normal decoder sort of thing, that they can actually predict multiple tokens at a time and it does not decrease loss and sometimes even improves the performance.

So instead of generating out multiple tokens, I might be able to better predict out tokens if I’m predicting more as a chunk, because it might be that if I’m going at a token by token by token, the next most likely token is more incorrect, where if I try and generate larger chunks at a time, I might get closer to the right answer is what they’re finding, which there’s some other backing research that makes that make sense.

And here’s the really big one that has probably the most performance increase is just being able to do raw FP8 mixed precision training.

So generally these models, I think a lot are trained at FP16 now, FP32 is also, I don’t know what they train for these frontier models, but this FP8 mixed precision is really something that really just came along with the latest batch of GPUs out there.

I think that the earliest that support it now are, I think Ampere is dragging a little bit, but I think Ampere does, but this really started to kick up with the H series.

So H100 is all those.

And so they’re able to train at that position, which has not really been seen at this scale, especially not with something like an MOE.

And the general idea here is that there’s less fidelity at the end of these floating points.

And the fact that you’re able to do that without reduction in performance is pretty neat.

But you’re not going to be doing on your 4090, unfortunately. So I think you do need at least Ampere, I think 8.6 compute capability is the best you can get.

There is another one here.

Okay, we’ll talk about that later.

All right, so let’s talk about DeepSeek R1. All right, so this is the one that everybody flipped out about.

It was just kind of weird, but everybody did flip out and they probably should have flipped out. I’m not saying the flip out was unnecessary. I think where it came was kind of weird. That being said, there is something extremely cool in here.

Oh, have I lost, well, I’ve lost my non-meme slide. All right, we’re going to ignore these guys for now. Just gonna put them up here.

Not to presuppose what my opinions on are these. Okay, so DeepSeek R1.0, so this one is, there’s two models that they talked about here.

So first one was DeepSeek R1.0.

This one was trained with a pure reinforcement learning approach.

So it did not have any supervised fine tuning in it from the get-go.

And so it just kind of naturally emerged with the chain of thought sort of concept. You know, before we were showing that we kind of bootstrapped it with some examples. You kind of poked it and prodded it and did all that sort of stuff with it.

And this thing was where they were able to get the chain of thought performance and that the new paradigm without any sort of non-RL methods. And they got the other model, which is the DeepSeek R1.

And basically the problem was that they found this model, it was really cool, it was able to do this stuff, but it was given crap answers. It didn’t have the RLHF, it didn’t have the SFT. So it was kind of, it had trouble with, you know, dealing with multiple languages. It had trouble with giving answers that humans liked.

You know, it wasn’t a good chat model, but it was trained with pure RL.

That’s pretty neat.

And so they did a DeepSeek R1, which had a little bit better of the cold start and lots of training stuff, human-friendly output. This is the thing that everybody was using. Even more to the point, what everyone was freaking about was this. The thing is, that’s not interesting. This is just, this, this is just V3.

They didn’t do anything for real here.

That’s super novel. They gave it the chain of thought stuff, but we know that’s not hard.

We’re able to still DeepSeek R1, you know, or DeepSeek, yeah, DeepSeek R1 into Quinn and give it that capability. Yeah, you can do that. But this concept of doing it from a fresh start, that’s new. So DeepSeek R1, boring.

I’m sure it’s a fine model, that’s cool.

But not interesting.

And when I say that DeepSeek R1-0 is interesting, this is why.

So this is the training pipeline for the DeepSeek R1 series.

And so you got two major ideas.

So you got DeepSeek R1 here. This is the one that everyone’s using. You got your Quinns and your distills here. This is pretty cool. Not gonna knock on this. You got DeepSeek R1-0.

Okay, so we see DeepSeek V3 base.

All of them are from the V3 base.

We do the SFT.

And so we’re gonna go down the V3 line.

It does SFT, gets cold start data, which is basically your early examples.

It does all of the GRPO stuff, chain of thought, language reward. Then we go into some SFT here, some rejection sampling with reasoning prompts.

We get some additional reasoning data and some non-reasoning data with some fine tuning and chain of thought prompting and an additional model that’s helping to reason into that and generate out those examples.

Then we get some combined data here and we take that over and do some more supervised fine tuning, some more RL, and then we get DeepSeek R1.

All right, let’s look at R1-0.

Okay, we got the base.

And you’ll notice here, the dotted line here, it does not get the SFT.

It goes into this block.

So it gets reasoning.

It gets chain of thought.

And it’s done. This makes me think of the better lesson. R1-0 is very, very cool. So we see here the trajectory that it used over time.

So we can see it bootstrapping its chain of thought process.

And actually, I think I missed one here.

So this is kind of an example of what it’s doing when we were talking about the reasoning chains.

It’s going through the question and it’s going through stuff, but it has this backtracking capability, this wait, wait, wait.

There’s an aha moment I can flag here. I’m not done thinking yet.

So it’s self-taught, this behavior. And so with that, we see it starts here with relatively small chains of thoughts.

The steps here is how many training steps.

And the average length per response is an indication of how long the chain of thought was based off the questions that they were answering. And so you can see here, as it was going along, it was not only getting more correct answers, but it was naturally, and on a pretty good curve here, increasing the size.

So it was teaching itself to more effectively backtrack over time.

And this doesn’t seem like it’s flattening out. Very interesting.

So there’s some other things that were very interesting about all the DeepSeek stuff. I think that really the big one here is a lot of the efficiencies they use to get outside of, I’m looking at time here, outside of the constraints they had on GPU interconnect.

So we can see here, and it’s kind of a good reason for people to freak out, is the rapid increase up to what the best model was, which I think is 03 here, of capability.

And how they really were able to do this was some clever stuff they did with the CUDA libraries.

So they actually went a layer under CUDA to this parallel thread execution to get past, let me see, I think I’ve, oh, here we go. Yeah, here we go, this is perfect. So they have these H8100s.

So before October 7th, they did not have the H800s in the embargo, essentially. So H100s were in the embargo, A100s were in the embargo. And the major thing that they nerfed here was the interconnect bandwidth, which is basically the ability for the GPUs to talk to one another. And what they essentially did is that they rewrote the drivers under CUDA to have 130 to 8 processing units basically fix that gap, so that they could get the interconnect back and get the speed of this thing back up to what they needed to, to essentially make them H100s again. Which, I mean, this is a crazy innovation, and it’s gonna be something that’s very useful. It’s all some CapEx stuff, I’m looking at time, so I’m just gonna kind of move forward here. But that’s basically the general idea is that lots of really clever stuff, lots of this stuff we can use for ourselves and lots of it’s open-sourced. So that’s why there’s a general latent positive reaction, I think, from the community as a whole. Some other thoughts that are out there. We had this discussion about the different, if you’re saying that you’re running DeepSeq, you’re not. You’re running one of these pruned networks. And so essentially, what this is is that, this right here is the concept of the lottery ticket hypothesis, which is that in every neural network, there is a smaller neural network that is really, really, really good.

And if we could somehow magically know what those weights are, we get the performance of something like a GPT-4 on a one billion parameter model, or a GPT-5, or whatever it is, that there’s some little area of the network that is better than others.

And so there’s this concept of pruning where you’re trying to prune it beforehand, and that’s a valid one.

But what we’re really looking for, what people really seem to do in practice is this knowledge distillation where we have a larger model that exists and we’re trying to distill it down into a smaller model.

And so that’s what’s happening with these deep seeks that you can run on your computer.

You got big brain, deep seek right here, R1, 671 billion parameters with 37 activated.

It’s the teacher.

We do our favorite friend KL divergence, and here we’re doing it in a positive sense. We are trying to leverage the distance and move things towards it, and teaching that to the student model, which right here is our little dumb dumb quin.

But it works. That’s the big thing. It sounds dumb, but it works. So we can go from our 685 billion parameters.

I think this is like a terabyte or over a terabyte.

We’re able to get it down to the seven or one or below.

And really the seven Bs seem to be, the one Bs aren’t really taking it that well.

The seven Bs seem to be where they start to do pretty well. I think quin 32 is really good on performance.

All right.

One last thing I did want to talk about was this paper that came out on the third, which is S1 simple test time scaling.

And they were able to get this behavior, which is this wait, wait.

That’s an aha moment, which this backtracking behavior is huge for performance.

So it’s really quite a critical thing. And they were able to do this for 50 bucks. So they were fine tuning a high quality data set with only a hundred K tokens.

So they just kind of crawled the internet, created a new data set for this with a hundred K reasoning examples. And they were able to generate out this wait performance.

I didn’t post it here just because we’re, you know, getting to be quite long. But this is definitely a paper to check out. I think this is a hint of kind of what the R10 might’ve opened up. And Charlie has helped me out and actually linked all the papers, but I will send out a few of those links to discord or maybe convince Charlie to be nice and do it for me.

Yeah. That’s what I got.

I guess any time for discussion, questions, all that sort of jazz. That was a great breakdown. Thank you. Excellent.

I had a random question.

Have you seen the, I wasn’t able to fully grab my head around the GRPO adjustments to it. Have you seen that break its way into any more of the reinforcement realm or is that just more of a particular only useful for LLM type stuff? I have not looked into anything like that. I’ve just, I have used it. So I will say Unsloth has a version of this.

And if you don’t know about Unsloth, if you’re interested in training, it’s super cool.

I think I sent it out, but they have a new a GRPO thing that’s out there. So I’ve played around with it on quin-1b. I’ve done some training with it there and I’m kind of trying to figure out how to shift to it versus Orpo. All of my current stuff uses Orpo. So I’m kind of trying to shift over right now. I don’t see why it wouldn’t work for other cases. Yeah, I was curious about that. Cause I was looking at like some of the libraries I’ve used for just vanilla reinforcement learning. And none of them have any mentions of GRPO.

So I think that’s something I definitely need to dive more into myself.

Reinforcement learning has always had a soft spot in my heart. Well, that’s good. It’s gonna be around. Right, any other thoughts, questions, musings? Yeah, I will just jump on and say, I am getting these links in the discord.

Thank you. Yeah, thank you for doing that, Charlie. And sorry for not more feedback on this. It was a great refresher. Some of this stuff, even though it’s only a few years ago, I’m sure you know how it feels. There’s so much going on that I’m like, oh yeah, I forgot about that. Yeah, some of it was starting to make me feel old. I remember when the bitter lesson was first posted and everybody was like, oh my God, this is so amazing. And then seeing how long ago it was just hurts a bit. Yeah, I’ll admit a couple of times I got caught up. I did a little browsing for 2014 articles about how AlphaGo would never succeed and AI would take decades to be go. I forgot about the star paper until you linked that. I was texting some friends like, didn’t we talk about this? Oh yeah, so. Yeah, I think the star is probably, I think that’s the biggest sleeper paper that’s out there. As far as like how impactful it was under the scene. Yeah, no, I definitely agree on the strawberry connection. And Qstar. Yeah, Qstar was the other one, yeah. Yeah, the one I linked for coding was basically the same results as you’re showing on the math side.

OpenAI just put out results from their 01 IOI specifically trained model versus 03 generalized.

And there’s some really cool findings about how the generalized model was outperforming the model that was custom trained for the task. Was this the one, I saw something about, they got a gold medal or something like that? Yeah, their score was a gold medal for it. And I think some of the other coding challenges that they gave results on the paper were even more impressive. Like there are a couple of different points where they’re hitting like 99% plus accuracy. And it was within the bounds of the exam. Like no custom parameters are set up for it. Just the straight up model.

But as I said, it’s just been out for a few days and need to do some more digging to make sure everything’s legit and what it means. But yeah, there’s some interesting stuff on generalized model performance versus specialized and how the larger generalized models are kind of performing better than expected. So I’m really interested in where that’s going. That’s probably good. I mean, bodes well for AGI, I guess. Yeah. But no, I definitely agree.

I think B3 was the big turning point and it was interesting all the R1 attention. Yeah, well, it was definitely, I guess just a shock to people that somebody could just come out of nowhere, quote unquote nowhere and rise to the top of the app store. So lots of kind of social hysteria stuff. So a little bit justified, but how it came about kind of hinted that it wasn’t. It could have been justified if it had come from a different place, but so it goes. And take this with a grain of salt, but you know how there’s always certain people that seem to have a general idea of when updates are coming out and might be off by a few days. Like I’m hearing musings on something from Deep Sea Kidding tomorrow. So you’ll see. There’s lots of murmuring around, lots of video things as well. Pretty crazy video stuff coming out pretty soon. I think this is gonna be, we’re about to be in for a very wild few months. Yeah, at Mid-Journey’s office hours today, David Holtz.

Yeah, did you catch that about being back on board with video and what he was saying? I think he’s partnered up with Alibaba or something like that. Yeah, that’s kind of what it’s like. Yes, those are always very interesting for sure. Yeah.

Well cool, we are at time. You guys wanna chat about this stuff, the Discord’s open, post things about this. This is definitely something that we’ll keep watching. We’ll host another one of these next month. So if you have some papers out there, maybe we’ll be looking at whatever’s gonna be popping out in the next few days.

And maybe that will be what paper we do.

But let’s kind of keep up with whatever’s going on and it’s hot out there so we can do a dive like this a little bit deeper into stuff. Cool. Sounds good, thank you all. Thank you.