Transcription provided by Huntsville AI Transcribe
All right Okay, so I’m talking today about world models and This is a fun one.
So there’s a lot of things in here that harken back to some of the other stuff that we’ve talked through in the paper series, especially our Review on one where we went through that and diffusion transformers.
So lots of relations to that in here.
There’s an entire segment of world models that are just like that model.
So we’ll talk a little bit about some of the stuff that we hit there just to kind of freshen it up, but kind of breeze right through it and show the more world model instead of video generation focused aspects of those models. There’s also lots of relations to where we were talking about reasoning, which is our very first paper review series.
It’s lots of elements related to reasoning, latent reasoning, so reasoning without words.
We’ve kind of talked through that.
How do you do it without chain of thought?
And so what does that look like?
And we’ll talk about that here.
And that’s actually what VJEPA is.
And then we’ll also talk a good bit about encoders.
which ties to our cross or a cross modal encoder stuff.
So lots of stuff in here.
World models are a big topic. I think it’s very likely, you know, everyone’s talking about this year was the year of agents or whatever. I think at some point we’re going to hit the year of world models.
It’s kind of one of those sorts of things where lots of stuff is going to be tied to it when it opens up.
And hopefully that’ll make sense. Why specifically that is after this talk. We’re gonna talk about what is a world model. We’ll go through some of our stuff that we talked about before on latent space, but now expand it a little bit if we’re talking about a new part of that equation. And then look at a lot of stuff with the current world model landscape.
So there’s really three different kinds of world models that exist right now.
We’ll go into each one of them and kind of map them and also map where it gets blurry.
That’s one of the big things I hope you’ll find is that when these things kind of get to their final state, they’ll probably be pretty blurry between these three things. These are the three specializations right now.
That’s these predictive and latent planner models, generative world models, and then interactive world models.
Then we’ll go into paper review with VJAP2.
We’ll go deep on that one.
The first half is a little bit lighter, and then we’ll go heavier on some of the math and architecture stuff later. We’re going to start off with what is a world model? And while I was going through, I found this video from Yanlacoon.
Yanlacoon is currently, for the next few weeks or so, the head scientist over at MetaAI. He’s about to go leave and do his own thing, which is part of the reason we’re doing this talk now, because there’s lots of stuff related to that. But he actually invented convolutional neural networks back in the 80s. So he’s like one of the OGs.
And a lot of stuff that we’re talking about are going to be dealing with convolutional neural networks. So that’s important. So he’s been working on this for a long time. So he kind of knows what the deal is. So I’m just going to let him take it. AI is a technological challenge as well as a scientific challenge because we don’t know how to build truly intelligent systems yet. It’s one of the big scientific questions of our time was the universe made up for this life all about how does the brain work or what is intelligence really? Hi everyone, my name is Yann Lecaun. I’m the chief AI scientist at META. As humans, we think that language is very important for intelligence. But in fact, it’s not the case. Imagine a cube floating in the air in front of you, and imagine rotating that cube by 90 degrees.
You can sort of picture this in your mind, and this has nothing to do with language. Humans and animals navigate the world by building mental models of reality. What if AI could develop this kind of common sense? an ability to make predictions of what’s going to happen in some sort of abstract representation space.
We call this concept a world model.
Allowing machines to understand the physical world is very different from allowing to understand language.
A world model is like an abstract digital twin of reality that an AI can reference to understand the world and predict consequences of its action and therefore it would be able to plan a course of action to accomplish a given task.
It does not need millions of trials to learn something new because the world model provides a fundamental understanding of how the world works.
The impact of AI that can reason and plan using world models would be vast. Imagine assistive technology that helps people with visual impairment. AI agents in mixed reality could provide guidance through complex tasks making education more personalized. Imagine an AI coding agent that can actually understand how a new line of code will change the state of the variable of the program, the effect on the external world in the context of existing code.
And of course, world models are essential for autonomous systems like cell driving cars and robots.
In fact, we believe world models will usher a new era for robotics, enabling real-world AI agents to help with chores and physical tasks without needing astronomical amounts of robotic training data.
This is a very exciting time for AI research and a captivating set of scientific questions in front of us. We want to understand intelligence itself as well as learning, reasoning, understanding the physical world so we can build systems that can help billions in their daily lives. We’re excited to announce the release of VideoJAPA version 2, the next step in this journey. Stay with us as we continue exploring the possibilities of world models and push the boundaries of AI research. Yeah. So that is Yanla Kun’s talk on this.
We’re going to talk a little bit about the backstory of world models. This is kind of a It feels old now, but back in 2017, you know eons ago This paper came out from Jurgen Schmutterer Schmutterer So he’s one of one of the other sort of old old heads in the world of AI So lots of stuff with the LSTMs. I think he was really really related to that area RNNs And so he came out with this concept of the world model where they were basically determining if you could properly classify panels and comics. So with disparate panels and comics, you know, as humans, we can read them and understand what it is. But, you know, if I see, you know, a baseball zooming and in the prior frame, there was, you know, a picture, you know, can it tell that that picture through the baseball and stuff like that?
So that’s what they were kind of looking at here.
And he’s really looking at this idea of wanting to be able, in an unsupervised manner, have the model learn its own internal representation of the world.
And that’s the very important thing that it makes it very, very different from the current generative models, where we’re trying to basically have it reproduce things that we are feeding it in, in different ways.
We’re trying to get it to generate stuff out.
The world model is trying to get it to understand it implicitly. And so that goes back to 2017, but it actually goes all the way back to, I believe, 1991 with Richard Sutton. Now, we’ve talked about Richard Sutton a few times, but this is the guy who did the bitter lesson. And he’s still out and about and talking through things, but he came up with this concept of dyna, which is, you can see here, he has this world model where there is an agent, he performs some sort of an action and gets reward from that and some sort of situation in state comes out of it.
And what we’re talking about here is that this isn’t the environment, isn’t just like where the agent is in. The world environment here is an actual model that is somehow feeding into the thing that does learning.
And that’s kind of what we’re talking about through here.
One of the places where it can get really confusing is people who are kind of doing world generation.
They’re saying that these world generation models are world models.
And that can be true.
But it also might not be so that’s that’s one of the confusing things that we’ll try and dispel here The big thing here is We all have world models. We all have an understanding of what’s going to happen with this cup That there’s water going to be fall out the water is not going to shoot up into the air and explode Like one might might might want it to best because we have world model and sometimes these things don’t So yeah, that’s kind of gonna lead into, I think one of the easiest way to understand what a world model is, is to look and see the different failure modes that AI gets into, that is just 100% because they lack a good world model.
Things that would be obvious to anybody else, but they just don’t get it because they lack anything past language intelligence or basic vision intelligence from the form of patches.
where it’s like doing the sort of OCR sort of stuff that they do, but it doesn’t understand what it’s looking at.
And so these are kind of fun. So we’ll go through these now. So one of these is the concept of the blueprint bench where they’re giving a task and trying to have the model take some pictures of a layout and then put dots on each room and kind of label those dots and do proper things with the entryways and follow like all these code guidelines and stuff like that.
And we find that the models fail utterly at this sort of thing.
because this requires an understanding of some deep semantics about how these things are supposed to happen.
And as you were trying to translate that number one from a picture that’s in 2D space or a 3D space onto a top-down view, most of us could do something like this, but the models just absolutely fail. And we get things in the simple case like this living room right here that goes into the kitchen and nowhere else, and the bedroom that opens up into outside.
This is probably not a perfectly functional house, but then you also get these monstrosities that are I think also show Some of the problems that they can make very very pretty things.
This is this looks like if you just like saw this and flipped through it It looked fine, but upon further inspection You’ll notice a lot of very odd things in this picture Like for instance the fact that we have a mudroom with and laundry room with two tubs in it We have a bedroom that is all cabinets We have a hallway into a Jack and Jill with three sinks and a bedroom with a sink and Yeah, the master room is outside on the porch. So it’s the sort of thing, you know, whereas it’s done something here That’s that’s something that’s your next HGTV show.
Yeah designed by AI Yep And different things, like these beautiful AI search reviews, however is the average man. We all know that we exponentially grow larger as we grow older, so 20, 29, 188 pounds.
When we get to our 30s, we go up to 300, 40s, 700.
And then of course, 5,000 at 50, at 60 pounds, we become one year.
So that’s really lost the plot here.
Obviously, I don’t know who Lea Vericdar is, but the thing has confidently said that this has 217 feet, 16 feet. It’s how tall that guy is, which is very impressive.
He must drink his milk. We also see this in the generative world in sort of video generation.
So here’s an example.
This is something called first frame, last frame, where we have a starting frame and an ending frame and get the thing to kind of interpolate.
reasonably before that.
And in here, they asked for it to be a cut. So like time passed and the plant grew, essentially, with what they were trying to do. But we see the world model has found a better solution. Just unfold the plant. Why didn’t I think of that? It’s a great idea. Yeah, it doesn’t know how these things work. It just knows how to kind of reconstruct the pixels here.
We have another example here.
I think this is a robotics one.
What are you doing? That’s my mirror.
What are you doing?
One of your units just ran into my mirror and is causing damage. I’m a little I’m a little worried That’s not gonna work. Oh my god Yeah So there’s probably issues with not figuring that sort of stuff out We can also See this in other areas. So this one might be a little bit closer to things that we care about.
Well, world-most failures can show up with, you know, not understanding private information.
This came a report from late last month where they were holding sort of these investor meetings and having these discussions.
And they had lots of information about all the other deals they were doing that the thing had to negotiate. And it started blurting out information about these other deals while talking to the person that they were, you know, negotiating with, just kind of out of nowhere. And so it has no concept of like, oops, that was a private, you know, all that sort of stuff. That’s also just a failure of the world model. It’s just knowing how to do this sort of next thing prediction without any understanding of the reason why those things are there. And it gets really bad, especially when we start talking about these autoregressive models, these next token sort of things, the next frame sort of things, because the errors will compound.
As it goes wrong, it’s going to go wronger and wronger and wronger because they’re doing these probability sort of distributions.
And it just kind of will go off the rails.
See ya.
Um, so we kind of in this weird area where we have these things that are super super smart that can do math olympiad questions They can solve these these crazy proofs that are are are uh, you know verifiable Uh, and sometimes the things that they generate with these videos are just stunning like it is completely unclear how they can do the things that they’re doing But then sometimes they can’t know that you know 4.11 is uh less than 4.9 And so stuff like that is what we’ll be talking about today. There’s also a funny little area where it can also sometimes a lot of people will say that these language models have a world model.
They must have a world model if they’re able to determine all these things.
But I think that they’re underestimating how many things can be learned through an algorithm.
How many things these networks can approximate Inside of its neural net and a lot of the times what they’re doing is they’re not actually recovering that world model from the language. They’re just figuring out some sort of fancy way of Solving the task one or two steps ahead And it seems like they understand but then it will fall apart So there’s been a few studies on this.
It’s called world model recovery.
That’s the name of the concept But in general that it comes out that these things don’t have one at least the pure language models.
So yeah, that’s the first blurb.
Any thoughts on all that?
Some of the stuff we’ve done, this was a while back, we were looking at some of the initial pieces that were learning how to play video games and just recurrent neural nets and how many examples they had to see. Millions and millions. of hours, you know, condensed, of course, but playing video games because the, you know, they didn’t, they didn’t start with a human level knowledge that you can’t walk through a wall or that if I open my hand, something I’m holding is going to fall, you know, or which way up is. And they basically had to learn that through gameplay. But I think some of that kind of, I think that tracks what you’re talking about.
Yeah. So we’ll actually talk about that very specifically. That’s a lot of what world models are trying to tackle. And transformers in general are very interesting because they’re super powerful because they scale very well with data. In the sense that you can just keep feeding it more data and it will keep getting better. But the manner in which it scales is super inefficient.
So it’s very effective but very greedy in how it scales, which is how we got to where we are right now.
But yeah, we’ll talk a little bit about that when we get further into the Lacoon stuff and self supervised learning Okay, we’ll talk a little bit now. This is I’ll go through it again since some of you guys might not have been here for the diffusion Transformers review But a lot of what we’ll be talking about today is this concept of the latent space is that something that you guys are familiar with kind of what this is Okay, yeah a little bit I’ve got things in the numbers. Yeah, slight not. Yeah, yeah. It’s basically this internal representation space that sits inside of any sort of model that is represented by vectors, but we don’t really understand what it is, essentially.
And so there’s also this concept of an autoencoder where we take some sort of input on the left-hand side, and that might be text, it might be video, it might be audio, it might be another model’s outputs.
But we’re taking that and we’re taking it down into some sort of latent space doing a bunch of stuff with it and then getting it back out into some space that Is the actual output of the model that might be for like one it might be taking in text and putting out video So that’s the bay is going to reconstruct that on either side This comes from kind of the world of the convolutional neural net. A lot of what we’ll be talking about today is the concept of going from pixel space to latent space.
Now, there are lots of other places where that exists, but pixel space is what people usually talk about here, which is that we’re looking at sort of RGB data, video images.
You think about things that a robot might have access to, and it has to be able to react to the physical world but then also do its calculations.
Generally, what it’s doing whenever we are looking at the encoding of pixel space into the latent is this convolutional network.
I should remember.
As I said, Jan Lacoon is the one who created this, where it will basically take your picture.
All it will do is that it will take that picture and turn it into patches of some pixel square.
It might be 16 by 16 pixels or very normal now.
And it’ll just kind of take all of those things and it’ll feed it into the model.
And depending on what kind of model it is, it might also give things like positional embeddings.
So a convolutional neural network, it doesn’t need to do that.
But if it’s a transformer, like Pickstroll or something like that, it’ll feed in also where those things align to, which we’ll also get into.
Because what do you do whenever you have three dimensions?
How does that work?
But that’s the basic idea is that it’s taking this thing and extracting features out of it and transferring it into the latent.
What that ends up looking like is it usually is not so simple, but whenever we look at things like the MNIST dataset, which is where a lot of this sort of initial learning came from, it’s pretty easy to see that it’s like catching little loops and things like that.
And these different sort of random things will be patterned as some sort of a pattern and then stored together inside the network.
And that is the concept of latent space.
And so what we’re talking about today, there’s gonna be two sorts of models that we’re talking about, one of which is a generative model.
That’s what a lot of, you know, that’s what we’ve mostly been talking about for the past two years, at the very least, three years, however long it’s been that we’ve been in this. But they’re also non-generative models.
And that’s what VJepa is.
Jepa is not a generative model.
When we’re talking about that, you know, we have all these sort of, say for instance, one, you know, it got trained on all these different sort of videos and things, images that got in. And basically the general model is to, based off of what sort of data set it had, it should be able to reconstruct that data set through this variational autoencoder and latent transformation process.
And sort of basically find this thing over here.
But the way that it found it should be so robust that it can also find things that it wasn’t trained on.
That’s the idea here is that we’re trying to reconstruct, regenerate that input data into the output data that matches it in this sort of Gaussian space that exists.
And so that’s the idea of one.
It’s trying to do that with pixels, chat GBTs, trying to do it with text and images.
What does it look like when we’re trying to do that with a world model?
How do you pack the state of a world inside of a latent space?
And that is what we’ll be talking about Hey Josh, we do have a question in the chat if you get a second.
Yeah I Need to figure out how to oh there we go Don’t see how the world model be different from the limitations of true novel generations that LMS fail with since it relies more or less on the same probabilistic metam So This is actually one of the things that we’ll talk about It has it does have a different mechanism V Jepa has a different mechanism. It is actually not a probabilistic model is an energy based model So that’s one of the things that ways that it might be a little bit different But yeah, I think this is probably, let’s probably circle back to this. I think this could be a good discussion topic to have afterwards, because we can kind of talk through it.
I think some of that might get answered as we go through and some of it won’t, but we can kind of talk through what we think about that. If that’s okay. Yeah, that’s a big one. All right. Let’s go into the different kinds of world models real quick.
And so we’re going to talk through at first starting through kind of the big divide, which is the concept between the predictor and the generator.
And where we’re kind of basically trying to simulate and visualize the world and the predictive models where we’re basically trying to focus on that implicit model of the world.
So one’s focused on the output and one is focused on the internal representation.
Generally, the predictive models are focused on planning.
A lot of how AI worked for a long time was this predictive model.
This is really the older mode of things up until somewhat recently.
There’s always outliers, but a lot of stuff was focused on this.
This is an older way of doing things, but obviously still very relevant to what we want. We’ll talk about that in context of model-based reinforcement learning and then self-supervised learning, which is JEPA. And with that, we’ll talk a lot about this dreamer paper, which I found as we went through here, which is really kind of the counterpoint to JEPA in this area. Then obviously, generative models, this is making worlds and doing stuff. I feel like I don’t have to speak too much about this.
We will go through some of the cool sort of uses of this that we haven’t talked about and that will likely be the subject of a future paper review in very short order, since we haven’t gotten into those interactive ones especially. Let’s start off with the predictive world models.
With this one, we’re going to talk mostly about that dreamer model because the other model, we’re covering at the end of this presentation.
I’m not going to double-hash it.
The idea here, obviously, we’re learning that implicit understanding. With the model-based reinforcement learning, we’re really looking at a normal reinforcement loop.
The agent is taking an action in the environment and he gets some feedback back.
But the big thing here is that there is some sort of world model implicit understanding that is actually being trained in sort of the state.
And so what it’s getting trained on is not necessarily the actions that it produces, but it is getting trained on what predicted to happen based on the actions. And seems to work actually.
So we’ll talk a little bit about that.
And then self-supervised learning is where we are basically having an input encoder going into some sort of predictor and then comparing the outputs of that against an unmasked version of the same encoder.
And that’s the joint embedding architecture here.
And the big thing with that, I will talk about later.
So the big difference I wanna pull off here is that Jepa is really looking about that.
representation learning, it is trying to focus on the internal representations that the encoder is learning, where this dreamer model that we’ll talk about is looking to see how its plans align with the actual environment.
So there’s an action component, or VJEPA does not have an action component so much.
It does in this paper, but the actual VJEPA itself does not.
The concept of self-supervised learning is focused very much on masked prediction, where it’s looking at something that has absolute knowledge of the world versus something where you’re masking out up to even 90%. And that’s really the focus is just filling out that encodings.
How good is it at predicting what’s there?
Where model-based RL is, once again, looking inside of its environment and looking to generate out a reasonable sequence of actions.
Line of papers called dreamer.
I think started in 2019. This comes from Google You’ll notice a lot of these things come from Google that’s in this area They are also the ones who did alpha go and alpha star and all that sort of stuff. So it’s no surprise that they’re gonna be good at this stuff and This is the first time when they started really playing around with this idea of trying to get actions out of these agents by manipulating the latent space And so with this one, they started off with just these sort of, uh, have you ever seen these, um, videos of them trying to train the models to walk in sort of the 3d space? It used to be like a big thing, you know, like five years ago or something like that. This looks familiar. Yeah.
I’ve actually used this and explaining some things. Yeah. And how funny, how funny they can come out with. Right.
Right.
And so this is dreamer.
Uh, this is what their first.
very first test was was being able. Can it predict out the video based off its latent representation of what’s going to happen? And so you have here where we have the actual video where it’s trying to do something right here. Based off of this, its knowledge, it’s able to predict out what it’s going to do.
And they have an actual reconstructed version of that video that can be inspected. And the difference here is that they’re actually generating out that video into pixel space at some point.
So that’s their end goal.
And they’re basically saying, hey, can this work?
And turns out it did.
So it was just kind of a crazy idea. If it didn’t have worked, it’d just been another paper that somebody did. But it happened too. And so yippee, we have Dreamer.
So that was the main thing here. And their focus of this very first one was basically just I’m going to learn to encode what I’m looking at and the actions into these latent states and then predict what the environment is going to give me and meet. I’m going to try and make my predictions better. That’s all it was doing. It’s not acting. It’s purely an observer watching something and trying to encode stuff. That is most of what Dreamer is. That was your baby steps. Let me go on to Dreamer v2.
I think this is around 22.
time period or okay, so now we’ve got this thing in coding.
Let’s try and get it to play Atari. There’s pretty easy games. There’s just one plane usually of movement in these things, very few variables.
There’s not stuff everywhere flying.
So they’re going to try and figure out how to do this sort of thing. And so they have it going over a series about 15 games that they’re studying this thing on and its capability. And they’re testing to see if they can have it perform out the actions for these things and play the games. They’re actually playing the games live and survive and win and do the things you’re supposed to do. But also to be able to do it with a reasonable compute, reasonable size model, so you’re able to quickly do it. And one of the things they played with here is also having it be done with discrete variables instead of continuous variables.
See if they can do that sort of thing.
And so we have here an add-on to Dreamer 1, which is that we’re adding now in the hidden state, where it is holding its latent.
It’s a deep network now, and it is trying to spit out the actions that are actually controlling the thing.
And we have the difference between the predictor of what’s my current state to what’s my next state to what’s my last action. And so it’s kind of tying all those things together into a more complex, temporally aware model.
And hey, it worked again.
So one of the things that is interesting here is that they do have these discrete variables.
So instead of it being a continuous space, it might be that it’s just gonna say, who’s the last guy who shot?
Who’s the last thing where you can only store one value in that sort of thing? because of the limitations of Atari, this seemed to have worked. It is probably not something that would work in a modern game, just because of how dynamic these things are, but they were able to get away with that hack in this case, and so, hey, bully to them, that was their goal. The other thing they added in here was the concept of an actor critic, and this will come back many times, so this is, you know, some sort of an adversarial So if we talk about GAN, it’s kind of in that area where there’s some sort of teacher model that is looking at the things coming out of the model and pushing back against them.
It’s kind of in their loss somehow.
And so it’s having this thing that’s kind of doing this post-planning activity for training. So yeah, that is the Dreamer 3 or 2.
And so now we’re going to kind of get into stuff and I promise there’s a reason I’m going through all this. It is very important just kind of see the trajectory of how things are moving in this space with these world models. But now we’re going to go into 3D space.
This is Dreamer V3.
This I believe was 24.
was when this came out.
And so we’re doing lots of additional stuff. Everything that it did before is still in its model. So it’s not losing capability. We’re building on the same basis and it’s kind of growing this thing like a mold, but we’re just gonna start moving into 3D space now.
And so here we’re going to start, we’re still encoding stuff into these discrete representations, but we are kind of exploding them out into a more complex, because it’s having to deal with 3D things and starting to deal with long horizon tasks that go for longer than a single sort of frame. And so we’re gonna start looking at multi-step predictions. They’ve also kind of expanded out their visualizer here. So we’re starting to get into, I believe this is when they started actually adding in a diffusion transformer into this model.
As part of what it’s doing.
It’s a very light one and has probably has lots of distillation But they do have some sort of element where it’s going into pixel space now for part of its planning process And so as part of its training process it receives the first five frames and then has to predict 45 frames into the future And you have a model who can see exactly what happened and it encodes the thing and this one looks at that and kind of we kind of try and get the loss to line up with those things.
And what they’re finding with the architecture that they’re doing, where they’re doing this sort of actor-teacher sort of thing, is that they’re able to have lots of success, and very consistently, and I’m not assuming that anybody knows what Minecraft is, but if you do, there’s this, you know, different materials that exist, and the end sort of game is this diamond material.
And this is the first one who was able to get that diamond material, which takes a significant amount of planning because it’s got to get through all of the prior stages and tech trees and then also find that thing in a very hard to reach place.
And it was able to do that through this architecture.
And a lot of this was because it had, the Minecraft is a randomized world.
It’s not the same every time.
You can’t learn just one way of doing things. It has to have an actual understanding of these things. And they do some stuff in here, which we’ll go into when we probably talk about this paper, where they’re actually probing in to see what it’s thinking to understand that it does have a world model.
Yeah, I don’t think that’s necessary. That’s a cool thing though. So I’m gonna go now actually to look at the one that just released. This is what we’re waiting on.
They just released the Dreamer V4, I think just in the past few months or so.
The nice thing about Dreamer is that all their code is on GitHub.
And so they haven’t released this one yet, but they are going to.
So that’s probably going to coincide with what we talked about this one. But just to kind of see some of the stuff that it’s doing, it has to do things like gathering wood. You can see here that’s providing reward. It’s basically looking around for things of interest. Oh, I have to share this tab because Google Meet.
Let me see.
Can you see this now?
Much better much better.
Okay, so yeah, it’s it’s like doing things like gathering wood and mining stone And I saw actually this this video here has some interesting stuff Just to see kind of like the speed that it’s able to perform at So you have to think about you know, this is a model doing inference for all this stuff. How efficient is it? I think it’s interesting.
So it’s got to build things and know that that has to go there.
That’s pretty easy. But I think getting later into the video, I’ll just sneak ahead, where it’s kind of doing some crazy stuff. Yeah, here we go. Here’s it messing around with diamond.
And the interesting thing here is that there’s some lava that pops up.
And so this doesn’t get dangerous.
If you touch it, you die. And so the model is going around. It’s trying to cut around. It’s noticing it’s got lava here. OK, lava there. It puts a block in very quickly. It’s just kind of shifting around. It’s inventory. And then, of course, it is flown too close to the sun.
Very interesting that we’re kind of at that point.
So hopefully, it’ll be quite fun to play with.
Have you guys seen any of this sort of player agent sort of stuff?
I think this one might be.
To me, it seems like it’s a little bit more hidden as compared to some of the other stuff we see. But there’s lots of stuff happening. The fun parts were some of the ones that had found like loopholes behind games and things that it could, you know, find some tricks to go through, you know, stuff like that.
But yeah. So Josh, I was wondering and I was Googling it to double check myself, but I know that our graders, the new video game. is using some machine learning for locomotion. But I was curious if you knew, it doesn’t spell it out really as I read through, if it’s using role models for the reinforcement learning part.
I’m not sure. I’ve not looked into that one, but I’ll be happy to look at it on my holiday breaks. Yeah, my husband and I have been playing it a bunch and it’s creepy, right? The drones really are following you quite a bit. Ooh, that’d be interesting.
Yeah. Yeah. All right. So yeah, that is the dreamers. That’s the planning models. And we’re going to talk about the generative models.
These are the ones everybody are talking about now as world models.
It’s Sora and your Vios and Gemini.
And to an extent, maybe they might be world models. But these are the things that exist out there. They’re the least world model-y of the world models.
with an asterisk of how they’ve used Gemini in one area that we’ll talk about later.
Yeah, the big idea here is these are using diffusion transformers.
So these have the ability to do all that data scaling. And what they’ve done is that they’ve either, because they’re Google and they have YouTube, they’ve fed all of YouTube into those diffusion transformers. And yeah, it’s data hungry, but who cares? They’re Google or they’re open AI and they don’t care. And so they just fed everybody’s data into that and, you know, who cares? Or they’re, you know, Juan and they’re from China and they don’t care. So it’s this sort of thing where they just kind of fed these things and they’re able to get a very effective reproduction of pixel space is generally what it is.
These things generally don’t actually have a world model.
They just have so much data fed into them that they’re able to so effectively reconstruct pixel space and They don’t really understand the things that are underneath them still. That’s why you still get lots of really weird stuff, but they’re able to reconstruct Pikachu like whoa.
And that’s kind of why.
These are very interesting.
So they have a concept of a bay that includes a spatial and temporal element.
A lot of these things that you deal with video, they’re always going to have this, but the idea is that we’re now adding a third dimension or fourth dimension that includes time. So these videos, how they work is that they do the variational autoencoder, but they don’t add depth.
This kind of tells you what they’re doing, right?
There’s no depth here.
They’re adding time.
It’s not a four dimensional autoencoder.
It’s reconstructing your XY pixels and then adding a temporal element. And so it’s impossible for them to have an effective world model because that’s not how the world works.
But they’re still very cool.
So let’s talk about.
So one of the ones that are very big in this area that does very, very explicitly say I am a world model is Cosmos.
This is an NVIDIA’s world model.
And its entire idea is that it is going to generate out a whole bunch of videos, and the model is going to then use some sort of a vision transformer to look at those videos and decide between them what is the best one.
Does anybody see a problem with this?
How long does it take to generate a video?
It depends on if you’re on the free tier.
It’s only a problem for people who don’t own NVIDIA stock.
Which is a lot faster if you have a data center full of B200s.
So this is their kind of concept.
This is actually one they ablated with VJEPA. We probably won’t go into that too much, but they try to train one of these.
to do some sort of a comparison, because this is what a lot of people are using, because NVIDIA tells them to use it, and we use what NVIDIA tells us to use, because, you know, executives.
And what they found was to do the same sort of tasking. It took four minutes on Cosmos and 16 seconds on JEPA, just to kind of give you a scale of kind of your cost and time. So imagine if you have to wait four minutes for your robot to decide how to open the washing machine. And, you know, it’s not going to give… that much better for some of the stuff they want to do. Here’s the idea behind something like Cosmos is that it takes the input frames.
Before, it was planning out the latent actions.
This is having to reconstruct a video every single time. Along with that, you get world model failures that are hard to train away because you’re dealing in pixel space.
There’s so much information here.
You’ll notice that here’s a ground truth.
The arm reaches down, it grabs the thing, and it moves over and puts it on this thing over here. Everything’s fine and the donut doesn’t disappear. However, our donut disappears here. Sure, they’ve shown here that they’ve been successful and the donut has not disappeared, but wouldn’t it be great if that wasn’t a problem because you’re not trying to reconstruct pixels on things that are noisy?
And so that’s a little bit of the issue here with these sorts of models.
There probably is a space for them.
I think that this probably is something you want to do, but having it as your only driver is kind of insane. It’s very likely this will be a component of some sort of stuff we do. The issue here is very easy. There’s just so much detail in these sorts of things in order to get effective. If you’re going to go that way, you have to go all the way. Since you don’t have that latent representation, you have to be specific because sometimes that little detail is very critical for your application. So let me think about if you’re doing underwater pipe things and you need to detect fractures and stuff like pixel matters, if you don’t have other heuristics. So that’s the sort of things we’d be having these things do. That’s one aspect of generative world models.
We’re going to look now at some of the other ones.
So I’m going to go first into Genie. Have you guys heard of Genie? This one’s very cool.
I don’t think I have.
Oh, this is neat.
What you’re seeing are not games or videos.
They’re worlds. Each one of these is an interactive environment generated by Genie 3, a new frontier for world models.
With Genie 3, you can use natural language to generate a variety of worlds and explore them interactively, all with a single text prompt. Let’s see what it’s like to spend some time in a world.
Genie 3 has real-time interactivity, meaning that the environment reacts to your movements and actions.
You’re not walking through a pre-built simulation. Everything you see here is being generated live, as you explore it. And Genie 3 has world memory.
That’s why environments like this one stay consistent.
World memory even carries over into your actions. For example, when I’m painting on this wall, my actions persist. I can look away and generate other parts of the world.
But when I look back, the actions I took are still there. And Genie 3 enables prompt bull events.
so you can add new events into your world on the fly, something like another person, or transportation, or even something totally unexpected.
You can use Genie to explore real-world physics and movement and all kinds of unique environments. You can generate worlds with distinct geographies, historical settings, fictional environments, and even other characters.
We’re excited to see how Genie 3 can be used for next-generation gaming and entertainment.
And that’s just the beginning.
Worlds can help with embodied research, training robotic agents before working in the real world, or simulating dangerous scenarios for disaster preparedness and emergency training. World models can open new pathways for learning, agriculture, manufacturing, and more. We’re excited to see how Genie3’s world simulation can benefit.
So yeah, that’s pretty cool.
So there’s so many cool things in here. There are things that are a lot of fun. But to me, this is insane. This is a very complex pattern.
And the fact that it’s getting the physics of this roller and then is able to look away and then look back, and it’s there.
It’s exactly the same. That is insane.
And right now they have this up to a minute, I think. And yeah, I mean, this is the world model.
Very clearly.
And so this is where it’s going to kind of get blurry.
And it’s going to get even blurrier with the next video that I’m going to show you. And then we’re mostly done with the videos. That one with the paint roller, I watched it cut away and cut back probably like 20 times. That’s crazy. Really?
Really? Now we’re going to talk about another interesting thing here.
We talked about Dreamer. My guess is that this SEMA project probably came out of Dreamer.
This is kind of it leveled up with lots and lots and lots of compute because they had Genie.
This isn’t a person playing a video game. It’s SEMA 2, our most capable AI agent for virtual worlds. worlds that are complex, responsive, and ever-changing, just like ours. Unlike earlier models, Sima2 goes beyond simple actions to navigate and complete difficult, multi-step tasks. It understands multimodal prompts. And if you ask, Sima2 will explain what it can see and what it plans to do next.
Simitou can learn, reason, and improve by playing on its own, developing new skills and abilities without any human input. And the more Simitou plays, the better it becomes. Taking what it learns in one virtual world and applying it to the next, and the next, and the next.
Even if it’s never seen them before, Sima too. Not just a milestone for training agents in virtual worlds, but a step towards creating AI that can help with any task anywhere, including one day in the real world. So yeah, this concept here is there basically generating My guess is this is some combination of Gemini and VO into this combined world model where it has something that’s feeding both of them.
So it can generate stuff out, but then also play and exist inside of it and kind of explore itself.
You notice that Genie is generating something and it is playing inside that area.
And so how that works, you know, how does that, you know, what can you do with that? Probably lots of very interesting things.
Another very interesting thing here that I did mention with Dreamer is that one of the things that it got in before was the ability to learn from offline activity. So it didn’t have to be interacting with its environment. Its dreams could be videos that it’s reviewing in between sessions. And so it can do offline learning, which is huge for cost. All right, that’s kind of all of that.
So then we’re going to get into the nerd stuff.
Any thoughts, I guess, before we go into the math?
That wasn’t nerd stuff. Oh, I’m sorry. All right, we’ll go ahead and get into it so we can kind of go looser. All right. So this is a little bit lighter. So this is just showing sort of what VJEPA is doing specifically. We’re going to talk about two things here, one of which is VJEPA itself, which is the first model, VJEPA 1. And so its idea is that it takes a start and it generates out a whole bunch of different latent trajectories.
And then from there, it plans out the action that it wants to come out.
This is just some sort of basic robotic action of A0, you know, goes to A1 or whatever it is.
A lot of the things that they’re doing right now is stuff with like pick in place and your basic robotic actions sort of stuff, but trying to get it to do it from very, very few examples so that you can quickly learn very quickly in a very light sort of form factor. And so a lot of that is scaling without any, I guess, annotated examples. So there’s no negative examples.
That’s the big thing with this.
It never sees us saying, hey, don’t do this, which allows it to generalize in a huge way.
And so how it works, and this is with VJAP2 here, is that it takes in the internet video, and this is 1 million images to basically seed it off.
We talked about this with Juan, where they did these very, very tiny images at first to just kind of kick it off. And so they do that with this one as well. And then it goes into its video pre-training. And then it gets fed these larger data sets. I don’t think this is like 22 million videos or 22 million hours of video or something like that. Some huge amount, but not super huge. And to some VQA and this attentive probe training, which we’ll talk about this, because it’s kind of interesting.
So since this thing is not… It doesn’t generate out an image at the end.
There’s no video to look at.
How do you know that it’s working? That’s an interesting problem that arises from this sort of model.
And then it needs to train out this thing for VGEPA 2, which is it actually wants to not just be the eyes this time.
It wants to be the brain that says, hey, move here.
So there’s some stuff here. But of note here is that it’s only 62 hours of data that they use for this training.
That is not a lot.
especially if you have a magical machine that can generate infinite data. That’d be pretty interesting.
The basic idea here of what they’re doing is that where these other models are doing this actual pixel level representation, it just has an internal model that says, I have an arm and this thing’s over here and I want to move from here to here.
How do I make this mask out as much as possible of this video?
and still have this sort of loosey-goosey, rough model of the world be accurate.
That’s the concept behind VJEPA 2 and VJEPA in general.
And so there’s two major parts of this.
The first part that we’ll talk about is this.
This is just VJEPA 1, basically scaled up.
We’ll talk about that.
And then this action model on the right.
And so this VJEPA 1 element has a encoder element that I have spelled a student.
It is basically viewing that masked out video clip where we’ve taken it and they start with 10% and they scale it all the way up to 90% without having any degradation in quality at the end and actually improves quality to a certain extent, have more of the pixel space masked out, which is interesting.
I think there’s probably something to look into or to understand with that probably because it has less noise. and it actually does better with less information somehow.
We have something mask it out, and we have that encode what it thinks those slatens are.
Then we have a teacher model.
This teacher model is interesting because it is the same model, but it has its outputs move within an exponential moving average.
As this kind of lets it smooth it out, it makes it go slower, and it makes the other model, the other model is never going to have the exact same outputs.
So it’s not going to get stuck in a rut, basically.
It also forms as a way of regularization.
And almost like, you know, we’ve talked about some of these, you know, KL divergence and things like that, where we’re trying to kind of use these as a slowing term to smooth out the loss.
It’s being used for that. Unfortunately, for some reasons we’ll talk about later, they’ve got to do a lot of other kind of crazy stuff because of the way they’re training this model where it has no examples and there’s not a clear reward necessarily. And so it does these things like the stock gradients that activates to do some fancy stuff to prevent it from sort of putting out these gray blobs. so that it’s always predicting the right things because every dog and cat and wall is all the same gray blob. So we’ll talk a little bit about what that looks like, but they have much of these little feels over-engineered to a certain extent to the point they’ve made follow-up papers on it.
So that happens here. Basically, what they’re doing here with this, we’ve already talked about it, is that it’s calculating the distance between the teacher and the student.
That’s it.
And so for the JEPA 2, it’s JEPA 1, but it’s just scaled up. That’s really all it is. So they’ve done a lot of data scaling.
They increased the data set size from 2 million to 22 million. They scaled up the model from bit L to bit G. This is just a VIT. It’s almost like a, it’s a leveled up clip, I think would be the best way to think of what JEPA is. So JEPA is a clip style model. SIGLIP, all those sorts of things. It’s that.
sort of thing inside of an architecture.
But it’s doing other stuff that it’s encoding.
They also trained this thing for a lot longer. And this is actually a pretty interesting one. They found that they do for like 90% of the time, they kind of had it with a normal train scheduled and they did a heavy sort of drop off and then trained it on higher resolution images at this very end. And they’ve got lots of information on that, but it’s very clever. It’s very similar to some of the stuff that Juan did as well. where they kind of just at the very end, they kicked up the resolution on a whole bunch of stuff, just to kind of tip it off. So you can get very quick runs in the beginning while it’s learning the latent stuff, but then it’s also able to project to higher pixel space.
So yeah, VJEPA 2 is just VJEPA but big.
And so they went on a few different data sets here.
So this is something, something data set, which is the actual name.
I think this is like a hard classification.
Data set is really what it is.
Let’s say it’s something, something. So that’s what it is. Some things with diving, ImageNet.
So lots of things.
There are a lot of things in here.
I’m going to breeze by just because I’m not a robotics guy at all. So I’m learning a lot of things for the first time. So I’m not going to speak too much to any of that, but at least talk about some of the diffusion stuff and all that. The interesting thing here is that to do this sort of testing, they had to do this task-specific probe in order to even test these things. They have a classifier that they’ve trained to go inside a certain layers of the JEPA model and detect what class of object the latent is, and they’re using that to determine on the other end, you know, whether it’s passing or failing on these things. There’s no reason to believe this isn’t an accurate classifier. It’s just kind of crazy that they have to do this sort of thing. And it makes their training procedure very difficult because one of the things that’s also problematic about this sort of architecture with like the stop grad and the fact that they’re actually doing teacher forcing all that is that their loss curve, it doesn’t really mean anything to them. that in the current setup that they have, just because their loss is going down, doesn’t mean the model performance is going better, which is not what you want.
Because that makes training very, very difficult and very annoying.
So that is the situation that they’re in.
That’s why Legepa, which we’ll talk about later, exists.
So they found a bunch of hacks to make it be able to train. So this works, asterisk.
All right, that’s VJEPA.
Now we’re gonna go into VJEPA 2. This is one I’m actually trying to do, the actual robot stuff, is very similar to JEPA. But there is a little bit of difference. So what they do is they do freeze that encoder, and that encoder is what’s coming out of VJEPA 1. And so they’re keeping that the same, but now they’re adding in a different predictor.
So now it’s not trying to predict out the latent, but it’s trying to use the outputs of VJEPA 1.
to predict the actions that it has to take.
So it feeds in the 62,000 data points on robot actions and the available poses or whatever it is. And then it feeds it to like, here’s the available actions that you have with this robot.
And then it does the exact same thing where it has the thing that has completely unmasked has ground truth. and it has a thing that is completely masked and it does the L1 loss again.
That’s pretty easy.
There are less finicky things that we have to do here because it’s the robots doing something.
I’m not having to guess what its magic numbers are.
I can see that it knocks the thing over.
So there’s a little bit more simple of a architecture here.
Is that all that makes sense?
I’ll track.
Yeah, I’m guessing if you didn’t have unlimited compute you may not have just You know how do you know when to stop?
How do you ever?
You run out of money Not a problem for Mark Zuckerberg.
Yeah, we’re done. Let’s write a paper All right, so yeah some other interesting things here they do that have not necessarily seen before but doesn’t mean anything, I just haven’t seen it, is that they use these tubelets where for their spatial temporal bay, they’re basically for their stride. So instead of going around pixels, they’re kind of doing these tubelets where it’s going into multiple frames, which I’m sure the other models did too, but I thought that was interesting. So they have a 3D convolutional patch embedding. So kind of think about how that works.
when you have these videos. And they’re doing some other stuff. They talk about 3D rope.
So are you guys familiar with rotary positional embeddings? This is sort of positional embeddings.
That’s when we kind of go 2D. It’s just a transformer.
They don’t actually work on sequences.
They’re a set-based model.
So if you don’t feed in sequences to a transformer, it’s just gonna randomly say everything’s in whatever position it comes in.
And it’s just gonna, it’s not gonna care about that.
So you have to give in specific positional embeddings to these things. And of course, with a video, I don’t have, you know, one-dimensional with the normal transformer models.
I don’t have two-dimensional, which is like your image models.
I have three-dimensional positional embeddings.
And so what they do is they do some, I think it’s some of the trig sort of things that they rotate the embeddings that are, they prepend a rotation.
before the position, and they use that to kind of manage this in 3D space since they’re using a transformer with the VIT. That was an interesting thing that adds a little bit of complexity. And because one of the interesting things with 3D Rope is that you actually have to train a model to do the rotation. So they train a model to learn how to rotate.
in three dimensions. You can do it in two dimensions without another model, but in three dimensions, you need to have a second model learn the rotation for your space because it’s not, you’re not going to know how it’s going to need to rotate.
So this, you know, it’s not always a bad thing.
It does sound more complex, but on the other hand, you don’t have a human trying to engineer it either. So it might come out in the wash. The other element that’s new with VJP2 is obviously they’re doing some additional stuff with the robot arms, so they’re having to have it plan out multiple actions in the future.
And so they do this in two ways, one of which is very normal.
That’s just this teacher forcing loss. This is how this is done with language models.
This is exactly the same as that.
But they also do this rollout loss where it’s having the model try and predict action several states ahead.
and trying to match up the latents for that and then kind of teaching it all at once based off of several rollouts into the future. This is very interesting. I’ve not seen this one before, but maybe it’s something normal in robotics, but the teacher forcing loss is just the normal sort of thing.
And so yeah, the idea here is that at time Z predictor receives the real inputs and it predicts its next state and then it has something that has information about what the real inputs are that it’s learning on.
And then that gets fed back in sort of that L1 aspect.
And then the big thing with the rollout is that it’s kind of gathering these inputs up and feeding it forward.
And then feeding it back at the very end. And it seems like, I think they said that it was, yeah, they did a two-step rollout. and then a four-step rollout for their training. That’s all I’m going to say about this because I don’t know what I’m talking about for this aspect of it. Go read the paper if you are so inclined. The next thing is the energy model.
These are interesting.
This is the first time I’ve really looked into energy-based models.
These are something that’s coming up a lot. There are a lot of things about energy-based transformers. where they’re working in a different way than the probabilistic models, where they kind of have a different way of performing regularization.
And the general idea is that instead of trying to basically do a probability where everything has to add up to one, some sort of probability sort of function, it instead works on basically manipulating the data planes.
of different parts of the vector space, the subspace, where the model works.
And it pushes down on things that are closer to ground truth.
And it pushes up on things that are further away from the ground truth. And so it’s trying to find a plane in which gradient descent can do its magic thing.
And it’s not caring about adding up to one.
And so this gives an interesting way where you can see fewer examples and potentially still get good outputs.
Because if I have good examples, I don’t have to see a whole bunch of stuff in order to properly get that probability mishmash if that makes sense. And you might get some information that makes sense if your model, as the rest of your model training goes well. This is basically how I understand what this is.
It also has the benefit of having cooked in regularization because if I have two things that are pushing the same amount into the same part of the graph and it’s already that far down, it’s not going to do much because it’s a displacement effect on the data plane.
This is very interesting, very heady stuff.
It took me about two days and two hours. of watching videos and just reading. I’d take a break. I was like, okay, I have a vague idea of what the heck they’re talking about here.
And now I have a vague plus one idea of what they’re talking about here. It seems interesting.
I think this is kind of in the area where they’re exploring new architectures like genetic algorithms, evolutionary algorithms of different ways to kind of get past bottlenecks. It seems like a lot of people are very excited about this. Maybe one day I’ll know more, but this is what I know for now They’re using it here and it seems to be very important to Yanlacoon. So he got to put the math somewhere. Yeah. Yeah, it is These are all really cool hints and things. Thank you Josh. Yeah, are you are you sharing this presentation? Oh Yeah, yeah, I can share awesome. I’d appreciate it. Thank you Yeah, so the energy based model, you can see here where they’re kind of using that to form the planning aspect, which is this P action here. They’ve got the little bumps and it’s doing stuff. And those stuffs are not adding up to one. All right, so yeah, that’s really VJAPIT 2. There are lots of stuff as far as kind of where they’re wanting to go.
And obviously there’s a lot of issues here with this model still.
It only takes image inputs, so you cannot condition it on text.
That’s what they’re doing next. But they’ve got to do it, and it has to work. So hopefully that does. It seems to have lots of issues with changing camera angles and stuff like that. So it’s anybody’s guess as to whether this is going to be the thing that wins out. I think that there’s probably things that will be taken from this.
I see people using this sort of thing and experimenting with it.
But, you know, is Jan Lacoon the one who’s going to crack the thing that Jeppe wanted to be?
Who knows? But it’s very good that we’re not stuck with Cosmos, that we’re not just going to do the thing that costs the most GPUs and brute force it, because what you’re trying to do here is something that’s smarter, that is going to be more efficient, is going to be able to scale and have more robots for more people. instead of just robots for like five people. So I think it is important that we do look at this sort of stuff even if this might not be the exact one that gets there.
All that being said, looking at sort of what they’re thinking for the hierarchical aspect.
So right now they’re just doing pick in place, moving an arm and sort of stuff. How do we roll that into higher sort of actions? Robot decides it has to open the door and go into the kitchen and then open the dishwasher and pick the cup up and put it in the dishwasher. That’s sort of chain of things. So think about that, how those things feed in where we have the encoding feeding in from one to the other. You’re wanting to have that cross tasks.
How do you appropriately handle that memory but also allow it to decay appropriately? It’s a problem we hit in agent space right now.
you know, with LLMs all the time.
That’s one of the things that we’re thinking about a lot. How does that translate to working in the world? That’s hard.
Here’s one of the examples they give of, you know, they’re in Paris, they need to go to the JFK airport, hail taxi, but then the robot at some point just has to move its arm.
Yeah.
All right, we’re gonna talk real fast. Oh, good. I think we’re good on time.
and talk about Legepa.
This is one of the last papers I think that’s going to come out with Jan at Metta.
I think this is a very interesting one.
This is an interesting paper. It’s called Legepa, but it’s really should be called Sigreg.
And we’ll talk about what that is, but I think it gives an idea of kind of the alienness of this world that they’re in. And some of the problems that happen whenever you’re kind of going into a different architecture. This this architecture has to or this paper entirely has to do with a problem that is unique to them and they’ve they’ve had this very Political title a self-supervised learning without heuristics and I think this would be Self-supervised learning without crappy hacks would probably be the better way of how they feel about this pain heuristics sure And so this is how they got past one of the issues that they had before And it’s with this concept of sketched isotropic Gaussian regularization, that’s sigreg, not sigmoidreg, since there’s sketched isotropic.
This is very interesting actually.
At first it was very dry, but if you know a lot about like the fusion transformers and kind of like how they do noising, and you care about the Gaussian noise sort of algorithms, and that spaces, this is a very interesting sort of paper.
And the problem I was trying to solve is that whole stop gradient thing that they had in the other paper is trying to get rid of that. And because the problem is the training objective of this model is that it has a model that has absolute knowledge of what is in the frame. It has a model that doesn’t.
And the goal is to get them to match up as perfect as possible.
And what’s the greatest way to get everything to match up is it’s just everything’s a gray blob. That’s going to train well. Yeah, this thing thought it was a great blob. I think said it was a great blob. And so in order to get past that, they have to do some bull crap in order to make that sort of not happen.
And they’re able to see this. They’re able to do these sort of things where they do reconstruct the visual sort of image of these things. And that’s what you’re seeing here.
So they can see that it doesn’t know what the thing is, but it’s scoring out well.
And here they’re kind of slowly removing that and replacing it with this Legepa thing.
And the concept of what this is, is that what they’re doing is instead of trying to reconstruct it, which is very expensive, it takes engineer time, they’ve actually identified the optimal distance and sort of gap between data points in their vector space.
And so they’ve determined what’s the proper distance away that all of my data points should be that implies that I’ve actually learned the world model.
And so it’s looking for clusters of things where we have these sort of jagged distributions that indicate that the thing is bucketed a whole bunch of stuff.
So if there’s nothing that’s different about it, it shouldn’t be learning it.
And if it’s bucketed a whole bunch of stuff together, it’s kind of being clever. And so it’s adding this in as a loss parameter to say that, okay, this thing is learning unique and independent things about the world and it’s properly distancing them in a way that forms this isotropic Gaussian, which is perfectly round distribution of data.
And are they ever gonna get there?
No, but if they treat that as a loss, To kind of go after you have something that you’re training that shows up directly inside of at least the model that they’re training And so they kind of found this is a way to to get their loss curve back And they’re doing it in a very interesting way The other way that you would do this is that you could like sort all so okay now I know that the isotropic Gaussian is the perfect thing for my model training so then I have to If I go by that method, I have to do some sorting and very expensive things in order to calculate that, okay, all of my data points are correct. That wasn’t going to work. They do this additional thing where they are slicing these one-dimensional slices through their noise landscape and detecting if that thing is a bell curve, which if you’re looking at the noise, a perfectly round circle is going to look like a bell curve. If it’s not, then you have something to learn.
Very clever very fast Yeah, that’s that’s that’s what legit is essentially So that’s sort of thing that they have to deal with In their specific area The other thing I found interesting this this is the paper that was out today popped up in front of me I always go through the hugging face papers They kind of have their selection of things that they like and I look at the beginning of every day and find the things that look interesting to me.
And one of the things that popped out was this hierarchical video generation for long horizon robotic manipulation. But oh, that’s interesting.
I pretty much reach any video generation paper.
But I notice in here, they’re doing some some robot stuff, and they’re actually combining a whole bunch of different things.
They have the SAM model, Segment Anything.
They’re using a VLM, and they’re doing some stuff with DIT. This is an and the kitchen sink paper. They’ve got all the buzzwords.
They’re good to go with whatever they do after their postdoc. But interesting thing in here is that they’re using VJEPA as their encoder for the motor phase. And, you know, so this is kind of where you see people say that I’m using VIT.
I’m using clip.
This is where VJEPA will show up and people are playing with it I’ve you know when I was looking around as if our people talking about this you look at you know the paper groups for the robotics Universities you know these people host these things post on YouTube people are talking about it But I think we will see VJEPA around because it’s kind of just kind of serves this little you know alternate way of I can explore if this is going to make my use case better and I didn’t do too much on the actual performance metrics because I don’t understand them I’m not a robotics guy. I’m not going to try and talk to them, but it seems like they’re pretty good from what I’m hearing people say. That’s all I can say about it.
Let’s see.
I think they have them here.
Yeah. Yeah. So they have, what is it?
You know, those numbers are good.
100 is the same as 100.
That’s good.
It’s not smaller.
I don’t know what Octo is.
Yeah.
Obviously, this is their marketing copy, but it seems pretty good. Yeah, that is Jeppa. Hopefully, that was at least somewhat interesting. Let me see if I can paraphrase or at least say back in different words what you were saying earlier on the new learning model with the already lost stuff. It’s the sketchy one.
It seems like instead of, so as you’re learning things, it seems like it’s looking at internal weights of, or an internal kind of a representation and then seeing how that changes rather than actually having to go all the way through the model itself to get like an output result.
Right, close.
Yeah, it’s looking at the weights and looking at that latent representation.
And all it’s looking at is what is the distance between my weights?
So how, how is my in the latent space, you know, how is it distributing those?
So it’s a very weird thing to do.
But if you’re doing a latent reasoning model, it, it makes sense.
It makes sense.
I’m wondering if you could apply it to other types of training to make it cheaper or find ways, find places you may be overfitting and do something different. I think, I think especially for diffusion models, anything related to diffusion, I think that I am dead set on my assumption that latent reasoning is going to come to LLMs. We are not going to do this chain of thought thing forever.
If this works here, this is how we get world models for robots, it’s going to be how we get world models for LLMs.
Because the same pixel space, how pixel space is efficient, having Jim and I have a mental breakdown in its chain of thought is not an efficient use of time.
It should be staying in latent space.
Yeah, it’ll be interesting.
I do like how things come full circle or not full circle, but we all it there’s a there’s a tendency to think that, Hey, I’ve got this new model is doing this thing and we’ve arrived and now we’re going to be doing this.
Yeah. You know, you’ve been doing it long enough.
You realize, well, this might stick for a month or two before somebody else blows it out of the water. You know, One of the initial diagrams you showed with the agent the world the action When I started learning AI was in the realm of intelligent agents and that is what was being taught in school as the latest the greatest we have arrived You know It’s it’s just funny There’s been a few and I think a lot of people are kind of in this area of like this this sort of thing Somewhere between these world models and the state space models, which I don’t think we’ve talked about, is probably what’s next. I think Ilya Suskooper, he did a interview just like a week or two ago, where he’s basically like, yeah, no, transformers are a dead end.
That’s Ilya Suskooper. So you got him, you got Jeffrey Hinn, you got all these people saying like, this isn’t going to do it, or it’s going to do it. It’s not gonna do it, but it’s still gonna ruin us. You know some yeah, they’re they’re useful tools. Yeah, um to do a thing this ain’t it Yeah, a lot of these people are driving trying to hit a GI though, so it’s um My intent normally is not that exactly, you know, but if they come up with useful approaches and useful things they we can use on the way Sure, you know, thank you for that. Um That’s really cool stuff If we look at the robots that we have out there now are some of them based on the LLM Ideas and some are based on world model ideas. So some are gonna be a lot better than others already Yeah, it’s hard for me to say I do know that There are Robotics that are using things that are similar to VJEPA I know that there are some that use Cosmos for things. I know there are some that use a guy in a VR headset in the other room. That’s about all I know.
I was just thinking of all this stuff coming out of China and so forth wondering, do they actually have a real world model running there? Or is that just all mad and they’re just letting it go wild? especially, uh, honey and, you know, Alibaba, they’ve got some really good world model sort of stuff out there.
Uh, it’s open source too.
I mean, so some of the best papers when I was learning about this was these Chinese papers.
I think it’s, yeah, honey and especially 10 cent, uh, they’re absolutely, um, doing this sort of thing.
The thing that gets me excited is I would love to be able to do this with robots.
You know, things I could, I’ve always been a more of a physical, I want to build a thing that I can hold and do something, you know, um, which making videos is great. You know, writing, writing code. Yay.
You know, uh, that’s what that’s, that’s kind of cool stuff. But, you know, Right now that you see the robots learning, right? From each other and things like that, you just can’t see that.
in your computer models as easily.
You have to think hard. Okay, so I did say we’d come back to Andrew’s comment. So I think it’s easier to talk about this one now than we’re at the other end of it. So let me see if this specific answer makes more sense. We’re asking how the world model would be different from the limitations of the true novel generation that LLIMs fail with.
I think the big thing is that it is generally trained on some sort of unsupervised way where it’s just learning basic aspects from the data.
Those aspects can translate to unseen events.
It also is the case that the world model is not necessarily generating either.
It is just a form of input that might go on to another generative model.
And so that’s one aspect.
The other aspect is a fact that might be relevant to this is that with the Dreamer studies and the Sima studies where they had those agents acting in the world and they were acting based on a world model, their performance, they did a lot of ablations. I didn’t include all this stuff in. They did a lot of stuff where they’re testing the performance on games, where the agent had been hyper-trained on a game, the agent had been trained, you know, equally across all games, and the agent had been trained on all games, but the game that was being tested. And they found that the best performance was the agent who got trained on all games, obviously. It had seen every… So that was very good. But the agent… who was trained on just that one game and that game alone performed worse than the agent that was trained on all games but that game. And thinking through that shows the fact that by learning the world model, it is able to effectively extrapolate out of distribution better than training on that distribution.
So that’s kind of my, you know, my response to that is that I think, yeah, this is a pretty good, if not, I won’t say a solve, because you got other things that you’re working with.
But this is at least pushing in the direction of not, of just for being a parent, you know? It has its own thing that is learned, that’s true to itself.
You know, however that might have been, Randomly initialized.
I’m almost thinking like if you, uh, if you trained on a lot of videos, things like that or whatever.
Um, and it had seen somewhere, you know, you toss a ball up in the air, you hit it with a baseball bat and the ball goes flying. You know, I could imagine, okay, now let me, let me toss a tennis ball up in the air and I hit it, you know, and it’s got to fly. Right, even though it’s never seen that versus it may I wonder if it’d be good enough if I toss, you know An egg up in the air and I hit it with a baseball bat that the egg is gonna break, you know Or if I hit a tree with a baseball bat, I’m gonna hurt, you know Cuz it’s storing it in a high-dimensional space and that’s it’s staying there It’s not that high-dimensional space isn’t getting tainted by the output generation objective of I’ve got to make this pixels, right? I think it also be really interesting to see the the ability for transfer learning at that point. Oh, yeah That’s what I was going to ask is how does one model teach another model?
Like the robots are doing now Or is it or is it really not one to two different models.
It’s all one model in all of them They’re just yeah, I think that generally it’s probably a base model is my guess is generally these are going to be base models, the world model part. It’d be a base model that is then further trained. You could still do distillation too, which is lots of different ways you can just, people say this is a distilled model. And that’s kind of crazy because there’s like a billion different ways you can distill stuff. You can distill knowledge from a higher parameter model into a smaller one.
You can distill steps to make things go faster. if it’s like an ODE style sort of thing.
So there’s lots of ways that they can train.
It could be if they’re learning from experience, they could be doing it the whole fashion way. One robot talking to another robot, God’s sake. And that would be valid somehow here. I don’t think they’ll do that because that doesn’t seem like good use of GPU time, but hey, possible. This is just cool stuff. One thought I had, we didn’t need to wrap up. Possibly, I know we’re gonna get together next week and do basically a year in review. What I’m thinking about is we could come up with a list of maybe a possible 10 papers or something to read over the holiday, you know what I mean?
Something for your reading list curated by Josh. You know if that’s something that you’d be willing to put you know throw out there I could be pretty cool because I will I will have some downtime And you’re the 10 papers I wanted to present but didn’t yeah, I mean that that would work and similar to you some of these I hit once and then I have to put it down and walk away and Then I’ll you know, I mean think for a minute and they go look at it again and sometimes I I mean, there’s a couple of papers where I there are sections I still don’t get. But oh, yeah, you know, it’s like, OK, I’m going to move on. Oh. But yeah, I mean, this is really cool stuff.
I really appreciate you putting this together. Oh, yeah, this is a fun one. Learn lots of new stuff. All right. Any final thoughts from folks? Wise.
No, I appreciate that as always. Thank you very much Josh. That’s great. Now I have a 12 different more views of what a world model is That’s great.

