Mixture of Experts with GPT4

Mixture of Experts: Harnessing the Hidden Architecture of GPT4

Recording of Huntsville AI Meetup

Transcription provided by Huntsville AI Transcribe

Yeah, sure. That’s fine.

Yep. No worries. Yeah, actually I’m going to copy before I post new things. That’s like, you know, okay. All right. Thank you. Thank you for your recording. So let’s go back over here. All right. So while we’re around, I kind of got a gauge on the folks in the room.

So it seems like we’re pretty wide as far as our experience level.

So we’ll be going through kind of from the ground up.

What mixtures of exits are as long as well as some sort of basic primers on the core concepts around neural networks and before layers is for the central things that are important for the mixture of experts architecture.

And so yeah, this is going to kind of be all over the place. So feel free to raise your hand, ask questions as we go through. And the idea here is basically you don’t have to understand every single thing inside of this talk. But this is mostly a here’s all the different things for you to go look at. This is a giant emergent area right now.

This is a new ish architecture that’s also extremely old, but it’s now coming in and go again and everybody’s actually trying it out and it’s changing in new ways and it’s changing almost every week at this point.

So this is called mixture of experts harnessing the hidden architecture of GPT for the reason behind that title.

It’s a is that GPT for is supposed to be using this architecture, which has kind of led to the flurry of activity around this thing.

Since that got released in March and now a lot has happened since then with this architectural pattern.

So can we go to the next slide?


All right, so what we’re going to hit is what is mixture of experts first, then why do I care?

How do they work?

How don’t they work? And how can I learn more than go to the next one that mostly hit that.

So the first thing to understand with the mixture of experts is that it’s very deeply tied to the scaling laws of neural networks.

Scaling law, which basically means how many of you guys are familiar with the bitter lesson? It’s a short story or a dirge basically saying that we spent all this time working on machine learning algorithms and finding too many things and doing clever things and doing all this all that doesn’t matter.

You just need to make model bigger and you get more compute and you need to increase basically your budget. It’s like that’s it. It’s all the things that you care about. People spend years and years of their lives at right now. It’s the bitter lesson is that it doesn’t matter. You just throw more compute at it. And this is kind of us trying to be clever and working around it to sneak a few cycles ahead. And that’s really what mixture of experts is working at walkup. So you can see here, we’ll talk about this a bunch along the way, but we care about compute, which is also correlated with like flops, things of that nature, dataset size, which is a lot going on with that right now and parameters, which this is whenever we talk about parameters, something we say like this is Lama 7B, dbt is a 250 billion parameter model.

That sort of thing is what we talk about as far as model size, which is a big deal in mixture of experts.

So we can go to the next slide here.

The next major concept that’s tied to mixture of experts is the concept of a feed for neural network.

This is basically the bread and butter of what makes a lot of these neural network architectures work, which is it is a basically a layer of artificial neurons in which things come into it and things go out of it in a different transform state.

And then there’s billions and billions and trillions of dollars changing what is going in and what is going out essentially. That’s the AI boom that’s currently happening right now. And so something like a transformer, we talk about the transformer architecture.

That’s basically a special series of neural network layers.

So this is the feed for neural network of which the mixture of experts is really just a very specific kind of neural network feed for layer. We go to the next one. And so how do you feed for networks work?

They start with and if you guys haven’t seen this before and you’re interested in transformers, large length model, this would be a very, very familiar graphic right here.

This is from the transformers attention is all you need paper and it’s defining like the core transformers network. So you have inputs going in.

It goes into what is called attention, which is just a kind of layer. And then they do some normalization, which is basically that it evens things out among all of your weights. It goes into some sort of feed for layer. It evens it out and does it again.

It does it again. Does it again. There’s many times as you have layers. And so this is what it is.

So it receives that data in each hidden layer performs computation that passes outputs the next layer.

Hidden layer is essentially just some layer beyond two.

In a neural network is a very easy way of saying it is something where it’s a little bit deep.

And it’s not directly on the output.

The output layer produces final result.

Within every layer, each neuron is connected to every neuron in the next layer. So they’re all it’s very synchronous for the layers and they have numeric weights that are tuned as the network is trained. So things inside of the layers are what gets trained. We’re talking about training models and every neuron receives the inputs and basically multiplies them forward shifts the values around.

And that’s basically how you get your final outputs and tensors. So that’s what a feed for network is.

Any questions at this point?

Essentially stuff goes in stuff comes out robots. Because it’s a yeah, this Alice yeah, this out as magic. All right. So when you say a mathematical function and you’ve got words, what does it do to the word to make a mathematical function apply to it?


So basically, there is an encoder and a decoder on either side.

And it depends on what sort of model you’re using.

But basically there is a text embedding where it takes whatever your input words are.

If you’re doing an input query, it turns those into weights.

That weight will depend on what sort of encode strategy you’re using.

Then it has a decoder on the other ends that will pull it out.

And basically depending on your model, there’s a whole bunch of different ways of how it comes out with that final way.

But the same the word layer going in is always represented by the same set of inputs to the system.

If you have if you have the same layer, if you have an atmospheric layer, a skin layer, a clothing layer, that word always has the same representation going in.

So the layers are not concepts within? Well, I’m using layer as a word. Okay. Okay.

Just as I should.

Oh, right.

So it depends.

So that’s why I say it really does matter on what’s called an embedding, which is a very special set of vectors.

Sometimes you can have stock embeddings.

Sometimes you can train your own embeddings, which that’s a lot. Basically, I’m going to be in a PDF and it’s, you know, whatever my strategy is for doing that, it’s going to detect the possibility of relationships between these things, the semantic meaning based on my batches. So there’s not a hard answer. It really depends on what you’re doing. Okay. All right. Thanks. I’m going to give you a chance to actually use like dagger or Kubernetes to have a neural thing on one layer, deep learning thing on another layer.

Okay. So like a distributed sort of, yeah. Yeah. Is this from that scene in architecture or is this like another layer up from that?

It’s all custom software layer.

Are you talking about mixture of experts or the neural? So there are some people doing distributed computing with mixture of experts including somebody who’s doing something really interesting, which is basically leveraging the new GPU, web GPU sort of whatever that thing is. And they’re basically posting each expert on one Chrome laptop sort of thing and distributing it out that way. I don’t, you know, I think to me it’s a novel novel thing. I don’t really, because you’re really adding a lot of network traffic because you’re everything that’s going on right now is basically how can you cut down the distance between your routing function?

And so you can do that if you’re constrained and I’m sure there is some sort of scale in which it makes sense. And if you have enough idle compute, but you know, I think plan nine is working around somewhere with this.

I guess for some, yeah, yeah.

There are use cases, but they’re really, you gotta really, that’s a complex. Yeah, you gotta really need it. But yeah, there are there are some people doing that sort of thing. So it’s worth looking for.

All right, why don’t we move forward?

So next we’ll go into what mixture of experts is specifically.

It is a neural architecture that consists of an ensemble of specialized sub models called experts that work together to optimize the overall performance.

And basically what it is, is that this is a recursive, it’s also a recursive layer.

And it is a layer that contains a mixture of a whole bunch of different layers, which are basically smaller models that have their own training set that gets switched between depending on whatever your input vectors are. And so you see here, this is just a normal linear layer.

So this would be like, where are you doing like a RELU, any sort of those normal sort of things.

A feed forward.

And then you have basically each one of these tiny models and some, you know, so we’re seven billion model will see these sometimes be like, you know, there may be like 225 megabytes, instead of another seven billion.

And so what you’re really doing is sharding your model size among a large number of different experts, and then activating them on demand. So it gives you a larger overall model weight while at runtime and during training, you’re only activating certain subsets of it, allowing you to kind of do things like put it. And so that’s that’s the idea behind this sort of thing.

If we go to the next slide, I think we’ll talk about the gating network and the very critical thing about this, this is where all of the differences for the most part in mixture of experts experts exist from an architecture level is this gating network.

It’s also called a routing function.

And it’s the major difference between like switch transformers, sparse models, mixture of Laura’s all those sort of things.

And this is the way it is another layer, essentially, it is a trainable thing.

So it learns along with the rest of your model, how to choose which experts from your batch of experts, which one is going to be the most appropriate to activate based off the input query.

And so we’ll see in a few of your further slides that you might have, for instance, say expert one is trained on accounting, expert two is trained on HR, expert three is trained on, you know, operations, logistics, that sort of thing. Another very interesting thing is that you’ll have like expert for the trained on safety, trained on alignment trained on, you know, these different security things that we might care about. And so you have those sorts of things that exist. We had a pop up from somebody that says, zoom chat open, I can’t open long show full screen. We had a comment from Matthew just a second ago asking what language mixture of experts are being driven by. So it’s, I mean, Python.

So Python is where a lot of the implementation of this is but it’s, there’s always going to be implementations and see that related as well.

So, yeah, I’m assuming that’s what the question is.

Yeah. Any other questions on this slide, we’ll hit all of these sub components for gone.

So is it truly burrowing there’s no image.

Oh yeah, oh this image. Yeah, it was completely multimodal. There’s actually one of the models of the architectures will go through does not work on decoder only models like the GT only works on the other thing that was a little counter to me.

I was thinking mixture of experts you have the experts that would be trained separately on all of their tasks and then we combine them together and graph which way but there is one that’s like that. Okay, yes, that’s actually the one that I work on. Okay, so mixture of Laura’s mixture of adapters, but this one in particular kind of what you’re going to general approach is to train this model as is this way.

Yeah, right.

And that’s how you get the getting that we’re trained in saying kind of ways that it’s also, I can imagine if you have your router trained differently than mixed in your experts and you’re out of the wrong thing but the wrong expert. You’re going to be asking your HR person how do we software design. Yeah, it’s the ones that are out there in production use right now or was called the sparse model, which is all trained together.

All right. All right, so the motivations and goal of moe this is a bunch of complex stuff where’s the rub. And there’s a very big row and that’s what we care about it despite all the complexity, which is that we can increase our model capacity without a proportional increase in costs.

Going back to that very slide is scaling laws the only thing all the other things that matter.

The only thing that actually matters is increasing those three parameters.

This is finding a way to decouple the one we like the least cost, which are flops, you know, whatever that that computational budget is, and increasing our model capacity along with it, as well as being able to increase, you know, even the distribution of our data set by being able to split across the models as well.

So that’s really the big road.

So we decouple our parameters.

And the also other big thing is that we reduce inference speeds while scaling our model size, if we fix compute, because we’re not at this point now going to all of the different parts of our model.

So if we take 10 experts and take a seven 70 be model down to seven be.

We’re going to be inferencing that seven be.

So it’s going to be a faster inference speed.

And there are other ways of getting this I think this one is becoming less interesting over time as for a nice benefit because there’s other things like compensation, BLM some of that page attention sort of stuff that’s going on. But it’s a nice benefit that comes along with this. The model class is though it’s just that’s the rub.

So this is the benefit of decoupling parameters, decoupling money.


That’s the thing parameter you you ever try and run Falcon 180 be.

You can’t.

But you could potentially run something equivalent to that on a 4090.

If it was implemented in the extra records.

So here’s kind of the big fun thing around a mixture of experts.

Everybody knows it’s using this architecture.

They won’t say it.

It won’t confirm it. It’s what they’re using. Everybody knows it. Nobody knows any of the actual details. It’s one of those things you can get Luigi and Mario’s before one of the things it’s just, you know, So the model is very likely to boast about 1.8 trillion parameters.

We thought GPT 3.5 was somewhere near 250.

And we think this thing is essentially just 8 GPT 3.5 stack in a box together of which one is the safety function and there’s seven other things in it.

So if you look at the 7 or 15 or whatever it is, it’s very likely that GPT B is just one of these with a vision encoder expert sitting in the top with a high weight.

It’s basically if you look at what all the people at a pin I have been doing in the up the coming up into this, it’s been stuff around this.

And I’ll talk about this. I think a little bit later on.

Have you guys heard about this arachis build of GPT. I’ll talk a little bit once we’ve seen one of the architectures where it’s, they basically they’re putting up all the stuff this month about this new model that releasing that’s going to be cheaper and everything else. And then they couldn’t make it work.

And so they very quietly pulled it all back. And there’s no longer arachis being released. And then the reason seemed to be something that we lay out later in here of the soft mo architecture, or everybody thought this thing was going to work. It didn’t work. And then also they pulled their launch back so we’ll go into that a little bit further on. Just know that’s the big thing there. You can make big stuff with it. So the most structure layer contains the expert networks.

Each expert is like a fully connected neural network. So you basically have a neural network of neural networks. There may be up 2000s of experts. I’ve seen one paper where they have over 1000 networks. That’s not real.

That’s good. Do you have to have all these experts loaded up at once? There are lots of people trying to get it dynamic right now. It’s actually something I was working on today, doing that through a fast API application.

So yes, it would be very nice if you didn’t have to do it at startup. Most things do it at startup right now.

So they load all of your experts together.

And it’s slow.

It is very slow.

But you don’t have to do it on every inference. You just have to do it at load. So the gating network selects a, so yeah, and here, 1000 is one of the numbers.

We generally see 832.

I’ve seen four in some examples.

That’s a really round one.

We start getting a benefit is round four.

There are some projects I’ve seen successfully do up to like 256.

But there are diminishing returns as you go up in size.

So the gating network selects a sparse combination of experts per input.

We’ll go into what that means later.

It also learns a specialized routing function during training. All right, let’s go forward.

All right, so whenever we’re talking about the gating network, this is what determines which experts to use for each input.

And the input here we’re talking about is taken by token by token in general.

It will vary depending on what gating network you’re using.

But mostly we use the token choice sort of thing where we’re going by token.

The outputs are a set of routing probabilities one for expert.

You see that here when we’re talking about the inputs and outputs, these are the scores that come out.

And some of them will get normalized and how they get normalized differs on the architecture.

So if everybody uses softmax to determine the probabilities for each expert, the additional logic is usually needed to reduce the dense complexity and how they do that is different between each model to and I actually have all of the basically the big architecture diagrams we can see how they’re using that final softmax to get your weights.

Basically, it’s a function to pull big, big, big numbers down to small numbers so that we can reason about them effectively.

It’s almost like a close sign.

It’s pretty simple.

You just think about it turns and say just a string of numbers between negative and positive infinity squeezing them all to some open one.

This is the first one.

This is the og, at least of the current batch.

This is the sparse activation sparse mixture of experts is very likely to be the least close to what he is using there obviously doing some crazy special stuff themselves.

But this paper that this might have been the first paper I came across this is the keystone paper is the one that started this one there’s a whole bunch of other papers. It’s the most popular. I think it’s 2017 2017, which is a little bit because this is like brand new stuff but it’s based on 2017 paper that somebody pulled out and went, oh, yeah, useful now. Well, if you look 2017 to 2021, there’s a whole bunch of little small tiny specific open a I papers, all up into the run up of like here’s this little sub component. Here’s this little sub component.

There’s a good geography.

Huh. Yeah, I wish I could get send in with you on the email.

Yes. That’ll, if you go through that and then go through all of their abilities, you’ll have most of everything.

I do include a few of them here, but it’s a big list. Wake up a lot later. Yeah, that’s right. There was a healthy list on Facebook. Yeah, that’s your that’s your primer. That’s the intro basically is what it is right now. So, sparks activation.

The big thing with this one is, it’s a top K. So it takes the experts that have the highest probability.

Usually it’s two to four is kind of what we generally see.

The big thing here is that it is you’re basically taking a, a set of them, and then summarizing those values together and using that to get your final output.

So you’re using a ensemble of the experts to come up with your, your final output. That’s really the big thing that’s in this one you’ll see here. That’s the thing that we’ve been looking at already.

We can go to the next one here.

I think each one of these yeah. So the issues with this one this one that we have the most information on. The main problems of why we work just like cool sparks activations good let’s all end it and go home is that their problems.

Number one is the problem of token dropping which is basically, since we’re going token first, you can have, say for instance, this one expert is super awesome everybody likes this expert the router and discovers that if I send things to this expert, the users like it.

So it gets all the tokens and it might get more than it’s allowed to have, since we have so many where we kind of constrained the amount of tokens and go to something.

And how does this thing deal with it.

That’s all just drop some of the floor. That’s all it’s first mixture of experts does not great. So that’s kind of a problem.

It also happens during training, which is more often that you’re going to have those full limits of experts and things where you might have something on every piece of data, because for those that do not train whenever you’re training you basically batch things up to your token limit. So say I have a 4096 model size, and I have a whole bunch of data examples that are 200 size and like, I might back to a whole bunch of those together. And train it to essentially be able to respond that link reduces the training time for those things that we do for a lot of different reasons.

And it’s a problem that’s bar Smith trip experts, if you don’t do some other things to counteract it, but those add their own complications so it starts that I think it’s more context for you using token token is basically is not like a character but it’s like a subset a meaningful subset that the thing has been trained on that means a concept there’s something there’s a neuron that’s going to be tied to it. That’s what history of training. So a token might so like, hope might be a token and in might be a token see it’s subsets of words.

I don’t know if you have a better.

I think it’s sort of serving. You might break up a certain thing. Yeah, that’s true. So sort of the base code and the work you have like a past tense word and I break it up to like the base word and pass this part of it. But could be something like that.

It’s just like a unit of input.

So it’s not inconceivable that it would send two parts of a word to different expert very unlikely just because of how we do something.

But, you know, also if you’re using like the open AI service, you’re probably paying per token for some things.

So sometimes you might send it a word thinking that’s a token that might not be a token.

That might be to do with you know, so that’s there’s been kind of what we’re looking at before on some lava pieces and some of these other networks and how do we how we get some of these scope down part of a reason for doing some of that. I mean, even when we’re looking at what we did last time where we set up the system prop that we said you are this person with this persona answering this way all of that counts for your token length.

So he’s shipped that across every time you’re paying for the same thing that you’ve already said it like every time there’s some some cost there but it is the atom of large language models.

That is a better way to put it. Well, it’s worth mentioning when I was playing with these different models trying to put it together, you know, see which channels would cross talk.

When I lost tokens when age differently, like it was different levels of like corruption and model just with the token expires.

We had to go read, read the token at the cloud, we’re going to pull everything through. Interesting.

Yeah, yeah, I’m not sure what was that. This is definitely not that. Yeah, so this is definitely it’s the so when you think your people talk about the context link, the link to talk about tokens and talk about you know memory or feeding stuff it’s always it really comes back to this. It’s a little bit of a limit on token status groups. Compute parameter size. So the more tokens that you have, the more memory you have to have to hold them, the longer it takes to do so it’s all that sort of stuff.

So the big changes that are happening right now is like, originally, you had a context link the token link of 512 was like the big thing for a long time.

And everybody was really happy because it’s up to 5,000 and then 2048.

Now it’s mostly around 4096. And like, as you get it larger, you can do more tasks and more things that can be done with it.

But it’s all tied to scaling.

Alright, so moving on here, routing algorithms are not GPU optimized for a lot of the sparse activation patterns.

So this is I put it in third here but this is really the big one is deterministic based on batch and not on sequence, which is really weird so basically a sequence that I said before had a whole bunch of things that were 200 train 200 length as batching those things all together.

What this thing will do is that it’s deterministic based on that collection of batch if I take one out and put another one in because I’m trying to morph around the training data to keep it fresh to keep it from overfitting any sort of things like that. So it’s completely different for every single one of those models all 20 of them, not just the one that was added in. And so deterministic consequence would mean that it would do those 19 the same.

It would do this one different because it’s new sparse activation does not do that.

So it is unstable, not fully differentiable, which means that we can’t successfully back propagate which is basically how we train the neural network.

There’s discontinuity in outputs often diverges with the literal interval reason suddenly your loss will just go or you don’t know what’s going on.

It’s a black box.

It’s black magic.

Nobody knows how it works. Some wizard came and gave it to us essentially what this feels like whenever you get a good sparse model.

Oh, I can remind me to push the button about five minutes with the lights turn off again.

I’m training mechanism work through it. Oh yeah. All right, so gating strategy.

So this is the next one is a switch transformers.

This is kind of the new one.

Or I guess it’s a little bit past hot.

There’s a lot of successful implementations of this one right now, but it’s not a merge and it’s emerged. And people are kind of using this a bit now. So switch transformers changes it for before it was top K. It just checks one guy.

Who’s the best guy.

I’m going to use that simplifies a whole bunch of stuff.

It has selective precision. There’s a kind of a slower learning rate warm up, which is basically it decreases the amount of time the cost it takes to to train these things.

It has a higher expert regularization and this is because this basically means that your your training and your inference is more even across all of your experts. And we’ll go over what happens whenever you don’t do that a little bit later on. Something is a really big problem in sparse models or basically one expert gets really really well trained and everybody else is on the bench. And it also has the ability to do parallel experts.

I haven’t played a lot with that so but it’s one of the things they claim can speak a ton about that.

And so the big thing here is that it outperforms sparse MOE at lower capacity factors capacity factor be basically meaning what is the average amount of experts each token is going to see during training or an inference. And so as you decrease capacity down it outperforms even below one, which is kind of interesting.

I need to get below one expert.

Yes. It’s a half of expert.

Normal person. Okay. About to 90.

All right, so you can see here, basically the big thing here is that it’s the router, it chooses one, only one goes in, it normalizes conceptually it’s very easy.

This is a constrained version of sparse where it only picks one.

Top one.

So I have a serious question.

Okay, if you select the expert. And it goes through and you get a result.

Is there any effort to update all the experts on a regular basis as to how who does what is better during training.

Only during training, not during inference. So if you do it to chat tbt, it’s not. Unless they’re doing a training right.

So the big thing is that basically all of that learning stuff. It shuts off as soon as you’re like using it in production, because it is expensive. Slow. All right. Right.

I think that’s about here any other questions on switch transformers I think we have a little bit additional information on the next slide.

Okay, switch before it really is quite easy seems to work. People like it. There are a bunch of people that have successful implementations that one.

This one’s kind of the new hotness.

So this is soft activation.

You might have heard this called soft MOE.

So in.

Yeah, I’m going to go ahead and get out of the way. So each expert gets basically s slots where the slots are a linear combination of as many tokens as it wants. And so what that means is a linear combination is basically it takes every token and it adds it together. And so basically with an I’m son.

So basically I’m some is a weird notation that exists because Einstein got tired of writing vector summations in the 40s, writing them out by hand.

And so you basically said, I’ll just get together and squash them as roughly what it is.

So now we use that for this particular model a lot of different things.

It’s merging all of the different things together and saying, okay, that’s what it is. Everything that went to this expert. That’s what your weight is. And then it basically takes your original waiting. So we say we have a router and it you know we have a top five routing and it picks the top five activations and basically whatever it got.

So say expert one got point seven expert to dot point two expert three dot point one.

And so you then multiply those weights on the other end by whatever that activation was.

And so it’s kind of doing a weird, you’re getting a funky vectorized blend inside of your latent space.

It’s the bottom that’s unmarked that you just have to know this is a button.

Yes. And so that’s soft MOE.

Yeah, it works.

If you want to figure out why it works. It’s kind of funky, but it does for the models of which it’s pertinent which is the encoder models and not the decoder models.

So the soft activation reduces the sparse.

I don’t have that here.

I don’t.

I think I have that on the next slide.

So I’ll cover that that second one a little bit later.

And so the big thing though here is that there are a lot of benefits to doing this from a performance standpoint. This is almost as much of a jump in performance as we got from a dense model so dense model is where you have just basic all your weights are there.

That’s your your your mama 70 be all those sorts of things that the dense model there’s nothing weird been happening to it. A sparse model is basically just a generalization of a dense model is one case of a dense model where you’re using the very narrow subset of it in order to get performance benefits.

This is that same level of jump up where you’re basically taking your temperature for your experts down to zero.

And so it’s another level of generalization above the sparse MOE.

And the big thing here is that it is deterministic based on sequences, which is the one of the almost one of the as big a thing is that they have a way of you put one set of tokens in it’s going to return the same thing no matter what you back.

All right, we can go to the next one.

Is there any risk with dropping slots like in sparse model and dropping tokens.

No, not really dropping tokens because they’re just kind of stacking all the time together.

Not as far as I know I’ve not heard that.

However, there is a reason why that’s less of a problem that we’ll get to in the next slide aspect has to do with the know it doesn’t really drop tokens.

And it’s not for the reason that you want to be.

And so here we see the benefits of the game strategy soft activation.

It’s doing this on image net, which is a image data set.

So this is basically a vision transformer which is something using clip and blip and all those sorts of things saying I can look at an image segmented say that’s a dog.

So, I’m going to go to the side walk at the dog on the sidewalk that sort of thing. So it’s that sort of item soft MOE is in the blue denses in the red.

It’s just it outperforms.

We do see a trail off up here towards the edge. The intuition here is this is likely because of labeling issues and image net.

Obviously it also could be trailing off but we’re expecting pretty much everything is going to travel to that point because their things are just mislabeled in there. The data set itself is bad.

But it’s the test everybody uses image net one K I think as you can see here.

Over here you can see the same thing soft MOE just kind of outperforming everything outperforming everything.

We have experts choice which we haven’t talked about yet, but it’s kind of, it’s kind of more niche. So the choice is basically everything we’ve been talking about this point is sparse MOE. Other one switch and then dense is like lava.

I’m curious about your test.


It was just something that I was playing with already.

I couldn’t tell you anything about that. I invented this.

The set of specs I was looking for for an image generation.

It gave me those specs back like the first time it gave me tokens to use to just play with use all of them on this particular set of words and labels.

And a little while later, it gave me like another hundred hundred and fifty that I could use for the tokens use the same set of words I put it in there. I got back for the image for much more realistic.

Okay. Now getting more specific getting better results after more training seems to fly in the face of what you got.

But test out a lot. Getting better results. So how you’re saying that the results trail off diminishing returns not reduce reduction returns the the amount.

So the more you throw it at the less you get.

But you don’t get less. Anything else on this one? Svhl.


Where is that?


So that is actually, I believe the size of clip, which is contrastive learning.

So it’s the the text.

There’s a data set out there.

That is basically a mixture of text and images where it knows it treats those things as similar vectors.

So so that they’re, it’s basically a different sizes of that so they go up in size to go.

So the S16 B16 L16.

I think it’s a small.

So the S16 B16 is large and large and huge.

And then the biggest one is a small G. That’s a giant antique or whatever it is, but it’s a small G. So you’ll see you’ll see small G out there.

That’s actually the largest set.


It’s, it might be. I don’t know. As somebody thought they were clever. That’s basically what it was. All right. Small G. So it’s a soft activation.

Who knows what a decoder only model is an encoder only model.

Is that terms that you guys are familiar with?

We’ve covered it once or twice.

Okay. We really jump all the way in.


So the biggest thing to know is that a decoder only model basically what it is, is that it doesn’t know anything about the next.

The next training set, it always max everything ahead of the next token.

It only knows about the current token being generated and everything before things that use decoder only.

GPT, all of your language models, pretty much anything like that is decoder only things that have an encoder or things like a thing diffusion has an encoder.

And then what it does is it transforms to the audio transformers.

Things of that nature, where you’re both, you’re basically it knows about everything that exists in the space.

Things like co pilot has an encoder decoder T five if anybody remembers T five.

So the problem right now is that soft MOE does not work with decoder models.

It doesn’t work with GPT.

So the question that happens in every single paper that anybody released on any MOE soft MOE architectures, anything with decoder decoder only model decoder only decoder only.

They say no, everybody turns around and walks away.

That’s pretty much what’s happening. The thing is, is that for the vision transformers sets like GPT be there’s this new lava model.

What’s out when there’s a whole bunch of stuff happening in vision transformers right now.

It’s great for that. It’s being used for that. And it’s probably part of the reason why there’s a big jump in performance there.

So rather than considering a single token, is there any effort that is considered to be a multi dimensional token?

Instead of a single.

So you’re saying like, have the model know about the dog and the concept of it is both a linguist linguistic thing, but also a visual thing.

That would be an example.

Yeah, yeah, that is what Clip is.

Sorry. It’s really hard to say Clip is vision transformers are weird, because they essentially use the same. We’re essentially teaching us that there’s not that much different between language and visual information, which is a something for all the philosophers to chew on, but it’s kind of decoding a bunch of weird stuff that you want to get into it. But basically saying that there’s not, it doesn’t matter if you train it in a certain way, where you can really stretch what that stuff means to a good extent, especially if you have a good train. So, so yes. And soft MOE is a great thing for that.

So here’s my hot take.

I have one hot take for this talk. I think the soft MOE, they were trying to get to work with decoding.

And they couldn’t do it. That’s that’s basically money, looking at all the people who are on that project. All the hype that they’re doing what they’re saying they’re saying, we’re going to just huge cost saving for everybody inference is going to be so much cheaper. We’re going to be able to roll out all of this stuff you look at papers that they’re writing, and you look at the big failures that happened all the merger quest that got closed around September time period. It’s just a lot of things at point to they tried this, they shop for the moon and they could do it. That’s my hot take, completely unfounded other than intuition and being on discord way too much. Well, they’re also fighting against the quantization type stuff we showed last time, which we’re running a 13b model on this laptop, just with no GPU. You know, so people have been moving over and over and over trying to get models on small hardware.

And this is a different approach to that.


So, yeah, they had to go this way. And the big thing too is that when they announced that it failed was about if you take the delta between soft when soft MOA was released, which is about May.

And when they announced it fails about the time period six months of a big training run. They kind of went to implement lists, all the timetables kind of match up to of like, Oh, they saw this thing that the open source community said they said, my, I’m going to take it and reap all the benefits. You can do it. Whoops.

But they were really sorry to be so they’re still doing pretty good. All right, so that’s soft. What’s the next I think we’re almost the end here.

The next one, I don’t know a damn thing about this other than a lot of people seem to be talking about it. I always hear experts choice. But the concept is simple.

Instead of basically choosing it based off the token like everything else, it has each expert choose whatever tokens it wants.

And so you activate all of your experts. And then you have them basically choose the tokens that they want. And then you have some sort of top K. Yeah, you can set it top K is a printable parameter.

And yeah, I’m assuming this is very expensive, but it probably gives a really good results.

Since you’re only taking a certain amount of the tokens you probably get less but this seems to be like a really good shower but I don’t know how this works. And this is a good example of a very good development of production of use from a trade off standpoint, but maybe in a year or so. It’s like instead of, you know, going down from 10 models to only using to invite 10 experts and I’m only activating to this seems like more like you’re activating like all of the kind of.

And hoping that I guess you’d have if you really trusted your set of experts or a specific problem, maybe. This is a fun one that’s out there. It’s emerging a lot of stuff happening in the open source community right now, because it’s how many of you guys are familiar with Laura, cute Laura pack all that sort of stuff. So let’s tie that. And basically the idea here is this called mixture of adapters.

Laura is a low rank adapter.

So it’s a mixture of those, essentially.

The big thing with Laura’s and key Laura’s that you can train them on consumer GPUs you can use the pep library perimeter efficient fine tuning and get that model size down so it fits on a 4090, especially, you know, it’s not fits on a lot of the things that you can get from, you know, like run, like the land all those sorts of things that accessible things that normal businesses and even individuals can train in the models.

And the idea here is that instead of training all your things at the front time is that you take those adapters and you swap them up. And so this the really the big rub here, because it adds a massive amount of complexity, because this is kind of where you go into the whole thing about you train your router, how do you deal with the fact that if your router isn’t trained, it’s just going to do random stuff. So that’s really the big problem here. The rub is that you can update your experts after your big training run by training it on your own personal data.

You could train it on characters in a movie you could train it on, you know, to do anything and I think about a Q or a Laura, instead of these giant massive data sets you can do it with like 1000 examples instead of a million examples, because it’s doing a smaller space that you can really, I think that this if it works is the future of the consumer grade sort of space. Now there might be a whole bunch of the fractions on top of it.

But what this is is insane.

There’s been a lot of people doing stuff on this right now. And since it is accessible and that consumer hardware, a lot of the open source people are right now. And this is kind of this like three or four major projects that are out there. I will plug a high memory.

That’s one of them that’s out there right now. So that’s a bunch of folks. Let’s get some traction. So then go next here. That’s the basic thought here is that you’re basically swapping these things in, you do the feed function has a lot of the same things here.

But the difference is, is that it just adds the adapter at the Instead of it being this giant thing in the middle.

There is an adapter that basically gets merged into your weights and passes through that at the very end, and it goes out.

That’s kind of how you are works very, very simple. And yeah, that’s the general idea around that.

We can probably go over the next one.

One thing that I thought was very interesting.

So there’s a whole bunch of papers about the people using this sort of thing.

And so this paper laid out, basically they’re training it on different tasks. They have a basically a Laura for each one of their different processes, different things that they want to do. Okay, so for this task is workflow.

I’m going to load up these experts together, and it’s going to go off you have a super high training model, but you can have the same core base.

And this is concept of transfer learning, which is basically that either certain sorts of training data that has broad ability across all tasks.

That’s things like, you’ll hear people talking about programming to train on programming training on code, and suddenly it has the ability to reason most effectively. Why because turns out programming, you know, systems are some of the best structured logic style data, the sort of pathway data that we have available to us, despite what happens during code route. So it’s, it’s, you know, stuff like that where it can help you with like philosophy stuff like that.

So this is kind of a concept around that benefiting from a big large bulky transfer model, you’re still You see here, you see the little freezy, freezy pop here is basically say we’re not training these things.

And the fire is saying that we’re training those things.

And so here we’re training the gate, the task adapters, and the domain adapter which is domain data. There’s a lot of strategies now for basically taking giant corpus of data and turning it into a domain data set.

That’s one of the most pipelines actually saying is how do I take a corpus and go into your text and that is sort of training. So lots of pipelines here.

And yeah. All right, go next. So I can stop talking. All right, so another big thing that’s kind of in all of these that pop out as expert load balancing. So is a big, big problem. A lot of these these things where you have one expert is really nice.

Get super over trained. So, I had to double generate me a little mean for the, for the talk here.

We have a, it’s crazy, kind of crazy, right?

I think somebody had a weird eye over here.

That’s not over fitting under fitting. That is just actually you want to put one model that’s trained and the others are just kind of happy. They’re initialized weights.

They, and it’s just luck in the draw.

Not necessarily anything wrong with those models.

It just happens that it one, two, three expert one. All right. The burges. That’s just like when you say like the sparse model at the burges for some random reason stuff like that.

Okay. Yeah.

Similar to like it work. If you’re really good at something, you just want to be more exact thing that you’re really good at, because people keep giving it to you. Very much knows how to do it, but you get overworked. And so you can’t count right. And so we’re talking about here is the tunable, Gaussian noise.

There’s an additional noise parameter that gets added in.

It’s an additional tunable parameter has to be trained on a recent size complexity, all that stuff.

All right, I got to ask her. Okay. If you’re going to reject an expert’s training.

Okay. And you know what that.

Probability is.

Why even do the work on it.

If you’re going to reject it when you. When do you decide when do you do this added noise.

How do you do it?

Yeah, it’s just another random number generator process.

Kind of.

Yeah. So it’s basically, you come up with a zero and why do all the work to get there.


So it’s not, it’s quite like that.

It’s not somebody deciding on the thing. It’s kind of a, a, it does a. There’s a essence of randomness to the start of a sequence.

You see the values and so it’s kind of based on this.

You have to believe the human brain when it starts searching for something. There’s no human in this process. I know, but I have to believe that human brain stops searching through memory for things. When it knows it’s a dead end.

Cause otherwise I would bring her over here or something.

So this reminds me of something in like recurrence. Where you can’t go for the goal and you wind up something always go straight on, straight on, but it may not actually get to all the space that you want to come. So you have so many curiosity to it. That’s exactly. Yeah, that’s that.

So this is almost like, yeah, you want to do this. You want to go for the goal. So you have to be able to do this. You want to go for the goal. So this is almost like, yeah, you want to do this, but sometimes you want to force it to take it off the path. For instance, the gentleman with the bad, why is he there?

No, he’s overtrained. Right.


But yeah, you can’t take them out during training. You know, the way that it’s all that is to say that and it’s done, you know, talking about those forward passes, the forward, the neural network layer.

It’s just basically done as part of that forward pass.

I think I will not pretend to know the actual intricacies of the Gaussian noise.

I just know it’s there.

Part of what you just said answers my question. Okay. It’s a feed forward system.

There’s no feedback.


It is always before and it does feedback for back propagation.

That’s it.

Yeah, because if we knew what the weight was going to be at the end, we would feed it back and just truncate that effort. But that’s not the way this works. Exactly. I’m going to hit my miniature. Okay. All right. So that’s the bulk of it. I will plug a few open source communities that are out there.

So Hydra MOE skunkworks.

They also do some stuff with vision transformers. They have a version of Bakoba, which is the lava trained on Mistral, which greatly improves that sort of thing. They also have a bunch of things related to ablation studies. So ablated is kind of one of their studies they have out there. Another one is this zoo. I can’t spell his name or say his name. This guy is huge in the MOE community. He has these giant, giant lists on GitHub of like, all here’s all the papers on MOE to go look at like awesome MOE or something like that.

So this is definitely someone I suggest checking out. He also has open moe that he’s kind of moved off to something else right now.

So this has died a little bit, but he’s still posting really good MOE content. And the one I really would suggest is a Luther AI.

These guys do a massive amount of stuff in the open AI community, the actual open AI community. They have a bunch of evaluation frameworks and things like that. They also have the MOE reading group. They meet every Saturday and usually have some sort of paper, whatever’s coming out. They’ll read through it. Someone will give the talk. There’s lots of good stuff on YouTube with that goes over a lot of the things that I talk about today and get they have one on almost one of each of these. So I would definitely check out Luther and there’s a whole bunch of things kind of in that area. If you kind of branch out from a Luther everybody around it to get the news research and all those sorts of people where there’s lots of fun things happening right now. And it’s all in this sort. So if you don’t have this for get this work.

You don’t have to do that’s where everything’s happening. So. I think that’s most of it.

Is a key if I had to take six papers on a desert island and learn about MOE.

I don’t know why I would do that, but I would do these ones.

So sparse expert models and deep learning switch transformers.

This is the key switch transform paper scaling large models of mixture of experts scaling vision. With sparse mixture of experts stable mode, which is a different. This is basically where they figure out how to fix them issues with the sparse MOE is kind of laid out in this paper so that might go into some of your Gaussian noise sort of questions expert choice routing brainformers which is a fun one and then sparse soft, which is the soft MOE.

Yeah, any final questions. Thank you. You guys. Nice to meet you.

The next thing that is likely to come.

I think is the doctor.

I’m pretty confident that that’s gonna be out there. I also think that soft MOE kind of beating up. Vision transformers so like lava. When one of those students probably next. I don’t say the mission transformers one just because it’s a lot happening right there right now.

One of the things that you can see that kind of going through doctor to my head doctor was the other thing where doctor kind of builds the operating system.

Put stuff in and it’s like, oh, I like the communications layer your.

I can see something like one of those models potentially crowning one of those lawyers.

So there’s a lot of very, very interesting.

This is probably come from some of the big guys. So Microsoft, open AI something like that that an LLM as an OS, probably not. The actual stuff still have an OS but something in that space.

It has been posting coily on Twitter, all of his like, oh, I’m looking at operating systems memory need sort of stuff. So you’re like, okay, what do you what do you guys do. So they’re going to release something probably with this copilot. I don’t know. That’s kind of probably come out soon. Another interesting that’s out there. I think more from intuition and conceptual levels as this concept of men GPT, which is something that got released sometime last week or before. The actual concept itself is just somebody being clever and doing some prompt engineering but it introduces the concept of treating LLM memory as the same would as memory registers inside of a computer, which the intuition is kind of interesting. How they’re doing it, they’re just trying to put some names around prompt engineering and call make a paper because they’re like some kid out of college.

This is just one thing around that right now. I’m going to go ahead and shut down recording part of this and we’ll end the zoom part of this call. Thanks everybody for joining in on zoom. Again, we’ll be back at it in two weeks.