Meeting notes provided by Gemini
Apr 15, 2026
Gemma 4 & TurboQuant: Fitting New Models on Your Hardware – Transcript
Josh Phillips: All right. So, tonight we’re talking about uh Gemma 4 and TurboQuants.
J. Langley: man.
Josh Phillips: These are two different papers or one of them is actually a paper for our virtual paper review series.
Josh Phillips: But I also did also want to talk about the gym of four models that came out uh and kind of pair these together to um give an idea of, you know, a new model family came out. What are my options for trying to run this thing on my hardware and and do something with this sort of thing? Obviously, this Turbo Quant paper came out that was very big for some reason. Uh and so I wanted to talk a little bit about quantization as well. uh because that’s another thing that’s that really kind of uh goes into a lot of the things that Gemma 4 innovated on in an interesting way. So we’re going to be talking a lot about um uh you know the embeddings KB cache and some of the different aspects of the models uh that tie these things together and how you might use that for uh optimizing your own use of them. Uh so the big thing with the Gemma 4 family uh one thing I didn’t note on here is uh the big thing is that this is now an Apache weighted model.
Josh Phillips: Uh so before now they had some weird Google model of uh you know you have to do exactly what we want you to do with this model and it was pretty permissive but it was had some clauses in there that made it pretty much not usable for uh commercial purposes at scale. They’ve gotten rid of that with this one. It’s completely Apache 2, which is great because the Jim uh three models were really solid, but that that um caveat made it really hard to use them for actual purposes. Uh and so this one has four different models that they’ve released, and they’ve got two of these models that are kind of a new thing completely. We’ve not seen anything like them, which is these E2B and E4B models. Um and these are mostly divine uh designed for edge and low uh sort of resource computing. Uh, and the big thing with them is that they’re uh, uh, text in, uh, image in, but, they’re also audio in, uh, which this is really nice because not really a model like that, uh, at the level of performance that they’re giving out available.
Josh Phillips: And so they’re kind of first in class. And they have this, uh, E element that we’ll talk about later as far as what’s that’s doing and and why that makes it really good for edge. Um, and so those are kind of your small models. Uh then they have this uh 26B activated 4 billion parameter model. Uh this one’s really good for if you have lots of memory. Uh but you’re kind of computebound. So a good example of this would be things like your integrated memory things. Uh so your NGX Spark, your Mac M5, M3, M4 chips. Uh those are compute bound. So they have lots of memory, but um it’s not super efficient. And so this only activates part of the network which makes it much easier to to um get good speeds on it while still having kind of the intelligence of the larger model. CSS 26 uh B A4B model. Uh and then they have this 31B model which is really the most interesting to me of these.
Josh Phillips: Uh outside maybe the edge ones are pretty interesting too, but this 31B model is is really nice. Um it’s got really really solid levels of intelligence. uh and it is a dense model uh which is nice for other reasons because there’s sometimes that you want ane model and there’s sometimes that a dense model is really just the best thing you can do especially when you have like oneshot tasks um and so yeah has uh anybody here worked with the Gemma model so far so far played around with it they’re I think about you know 12 13 days now uh kind of want to get a read on the
J. Langley: Yeah, I’ve hit it a few times. Um mostly um at work, you know, things like that. Um I stopped very quickly after trying to load the 26 in in ways that it did not want to be loaded. Um, and quickly, you know, I think I had a I think I said something across the Discord channel going, “Is this possible?” And you’re like, “Nope.” I’m like,
Josh Phillips: Great.
J. Langley: “Okay,
Josh Phillips: Yeah.
J. Langley: we’ll stop.
Josh Phillips: Yeah.
J. Langley: I’m not going to blow another hour on this
Josh Phillips: All right. So Jay’s done.
David Showalter: Yeah,
Josh Phillips: Oh,
David Showalter: I I was just going to say I have not yet,
J. Langley: one.
Josh Phillips: so go ahead there.
David Showalter: but uh grabbing it now.
Josh Phillips: Okay, cool. And I think they have it in AI Studio and all those sort of things. So even if you don’t have it local inference, uh you can uh um you can still play around with them online. It’s a pretty good model. Um so yeah, let’s talk a little bit about uh what each of these do differently. Um and so there’s three kind of the big the big story with these uh is this hybrid local and global attention uh and this thing called PLE which is the E uh that is going in the small model. So we’re going to go ahead and start uh talking about that PL now if I can remember how to scroll.
Josh Phillips: There we go. All right. So we’re going to start talking uh with this Gemma 4 E2B. Uh and this thing is super super tiny. Uh it’s really designed for things like phones uh working inside of the browser. So, this might be something that’s powered uh inside a Google Chrome uh for your computer. Uh things like a Raspberry Pi could potentially run this on CPU very well. Uh or things that are kind of doing an offline local assistant on something like these new NPU uh sort of add-on chips that that are in devices. And so, this E2B is very interesting because in truth, this is actually a five billion parameter model. Um but they actually store uh three billion of these parameters as a basically a lookup table that goes along with the model uh that does not get computed in the forward passes. Uh and we’ll talk a little bit about how that works later but essentially I would almost call this um this E2B uh PLE uh thing a competitor toe. So you’re you’re looking at I want to have the full uh access to this much much larger level of intelligence, but I don’t want to compute it all.
Josh Phillips: And does that by slicing up the intelligence and only giving you some of it during a pass. Uh the nice thing about this E2B thing is that it gives you all of it uh during the pass. Uh so there are some things that um it’s going to be better at than uh but they’re just kind of scaling it out now. So hopefully this continues to be a trend because I think this is probably a better solution to uh and we’ll talk a bit about how all that works. Uh it’s got 128k context. It’s really nice for a model this size. Um and it’s pretty fast. So this can get uh 52 tokens per second on a uh Galaxy uh S26, which is pretty good for for a phone. Uh so 57 tokens per second. Uh, and it can get uh eight tokens per second on a pi uh a Raspberry Pi, which is not a lot, but that’s a Raspberry Pi, man. So, that’s pretty good. Uh, so you can probably have this to be like your home assistant, like a simple classifier.
Josh Phillips: Uh, it has audio in, so maybe you could do some simple like wake word sort of things. Uh, and this would be pretty good for that, I think. And you can obviously see if you have it on your M4 Max, it’s going to fly. 160 tokens a lot. So, yeah, this is a really cool model. Uh so the next one up is this E4B and this is a pretty interesting model. The nice thing about these two models is that they’re very uh in intelligence dense I think would be the best word for it. uh so that they’re they’re punching way way above their weight class uh for each one of them where so this one is uh an 8 billion parameter model that uh computes like a uh 4 billion parameter model uh but it really performs up at the level above it which is that 14 billion parameterish uh so that if you remember like the Mistl Neos you know it’s that class of performance um uh but at that not even just the 8 billion parameter size but then the four billion parameter down below it as far as your actual computation.
Josh Phillips: Uh so it’s all uh pretty interesting. It’s and past that it’s kind of the same thing as the other one. It has a few more layers on it than the first one. Um but it’s still very fast. So you can see uh it’s reporting, you know, 22 tokens per second for this S26 25. So it’s really basically almost on par uh with that smaller model. Um, so yeah, those are the two ones uh that are that E4B thing. Uh, so the next one up is this Gemma 4 26 billion A4B. Uh, this one has 128 experts, which basically means they’re slicing these little four chunks uh, four-bit chunks, and then it’s routing to uh, different ones at a time. Um, and it’s routing at eight at a time. So it’s whatever 25.2b 2B is divided by 128. Uh this one it jumps up to 20 256K context which was uh quite a bit. Uh but the big thing with these larger ones is they do drop off the audio. So they do not have the audio encoder on these.
Josh Phillips: Um this one can fit pretty easily into something that has 24 uh gigabytes of VRAM if you take it down to Q4. And if you don’t know what Q4 is, don’t worry. We’ll talk about it later. Uh but basically if you quantize it down to almost like an FP4 or FP4 level, uh it’s really good when you kind of get up to your 24 to 48 gigabytes for local GPUs and it’s pretty comfortable anything above that. Uh so you can constrain any of these um um if you’re willing to negotiate on context length too, you can get it a little bit smaller. Um you do some weird stuff that we’ll talk about later with offloading. Uh but luckily with these models, they’re very hungry for overall memory. Uh but they’re very nice for compute. Uh since you’re only activating some of the models. All right. And so there’s one last model in this family of stuff, which is this Gemma 4 31B. It has no special tricks. It’s just the big old model.
Josh Phillips: Uh and it has 60 layers. It does have the the sliding window. Um the big thing that we’ll talk about later, I just want to mention now uh that sliding window that we will talk about, this has double the window of the ones that uh um we talked about before. So this 1024, those are 512. Uh so it has a little bit uh better memory for things that have happened recently. Uh but for this one, you do need some pretty pretty beefy GPUs. Um, if you want to run it at a good level of precision, if you go to Q4, you can fit it into a 5090. Um, obviously you can fit it into uh if you have one of these Max or NG or uh Sparks, you know, it’ll fit in that. But the thing with this one is that it is a dense model. So, you have to to use all of its parameters. And uh your prefill cache, like your your prompt processing on those guys is going to be super slow with this guy.
Josh Phillips: So, um it’s probably not a friendly model for uh your MacBook. Um but if you have, you know, an RTX Pro 6000, you can run this very easily at full precision. Or if you’re willing to take it down to Q8, Q6, Q4, uh you can run it on some other GPUs as well. Yeah, that is Gemma 431D. All right, any questions on any of those before we go kind of into um what’s uh under the hood? Any any questions about the zoo of
J. Langley: uh the sliding uh window is new to me.
Josh Phillips: models?
J. Langley: I’m guessing we’re going to cover that later.
Josh Phillips: Yeah. Yeah. Yeah.
J. Langley: Okay.
Josh Phillips: All right. Okay, I’m going to go into a few things. Just metrics and evals. I don’t usually go into this for models, uh, but I thought it was kind of useful. There’s a good artificial analysis um, post out there. And I’ve also kind of got a few grab bags of interesting little tidbits um, that are in here.
Josh Phillips: Um, so I think one of the big things that’s very interesting is that uh based off this artificial analysis uh evaluation and if uh you’ve not been to some of our past talks, this is a very wellrespected thirdparty uh evaluator. Uh and so they’re not uh benchmaxing things for the different uh uh vendors. Uh they’re pretty good as as kind of that third party and a semiconductor sort of analysis sort of stuff. um even and they are finding that the intelligence here is kind of equivalent to uh 3.527B which that’s fair that’s kind of it it’s a competitor its peer in this uh realm but it’s also as good as some of these larger models like this GLM 4.7 which is a 100 billion parameter model from I think the last generation essentially uh Miniax M2.5 which is another large 100 billion size parameter model uh but also deepseeek so that big deepsee three that came out. That’s this is a huge model. I think that was kind of like the early um uh early side of that last generation, but I’m fairly sure that’s like a 323 billion parameter model and it’s up there with uh that level.
Josh Phillips: Uh and now is a nice Apache uh uh US sourced model that we can work with.
David Showalter: Yeah, Deep Seek 32. That was right about a year ago, wasn’t it?
Josh Phillips: Yeah. Um, so it was a little bit over. Yeah. Uh, so, um, one thing that’s very interesting, and so in general, I I’ve played a lot with Quinn, uh, 3.52.7b. I’ve always been a big, uh, fan of their BL series. Um, I like working with Gemma 3 a lot more. Um, it just kind of, uh, it’s easier to talk to. it doesn’t ramble so much. Um, and I think this uh one of this this is kind of the one that that makes me that stands out the most to me, which is talking about how many output tokens it needed in order to do its tasks at around the same level. Uh, and so Quinn put out 98 million tokens for all of the artificial analysis tasks and GEMA put out 39 million for the same tasks. So you’re not having to to read and parse through the yapping quite so as much.
Josh Phillips: it just kind of it kind of talks in a way that’s that’s more easy to understand where it doesn’t feel like I’m I’m reading uh you know GPT slop and it’s a hard one to to kind of quantify um but I that’s what a lot of people are kind of saying when they’re working with
J. Langley: Well,
Josh Phillips: it you quantify it there yes too but the the uh
J. Langley: you can quantify it on cost.
Josh Phillips: the taste thing is is harder to quantify I’d compare it to to dealing with
J. Langley: That’s
Josh Phillips: like it’s kind of easier to talk to like uh the mist Nemo models versus some of the other ones that were around that time. It kind of feels like that. Um the other one that’s very interesting is that uh this E2B model is designed so it can fit in uh under 3 GB of RAM uh with all of those extra weights that are not actively being used being able to be stored in flash memory which is much cheaper uh for the phone. So that’s kind of some of the stuff they’re doing there.
Josh Phillips: Um okay. Yeah. All right. So, just some ideas at a high level where these guys sit. Uh, so Gemma 4 31 31B is right here. Uh, and this is on a chart with, you know, Claude 4.6 Opus Max and Pro Preview. And so it’s sitting right here. It’s above Haiku for 4.5. Um, so I think that’s uh pretty interesting to just to be able to self-host a highQ level model. There’s a lot of things you can do with that. Uh, so an idea of kind of its active parameters uh versus uh how smart it is. I think here you can kind of see the value of these uh more efficient models where they’re especially this this uh 4 E4B is really uh doing quite a lot of uh of performance here uh for what it costs. So it’s in a nice little area here on this ideal line. Uh and so I think this is a very interesting one uh as well. Uh but I just like uh working with this this uh 31B because we we don’t really have a good uh we’ve been kind of stuck with the Neotron and uh GBT OSS.
Josh Phillips: So it’s nice to have a good uh US sourced dense model here to work with. So an idea here of kind of how we jumped from the last series to this series. Um so here’s this uh E2B level model. The first time they they played with this. So jumping from the three version of that to the Gemma 4 E2B, it goes from five to 15 points on this index. Uh and I think the other interesting one is this Gemma 327B which was their big dense model of last time. This is one that most people used whenever they’re working with Gemma in an enterprise setting. It jumped from 10 to 40 essentially 39. So this uh this 31B is pretty good. Can’t tell I’m a fan. Um, so this GDP eval, uh, this is one that you’ve, uh, kind of talked through a few times, Jay. Uh, and so you can see here how GIMA 3 performed on all of those GDP eval tasks, 0%. Uh, but these two largest in this series actually are able to start finishing stuff off on here.
Josh Phillips: I I think this is kind of an interesting chart. Um, Tom, I think you talked about the tow bench. Uh so here’s the TW bench. Uh in general the Gemma models, especially once after the tool calling fix got in, uh it’s much better at these agentic tool calling tasks. Uh and you can see here uh with the 431B is really kind of getting to that that right level of parh for this sort of stuff. Um,
Thomas Plunkett: Yeah,
Josh Phillips: I would say,
Thomas Plunkett: that’s cool.
Josh Phillips: yeah, I would say that that the Quinn, this is the one area where I think Quinn 27B is just a little bit higher than Gemma is on this pure tool calling uh capability. That is the one place where it eats out. Uh, but 60% I’m pretty happy. Uh, some more more stuff. Terminal bench here. You can kind of see where the deltas are. Why we like the dense model uh is because um really need to be guided. you really kind of have to be interactive with them which doesn’t bode well for long running uh uh agentic sessions.
Josh Phillips: Uh so they they tend to kind of mess up and are unable to oneshot things and so it makes it things like terminal bench uh where you have to have this long strain and not get caught in a rut more difficult for these models. Even though this one is kind of almost the same that A4B really hurts. Uh and this is kind of an emergent property that we’re starting to notice with there’s been some papers that might be the next paper we we do. It’s It’s kind of a uh research is still out, but that’s one of the weird things. Bad at one shot. Yeah, that’s all the metrics that I’ll have. I think I think that was kind of interesting kind of show where this thing is.
J. Langley: Do you know um on the GDP val uh have you seen any breakdown of what parts of GDP val it was good at and what parts it was not good at?
Josh Phillips: Um, I have not, but I do have the uh,
J. Langley: because I don’t think artificial index covers it as as far as breaking it out into pieces.
Josh Phillips: it’s a good question. Um, I don’t, but Charlie, could you maybe look that up for us? I’d be interested in that too.
J. Langley: Yeah,
Josh Phillips: I know
J. Langley: I’ve I’ve run into something.
Josh Phillips: Charlie.
J. Langley: Uh, I’ve I’ve run into stuff that scored really well on one part of GDP and then sucked at a whole another thing. And I’m like, well, as long as you don’t want to do this one thing, it’s great.
Josh Phillips: Yeah. Yeah. Yeah.
J. Langley: But if you do,
Josh Phillips: Yeah.
J. Langley: you’re going to be you’re going to be sitting there trying to figure out what
Josh Phillips: Right.
J. Langley: happened.
Josh Phillips: All right. There we go. Now I’m back. All right. Any other questions on like the metric stuff? I’ll try and answer it. I don’t have I don’t have a ton I don’t usually spend a lot of time thinking about metrics, but uh uh those did stand out to me.
Charlie Rogers: Which one was I uh searching for again?
Josh Phillips: uh the GDP val gross domestic product val or Jimma 31B.
Charlie Rogers: Right.
Josh Phillips: It was good at some of them. We’re kind of curious which ones they were. All right. So, we’re going to talk about one of the two interesting bits on the Gemma architecture. So, the first one here is this per layer embeddings. Uh this is something that they had in the prior series. This is the first time that it’s really coming out uh on a frontier. It was released as like an also like Gemma’s or or Gemini is releasing something. Everyone thought it was Gemma 4 and it was just this Gemma 3N thing and everyone’s like ah come on. Uh so it did release earlier uh but nobody really cared about it. Uh but now we do because it’s on kind of a cool new model. Uh and so the big thing with this uh uh per layer embeddings is it’s splitting up the uh the embeddings and the knowledge basically out of the uh core uh forward loop but still trying to provide uh that uh information into it and it goes through five different steps here.
Josh Phillips: Uh the first of which is uh going into a lookup that is attached to each layer. So each one of these layers uh basically is trained with these things and then they they freeze uh those uh sort of weights uh and then make it so instead of it it doing a forward pass they’re kind of layering them in. And this is kind of used for um storing attention for certain concepts. So, so a way that I’ve heard described that makes sense to me is is it basically stores the concept that uh Princess Diana uh is the princess of Wales and kind of connects it to all those different things. And so it’s it’s storing these sort of of longer terms of uh you know groups of tokens as precomputed connections so they don’t have to be connected again. And so it’s not necessarily going to be uh you know as good as if those were always going to be computed. Uh but they’ve got them stored off in a way where it’s it’s efficient. Uh so you’re getting most of the the capability with a huge uh save in performance.
Josh Phillips: Um and how they do that, I mean, we don’t know. It’s that’s that’s their secret sauce, right? Uh but that’s kind of what they’ve described as they’ve had some way to to distill uh some sort of meaningful things off. And instead of computing those, they’re now just doing a lookup. So they’re looking up those tokens uh inside of this forward pass, which is just an 01 uh uh level of complexity. So it costs a linear amount. So how much you’re looking up is how much you cost instead of the quadratic amount as uh the context windows get larger for these things. Uh, and so they take those, they do the lookup, and then they uh project that into uh the current context that’s there. They combine it in uh find a way to get it back into the uh uh forward pass and then do some residual stuff uh to kind of uh inject it without collapsing the whole thing. Uh and that kind of looks like this. Uh if you are interested in all this sort of stuff, there’s lots of really good um YouTube videos out there where they go into the math of it.
Josh Phillips: I’m not going to do that. Uh I just need to understand it enough to know what I’m optimizing for when I’m choosing the model. Um and so the big thing here is that we’re doing this lookup here at the attention block. It goes into the feed forward, goes off into this gate. we do something to go it into the stream and then we also add it a little bit after it has gone through the RMS norm uh to ensure that it it really gets in there and it doesn’t get collapsed. All right, so that’s that thing. Um this thing’s kind of weird. Uh so they they haven’t perfected it. They’ve got this really working well for the textual tokens. Uh but there’s not really a good way for them to save off these concepts whenever they are visual. haven’t figured that out because um they’re basically uh you know the image tokens are strange in which they’re usually just you know these little image image image image uh things if you actually look inside the tokenizer I think we did that a little bit long way long ago when we were looking at pixel um we kind of it’s kind of mostly the same of that you know there’s lots of different things people do now with u resolution um but it’s still mostly that uh and so for this this uh per layer embeddings thing.
Josh Phillips: Uh they just zero it out. which means that if you’re putting in multimodal inputs into the thing, uh those multimodal inputs are not necessarily going to get any benefit from those extra parameters. So it’s almost like for text you’re getting that full five billion parameters out of the E2B, but any images it suddenly becomes super dumb. Um so that’s that’s a negative of these things. Uh it’s not not totally efficient. Uh but there’s always weird stuff with images. Um so take what you can get. I guess that’s just something to be aware of. Uh but as far as the actual I mean performance cost for those embeddings uh generally whenever you’re going through a forward pass it’s only interacting with 1,000 tokens out of the total vocabulary of 262,000 that are in these things. And so the the idea here is that now you’re only having to calculate over those tokens that are actually uh called uh from that uh uh uh essentially database of uh embeddings. Um so way way more efficient.
Josh Phillips: So yeah, any any uh questions about the per layer embeddings embeddings thing? I’m trying to keep it high level enough so we can kind of get to the turbo quant. Um but it is very cool and there there’s a lot of extra information there. uh on the
J. Langley: How much is uh kind of set as far as they figured it out and how much of it is
Josh Phillips: internet.
J. Langley: probably going to be different? You know what I mean? I I hate spending time learning something and just to figure out they threw it away and did something else.
Josh Phillips: So I would think I would say that this um if you have to look at anything here um I would focus on the concept of this disk 01 lookup table uh and how they are projecting it back into the forward pass. Uh and so the reason I say this is because this is essentially what engram is. So we talking about engram that’s that’s the other uh deepseek uh sort of thing where they’re using this same concept uh but as memory.
Josh Phillips: Uh so you could use this exact same concept but instead of this being something that’s shipped with the model, this could be the memory of the agent or some sort of long running training process uh that you’re gradually.
J. Langley: Oh. H.
Josh Phillips: So what so why can’t I just make this you know stored three billion parameters 5 billion and six billion 8 billion a
J. Langley: and then and then attach it to my next instance.
Josh Phillips: trillion yeah it’s just a lookup
J. Langley: Uh
Josh Phillips: table uh and so so there is something that has to be done there so you have to have the
J. Langley: right.
Josh Phillips: right way to project it back into the stream and so they get through this in a very easy way by training
J. Langley: Mhm.
Josh Phillips: it with those specific things um Ingram is continual. So this is very interesting. I think um you know all maybe maybe the uh yeah I’m not sure. Does that answer your question?
J. Langley: It gives me a
Josh Phillips: Yeah. Yeah.
J. Langley: clue.
Josh Phillips: All right. So next bit is this hybrid sliding window attention. This one’s actually pretty um uh standard. It’s just they’re using it in a fairly big way. Uh and the idea is hybrids uh window attention. I think uh this uh rope and nope uh uh paper it’s uh was one of my favorite examples of this uh where you basically just have a few layers that are uh looking at all of the context. So I’m looking at the entire thing for you know one out of every four layers. But then for the other layers, I’m just going to do a sliding window. I’m going to look at the the most recent 500 tokens or a thousand tokens or whatever it is. Uh and that makes it so that I don’t have to compute that full context and I can do inference much faster. This kind of the stuff that makes it so, you know, uh you can have two 31 billion parameter models, but one is much faster than the other one is because of stuff like this.
Josh Phillips: Uh even if they’re both dense. Uh and so yeah this rope and nope nope is getting the full attention and the rope here is getting uh the sliding window and uh all the gym models employ sliding windows. Um let’s see I think we can go back to this family now. It has the context here. Uh and I think that so 35 layers I think that the Gemma uh E2B models are like every um every three layers and the last layer is always global attention. So there’s there’s it’s not it’s not completely even, but I think this one’s every three layers. Uh, and I think this one, one of them goes up to maybe four, but Gemma 5 is every five layers is um uh global. And so it’s an interesting sort of trade-off where they’re just trying to to find a way to ease off of uh really the growth of the KB cache because the KB cache is um that’s what makes it slow the longer it goes, right? I didn’t say that. I had that backwards, but yes.
Josh Phillips: Anyways, let’s see. Um, so yeah, that’s the sliding window. Any qu any questions there?
J. Langley: Yeah, it makes sense.
Josh Phillips: Yeah,
J. Langley: I don’t think I’ve seen that before,
Josh Phillips: very easy.
J. Langley: but I probably wasn’t paying
Josh Phillips: Uh, it it’s not super new.
J. Langley: attention.
Josh Phillips: You might have you might it might have be happening. I can’t remember what other models do it, but it’s not uncommon for that to happen.
J. Langley: Okay.
Josh Phillips: I just can’t remember which ones do it. Uh so one thing that is interesting and this took me from super super excited about the E4B models to moderately excited uh is that they have audio understanding and they have video understanding, but they don’t they weren’t really trained on on like audio generally. They’re really trained on speech. And so I can’t I can’t feed it a song and get me to describe the song essentially. Or if there’s like lots of, you know, I’m watch going through a video, uh, it’s really kind of tracking the images and the audio not really fully together in in a full way.
Josh Phillips: And so it’s not going to be able to understand like somebody talking off screen, they walk on screen, and then, you know, you hear a a plate break in the background. It’s not going to be able to like put together weird things like that. Uh like you might be able to get from like a Gemini Pro or something like that. Uh so it doesn’t have all of that. Um but if you are just kind of doing audio training, uh it is able to do that, which makes sense. They trained it to essentially be a phone assistant or an edge browser Chrome assistant. Um as we’re talking about an edge uh it is an edge browser model. No, not Bing. It’s for the edge. Uh Chrome browser thing. Let’s see. Uh anything else here? Yeah, it’s not really good with music tagging. It’s also uh if you want to do audio stuff, the training for audio models really sucks. I think is the other big thing.
Josh Phillips: We’ll talk about that. I think this is my next slide. Uh is so so you’re really good for training this thing, which it’s a great size for training, by the way, too, especially those E4B models. Uh, but if you want to do audio training, there’s just not a lot of good support for it. It exists, but it sucks to put it plainly. Um, so hopefully this being here might change that just because it is kind of a frontier uh release. Um, this is kind of when those things get added. So, here’s hoping. All right. So, as far as training, uh, we have, I think, uh, four major libraries, uh, that, uh, can be used for Gemma training. Uh so unsloth is always the big one in the low compute regime. Uh so this lets us do quantize uh training and Laura training very easily. I think it’s the easiest for anybody to get started with. They’ve got now this cool fancy UI that makes it super easy to get started.
Josh Phillips: I’m actually going to pull it up right here. So we I I haven’t think I’ve done one of these since that got released. Uh so we got the Unsloth uh UI. Here we go. this Unslaw Studio. I would highly suggest anybody who has any interest in training to please, please, if you’re scared of training and don’t want to get into it, please go download this. This thing, it makes it so easy. It helps you put the data sets together um and loads everything onto your machine. You can get up and kind of going and your foot in the water. Uh because this stuff’s really fun if you if you get into it. Um and even if you don’t use it as a major thing, training them makes it a lot easier to understand what’s under the hood of these things. Uh so it’s my pitch for Unsloth. uh that exist out there. Um so that’s one. Uh once you get kind of get into a more serious regime, they’ve got transformers and TRL.
Josh Phillips: Uh this is TRL is really good for reinforcement learning uh sort of capabilities. It works well with things like Ray if you’re doing distributed training. Um so that’s good. Um, MLXLM uh makes it very easy to train these things on metal, which is the uh kernels for the M chips for uh Mac. Uh, so you can train this on a Mac, which is pretty cool. Uh, Axelottle is really good. This is kind of your your uh Blackmagic uh library that’s really in the weeds. Uh, but it’s really good if you want to train thee. Uh, none of the other ones really have as effective ways of training thee. Uh, and in general, I would suggest do not try and train the MOE if you don’t know what you’re doing. Uh, they’re very, very difficult to properly train. Uh, but if you do have to go do that for some reason because of your use case, uh, go check out Axelottle. That will probably be the most successful of those.
Josh Phillips: There also a bunch of weird Jack stuff because it’s, uh, Google and I don’t know what any of those are. Um, but I’m going to mention that they exist. And so, um, if you’re broken and you use jacks, then uh, there you go. you probably already know what they are. Um, and yeah, uh, I did post, for some reason, I’m now on this machine. I can’t, um, click through my links here. Uh, but there are lots of really good resources on training Gemma from Google. Uh, oh goodness. Okay. Oh, wait. This isn’t seven. Google train. There we go. So, yeah, they’re Google AI for developers. They’ve got lots of good guides here and I looked through these and they’re pretty solid. Um, they got the KAS and Jack stuff. Um, but they also have, you know, this own little library apparently that helps train for it. I don’t know much about that, but there should be uh I thought they had some notebooks.
Josh Phillips: Well, uh, just trust me, I had it linked in there that there are, here we go. Google Collab notebooks. Uh, so there are collab notebooks just for getting this up and going. Uh, the E4BS very small model. Uh, you can train those in a notebook. I would also like to note that um, they had here we go. Uh, Kaggle, you actually get 30 GPU hours for free. Everybody does from Kaggle every month. Uh, so you could totally train a model and just kind of play around on the weekends, uh, just using free hours from Kaggle. I think they give out A100 hours, so it’s a decent GPU.
J. Langley: Wait, what?
Josh Phillips: Yeah, they might do they might do be doing uh,
J. Langley: That’s Dang.
Josh Phillips: RTX Pros now.
J. Langley: that uh we need to file that away next time we host some or help with some kind of hackathon or something like that where you just got a weekend and nobody’s got any hardware.
Josh Phillips: Yeah. Yeah.
Josh Phillips: Yeah,
J. Langley: Uh that’s that’s really
Josh Phillips: Kaggle’s great. Yeah, I think they’ll
Thomas Plunkett: It was kind of It was kind of slow when I tried using it a year ago.
Josh Phillips: probably
J. Langley: cool.
Thomas Plunkett: Maybe it’s better now. Or maybe I was asking for too
Josh Phillips: maybe I don’t know, but uh I do know that they they offer it. Yeah. Let’s see. All right. So, as far as the sizes uh for training Jimma, generally if you’re an 812, you can do the E2B um with Laura, I’m doing Laura for all of these except the one that I call out Q Laura. Uh generally Qura is just a lot more difficult. Uh so if you can avoid using that’s basically you’re quantizing the model down and then you’re training it and then unquantizing it. Uh and so it has some loss but it does work. Um but it’s not, you know, ideal. Uh but uh so H12 you can do E2B 16 to 24 you can do E4B.
Josh Phillips: Uh 32 to 48 it kind of gets you up into the area where you can do thee. Uh but I still say you should probably shouldn’t. Uh but you can do some really comfortable uh high rank training with the E4B at that point. Um, and then once you kind of get up into that uh A100 RTX Pro 6000 area, you can then do the 31B at Laura. Yeah, pretty accessible. Uh, and yeah, this is my last little thing on Gemma. So, Joyy’s the frontier some initial reports. A lot of people are saying at the very beginning of this thing, this model is very lazy. Doesn’t like to call tools, doesn’t call tools correctly. get a lot of errors with it. So, if you are hitting th those um issues, number one, I would uh uh uh have your people go and pull the new model because what this ended up being was uh oops, that’s the wrong one. Here we go. Uh there were several of these actually several uh fixes came out, but it ended up being uh this sort of issue with the chat template where they left off the correct uh stuff to call tools.
Josh Phillips: And so all of its cool tool calling if it seems like it was just like calling, you know, trying to make JSON at you, it’s because it didn’t have it in the template and the the it was a miracle that it was working at all uh for any tool calls. And so they they they just published the wrong thing. So it’s the same old thing of of these these machine learning engineers do not know how to deploy stuff whatsoever because they just didn’t include the template you’re supposed to use. Um and you see this is not a small, you know, delta here. Um, and so the issue is that B they basically had some stuff in there that was doing some like interleved uh tool calling wrong and that was causing it to be very bumpy. Uh, but I will say this is super normal and they caught theirs very quickly. Uh, so this happens pretty much on all these releases. uh and it’s probably probably a good idea to kind of give it you know a week or two uh whenever you are doing the uh uh evaluation for a new model for things to stabilize because this always happens uh and I think GPD OSS didn’t get fixed for like months uh because of the new harmony format that it had um and so uh just something to be aware of if you do have issues that with tool calling wherever you are it might be that they loaded Gemma on the first day and never refreshed it so tell them
Josh Phillips: to refresh the All right. Uh, so that’s Gemma three. Uh, anything else on that one before we move on to the next topic?
J. Langley: Uh, Gemma of four,
Josh Phillips: Gemma 4. Yeah,
J. Langley: it’s okay. Uh,
Josh Phillips: sorry.
J. Langley: do you know if there is a just a quick and dirty single prompt that would tell me whether or not they they’re using the the right tool calling one or is it just super obvious? You know, what’s in what’s in this directory or is that
Josh Phillips: Uh you number one it depends on what tools you have available to it. So if you’re in open code I would have it try and you know do a bunch of tool calls.
J. Langley: Okay.
Josh Phillips: So give it hey go find me this and this and this and this. And if it gives you a bunch of like bad tool calls that’s it. It shouldn’t be giving you a bunch.
J. Langley: Okay.
Josh Phillips: Maybe every once in a while. I mean it’s still not perfect but it should be able to kind of do some stuff.
J. Langley: Okay.
Josh Phillips: Uh there probably are some other other tricks. I just don’t know them off hand. All right. Any other things on gem 4? All right. Uh then I’ll go into this next bit. This is going to be about quantization and turbo quant. Uh so how many people here are familiar with the KB cache? Know what a KB cache is? Is that my answer?
J. Langley: That is your
David Showalter: That is that is your answer.
Josh Phillips: Very good. All right. So,
J. Langley: answer.
Josh Phillips: the KB cache is super cool. Uh, this is basically a way uh for us to not die of boredom when we’re waiting for an answer from an LLM. And so, I’m gonna show three different visualizations here to kind of layer it in. Uh, but the big idea that we’re watching for with the KB cache. So we look at this without cache area and we’re looking at how much has to be computed on each forward pass each to uh time that we’re generating a new token or whatever we’re doing.
Josh Phillips: And so without the cache uh it’s having to generate this token versus that and now it’s 2 * 2 3 * 3 4 * 4 calculating all of that. And so with the KV cache, you can see that it’s, you know, one times, you know, that one times that. Uh, so it’s a much smaller uh sort of size that you’re doing because you’re caching the values that happened before. All right. And so that’s one example of it. We’re going to look at a different way. So we’re looking at another example without caching. Uh, is this not going to play now? Okay. Please, please, please, please, please. Oh, it’s very sad. Let’s see. Where did I get this from? All right, it is good enough that I’m think I’m going to go try and find it again. We done that. Here we go. Okay, so I think this is hugging face KB cache. All right, I’m going to look for five sec.
Josh Phillips: Oh, beautiful. Here it is. Okay. All right. So, this is this is without KV cache. And so you see this comes in and I’m calculating that that that that next token I calculate that that that that that every single time. Yes. And I’m dying of boredom uh just for that exclamation mark. And so with the cache it saves that value off and so yeah I think uh even if you don’t get it totally you kind of get the idea of what it is right it’s it’s something that you don’t have to do stuff you already did before saving your work that make sense yeah it’s good enough to understand what’s this.
J. Langley: That makes sense.
Josh Phillips: So this is very important because this is the core of what turbo quant is. Um and so we’re just going to keep that sort of thing in mind when we talk about quantization. Uh so we’re going to talk about quantization in two terms. So we’re there’s two really kinds of quantization that we care a lot about.
Josh Phillips: Uh and so one of those quantization types is your weight and attention quantization. That’s the sort of quantization that people normally are thinking when they’re thinking of quants um in terms of AI at least. Um, and then there’s the KV cache or context quantization, uh, which is the the stuff that, uh, we’ll be talking about with Turbo Quant. And so to go through this, I’m going to go through a story of myself from the last time I did some my most arcane sort of quantization optimization where I hit a lot of the the sort of bells and whistle or the uh, knobs that you have to hit in order to fit something very very large and something that doesn’t want it to be fit into. Uh, and so this story I’m gonna talk about, uh, how I got Quinn 235 billion parameters uh, onto my computer. Uh, so this is a super nice vision language model that did some agentic stuff that I wanted to do some work for me. Uh, because I wanted to kind of see how it something worked at this level and I figured out I was just barely able to fit it.
Josh Phillips: Uh, which was super awesome. So I went here to look and see what it is. And uh looking at the model weights in here, it’s like, oh, okay, it’s only 4 billion or 4 GB. Uh, and there’s 118 of them. Uh, now I’m not super good at mental math. Uh, but I know that’s a lot. Uh, and so I had to figure out what I could do to get this monster uh inside of the card that I had available. Um, and so to give you an idea of what I’m working with, I’m not unhappy with with this sort of uh sort of loadout. Um, you know, I have an RTX Pro 6000 that has 96 gigabytes of VRAM. It’s Blackwell, super fast. I want to do as much as I possibly can on this Blackwell. Uh, and I have a secondary GPU as well, which is a uh ampear. This is like the um the RTX 3000 uh family. Uh the workstation card from that family is like two 2020ish time period, I think.
Josh Phillips: Uh so it has 48 gigabytes. That’s good. I like that. But it’s slow as heck. Uh I would say this is almost comparable with like your M5 chip uh sort of thing, but as a Nvidia GPU. Um so I I really don’t want to do a lot of work on this, but it does exist out there as something that I can still do it on GPU instead of CPU. Uh on top of this, I have 192 gigabytes of RAM or something like that, but it’s still a very hard fit for that big model. Uh that is over 400 500 gigabytes uh all included with the cache. Um, and so I had to come up with a voodoo spell. And I’m just going to kind of use this to talk about some of the uh things and knobs that you might uh want to go poke at uh whenever you are trying to do something like this. So this is just kind of an idea of the options that you have available to you.
Josh Phillips: Uh so we’re going to go through the voodoo spell and talk about the most interesting bits. And so the first thing here is a weight quantization. Uh and so I have two things here I’m using. Number one, uh, if you’re really really trying to be clever with the GPU resources that you have available, uh, use Llama CPP. It is the the one that has the most uh, flexibility. I think there’s there’s there might be some really really arcane uh, libraries out there on GitHub that have a little bit more flexibility, but if you know that if you’re the target audience for that, you already know about them. So, uh, if you’re not, then use Llama CVP. Um and so the uh what the first thing I did here and probably the biggest thing is that I got this Q4KM quantization. This is using something called gigaf uh or that’s all how I always pronounce it. I’m sure there’s another way. But this is the llama CPP uh native format uh for quantization that’s very good at uh retaining uh accuracy.
Josh Phillips: It’s it’s really I would say the most common version. And so I’m doing it at 4KM. So this is kind of your your your medium equivalent to an FP4 uh uh sort of thing. So I’m getting this down uh from 16 uh uh brain float and which we’ll talk about what that means uh down to just a little bit above uh for uh FB4. So that’s the first thing I do. Uh and the second thing here is that I’m loading the uh the actual vision encoder. So I’m still loading that at full precision at F16 because I it’s a very small thing and quantizing this has lots of negative effect. So you you can kind of choose selectively the things that you quantize. So just keep that in mind as you’re kind of fitting these things. And my suggestion to you if you’re using vision is never ever ever quantize vision. is do not do it. It’s not worth it. Um, so the next thing here is that we can do things like split the tensors.
Josh Phillips: Uh, I’m wanting to not half and half split the tensors here. I’m telling it very specifically do 96 on this GPU and 48 on this GPU. I’m telling it to split by the layer. See, we talk about those layers that each of the model has. Um, it is more efficient with my current setup to split by those layers. So I’m not uh doing half of a layer on one GPU and half of it on another. Uh so that’s an option that you have available to you. Uh and it ends up being a change of you know five or six tokens a second. Um you can you really will need to mess with your context size. So I was able to get this working at 64k. Uh and here I have these two things that are very relevant for us tonight which is this K and Vcash. So this is that KV cache here. um that we are seeing. And so what I’m doing there is that I am taking this and it is usually at this brain float 16 FP16 precision and I’m quantizing it to Q8.
Josh Phillips: And so basically what that means is that as it’s I’m exiting and caching these things, I am slicing off some of the precision after it’s done. um and uh you know trying to see what happens there. And so generally uh and this is the true at the time of this you know if I could cut this a lot more I could I can fit this is a good bit of savings that I can get. But at this point if you quantize this thing anywhere below eight the thing becomes useless because those those uh those uh relationships they’re very very finicky and they do not handle quantization well. Uh and pretty much any anything that you went below below this uh basically trashed the model. And so this is as low as I could possibly go. Um, and you’ll see, you know, if it was an option, I would have gone for it because I’m going to show you some very very other arcane things that I had to do very quickly. So, some other things here is that I didn’t want it to automatically offload the the the projector.
Josh Phillips: So, I said do not offload that. I loaded flash attention and then I manually uh specified that, you know, we have all those layers. We have the multimodal projector and the vision tower. So, I took individual layers of the model and I said, “Hey, I want you to load those on the Ampier device because they don’t take a lot of computation. Uh, but they kind of have a lot of weights and so I’m fine with having them over there.” And it was worth it uh to do all this just to get this thing uh ended up going. I got it up uh at 40 to 45 tokens per second, which to me is kind of in the acceptable range. So yeah, hopefully that wasn’t that was at least uh somewhat uh useful. Uh this is one of those things I never see people go into this sort of stuff of how they they do this sort of thing. Um and so hopefully it is at least interesting to see some of the knobs that somebody might uh use.
Josh Phillips: All right. Uh any questions there?
J. Langley: No, definitely appreciate the hey, if you try this, good luck. You know what I mean?
Josh Phillips: Yeah. Yeah. Yeah.
J. Langley: Um,
Josh Phillips: Uh that it’s it’s born through pain, I assure you. Um all right. So we’re going to talk a little bit now about the different quantization just kind of giving you a rough idea of what these things are. What are the ones you have to care about? Uh and so we’re going to go through all the the highle ones, but there are some guff uh variants that I do want to go into. Uh, and so we have our normal guys, which is uh, what we care about really is F-32. That’s what things mostly are at full full precision. Nobody really uses F-32 anymore. Everybody trains at BF-16 or F-16 at the highest. Uh, generally if you want your their model in its best self, you know, like it it had a full night of night of sleep and, you know, woke up and did its exercise, run it in BF-16.
Josh Phillips: Uh, it does matter. Uh, but you can go down to Q8 and FP8 pretty well. Uh, so these Q8 just uh kind of think of it as an F instead of a Q and that’s fine.
Charlie Rogers: two.
Josh Phillips: It basically tells you that it’s a guff. Uh 6K is really good. Uh you really don’t want to go lower than the Q4s on any of these. So anything under Q4 really starts struggling. Uh you can, but just know that you’re you’re it’s your problem. Uh, I’ve heard a lot about these IQs being very good. I never mess with them. Uh, but if you are trying to go for these really really low quants. I I found that people seem to be having a lot of success with like these IQ2. So if you’re you’re doing something really weird but you really need a new a model, uh, do look at those. Uh, MX FP4 is relevant because that is what GPTOSS uh, kind of pioneered. Uh, and it is a legitimately good format.
Josh Phillips: So that exists too. All right, those are all the ones that I care about in Guff. Uh, so yeah, we’ll talk about a little bit what we mean with these bits. So just to kind of give you a high level understanding, I I promise all this stuff is useful for the turbo quant paper. Um, is uh so the 32bit um the Oh, you’re saying something, Charlie?
Charlie Rogers: Oh, no. No, I didn’t say anything.
Josh Phillips: Oh,
Charlie Rogers: Sorry.
Josh Phillips: he said, “Oh, I forgot to mute.” Okay. Um, so the first bit is always the sign. There’s an 8 8 bit exponent for the 32-bit. And then we got these 23 uh bits of mantissa, which are basically your additional uh precision uh sort of at the more granular level. And so we’re always talking about this this trade-off between how many exponents do you have and how many mantisa bits. That’s where you get all of these different variants are different uh sort of uh recipes uh for that. So generally nobody’s using 32-bit.
Josh Phillips: Most people are using a variant of either FP16 or BF16. Uh so the big change between FP and BF16 is that BF gets eight bits of exponent and seven bits of mantisa. Uh and the reason that everything is in BF-16 is that it is more stable for training. Now technically there’s nothing really as on the inference side that should be better between DF-16 and FB16 other than the fact that they happen to usually be trained on BF-16 and it just all it is is it just makes a smoother curve. It makes it easier for the training loss to converge and that’s really the biggest thing with this. Uh so yeah that’s that’s why that exists. Uh it generally I think BF16 is an Nvidia majority thing. Um, so if you don’t have Nvidia, I don’t know how that works. Uh, but I’m assuming that there are problems there. Uh, FP8 is kind of the uh the big uh FP8’s the new FP16, I think, is the the way to way to put it. Uh, I think uh Devstrol now trains in FP8.
Josh Phillips: That was the big thing with I mean that was the thing with DeepSeek is that they trained a model in FP8 and it worked. Um, and so uh it’s pretty good obviously. So, another thing to kind of remember, and this is really useful if you’re looking at how much does it take to to load the weights of this thing for FP16, um, if I have a 30 billion parameter model and it’s in FP16, it’s usually going to be twice, it’s going to be 60 gigabytes of VRAM roughly. Uh, if I am FP8, it is going to be 30 GB of VRAM. if I am an FP4 is going to be 15 gigabytes of VRAM. And that’s not it’s not linear. It’s there are a lot of things that can change that up and down. That’s your kind of rough order of magnitude uh for understanding these things. Uh and I won’t tell you how long it took me to figure that out after being deep in the weeds of this. Uh it it only just recently dawned on me that that was the uh the translation.
Josh Phillips: Just never connected. So uh hopefully that’s useful for you. Uh, another thing I don’t think it’s going to be relevant for anybody. If you see this E5, M2 sort of stuff, uh, E4, M3, uh, just know that the E, if you have a newer GPU, just go for the E4, M3 instead of the E5. That’s just old legacy stuff. Uh, the new kids on the block is FP4. We really like FP4 because it’s nice and tiny. Uh, and there’s been some clever stuff that Nvidia has done especially to kind of help save whenever you’re quantizing things down to this level, uh, with some mixed precision stuff called NVFP4, uh, which lets us have almost like a FP6ish, a Q6 level performance down at that level, uh, by having these scaling factors on the outside. Um, and so that’s available on Blackwell. Um, if you’re on Hopper, you still get, I think, FP4, uh, capabilities somewhat. Uh, so you can use like MXFP4 or flash infer, but those are available. But, um, generally, if you’re you’re on, you know, an ADA GPU, if you’re on the, uh, the metal, uh, FP4 does not have native kernels yet.
Josh Phillips: Uh if you’re doing models and you’re trying to quantize, uh I would look at this reap. Look for reap stuff because they’re usually generally very high quality. Um and these actually are strange because they’re almost like a distill or they just like they have a bunch of experts and they say, “Hey, these guys aren’t pulling their weight. Nobody ever calls uh into these experts.” And so they just kill them. Uh and I think I love the uh uh they literally just cut them out of the network. Uh so let me let me go look at the their mascot here. Uh which is them with a little reaper killing all the little AI expert layers. Um and so yeah, that’s how they make those smaller. Very fun. Okay. Yeah, those are those are actually doing pretty pretty good on uh performance. Okay, so that is weight quantization. I just knocked something. Here we go. And so now we’re going to talk about the KV cache quantization. So because I’ve got it all nice, I’ve got my thing.
Josh Phillips: It fits in my VRAMm, but then I start talking to the thing and I’m broken again. Uh so um we’re going to talk now about Turbo Quant. Before I go into Turbo Quant, is there any questions on the weight general quantization stuff?
J. Langley: Now, I dropped uh I dropped a link to a notebook we did. I think this was September of 2024 when we initially pulled Llama uh Llama CPP and then at the time it was a Llama 2 model that we were trying to shove into a
Josh Phillips: Mhm.
J. Langley: notebook uh on a Google Collab.
Josh Phillips: All
J. Langley: It it has some links to some other references that walk through, you know, it’s kind of when GG GGML or whatever the GGUF and there was it was it was a mess for a minute. Um, but yeah, in case uh in case anybody wants to go uh play on a notebook kind of a thing with something similar
Josh Phillips: right. Uh, so we’re going to talk about TurboQuan. How many people here have heard of Turbo Quan?
Josh Phillips: uh at least have have have gotten the hype thrown in their face.
J. Langley: I heard about it when you sent me the description of what you’re talking about was the first time I hit
Josh Phillips: Yeah, this is one that I got.
Charlie Rogers: Yeah.
Josh Phillips: Go ahead.
Charlie Rogers: Yeah.
J. Langley: it.
Charlie Rogers: I I think I had come across it in uh on Discord like a couple of weeks
Josh Phillips: Sorry.
Charlie Rogers: ago and I was just is this going to be something that is going to help us with the RAM crisis that’s going on?
Josh Phillips: Yeah, I got a lot of people talking about that. This is just one that I got blown like a lot of people just messaging me and and mentioning it to me. I was like, “Hey, have you heard about this Turboquat thing?” I was like, “Uh, yeah, I guess.” Uh, so so I’m we’re going to go into what this paper was and a little bit of the hype around it. It is still cool. Um, but just to tell you, uh, we had to talk about all those things above to to get to to why it is.
Josh Phillips: So, it’s pretty in the weeds and very specific. Uh, but we’re going to talk a little bit about it. Uh so Turbo Quant uh so the hype around this was it looks like uh you know end of March here uh Google put out this blog and then suddenly the the the memory crisis was over and all of the the memory companies their stocks started to fall and you know DDR5 is going to go down and these these uh you know we’re going to be able to do GPUs again because Google solved memory. Um, and so you can see here that Micron, which was one of these big memory manufacturers where their price has gone up a whole bunch because OpenAI bought all the memory, uh, uh, they dropped it looks like 20% in the five days after this. Um, and so clearly something important has happened. Uh, because the market is reacting and so we must be having a a major breakthrough. Let’s go into that breakthrough. Well, I hate to be the bearer of bad news, but uh no, it’s it it’s not it’s not really that uh uh significant.
Josh Phillips: It is cool, I guess. Um but this is not a uh a new thing even. Uh so this Turbo Quant paper that caused this giant market correction has been around uh at least since the 28th of April last year. This is a fairly old thing and it’s uh not even kind of unique among that uh crowd. There’s actually several of these papers uh who had a concept like this. Uh so this this was uh uh released in 28 of April 2025. Uh and the the blog post here is because they’re going to uh I think ICLR or what one of the conferences they’re they’re going to actually go present it now. So they had the preprint uh and they’re actually going to go present. So they they did the blog post. I think there’s nothing wrong about what they did necessarily. Uh it’s just people being dumb. Um and so they’re talking about the turbo quant sort of thing which does help in certain cases. Um and it does have some pretty significant savings in those special cases.
Josh Phillips: Um and what it’s basically using is this concept of rotationbased quantization uh which is as fun as it sounds. uh where we are basically trying to uh before we quantize that KV cache and kind of remove the precision um off of those already computed values, we’re trying to kind of put them in a better shape so that we’re able to more effectively quantize them with losing as little information as possible. This actually has a lot of connection. I don’t know how many people are here for the world uh the the world model uh paper that we did. Um, we talked about Jeepa for a lot of it, but the very very end we talked about this thing called lepa, which was the concept of they’re trying to kind of shake the latent space around to get all of the nodes kind of perfectly spaced out. So they’re they’re using and able to pack as much as possible uh into the latent space of that model. Do you guys remember that part of that? It’s very at the tail end.
J. Langley: Oh
Josh Phillips: Yeah. So this is this is kind of in that realm of where we’re trying to to to kind of manipulate the space
J. Langley: yeah.
Josh Phillips: uh to to get a better form to do further actions on. Uh, and so what it’s usually doing is that it’s it’s it’s mapping basically I’ve got this this multi-dimensional thing and I’m going to to flip it around and I’m going to store a map of what the original position is to the flip position. in and I’m going to flip it there, quantize, and then do my stuff. And then I’m going to to unquantise and then flip it back every single time. Uh, which is really good for reducing the amount of memory you need. However, there is a problem with that. Uh so here you can see that this was something that was you know people were trying to get into VLM uh like in April of 2025 which uh I will note is before this paper by the way that we’re talking about. So this this online rotation concept which is what this is called of doing while we’re running I want to do the rotation to to then quantize it more effectively.
Josh Phillips: uh they closed it because it couldn’t get enough uh of uh sort of support in VLM and it finally got merged now probably just because they wanted people to stop vibe coding turbo quant patches into VLM uh because if you go look at the actual GitHub issues there’s like 30 vibecoded implementations of this thing and they probably just merged it uh not because anybody’s going to use it but just to stop that. Um and so uh yeah this is it it is useful but the problem is that we have to quantize this stuff every we have to to do that computation every single time so you have to really need it for some specific reason. Uh and so another thing that is kind of useful that uses is this quantized Johnson leader’s uh residual connection uh where it’s basically uh using uh this let’s see do I have I thought I had a uh I might have misplaced Oh did I not add it in? Dang it. Okay. I missed a All right. So, this is very cool.
Josh Phillips: Uh, and if I have any extra time, which I won’t, I’ll go see it. But basically, it’s a a very light signal uh where they’re storing basically just plus or minus uh of uh um is is the the current cache plus or minus towards the thing that I’m I’m currently calculating. And that’s it. It’s just one bit sort of quantization. And they’re using that to kind of even things out. Uh, and so the problem that they’re trying to solve is that normally with these vectors, uh, they’re very spiky, uh, where I’ll have something and then not a whole bunch and not a whole bunch. And when we’re doing quantizing, I’m really I’m usually taking like only the top two layers. And so I’m going to basically lose all information at these lower layers uh, if I’m using certain quantization schemes. Uh, and so it gives me a really jagged uh, heavy data loss. And so I want to as much as possible use up the space that is available. And so that’s why that’s why they’re rotating the stuff.
Josh Phillips: And so another concept here is this concept here of you know I’ve got this uh large bucket here and this large bucket here. And I’m trying to kind of split things out. So I’m storing the same information but it’s split evenly kind of evenish across all of my buckets. Uh and then what that does is here if I put the same sort of uh process here. So let’s say this is the quantization process. It never changes. Uh but I put a shape that’s very very spiky and a very a shape that’s kind of more uniform. My spiky one, I lose a lot of data. But my uniform one, it’s still quantized. I’m still losing data, but I still got kind of the shape which helps enough whenever I’m doing this KB cache uh sort of quantization stuff. And so here this is talking about the QJL where uh it basically then takes that last step where I’ve come out the other end uh and then there’s an additional residual that’s added basically saying that this thing is closer to that thing and this thing is further away and over you know thousands of dimensions this actually averages out to be a nice precision check uh to get it to the line and that’s what turboquin Uh so it’s it’s a way of doing this sort of flip and quantization so that
Josh Phillips: you’re able to to go to lower positions and it be effective. And so with this you can effectively I was talking before about being able to go from Q8 down to Q4. With this you can go to Q4 and you don’t lose that precision. it doesn’t become useless. That is good. Uh but the problem is that this thing where it’s taking all these things and moving it everywhere and then uh turning uh uh um quantizing it down and then doing the QJL has to be computed on every forward pass and it could be super expensive. Uh so you know it’s it’s kind of cool I guess uh that you’re kind of uh weighing how long you’re willing to wait against how much you want that model in your VRAM. And so if it’s a super fast model uh so so I think you know for super super large with very very low activations this could be pretty good um because they have a large memory footprint but low uh runtime computation and so an example of this and why I think it’s a good pair with the Gemma models is that these uh E4B models so if there ends up being you know right now we’ve not uh Gemma E2B and Gemma E4B, but say we have Gemma E 300B, I could see something like TurboQuant being much more interesting at that point.
Josh Phillips: Uh because it’s activating so very little uh and so that computation is having to do it every time. But, you know, it’s like comparing uh having a 32 megabyte memory leak in 1995 versus 2020. You know, it’s a little bit different story. All right. Um, I do want to leave a little bit of time. Let’s see. So, any bits and bobs? Uh, yeah. I don’t think we need to go that. Does anybody care about uh MLX stuff? I added some stuff in because sometimes there’s people that care about MLX stuff. I don’t think so. All right, that’s what I got. Uh, I kind of like leaving a little bit of time for talking or chatting the thoughts.
J. Langley: And I I think uh I do like the thought of trying to spread things out a bit so that you can keep more information then it it seems a little obvious
Josh Phillips: Yeah.
J. Langley: but you know
Josh Phillips: And it is use.
J. Langley: it’s
Josh Phillips: So if you ever see rope uh the rotary positional embeddings um that’s a very common one.
Josh Phillips: Everyone uses it.
J. Langley: Right.
Josh Phillips: So that that uses it. It’s definitely valuable. It’s just uh how much is it are you willing to pay the computation cost?
J. Langley: Right.
Josh Phillips: Yeah.
J. Langley: And I remember uh there was one uh quantization at some point where so I think all of the stuff that we’re talking about right now is basically uh quantizing an existing model uh into you know a smaller size something like that. Um there was another approach at some point about doing it dynamically where you’re quantizing it by watching it run against the data set that you’re you know a sample data set and you’re picking which weights to quantize based on activation rates or something you
Josh Phillips: Yeah.
J. Langley: know.
Josh Phillips: Yeah, that’s reap.
J. Langley: Um oh okay got it.
Josh Phillips: So reap would be a way of doing that. So that’s why I mentioned that one.
J. Langley: Okay. Yeah,
Josh Phillips: See
J. Langley: that one seemed cool if you had the resources to do it and you had a very kind of specific way the model was going to be used
Josh Phillips: Yeah, here we go. Yeah, so this one might be useful and there are ways of doing I think reap mixture of experts makes that paradigm way
J. Langley: later.
Josh Phillips: more useful. Uh and so I think you know you could do that with a dense model and I think we’ve talked about that at some point. It is doable. It’s just a pain in the butt. Uh, but if you have to fit it, like you if you have to fit Pixrol on a 4090 for some reason,
J. Langley: Right.
Josh Phillips: uh, and it has to fit in there, that could be a way you do it. But, uh, yeah, Reap is the best, uh, best version of that, I think.
J. Langley: Right. I think Tom had a hand
Josh Phillips: Oh, what’s up, Tom?
Thomas Plunkett: Yep, I do. I got a question.
J. Langley: up.
Thomas Plunkett: Uh, and this kind of goes back to the Gemma 4 E models. So, um, when they’re doing those 01 lookups, you know, and so you’ve got part of the model parameters, you’ve kind of moved off into this other area.
Thomas Plunkett: Um, are those is the other area and the main model like separately updatable?
Josh Phillips: Um I so I don’t know the answer to that for the E4Bs. Uh my my my my guess is that yes, you can probably freeze the main model and train the uh the PLE. Um, I don’t know how I just don’t because I haven’t tried it yet. I don’t know how it’s going to take to that. It It should be fine, but I know that like Ingram is designed with that from the ground up. Uh, so it should be the case that if even if this one isn’t, I other people are going to try this. Uh, and so that should be something to experiment with.
Thomas Plunkett: Yeah, because I I can imagine a lot of possibilities if you can like have one frozen and you’re updating the other, even if they’re both being updated, but they’re on different update cycles.
Josh Phillips: Yeah. Yeah. Well, I think I mean the the the opportunity for continual learning this is it. I mean this is that’s the real deal.
Josh Phillips: If you can update the weights on that thing if if it’s basically just a lookup into you know a memory sort of uh thing. Ah that’s pretty cool.
Thomas Plunkett: Yeah,
Josh Phillips: Yeah.
Thomas Plunkett: I mean because I mean this is Google we’re talking about. So imagine you’ve got the Google search engine updating the weights in the PLE while you’re separately training your intelligent model.
Josh Phillips: Mhm. Yes, that would be very cool. Yeah, I I I would not be surprised if there’s a whole, you know, corner of our of our uh our industry that is working in what comes out of that sort of thing. You maybe not PLE exactly, but this this to me smells like an area of like rag rag grew up and is now part of the model basically.
J. Langley: How long do you think they were working this before, you know, and part of me is like, well, you know, Ram just went nuts, you know, it’s kind of kind of the turbo quant thing, but and then was this pre Well, and again,
Josh Phillips: Mhm.
J. Langley: of course, some of it you you reached back into, you know, fairly, you know, d I don’t want to say dated, I mean, but an AI I scan in AI world maybe. Um I don’t I was trying to figure out if if current pressure from you know pricing of
Josh Phillips: Oh, no.
J. Langley: things was driving you know but this seems like it was on on a roll a while
Josh Phillips: No, no,
J. Langley: back.
Josh Phillips: this is this is Android and Chrome drove this. Uh the the hype happened to coincide.
J. Langley: Okay.
Josh Phillips: The hype is what took it from uh uh just a normal like and it’s still cool, don’t get me wrong, but it didn’t match what it was because people are so looking for an optimistic story of
J. Langley: Right.
Josh Phillips: why uh RAM is going to come down. It’s not coming down, boys. Sorry. It’s we’re we’re we’re stuck with this till 27, I think.
J. Langley: Yeah. But find it finding something small that you can run somewhere uh as you know, especially if you lean back into kind of what we’re seeing with Open Claw and other stuff like that.
J. Langley: you know, no, you’re not you’re not rewriting a codebase maybe, but you know, doing useful things is is pretty cool with
Josh Phillips: Yeah.
J. Langley: it.
Josh Phillips: All right. Uh guess any other thoughts? Any uh musings? It’s kind of open,
J. Langley: I will think of something in about an hour that I should have
David Showalter: I know. Being honest,
J. Langley: asked.
David Showalter: she has motivated me to finally start messing with 4. So that’s what I’m taking out of it.
Josh Phillips: dude. It’s so cool. I think I think this is this is this is a good model.
David Showalter: Yes.
Josh Phillips: Uh every once in a while something comes along I’m like I’m gonna be using this for a while. I think I’m gonna be using Gemma for a while to be honest. I’ve already swapped over a few workloads to it uh that I think are just going to stick there.
David Showalter: Nice. And I’ll be good. I’m just going to try the 4B on my little 30 test laptop.
David Showalter: But, you know, I’m going to grab the 31 and just see what it actually does for
Josh Phillips: Yeah.
Charlie Rogers: Yeah, I’ve got a little robot here that uh could really use some uh some tr those really small
David Showalter: me.
Charlie Rogers: models.
Josh Phillips: Yeah. And the the uh the the small models, they think they take up to 30 seconds of audio is what they’re trained on. So, I mean that’s good for lots of like little tiny things you can do.
David Showalter: Yeah. I mean, I’m excited about the 2B as well.
Josh Phillips: Yeah.
David Showalter: Like there’s so much potential for fun
Josh Phillips: I mean, 2B is that 2B is it runs like an 8B,
David Showalter: toys.
Josh Phillips: you know? It’s it’s really really really good.
David Showalter: This was good. Thank you,
Josh Phillips: Yeah,
David Showalter: Josh.
Josh Phillips: no problem. All right, let me check chat. I know where that is.
J. Langley: Yeah, I’ve been kind of watching.
Josh Phillips: All right.
J. Langley: Um I don’t think I know Charlie, you’d posted some things from the uh GIMA 4.
Josh Phillips: Okay.
J. Langley: I think those are all back to the artificial analysis. Um you know, just to basically where does it compare with other models? But I I tried to search and I also came up empty um because I don’t remember how many tasks that are in GDP val. I mean it’s a bunch um I couldn’t find anywhere that actually specified which tasks or you know what I mean.
Josh Phillips: Yeah.
J. Langley: Um the initial GDP val it would actually break down some things like for this model it did well in the financial segment but it did poorly in the coding segment or you know what I mean? Um,
Josh Phillips: Yeah.
J. Langley: I just couldn’t find that
Josh Phillips: I wonder where where’s Gemma? I didn’t look at this ranking.
J. Langley: either.
Josh Phillips: See, where are we at? Nano. It’s got to be above nano. Okay, we’re in the right area. Gemini
J. Langley: Uh,
Josh Phillips: flash.
J. Langley: sometimes uh on uh artificial index sometimes they’ll hide the open source ones or put them on like a different chart.
J. Langley: That throws me sometimes.
Josh Phillips: That’s rude.
J. Langley: I I know. Not especially when it’s good enough to be on the chart.
Josh Phillips: Right.
J. Langley: One finally made it. I mean, come
Josh Phillips: Dude, that’s crazy. It’s right under force on
J. Langley: on.
Josh Phillips: it.
David Showalter: Oh wow.
Josh Phillips: It’s nuts. I mean, the the the it’s it’s a really good model. I mean,
David Showalter: Right. Anything anything around 3.7 sonnet level is very
Josh Phillips: it’s not Yeah.
David Showalter: usable.
Josh Phillips: I so I think and I think this is generally I’d say this is better for like general knowledge work. I would say uh maybe maybe Quinn is still a little bit better for coding like just raw coding output. I want a code agent. But if you’re doing like you know uh analysis of requirements and you know making stories and you know stuff like that you know reading documentation I would I would take Gemma 3 over that or Gemma 31B over that.
David Showalter: Oh, well
Josh Phillips: So
David Showalter: Josh slightly off topic, but uh I’ll throw a little plug in for the Journey 8.1. want.
Josh Phillips: Oh,
David Showalter: Um, and say I was paying attention,
Josh Phillips: yeah.
David Showalter: but uh, relax mode is so blazing fast at the moment. I had to keep tapping over and copying a dozen comps at a time. So, I don’t know what they did on their end. They said it was going to run faster, but this is
Josh Phillips: Yeah.
David Showalter: crazy.
Josh Phillips: Well, cool. I will stop presenting. Thank you all for letting me uh blab again. Hopefully, it was uh interesting, useful, or humorous, one of the four.
J. Langley: Yeah,
David Showalter: That was
J. Langley: I appreciate it. Um, it’s a lot of information as usual and I will uh circle back
David Showalter: great.
J. Langley: uh probably on the Discord or something. Um, I’m probably going to try to see what I can get running. Uh, playing around on a laptop doing some things.
Josh Phillips: for sure.
J. Langley: We’ll see.
Josh Phillips: It’s always uh always fun to
J. Langley: Mhm. So,
Josh Phillips: play.
J. Langley: if you wonder why I’m not getting any work done, uh, get my text me and break me out of whatever loop I might be in.
Josh Phillips: Got it.
J. Langley: All right, I’m out. I will talk to you’all next time.
Josh Phillips: All right.
David Showalter: Thank you all.
This editable transcript was computer generated and might contain errors. People can also change the text after it was created.

