Pixtral from Mistral AI – Huntsville AI

Here’s a link to the notebook we used in the meetup, in case you want to run this yourself – https://github.com/HSV-AI/presentations/blob/master/2024/240925_Pixtral.ipynb

Transcription provided by Huntsville AI Transcribe

So welcome everybody. We are Huntsville AI. Everybody here has been here a few times, so I’ll skip the how to get connected and sign up and all. What we’re talking about tonight is a model that Mistral AI dropped two weeks ago.

It’s been a little bit, but not very long.

And so we’re going to talk about Pickstral, which is fun to say in a Southern accent. So unlike Mistral, it has an S. It’s Pickstral that has an X, and it’s a little harder for me to say. So a lot of this comes from the blog post, which will hit that link. Well, actually, I’ll go ahead and get it opened in a minute. So most of this, they’ve already talked about, I’m not aware of an actual paper that has any of this stuff in it. It sounds Mistral, typically they will just put stuff on a blog post on their company page, and you just have to try to figure it out from there. It’s not like some of these others that will drop paper and then drop a data set and then do this thing and this thing and this thing, and then they formally roll out Mistral’s more along the lines of, hey, we did this cool thing and you want to try it out and have it. And do that actually. So it actually, they just dropped a magnet link in Discord, and then they didn’t say anything about it for a day. It was just a link, no inference code, no nothing. That’s what they do every single time is any notes. And it’s somewhat annoying because I’m trying to talk about it and I don’t know a lot about it.

I did go through, there are some interesting things I jumped through here that I haven’t done in other places.

It’s been a learning experience on a few different places.

The main thing about the model is you give it a picture or an image and you can give it text and all this kind of stuff and you can act like it’s either an Instruct model or a Chat model from LLM and you can interact with it the same way. You can actually ask it about things that are in the images or do all kinds of interesting things. We’ll walk through some of that towards the end. I think they, it’s almost like they’ve got the image embedded in the same space that all the text is embedded into.

That’s what it feels like to me. That’s probably not exactly correct.

That’s exactly correct. Okay, so that’s exactly correct. It’s just weird how, what it does or how well it does it.

So natively multimodal, we will, I actually got an image in here from their post about how they actually do the interleaving of the text. You ask it, the image is there and you can do text, image, text, image. You can do text and a bunch of images you can do. I haven’t tried all images just to see what happens. It may just describe it to me.

Of course it’s got strong performance on tasks on their site, which we’ll hit in a minute.

They actually go through a bunch of comparisons between PIXTRAL and some of the other major or much larger models that are commercially available.

One of the main differences is that this model is open.

I guess you could call it open.

It’s Apache licensed without a whole lot of extra rules around it.

For stated art performance, it has a 400 million parameter vision encoder that they trained from scratch. They have a decoder based on NEMO, which is another MISTRAL model that has been there for, I mean, it’s not new. It’s been around for a minute.

Yeah. Then the other thing I highlighted this part or bolded it, one thing I really like about it is the variable image sizes and aspect ratios.

If you remember some of the stuff we were doing with convolutional nets and things like that, where it’s great as long as it’s the same size as everything else we trained on. If it’s not that size and you can’t squish it into that size or do something, then sorry. This one, you actually can pretty much give it any image size that you want, any aspect ratio that you want. You can give it a bunch of images if you want to.

Then their context window of 128K tokens.

I haven’t seen anything yet that I know some of the things we’ve seen in the past for large language models with large context where some actual testing that had been done before the front of the context versus the back of the context and the middle.

Is there a hotspot in this context somewhere?

I haven’t seen anything like that.

That’d be interesting to see in this kind of a model.

For image coding, encoding, what they do is they basically break the image down into patches, which are 16 by 16 pixels each.

It’s almost like a row by row where they take pixel, patch, patch, patch, patch across the first row.

They drop a tag in for, I think it’s image break in between each row and then the next row until they’re done.

Then they have the image end.

That’s how they actually, it’s like they unfold each image into a single string of tokens.

That’s why it doesn’t matter what your aspect ratio or image size really is.

This is actually the full model.

I’m not sure if I can… So it comes in with texting coding and then the vision transformer basically taking the images, shoving them into the same stream along with text, along with images, things like that.

Then it decodes into text coming out. So that’s the other thing.

This is a image and text input and a text output. It can describe how to make the image if you wanted it to make an image. I haven’t even tried to see if you could build an SVG with it or something like that. It’s a text-based kind of an image. That might be interesting. We’ll actually try that later.

I’ve got something queued up for that. Let’s go ahead and we’ll hop over to see if this will let me… Yes, please.

Let’s go here.

This is what I stole all my info from.

Of course, it’s a French company.

So you got La Platform and La Chat, which is French for the Gap. That’s kind of fun. So we’ll hop back to that in a second. So pick its train, understand both that.

These were some of the actual comparisons they did using some standard datasets.

It says in this post that they are going to open source the props they use to do the whole comparison. I haven’t seen those yet because I don’t know where their open source would be.

So that looks good.

Cloud 3 is a pretty famous one.

Jim and I, I’m not quite sure.

And then of course they compared, again, these are, I’m guessing, similar size or relative in the same ballpark, whereas these are all fairly giant-sized models. So you would expect, yeah, the big, big models to be a little better. But some of these are relatively close in some of those.

But, and the other pieces, they did some other instruction following type things and measurements. Again, they’re saying they benchmark them the same evaluation harness, same prompts. And of course, there’s always weird things you can do where some models are prompted better, a little bit better in some ways than others.

And you can’t actually cause things to look better than they are.

But I’m sure when they finish their open source, whatever, to the community, of course, all of this is in future tense. So we’ll see. But going back to here.

So the chat was the first thing I hit, because that’s what it said. Actually, the first thing I did was look to see, hey, is this something I can run on my own laptop like I do some of the other stuff? The answer is no.

So that didn’t take long. The next thing I did was hop over to Lambda Labs, which is where I normally run a GPU.

We can play with that for a second.

You can see what that looks like.

Lambda Cloud.

Boom.

Let me see.

Where did I sign in?

This is, this might surprise you.

Hey, it’s actually going to let me pick some. I tried two days ago on a weekend, and it was, the statement I got was no instances available for you. For you?

None.

Unless you go to the full up enterprise plan pricing, which is yearly, and I don’t even know if they give you a real price or if you have to call somebody. But anyway, one of these probably would definitely run it. But I am not going there. Anyways, I went back and went, okay, let’s just go to the chat like they say. And yeah, let me go there and ask us anything. So of course, as always do when I wind up at a chat thing.

Let me see if I probably move my cursor.

I don’t know.

I’m supposed to ask about the book.

I have no idea. So what I wound up doing was going back to some of their stuff and grabbing like a copy image address, go back to here. Can you tell me what is in this image?

Maybe.

Oh, yeah, I get this.

I did that.

Upload images. Let me go ahead and do, I’d grabbed some of these. Yeah, this was one I used off of their site already. This was the thing that caught, this was the first thing that got my attention. So I’m going to explain this image as if I were a fifth grader. So imagine you have a super smart robot friend named Pictural.

It’s really good at understanding pictures and words.

And it shows how it works. You can take both words and pictures as inputs.

Kind of like how you read a book where you see both at the same time.

It gets a picture, uses something called a vision transformer encoder to understand it.

Special tool helps robots see and interpret the image. After it understands it, uses another tool called a transformer decoder to put the information together to make sense of both words and pictures. Because of this, it can answer questions about pictures and recognize things in pictures really well. Like having a robot that can read and understand books and pictures at the same time. Yeah, I mean this was pretty interesting, all based on just that image. But their chat interface is a little kind of, I mean, yeah, it kind of helps. Of course, I’m not paying anything for this. This is all, you know, free.

I think they’re just trying to figure out how people are going to use this. I think that’s what chat GPT did a little as well when people first got ahold of it and started doing all these kinds of things. All of a sudden they saw it happen.

And they’re like, huh, maybe we should allow people to see their conversation history.

Maybe we should let people have different conversations and pick up where they left off and do all of that. So it could be something they’re trying to do while they get into it. The other thing was a left platform. So somewhere up here, well, maybe not. Wasn’t there developers tools? Oh, left platform.

I went and created an account for free on a left platform.

It did ask for my phone number to verify I was a real person.

So I gave them my phone number.

It’s not like everybody else doesn’t have it. And I haven’t been, I mean, yeah, anyway, I’ll tell you all about the whole, I don’t know if everybody else got caught in a hack where they dumped all your info and your social and a bunch of stuff like a month ago. So I actually have a freeze on my credit account and all of that right now just to make sure nothing weird goes on. Anyway, safe. I went over here and created an API key, which yay, doesn’t expire.

I grabbed it and I went over to Colab and I stuck it in here, got a Mr. API key and then made it available to this notebook. And then I stole some things from somewhere in here.

API keys, docs maybe or API.

Let’s try docs and then somewhere in here is something about vision.

So I copied this and instead of the environment, I have to get it from Colab. So basically I signed up for free. I grabbed this image. I guess I need to run some of this stuff because this is a brand new instance.

So I’m grabbing the key that I’ve got coming out of here, doing an import from user data to grab it, telling it the model that I got out of there, whatever.

I said to the client and basically I’m saying, okay, the role is a user for content.

The text I’m asking is what is in this image? The image URL is this one.

Actually, I already commented up.

This is the image it’s going to look at. And so then I run it. And it goes and it does some stuff and then comes back and says the image depicts a snow covered scene featuring the Eiffel Tower in the background. Trees and surroundings are blanketed in snow, giving a serene and wintry atmosphere. There’s a pathway, possibly a part leading towards the Eiffel Tower and a lamp post is visible in the foreground. I tried to play a little bit around with it. Tell me more about the monument. We’ll see what it comes back with. I’m going to feature the tower. And I kind of mind my Paris France built. The Eiffel Tower. Tell me about the Eiffel Tower. So there’s that.

That was pretty interesting. Giving it, I may still have something later on. I found a, what’s the concept?

Lorem ipsum.

Have you seen the image ipsum or whatever where you just basically give it a URL and it can just basically, you tell it what size.

They’ll just go find a random image of that size and give it back to you. So the problem I had with this is it has Eiffel Tower with snow in the name of the URL. It doesn’t.

I know. But so instead I have somewhere where I posted just, it gets a random one every time and it can tell me what’s in the image. But my problem is if I try to go look at the image, it gives me a different random image because it’s random every time. So it was interesting.

It’s just kind of, what do you see today, sir? This one was a really interesting one. And this also is from their blog post. Analyzing this image, what are the top five economies in Europe? And so to do that, it’s got to know Europe is over here and it’s green.

And that these over here are the, I got to look at the green ones and I got to find the ones that are biggest.

So at least read the numbers.

Yeah, at least read the numbers.

List top in your highest GDP. Okay. Let’s run that.

It’s also pretty then quick.

So Germany, UK, France, Italy, Spain in descending order.

It also says the green on the diagram.

That’s why it knows your European region and their respective GDP values and percentages. The numbers, the same.

As far as I spot checked, let’s see, which France, 278, 324?

France, no.

Yeah. Yes, 278.

I thought I said 378.

Well, shoot.

Okay. I thought I’d found something.

I was going to write a inflammatory blog post. I found something through model that I could never build myself. This one was really interesting based on, create a website based on the stacker.

So pick an ice cream flavor, flavors vanilla, and the next copyright, Mr. AI, of course. So I’m going to take this part out because I added that. It actually did it, which we’ll cover in a minute. It just didn’t work as well as this one. So same kind of a thing. You tell it to give it the image of the sketch. In this case, we’re saying I want an HTML page or site that looks, you know, that does that.

And so it goes off and does stuff, and then it gives me an explanation of what it’s doing with structure, you know, what all of it is. It’s actually using inline CSS, or I guess it’s inline. I don’t know what the right word there is. Kind of drop down menu for picking a flavor.

Next button, style there. Simple.

You can enhance it by adding this.

So I actually went and grabbed doc type. It’s always hard with not a mouse.

Copy down here.

Yeah, what do you call it where it’s you have the style, but it’s in the same file.

Okay. Okay.

Come on.

Okay, there’s body. Paste my new thing in there.

Run it. Pick an ice cream flavor. Vanilla. Same thing. In the drop down, it went and found other ice cream flavors that weren’t in my image and added them the selections to fill it out. I mean, that was that thing kind of caught me off guard and that it kind of gets the intent. It gets the constraints of what you told it and then it goes and adds a little extra based on that. Is the drawing have two other options?

Is it?

I’ve got that out.

A few drop down. Because that’s interesting when it was two.

Yeah, it does have two. Let me see if I can try it.

Hold on. At least five more flavors.

See what it does.

I’m just looking at code and see.

Okay. Vanilla, chocolate, strawberry, mint, coffee, and cookie dough. Give you exactly five more. I mean, it’s pretty interesting.

Let’s see.

Add it in a two-moist extension.

Yeah.

Let me see if this would work. Because it doesn’t do imagery. Oops. Let’s see.

Also provide an SVG for the next button.

I’m wondering if it doesn’t give you imagery as back out. Can it give me something that describes the image that I could put into another tool to gen an SVG to an image or something? That’d be kind of fun. I don’t see it. Who’s the next button?

Are you kidding me?

Okay, we got to run this and see what it does. I can’t help it. I can’t pull you, but it’s going to be fun. I’m going to add a trick. Nothing like a live demo, folks. I’m going to do a copy. All right. Let’s do it. Oh, no. Broke it somewhere.

Let’s try to do JavaScript, too.

Okay.

I absolutely don’t trust it at this point.

Here’s where I gave it a pixom image.

You can actually give it a direct ID, which will give you the same image every time.

So I gave it this image, which is a call for whatever. And I said, what’s in the image? And it says a distinct design, vintage looking stamp or legal image of a person, possibly a historical figure.

They actually told me who it was the last time.

We’ll run it again to see if we get a different answer.

The text reads, I can’t pronounce that. You can see the post image of tobacco leaf illustration. Interesting.

It’s on surface, background, and soft neutral tone.

Let me try it again and see if it gives me a different answer. Likely it’s still okay. The text changed. Yeah, it did. Yeah, Wainos area is in Cuba. I definitely think the last one was better, but… I found that just OCR capability is not the best. It’s good at reasoning about things. It’s 12 feet, but the quints of the B-mobile are actually 5 billion crammers less than better at OCR, but you can’t take it at all. Yeah, you gotta push the magic button. That looks like a model light switch. So that was interesting. I tried to look into quantization just to see if I could possibly find a version that I could run.

I gave up.

24 gigabytes is as small as you can get it right now. Is that in quantize or is that quantize?

That’s running at fp8.

You can get it at 23 if you run it at fp8 at 16 k complex length.

That gives you an answer.

Yeah, it’s pretty good if you do more. Can you get that up to 30 more? Yes.

If you can get flash attention, that helps too. Yeah, so I don’t have one of those. So I stopped playing around with that one. This went a lot faster than I would have expected to go through, given that they don’t really tell you a whole lot about the internals other than their blog posts, where ever it went, I probably missed it somewhere. No, I got it linked up at the top, though. Yes, please. You know, so other than this piece, I did see somewhere that they have drastically dropped their price for using their API.

I’m not quite sure where it falls in relation to like chat GPT, you know, open AI or or Anthropic or anything like that.

That might be something to check out.

Again, their documentation was pretty much spot on as far as how to do, you know, how to work with their models. They’ve actually got some actual, you know, pieces dropped in for, hey, let’s see if we can transcribe this receipt. We could give that a shot if this actually, let’s see what this image address is.

Go back down to some cell that we don’t need to reuse again. Let me just play with this one. If you do have Jupyter Notebooks, you can just screw it up, it doesn’t really matter. I’m going to try this receipt into a JSON format.

Paneata beef, cheese, chicken. I don’t know what that says. Subtotal taxes. This is fun.

Okay.

Apparently it came from Clover with the actual interesting.

That’s fun.

Where did this come from over here?

Paneata, 3, 3, 3, 5, 5.5.

Along with that Clover print.

Yeah.

I mean, you can imagine from a compare images.

I haven’t even, oh, wow.

I’m just going to take their word for it because everything that I’ve tried from their models and their documentation has been basically giving me exactly what I was doing to start with. I was curious. I bet it’s tasty. All right. So this is Iphone Tower Olympics.

Let’s see.

Difference between the images.

No second image background. Apple stadium.

It doesn’t know it’s the Olympics, but that’s fine. Yeah. Interesting.

I could seriously see this.

This might be the same one. No, France is social divide. I think the most surprising thing to me that. The weirdness chart.

Yeah.

I would have lost it on that. It would be interesting to see this thing take on some of the reading comprehension or the there’s a section of the ACT, which is all about science reasoning or something where you’re looking at charts doing this thing. And I scored the lowest on that section anything else just because what they’re asking just didn’t make sense to me.

I’m just like anyway. The reason it’s so good at the charts is because of that encoder.

They train your data by itself because it’s every 16 pixels is it’s an vector.

So if you look at the other vision models, they basically treat the whole image or like half divided if it’s too big and they treat that as a vector.

It’s always kind of swirling together.

But since they’re doing it, it’s also like a typewriter. And each one of them has information about where it is. We have bar graphs or something it knows.

It is different from that of Eric.

Right.

Surprising it’s not better than OCR.

I feel like that would also be a good result.

Right.

Yeah, it is weird. I think it’s just because they trained it so much. You’ll notice it’s really good at structured outputs. They spent a lot of time training it for something specifically.

And just the OCR seems to be a little sound. It’d be interesting to see if the OCR is different in documents versus like on the side of a cup or on the side of a, you know what I mean?

If it knows it’s on a document, my guess is that it’s good.

I’ve never had any issues with documents. The scene text has had trouble with that. It’s captioning. It’s not as good. Yeah. Old documents in French. So I can’t imagine. So it obviously knows I’m English. But that’s interesting.

Not just a, I can’t tell you if that’s correct or not.

No, that is in English. My bad. It’s just, yes, really old.

I wonder if it can translate middle English. Chaucer or anything. That’d be fun. Yeah. Guess check.

I’ve never seen her see Britain. I have not fair. Oh, so they do that. Doctors.

Right.

What prescription is this?

Good luck.

So I got it. Oh, right. Let’s go look at those real quick.

See, if you have no idea what’s on this means. The cost are doing limits. Yeah, those limits somewhere. I should stop looking at that screen because it’s throwing me.

That’s basically what I’ve done. This is today. Show consumption in the USP. I would get it somewhere. You had 30,000 tokens.

That would be a good price. Yeah, it would be. API keys, data sets, billing. I haven’t set up anything for billing. That’s why I think anything. I’m on this experiment limit.

Of course, there was a thing that popped up when I first started that said, hey, we’re basically looking at anything you send and trying to make our tool better by whatever, you know, Okay, I’m doing this as a demo anyway.

But of course, if you want them to keep your data private, other kind of stuff. Of course, you run into. I don’t know what their setup is for showing how they keep your data segmented or private or anything like that.

I did the next thing I was going to look into because I thought maybe I could run this on something I’ve got access to. I may still try that. I’ve got Ben had mentioned that I could try to use his machine for something on this.

Ben, do you have a, I think you’ve got something that’s 24 gig. Yeah, I got there’s at least one 24 gig cards in that machine. Okay, I might try that out.

And then the other thing I’ve come across a lot of things that are, they’re tied in a good bit with the LLM. Which is something I’m probably going to, we probably need to take a couple, at least one session covering this. This is more like a inference server, if you will, that’s kind of tuned more for providing, if you wanted to host a model and work through like high availability, making sure you can dispatch a lot of things. I think I’m saying that right, but working through performance kind of that direction.

I don’t think it’s got a way to actually the thing that’s missing here for me was the ability to split and have some layers run CPU some Ray layers run GPU. Because that’s what I’ve been doing with Lama CPP because that lets me run like a 13b model and run part run just enough on a GPU to still, I wouldn’t put that production or anything that lets me iterate quite quickly. But it also does things, some of the models you can get to run with BLLM, I don’t believe are available for Lama CPP. So that’s, that’s fun. So we may take a look at this coming up at some point.

And then that is all I’ve got on Pixel.

Any comments, cries of heresy other than while it’s the craziest thing since what was the I don’t remember the last one we were looking at that made their made a video that was super realistic.

Yeah, Sora was the last. Oh my gosh, how is this possible thing that we looked at but Oh, cool.

So let me stop video at that point.

If I can remember how to get more up the top course.