llama-cpp-python

Llama-Cpp-Python

Transcription provided by Huntsville AI Transcribe

What we have been doing, again last year, there were some things we were doing before the large language model explosion, where usually when we were running console AI, we would do demos or tutorials at schools or anything like that. We would normally use Google Colab and make it for anybody who, ideally, took Google Colab and you could log in, follow along.

You don’t have to have special hardware and it’s super great for education by stuff.

After we got into these large language bottles, I think you just didn’t have the hardware. You could do it. So we actually did a talk last year at, to give a presentation that’s a piece of it about how in the world do you take those behind the model job?

That’s good.

We got four chairs open.

I don’t know if there’s a personality test included in your choice of chair. This sounds like one of those puzzles that you’ve run along with somebody else. I was looking at, like, there are four in the list.

It’s about the same as the symmetry of four.

I know what’s going on three years ahead. Did you watch?

I know.

So last year we’ve done a talk about, well, we were at a point where at Huntsville AI, we could get GPUs. Either through AWS or anything else.

You can always get the small ones.

They let you have them. Then as soon as you step up and say, hey, I need something, we’re 24-day. It’d be, okay, here’s where you request one.

They give you a price, but then you request it. It might be months. It might be, you know, something like that. So start off with that. And then at Cohesion Force, we were also trying to do a higher ad piece.

So we put in as Cohesion Force, you know, we need GPUs. We were on Microsoft’s side. We still do not have it. Not because they don’t want to give it to us. They don’t have any left. Everything is spoken for this day. Because you’ve got all the big companies that are doing all the training, all the stuff and just trying, trying, trying. And we actually did a, it was really interesting. I put together this talk for S&D symposium. And the two things we were going on is one, if you want something like chat, GPT, good luck because they’re not going to give it to you. Because it’s, you know, you’ve got to get off their API to visit. So there are other things you can run locally, but you’ve got to jump through the tubes, look at licenses and stuff. As soon as I had finished that part of the presentation, a lot of two dropped.

And I was like, well, there’s so much for that part. Now you can use number two.

So, but then we went after, okay, so here are the other options for running.

You can take a large model, you can get a smaller level, you know, GPU, and you can try and tune it based on your data, which is good.

You know, you can come up with really nice stuff there.

And then maybe a couple of months after that, we ran into Lama-CPP. Which uses, it basically has rewritten the whole model directly to C++.

You don’t even have to use a GPU and run directly off of your CPU hardware.

Along with that, they came up with a way to quantize the weights, meaning instead of a full set of a double or a floating point, you can now use a 4-bit integer, you can use a 8-bit integer, and then from the transformer perspective, about this model’s work, they would actually find that depending on what layer, if you quantized certain parts of the transformer with five bits and some parts with four bits, you get better performance. Just some things are more important, of course.

At the same time, they had all kinds of stuff going on at the time. The quantization is basically done without running the model itself. You can take a large model, you can quantize it, get a smaller model out, but actually it will fit in round.

So that’s kind of what we’re doing.

There were some other really interesting approaches at the time that would actually track what the model was doing as you were running. So dynamically figuring out what parts were important, what parts weren’t important for your exact use case. And then the weights that got hit a lot, they wouldn’t quantize. The weights that got hit didn’t matter as much, but just kind of, you know, on the edges, you know, run those a 2-bit, run those a 4-bit, whatever. But again, that actually takes a fairly significant huge amount of time to do that, but after it’s done, you’ve got a much smaller type of model.

So we went through that.

I try to remember what you’re talking about here on what I think in here.

It was the one where we had, we had it set up, act as Donald Trump, tell me about your wall, but in the form of Shakespeare.

And it wrote a sassonnet. That rhymed.

And it freaked everybody out. It sounded like that one, Trump. Right. But all of that running off of this laptop, no connection out or anything like that using a 4-bit, quantized model.

And it was, I mean, it was really, really good. Then we said give us Dr. Schuch. So that was funny. It went all cats for that one. So anyway, so we’ve been playing around with that.

So that was last year.

Going into this year, we’ve been working through a little backwards, first and background and then back in 2022. We normally help out with the NASA space apps challenge either as a mentor or so teams of folks who want to go, you know, do some challenges and stuff. And we showed up and they, they didn’t, they needed more teams than mentors. They were full on the mentor. So, okay, let’s make a team. So we had been decided to make a team. So the challenge we went after was one that says basically NASA has a lot of documentation is you can imagine technical document, scientific documentation, all the way back to the 50s. And that was really, really interesting stuff in here. So they were trying to find a way to allow a better exploration of their documentation. So really what we built was mostly a semantic search on top of the documentation where you didn’t have to ask for the exact word. You could just kind of be close and it would find the context and give it to you. And so as we started off this year, now that large language models have taken over the world and that’s all anybody cares about at the moment. Well, not everybody. So we decided to go change that submission into a tree blog that generation approach to building a rag on top of around 10,000 NASA documents and walking through session by session.

What does it take to do that? You know, so we’ve been through how do you do the betting?

Actually, how do you jump these documents into pieces and what does each, does the size of the chunks matter?

Does it’s better to do big ones?

You know, all of this kind of stuff. We went through a little short for that.

That’s pretty neat piece that actually wraps patchy new library that can do PDFs and Word docs and Excel and all this kind of stuff.

But through that, we’ve been through a session on inventing and how to choose the right model for your to bet your chunks to do the search part. We’ve been through vector stores, which went through Cromedy.

They were probably another we’d be eight back from today. It was actually, I think then built it. It wasn’t a lot of code, but it was a basically a memory vector store. Before my phone even came out, we didn’t know that would be a thing.

We could have enriched, I guess. Again, underneath my phone is basically a, I mean, this is cosign similarity and a couple other algorithms.

That’s all a vector.

They do some interesting things though, behind the scenes.

So I give them that.

So now we’re at a point of, I’ve got the documents jumped.

I’ve got them embedded. I’ve got them in my vector store.

The next thing is going back and using at least at this point, pointing at llama CPP dash Python, because the initial llama CPP is all C code.

I mean, if you are not good with make files, don’t go there.

I built it from scratch a couple of different ways and it’s well-documented.

You see they can do some other things, but it is a hardcore C++. So this is actually a Python wrapper around that.

It makes it much easier to prototype with.

And given the kind of approach from a AI standpoint is much more on the applied side, rather than a lot of theory.

We’ll jump into theory occasionally when I find somebody that knows it enough to come talk to us.

For me, normally it’s more of the applied.

So we can normally work through something and figure out how do we iterate.

How do I get this stand up?

How do I put it in front of some users to see if it works? Because you don’t want to spend a lot of time and a lot of money on server time. If the first time you show it, it’s spewing out a bunch of garbage, you know what I’m saying. So anyway, that’s why we’re looking at llama CPP because it’s there are small models that are still good enough to be useful.

You’re not going to hit value something percent accuracy or anything like that, but it’s good enough.

And we found that moving from that approach, if you want to go up to a larger model, it typically translates into a. So that what we’re talking about now.

And again, also, I was talking about this before actually uses tool called perplexity.ai to actually generate this actual outline, which followed pretty much what I would have done anyway.

So it’s pretty cool.

Through that, also being able to run locally really keeps your costs down.

So that’s why you’re iterating quickly through something it needs to do the installation.

It says it needs Python 3.8 and a C compiler.

They do have prebuilt wheels that you can download directly.

So you’ve got a specific that got, I think, three different versions of CUDA they work with where you can say I need CUDA 12.1 and pull back because it’s already built for you.

So C++ or C compiler is now technically required.

It also has bindings for a lot of things.

CUDA, OpenCL, if you’ve got some platform that has a hard core machine interface type thing, there’s probably a binding that it works with.

It was tricky to get this thing up and going with CUDA because I do have a GPU on the laptop that’s it’s got four gigamars and nothing special.

But I would like to use it. So before I had followed the instructions and could not get the thing to even use my GPU and I even not told it to. So I found somewhere in a blog or something else that you actually have to run. You have to give it a CMakeArgs and then this guy and then install. And this was the only way I was able to get it to actually use my GPU and I went to run. And when you actually tell it to you can tell it to use actually what you want to do is when you install it, look at code in a second. When you keep this thing off, you tell it how many of the layers we want to run on the GPU and then you tell it which GPU defaults to the zero index.

So I want to get more excited about that.

In this case, you can see that it’s actually using and this is actually not correct because I don’t have a gig nearly on this CPU. But it’ll show you kind of what it’s doing as it loads. If you don’t see this particular item, it’s not running on a GPU. The other thing that we’ll check out in a minute is one of the other things they’ve done with this particular project since they’ve already reacted with Python.

They also threw an open AI API piece in front of it. So if you wanted to use this instead of using open AI, you could do that.

So a lot of people are actually spinning this up locally to use debugging and troubleshooting and iterating quickly for free.

And then go to production. All you do is change the URL and now you’re going directly to open AI. You can see the same program. I mean, that’s pretty cool.

There’s also, I don’t mention it here because I haven’t done it personally yet, there’s also a Docker potato that has the same basic thing in it. Docker is going to get a little more interesting because running Docker on a VM with something right into a lot is making sure you’ve got your GPU passed all the way up through in your container. That could get a little tricky sometimes. So loading the model. What we’re using right now is a model that most of these are hosted on a hooking face.

Most of what you’re going to see.

Then you remember the name of the blog, what’s that?

Gregory something you remember?

I do not.

Sorry to put you on the spot there. Anyway, his initials are GG. So this is pretty much a GG. So basically he’s the one that came up with the format of how to store this model in the quantized fashion.

So after he got going, there’s a lot of additional models that all come out in the same kind of thing.

In this case, you drop down to the bottom where he’s got all these files for download. And you can see two big quantization, three big quantization.

Hang on one second.

Sorry, are you going to be here with Shack?

This going on at 6.30?

Yeah, I know that.

I thought you were, were you looking for a film group or just a different event?

No, it’s a new section which I’ll be covering with that. Okay. I don’t know about an event.

Sorry. Sorry. We’re not doing enough. That’s a college. So anyway, there’s all kinds of different models that are available and you can run any of these.

And you can see they kind of got cheat sheets for the size of the model for how much and then how much RAM is required if you’re running on a, you know, the recommended on CPU. And also basically says, hey, not recommended. You can play with it, but again, so we wound up using the one that’s typically recommended, which is the four big quantization.

And I’ve got 20 around to fill up, you know, so I got 32 on these machines.

I’m good there. There’s a lot, a lot of models available. So I mean, if I’m just looking for GQF.

And then maybe Lama or Sarah Lama. Wait a minute.

Let me go back. I saw that.

You see.

Oh my gosh.

I don’t see a GQF on there.

It won’t take long.

Oh, there we go.

We may actually have to check that out.

So as part of the Lama CPP library, it also includes the application you need to actually do quantization of the model.

So if you find a Lama architecting model that is in quantized, you just run that one and you’ll spit out all of these.

I don’t see where the actual.

I’m getting a little off track here, but we may have to go look at that next time.

Oops.

Also have a bad habit of jumping off the presentation I was showing.

So there’s that. We’ll show the code for how to initialize it.

You can set the context length.

Currently the default for Lama CPP is 512 tokens, which is not a lot.

The template itself goes up to the basic goes up to 4096.

So you can play around with a lot there.

I just haven’t gotten into that yet.

When you’re running the high level watchable look at the high level API, the open API.

Well, there has to be a project called open API in the user class because I cannot say the word open AI API.

It just doesn’t seem to work.

I don’t know if I still got this running when I check. I think I can shut it down. So there is an open API.

I swear I can use that at some point in my career.

Anyway, if you run this as a server, it spins up and you can actually go to 8000 such docs and it provides you the documentation for the rest interface in compliance with open API.

I haven’t played around yet with actually connecting code I’ve got with this API.

That’s actually going to be a whole other session going through all the different open AI kind of connections.

I don’t know if somebody that’s done a lot more that kind of work than me to run that. Let’s see.

Things I still haven’t looked at yet.

Apparently there’s a, I’ve done a little bit of CPU versus GPU performance because you can say how many layers you want to push to GPU and how many you want to keep on CPU. Open AI is the new name of Swagger. Oh, that’s it.

Swagger. I don’t know if you ever use Swagger. If you’ve ever wanted to use Swagger, it looks a whole lot like this. Makes me wonder if they’re using Swagger underneath to make the API for open AI.

Swagger allows you to read or rest the top interface. You can interrogate the server to get the API from it. Hey, I’m trying to log in.

You happen to have a breast route that’s logged in at work. I haven’t done anything with correct number of threads.

So that’s probably a little later.

I’ll send something up if I actually wind up getting into it. What we will likely wind up doing is play around with this kind of thing just locally. And then at some point I wind up getting this stuff across stack.

Currently the embedding model runs as a service.

The vector store runs as a service.

The paragraph chunker thing runs as a service.

This will run as a service.

So basically I’m going to copy and compose away from just throw it up in AWS Fargate somewhere.

Keep everything private on the network except for where I want to explode a port for the UI site. Which is basically the same approach I’m using to the screen transcription thing now. The other thing I need to do following up with that is after we figure out what kinds of things we need to run, I can actually go back and do a cost model for how much is this going to cost a host for month. Just as a overall.

That’s one of those things a lot of people will kick around and both Amazon and Azure and some of these other providers are great. It throwing tokens your way as you’re playing around with stuff and a low level. Then as soon as you want to go to production you need to run this thing 24 seven over time that you know the amount of cost that it goes up pretty quickly. But for now we’ll jump over to some code. I actually stopped this that was running.

And then see if I can blow that up song. Alright, so I also have to remember to look at my laptop and not that screen because it’s about two seconds delay and it really throws you a trap to it. I downloaded three different models locally. I’ve got one that’s code level just that I wanted to play around and ask you to kind of questions.

Not really using that tonight.

And then I’ve got a seven million parameter model for a lot of to chat and then a 13 million parameter model for a lot of to chat both of them are the four bit quantization.

So after you saw basically just import model. You create a model.

Give it the model path to the whichever one you want.

And then how many GPU layers 20s. In this case I’m asking for 20.

I’m going to play around a little bit with after you get running.

You can actually go back actually print out in the output how much round it’s using on your machine and how much ram it’s using out of the GPU.

So pretty much I started with 10 and started working my way up to using as much GPU ram as I can. We’ll also have to have a follow on discussion about prompts because dang this is complicated.

There’s actually a some folks are actually using an LLM to dynamically help them build the prompt to send to the other LLM.

So it gets you know based on context when I’m asking for what I need back.

In this case.

I actually had to shake one thing.

No, I left this one in. Okay.

So in this case, there’s a system part of the prompt where you’re telling the system what it’s supposed to be.

So this case, a helpful respectful and honest assistant I still live from some website somewhere that was doing similar things with Greg.

I need a helpful answer. So don’t stack overflow me and tell me that why are you doing it that way.

You know, it’s normal to get my approach to stack overflow and I can’t tell them that I’ve been allowed and only has things that are 10 years older later. So I think it’s the last version is not always the option.

So why we save these a lot of people would think of these as guard bristles.

So there’s that. Here’s another fun thing.

In other words, don’t this is basically the don’t hallucinate section. So the question that makes sense. Don’t make up an answer. Explain why you can’t answer instead. Things like that. Generate by question simply as possible.

That’s a fun one. Some of these models, if you don’t tell them to be succinct, they will just roll and roll and roll.

And if you’re paying a service like open AI, they charge you for token.

So the warrior models use the warrior and it’s, it’s fairly cheap per token.

But still, if you got a boatload of users, it all gets you here.

You’re getting, you know, soliloquies, you know, when you shouldn’t get it size. No, it’s going to be a lot. So if the answer comes from different documents, mentioned all possibilities and use the titles to separate for topics can answer using the given documents. And then in this case, I only gave it basically one block post that I’ve basically captured from somewhere. Just about how to use a trend.

It’s probably something for, you know, medium or something on how to use a hugging face pipeline transform or something.

So I give it this one block post context and I asked the question, what type of model do we pass the rules to?

Because that’s right here to correct and pass the rules to the transformer model.

We can use some of the answers actually in there.

We can try again if I ask it something totally different that’s not in here to see if it actually follows the rules.

The other thing I found it’s useful in this Lama CPP. You can actually have that tokenize your prompt and tell you you can figure out what the length of your prompt is. So if I’ve got a model that’s got 512 context, I’ve got to fit everything I’m going to ask you in 500 point of tokens.

And a token is not related to exactly to a character or a word, you know, some words if they’re compound syllables might wind up being three tokens.

Other words might be one token.

You know, so it just gets kind of interesting that way. A lot of what I see here, if you put this kind of as the front end of a rag, what you would want to do is first you get the question, you query your vector store to see what chucks you might have related to your question.

And then basically you want to keep adding chucks to your context until you get close to that input length.

So that’s kind of a way to build that dynamically.

So I print that out and then actually I just say, hey, here’s the prompt.

Basically I tell it to generate tokens until it runs out of context.

So you can actually give it a backstair to make sure it doesn’t run long.

And then I was just checking, bringing out the actual structure gives you back within this.

The structure is kind of odd.

I’m not quite sure when it gives you more than one choice.

I haven’t had a way to have a player to run with it enough to actually where it gives me more than one answer.

So here goes nothing.

If I try to run it, we’ll see what happens.

So start from the top. You go and keep adding to hide this thing. Somewhere in here, this big mix. All right, so now it’s telling me it was using… All right, offloaded 20 layers to GPU.

So it’s running 2.3 GB by GPU memory.

And that much of that context length of my prompt is 422.

Actually, it was a task manager.

Then go look at points and then over here on my GPU, you can see it’s actually starting to pick that up. If I go back over to where I’m going to, do a media supply… Well, just finished, I think.

Anyway, I’ll be back over.

It’s the answer I got. According to the documents provided, we can pass the real-estable transfer model or use a HODI-based pipeline to wrap it.

A blockchain prompt template can be used to graph from block.

Sometimes, and again, I’m not seeding this with a particular random seed.

So if I ask it again, it will give me a slightly different answer usually.

So anyway, there are motions involved.

In the group, I’ll notice the next one that’s coming up.

We should actually initiate folks by having them go find the light switch.

It doesn’t look like a light switch. Yeah, we could. It just quit the button. I’ll write it again while we’re talking and just let this thing and see what it gives us. Usually, and again, some of this depends on the prompt you provided.

This just says, according to the documents provided, occasionally, what I’ll get is, according to the block post, page 1, blah, blah, blah. So based on if you formed your prompt correctly with what is why it takes a good bit of iteration to dial this thing in based on what kind of questions and what kind of documentation you’re going to get.

That way, you can actually, after you’ve asked it a question, you can have it cite the references that it had in its answer.

A moment.

Which is super useful.

Therefore, the answer to the question is we passed a little transformer model.

This is almost like a scientific kind of an introduction.

Never before I’m waiting for a QED someone. Oh, a little square.

Yeah, right.

So the other thing that we can do is switch over from 1.2 to 1.13B. Which again, the general rule of thumb without quantizing.

If I have a, in this case, a 7 billion parameter model, I would need a 14 gig size of shared memory, size of brand on energy to be running.

If I have a 13 billion parameter model, I need 26 gig or more energy to be running. But since it’s quantized, you know, it’s much, much smaller.

I can actually run a 13 billion parameter model.

I have noticed that when I run it cold, the first time it takes a little bit, I’m guessing it’s a hardware kind of thing.

The first time I run something like this, that’s where I’ve almost got a full 4G used on the GPU.

It’s like the first time it’s a little slow.

And the next time I try to run the prompt, it’s a good bit faster.

I think it’s already got certain things loaded up or queued up properly.

So, while that’s going, I will also do a, I think I could just do it here. So, comma.

So, there are a lot of parameters that you can send this thing.

You can tell it we’re having split tensors, you can tell it, you know, context-wise, you can tell it how many threads to use.

I guess that’s what one of the important things was. Whether it’s a random number seed that you want to use. You can actually have it set up to where instead of sending it the text, you can actually go ahead and embed the text into embedding and sending the embedding to use. Assuming everything is all using the same, same type of thing, or the same embedding model.

Again, let me, while this is going, I’ll drop back over to Task Manager.

Because I have a laptop in my lap and I can actually feel this is working well. So, yeah, CPU is pretty far up there. Memory is a grandpa of that.

And even on the GPU, it’s sitting in nearly the max there.

And at 13 gig running a laptop, it struggles by, sorry, it’ll eventually complete. But again, that’s kind of the, what I did initially was run the biggest model I could find here and check kind of what I was getting out of it and back up some to something I could iterate and spin on. And as long as I was getting similar answers, I knew I could turn it faster with a small model.

And then when I step back up, either by pushing this file somewhere or finding a friend with a GPU that lets me run the error case.

We may spin this up again somewhere else.

This might take a while to complete. All right. Now, this time you actually have to go find the light switch, which is a bomb. It’s not letting them. And there’s none of the other options. It’s all those pin lips.

Yeah. It took the first time that happened. Oh, how many other ones were walking around here trying to figure out how to turn the lights on. And we had to know how many AI experts does it take to turn it on. And the answer was all of us. So any any question why we wait on this thing to come back. So one thing I will note that I have, we’ll have to look at this pretty hard when we apply this to NASA documentation. Most of the things you’ll see even in LLMs, especially the model, they have to argue as they have actually put in the model where it will not tell you how to do certain things.

There’s actually other competitions on how do I act as model because they really, really want to make sure that I can’t just go ask this model.

Hey, how do I make a bomb that will fit the back of you all or you know, it’s on purpose.

Interestingly enough, some of the things NASA does are not safe, you know, or are actually explosions in a particular direction.

You know, so if I have a model that has those guardrails and I try to use it for that.

I’m talking about propulsion and chemicals and reaction and stuff like that. It’s going to be really, really interesting to see what happens. So that’s why I got the answer. And this was the one with based on information provided in the Black Lives Page 1.

We passed the rules to a transformer specifically, you know, so this is basically showing that to get a more precise answer, we have to step up our model size.

But the smaller models are still perfectly okay to, you know, go bang on and do things.

In other words, if it takes me five minutes every time I want to test something, I’m going to be sitting here for a long time.

So that’s what we currently got. Let me go back to presentations. Because perplexity even gave me a recap. We’re supposed to talk about future developments, probe maps, which I’ll skip that part. And similar to our, what we’re using, it also provides references back to where it got its data from.

Which you can also hit sighted throughout.

So we’ll wind up with something like this for the NASA documentation.

And then our hope is if we get that spotting, spinning the spot up and show them, go to the demo.

Hopefully they would be happy to take over ownership of it and hopefully themselves.

And with some care and feeding, of course. So that’s all I’ve got for tonight. So next up in this series is definitely we got to do something with our big parts.

Because there’s like eight million ways to build this particular kind of prompt, you know.

And it actually has a pretty big impact on what you get back.

Like I said, I mean, you can ask as a helpful assistant, depending on you could say as a scientific researcher.

I want this, this or this.

There’s based on your use case, you can do all kinds of things with that.

So from engineering, the open AI interface is probably a good talk.

And then when we get towards the end of it, at least this current series, Josh has been doing a lot of work on evaluation of ranks.

How do I know if this thing’s giving me good answers or not?

Especially in domains like what physics and space travel.

Not my background.

So, so how do you know if you’re building it for a domain that you got super familiar with how do you measure.

So there’s some, there’s some standardized ways to do that.

That will cover and hopefully finish up.

I would love to finish saying this thing up right around the time space apps challenge and roll it right again. What is that? It’s not only towards the end of October, I think. Yeah, because it’s usually we would do it’s before New York. The way we wind up we want to do that in October followed by here. It would be like the next one. Two different remote hostings under us. Where we actually hosted as a remote.

That started in COVID and they could get everybody there.

We want to find the US doing remote here to be up for them. So there’s that.

And then it was always space apps, Europe’s and the AWS, I think the United whatever Amazon. So usually by the time December get out of here I was pretty fried. Because all of that stuff gets posted on YouTube and whatnot after it’s done. And it is massive amounts of information. I mean, even when we were doing the remote lyrics, it was hard to think we were basically doing like a three hour session, two hour session. And for the three hour session we picked from like 300 sessions that we can strength whatever it was consistent state a lot. You know, all right, let me stop recording.