Private RAG Hosting & Cost

Transcription provided by Huntsville AI Transcribe

It’s got some pretty heavy-handed companies associated. Locally, we got Deloitte and Booz Allen, UAH, I think Proud and Tears local. Technically some of the others like AWS and Microsoft have local offices and stuff, but I don’t know if people talking would be local. That’d be fun to do.

So basically looking at AI and cyber, which we’ve got a couple of talks here from an AI cyber perspective, but didn’t get too deep into it. So there’s a group called the AI Research Collaborative out of the ITC over at UAH that are trying to put this together along with Amplified Security that I haven’t heard of yet. So basically what they’re talking about, let me just jump down here. So starting at noon with some lunch, a keynote.

They’ve got a couple of panels loaded up.

I wanted to talk about cyber, cloud and AI.

Some of the companies providing speakers look to be like they’re pretty significant. I’m pretty interested in the AI and threat analysis to see what that’s about. And then also, since we’ve been on this RAG series for a while, I’m kind of interested to see what Dr. Davis has about how to safeguard RAG systems.

That’d be interesting to see. And then 415 is a women in AI panel.

So hopefully meet some of them and if I could become joining us casually.

And then of course, a networking and happy hour. So looks like fun. So the topic tonight we were talking about, I think too. What’s up, buddy? Thanks for watching. Get rid of that little box. Yeah, get rid of the box. That’s about it. Okay. Oh, it’s Jeremy Davis.

He spoke at the thing that we wanted to say. Okay.

He’s good.

Okay. He might be going for a check on.

He wasn’t the same one that didn’t quite understand few shot learning. No, he did his stuff. Okay. He was like, did he hear a slingshirt? Yeah. Slingshirt’s the whole thing.

Okay. Yeah.

So if you’re looking at this later, sorry, I’ll speak to your phone call. But I’m talking in the middle of a thing. So private drag hosting cost. So we’ve been talking about, you know, retrieval of a minute generation for several series, you know, sessions now.

So we’re looking, we’re now at a point where the piece we’ve got with NASA, we’ve done the chunking, we’ve done the waxing first, we did the docking collection.

We’ve done the chucking, the embedding, then storing any vector store and then working on the prop that you would use to put those together. And then the actual easy part is thinking over to a large language probably in nature.

I thought that would be harder than it is. It’s easy to mess up, but getting it off the ground is not hard. We even covered a lot of CPP and then a lot of CPP Python to actually, you know, run through some of that locally. And so now we’re at a point of, so you can build one, how do you deploy it and let a user or a customer or a client or somebody actually use the thing?

Because I don’t want them all running off of my laptop.

That would not be fun. It would have gotten through crowdstrike. Because I didn’t actually get in without them somehow. So anyway, so what we’re doing now, we’re going to start off with a basic case because we’re talking about private hosting. I don’t necessarily mean hosting it on your own machine. We’re going to look at cloud providers, we’re going to look at different services that provide some of the same pieces where you can either go through their API and pay them or you can pay five points to host it and pay the host. But either way, there’s some money that has to transact to do this. So reference points. In order to make this meaningful for really anybody, I needed to actually draw a line in the sand as far as what is it we’re talking about. You get into how many tokens is it per transaction or how much space do I need in a database? There are so many, it depends.

I was going to use this as a reference point.

So I have 490 NASA documents in PDF form.

We have chucked them into paragraphs.

I did buy five of them and chucked them a little bit to make sure there was a piece that was incorrect where it was taking each bulletin point of a list and turning each one of those into its own paragraph or chunk, which isn’t really useful. So I made a minimum size of paragraph. So if you’re not over a certain size, go ahead and wrap the next one.

Seemed to work pretty good after checking later. So 490 documents.

I now add 16,452 paragraphs right under five million words.

That comes up to about 300 words for paragraph or chunk. Right now I’m using an embedded size of 384, which I think is the BGD small size.

That will affect some of the calculations when we get to vector storage that we’ll play around with a little bit.

Currently, and these are some made up terms, I’ve got a system prompt that we covered last week.

The short, concise one was around 100 words.

Just show that what I was looking at was initially, I’m not sure if there’s a guidance anywhere for how much context you get right now.

I’m looking at the top 10 and the chunks that combine, shove that over, able context, and use that as to get my answer.

I don’t know if that’s normal, I don’t know how much you usually use. So anyway, and then also looking at a similar output versus the prompt.

So in other words, if I give it a prompt that’s 3000 words, I don’t know if that’s right or not, but I had to pick a number.

So currently looking at a 3600 total words in the prompt, as well as in the output, because that’s still a little interesting on the open AI side, which we’ll get to in a minute.

Current back of the envelope, I’m using 1.3 tokens per word.

I think that’s about right for embedded side, because you can’t get a solid answer, but 1.3 seems to be the true kind of an easy way to use it. And then I just picked up a totally bad number for we’re going to look at maybe 100 queries a month.

So if you’ve got five users who did it a few times a day, maybe that’s, you know, but again, had to pick some number.

So if we’re using a pure web host, in other words, I’m hosting the UI for this thing on a virtual machine somewhere.

And then I’m using open AI as a service, where I’m using the, you know, their protocol, and I’m using we via as a service, where they’re hosting the database and all. So we V8 is basically 0.095.

So nine and a half cents per million dimensions per month.

So for me, I’ve got this many paragraphs times this is my betting size.

That’s why I can have betting size goes great.

And I, this number is going to change quite a bit because it’s a multiplier divided by a thousand a million dimensions and then multiply.

So I’m paying roughly 60 cents a month for that.

But we V8 actually has a minimal for 25 dollars a month and you don’t break even until in this case you hit 685 thousand.

I went through some of the other database providers, some for we V8, some for Postgres, some for others, nearly all of them have a minimal line where you’re going to pay at least this.

And then you get some kind of grace period until you kind of, it reminds me of early cell phone use for you.

You got 30 minutes for free. I’m going to call you after that. And then it started charging for you later. So there’s that. Let’s see.

Open AI pricing, just looking initially at GPT-40 is $5 per one million input tokens and $15 per one million output tokens.

And that was a little curious.

I don’t remember having a different price for output than input.

Okay.

Outputs are always more expensive.

Okay.

Because that makes you actually rain something. And they had this mid-time. So again, this is again an assumption that output is similar size to total prompts that we put in.

Again, I had to pick a number.

So with that, you’re roughly just under 10 cents per query for those numbers.

So Web Host.

Several different places.

I looked at just basically just enough CPU memory to host a website doing minimal.

This is kind of where some of your connecting logic is.

And this send it over to you, get results back to send it over there.

Get results back. This one to user. One thing I don’t have in here that I just realized, this is the part where you get to this point and you’re like, well, people are like, well, let’s not check GPT.

Why not? Well, it’s not remembering any of my other stuff. Crap. I need another database.

Or another storage thing.

It’s not a vector store that I could, you know, which could be a way, could be a reason to go post-dress instead of weBA because you can do both in the same database.

Some of these providers, like I think DigitalOcean or Heroku, I think it’s Heroku actually gives you a built-in database on the VM, on the little container thing itself for like up to 500-Bang of brand, you know, whatever. It’s small.

But if you’re just trying to get something spot up in front of a user, it’s probably worth looking at there.

So basic for this bare bones, you’re looking at $40 a month at least just to get off the ground where I have something I can put in front of a customer and show it.

So next up.

Self-hosting.

The UI, self-hosting, WeVay, self-hosting and LLM. So this case, I’m not using any of the services. I am rolling my own stuff. And the good news is at this point, I’m not paying only per query basis. I’m just paying for the hardware that this is running.

Downside is that nobody’s querying things.

I’m still paying for hardware.

Good news is I didn’t quite do the whole how many queries were to take to break me. This is that part. Well, they may do that in a minute. So for this one, WeVay documentation recommends the best I could get out of them was four CPUs or four cores and 16GB of RAM to have a database that is not giant but performant.

And pretty good latency on queries, things like that.

The other thing I realized, I learned about this when I was going through it.

I did realize at the time, WeVay actually has a pretty good way to back up its data to either a local drive or an S3 bucket or whatever.

And there’s API you can use.

It’ll actually back up while it’s live.

And then tell you what’s done. You can reload it from being, you know, while it’s live, it will pull from data itself, you know, all those things. Hello there.

There.

Starting with 8.

That’s okay.

We are recording by the way. Just so I’ll be on my. Yeah, everybody. Everybody be a third of your teaser. Let’s see.

So that’s WeV8.

What I’m looking at for the LLM is basically something running on a 24GB RAM GPU, which I think it was this time last year.

We actually did a session on what GPUs are available on virtualized, you know, sheets.

Because some of them will show you a GPU and a price.

As soon as you go to buy it, they say, okay, cool, we will put you in the list. We will let you know when this is available for you.

Which I think in the office, we’re still waiting.

We’re still waiting. You’re in a half now. We’re still waiting for Microsoft to say, yeah, you got it. That’s a big one. So anyway, we want to show this one.

I picked 24 because you can get 24 GPUs at the moment.

24GB range GPUs and we’re not talking about high end stuff. We’re now talking about the leftovers of what people had moved off of onto, you know, A100s and the other stuff like that. So currently looking at like a 13B size LLM.

There are some things that have dropped recently and others that are in the, what was Jimma that you said?

27B. So a 27B quantized 4 bits would fit.

And then looking through that, the easiest approach to get this thing off the ground.

I was looking at initially running one virtual machine for EVA, another virtual machine with the GPU on it to do the, you know, the inference on the LLM.

And also using the LLM, that same machine to run something that does the embedding for you as well.

You may not run.

I don’t know if you can run multiple models inside of the server part of the LLM-CTP.

I know you give it a model when you start it up.

But maybe you can, I don’t know, I have to check.

So the easiest approach was, I noticed that most of the virtual machines that I was looking at for the, you know, for the GPU side had tons of RAM and plenty of CPU usage for me to, because most of the workload I’m going to put on them is nearly all GPU based for the LLM. So they probably have enough CPU left to host both the EVA and the LLM on the same machine. And try again, but on paper, it works out.

Plus you’re not jumping, you’re not, you’re moving the latency for that work and trying to get from one machine over to another, get the answer and come back. You could actually run your UI on this thing if you really want to.

But that’s might be a bit much. So looking through this, I actually went through looking at system specs from Lambda Lab, which is one we’ve used before.

I didn’t hit paper space because they upset me last time because they kept charging me $8 a month even though I wasn’t paying anything.

And I realized that, oh, they’re based, they have a base cost as well, even if you shut everything down. So that was a fun. Currently Lambda Lab has a one GPU.

I believe this is a, I don’t know if it’s a, anyway, I think I may still have it up.

Yep, there we go. So the Lambda Lab was a Quattro RTX 6000 at 24 gig of RAM, 14 virtual CPUs, 46 gig of system RAM, and enough storage to be slightly okay.

At 56 now that was their minimum list. So with this, I got enough CPUs to run the V8.

I’ve got enough RAM to run V8.

I’ve got enough GPU to do something with to turn it around.

And at 50 cents an hour, over 24 hours times 365 days divided by 12 months, got me to my cost of $360 a month.

For your bare bones, I had one VM that’s got all my stuff running. The other approach I’ve looked at was going back to Amazon, looking for they got two different levels that are there kind of low tier. So there’s a P2 extra large, it’s got, I don’t even know where you find the 12 gig RAM on a GPU anymore.

But that’s there.

It’s a thing. You could run a 13B model with 4 bit or a 7 bit and 4 bit or 8 bit something.

And that’s 90 cents an hour.

So you are $657 a month.

Stepping that up to be something a little bit closer on the RAM side of GPU is a G5 extra large.

And without when you get, I probably have this wrong at 16 gig of RAM.

That doesn’t make any sense. Actually, if I still have it up, let me look. So compute.

Oh, here it was.

I’m seriously, I don’t know what they’re doing with that, but so I could barely run. Maybe.

So anyway, there’s that.

And that is also super duper expensive like $720 a month.

So then that’s the virtual approaches.

So is there a reason you wouldn’t do like a normal VM for EVA, your UI or whatever it is and something like bedrock.

But I want to know what the value if you’re not doing some very specific thing or have your own plan to model.

Right. Because you can get a lot of 7B bedrock like or croc or some of those sort of things.

And then you still have to pay for the VM, but you wouldn’t have to pay for the GPU source that quality of play cheaper. It’s still have a lot of the same. So you have access to the GPU for the service.

So you can go, yeah, you can’t pretend to grog, but you know, those some of those sorts of things are it’s cheap. Right. But what I was getting out on this one is if you had something that you could use another service, that’s got to where the constraints work. So assuming I can’t, I need to be able to either ensure that this is only accessible with the data that’s here or possibly run the software.

It also will contain something like that.

Right. Something like that.

Or the case of S3 buckets.

If I need to make sure that none of this data leaves the US.

Something like that. Well, I can go to those particular instance or a particular region, whatever, but you’re going to pay through the nose for it.

I mean, that’s the main thing that we’ve learned so far.

I mean, the difference between $40 a month and even the best one here was landing the lab and 360.

I mean, here and at times more expense.

I mean, it’s for a hundred queries.

Yeah, 400 queries.

I didn’t do the. All right. So let’s, we got some folks that are currently okay at math or might have a phone.

So if my base on this piece is, let’s see, 25 a month for the web host and I’ve got the web host, maybe.

Okay, so that’s 30.

So for $10.

So $10 is 100 queries.

How many queries do I have to have before I break even so that would be 36 months.

That would be $330.

The amount about 10. So if I’m 3300 queries before breaking.

So there was a 330 queries.

It’s got to be 3300.

Anyway, so in other words, as soon as you pass like that amount of users queries coming in at that point, it actually makes a lot more sense to flip over.

So that’s where one of the good things that we are looking at as far as the Lava CP Python using the same API is open AI.

I could write the code in one time deploy it this way, along with knowing that as soon as my number of queries per month gets over here, I can actually flip it over without changing code.

I just change where it’s deployed or how it’s deployed.

And now I’m running, you know, my cap now is 360 months. The cap on this one, you know, total starts at $40 a month, it collides.

So, again, all depends on what you’re looking at.

Anyway, also from land to labs.

Assuming if you’re running at your house, you got good internet connection power that go out. When when the lightning flashes. Do I have that? Let me go look at this.

We’re going to push it back. It’s been a power gone now. Yeah. Sorry, it’s just a double double GPU for, you know, a couple of 4090s. You know, startings.

It’s great.

I’m going to start that way. Of course, that is starting at you as soon as you go try to build one of these is how like they don’t.

Where it’s like, Oh, did you want to play?

If we put together.

Yeah, I remember looking over this one when I was shopping. Sure, it has that capacity.

You click on that.

Like, okay, so you get one.

And you get one RIT X at that. Oh, do you want to actually go up on your on your cores or your system memory?

Yeah, it’s either that or you don’t configure anything. It is what it is. Yes, I think this one is this is what it is. Except for some warranty stuff.

But some of these, if you look at them, it looks good. And then you figure out here at the IKEA of computer shops.

And it comes into six boxes.

Some people probably like it.

But again, at that point, you’ve also got to run this thing on your at your place. A lot of your end up providers don’t like you to do that. And you’ll see your IP address change over time. You got to figure out how to deal with that. Do their first proxy, know what a firewall means. And then what happens is something you do it’s popular and people that just that is popular like with users, but popular with somebody trying to crack your system. So there’s that is what 200 of one for Google. Business.

Yeah, I like the high end right now is 130 for consumer. When you switch over to business and you get that dedicated IP that it just I think it’s like either 200 or 225 or something like that. I want to show you was too hard.

Because I was looking to do that. I was looking to get to fiber. Just one can you don’t get out here. Yeah, business. Yeah, it will not give you static static IP unless you go business. Yeah.

And then let’s see.

Absolutely cheapest.

This is where I try to figure out what is the bearable thing I could do using free services free tiers whatever I can think of.

Let’s say I had a demo and I wanted to go show up somebody and I don’t really care if like for we V8 into a free database for 14 days.

And I verified yes it does disappear.

Yes, they’ll send you a note first. At which time you can use their API for backups to back up your data and then I don’t have to import your data but it’s it’s enough friction to you know you do that at that time you just going to pay for it. Yeah. So a web host with either heroku digital ocean droplet droplet or maybe a stream of thing like we’ve done before where it’s off of your GitHub branch and they hosted for free and whatnot. And then you’ve got a net just enough pipeline pieces behind it because you’re not high performance you’re not doing a lot of work there. But then farming that out over to we V8 and then you know open AI just pay a good cost of tokens and whatnot. You’re probably around five to six dollars a month. The other interesting thing that I thought of.

And one of the reasons appear that I was looking at was this self hosting putting everything on the same machine.

If I’m hosting a virtual machine somewhere and I’ve got we V8 on that one and I’ve got my LLM running on this other one and all now I kind of have to care about authentication and encryption of the data between the places all the kind of stuff that you have to care about.

Whereas if I’m running all of this internal to a system, I can run things in a containerized fashion on a container network where nothing leaves except for the port going out to the you on.

So I don’t you know there’s I’m sure it could be cracked but not as easy.

So things like that. But yeah, the other thing I didn’t put in here was how to host it been to us. Which going back to the NASA space apps challenge that was actually running on this event correctly from wrong. I think that was running on your machine that we did like I don’t I’m not exactly sure all the technical stuff you did to make that available. Yeah, it was still up I think I have that machine off right now. But if I turn it on it should be right back up at the same URL. It’s just a dark container on a network with engine X reverse proxy sitting behind another cloud flare reverse proxy to mask my p address. Cloud that’s why I couldn’t remember I remember something.

Yeah, so you’re using cloud fair cloud flare so the URL goes to them. You figure out whether it’s a real thing that they need to forward on your IP address or whether it’s some other, you know, thing going on. And then that’s okay. Are you on a static IP or do you have to keep updating when your IP changes.

I’m not on a static IP. Static IPs are really overrated.

Most networking gear itself even has the ability built in but you can also just run a Docker container to do it. Basically I just have a service running on my network.

That’s literally built into my router that says hey, does this IP address change if it does use this key go to cloud flare tell him IP address updated.

Okay.

There you go.

Plus, you got a little bit of a watch to hear that way.

So, but that’s, that’s what I had for tonight.

Maybe a little under a moment, but the hardest part of the whole thing was dealing with the different services and the different ways they try to get you pay and none of them are the same. You know, all of them, the things to take away the database services. Most of them that I found have some minimum dollar figure. And you’ve also got thinking about how the system is going to be used and you don’t want to pay a lot per month only to find out that it’s only going to be used a little bit. But you also need to make sure that if you get something out there and it starts to scale up that you have a plan for if instead of 100 queries a month all of a sudden you add zeros to the end of that number and this 400 turns into $4,440 goes up pretty quick as soon as you start hitting this. The other thing to look at with the webiate specifically the small embedding size small model for embedding provided by OpenAI is around 1500 factor size, except 384.

They do provide an algorithm you can use to reduce that size to some other piece and normalize it.

So you could still use their embedding and do the normalization or whatever the reduction get it down.

But if you basically multiply that by another half as far as the where was that that was up here.

So, so this number instead of 384 the small size there would be 15, you know, around 1500.

Oh, sorry, it’s, it’s five times four.

Anyone.

So you would be.

Gosh, what does that do that.

That turned that into three algements.

They could print it in the 25 still. Right. What is that.

Yeah, there’s yourself with math.

And they’re all it reminds me of when you try to go buy a car and they don’t want to talk about the price of the car and talk about the price and monthly payment. And here they want to talk about the cost of the thing they want to talk about the price of the token. When a token isn’t even a word.

It’s a it’s some, you know, 1.3 per word.

So there’s that. I guess with that will wrap it let me go ahead and stop the recording.