Transcription provided by Huntsville AI Transcribe
NOTE – we cover a lot of preliminary material about embeddings before we get to Weaviate. You may want to skip to the halfway point.
TRANSCRIPT:
It’s hard enough hearing myself talk on a recording.
Much less also got to see myself while. Yes, so with that said, the what we’re going to talk about tonight is already. So. So.
To get started, one of the things I think Alex you brought it up before we actually got into this as far as a why do I need a vector database versus like some other kind of database kind of thing. So some background info. When we’re trying to do a search of text and we may not know the exact words I’m looking for, I may just have a concept.
So what it what this allows us to do is take the phrase that I’m going to search for and convert that into an array of numbers.
The way that works is there’s a language model that has been trained over a lot of text.
And so when it sees certain words, it knows how that word is normally used in sense.
So let for instance, let’s say I have the word table.
And I use the phrase put that on the table.
Yeah, in your head, what I’m talking about. If I use the phrase. Well, I hear your argument, but let’s table that for later conversation. It’s two different. That’s a different thing. If I were to say.
Periodic data.
Different data. I’m not going to put something on the periodic table.
You know, so.
Of course, we are my biotech place.
Who knows.
So when they’re training these models to do the embedding, it actually takes think of it as a broader context.
So if I’m talking if I’m looking for a word, if I’m encoding a word, it actually not it not only codes that word, the context of how it’s used also gives into it.
Which kind of lets you do a lot of fun stuff. I mean, the act what we’re doing tonight is actually the term for this semantic search is in I’m not doing a control F word for word search of this. These words. I’m taking the thing I’m searching for.
I’m turning it into the meaning of what that means. And I’m looking for any other document that might be in my database that has a similar me.
And so this actually gets when we embed something from the query side, it basically puts it into an array of floating point values.
And then each thing that we’re searching, we have also taken that piece of information, use the same embedding to turn that into a rate of floating point values.
The, then the, the search part just turns into cosine similarity from a math perspective.
All the way across, which is super fast to do from a, you know, math math or computer standpoint. And what I get back is something I can actually check to see what value was used when it actually matched.
And that’s the thing that we actually used way back on the initial challenge we did for this was actually a space house challenge for 20.2.
We put something out there the actual topic that we applied for.
NASA has 10,000 plus technical documents of all sorts.
Some of these go back to the 50s. Some of them are extremely interesting.
If you are reading some of these to be classified or not classified.
Some of them are completely wrong.
I mean, you can imagine we’ve learned a little bit about space travel since 1958.
So it’s kind of fun.
So the topic that they had put out there for the challenge was they were trying to find a better way, not just to search for something, but to explore the documentation they got. So you can imagine if I don’t know exactly what I’m looking for, but I’ve got a thought or an idea or something. We were using semantic search to go through and find stuff that didn’t exactly match where it were, but could follow the concepts. And so one of the things we did, we had all of their sets of documents. And we used, I think we, Ben was a Zeus that we searched for that we knew wasn’t in there.
Yeah, it was just any, any Roman or Greek gods and it pulled up the Roman planet names for them instead because it knew they were associated with each other. So I was able to make the conceptual jump that I’m talking about Zeus, Zeus is a Greek guy. Zeus is something you see in the ass material. You will find Apollo.
Yeah, you will find Jupiter.
If you search Zeus, you’ll find Jupiter because it’s another name for Zeus and you’ll find papers on Jupiter.
Yes, I mean, you find all this stuff, but that was just the proof that you don’t actually know exactly what you’re searching for. It’s good enough to make them to get close enough, which gives you a little bit of good and bad. Sometimes you find sometimes it gives you things that really don’t match at all for what you’re trying to do. But there was enough goodness in there that then you could start walking through. Okay, now what about, okay, well, like, give me more information so I can now be more precise in my question.
Yeah, so that was kind of what we were trying to do, which going back to the, I don’t want to go all the way back and explain exactly what a RAG is. But a part of this new concept, not necessarily new, but it’s really taken off in the last year is a way to use a large language model, but restrict the answers you get to be based on some data set that you actually need.
So in other words, you do a search first based on the query, get a set of results back, and then I tell this large language model, answer the question, but here’s the material that’s available for use.
And there’s some other approaches like rewriting and other types of things that you can after you get the query back, you can do some things you may actually do another query if you want based on other things that are happening.
I think some of the crash stuff is pretty useful that we’ve been looking at lately that I, truthfully, I don’t know quite enough to even talk about it yet because I haven’t put my own hands on it.
So we started working through this. Initially, we wrote our own actually the very first that initially then wrote our own. Vector, vector search or vector database, it was all in memory.
We just basically put an array of embedding in memory and then you just did a cosine similarity.
We did never that was the viable business model that some other company called I call them and then you’ll get millions and millions of dollars to go. That’s a pure. Well, it’s, I don’t know if I want to spend my life doing vector searches.
Again, but no, I mean, it was, it was kind of interesting. But then again, which was not necessarily popular thing at the time. It was, you know, we were still talking about LLMs and how to elucidate if you give them interesting questions and they’ll, they’ll try to give you an answer that they think you want or that you would find acceptable. So anyway, Can I ask you a couple of finger burn kind of questions? You just gave it a word or is it because I like what can you give a small phrase? If you don’t know what an exact word or visit, but you want to get into the subject. You can give it a picture or you can give it. I haven’t gotten the point where we’re audio yet. That’s, that’s something I love to get to right now. You’d actually audio the text and text. It does the big thing with the GPT-40 is that it’s true audio image and so it’s not like decoding it into more like a string and then tricky like I’m doing it. I’m reading the image.
The GPT-40 is in and out.
You know, it’s not calling out to anything else to make an image.
It’s not calling and doubling. It’s just can diffuse that. Yeah. Now, you need a huge database.
Where is that database?
What permission do you have to access it?
Well, assuming a huge database, these, these factors that we have, this is, this is the weirdest part about it all that doesn’t make sense to me logically.
What we’ve got now is paragraphs and paragraph. We, right now the way we’ve done it, we took all of those NASA documents and we broke them down kind of a paragraph at a time and we’re handling each one of those paragraphs separately. And each one of those paragraphs separately after we embed them into this list of loading point numbers currently are array size. There’s 384 numbers for a full paragraph.
It’s amazing what you can condense into one vector.
And then have that vector be useful.
I mean, it’s, it’s pretty, you can go smaller than that if you want to.
Of course, you get into the whole, the smaller you get, the more information loss you have.
And so you, you find out there’s always a balance of how much do I need for most of what we found the toy problems for the smaller problems we solve are not the size that you would need like for GBT4O. You know, it’s, if you look at what, and it’s the way I explained this last year was that these large, large language models, they have to be giant because they have to know everything from part of the process. This is to the name of Shakespeare’s captain, which is a fun blue station thing if you can actually get a model to tell you the name Shakespeare’s captain.
Because there’s no record of Shakespeare having a gap, but some models will actually try to give you a name because they think that’s what you, you know, that’s an acceptable answer. And they’ve been trained to give people acceptable answers. In our case, what most people are looking for is a much, much smaller domain.
They may be looking for an arcade, it’s like some of the stuff we’re doing at work. We have company policies that mean certain things whatnot. And there’s not thousands of documents.
We don’t have time to write thousands of policy documents on how to create a capture attitude, you know, in reality, most companies or most programs, the amount of documentation they have. I mean, if you’re over 10,000 pages, you’re probably past the amount of information a person can work with anyway. But this can easily scale much larger than that.
But on the ordinary kind of project that you’re going to be looking at, the amount of documentation is not, even with NASA, there’s like 10,000 documents in this data set going back over, you know, years.
I don’t have any there on archive. Did you pull archive? I’m up to 14 terabytes of papers back to 17.
So it’s a bit.
How many papers?
I haven’t counted it.
Okay. It’s a, oh no, I have 2 million.
2 million. It’s a little bit over 2 million. Yeah. So, so what you wind up with is doing giant data sets in order to train these things.
Yeah.
Because it needs to know all the different ways and words that we use, whatever the context. But when you actually turn around and use it for real, you know, you don’t have to train it again. It’s already got all the stuff built in. I only need to use the way it knows that things are different to be able to put it into use. Well, it’s one of the things a lot of people do now is to get the really small ones is that they’ll cut out all the general knowledge. They won’t teach it about the Simpsons.
They won’t teach it about, you know, the latest thing of, you know, American Idol, but it will teach you how to reason. How do you deduce things? How do you, what’s your problem solving thing?
How do you go from chain of thought step by step?
That’s all it knows about and feeds everything in context.
We use a chain of science.
Yeah.
Oh no.
Oh, yeah.
Yeah, but some of the philosophy behind a lot of it, when like large language models exploding, people trying to figure out how the world is this thing, no, this, I mean, you can, you can ask GPT for it in math problems for you.
Nobody taught it how to do math.
You know, even back to like GPT too. It can do math problems.
Even though it was trained on like Wikipedia and then you find out the weird things like, well, it can do math problems as long as you put a comma in the thousand mark.
And you know, it’s because when people were writing things down, that’s how we write and we’re like documenting things.
It wasn’t like tabular data and stuff. But when we think about it as humans, the way we communicate is what works. So for me to take knowledge that I know and transfer to you, I actually talk.
I actually use words with that on paper or with images or with sound or whatever. So that is how knowledge is transferred. And so if I take a lot of words that have been written down, it shouldn’t be too surprising that you could build things that automatically figure out how these things are connected.
And then all of a sudden I know that, well, hey, this number plus this number, I’ve seen these two things show up a million times and the answer has always been this.
Hey, the answer is probably this, you know, and again, the problem with that is that we’ll give you that answer with the same amount of certainty that it will give you any other answer.
So that’s definitely one of the interesting things going back to, I know it would be at least side right now, but see the grace trustworthy AI a lot of how do you trust how you can buy that you can trust it, how do you know that it’s not out of bounds. Things like that. To get back to the subject at the moment. We’re trying to take the initial NASA piece we had built it was just a semantic search and we’re trying to actually do something similar to what chat you need to do this. But do it just for the NASA piece and then possibly throw that back over the fence to them they can host it or whatever it’s not super duper expensive. But imagine a way to ask almost like a ass. Hey, remember that time in 1958 when y’all did that experiment. What were you thinking about, you know, and it’s, I mean, it’s some serious interesting stuff. One of the things I haven’t figured out how to do yet you mentioned the whole team slang thing. The way that we use words today is different than the way we use words in the 60s. Even the 80s, you know, even in technical documentation you find references to things that we all we haven’t used in.
I mean, well they weren’t tubes.
Well, maybe I even the technology what they’re talking about is different.
Sometimes the whole meaning of a word changes.
Yeah, or the language themselves. Like the s you stole quite a bit. Yeah, I mean you go back to the gender, depending on what you did, cursive, rising and what types of things. Yeah. And some of the documentation we’ve got from NASA is actually image capture, roof text and stuff that then has been converted. Some of the stuff is foreign as far as what you get. And it’s a good way to test. Yeah, I’m sure you both Dr. Sandrider.
Yeah, well not a medical doctor, but probably the PhD doctor. Yeah. So that’s kind of where we started.
We’ve done a lot of interesting things like we’ve gone through some sessions. Okay, we got 10,000 documents. What’s the best way to turn it into reasonable chunks?
What applications are there available? We found some good stuff from that. We’ve gone through what embedded models are there available to like we were talking about earlier to turn this into a vector.
And there’s a lot of different ones and a lot of them that trade off.
How big you go, how small you go. Because again, we’re also going to have to store all of these.
Yeah, so you got to think about that side of it. Can you clarify what you mean by vector? I’m thinking point three, rational space. I’m referring to an array of numbers. This is a vector.
What’s the other word for it?
Tensors.
Tensors is another word for this. It’s a point in an indimensional space where that’s the size of the embedding model.
So we say it’s a 1024 size.
It’s a 1024 dimensional space.
It’s hard to think about it.
Yeah, it’s correct. And when you’re trying to figure out, okay, I got a point in the dimensional space. I have another point in the individual space.
That’s our part of it.
That is exactly what we’re doing with Samantha Church. So, so we started working through how to do so that there was a, we needed a way to store all these vectors that we have. So initially we went with the project called Crawl-a-DB. That was really quick to get off the ground and it wound up being fairly slow. I don’t know if you noticed how slow some of it was. The other good thing about Crawl-a-DB was that you can host it yourself. You know, I was able to run it on my laptop.
It was not that hard. And then I started learning how slow some of what I think it was when we actually did our session here. And I’ve been running the thing at home, fine, but I didn’t realize at home I had probably launched it on my laptop and then went out and got a glass of water. So, you know, just everything’s going on. And then when you do a demonstration at a live event, okay, let me launch it and then we all sit here for four minutes and watch it.
Yeah, it’s the professional version that was even on the bottom.
Oh, yes, that was it. And it’s like, huh, this might not be great. So anyway, it was good enough for what we were doing. We also moved over and did a session on Lila CPP and then Lila CPP Python, which is a really good way to host a large language model locally that you can, I mean, I was actually running it on my laptop. But it’s similar in capability, just not at the level of a chat GPT. So it’s really good for if you’re trying to build something, it’s really good at prototyping a quick thing and see, hey, is this working or not working before you go spend a lot of money and push something out to a larger host.
So we did some of that.
And then the next thing we started working through was, I think we had the alleviate on the list of things to look at. Can I have a really play much because from a DVD was doing what what I needed to. And then I think I saw some stuff jobs we’re working on, because we work together on another project and it’s like, huh, that’s, that looks like a lot better than what I’ve got. Let me let me go look. So yeah, I mean, it’s super easy. So really, it’s a whole lot better experience than what I had with problem. What made you go with this over like a lot of our clients that I see use pine cone, which you mentioned earlier, what’s the difference between you be making Python. Technically, as far as how they work, not much. And the pine cone is pretty good. I haven’t put my hands on it. There’s several podcasts with the founder of pine cone where they talk about something that I like his approach of making something useful and making sure it’s optimized enough, but not putting a ton of you know, there’s like your others that are faster. Yeah, be faster to do. You know, I mean, if we’re, if we’re under two milliseconds on the search is that good enough.
Typically, yeah, you know, the reason I went with we be a at first.
I think it in here was because you can actually create accounts and create a cluster and it’s hosted by them for free for up to 14 days.
And I can load this thing up.
I’ve got the same amount of stuff that we have on probably be shoved up here. And it provides you some things that to go flurry to, you know, to look at different collections that you’ve got.
We’ll get into some of this more specifically later.
I always have to remember not to look at that screen because it’s like a second behind what my laptop is.
In this case, in this instance, I have 52,000 charts or paragraphs.
It has created 20 million intermediate factors, if you will, or it does some things as you’re doing inserts to be able to put these things along with some intermediate steps so that the queries are a lot faster.
So I can actually figure out, hey, and not only do I have this, you know, because it’s a multiple.
I can’t really explain from a math perspective, others should probably do a better job of that.
Well, each vector I’ve got has 384 items in it.
And to do an in-dimensional, it’s not really, it’s doing cosine similarity, but it’s got to get close enough to know which pieces to do cosine similarity against.
So when I put in a vector, say search for this, it’s not going and comparing that with 52,000 items.
It’s already taking those and index those into some other space.
So I’m not sure how it works in the middle.
I don’t know if it’s doing anything like a binary search.
It’s a correct representation.
Inside they do an arbitrary neural network or a neural snake group.
I don’t know what sort of thing.
So they get in the space and then they do the search. Right, get close enough. And then actually convert that into everything within this realm. Now we would go some literary instead. So it seems like it’s super duper fast.
The other thing I really like about it is the exact same database that you got.
Let me actually go look for.
Also something to we need to source.
Desk where I’m at.
So here’s the code for everything they’re running.
And not only can you get the code walk through all of their, you know, if you got questions about how something works, well, here’s what here’s the actual source code.
The other thing that I really like is the exact same thing they host and they provide hosting for and all that kind of stuff.
They provide Docker images for everything that you would want to run yourself. So they make they make their money by hosting things that if you don’t want to set up here, you know, not everybody has the capacity to just be on spin up a container on AWS and do all the stuff that we might know how to do. So, and I actually probably will use their hosted piece, even for the NASA stuff. It’s such a low level of interaction. And it might cost me a dollar or two a month, maybe, you know, so that’s kind of where their, their market for their business model is. So the exact same thing that I’m using right now, their web instance, I can run locally on my laptop.
And I don’t have I only have to show you the line of the line of code I have to change to switch between the two.
Everything else is the exact same query of the exact same insert for data, you know, all that type of stuff.
How do you know which are those things you want to use in your particular search?
Oh, hold on.
That’s not sure that’s the right thing to look at.
Let me see if I can get back over to you. I’ve got a program or some of my moment lectures going to be. Yeah, this is actually just the source code they compiled into to build their database.
So hang on one second. Let me also hop over and show if I can find the documentation. So I actually have a link to their quick start, which is the quickest, of course, quick as well.
I think you mentioned earlier the ChromeDB and the documentation stuff and the issue that this the documentation is saying is super top notch.
Everything for how to configure things. And it seems familiar to me as then some of the things I was doing, maybe with JavaScript or some other other type of projects where there’s an opinionated version, which it’s good enough to get you something that’s worth a whole. But then there’s like 8 million things that you can go, you know, you can tweak or this module or that module.
I kind of like that.
So it gives me something working quick. But then if I need to know, oh, there’s a module I can actually put in here that lets me do the image search.
I don’t need that right now is not part of the thing most people just start off with, but it’s available when you put it.
There’s some other things this will do, which we’ll get to in a second. Let me get back over to where I was. So it’s kind of some of the why the other thing is I don’t really need an extremely large data set at the moment. But there’s been instances of this being used where it’s actually running across multiple instances of databases that are replicating things and doing charts doing other kinds of things handling it extremely large amount of data. And that’s the thing where a lot of people try to jump to that level to start with and they spend a lot of time building something out and then they don’t ask users if they think it’s useful. You know, rather than the opposite approach of getting something useful for one person and now see if that’s useful for 10 people and seeing, okay, now if I kind of doing the iterative approach. So creating an account was easy.
I need an email address and password I gave it.
I haven’t asked. I don’t think it’s asking me for credit card. I have to go check. I mean, I should never mind live recording. I’ll do that after I stop recording to check. This is something that’s also an open source file that you could remiss if you had your own hardware that was used. Yes, right.
We’ll show that in a minute.
So what I liked about it I was able to create an account, create a cluster, understand box, get the API and URL for it, download the API, download the Python library and jump right into their quick start and just stop, start from the top.
And by the time I got towards the bottom, I had something that worked in about 15 minutes. I think I did this last Friday. I’ll have to get back for vacation. So let’s have to drive up and learn what it happened here.
What is the sandbox?
The sandbox is there.
What they’re calling this sofa clusters. If you go create one, this is where I got the term from a free sandbox.
It’s available in one place and we can call it anything you want.
And this one’s free. It expires in 14 days.
Oh, that’s just your entry into the portal. Yes. And then let’s say it creates.
And so it will go do some do some turning and go instantiate everything it needs.
And then after, I don’t know if I want to wait for that. I don’t remember how long that took. So then after it does, you’re going to need this piece of information from the URL so that the code that you’re about to write knows how to get to that. And the other thing you’re going to need is an API key.
I don’t mind sharing this public because this whole thing is going to build way about 12 more days.
So there’s an API key that you can grab. So you will need those two pieces of information to validate you are who you are, you know, you access that data.
They have different ways of managing it if you were to, let’s say, create a cluster and have them manage it.
And so there are minimum, I guess I couldn’t do it for like a couple of dollars a month.
There are minimum charges 25 hours a month. If you wanted them to stay for you. The other thing that I always have to watch out for is a lot of the things I work with or in other labs that are closed, not connected to the internet. So, not an oxy.
You know, I can’t get there from you.
If you were to talk, I know there’s stuff when we had the attorneys here last time from the law office. They’re not dealing with national security, but they have clients who have rules that their data can’t be co-ingled with other clients. It’s a normal thing, especially even in Hudson Health here. Health data. There’s so many things that, so if I were running something that was public data and there’s not really that much of any issues with having it hosted somewhere else like NASA, it’s already hosted on their site.
You know, they wouldn’t have any issues with us dropping it here.
You could strip all identification off your data, couldn’t you? So you could then… Right. Yeah. Oh, but apparently you could have two sandboxes at a time.
Oh, I can’t create one.
No more than two.
Got it. So apparently I’ve created two. Let me go back and see.
Yep.
So now I have two.
So I could drop in and see, here’s the end points and then, you know, the API key it creates for you, for the API key.
And then playing around with it, I can jump into some code in a minute.
First we’re going to show the initial part of importing the Weavey8 classes.
This is where we’ll get into a little bit of code.
So if you don’t know, if you’re not a software developer, you can ignore some of this.
But I just wanted to show how easy some of these pieces are.
So initially you just import the Weavey8 library for creating the connection, which we’re calling client at this point.
This URL is the same URL that we grabbed from… Where was that?
Back over here.
This could be like this rest endpoint.
And then the API key is the one that we have from here.
What we’re doing in this example is following a common approach of our API keys and other kind of credentials. We never check those into a repository or any repository for that matter. These are always something that we keep separately and then we’ll pop it into the environment. A lot of times this is done with a Docker container, so everything is passed through the container and never leaves that boundary.
So we grabbed that from the environment. We set that as the authentication key.
This took me a little bit to figure out the configuration for timeouts.
When I was initially doing the cost and started all this data, I basically had all of my ad chunks, which was $59,000.
And I said, here. And it started working through and then timed out after 60 seconds because I had a lot of… I was also on my whole life, which was very, very awesome. I had to figure out what… Because it was timeout. But then again, a good web search, one of your ad third documentation and said, this is how you do it, which I dropped a link to. Here’s how to do it. And then for us, the other thing that was different from Chromadie, Chromadie made you a really easy get or create collection. In other words, you don’t find a collection that way created. We need to do that.
So if you try to create one that already exists, the usual error.
If you try to get one that doesn’t exist, the usual error.
So in this case, we have to call and say, hey, does this exist?
If it does get it?
If it doesn’t, hey, let’s go create one.
In this case, we already have vectors.
We’ve already done the embedding for all of these paragraphs.
So I already have all of this data.
I’m just trying to take the data I had and push it up to the database that can be used later for a query. There are ways to actually set this up with your API key to open AI or API key to a few other options. And it will actually take what you give it in raw text and then take your key and go over to open AI and have their model.
Chat, GPT or whatever it is.
So which model are they using?
Here, I’m actually using one.
I think I’ve got it further down.
Oh, hey, by the way, hey, you got the button later when you know what you’re going to find.
My financial now likes blog. You’re going to have to go find the light switch that does not look like a light switch on the wall. So the one we’re using right now, bte small egin dash b1 dash 5. Oh, there’s a better name for that. This was, if you go back on some of the videos we posted, we went through a really good, actually, Josh right now, going through a lot of the different models that are available to use. There are, I would say thousands of choices of models that you can pick from. There’s tradeoffs, some of them are better than others for certain things.
Some of it based on what kinds of data was used to train the model.
Some are better at technical documentation. Some are better at trivia. Some are better at 1920s movies. I was just wondering how it gets, like this is running on their servers, right? Right now, the database itself is running on their server. Everything else I have is running local for me.
The model is running so like the embedding is that you can do the index in your database.
Right now, all of the embedding is running local on my model.
And now I’m just sending these back to yourself for them.
There is a way that she can actually, if you give them the keys to have this turned around and call OpenAI.
OpenAI do the embedding because this one doesn’t do any of that stuff in the database itself.
I’m not sure I like it that much.
I like to keep components in, and have them do their thing. And then I will go take something from this one and move it to the next one and have it do its thing.
That’s kind of like lane change.
It starts doing a lot of things and then you turn around and you’re not really sure how it got from point A to point B. Suddenly, the error by library is deep.
You realize this was calling this, this was calling this. I’m just wondering what sentence transformer is doing there. Is that pulling from some local on-disk that’s taking some out of ways that you downloaded? Yes, it can do both ways.
You can actually have this reach out to Hugging Face, which is probably a collection of all options.
You can actually download it and have it run what you have locally. Imagine the place where this is intended to run at some point may not even be connected to the internet.
I have a few versions that you could run it on a thing, but leave separate from the internet.
Because doing all local stuff local has the models already there.
They can load the documents that are already there, embed them and store it on the database that’s on the chain.
You don’t have to touch any of the cloud with any of this.
Right now, this is all running in my case. This is running on this laptop. For the actual embeddings that we’ve done so far, see it chunks.
I already have for all of this NASA documentation that we did, we’d already done a piece of work that takes that chunks into paragraphs, does the embeddings and then store those locally off just as a, I think it’s reasonably not hiring. Just something stupid to get it local on-disk. What this code is doing, we’ll show that actual code in a second.
I gotta keep moving because we’re running low-level time in battery.
That’s my actual timer. We’re actually iterating through each one of those and passing that back up at this point.
We’re creating a chunk list for properties.
You can put whatever metadata you want on this thing.
For me, I have a key which goes back to the document ID.
I’ve got the page number and I’ve got the actual text.
Something that caught me initially of this embedding was actually already in a right enough iron. We didn’t want to send it as a list.
It didn’t throw an error, it just didn’t take it. It created an empty object, we’ll just find it there. So thanks for that.
Two hours later, I finally figured out another thing is when you do the query, by default it gives you the values that are in the metadata.
It doesn’t give you the actual embedded vector to store it with it as most people don’t care.
I had to go figure that out.
I have one list that’s got everything in it.
The line to actually insert all of that into my database is data, insert, any particulates.
It’s hard enough on its own to know how to make multiple, you know, it actually loads it up with all of us. I didn’t have to care about all of the network logic.
Now, if you’re turning around somewhere else while you’re on when we want to do a query, I create the client the same way.
So in this case, I have my text by one query, and what’s funny is I misspelled the word ratio to start with.
But then I decided to keep it because that’s a normal thing people do. I don’t know how often I put stuff into Google just to have it ask me, is this really what you were looking for? It fixes my spelling errors for me. So I figured why not just leave it that way. So we take the actual text query that we’re asking for.
It’s like a sentence, you create a paragraph, it could be a lot of, you know, fairly large.
We call encode to get the encode or the embedding.
The other thing is, not sure if I got into here somewhere.
The embedding that you use for your query has to match, has to be the same embedding that you use on all of your other data.
Or you can imagine if I had something that embedded things one way and I had my query embedded something else, I would not expect big numbers from me anything without the embedding.
So I get the collection the same way I was getting it before, and then I’m asking it, give me vectors near.
This is actual text embedding.
I’m telling it to limit to five.
And I’m also telling it when you return all these values including distance metric, there’s a couple of other things you can ask for and metadata.
And then this piece of code I’m just looping through when I got back and printing some values out.
The top three I got were basically, so the query was, where is that?
What is the ration of fuel to weight necessary for lunar level? I’m not a space traveler or an astronaut or a physicist or anything to know exactly the real value.
I just know there is a trade off because we’ve been some massive folks here talking to us about machine learning at some point. It was years ago where you need fuel less off but the more fuel you add the heavier you are, the more fuel you need. Therefore the heavier you are, the more fuel you need to eat up. So it’s always the same numbers as big as it was.
It’s fine.
That’s why that big rocket out there a lot of highways is big as it is.
It’s almost like a little teacup of three dudes in it. Yes. And if you ever climbed into that thing it is about the teacup size.
So anyway, the stuff I got back was talking about hay in this particular paragraph. And again, this is where it’s tricky as far as when we’re looking at paragraphs you got to be useful and you don’t want your chunks to be too big. Because then you got more information and you can put it in your venue.
You don’t want them to be too small or else you don’t get good information out of them. So in this case the first paragraph is actually talking about the weight of fuel and the weight of spacecraft. And I have information in the metadata that tells me this chunk was in a document with this key.
The NASA documentation puts the four number of the year that the document came from at the first far of the key.
That’s a document from 2013 which we can actually go look at the PDF file of that.
The second find was the cosine similarity result for distance was .21. The lower that number the more closely related to the concepts are.
So the second one remaining fuel used to lower paragraph.
I don’t even know what that word means. Paragee. Paragee. Is that the top of the trajectory?
Paragee. Yeah, that’s it.
Paragee. Paragee. Paragee. Paragee. Paragee. Paragee. Paragee. Paragee. Paragee. Paragee. Paragee. Paragee. Paragee. I guess it’s this way.
We get people from all over yes walks and it’s always fun to see.
Yes, you know. And then the second one was page 3 about acquired input power. And again, this has a bunch of interesting characters in it for tabs and other stuff. It’s probably not extremely useful.
So the other thing you can do this is actually filtered to say I want to do a first office that will research, but I want to filter on the metadata by certain things.
In this case, I’m telling it just to look in this one document that was the first one we found.
It is like 200 something pages.
So restrict what you’re looking for in there.
And apparently it talks to against me the first, you know, the same one for the first one.
The second one is talking about. Excavated water and ice. This is pretty interesting. I need to read this paper now because it is pretty interesting.
They’re now again talking about ice and water and extraction from the moon and how much we need. What’s the take to, you know, it’d be really nice to not have to carry water with you. So this is like an intelligence search engine for our database that you can upload. Yeah, if you want. And then the other side of that. At first we were running, you know, looking at the cluster that we’ve created at we VH cloud and stuff.
In order to run this in a Docker image is Docker run and you, you know, drop this slide in there.
And then that spins up a local instance registered and local host slash, you know, call it 8080 as far as the point got you can you can prioritize the heck out of that.
Where what you wanted to come up on everything.
And then I had to change my client connection code to say connect to everything else I had in my code to do the query to do the upload and everything is exactly the same from one to the other. So I really like it in that I can take a, I can create a, like a trial sandbox or a trial cluster to use to prototype something and see, hey, it’s going to work for me. How’s it, how’s it work to, you know, to use it.
And then you could pay them for hosting, or I’m pretty sure I could find a place to host what I need myself for much less than $25 a month. But you can use the Docker container they provide or find some other way. They’ve also got several events and those files that have all kinds of different modules loaded and other types of stuff.
If you’re interested in the actual code, let me actually go back and get it.
Made a mistake again looking up there instead of the bottom. I think I had this, this drawing back in. Okay.
So this whole piece is in this space apps 2022.
So all of the code that we did for that space apps challenge, along with all the code that we’ve been playing around with so far is all public on GitHub. So if you want, you can drop in and see, I’ve got one piece where we were doing comedy these, you can actually see how we’ve loaded it there. The.
I thought I maybe I didn’t upload that. Whoops. Hold on.
I have that.
See if it’s there now.
It is not.
That’s fine.
There we are just to be.
This actually showed you like the code that runs, which is where I got the source material for that post.
You know, same kind of thing. Here’s where we’re actually loading it from a park a final locally that we got that we put all the chunks in the first time.
Now we’re pulling them back out looping over those.
I’ll actually lose down here.
So you’re dealing with mailing with text, right?
You mentioned some of it had to look good through. Yeah, right now, the result that we have steps.
The OCR and argument up.
You decide like how to chunk it like how to break it up into the basis that will be.
Yeah. And that’s we’ve tried to we haven’t done a lot right now. But again, as far as trying to tweak, what’s the right size of a chunk.
You know, some places that’s not to turn.
Oh yeah.
Some people go page by page.
Maybe the whole page is a chunk. Maybe a paragraph, you know, I have another question about that. Maybe nobody knows about this. I don’t know.
But like I’ve encountered this before with the mentioned the embedding dimensions.
And I have to make sure that matches up.
Is there any sensible way to like translate.
I’m actually like these squares like linear transformation that you can find with.
And then I spaces of different dimensions, like, is that are you aware of any results on how to sensibly, like, take a lower dimensional embedding and put it in a kind of way into higher dimensional space.
And I think that’s one of the things that I wanted to use a bigger model than the model that you use for the embeddings.
I think you have to train that in.
So you have to like have that be part of the actual model itself.
What do you mean?
So I mean, there’s definitely things where you can do that’s projections.
Sure.
But yeah, but that would be going the other direction. I don’t think you probably could. Also, if you want to use like a more powerful model, but that would be the life of like if you wanted to go, you’re saying there’s a projection if you want to go to a smaller dimensional embedding space.
Yeah, but like I said, it happens like the training with this like some architectural reason like a unit or something like that.
So I’m asking like, is there a way to do it after the fact not training it?
Is there like a sensible way?
I have never seen anybody ever.
He’s always said to just to take the bullet right back.
If you’re doing too many documents, maybe there’s a purpose to train a model that can transfer to your bed.
Yeah, but it’s like this thing with the spaceship. You know, what’s the ratio?
Yeah, so if you’ve got a graphics card, embedding all those documents, you’re looking 15 minutes 20.
I mean, it depends on the size of your car.
I mean, it’s not that’s just probably why everybody’s really put a lot of effort into it because re-enbedding is cheap. Yeah, considerably. You know, I mean, it’s just it. I know you’ve got that problem right. That’s right. That’s a really hard one to verify.
You didn’t mess up something fundamental till it’s too late. Let me stop the recording that we get thrown and at least get that part wrapped up before my computer dies. We can find that. Do you always use chunks and vectors and whatever?
Are those common in the industry or is it just kind of you need to what you’re building? I don’t know what standard terminology most people need. You hand this office.
See, someone I know want to talk about those things. I don’t know what I’m talking about. I mean, would you go ask another entity to host it for you or to embed it or whatever?
Are your terms so unique that you have to embed it?