Chunking with LLM Sherpa

Chunking with LLM Sherpa

Transcription provided by Huntsville AI Transcribe

So what we’re talking about tonight, we’re gonna do that part. We’ve been going through this retrieval augmented generation brag kind of approach for the NASA space apps piece we’ve done a couple years ago. And as part of that, we had initially gone through and done everything with the download of both loaded data from the NASA server and it’s got one big file that’s got all of the IDs and abstracts of all the papers. And so getting the abstracts was easy. Getting the data out of the actual papers is quite interesting.

And we can, we get towards the end, we’ll go back and look at actually what some of these documents look like.

But initially we were using this package called high PDF.

And so you load it up, you point it to a PDF document and say you can pair it to your after you do the text and stuff. And what it does is it gives you sections of text, but they’re not great. Depending on, I mean, this was an actual poll from one of the things it gives you.

And so it’s fairly, fairly bad, depending on what kind of document you’ve got.

If you have a typical, you know, if you’d like taking a Word document converted to PDF, we just got a single column and normal paragraphs, it does okay.

If you have anything that’s like a two column piece like you’d normally see in a, you know, an AI paper, or things with graphs or things with bulleted lists or things with tables or, you know, it really, really, it sucks. And so since the thing that we’re looking at is actually, I need chunks of paragraphs to use because the way the retrieval works. So first put in a prompt, second do a semantic search against your data set to find things that are close to what your prompt was, and then you use that to build the context.

So I need something to actually take my prompt and then turn it around and do a query, but I have to have good data in the actual database to make that useful. So looking through that, basically did a quick search and ran into this project called LLM-Sherpa, which it gives you ways to split PDFs into paragraphs that can manage, it can handle tables, it can handle graphs, you know, things like that. And even depending on how deep you get into it, it can help you with things like linkages between different paragraphs to say that these two things are related. So that was neat.

And then the other thing that blew me away, I dove into this LLM-Sherpa. At first, I downloaded it. We can let me switch over and I’ll show you what this actually looks like for PDF.

So I think this might be big enough.

It’s actually big on my screen that is there. So the interesting part was you had this layout reader, PDF reader that they want you to use, and they give you a URL which initially points to this, you know, URL piece that’s hosted by LLM-Sherpa. And I started playing around with it and I’m like, well, I don’t want my data going out to some other site or something. So I’m like, well, let me just see what happens if I get rid of it. It just totally doesn’t work.

That’s where I figured out that the LLM-Sherpa service on the back side is actually doing work.

And the front end is just a Python wrapper that you call and it adds just things up, sends them to the server.

So that one is that, well, where’s the server?

Where do I find it?

And of course, they have the code for the server and all this kind of stuff.

So I started digging down into the server and I realized they basically did the exact same thing I had done about 10 years ago. Josh, you may remember some of this. There’s actually a project called Apache Tika that is a Java library that Apache built that can basically parse PDF, and then you can use it to run a bunch of Word documents, all that kind of stuff and give it back to you. I’ve done that a good while back when I had something where I was working in a different language, but the power for part of parsing and stuff was in Java. So I wound up time wrapping it with, I don’t remember which, there’s some Java Web socket thing that you can use to host a REST thing really quickly.

I don’t remember what it was.

Maybe I don’t remember the actual library. But anyway, it was a thin wrapper around it and I was able to spin it up. It looks like they’ve done something very similar. And the neat thing is not only does it parse things in a way that’s much better, it’s really fast. So it’s definitely much better than what we were doing with PDF, parse PDF, whatever that, PyPDF, which was all Python. So that was pretty interesting. So now I’m actually running LLM Sherpa locally, which they provide their application in a Docker container along with convey line.

You just lost the container and it pulls it, loads it up and you can connect to it that way. So that’s where this local URL came from. So when I picked it up and I run it, I see, I think I’ve got that in the read me as far as the actual, actual Docker command. Yeah, I’m just using Docker run mappings and ports because for me, one of these was already in use. So I have to map it to another one on my host, run that and then it spins it up and now you can connect to it. I think I’m running that somewhere in here. Not there, my one of those is actually running the backend. So back to the main part.

So all of that we can drop into the actual.

So this is the actual library for LLM Sherpa.

It’s from a company called NLMATX, which is kind of interesting.

I have seen a little bit different things. This says it’s MIT license, but if you scroll down, you see it’s Apache to license. As long as it’s one or the other, I don’t care. So this NLM and gestures actually the service that actually runs the backing part of it.

I haven’t verified the OCR part, but that is also another thing that’s pretty hefty.

If that actually works. Suddenly older PDFs that you find.

Have either been turned to text through OCR as it existed, you know, 10 years ago, which is pretty rough. So you may want to, especially if we could get back to the initial original ones, that would be good. Yeah. It’d be interesting to come to a lot. Yeah. It would be interesting to compare the OCR between this and what you get with PyPDF.

So they also not only wrapped Apache and Tika, they also added some modifications to it to help it do something. So this is something similar to what we had done kind of in a research thing a while back. Because from a PDF standpoint, the way that it works, it’s from an old type set kind of a thought process where it puts all the letters that you see on the page at that exact spot. And it doesn’t care about the text itself, doesn’t care anything about paragraphs or sentences or text or anything. It is draw this letter at this location.

Now draw this letter at this location and they just all happen to line up.

So you wind up with things occasionally with, especially like the two column texts, you may have some of that you may ask for, you know, paragraphs, and they actually read the first one from the first column and then continue on for the second column.

So you can see that the text of the stuff in a mix gets really fun if their type set is often just a little bit and it just winds up doing more of a zigzag higher through there because this line is just slightly lower than the other.

So they had some things in there that do similar things, which we had added something in to detect. A lot of times you’ll see the same information at the bottom of each page figure out, okay, this is a footer.

Please don’t insert this in the middle of my paragraph.

You wind up with a number that’s in your every paragraph and realize wait, that’s incrementing. Well, that’s a page number. Oh, crap.

So that’s one of the things they were talking about here. It can do things like sections and subsections and then paragraphs within sections.

It can do things like lists and nested lists, which right now we’ve got that information coming in from the documents. The next question is kind of the next kind of thing we’ll get into is assuming I’ve got paragraphs of data.

What’s the right way to jump with how big do they need to be?

What do you do if they’re too big?

I’ve seen some approaches that have each chunk of a paragraph will include like the last bits of the one before and the first bits and one after to try to help it get a little bit more context.

The other thing, so we’ll talk a little bit about that coming up in one of our later sessions.

I haven’t actually checked this little jump into the write up first and then maybe come back and see if their co-lab actually is active and maybe kick the tires on it. So this is kind of what they were talking about is naive chunk.

I guess their words were naive chunking. I haven’t really seen that elsewhere, but it might be coming from some other paper somewhere. But they get in it. Oh, come on.

Don’t make me sign up. There we go. See if I can zoom that a little bit.

So thinking about actually taking each instead of each paragraph being it’s all chunk actually chunking across certain things.

Because one thing is if you’re you may wind up with things like a heading being in a chunk by itself, that might not make sense. You know, you may wind up never getting that because it doesn’t have enough. I don’t know how to say it. It doesn’t have enough meat to it to actually allow it to really anchor into anything. And they’ve got things like, okay, so now get out of the list. How does that work with pulling that in?

It shows you kind of how they’re actually doing some things.

Actually it just shows you, okay, smart chunking.

Oh, yeah.

The first part, they’ll be showing you some of the problems you run into.

Where you get, especially like the tables, you might have stuff that’s right next to it that actually should be included with the actual table itself, like a caption, maybe. And then sometimes you wind up with things that, you know, if you go down too far and you don’t have the header information, like this is what this chunk two was talking about.

So they have this concept, they call smart chunking, which I guess it’s smart.

So here they’re actually showing how they’re, how they instead of going the line by line, they’re actually going, you know, with sections and subsections, making sure that they’ve got as much in the same piece as they can.

And then for lists, it looks like they dropped some of the stuff at the bottom, but kept some pieces at the top.

I’m not quite sure what that’s going to do for you, but.

All the lists by single chunk with the lead incidents.

Yeah, so it looks like they’re chunking tables all together instead of splitting it somewhere, which I’m not quite sure what happened to get something that’s big.

Because you guess, you know, the, the other thing you guys depend, and again, this is another one of those tradeoffs, depending on.

So we’re going to cut these things into chunks and we’re actually going to take each chunk and use an embedding model to turn it into a list of numbers.

And you got some size on that list of numbers.

And if you wind up with the short list of numbers and a lot of text, you’re going to be dropping information somewhere as you try to get into a smaller embedding space.

And that might be interesting to compare chunks on ice versus embedding space size and see how well some of this works.

They’re also, you know, talking about context windows, things like that.

So as you get into this is more leaning towards how this stuff is used. So with a with a large language model if we’re going to be putting prompts and then asking for responses and stuff.

There’s only so much room you have to put your query and your context and all your stuff together to hand it to the model.

That’s what they call context size.

And up until I’m not sure where that might be a good talk as well.

Or if anybody wants to drop a link somewhere. The increasing context size over the last year or so.

That’s been your own pretty phenomenally. Or before even even like chat to you, but first came out, you can ask it a certain thing and is what the way that worked is each time you ask a question. You get the response and then you would actually the next time you go ask it, it would actually take your response that you have from before tagging with your prompt and send all of it back over again. And then remember what you’re actually it’s actually sending it over again and again and again. These times you ask you can imagine that gets bigger and bigger and bigger, which some places don’t mind that at all because they’re charging you by the token. So every time if you were to build your own application on top of it, you pay based on how much stuff you send across. So the more context they give you the more context you use the more money you’re paying them to send the information across to them.

So that’s that’s another place so we’ve that might be a good, you know, try to think of the name of the session might be what knobs can you turn.

You can turn how big your paragraphs are you can turn it off for what your betting space is how big that is you can turn it off for what you’ve been model you use.

You can turn it off for what your context size is going into the LL. I mean, it’s it’s pretty interesting.

And so then the other thing that I know we’re going to get into with one of our next talks is is basically looking at how to measure whether this is good or not. You know, a lot of its subject, especially what we’re doing when asked stuff to show in a second. I’m trying to put myself in a researcher kind of a hat and putting props into a query and looking at what I get back and saying is this makes sense.

You know, there are actual baselines that you can use and measurement approaches, things like precision recall, I mean their actual standard things you could use things like that. That’s pretty interesting.

In this case, they’ve talked this is their introduction to their PDF layout reader and layout PDF readers so that backwards. So we will actually give this show see if this code actually opens up and does anything useful.

I’m just probably first going to ask me if I want to trust it.

If you’re not if you’re not up to speed with Google Colab, it’s actually something we’ve used here a lot. It’s a notebook. Think of it as a Python notebook service provided by Google. They can run some pretty interesting level of things. It’s not going to run. Actually, we got it to run Lama 2 at one point. Or GVT2, yeah, not Lama 2.

I can’t remember. It’s GVT2, I think.

Let’s run on the CPU.

It should be okay. The other thing we’ll talk about real quick.

Actually, let me insert a cell below this because I think that oops, where’s my insert button plus code.

There we go.

Ooh, I can generate the AI that needs to be there.

I have no idea if this is going to work.

Probably not. Method iterable chunks.

Okay. So it actually shows me that in this document it’s got the table of things like that.

And so somewhere in here, I don’t remember what the code was.

I’m kind of cheating here. I’m going back to see what my actual… I’m going to go back to the same page. That was our piece that did PDF to Parquet. Somewhere in here.

I have a two text.

So apparently this is… Not bad.

That’s a lot. Let me drop a new line. There we go.

That way you can actually see what it is calling chunks of this document.

So is it just parsing it to markdown?


It may further take it because this would be a… Well, even if that were to say, really important… Is that a table or is that a… That’s kind of interesting.

Looking for something with… Looks like it’s towards the end of that kind of thing, baby.

It’s backwards because it doesn’t have to be references. And references would be at the bottom. Yeah.

Could be. Table one.

Hey, it says table one.

Comparison of pre-training objectives.

So it’s… Yeah. I mean, there’s a lot in this game.

So it’s pretty easy to use to tell you the truth.

As far as that goes, I didn’t… There were some I’ve used in the past from a PDF perspective. It might have been a PDF box I was working with where you actually have to… It’s like you get into the document and it tells you how many pages you have. Then you have to iterate pages by pages and on each page you have a section and each section and you have a paragraph. And it’s just got a lot going on there. So we check how that’s so far. So which chunking strategy are you using for this one?

Are you doing the 100 breaks without that section?

I’m seeing Mike. Yeah. Next place.


I used exactly what their defaults were.

So I have no idea.

So they do have… Before I did the chunk to text, there’s actually a… Just to correct that out.

It was showing me I’ve got a lot of paragraph objects. Somewhere in here it tells me I have tables.

That’s a list item.

So that’s another thing I haven’t taken advantage of is that there’s a way to check for each object, what kind of object it is.

And you may actually want to handle paragraphs differently than you handle lists.

And you may want to handle that differently than you have tables. So there was that.

And so what I wound up doing… I’m trying to walk through this code if my cursor will catch up with me. So what this code does, I took a piece of it out so that I could run it by hand, I believe. See what I had done here. There we go.

Yes, I had actually, in order to get this thing up and working and where I can actually play around with it.

We were running this in a pipeline basically going file by file, go jump this one, go jump this one, go jump this one.

But to play around with it, I actually kind of broke it a little bit just to see… …about that.

This is actually looking on my local machine.


I’m not sure I can find that one easily.

I don’t know the way to open these nicely from VS Code unless somebody else knows and open with function.

Right, clear.

I’ll go find it first and I’ll try it. So, ah, file path.

2020, 590, zero order.

I’ll go back from the 70s.

So 2020, 590, so there it is. See if I can open with maybe.

There’s a PDF here.

I can try it. I don’t want to do that yet.

Let me see if there’s a… Oh, it is skipping for now.

Alright, so that and then the actually part that does the read.

Okay, that’s exception. Okay, not load.

That’s fine.

Always fun with live demos. What am I giving it for a case?

It expects to be run.


There we go.

Okay. That’s what it is.

I can’t remember what I called it.

Space apps, I think.

Yeah, definitely doesn’t like me. Well, we’re not going to do a live demo on that one. So what we did here, I’m going to just kind of jump to the point.

We went through all of the PDFs that are in this big list of PDFs. We grabbed all the chunks out of them and then converted that to one giant RK file, which is a way to store. I think that’s over here.

It’s a way to store. Data.

But it’s a, it gives you a way to compress it in a, in a pretty good way.

So if I go to data.

We got 12 May worth of chunks in there and out of about 500 documents, there are over 100,000 chunks worth of paragraphs.

Some of these documents are 100 pages long, some of them are five, you know, so it gives you things like that. And then ran through the same operation that we had done for the, for the abstracts.

So where was that?

Chucks to come.

Oh, working in banks. This is the one where we’re actually doing the encoding of the text paragraph into the embedding space.

So when you get out of that, you get a bunch of numbers.

And then we have another piece where we’re taking that set of numbers and I’m putting them into this.

Chroma DB vector store.

And so then what that lets me do, I’ve actually got a quick test up here to query, query Chroma itself.

You see this will actually run. The number of dimensions for the embeddings of variable.

Or is it fixed?

It’s fixed. Generally depends on which, I mean, you can, well, it’s, it’s fixed based on which embedding model you use.

Then there are choices of a lot of embedded models.

Try to see why this is.

You know how many of them, that’s one. Not, I was hoping to show you in a second, but it doesn’t seem like it wants to work very well.

Actually, I can scroll up and show you what it looked like before.

So this actual, what is the, the ratio of fuel weight necessary for Linger Lift off?

I just asked that kind of question to see what it would connect with. And it turned that query into a debate. And what it’s doing at that point is it’s doing a comparison in embedding space. And what that does is let’s, it’s, the concept is more of a semantic search.

So set of searching for word for word.

I’m searching for concepts, if you will, you know, and one of the things we’ve talked to, I mean, this is fairly dated as far as the way to discuss this was kind of, it kind of a word vector type of concept. I guess the old way to do it is you have like one statement that says the president speaks to the president, she got it. And then another statement says Obama addresses the media in Illinois. Those two things meant the same thing at the time. The only words they shared were the, you know, but they say the same thing. So this is a way to turn those statements into embeddings that capture not only the words themselves, but how they’re used in a sentence, how they’re used in context. And you can actually compare that with each other.

I think somebody already answered my question. If I could flip back over. Where was that?

Okay, so yes, the one we’re using BGE small gives us 384.

Their standard is a six 768.

But it shows basically what it’s returning for documents will show that more of a GUI approach later.

I’m hoping I didn’t break something because I had this working before. That’s not that.

We’ll see.

So last time we turned this, which I guess we’ll do it live and see if it works or not.

We’ve worked through that it’s the problem we’re into with quality me was the spin up time.

So I found they also have a really easy way to run the database in its own container, which also makes me that can put what a report I want. I can run them server if I want.

It doesn’t have to be here. So I did that and I updated our the front end actually point to that.

Now the question will be if it’s still running or not. I’ve actually got it set up so that this may still be up and going. So in this case, what the query was and this is actually searching across all the different chunks that came out of that 500 documents. And the reason we picked some of the words we picked.

All right, here we go.


I think God war is not something you’re going to find a lot of NASA documents. You know, you should see Mars, which was a kind of war. You may see Aries, which was a guy of work. You know, I mean, so it’s kind of like taking the concept and applying it across something that doesn’t even may or may not have anything to connect with it.

I’ve got to feel trying to call.

So I’m going to have a little things outside.

I don’t know if you see outside. I can’t tell if he’s out there. Anyway, maybe we’ll call back.

Maybe we’ll answer live and talk to Phil while we’re in the middle. So let’s see if it’ll actually.

It may be down.

So I don’t know if this likes the return.

No, it didn’t.

It didn’t really give us a whole lot out of that as far as later moisture content. But some of the things I guess that’s some of the things we’ll have to actually look at. But again, that’s we’re at the point now that we got ways to get the chunks, get them in the day to get them embedded, get them in the database.

Now the next step is to figure out what are the right size of chunks? What are the right embedding log to use?

Are we getting back the right thing that we’re asking for?

Because this is just the first step of actual building out the whole application we’re trying to. So the first part is we’ve got to get this part right because the rest of it from then on, what we would like to be doing would be taking the items we get back, which we may want one result back, we may want 100 results back. We might want to rank them in a certain order and then build a context and hand that over to a model that then we’re going to host a question to. So that’s what we’re at.

Let me take this back a few and I’ll show you something else I was working on. I have to go way far back to sleep side.

So would you want it to put together the margins of the board or is that something that you don’t want it to know?

I probably want to because you can imagine one of the superpowers that I have.

I’ve actually gotten rid of this on LinkedIn before is I have an uncanny way of finding the right words to put into Google to get answers to things.

I think it happened when I was working with the eclipse foundation and they had one of their releases was called Juno. I don’t know if you remember Juno, you remember Juno at least. If you turn eclipse Juno backwards, Juno eclipse was actually a character in Star Wars. You could find anything by eclipse because you were finding Star Wars memorabilia stuff on Google and nobody could, you know, it just like, could you name it something different?

So what I’m getting at is sometimes you don’t even know what the right words to ask for.

You know if you’ve kind of got an idea of what to do, but that’s one of the keys that’s unlocked a lot of the chat type, GPT type things. You don’t have to be exact. You get close and it’s good enough to know even, I mean, even Google do it now. You type in something kind of close, it’ll come back and ask you, did you mean search for this instead? And usually it’s right. Unless you’re just looking for that one word that is actually disposed.

The other thing I wanted to show, started working out since we’ve got this chunking piece that’s a service that you can run on a separate machine and I’ve got a vector database that you can run on a separate machine.

When we built this thing initially, we just had all of one Python thing in a radio file and it’s great for one user one time, things like that. There is no user login, there’s no credentials you have to supply, there’s no protection of anything, you know, all of that stuff. So started working through kind of an operational analysis standpoint of what kind of blocks do I need and how would you break this thing up? And so we’ve already got the vector database, we’ve already got most language models.

It’s pretty easy to wrap them in either a connector themselves or they either go with the open AI API.

You have the button, right?

And that type of thing.

So you may see more of this coming up because this kind of triggered something else I’ve been working with for a while, which is I’m used to using some model-based tools to do a lot of things. And my other day job, I haven’t really seen the complaint at all to AI is in how we think about things. Because until I started walking through this is actually a methodology called Arcadia that I was going through and I didn’t even realize I needed this whole coordinator block until I’m going through this operation analysis. I was suddenly got functions that don’t belong anywhere. It’s like, oh, these are similar.

I probably ought to put them in something to contain these to do the same thing. If I would come at it from my own perspective of, oh, I already know I’ve got this and I got this. So you can go ahead and put the things you already have and it helps you figure out what you need. But anyway, I might start, I might walk through some of that. That might be a talk much, much later down the road. But that’s what I’ve currently got.