Structured Output and Speech2Speech

Structured Output & Speech 2 Speech

Transcription provided by Huntsville AI Transcribe

There you go. All right. Folks that are online, I guess, Drew C., can you hear us? Mic check for online? I guess I should have done that before we turned on the recording. We get to hear that on the transcript. Let’s see. Not hearing anything. Okay. Well, I’m going to assume that this is fine.

All right. Welcome, everybody, to Huntsville AI Meetup. If you’ve not been here before, we meet about every other week here at HudsonAlpha. There’s a newsletter that goes out to show the different topics. And we’re just interested in all sorts of different… Your microphone’s muted. My microphone’s muted on Zoom. It’s muted here, but this thing… Oh, yeah. Okay. I see. All right. Thank you. That would be… Sounds good. All right. Everything’s working out great.

Okay. Lost my train of thought. No, no, you’re totally okay. So I’m not usually who runs these things, but Jay is out this week. He’s off looking at elephants at the zoo. Okay. So let’s go ahead and hear some stuff from the audience and hear some comments from you.

All right. So I’m going to go ahead and keep going just because we’ve kind of gotten on to a role here. So we have two topics that we’re going to talk about tonight. I think somebody is here from Spacey Harkocenter, or are they here yet? Somebody might show up sometime to tell us about the AI symposium, and if they don’t, then I will make up what I can from what I know, and try to prevent at me then.

So we’re going to talk about two topics tonight, one of which is… Let’s talk it out, which is an open source speech to seat speech thing. And I’ll be going over something a little bit later, something with structured generation later on. But Charlie, I guess do you want to pick us off and we’ll just go into what this thing is.

Sure, right. So speech-to-speech. What do we mean when we say, when we talk about speech-to-speech itself? It is a way for humans and computer systems to interact without any like any physical input, like typing or anything like that. And probably one of the more common ones that you do, that you may know is whether you’re interacting with Siri or any of your Alexa devices or whatnot, that it is similar to that. Hugging Face has come out with an open source one. And they go through this pipeline of just exactly how it works. Can you make the text bigger on the screen? Yep. Thank you. Maybe. Okay.

So there are four specific aspects of this pipeline that you more or less see in just about any other kind of paid speech-to-speech product. You have the voice activity detection, your speech-to-text model, your speech-to-speech model, your voice activity detection model, your speech-to-text model, your voice activity detection, your voice activity detection. And then you have the text LLM, that’s doing all of the like the query and the response to it, and then bringing that text back into speech. We’re going to go into what each of these are. Assuming that’s my cue, okay.

So voice activity detection, this was this was actually something that I, like, it’s one of those things where you know it but you don’t actually know it of what it is, this is a way for a system to not only tell when a person is speaking, but when they stop speaking. A lot of times you find this kind of detection whenever, say, whenever you are asking Siri a question or OpenAI’s model a question. This detection system waits for the end of your sentence and recognizes that there is silence and then, okay, I’ve captured my query and now I can send that on to the next portion of the pipeline. Like your hey Alexa sort of thing.

Yeah, yeah. It waits for, or, I don’t want to say that. I’m not going to use that one. So, yeah. I just, I don’t know. I just, I threw up a picture of there just showing, okay, here is the, just a waveform of, here’s some speech and what the actual waveform looks like there. And so this detection system captures the, where the speech ends, sends it to the next portion, which is speech to text. It is what it says on the tin. And it takes, this takes human speech. Transcribes it into text and just sends it on its way. The one that, probably the one that was most popular that was open sourced was OpenAI’s speech to text model, Whisper. What they did, they trained on 680,000 hours of data. I think they grabbed audio books or something similar to that.

It’s great that that was open sourced, but to use a model like that takes a lot of compute. Think something like the A6000s or something like that whenever you’re trying to use a large model like OpenAI’s Whisper. This is all with the English language. I’m sorry? This is all with the English language. Yes. Okay. And the latest one they did was, one of the things they did was add a whole bunch of extra information for other platforms. Okay. Yeah. So, thank you actually for bringing that up because Hunting Face talks about using, whenever you’re using this portion of their pipeline, you can use any Whisper based model that’s on, that’s available on Hunting Face. Especially the Distilled Whisper.

Version. We’ve talked briefly, didn’t we talk at one point a few weeks ago about distillation of models? I don’t think we have. We haven’t? Okay, fantastic. I get to do a little bit of that then. Because I thought, that was another thing that’s relatively new to me, but there have been some papers about it. What do we mean when you talk about something being distilled? In this case for Whisper, they take the first and the last layer of a Whisper model, get rid of everything that’s inside of it, just put those two layers together, and then as a part of the training data to influence the smaller model, they take all of the knowledge from the larger model and use it with that. So you get roughly, I didn’t put any metrics on there, but DistilWhisper uses, I think they are about 45 to 48 percent the size of Whisper, the larger model, and I think 5.8 times faster.

So it’s great. Using distillation is a great way to get the benefits of a large model but not having that much, not having it take up as much space, take up as much compute. You can use this with a much lower-end video card, I would say like maybe what, 4090s possibly or? It depends. It depends. Okay. So we’ve got the speech-to-text part.

So that text that’s generated from that ends up being the query that gets sent to the text LLM. And there are a lot of large language models that deal specifically in text. I think the picture I put up there is something like 146,000, 148,000. Yeah. So and these are just, just if anyone has used whether it’s ChatGPT or Claude, you type in a query, it gives you a response that that is what they mean by text. But HuggingFace goes on to say that for this portion of their pipeline to use instruction, what was it that scrolled a little bit? I said instruct to.

Instruction-following large language model. What they mean by that is, the large language model is expected to get an instruction from the user and to follow that instruction as best, as safely and as accurately as possible. And how long will it stay in that context? Will it, after it gives you an answer, will it drop out of that context and then you have to tell it to tell me the story again? More for the purposes of the pipeline, it is in one context. Okay. So that one user session. Yeah. If your context length is going to be, that’s something you choose depending on what your resources are.

So can you say this is the end of telling me a story? I want to do something different. As long as that connection stays active, then you may end up losing some derivation if your context length gets very large. For the most part, depending upon your model, it should be able to follow that instruction pretty well. So now that waveform you were showing there, is that a word or a sentence? That was really just a, I think it’s just “hi there.” Something like that. Okay.

Now, when you talk about taking the front end and the back end… That’s smaller. Yeah. And then you say you put the information in. I don’t understand that. To me, those vertical lines are the information. So what good does it do you to cut it out of the middle, then you go put it back in again? Oh, are you talking about…? I honestly don’t understand the concept. Are you talking about the, what were we talking about with the models, the distilling the models? Yeah, we’re talking about the small versus the large. Yeah, distilling. That’s what you’re talking about. Yeah.

That’s just a way of making a large, expensive model be able to be accomplished in a small model. It’s really what it’s doing is, like, you have these large models, and a lot of the layers that are inside of these things don’t really do much. They’re not that important to do your actual tasks. So you can actually lobotomize them essentially. It’s really what you’re doing. You do width or, or layer-based pruning. And then you have the larger model give answers and you teach the smaller one to mimic those answers while it’s essentially, you know, has half of its brain out.

Which means you have to do less multiplication. You have fewer things that it has to pass through, which makes it smaller, faster, yada yada. It’s a good way for, yeah, when you want to use large models, that’s, you use those more for training data, whereas the smaller models you would want to use for tasks. Am I right about that? I think you’re going to have trouble generalizing that. It’s going to depend on all of these things.

All right. So you’re talking about that actually normally you’re chopping the middle one. Yeah, you’re taking out the layers of those, yeah. And sometimes you do it this way and sometimes you do it this way. It’s, yeah. Sorry. Hi. I have a couple of questions. So you said that you’re chopping out the middle layer and just use the initial layer and the output layer. But how is it different from starting with initializing the network with two layers only, or the same thing like that?

It really just depends on the strategy that’s used. The reason why I just talked about that one was because that was one, that was coming from the distilled whisper paper. They’re right on. But one particular strategy is usually trying to take the layers that are as far apart as possible, that can give them the most accurate results from that. Whether it’s the first from the second to the third layer. So there is another element where, the smaller model you’re starting with the weights from a larger model, so you’re not having to denoise so far.

So it’s still going up to the runs, a few other studies, weights because it has to learn how to do it with fewer layers, but you have a kind of a starting noise so you’re not having to go so far, and that helps a bit too. Well, if you think about the RT template, so when we think of double layers, it’s usually represented by the nodes, but those are just spelled out by the numbers, and it must really look like this is a relationship between the two different layers. So if you take out the middle layers, how does that weight really work out?

It’s because a lot of the middle layers, especially once you get deeper or higher in the network—whatever you want to call it—they don’t do much. There’s lots of zeros; there’s a lot of things that don’t actually change. That’s what it is, and so it’ll depend on all the models, but the deeper and the more abstract your layers are, it can be the case that you can cut them out without a degradation in your task. It’s not always the case.

Where the instruction following is the key in the pipeline you mentioned now, is that a combination of just, to clarify, the voice instructions that the user is giving and also considering pre-prompt—you know, that kind of instructions that you can give? If I have to attend to what you’re saying or what instructions… It’s more of whatever the query that is coming out from the speech-to-text portion. As far as what you’re saying, as long as the model that they’re using for speech-to-text is accurate enough, then you can do that association. But really, it does depend on the query that’s coming out of that into the LLM.

Right, so in our setups, you know, we would basically very likely modify it. So you have this speech, and somebody’s giving me that’s translated to the text to be fed to the LLM, and it’s likely we would have some of them in Serif as well to maybe modify that. Right, yeah, and in some of my tests, I did even find the prompt that is setting the default voice for the text-to-speech portion that’s coming in. Awesome, I just want to make sure that’s all in that instruction.

Is there any way for you to tell it to watch what it’s doing so that on occasion you can say, “You got that all messed up, and you didn’t get it right?” The instruction to enable… I mean, so the instruction, that’s basically… There are two major kind of variants for all the models that you’ll see out there, and this is just—I can’t remember what the phrase is, it’s a pattern in the community, is an actual way of saying that—but it’s called the base model, which basically, that’s what’s coming fresh out of the pipeline, and these instruct models, they have some additional fine-tuning where you give it a task and it accomplishes that task so it learns how to talk like a chatbot.

So that’s where your sort of stuff from ChatGPT comes from. So it’s been instructed during the instruction tuning; it learned those specific patterns. And so any of your models in a pipeline can get that instruction tuning. It just happens, the text models get it the most. To go on to your question of whenever you’re saying, “Oh, what I got back is not correct,” you can, as long as it’s still in that same user session in that context, the… From what I have tested on this, it does go back, and it’s, “Oh, I’m sorry about that, let me rephrase,” but again, that also depends on the LLM that you’re using.

And since it’s a pipeline, you can also, you know, have your voice activity detection running on an additional thread. And so you can wait for interrupts. It’s like, “Hey, hey, hey, stop.” And you can do that sort of thing too. So it’s not like Claude where Claude’s currently sending, “You know, I got an answer, I don’t quite like that answer. Send it back and give me another answer,” and then it goes, “I think I like this one better than this,” and then gets it to the user, right? That’s how they improved it, kind of. That’s a front end to a normal model, right?

Usually, they’re turn-based, yeah. But you can obviously, if you want to, do some sort of interrupt if you just programmed it. That’s not a language model thing. That’s just a real old software dev, you know, sort of thing. So we now get to the text-to-speech program. I put this up as an illustration if anyone can recognize where this comes from. Thank you. One of the first times that at least I saw text-to-speech in media.

But it is just a, it’s just another way of… Oh wait, go back a little bit. No, it’s fine. It’s fine. Yeah, I didn’t really have that much to say. I’m just going to go back to the first one. I’m going to say about it because these models, it really just does what it says on its tin.

So we’ve been talking about this pipeline, but where can we go get it? Oh, I put the link in this presentation directly to HuggingFace’s repository, and I’m giving you a quick preview of speech-to-speech. And if you’ve got a high-end enough machine, like a Mac or whatnot, all you have to do is clone the repository and run that one line, and everything on your machine will take care of it.

And there are, if you go on HuggingFace too, that’s the example for the Mac. There’s also the thing you can do for like NVIDIA and Rocket and all that sort of stuff. So they run on CPU as well. It’ll be slow, but you can do it. They also have a client-server portion as well, which I like because that means I can take something like RunPod, throw all the server stuff on that end, have it do all the heavy lifting, and then I can use like my framework laptop, or I even got a Raspberry Pi to be the client with a little mic. I can just take a microphone, a couple speakers, and as long as that connection is good with your server up at RunPod, it works pretty well. Yeah, any questions about this?

Eventually, we’ll talk. So this is the pipeline version where you kind of have to split out the different parts where it’s speech-to-text, then you do the LLM, then you put it back out. There are also models. There’s a lot of models that do that all in one go. There’s no pipeline magic. It interleaves the audio tokens and the text tokens and the output tokens, one after the other in a single sequence. There’s a thing called Moti or Mochi or something like that that does that sort of interleaving.

So there are some of those models that are starting to pop out. A lot of them are very basic, not super great yet, but I’m assuming in a year or so we’ll be talking about that sort of thing. Mm-hmm. If you guys have tried it, I’m sure you’ll like it. All right. Just practical experience. Any surprises in using them or hiccups or…? Just give me a little…

So there are some models that HuggingFace recommends on their repo that you can use in parts of this pipeline. Distilled Whisper I very much like. There are so many LLMs to put in the middle. It really just depends on what your use case is. As far as the text-to-speech, they started off recommending Parler TTS. I’m not 100% sold on that one yet. I’d like to do some more testing with a few others, but…

Parler’s good. Bark’s good. I think those are the two big ones. Solerno, I think I’ve heard Solerno. I’ve heard some of those too. Yeah. Tagging you. There you go. All right. Any other questions for Charlie before we head on to the next one? I’m going to stand up because I don’t like craning my neck. That’s fair.

All right. So now we’re going to talk about a mouthful here: structured outputs with finite state machines. So we will know what all of that means by the end of this. That’s the main focus here. The main focus. Oh, sorry. Go. I’ll let you know whenever I need to switch. It was switching. There we go. All right. I need to remember what my… Hang on. There it goes. All right.

Yeah. So the main focus of what we’re talking about here is controlling the outputs of LLMs to actually have reliable applications, making AI engineering engineering again. That’s the main focus here. So let’s go to the next one now.

So the problem that we’re talking about here is the problem of trying to get structured output out of things like ChatGPT, Llama, Mistral, yada, yada, yada. These things are great at generating this kind of human-like text, passing the Turing test, giving us sort of conversational stuff. But sometimes we don’t want that. We just want to utilize these things to be some sort of cog in a pipeline or a system. We want some sort of normal machine-readable syntax, something like JSON or XML. We want to use that XML to be able to pass some of this natural language understanding component onto the next part of the pipeline. But that is not always an easy thing to get.

Because if you go to the next slide, this is what I generally find whenever I’m dealing with LLMs, just cold turkey. So I want it to… Say for instance, I’m doing some sort of phone number-related thing. I want to generate test cases. I need a bunch of phone numbers. I want to say, “Hey, just give me a phone number for Washington State. 106-555-0134. Thank you.” Yes. That’s great. Easy to parse. But here’s what you get.

And you see here’s an example of a phone number I’ve formatted for Washington State: 205-555… It’s super great because it’s 555 and it’s fake. It’s like, oh, that’s very helpful. But nobody asked. Yeah, super helpful robot down here. He just wants to help. He’s been instruct-tuned. That’s where it is. All that comes from that instruct-tuning that we talked about before. And it’s just yapping. Yapping away when I’m trying to do some work. So you get that up here. We want this down here. So how do we get there?