Text to Speech

Text to Speech

Transcription provided by Huntsville AI Transcribe

This is, let me find the presentation.

This is pretty much how we ran things from, what was it, March 2020 all the way through 2021.

We were, you know, because of COVID, we went pure virtual for about a year and a half, it seemed, before we got back live.

So, welcome.

If you haven’t joined us before or after we did that, that’s kind of what was going on. So, what we’re talking about tonight is text to speech. We’ve done a lot of things in the past as far as whisper and faster whisper, and even before that, back to deep speech and other things, going, you know, speech to text to get transcripts from audio, things like that. And then last week, not last week, two weeks ago, Charlie had walked through, you know, a pipeline kind of from a conversational AI type approach. And I know he mentioned some things for speech to text.

I don’t, sorry, text to speech. I keep going backwards. And I think we kind of left it, some of that kind of, well, hey, we can hook in a lot of different ways to do that.

And so, I just kind of picked up from there and went back and looked to see, what’s available right now?

And what kind of methods are there? Things like that.

Some of it extends on a conversation I’d had last year. There’s a guy that used to be really involved in Coworking Night named Ben East. I don’t know if you all remember him or know him. He is now a pretty good game, good. He’s excellent, but he has some pretty significant games out there on Roblox.

And some of the stuff he was working on, he was, you know, at the point of having to actually do full game production, like going and finding the videos and finding the graphics and finding the models and then going and getting voice actors to talk through the lines of whatever scripting stuff he had. And so, I had a kind of conversation with him about how could you pull some AI into your workflow where you can actually, you know, kind of work through what the conversation needs to be and get something a little realistic where you can work through timing, where you can work through, you know, how this feels from a user perspective before you go pay a lot of money to actually have actors go voice through all of these and get audio clips and all of that. So, the best thing that I found at the time was a model called Bark from Suno that we’ll walk through. He is currently heavy into 11 Labs is what he’s using right now. I may see if I can get him on early next year just to walk us through what kind of AI tools he uses in his workflow. So, that might be interesting.

And then, so that kind of, and of course, the first thing I do is I go to Perplexity and then I go to ChatGPT and I say, hey, what’s the state of the art, you know, voice to text?

Wow, wrong way. Text to speech. Text to speech models and things.

Is Suno, Bark still, you know, still state of the art?

And what I get back from them is, well, no, it’s not. You got these others. And I actually go look at it and I think Perplexity and some of the others were probably a little not exactly correct, which, guess what, it’s AI. So, I started off with, let me see if I’ve still got that hugging. I don’t think I’ve still got it open. Where was that? I started off, of course, hugging space, hugging face, text to speech. Started walking through this, which is kind of neat. Talks about how some of these work. There’s a video you can watch, but the accent is so thick I can barely understand what they’re saying. But some of these kind of make sense and kind of flowed into the kind of idea that I wound up breaking things down into. So, what they were working through is like voice assistants, where you’ve got, this is more of the conversational type thing that we were talking about last time. You’ve got some that are more announcement systems, which this has been, I think, your traditional voiceover work, where you have somebody go read some kind of a thing in the right kind of a tone, the right pace, and then you wind up playing those back. But these can actually. Now be used in place of voice actors for that.

So I’m sure, you know, voice actors are, there’s probably a strike somewhere upcoming. And then you’ve got the other kind of a thing where hugging face and there are others that do this that are. Now you can actually call their API as part of your, you know, if you had a product and you needed to get this in there. You know, you could, you could do it that way.

Or you can actually plug them directly into your product.

And so that’s kind of the, the way I wound up breaking it down. And the interesting thing I’ve been playing with, they apparently they’ve got 1500, you know, text to speech models that you can just plug in right away.

Except each one tends to have its own kind of little thing that you got to do different.

So they got close, but it’s not quite there. And that’s kind of where I started.

So I started off, of course, with a space odyssey.

And this is the kind of thing I’m trying to figure out if I can. If I can repeat this and let me, if somebody can just let me know if you hear this, because I don’t know if it plays through the, through the actual connection. So is that, did that audio come through? Okay. No, that didn’t come through. Did not. Okay. Let me see how.

Let me see how this, I remember having to do something. I think it’s in the share or something like that.

Okay. There we go.

Got it. Oh.

Cause this is a very boring movie without the audio.

Did that come through? I did not see the video or hear the audio this time. Neither. Okay.

Hi.

We’re going to get there.

Hold on. This button, that button, share.

Do you see the screen?

It will hold on. Let me, I see you already know the answer to that. Let me go over to chat. All right. Maybe I go to window.

Share tab or screen instead.

Okay. That’s weird.

It’s like it doesn’t want to share. Okay. That’s weird.

It’s like it doesn’t want to share. Has swung in the camera. So it is Student came back. Still working on it. Could just be Dave . I could be Dave and you could be Hal. I’m pretty sure we’d. Sorry. Yeah.

So open the pod bay doors Hal.

Sorry. I can’t do that. There you go.

So that’s the whole deal. Let me see if I can.

What about just going to play a game?

Right.

And for all we know, you guys may be AI agents hosting this.

We could be.

Certainly.

Is this actually sharing?

I can see your screen.

Okay. Let’s see if it actually.

If it’s true to form, it’ll make me play like a commercial first.

Oh, there we go. I’m sorry, Dave.

I’m afraid I can’t do that. Yeah. So that whole conversation. What I was trying to figure out is it’s fairly straightforward. It’s fairly famous.

I’ve got the script for the whole thing with the actors and Howl and all of that.

So I started working through. I didn’t get very far.

But what would it take to redo that with the current speech to text, text to speech? So that’s kind of where I started.

I’m a sister.

Could probably do it. We do automation to grab a voice where it goes in. Open the pay door. And then it runs an automation to open a door or unlock it. It’s kind of like when I wake up or say good morning.

Right. Good morning. It turns on all my lights and tells me good morning and have a nice day. All right. So moving back to my screen and then back over to. Shut up now. Not the presentation piece. We’re getting there.

So that’s just kind of thing, though.

So the categories I got some below, but then some of the unexpected things that I found on some of these models.

Some of them, especially there’s one called Koki that we’ll see if I can get to work.

You can actually clone a voice with less than a minute’s worth of audio from the source that you want to use.

I mean, it’s like a voice clone of, you know, you say, like, if I wanted to make my own commercial or something, I could record a little bit of what my voice is and then go write the transcript and then have it go, you know, use.

And it it’s it’s close.

It’s not great.

Some are better than others. The other thing we’ll talk about a little bit.

I know 11 labs can do that as well, but I don’t actually have an account with them.

background additions.

Some of them, this was weird the first time, had background music behind the thing that I was trying to get speech out of. And I had to go figure out how do I turn it off or how do I make it do other things. So that’s kind of interesting. So you can add tokens for laughing or coughing or pausing or yeah, there’s all kinds of interesting things on some of these models that you can slide in, normally in square brackets, that cause it to do additional things.

Some of them have ways that you can pick what kind of voice that you want.

This was pretty interesting. I’ll see if I can figure out how to do it with the Bark model. It’s what I was talking to Ben about back, you know, last year. They had things like, hey, I want a voice clip of this. But do it like it was a 1950s sci-fi announcer, you know, for a movie or something like that. That would be, and it did. It was pretty interesting.

And some of the other latest ones, this is where you get some of the side effects that you got to really, really look out for.

You tell it the prompt that you want it to, you know, or the text that you want it to say, or the sounds that you want.

And then you provide a second prompt, for context and what kind of voice, what kind of tone. Do you want it upbeat or do you want it more, you know, more depressed? Do you want somebody that sounds anxious? Do you want this to happen at a train station with, you know what I mean? It’s, some of the stuff gets super duper realistic.

And it’s, it almost reminds me of some of the multimodal stuff that’s going on in kind of the opposite side where you can point it at a video clip.

It can, it can hit the audio of what things are saying, but if it sees things in there that it knows are making sounds or birds in the background or something like that, you know, it can also add that to the transcript. This is kind of going backwards where I’ve got the context and everything in it and then produce the sound from there. So the three categories I wound up with, one of them is just, you know, products on the side, but standalone that you give it, it gives you sound files. I don’t know that play AI is like that.

I got that pulled in.

I haven’t put too much effort into it, but we can hang on one second. I got to answer a message, make sure somebody knows that they’re not supposed to be at Hudson Alpha. Oh, if I can remember how to do that.

Yeah.

Okay.

So is Robert on?

I don’t think so.

Yeah, but I’m here. Okay.

I might get you to kind of talk us through, give me one second to go through 11 labs.

And then if you can just kind of briefly walk us through what you’re doing with play AI, I’ll bring up the, at least the website and some of the stuff it’s got um or i could throw the screen over to you if if you want i mean we’re just kind of impromptu tonight so initially let me hop over to 11 labs and i need to remember to hit back later um this one uh is was pretty interesting it’s got a lot of good stuff um the only problem i really see is i don’t know what it’s doing behind the scenes um but uh if you were wanting to do a case study of how to actually look more like a product than some development model kind of a thing this would be a pretty good one um super customer focused user focused doesn’t assume that your users know anything about ai or you know or what what’s happening behind the scenes and really just kind of what’s happening behind the scenes and really just kind of what’s happening what is it you want to do oh okay i want to introduce a podcast or i want to do you know um that’s pretty much what it’s geared for um again this one you can create a voice clone you can do a lot of things here um and i’m not going to go too much into this because this is more along the lines if i don’t well actually maybe they’ve got an api they do so i guess this one actually has uh two different kind of segments i guess i should put them in both categories uh because the other thing that you need to look at some of the super hi-fi models that i’ve found they take a minute to produce you know audio files and if you’re trying to do a conversational ai um you’re going to wind up in the the zoom phone tag thing where we all wait until the same time and then we’re going to have to do a conversational ai um you’re going to have to do a conversational ai um you’re going to have to do a conversational ai um you’re all talk over each other because the latency um when this case is you got you know some of its model i’m not quite sure what types of uh you know how fast each one are or what they what they sound like um but you know there’s a lot of that um i thought initially i would run into some issues where everything is super high on english and you’re going to have to do a little bit of a test and see if you can get it to work and if you can’t get it to work and if you can’t get it to work and if you can’t get it to work and if you can’t get it to work i mean a lot of these start off with multilingual models from the ground up um and then you find that some of them also might have an english model associated with them but for the most part a lot of these are multilingual out of the box since you’re already on the website 11 labs offers a ton for the free account and the voice changers especially uh impressive if you want i don’t know if you’re willing to try that out for a couple seconds oh yeah what do i do try for free yeah let’s click through where you can just create an account and you’ll automatically have the the free account let me try sign in with google let’s go with this account sure .

uh another user i don’t know if that’s in here we’ll go with other all right and on the left there’s voice changer right under text to speech and you can directly record audio from your computer into there sometimes it’ll give you a prompt or you can just say whatever you want i guess i have to click the button let’s see if i can do that while on the same call it’ll be interesting uh wants to use my microphone hi this is jay langley with huntsville ai hi this is jay langley with huntsville ai now uh hitting the play button will just let you hear a sample of what that voice sounds like or you can just select it and hit generate speech to uh to hear it all right so let me do we’ll go with ass Alice.

Alright, y’all probably couldn’t hear that because I’m not sharing the right thing. Let me flip this over real quick. This might be one of the more interesting videos to watch later. Let’s see.

Share screen.

What worked was Chrome tab. And then 11 labs.

Share audio.

Alright. So what I wound up with with Alice.

Hi, this is Jay Langley.

We’re Functional AI. That’s kind of neat.

A little garbled. Try out Bill or George. Let me try Bill.

Hi, this is Jay Langley.

We’re Functional AI. It could be my southern accent sometimes. Let’s try direct here. Hi, this is Jay Langley with Hunstville AI.

Hi, this is Jay Langley with Hunstville AI. That’s pretty good. Yeah, I found the most success with custom made voices. You get a lot of little variables that you can adjust.

One of the big shortcomings I found with 11 labs is sometimes if you go over maybe 500 characters, it’s more likely to have some errors.

Okay.

I try to keep it shorter, but there’s probably ways to get around that as well.

Let’s see.

I don’t know if I can add to my voices.

We’ll find out.

So voice, stability, similarity, speaker boost. I wonder what happens now. I wonder if it would be the same. Hi.

This is Jay Langley with Hunstville AI.

Okay.

Yes, I mean, you can do… I may play with your free tier for a bit.

I’m sure it’ll tell me when I run out of… I’m at 3%, apparently.

So let’s hop from there over to play.ai.

Because I know… Robert, you had posted something on the Discord.

I took a look and I couldn’t quite figure out what it was that you were actually trying to do. I was just trying to send post-call data, but I figured it out. Okay. And I’m guessing you’re… What kind of… If you don’t mind sharing, what kind of app or workflow are you working?

So I’m building everything inside. So I do automotive industry, and I’m finding the areas where AI needs to plug in. Service overflow is a huge one.

And we launched at Singing River Toyota in Muscle Shoals. And on day one, we did 18 appointments from just service overflow. And that’s where the phone rings five, six times and then no service advisor picks up and then an AI picks up.

And starts booking appointments.

Oh, cool. So you got voice.

We’re doing something coming up on… I don’t even know how to say agentic AI, but AI agent stuff. Do you route that over to an agent and then have the agent go back to with a voice, or is there more of a short circuit? No, everything is built within this tool.

And then we just send post-call data to either CRM or any type of email address.

So we haven’t really built integrations yet with their scheduling system, which most dealerships don’t even want.

If they get a phone call in, the post-call data sends name, phone number, conversation summary, and transcript.

Okay.

That’s pretty cool.

Yeah, we’ve done that, and then we’ve built something for after-hours sales. So let’s say you’re closed on a Sunday and a man calls dealership, and they’re looking on AutoTrader, and they see a tundra, and they call, and nobody picks up. So we don’t really know how many inbound phone calls come to dealership in after-hours.

So half of the inbound leads come to car dealerships after closing hours.

We don’t know how many phone calls come in, but I’m sure some do. So we’ve built that, and eventually we’ll start connecting to different data pieces and maybe taking all inbound calls for car dealerships.

I don’t think the tech is there quite yet.

It’s still a little bit noticeable that it’s not a human at some times.

But that’s kind of the goal and the direction that we’re moving in this upcoming year.

Well, I mean, that’s a question. Would I rather talk to a robot or a car salesman? This one’s worse than a car salesman.

It’s a lot worse. We’ve taught it on the worst of the worst. So overcome objections and crushes. All right. The other thing, I don’t know if you’re, is it only in English, or have you looked at figuring out what language and then routing off of that? In the API, there’s ways to do it through Spanish and different languages, but I haven’t really messed with that yet. Okay. I mean, that’s something I’ve thought of, where if I you know, if I’m trying to take calls or emails or something, and I already know I’m lost on anything but English. You know, at least if I routed it some other automated way, I’d have a chance.

You know?

It’s above nothing.

But that’s pretty cool. So you’ve got this.

I started looking into it as far as what kind of APIs they’ve got, and it’s it does seem to be much more of a REST API type thing.

So I’d probably put it in the same list of kind of like 11 Labs does, where you can you’ve got some kind of a call that you can make from a you know, from a web service type thing.

And this one, you can you can bring your own.

The other thing I kind of I got distracted a little bit from the kind of you know, workflow that you’re using because the other thing that Play AI can do is if you’ve got, it’s almost like a I don’t know what you’d want to call it, a kind of like a marketplace almost where if I had a model that did something specific, I could get it added to Play AI. And then I don’t know how the pricing or anything like that works, where if I have a model and other people start using my model to do their stuff, do I get a, you know, some kind of percentage or something like that? I’m not quite sure how that works. Well, from what I’ve seen, I signed up on a package that’s different from the pricing that exists now.

I’m paying $99 for $2,000 or 2,000 minutes.

So, yeah, that’s not bad. I think they changed the pricing to $299 for 2,500 minutes at this point. That’s pretty interesting.

Yeah, so that’s that part. Let me, again, sorry for jumping all over the place. Let me actually maybe the easiest way to do it. Oh, my run time. My run time disconnected, so that’s going to take a little longer anyway. Oh, where was that?

I’m now going back looking for my initial window that I had.

Maybe this is actually it.

Yep, there we go.

Look there. So, I guess both of these 11labs and play.ai have, you know, mechanisms that you can use to, you know, hook in through like an API call.

I guess I’ll just keep the window going, because I’m about to do this other one, too.

And, of course, OpenAI isn’t going to be left out.

So, in their API, they’ve got a speech endpoint that lets you do a, you know, six pre-built voices.

You can do a lot of interesting things there.

So, the you know, the Sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years. So, they’ve got a lot of different voices.

I played around with it a little bit.

It’s, I mean, they’re basically copy and paste, you know, from their code.

It’s not, I haven’t played with a lot of their different models, things like that.

I have no idea, why they name things the way they do. It’s like a fable. The library is a quiet and peaceful place, where people go to read, study, and learn. So, there’s that. In the heart of the city, there is a large park where people go to relax and enjoy nature. So, you know, things like that. Real-time audio, it’s kind of interesting.

I’m guessing it’s a, that’s I guess something else we’ll talk about. Context window.

Similar to what I’ve seen on, you know, some models, some large language models, kind of like back in the day when we had small context windows and sometimes you’d get to the, what you put in there and how much of it you used would matter.

I’ve got some pretty interesting things sending longer segments of text across.

And, it seems like it’ll pick up the front part and then it’ll, it might mush some stuff in the middle together, but then it’ll nail the ending on some of these models. It’s been kind of interesting.

I’m not, you know, so I’m not sure what’s going on with some of that. I’m sure it’s in the model card somewhere. But that’s something that I’d be interested to know more about at some point. But that’s, that’s the Open API, Open AI stuff. And then the next part was the stuff that you can pull in directly into your, let’s say if you had an app and it’s running on its own server, especially if you’ve got some kind of a, you know, a GPU available. So I started playing around with some of these. Let me go ahead and stop sharing and kick it over to, where did it go? There we go. Okay. And we will flip over to a different tab, Chrome tab.

Let’s see.

There we go.

Turn on audio.

Let me reconnect and we’ll get that going. Let me check the comments real quick. Using the crappy interface that Zoom gives me. Oh, okay. Yeah.

Yeah. All right. Let me initially start off with some installation stuff.

This is running on a T4 with Colab, so nothing super duper.

Super special.

I’m thinking it’s 12 gig, 12, 24, I’m not sure how many gig of RAM is on a T4 instance.

Do some blanking installs.

Actually, we’ll go to the T4. ,

and we’ll see what happens.

And of course, running in Colab, I pause enough to go to other stuff, and I come back and my runtime is disconnected.

So, always interesting.

The other thing I will bring up, if you go look at Hugging Face, it’ll say pipeline, and they forget in their example to tell you that you can’t use it. So, I’m going to go ahead and do that. So, it’s not going to tell you that you need to actually specify what device. Uh, cause even if you run their stuff without that on there, even if you’re running on a GPU, it’ll say, well, I found a GPU, but I’m not going to use it because you didn’t tell me to. Um, luckily, it’ll actually put that in the output and that’ll give you a clue to remember to go, oh, don’t forget to add device.

And this was one of their basics.

Um, you know, this is the Suno bark model that I was playing around with like a year ago.

And, uh, we’ll see if we can find its metacard in a minute because you can do other things like at this is the one where you can have it cough or laugh or provide background music you know things like that um let’s see how long this takes to actually i was thinking it was about a minute uh to run and currently i’m a minute and three seconds uh from a cold start so we’ll see um in the meantime let me see if i can find a uh suno mart model yeah it does a lot of languages i know i’m not sharing that tab though okay so that finished so now we can look at the uh load that up let’s see what it says i can’t do that that actually was not bad uh so yeah um let me see if there is a oh where was that i’ve got too many tabs open um all right we’re in the okay so all right so according to their website the bark model works good with about 13 seconds worth of text so that not great for long form uh unless you’re going to do a lot of snippets you know things like that but i’m going to go ahead and do that so i’m going to go ahead and do that uh i’m trying to find the uh they have some things in here where you can actually let’s see if this actually works or maybe not yeah there’s it gets interesting trying to figure out it seems like uh the hugging face thing with their pipelines and tasks they try to do a lot of you know commonizing um um But then you go to a specific – this might actually error out because I don’t know if it meets their actual interface. No. All right, we’ll leave that at that.

Oh, I bet it’s forward params.

Okay.

So there was another one called Microsoft Speech.

And so for this one, it’s basically the same except there are several different speaker embeddings.

So there’s a – and I really wish this thing would share sound on a window so I could quit trying to figure out what to do with tabs.

But if you happen to go look at Hugging Face and then this path – and this is also already posted on GitHub.

So you can go back afterward and play on your own if you want.

So you can actually pull in a lot of different – think of it as pre-built encoding.

They’re embeddings, I guess is the right word, for what kind of voice to use on these.

So this was, I guess, this particular embedding.

What we’re saying is, hello, my dog is cooler than you. And – Hello, my dog is cooler than you. So that’s fun.

Another thing I want to do – let me do another code cell. Anybody remember the magic thing to time something?

I already lost my place.

Hold on. Time it?

I think that’s – hold on. I may have to go grab – there’s something in Colab. I think it’s kind of like time it. Hold on.

Yep. I think it’s – yeah. I think that’s exactly right. Trying to see just the synthesis itself.

So what?

Plus or minus per loop.

Not terrible.

So I have a question.

Yeah. So you said that it’s storing the – you’re doing the conditioning and passing it in. Does that mean – is this like a prompt caching thing where it’s doing in-context learning instead of training? I don’t think it’s doing any training at all.

That’s interesting.

So it’s – you know what I mean?

I don’t know a ton about the underlying architecture, how it works.

Some of them actually go into a bit of detail.

There’s actually one we’ll cover at the very end.

There’s one that actually has a paper. Hello.

My dog is cooler than you.

Well, that didn’t work. I’m sorry, Dave. I can’t do that. So that’s – yeah, we’ll cover that in just a second. There’s also – I don’t know if this is going to work or not. But parlor – some of the models apply pretty easily.

Easy with this whole pipeline concept from Hugging Face.

And there’s a lot of others that have a whole separate stack, you know, where basically you’re bypassing, you know, any kind of – basically you’re going straight torch, you know, with tokenizers and generating and, you know, all of that kind of stuff and getting things back.

Well, this one I wasn’t quite able to figure out.

I don’t know. I know I’ve got the audio array.

I’ve got the sampling array rate.

But when I hit the bottom part of trying to actually put it in audio, I wasn’t able to get there.

Oh, okay. But with that, I want to jump into – let me check something else real quick and see if there’s a – I think I had a link to it.

If I can remember where – okay, there’s presentations.

I think it was this one.

Where this was one that could basically – it has a workspace for it for this particular model.

This one’s been used a pretty good bit based on what I could tell.

The most interesting part about it, I mean, it’s got a lot of stars on GitHub.

You got all that kind of stuff, you know. So, I mean, it’s fairly interesting. But if I go to COQAI, it says, sorry, we’re shutting down, thanks. So, I’m not quite sure. I wasn’t able to do a deep dive of kind of what was going on because they’ve got some interesting, you know, they definitely built something cool. But maybe ran out of money or something like 11 Labs is… You know, good enough that you don’t have to worry about it. And I just realized I was not sharing the thing I thought I was sharing. Yay. Yay.

There’s going to be a good retro after this for things to not do again.

All right.

What I think I was showing, if I can find it again.

I might have to go all the way back because I think I closed the wrong window.

So, we are here.

So, I think it was the… I’m not sure how to say this, COQAI TTS.

That has a lot of interesting things. A lot of contributors.

You know, things like that. Pretty good, you know, metrics.

All right.

But if I go back over… All right.

You know, that’s not good.

So, there’s that.

The other thing I wanted to hit… Oh, I actually had that saved off.

All right.

Was this guy. So, there’s actually… This… As far as I can tell, from a state-of-the-art perspective, this is basically kind of the best of the best so far.

So, I mean, you can read the paper, all this kind of stuff about all of that from Amazon. And then I was, cool, let me go on Hugging Face and see if I can grab it or whatnot. But they’ve actually decided not to release… I mean, they’re basically pulling the open AI thing of, hey, we built something, but it’s too powerful and we don’t trust society. And I’m like, well, that’s what causes you to not trust society, really? There’s a lot of better reasons. But… So, they didn’t put this out in the wild. They just published some actual… Where did it go? I think I was on the right one.

Kind of examples.

Let me go back.

And again, bouncing all over the place.

Let’s now go back to tabs.

So, some of these… At the conference, the professor, Mark Curtis, who researched the phenomena that the student who presented earlier had focused on, made a surprising revelation that shocked the audience. At the conference, the professor… So, some of the nuance is really… Really interesting. I’m trying to figure out… Let’s try this one. Overwhelmed with confusion and despair, David Darling cried out, What do you want from me? Why can’t you just tell me what’s wrong? So, a lot closer to getting the kind of emotional content that is typically missing. You know… A lot… A lot of, you know, realistic kind of, you know, audio. A lot of times you can listen to it and you can tell if somebody is angry or excited or sad or bored or… You know, it’s just interesting to see how far it’s gone.

Of course, the other thing that you’ll note, all of these clips are less than 15 seconds.

Which is kind of… I’m not quite sure… Where that… What’s driving that.

But that… That is as far as I was able to get… In the time allotted for… For going from zero to find out what I can do with text-to-speech models in a day or so.

Let me stop sharing and we can get back to a normal… More normal conversation.

All right. So, the Colab T4 is… I think it’s 16… It’s a 16 gig.

16 gig. And then I was… I was trying to find one that had the… Neural… One of the models was called Mel. That I’d found. Let me see if I can get back to that.

Is that Mellow?

Maybe?

Mellow TTS English.

Mellow… Different accents. Yes. I don’t know that it’s the Mellow one.

Oh, the other thing… To keep in mind… Is the licenses that you will find various ones under. So, some of them… Like the OpenAI… Well, I don’t know if it’s just OpenAI.

A lot of these were… Apache or MIT. And then others were… You know, provided under non-commercial licenses. So, that kind of makes things a little more interesting. Let’s try… I don’t know if it was MetaVoice.

Yeah, I think this is it. So, if I can… Just let me share. Josh, this might be more along the lines of some of the questions that I’m pretty sure you were looking into. If I can go find the right tab. Again, I lost it.

Yeah, so this one… Um… so it’s got a zero shot for bridge blah blah that’s where i was thinking that they’re not really doing much other than you know cranking it off as far as you know training or anything a lot of them you can’t do have instructions on how to fine-tune it based on your own you know your own copies of stuff i was looking at um a paper for the tortoise one which they have this sort of cross attention thing where you can do some sort of input text it looks like they have some sort of auto-aggressive model that’s sitting there along with the um phonetic models and they’re just kind of doing cross attention which is pretty nifty i didn’t know that’s what they’re doing sorry you that’s true okay um if i can find the one so this one was uh what was this an apache 2 you’ll find the one that i think it was one that was actually it was facebook it was facebook or meta or something i was I can’t seem to find it um this one of course this is the one i found that was the you know non-commercial license for it i’m looking for the one that had a uh uh it had a way to provide a separate prompt that actually described let me close out some of these tabs so i quit getting confused no this is the one i was just showing i think you you you you you you you you you you can’t get it i think it’s just the speaker now this one just as the speaker reference so somewhere i had come across uh a model that you actually uh tell it i don’t know maybe it’s parlor might as well i haven’t clicked this one yet oh yep okay there we go um i don’t know if this will work or not but i haven’t seen this one yet but how do i get this one in here ah the problem is i’ve been looking for it i’m going to back in to the app i’ve been but go ahead and see if it’ll even run. And I’ll click over to large. Look at this.

So this one, let’s see, I’m not sure the license.

They called it a prompt, and this is basically the text it’s going to speak. And then what they’re calling the description is actually what I’d probably call a prompt. This is where you describe what you want, kind of like what you want to have in the output while this is being said, you know, kind of things like that.

This one was the one that I… I thought was one of the more interesting ones. I played around with this a little bit.

Let me actually stop sharing and flip back over to where I can do sound. As soon as I can figure out which one it is. Yeah, let’s go with that.

All right.

So the input text is basically this. All of the data, preprocessing, training code, weights, this kind of thing, monotone, slightly fast, very close recording that almost has no background noise.

And this is what it came up with.

All of the data, preprocessing, training code, and weights are released publicly under a permissive license, enabling the community to build on our work and develop their own powerful models.

So what if all I tell it is a guy’s name, Tom?

I don’t know. I’m making stuff up. Bouncy. I have no idea what this is going to do. But playing around with different types of ways to get additional either background or be able to… Oh, okay.

Hold on.

Apparently… I’m hoping Tom is one of their speakers.

Because according to this… By default, 34 speakers, you can actually reference them by name.

Let’s see what we got.

All of the data, preprocessing, training code, and weights are released publicly under a permissive license, enabling the community to build on our work and develop their own powerful models.

That’s… So… Basically, the… Combining, like, the text… You want it to say along with more of a… I mean, it feels more like an LLM prompt.

It’s probably not quite that… Not quite that expressive. But… I mean, it just is pretty interesting. But with that, I will stop sharing and get back over.