Jump to content

Recommended Posts

8 minutes ago, firepunch1 said:

Maybe you're already doing it, but spelling the words how you want them to sound instead of how they're actually spelled properly is also effective. 

Yea. Look at the spelling of some of the wavs. I've had to improvise. (wOOmb & assS)

Link to comment
16 minutes ago, Corsayr said:

I feel like this would be a better tool for making the misogynistic comments.

Might be right. But I think there's more factors involved. Like the context of the comment and the type of voice used. 

Some of the comments above I think do fit. Some of them don't. But they're not bad to be fair when compared to STA butchered dialogue. 

 

Here's male guard. Look at what I had to do with the spelling/punctuation to get it sound even somewhat decent.

woah. wood you look at the! ttits, on that!.wav

 

He rolls his wOULd way too much to sound natural and doesn't pronouce 'T's very well. 

Link to comment
16 minutes ago, Monoman1 said:

Might be right. But I think there's more factors involved. Like the context of the comment and the type of voice used. 

Some of the comments above I think do fit. Some of them don't. But they're not bad to be fair when compared to STA butchered dialogue. 

 

Here's male guard. Look at what I had to do with the spelling/punctuation to get it sound even somewhat decent.

 

Your browser does not support the HTML5 audio tag.
 
 
 
 
 
 


woah. wood you look at the! ttits, on that!.wav 175.06 kB · 33 downloads

 

 

He rolls his wOULd way too much to sound natural and doesn't pronouce 'T's very well. 

 

In fairness guys usually have troubles pronouncing T's with their tongues hanging out. ?

Link to comment
2 hours ago, Monoman1 said:

That's very interesting. I've been waiting for something like that since someone put me onto an AI voice synthesizer before. But that one was pay-walled/proprietary. 

 

Your browser does not support the HTML5 audio tag.
 
 
 


I love taking big cocks deep in my asss.wav 198.06 kB · 217 downloads

 

 

Your browser does not support the HTML5 audio tag.
 
 
 


Cum all over my tits!.wav 128.06 kB · 236 downloads

 

 

Pretty neat. It's not perfect though. There's still no way to instill inflection or emotion. It uses whatever it's got. So those lines above sound a bit on the subdued/robotic side. But if you put in 'Hey there' she sounds really shrill and aggressive. She also can't pronounce 'huge'. I guess some messing about with the spelling might fix it. But it's definitely useful. No doubt about it. 

 

Nope.

 

Ok. I use eff as well. 

yeah, it still has a way to go but, still better then text in some cases, and I am betting it is just going to improve.

Link to comment
17 hours ago, sattyre said:

it is just going to improve.

It is, and ... it already has ... and the high quality deepfakes are already demonstrating it.

You can also try Descript's feature for filling in lines you never said as a demo of where this is headed.

 

Descript deliberately tries to prevent you using this on an arbitrary training set, but it's only a matter of time before an open source, GPU-based solution is available that works reasonably well for a determined user, given a good training set.

 

This existing solution is based partially on traditional voice-synth mechanics, which are decades old; hence the robotic sounds. It's a gross simplification to say they are the same in any way, but there are characteristics in common. The basic source of defects in audible results are coming from the same limitation.

 

The robotic tones come from naïve extension of a tone when no data is available.

i.e. it doesn't know what frequency changes to apply in that spot, so it simply retains the same frequency (or frequency delta rate) as the preceding data until it has some concrete data to work with.

 

Real speech is largely devoid of those areas of (a) fixed frequency, and (b) constant frequency delta, so it's something of a defect in the underlying model; it's an issue we see in audio compression too, especially compression for streaming, where data can be lost en route and the decompressor has to fill in the gaps. Same problem, different cause.

 

In the case of generated speech data (which we have here) rather than simply lacking any data, it retains the base model unmodified by additional voice augmentation data.

 

In the case of network trained models, these odd areas expose areas in the model where there is no training consequence; the value function is not measuring that difference (from an idealized result) as meaningful, either because the function doesn't measure it at all, or gives it a low weighting. The model "doesn't care" about those differences; it simply cannot see them as meaningful. It might think "hey, I dropped some low frequency noise here, doesn't matter" but it does matter.

 

It was always possible to improve the classical phonetic models through improved timing information, and some early synths exposed this capability in their input parser. As SKVA theoretically lets you mess around with both speed and pitch at a detailed level, time and effort can probably yield some improvements.

 

A directed/assisted trainer, where a user carefully pre-edits some training inputs and creates edited samples for a wide range of phoneme+context might also yield some improvements, but it would be labor intensive and dependent on the quality of the user's editing and understanding of the intent.

 

 

The SKVA poster says they were unable to train Tacotron due to lack of VRAM, but didn't indicate how much is required.

I had similar issues training ESRGAN and ARCNN on my old 1080Ti, which had 11GB of VRAM, though it's possible to mod some constants in the PyTorch dependent source to work around it as long as you choose appropriate training images.

 

If this only requires 24GB, it will be well within reach of many users, but if it's more like 48GB, then it might be limited to CPU only, which is fine, in a way, but will take a long time.

 

Alternatively, you can get access to GPUs via AWS or similar services, and the RAM available there should be substantial, providing you're willing to pay for it.

 

 

The specific problem here (I suspect), is that the SKVA author has trained everything based on the Tacotron2 training results that he has access to, based on the large LJSpeech data set. This sounds nothing - absolutely nothing - like most Skyrim dialog, female or male.

(Actually not true, as Monomon observed, it sounds quite a bit like a snooty high-elf delivering a long-winded announcement or put-down, but it is definitely bad for male nords, and even female nords, or anyone saying something with emotional emphasis).

 

This is a poor data set for the Skyrim voices. LJSpeech is full of slow, carefully enunciated non-fiction text being read at an even pace. It's aimed at text-to-speech for e-readers and that sort of application. It's not great, even for fictional writing, and dialog is an entirely different kind of speech altogether.

 

Training Tacotron2 on the Skyrim audio files would likely result in some quite different results. Using FastPitch trained on a specific Skyrim voice, with a Tacotron2 data set trained on that same voice will likely yield superior results. The data will match up better and we won't see the "dead spots" that lead to robot voice. Accents and other quirks of delivery should come across more accurately.

 

That's an idealized hope anyway. My experience with these systems is that you need to massage the input data sets very carefully to get good results. Too much bias towards one kind of input, or training sets that are "too difficult", too small, or even too large, can result in worse results in the final network. Just like training a person, the training set needs to include all the problem cases, in the correct percentage of occurrence, or it won't give good results.

 

The error measures used for these kinds of training are critical, and that's been an issue in automated search since forever. Training is highly sensitive to choices in the error function. Whenever somebody writes a paper where they show an awesome result, you can be sure it's either just against the error function they chose, or with an error function that works great for the somewhat narrowed range of data they selected.

 

Whether it's visual or audio "errors" that we're looking for, the human brain is incredibly sensitive to certain differences and cares little about others. So far, we have failed to replicate its particular sensitivities in a general and across-the-board way. For some images we are good at spotting differences that matter, but for others we aren't. It's the same with audio.

 

In an image, if we just look at each pixel, we will fail to pick up the kind of problems that the eyes easily perceives: structural differences. Any difference measure needs to be able to measure those large scale structural differences and weight them accordingly. This requires convolutions across the image space, but using a complete set of those is computationally impossible right now - for any image of meaningful size. For it to be complete, we'd need to consider every possible combination of pixels that exist in the image. Instead, tricks from video compression are used to speed up the process. It's much faster, but it has blind spots and introduces limitations that are nothing to do with network-based transforms in general.

Link to comment

I've got to say. It seems like the more articulate the voice type the better it works out. Mumbly nord + weird accent has issues.

 

Surprisingly long, snooty high elf comments work out pretty well IMO. Conversely, probably wouldn't be very good at expressing lust/anguish etc. 

Get away from me. You stink of sperm.wav

 

My dog will be happy to see you!. Why don't you get down on all fours and I'll go get him..wav

 

Oh Look! It's the girl that likes to suck big horse cocks. The bigger the better. Right, sweetie.wav

 

Well well. If it isn't the town whhore!., Shouldn't you be on your back somewhere.wav

 

Still takes a pretty long time fiddling around with wording and punctuation but definitely nowhere near the old fashioned way. 

Edit: LL must be doing some compression. Sounds a lot more crackly when played back here than locally. 

 

3 hours ago, Lupine00 said:

It is, and ... it already has ... and the high quality deepfakes are already demonstrating it.

I'd say there's about 3 people on this forum that understood this post. 

Sadly I'm not one of them. But if it helps bring more accurate voice lines I'm all for it. 

 

Link to comment
18 minutes ago, Monoman1 said:

I'd say there's about 3 people on this forum that understood this post. 

Sadly I'm not one of them. But if it helps bring more accurate voice lines I'm all for it. 

 

LOL.  yup, and I'm not one either, but it certainly makes Lupine sound like a smarty pants, considering she's one of those three.

 

Good on ya, it goes to show we come from all walks of life, all devious quirks aside.

Link to comment
7 hours ago, Monoman1 said:

I've got to say. It seems like the more articulate the voice type the better it works out. Mumbly nord + weird accent has issues.

That's probably because the articulate speech is a closer match to the LJSpeech dataset used to train the Tacotron2 model that feeds into the FastPitch model generation.

Also male speech is just harder because it's made of mumbles, contractions, dropped syllables, and grunts.

Spoiler

It's well known that at puberty, the testosterone overdose paralyses the speech centers and renders the male child unable to voice words clearly, and instead they are reduced to grunts and the occasional "ah'yup". Some relearn to speak after this incident, others do not.

 

The sound on Monoman's latest samples is not robotic, but the pacing is odd, and they're still a bit like the cut+paste lines he's made simply by editing.

Maybe, it's possible to overcome things like the fast falling tone on "god damned" so it has the appropriate emphasis on the curse, but ... takes a lot of experimenting and pain I imagine?

 

The high-elf ones are very good, though emphasis goes on the wrong word now and again the words themselves sound correct, even down to the accent.

 

 

On the topic of the lines we have in SLS right now...

 

It seems like we have almost everyone saying everything, and though the lines are great in some contexts, they land badly in others.

For example, just about everyone is telling me to get lost! Well... That's fine for ... almost no NPCs ... as most NPCs have some kind of interaction they will eagerly perform and it transitions badly.

 

e.g.

 

Belethor: "Get lost."

Player invokes dialogs.

Belethor: "How can I help you?"

Player: "Wow! That's some attitude change!"

 

Or it comes off even stranger with things like the golden claw quest.

 

Vendors and unique NPCs should probably not be so eager to dismiss the PC, insult her, or otherwise emit random abuse, as it's likely that their main dialogs will then say something completely contradictory. Nazeem is an obvious exception of course, though the ability to insult him back is sorely lacking.

 

Sometimes less is more, and all that.

 

 

In terms of informational messages, basic stuff like tiredness, hunger and thirst would be useful too, even though EatSleepDrink has a tremendous tummy rumble.

Link to comment
19 hours ago, Monoman1 said:

Can not for the life of me get serana to say 'tits'. Keeps saying 'its'. Which is odd. I would have thought serana would have a mountain of dialogue to sample from. 

https://vocaroo.com/14iaxsy7TzBT 

It would seem what the words you choose to put around the given word has a pretty large effect on how it'll be pronounced as well, I'd say this sounds like "tits", and the only thing he changed was the words leading up to it (i.e. no spelling magic or punctuation.)

 

Input he used:

 


1.PNG.6a1404b5a8ff988c082d39edacf15ecc.PNG
 

Link to comment
1 hour ago, firepunch1 said:

Input he used:

Haha. That's brilliant. But how the hell did they come up with that :D

2 hours ago, Lupine00 said:

they land badly in others.

In fairness, LL mods have always had that problem. Skyrim itself has it as well. It'd be nice if say the companions were all snotty in the beginning to a new whelp and later eased back but achieving that sort of granularity is difficult and time consuming. And just probably not really worth the effort. 

Link to comment
1 hour ago, Monoman1 said:

And just probably not really worth the effort. 

 

In an ideal world, vendors would have their own sleazy dialogs that were vendor appropriate.

There are opportunities for bonus comedy.

In contrast, the "Get lost" dialog works in almost no situation at all.

 

It would work nicely for a dedicated prostitution mod, in the result of a failed solicitation attempt, but ironically, no such mod has any audio lines at all.

 

Spoiler

"Get lost" has some very specific issues.

 

Unlike most of the dialog in SLS, which is background incidental. "I bet you have a nice body under those clothes."

Well... maybe I do, assuming I'm wearing clothes, which ... I'm not ... but it doesn't matter because there's no implied consequence.

The only implied consequence for such a line is further dialog with Hulda (or whoever, which we know doesn't exist).

 

But "Get lost" is doing something different. It implies an escalation if you don't comply.

But there is no escalation, nor even an expectation of compliance.

Neither dialog nor behavior will fit, except in that narrow case I noted above, where you're soliciting, and fail. If, by chance you were to get the line then, it would make sense, but the odds of that are slim to none.

 

 

What's worth the effort is, of course, a matter of personal perspective. But the impact of a dialog - any dialog - is proportional to its appropriateness.

 

Sure, there are many weak spots in Skyrim's vanilla dialogs, but that's what dialog mods were invented to address, not perpetuate.

Link to comment

I didn't realize this but you can sort of manipulate the inflection by grabbing the blue bars and raise/lower them. The effect is pretty extreme so small adjustments are best but it's useful. I'm also having better luck breaking sentences up rather than using longer ones. Long sentences can become weirdly garbled (sometimes).

 

Yes! That's it. Pound my ass!.wav

 

Just look at that lovely fat cock. I'd love to suck it dry. But I'd better not.wav

 

Hold it there pumpkin.wav

 

Thats it baybee! Shake that ass!.wav

Link to comment
50 minutes ago, Monoman1 said:

you can sort of manipulate the inflection by grabbing the blue bars and raise/lower them

The up-down arrow is a clue?

 

You can also adjust the length of each letter, though the interface for it is slightly less obvious; it works on the last letter you edited.

 

I found that just putting in a comma makes a big difference.

You can put commas around a word to split it out.

 

Also, you can replicate a letter and then adjust the length and tone of the different copies, to create a rise or fall.

Putting an apostrophe can also have an effect. I think it's like a space but shorter.

 

Try "Yewer ful of sh'it" vs "Yewer full of, shit", for example.

 

Sometimes it seems better not to double letters you would spell doubled.

Sometimes it seems better to roll words together phonetically.

The Serana model is just bad. I suspect there was a problem with a lack of sufficient training data for the FastPitch there.

Might be able to get some good results by making models for the voice replacers that exist.

Exclamation mark works too, but it's a bit iffy in some models.

e.g. "You, arr, full, of. Shit!" in FemaleCondescending; the full stop allows the "shit" to begin with the correct emphasis.

Sometimes swapping a d for a t (soft for hard) is effective.

e.g. "You, ar, fullov. Shid."

 

Realistically, making voice lines with this tool is easier than recording them (which is actually hard work and requries a good recording set-up and a lot of edits).

 

I think we're looking at a new year full of old mods revoiced with actual voices, and new mods that are all about voice.

So exciting!!!

Link to comment
7 minutes ago, Lupine00 said:

The up-down arrow is a clue?

:P

7 minutes ago, Lupine00 said:

You can also adjust the length of each letter, though the interface for it is slightly less obvious; it works on the last letter you edited.

Yep. Just figured that out too. Pretty good program I have to say. I'm impressed. Though there are some bugs and the UI could be a little better. You know what mod could really use this? SLSF. 

Link to comment
46 minutes ago, Monoman1 said:

You know what mod could really use this? SLSF. 

Or sexist guards?

 

"...to suck it dry, but I better not..."

That came out perfect. Just perfect. So whistful.

But why not? Why not???

 

CUDA device not working for me in this though ... perhaps a path or version issue?

Will investigate another day.

 

Oh, just noticed it's all service driven. You can write your own client to post direct to the service. The input is very simple.

Using that approach you can automate re-processing all your lines whenever models are updated or whatever, or just try a dozen combinations and run them all at once for review later. Look in server.log to see what I mean.

Link to comment
36 minutes ago, Lupine00 said:

Or sexist guards?

True. Loads of possibilities. It's still tons of work though. 

1 hour ago, Lupine00 said:

I think we're looking at a new year full of old mods revoiced with actual voices, and new mods that are all about voice.

So exciting!!!

I think you're right. But you know, having worked fairly extensively on voice mods, I see (potentially) fairly major issues ahead. 

 

One is this from 'Say()' in the ck wiki: 

  • If used on an actor and that actor attempts to initiate normal dialogue (for example a random greeting) while saying something through Say(), the game will crash to desktop.

 

So I think maybe what the modding community could do with now is a dialogue framework. Something that would accept topics and prioritize/play them back without everything conflicting and becoming a super garbage fire when every mod under the sun want's to push dialogue. SLS and STA sort of has this kind of system but I was never 100% happy with it. The problem is that pushing a topic from one mod to another without dependency is tough because the conditions used on the topic to limit npcs saying the lines is.... difficult. Only some conditions seem to work. I spent ages messing around with various conditions and never really got anywhere. A magic effect seems to be the best but of course you can't add a condition for that without dependency. 

 

To get around this, SLS when it wants to send a topic to STA to say it starts the quest, sends the line and stops the quest again. This is not great. There are timing issues involved. 

 

The only other thing to try maybe is an injected faction form. Not sure you can inject faction forms. I might have even tried it. Can't remember. 

 

I have to get out of these bindings, sooon.... or Ill endup as someones, ffuck toy..wav

 

Hold it there girl. You know the drill. Let me see your papers..wav

Link to comment
22 minutes ago, Monoman1 said:

If used on an actor and that actor attempts to initiate normal dialogue (for example a random greeting) while saying something through Say(), the game will crash to desktop.

The majority of existing mods aren't exercising that bug though.

They rely on the dialog system "as intended" and hellos.

They just need Fuz'd text upgraded with voice.

 

It's mainly STA that gets excessively chatty.

 

And yes, a dialog framework that runs all the speech through a thread-safe queue and a single thread is probably a requirement if we want to share "Say()" without crashes.

Just imagine the situation in Slaverun?!

But it's already resolved for now.

 

 

Repeating letters sometimes makes robot voice. I tried it with "sune", which also worked, but maybe not what you were after.

"I have to get out of these bindings sune. Or I'll end up as someone's fuck toy."

Then I bumped up the pitch of the 'u' in sune just a little. Sounds nervous.

I have to get out of these bindings sune. Or I'll end up as someone's fuck toy..wav

 

Trying to create anything with more than the slightest bit of emotion is near impossible though.

The data just isn't made for it.

Link to comment

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue. For more information, see our Privacy Policy & Terms of Use