xVASynth / xVATrainer / xVADict

*LongDukDong* · July 5, 2022

Said content is that of Dan Ruta, the tool's creator...

Research Update #5 - 4 more new languages supported!

(Not the original... I had to play, record, convert...)

Post contents within the below spoiler:

Spoiler

Another quick update: a few more languages are now supported by xVASynth. The new languages are:

Spanish
Japanese
French
German

As before, the preview audio shows an example of each of these languages (in this order), at generating some text within xVASynth ("This is a sentence in <language>, generated by xVASynth").

It's a bit difficult for me to tell exactly how well these turned out, not being a speaker of any of these languages. I will rely on the community to give me feedback on various languages, and any tips for improvement, etc!

These lines were generated without HiFi-GAN again (to save time on development, these aren't prod-ready voices, rather proof-of-concepts). The Spanish, French, and German speakers are Fallout4:Piper again, and the Japanese voice is from a research dataset.

Japanese is another language which requires a bit more custom work, but thankfully not as much as Chinese (so far). I have tested hiragana, katakana, and kanji, in the app, and it seems to work fine for all of them. Katakana was a bit more noisy, but that may or may not be down to the data I used. Things may iron out when I do some large scale multi-speaker multi-lingual pre-training.

Speaking of which, I'm still collecting data! So I'm still thankful to receive any labelled speech data, from basically any/every language! ?

Edited October 9, 2022 by LongDukDong

*LongDukDong* · July 14, 2022

Said content is that of Dan Ruta, the tool's creator...

Research Update #6 - Another 4 languages supported

Said content is that of Dan Ruta, the tool's creator.

(Not the original... I had to play, record, convert...)

Post contents within the below spoiler:

Spoiler

Going through the list of languages, there are now a few further languages supported in xVASynth. The new ones are:

Greek
Hindi
Ukrainian
Dutch

The preview clip plays out a line generated with each language, in this order. I used research datasets for all of these, as I don't have any data from any games, in these languages. As such, the quality isn't as high (especially the dutch dataset) - but that's fine, these are just test models (Stage 4), to figure out the language support.

I've been tweaking a few things here and there to help improve quality - I may write up a dedicated post for this soon, depending on how it turns out.

Edited October 9, 2022 by LongDukDong

*LongDukDong* · July 24, 2022

Said content is that of Dan Ruta, the tool's creator...

Research Update #8 - 5 languages supported;

100 community voices

(Not the original... I had to play, record, convert...)

Post contents within the below spoiler:

Spoiler

An additional 5 new languages are now supported in xVASynth, for TTS. They are the following, with samples also played in this order, in the preview audio clip linked above:

- Polish
- Russian
- Arabic
- Finnish
- Portuguese

Polish and Russian came out especially well (I think), but I did also have good, big datasets for these (Polish was Skyrim:FemaleNord). Arabic came out the least well (I think), but I think there was a slight problem with the transcript, which I've since fixed. Finnish and Portuguese came out ok (I think), at least as much as the dataset quality permitted - I didn't have game datasets for these last 3.

For all these, the text prompts were fully native, as with previous and future languages. For example, these were the prompts, for the samples in this post:

- Polish: To zdanie w języku polskim, wygenerowane przez xVASynth
- Russian: Это предложение на русском языке, сгенерированное xVASynth
- Arabic: هذه جملة باللغة العربية تم إنشاؤها بواسطة xVASynth
- Finnish: Tämä on suomenkielinen lause, jonka on luonut xVASynth
- Portuguese: Esta é uma frase em português, gerada pelo xVASynth

---

Unrelated, but we also just recently crossed the 100 xVATrainer trained community voices for xVASynth milestone! This is an amazing feat - well done to everyone! I can't wait to get this new model out, for even better quality over v2. Recent changes have improved quality and expressiveness (check the last post, and its samples).

Edited October 9, 2022 by LongDukDong

*LongDukDong* · August 1, 2022

Said content is that of Dan Ruta, the tool's creator...

Research Update #9 - 5 more languages

supported; App updates

(Not the original... I had to play, record, convert...)

Post contents within the below spoiler:

Spoiler

Some additional languages are now supported in xVASynth! The new ones in this update are the following:

- Hungarian
- Korean
- Latin
- Turkish
- Swedish

Again, these are played in this order, in the preview clip. In addition, I also included another Russian sample, from a different dataset, with better quality. The Swedish one needed a re-attempt: if you use xVATrainer, make sure your audio clips don't have long stretches of silence at the start/end - use the "Cut padding" tool to clean those up, otherwise the voice will fail.

I made some improvements to the g2p process since the previous update which, as mentioned in the initial post about this, will improve model quality across the board, even for existing already trained (v3) models.

I also just released a couple of updates to both xVASynth (v2.3.0), and xVATrainer (v1.1.0 and v1.1.1), so check those out for some bug fixes and quality-of-life tweaks/features.

Edited October 9, 2022 by LongDukDong

*LongDukDong* · August 5, 2022

Said content is that of Dan Ruta, the tool's creator...

Research Update #10 - +9 languages; Initial

language support finished, xVASpeech; Language

ASR Models

(Not the original... I had to play, record, convert...)

Post contents within the below spoiler:

Spoiler

We've made it! With this update, there's now language support added for every single language in the initial list of planned languages. There are also a few additional more, that I've since added. I noticed that I hadn't initially included any African languages, which wasn't very inclusive of me. So I've added some to the list, as well as a few others, to further round out support. The newly included languages are the following:

- Vietnamese
- Yoruba
- Amharic
- Wolof
- Mongolian
- Thai
- Danish
- Hausa
- Swahili

Previews for these are played in this order, in the preview audio. Now... for this batch of languages, it really was the bottom of the barrel, in terms of data quantity and/or quality. Some turned out ok, some had a few issues, and some are actually quite bad still. However, most of the issues are temporary. For example, Amharic turned out quite bad, but that's only because I only had about 200 lines to train on (to even learn the language, not just the voice). But I actually have almost 20 hours of good, clean data for Amharic - just that it's split across many speakers. When I switch over to multi-speaker multi-lingual training, Amharic will have plenty more data, and will turn out better.

Thai has some issues also, but I have a bespoke code pipeline for Thai ready, which will hopefully noticeably improve the quality - however, it's currently incompatible with the alignment mechanism. So when I'll switch away from this alignment mechanism, I'll be able to turn it on, and the quality will improve for this one.

Mongolian was kinda bad, mostly because of bad data. It's a low resource language, so until I find more clean data, there's not much more I can do for it.

Danish is kinda unstable also, but I'm hoping that, like for the other languages, the multi-speaker/language training will help. It's a fairly low resource language, from what data I could find.

So all in all, the completed currently supported languages in xVASynth are the following:

- Amharic
- Arabic
- Chinese (Mandarin)
- Danish
- Dutch
- English
- Finnish
- French
- German
- Greek
- Hausa
- Hindi
- Hungarian
- Italian
- Japanese
- Korean
- Latin
- Mongolian
- Polish
- Portuguese
- Romanian
- Russian
- Spanish
- Swahili
- Swedish
- Thai
- Turkish
- Ukrainian
- Vietnamese
- Wolof
- Yoruba

If I haven't forgotten any from this list, that's 31 languages, including English. These are all integrated via a grapheme-to-phoneme backend. Interesting to note: actually, Wolof doesn't actually have any g2p program/software out there, so I ended up hand-writing a g2p script myself for this language, with some Wolof YouTube tutorials. But I'm pretty sure also that xVASynth will also be one of the first apps to support TTS for this language (not even google translate has Wolof, or some of the other languages in this update). At least public ones. Seems there's some chats online about it, and a powerpoint about someone researching it, but not actually any public, working options (from a quick glance).

---------------

I did mention that one of the things I wanted for language support is to also have the languages as options in xVATrainer's Auto-Transcribe tool. This is mostly done - I couldn't get models for Yoruba or Amharic unfortunately, but maybe at some point in the future. Following the latest xVATrainer update (v1.1.0+), you can now install these ASR language models, in a modular way, based on which ones you'd like.

I haven't included them in the base app, as these models are actually quite big, and all together will add up to quite a few GB; 32GB for all languages. Instead, these will be individual ~1GB downloads, in the optional section on the Nexus xVATrainer page: https://www.nexusmods.com/skyrimspecialedition/mods/65022

Do note - xVATrainer voice training is still limited to just English for now, as it's still just the v2 model, in there. It's just the Transcript tool right now, that has the new languages.

---------------

The last item in this update is the introduction of the xVASpeech dataset! I've been working on putting this together for over 10 months (and thank you to all those who have helped me with voice data from games! Game data is usually the best quality). It contains data from both TTS datasets from online, and also from game data, where available - but mostly TTS datasets.

And it's not finished - I will be adding/removing data, as I find more. For example, I have no female data for Danish or Hausa yet - having some will help with future female voices trained for those languages. And some languages like Greek and Mongolian have very little data (5h, 8h), which is not enough to teach a language.

This 2.6k hour dataset will be what I will use near the end of the v3 development, to do a super large-scale pre-training run of all languages, and all 1.3k+ speakers. Once the model capacity is tuned, the hopes are that this will provide the model with ample speech data to learn from. And future fine-tuning runs will in theory be not only quicker, but more a matter of learning a speaker's identity, rather than having to learn how to speak in even roughly that style.

To teach a TTS model, a model has to learn how to speak, how to speak a specific language, how to speak in varied styles (eg male/female), and how to speak in a specific identity's style (and then emotions, if you have enough data, but let's ignore that for now). When learning a model from scratch, all these have to happen within the budget of your dataset, which might not always be big enough. Current fine-tuning training runs in xVATrainer mostly deal with the last 2 steps (though I've included different male/female checkpoints to help with that second to last step a little). With the xVASpeech training run, the hope is to not only reduce the strain on a dataset to just that last item (identity), but even then, to have already trained the model on so many speakers, that a new speaker will already be quite close in speaking style to one that the model has already seen.

---------------

So referencing back to that post from April, with the plans for v3, the following items are now done:

- Multi-lingual support
- Extended ARPAbet phoneme dictionary: xVA-ARPAbet

as well as some new items added since then:

- Switch to a phoneme-only backend, with manual text pre-processing (eg Heteronyms)
- Better quality - achieved so far through the loss sorted gaussian data sampling (see previous post), and the phoneme backend, but this is an ongoing point, of course.

The language support was the super time-consuming one, and that one is now done, and out of the way. The next item is now the nitty-gritty model architecture re-write. This is the heavy stuff, where most of the complexity will happen. It hopefully (!) won't take as long as the language support, especially as I've broken into down into a step-by-step plan already, but this one might be a little harder to provide updates for - it's now time to melt my computer (and myself too, in these heatwaves) with lots of experiments, with architecture explorations. This step is what will replace the alignment method in the models, and what will re-frame the models into a multi-speaker/language mode, enabling the rest of the points on that April list.

I will try to post updates on anything I can, on this, but it'll mostly be an "it works" or "it doesn't work yet" situation! :joy:

Edited October 9, 2022 by LongDukDong

*LongDukDong* · September 21, 2022

Said content is that of Dan Ruta, the tool's creator...

Research update #12; v3 progress

Post contents within the below spoiler:

Spoiler

Things have been moving along! The research has been underway, and I'm happy to report that as far as architectural design choices for the models are concerned, we're actually nearly there for the v3 model!

I've just finished a week of full time work on xVA, that I've taken as holiday from the phd, so I had a good amount of time to focus on powering through a good number of things.

Some things since the last update:

The v3 model is now fully backed by the xVA-ARPAbet phoneme pipeline, complete with all the grapheme-to-phoneme work from earlier, in the summer. This includes ionite's heteronyms stuff also, and everything previously mentioned. The two work streams have fully merged into one - this took quite a bit of fiddling around, to get working.

The most critical component was next to implement pitch (and energy) conditioning in the model, to re-enable pitch (and energy) control on a per-symbol basis again, just like FastPitch models have had, for v1 and v2. This is actually the main area of focus at the moment - training the models is quite slow, as the pitch prediction/conditioning modules need to get trained from scratch. I've implemented 3 different methods for adding the pitch conditioning (and I have an idea for a fourth), so it's now a matter of testing each one of them out, to see which one works the best, and jiggling things about until the quality is maximized. Both the pitch and energy conditioning work the same way, so work is halved, there.

Thankfully, I found a way to not need to re-train the entire model from scratch with every experiment, and instead only certain components - this definitely helps with experimentation speed going forward.

---

There are also a couple of new "abilities" of the model that I've implemented/discovered! The first being that the model, due to its relative-based design, can generate really long audio clips, without any quality degradation. The previous models so far have had trouble past 15 to 20 seconds, where there was no training data this long to tell the model what to do, at that point. But with the v3 model, you can generate audio that's even several minutes long - limited only by your hardware (and I haven't reached/tested the limits yet).

The second, if you have checked out the previous post where I showed how you can control the style of each line generated through text-to-speech or voice conversion, is that actually, this style can be controlled on a per-symbol basis, also! So this can in theory mean yet another set of sliders, somehow. The UI will need some serious work to accommodate the new ways to use v3 models.

Other than that, I've made some speed improvements, to both voice conversion inference time (also improved the quality!), and overall training time - though there is still room for optimization improvements, once the architecture is finalized.

And I've started labelling non-verbal sounds, but this experiment will only be possible once more data is labelled like this. If you have datasets of your own that you have labelled sounds for like this, do let me know and if you'd be willing to share data for, as it will expedite this branch of experiments (and pre-train the base model on your data). The non-verbal sounds will work just like ARPAbet phonemes, and will look like this, in the text prompt:

{@BREATHE_IN}

{@BREATHE_OUT}

{@LAUGH} - more AAH based laughing, may merge into one if too ambiguous

{@GIGGLE} - more IY0 based laughing, may merge into one if too ambiguous

{@SIGH} - like BREATHE_OUT, but with voice in it

{@COUGH}

{@AHEM}

{@SNEEZE}

{@WHISTLE}

{@UGH}

{@HMM}

{@GASP} - like BREATHE_IN, but with voice in it

{@AAH}

{@GRUNT}

{@YAWN}

{@SNIFF}

Also, I'm working on another really cool and exciting way to use xVASynth (different from everything so far), but I still need to tweak that to get it into a usable state, and it still may not work well enough, so I won't talk about that just now.

Edited October 9, 2022 by LongDukDong

*LongDukDong* · October 9, 2022

Said content is that of Dan Ruta, the tool's creator...

Research update #13; v3 sliders, emotion control

Post contents within the below spoiler:

Spoiler

Time for an update on v3!

Things have been hectic. Experiment after experiment, but we're now very close to the end of the experimental part of the v3 development.

Pitch sliders are now fully in (yay!). However, energy sliders have not been so successful (hence the longer break between updates). I've tried my hardest to get these to work, but the closest I could get to was very slight changes in the energy, with more changes to the identity more than anything else - likely just too much going on. And seeing as energy sliders mostly affect the per-symbol volume of a voice line, I figured it'd be best to give up on trying to incorporate this straight into the model weights, and instead just handle this externally, in post. So, energy sliders are most likely not going away, but they're going away from the model at least.

An exciting other new way to control the v3 models is in the emotion! This is something I was playing around with a bit, while the experiments were running, and I got it working for "happy", "angry", "surprise", and "sad". The way this will work is just like the pitch (and old energy) sliders - every symbol (so every phoneme) will have its emotion controllable via emotion sliders. It's quite possible that I'll be able to get more emotions working, but I just tried out these 4, for now. I will get some more generated examples of this out, soon, but so far I'm actually quite happy with how these turned out!

Finally, I have now added v3 model support into the actual xVASynth app! That means the app can now run the new model architecture, though it's just the pitch and duration controls for now:

Visually, there aren't that many changes just yet compared to v2, other than the text in the sequence editor now only containing phonemes, now that the models are purely g2p phoneme based.

--

As for the next steps, we are actually nearly there! I am currently playing around with trying to improve the language accent control. I got it kinda working on the current test checkpoint I have, but it was working much much better in earlier experiments (albeit on a model with a different architecture). So I'm trying a few things in the training, like new/re-weighted losses, small further tweaks to the architecture, and so on. Once this is working, this will be the final nail needed, and it'll be time for the large-scale pre-training. I am considering also to increase the model capacity slightly (more weights per module), to accommodate all the data I'll be throwing at it, and this will mean starting the training from scratch, but that should be it.

While the large-scale training on xVASpeech will be running (will take quite some time to train, maybe a few weeks), I will be switching focus to the last stage of this v3 mega-project, which is xVASynth UI integration for all the new features in v3, that the model is capable of.

Edited April 21, 2023 by LongDukDong

*LongDukDong* · November 14, 2022

Said content is that of Dan Ruta, the tool's creator...

Research update #14; full steam ahead!

Post contents within the below spoiler:

Spoiler

All aboard the v3 base model train(ing)! All hands and legs inside at all times, or they will get regularized!

Experiments are now mostly done, and the mass pre-processing of the huge xVASpeech dataset is also done - and we're now training on it! I've been away/insanely busy with presenting papers at the ECCV conference, and submitting to the CVPR conference, as part of my phd, but meanwhile, I've been nursing along the big pre-training run of the "final" v3 model design, on all the datasets I've collected, and been graciously given.

Unfortunately, though it KINDA works, varying the control on the accent of a line doesn't work as well as it did in initial experiments. So, say, an English line spoken with a French accent, or a Japanese line spoken with a Hindi accent (but French lines still sound French, and Japanese lines still sound Japanese, etc). There's a chance that I could still make this work, with some experiments on an already xVASpeech-trained checkpoint, rather than trying to make it work on a checkpoint not already trained on all the multi-lingual data. But if not, I may have to just leave that for a v4, or a v3.1 or something. There's already plenty of changes with v3, and I think it's just time shift to a release now, than try to do everything in one big update!

So the step I've been working on recently is, like I mentioned, the big long pre-training run on xVASpeech, which will bake in all the data from hundreds/thousands of different kind of speakers, from 30+ languages, into the model. I increased the model capacity (size) to accommodate the added data, so hopefully that won't be an issue (easy fix otherwise, only a few certain sub-modules would need re-training after expanding, if more capacity is needed).

Initially the quality degraded a fair bit actually (compared to my debugging checkpoint trained on like 3 languages), but with more training time, the quality seems to be coming back, and with the added training data diversity, I am expecting the robustness and overall quality to go beyond what was there before. One caveat is I did have to remove the Mongolian datasets from xVASpeech. If you recall, this was the only language that in single-speaker mode didn't turn out too well, without any sign that in multi-speaker mode this would change. The data sampling for the training is language-balanced, meaning that with 31 total languages, 1/31 of the training data could potentially have been broken Mongolian - so I just removed it from this mass pre-training. The number 30 is nicer and more round, anyway ?. I'm not removing support from the app for the language, however! So if anyone wants Mongolian models, that can still be done later.

The following is one of the loss graphs training the model (the mel loss) - ignore the sudden jumps, that's me messing with the data/hyper-parameters mid-training. It's going down nicely, but this is across quite a big time-span (>2 weeks).

1.png?token-time=1683331200&token-hash=G

Given the sheer amount of data to train on, this training run will require a fair bit more time, but that's fine, as I've still got a good amount of work to do on the UI front, as soon as this CVPR business is done and out of the way. I've got some basic designs going for how it could work, and I've already mostly finished replacing the microphone input with a much more stable and reliable method, for the upcoming voice conversion mode that will replace current speech-to-speech!

v3 will also be the first time since v0.1 that I've completely re-built the backend dependencies and environment. I've finally updated the python version used from 3.6 to a more recent 3.9, to make some new dependencies compatible. Other such big updates might happen also (eg ancient v2 Electron front-end version), though some heavy testing will need to happen first, to make sure nothing is now broken due to this.

--

I'll be updating on how the large scale pre-training is going, but we definitely have to strap in for a couple/few more weeks of leaving this running, for sure. I'm checking in with it regularly, to verify progress, and that nothing is broken.

Edited April 21, 2023 by LongDukDong

*LongDukDong* · November 30, 2022

NEW CONTENT BY OTHERS

Discovered only recently, I found that there are others making xVASynth resources. Personally, I've been waiting for some Oblivion resources... but that's me as an Oblivion modder. But along with an Oblivion resource, there are Skyrim resources, Fallout 4 resources and plenty of Fallout 76 resources.

The links are now in the bottom of the first post. They contain the author and the game engine for convenience.

*LongDukDong* · February 5, 2023

Said content is that of Dan Ruta, the tool's creator...

Research update #17 - Pre-training stage finished - preparation for pre-alpha builds

Post contents within the below spoiler:

Spoiler

I've been working more on fine-tuning with the remaining languages, and with individual fine-tuning, and I think we're nearly ready for an early pre-alpha build with some v3 models. I have some example audio files to show, but patreon still doesn't allow me to inline audio clips, so I've included them in a zip attached to this post, if you want to follow along with the filenames I'll be referencing.

In terms of what's new, since the last update, I've finished doing fine-tuning training on the remaining languages (up to 30), from the checkpoint trained on the clean game data languages. But with the vocoder frozen. This worked pretty well, and it now means there's a base, "genesis" checkpoint that further models can be fine-tuned from, with the ability to speak each of the 30 languages. Similar to how there was a Nate/Nora checkpoint for v2, this will be what is used for fine-tuning v3 models.

The next step was fine-tuning individual voices. This was ok when just switching completely to a new voice's dataset, but as previously mentioned, doing so would severely harm the model's abilities, in terms of the variety of voices it can handle (for eg voice conversion), and also accent retention.

For example, this is a file showing Joker from Mass Effect trying to speak Romanian, from a model fine-tuned with just Joker's data, without grounding the model with other multi-lingual data: joker no dreambooth romanian.wav

For reference, this is what the accent should sound like: ro_00000005_20_tts.wav

So the accent is completely butchered, and sounds very english-y. With the dreambooth-style fine-tuning (looping back some multilingual data, optimizing the non-vocoder modules for it), the accent stays good, while gaining the new voice: joker dreambooth romanian.wav

Further reference with german and english: joker dreambooth german.wav, joker dreambooth english.wav

This process isn't free, as there seems to be a bit of a trade-off between how well the new voice is being learned, and how good the accent preservation is. This has been the main area of experimentation recently, as I've been trying several different ratios of training new/multilingual data. I'm not yet 100% done with this, I'll likely play around with it more, but so far a ratio of 3:1 (new:old) seems to work mostly ok. One idea I had was to only sample multilingual data which is similar to the average speaker style of the newly fine-tuned dataset - to preserve accent while not pulling away too far from the newly trained speaker style. I'll probably pursue this in the near future.

--

But speaking of the near future, I've decided to take another week of annual leave from my PhD (which has been getting out of hand busy, as I approach its end), to focus on some dev time for xVASynth. My aims are to clean up a few things, and post here a super early pre-alpha build, with support for running v3 models. The things implemented so far are being able to do text-to-speech using v3 models (with the new g2p backend), and doing voice conversion with the new system that replaces the broken old v2 speech-to-speech. And some random bug fixes. I'll aim to post a couple of test models to go along with it, so there's something to play with. Then, over the next little while, while improving fine-tuning quality, I'll iterate on that build with the surrounding new features and fixes, and whatever else, on the lead-up to the main, public v3 release.

(( LINKS TO SAMPLES ))

Edited April 21, 2023 by LongDukDong

*LongDukDong* · March 16, 2023

Said content is that of Dan Ruta, the tool's creator...

v3 Pre-alpha build #1; Speaker styles; Technical post

Post contents within the below spoiler:

Spoiler

The first v3 pre-release is here!

So as previously mentioned, the plan going forward is to start with some pre-release builds mainly with just the support for v3 models. There are several features planned/in the works, that will hopefully make it into the final public release. But for now, the headline feature which is the v3 models, is in a mostly-done state, and is ready for playing around with and experimenting/stress-testing.

Having said that, I have also been working on some other surrounding things, and some of those other planned features due to the delay in getting some test models ready (I kept finding ways of improving the training process). So let's first get the changelog out of the way, for the pre-release vs latest public version:

Added support for v3 models
Purged old speech-to-speech system. Replaced with new v3 voice conversion system, with new python-based mic recording
Added backwards-compatible audio 48khz super-resolution post-processing via nuwave2 diffusion model
Added speaker style embedding dropdown and management system
Added dropdown setting for the base language to process v3 TTS input as
Added ffmpeg settings for noise reduction, with configurable parameters
Added de-essing post-processing filter
Re-built backend environments with Python 3.9
Stopped start-up error messages being very quickly dismissed by the start-up process
Added error modals for problems with Nexus API use, printing out the full response
Auto-insert spaces at start/end of text prompts
Ensured temporary files are cleaned out after every synthesis
Fixed editor issue where disabled sliders could still be moved if clicked in the right place
Fixed dashes being ignored in ARPAbet dictionary replacements
Added ability to load multiple model instances for programmatic use of the app via local http
Fixed batch mode bug where pitch_amp only affects first batch line
Fixed batch mode output_path not being shown properly
Improved batch mode window design
Improved/fixed error display/logging for batch mode
Changed default ffmpeg Hz to 44100
Fixed not all settings being reset
Misc smaller bug fixes and tweaks
. . .

(the rest required I joined the patreon...)

Edited April 21, 2023 by LongDukDong

*LongDukDong* · March 31, 2023

Said content is that of Dan Ruta, the tool's creator...

v3 now in Alpha; Fine-tuning quality improved, scripts finished, New changes

Post contents within the below spoiler:

Spoiler

Things are moving along!

We're now on the seventh patch since the first pre-alpha build released in the previous post, and we're now moving from "pre-alpha" into "Alpha" state. Several new things have been added/fixed since then. The pre-alpha to alpha changelog is as follows:

Fixed energy sliders
Added deepfilternet2 "Clean-up" post-processing model alongside SR
Added SR and clean-up post-processing support to batch mode
Added de-clipping and de-clicking post-processing ffmpeg filters
Made the automatic space padding configurable via user setting, and not affect the prompt box or output filename
Fixed style dropdown not refreshing after most changes in the management panel
Updated 3d emb visualizer to use better v3 embedding space model
Added right click menu on output editor to copy ARPAbet sequence to clipboard
More reactive Voice Conversion UI
Added explicit error message for unsupported ARPAbet symbol
Filtered out * asterisks from prompts
Fixed Jitter button not doing anything for zero values
Fixed app trying to use waveglow models for v3 models, if app settings initialized afresh
Fixed microphone recording not working for mono-only microphones
Stopped using cached g2p words if prediction failed (blank prediction)
Added more explicit and hopefully more robust device switching for synthesis (incl for super-resolution)
Added fallback model installation debugging mode for when models are loaded with missing asset files
Add pfft entries into the cmudict
Cleared old sliders editor TTS data when voice conversion is used
Disabled useless json output for voice conversion samples
Added more obvious styling for disabled inputs
Removed batch fast mode
Removed now useless "Automatically generate voice" Voice Conversion setting
Removed ffmpeg pre-apply setting, use as enabled going forward

Other than the tweaks and fixes, one additional cool thing is some extra post-processing to make audio quality better across the board. There are a couple of new ffmpeg filters now being auto-applied, but the main one is an additional post-processing ML model which can (and should) be used together with the super-resolution ML model to further clean-up audio. Like with the super-resolution model, this works not only for the new v3 models, but also for the older v1 and v2 models, where it will be especially useful. In initial testing, it seems to be able to clean out the voice in the audio, and even remove artefacts such as the tinny ringing you sometimes hear in less-than-perfect models.

1.png?token-time=1683331200&token-hash=C

The runtime for this new post-processing step is near-instant, and now that the de-clipping and de-clicking ffmpeg filters are also in and automatically applied, there is usually no downside to using it. So it is enabled by default, but as with the super-resolution, you can tick/untick the checkbox to toggle its use. One thing to note in using this "Clean-up" post-processing model is that it behaves very similar to the source separation tool in xVATrainer, if you're familiar with it - except higher quality. So it has some other side-effects such as removing echo. Most voices won't be affected by this, but if you somehow were able to train a model with some echo in it, you'll have to either not use this step, or to add the echo back in, externally. Reminder that to hear a benefit from the SR tool, make sure your ffmpeg Hz setting is higher than 22050Hz (SR outputs 48000Hz).

I've included a zip file of some before/after audio files from a v1 model and a v2 model with these SR and Clean-up post-processing steps applied. I used a v1 dragon voice to show the echo-removal, and a v2 Skyrim:FemaleKhajiit voice to show more general cleaning.

--

On the v3 training front, I've been running experiment after experiment after experiment. And I'm happy to report that I've made a great deal of progress on improving the quality of the fine-tuning process, squeezing out as much as I can out of it. I've made several changes, including hyper-parameter sweeps, dual-stage training (like how there's 5 stages in xVATrainer), and including a modified version of the loss-sorted gaussian data sampling technique I introduced last summer. I'm quite confident that this is likely it for quality changes - at least for now. I'm happy to lock things in now after some clean-up and small tweaks, for what will be the fine-tuning script for v3 models. Though I'll likely keep working on it to improve the speed, ahead of xVATrainer integration. I am running one last experiment currently, to test out the combination of two different ideas I was testing separately, but that should be it.

I've also found one metric which is a good contender to track for automatic detection of a good place to auto-terminate the training process. I've mainly been working with Nate's voice since adding that in, but I'm now tracking its average % delta curve. I'll next be exploring several other datasets, trying to keep their sizes and complexities varied. I'll manually examine checkpoints' output to determine a good stopping point for each dataset, and I'll try to find a pattern on the curve to determine a good stopping point at both fine-tuning stages.

--

The new patch is here, containing all the changes since the initial pre-alpha build: https://drive.google.com/file/d/11bNUiL3vTgpu7zzcn-pDUV4Eu3UA-58u/view?usp=share_link . The patch is cumulative, meaning it can be applied either on a patched pre-alpha, or on the original. (Edit, I fixed a bug with packaging the clean-up library)

Finally, here is a new v3 Nate model, following recent changes to the fine-tuning script: https://drive.google.com/file/d/1gzdw3vijSMGrYD8qZtUolhfzDqM8bATD/view?usp=share_link
This is likely the best version I have currently, though there is another version trained differently which I think maybe has better audio quality but less good pacing. If anyone's curious to try it and send some feedback, let me know: https://drive.google.com/file/d/18yWpltfOF0SNlqwfBCPirQB0wPbml4Xn/view?usp=share_link . Depending on the feedback and further testing, I'll know which version to lock in the fine-tune code/parameters for.

samples.zip (available ONLY if you are on his patreon)

Edited April 21, 2023 by LongDukDong

*LongDukDong* · April 6, 2023

Said content is that of Dan Ruta, the tool's creator...

New v3 features!

A lot has happened since the last update. Please check the last post if you haven't already, for recent changes to avoid repetition. The changelog for new things since that post are as follows:

Added emotion sliders for: Angry, Happy, Sad, Surprised
Added backslash lang control in sub-prompt components, allowing multiple languages per prompt
Added rich text prompt editor with code-editor-like autocompletes for languages and ARPAbet
Added voice crafting system, and support for json-only voice models
Added style sliders
Added Tab autocompletion for selected text to ARPAbet via the g2p backend
Made sliders and checkboxes theme coloured
Added pagination to the output records, with user setting for pagination size
Filtered out non-audio files from the output records (Keeping only wav, mp3, ogg, opus, wma, xwm)
Upgrade from Electron v2 to v19
Removed game-specific app title changes. Only xVASynth going forward
Changed model details (i) info window opening to click, instead of hover to enable interactivity
Added the clean-up model as a pre-processing step for voice conversion
Added v2 and v3 UI switching for ARPAbet symbol list displays
Lots of misc tweaks, fixes, and polishing

Emotion sliders

So one of the new things I'm quite excited about is the introduction of emotion sliders! Just like the pitch and energy sliders, these additional new sliders allow per-symbol control over the emotion of a line. The available emotions so far are: Angry, Happy, Sad, Surprised

And the great thing about these is that they will work for any voice, even if the voice doesn't have any such emotions in the training data. Benefits of building over a model pre-trained on around 2.5k hours of data (more on this later). The implementation is very general, but this does also mean that I don't know how uniformly well it will work across any and all voices. Will need to play around with this once there's more models out, but it seems to work alright so far!

1.png?token-time=1684886400&token-hash=-

Patreon still doesn't seem to allow inlining audio files, annoyingly, so I've again attached a zip file with sample audio files to this post, so you can listen along with some examples. Check the samples in the "emotions" folder (Said zip file requires being a patreon supporter - LDD).

Style sliders

Staying with sliders for a moment, there's now also sliders for voice styles! This is another thing I'm personally quite excited about. This builds upon the voice style system described two posts ago (check that one out). This new addition simply allows per-symbol control with dynamically added sliders for every new style you add. So you can have your base voice style, and gradually fade into a different voice style (check the example audio files in the zip, "style_sliders" folder), or fade between two new styles, or whatever. Also, I've had some pretty cool results mixing this with the emotion sliders. The "play area" for voice delivery customization is now quite a bit bigger, if you enjoy spending time messing with this.

1.png?token-time=1686009600&token-hash=p [/IMG]

Note, "Style: shouty" is selected in the dropdown next to "View:"

New promptbox, autocompletions, Inlined language backend switching

That was the sliders. Turning our attention slightly higher to the text prompt box, v3 now has a complete re-implementation of this, akin to the v1->v2 upgrade of the sliders. There are a few new functions built into it, as shown in this .gif which I hope plays, otherwise check the zip attachment:

1.gif?token-time=1682035200&token-hash=u

The first new item is support for backslash escaping of languages. So, the v3 model supports many languages, but each one has a separate text pre-processing pipeline, complete with separate pronunciation dictionaries, different grapheme-to-phoneme backends, number/abbreviation replacements, etc. The "Base Language" dropdown in the app's toolbars will specify which one to use for processing the prompt text with. But with this backslash escaping, you can now use multiple language backends, mixing and matching throughout the prompt. The syntax is as follows: \lang[lang_code][your_text], for example: \lang[en][Some text]. "en" here is the language code for English, but you can enter the code for other languages. The new editor will do most of the work here for you. When you press the backslash key, an autocomplete option will pop up for the available options (so far just the languages thing), and you press Tab (or Enter, or click it) to apply it. The required sytax is automatically entered into the prompt box, and your cursor is moved to that first square brackets pair, ready for typing in the language code. Except, there will also be a secondary pop up, immediatelly, with all the available options. You can cycle through them with the arrow keys, and filter them by starting to type. Again, Tab to automatically fill in, which will automatically move you to the second set of square brackets so you can write your text. (you can also then Tab to move to the other side of the ] bracket, like in a code editor). You can listen to an audio file in the zip ("lang_inline" folder) to see an example of using this to speak two different languages with the same voice, with mid-prompt switching.

The next thing is syntax highlighting and auto-completion of ARPAbet symbols. If you can never remember what the symbols are, and are tired of opening the ARPAbet menu to check the reference, this is for you. When you type a "{" opening squiggly bracket to start writing APRAbet symbols, the closing "}" bracket is inserted, and one of those menus is shown, where you can again cycle between and filter down the list of available symbols (different for v2 and v3, as v3 has more), and Tab as required to auto-insert. While the cursor is still within ARPAbet { } brackets, you can press Ctrl+<space> to bring this menu up again, to repeat the process. You will know when you're inside ARPAbet {} brackets, as they will be red. Even otherwise, all ARPAbet symbols are now stylized with italics and bold, to make them clearer to see.

1.gif?token-time=1682035200&token-hash=G

Finally, when selecting some text in the prompt box, a pop-up will give you the option to automatically covert that text into ARPAbet, by sending it to the appropriate text pre-processing pipeline, language-aware. You can press Escape to close any autocomplete tooltip.

Voice Workbench

Another new thing is the introduction of the "Voice Workbench" menu. This is a new menu accessible from the new "workbench" icon at the top right. This one will be a bit more complex to explain. The tl;dr is, this menu allows you to invent brand new voices, of people who don't exist. But this is not a replacement for xVATrainer.

You start by selecting a base model to use for this. You can select the "Base" v3 model, which is what was trained on just the big xVASpeech dataset without any individual fine-tuning, and is what will serve as the seed/genesis/foundation/starting model for v3 fine-tuned models in xVATrainer. Or, you can select one of the other v3 models, one fine-tuned on an actual individual voice.

1.png?token-time=1682035200&token-hash=H

On the next menu, you can start messing with it. The system works through embedding space navigation, where you can provide some audio files to change the identity of the model towards that example. Your starting point is either an audio file you provide if using the Base model, or the identity of the fine-tuned model, if using that.

1.png?token-time=1682035200&token-hash=P

Next, you drag+drop a second audio file over the "Reference Audio File A" field to get a second embedding. This second audio file is of a speaking style that you want your new voice to sound more like. You will have a "Current Delta" embedding, which is the direction from the existing style/embedding towards this new style/embedding. You can regulate how strongly to move towards this new style via the "Strength" slider (or number input, to go higher than 1). You can write a text prompt to generate some example audio with and without the new change applied, to see what difference your proposed change would make to the speaking style.

You can optionally also use another reference file, "B" for more advanced edits. When you ONLY use reference file A, the delta will be an interpolation between the current embedding and the new one. But when you ALSO use the reference file "B", the delta will be the direction from ref file A to ref file B. So for example, if you were working with a female voice, but wanted the difference between a male normal voice and a male shouty voice for example, this should in theory extract the shoutiness change between the two male voices, rather than also the male-yness. How well this will work will be down to practice and types of voices used, and availability of files to play around with.

1.png?token-time=1682035200&token-hash=o

Finally, you press the [Apply Delta] button to save your delta onto the voice embedding. Keep doing this until you're happy with the changes, then press the [Save] button to save it as a json-only voice model. These json-only voice models are usable the same way as any normal model, though they will show up in italics, on the left side. So long as you still have their reference model installed, they will work and use the new speaking identity you've just created.

Training and more data

On the model training side of things, I've spent some time recently just training up a few fine-tuned models. I got about 14 new models done, and collected various stats on the training, and listened through samples output by the training from several points throughout each of the models' training runs to find the sweet spot for training time. There doesn't seem to be a SUPER clean formula to get the right amount of training time, but I did scrape together a very rough formula which will err on the side of training for just a bit too long. I think to get the absolute best quality from training a model, you'll need to do what I did, which is to actually listen through some samples it outputs, every now and then. The metrics alone won't be enough. In the xVATrainer update, this rough formula should be able to train most voices alright even without supervision, but I'll add periodic synthesis of lines into a visualization folder, for a few lines, spanning a few languages. And some (small) checkpoints too.

The training can always be continued, like before, but if a model has finished training, you can either pick the latest checkpoint and hope for the best, or listen through the last few checkpoints' samples and select your sweet spot from there (likely near the end, if the rough formula works well). I'm sure that over time, as I and other people train models, we can adjust that formula. 14 models is an ok sample size, but bigger would be better, of course

Also, I've been adding and removing data to/from the big xVASpeech dataset over the last few weeks/months. We have RAYANE#9271 on discord to thank for recently providing a whole bunch of Portuguese (pt-br) data, greatly expanding what we had available before. Now that the fine-tuning formula (for stage 2 of 2, stage 1 is TBD but simpler) is done, I'm going to quickly do another small run of pre-training for the base model to incorporate some of the new data - and without some of the bad data I've since filtered out. I'm also making one quick last-minute change to the model architecture, to future-proof the model for addition of new languages in the future, as it's currently hard-coded to the 31 supported languages. I think 50 slots should probably do. The up-to-date xVASpeech dataset stats are as follows, but keep in mind I'm still currently leaving out Amharic, Mongolian, and Thai training, due to bad/insufficient data.

1.png?token-time=1682035200&token-hash=X

If you have any non-English data you'd like to contribute to the cause, let me know! It'll go a long way and be appreciated by many!

--------------

One other thing is I've finally upgraded the Electron front-end from v2 to v19, so the app should hopefully feel "nicer" to use - no longer a 5 year old Chrome browser. Looking through my todo list, there's actually not a huge lot left to do for v3! There are a couple of extra things I'd like to see if I can add in, and of course, some further polishing, but the end is in sight! I'll prepare another pre-release build very soon, which will likely be the beta.

files.zip (available ONLY if you are on his patreon)

Edited June 18, 2023 by LongDukDong

*LongDukDong* · June 18, 2023

Said content is that of Dan Ruta, the tool's creator...

xVASynth v3 finished! (and the xVATrainer update) - Public release

Spoiler

I've had a lot of time to occupy myself with, this past week. Having just finished my last bit of technical work for the PhD, it was time to finish up the releases for v3 xVASynth, and the xVATrainer updates. Check the YouTube link to watch the v3 showcase video, for the main changes/new additions.

The full changelog for v3 xVASynth is as follows:

Headline changes:

- Added support for v3 models
- Added multi-lingual support, with dropdown setting for the base language to process v3 TTS input as
- Purged old speech-to-speech system. Replaced with new v3 voice conversion system, with new python-based mic recording
- Added emotion sliders for: Angry, Happy, Sad, Surprised
- Added rich text prompt editor with code-editor-like autocompletes for languages and ARPAbet
- Added backslash lang control in sub-prompt components, allowing multiple languages per prompt
- Added voice crafting system, and support for json-only voice models
- Added style sliders, and management system
- Added backwards-compatible audio 48khz super-resolution post-processing via nuwave2 diffusion model
- Added deepfilternet2 'Clean-up' post-processing model alongside super-resolution

Post-processing:

- Added ffmpeg settings for noise reduction, with configurable parameters
- Added SR and clean-up post-processing support to batch mode
- Added de-clipping and de-clicking post-processing ffmpeg filters
- Added de-essing post-processing filter

UI Changes:

- Upgrade from Electron v2 to v19
- Added right click menu on output editor to copy ARPAbet sequence to clipboard
- Added error modals for problems with Nexus API use, printing out the full response
- More reactive Voice Conversion UI
- Auto-insert spaces at start/end of text prompts
- Made the automatic space padding configurable via user setting, and not affect the prompt box or output filename
- Ensured temporary files are cleaned out after every synthesis
- Added Tab autocompletion for selected text to ARPAbet via the g2p backend
- Added explicit error message for unsupported ARPAbet symbol
- Improved batch mode window design
- Changed default ffmpeg Hz to 44100
- Filtered out * asterisks from prompts
- Cleared old sliders editor TTS data when voice conversion is used
- Disabled useless json output for voice conversion samples
- Added more obvious styling for disabled inputs
- Removed batch fast mode
- Removed now useless 'Automatically generate voice' Voice Conversion setting
- Removed ffmpeg pre-apply setting, use as enabled going forward
- Made sliders and checkboxes theme coloured
- Added pagination to the output records, with user setting for pagination size
- Filtered out non-audio files from the output records (Keeping only wav, mp3, ogg, opus, wma, xwm)
- Removed game-specific app title changes. Only xVASynth going forward
- Changed model details (i) info window opening to click, instead of hover to enable interactivity
- Added v2 and v3 UI switching for ARPAbet symbol list displays
- Added CC-BY-4.0 category
- Added splash page with YouTube links, before the EULA
- Hidden the Endorsements/Downloads counters from the Nexus Manage Repos menu, as they do not update

Under-the-hood:

- Re-built backend environments with Python 3.9
- Add 'pfft' entries into the cmudict
- Updated 3d emb visualizer to use better v3 embedding space model
- Added ability to load multiple model instances for programmatic use of the app via local http
- Added fallback model installation debugging mode for when models are loaded with missing asset files
- Added the clean-up model as a pre-processing step for voice conversion

Fixes:

- Stopped start-up error messages being very quickly dismissed by the start-up process
- Fixed Jitter button not doing anything for zero values
- Fixed microphone recording not working for mono-only microphones
- Stopped using cached g2p words if prediction failed (blank prediction)
- Added more explicit and hopefully more robust device switching for synthesis (incl for super-resolution)
- Improved/fixed error display/logging for batch mode
- Fixed batch mode bug where pitch_amp only affects first batch line
- Fixed batch mode output_path not being shown properly
- Fixed editor issue where disabled sliders could still be moved if clicked in the right place
- Fixed dashes being ignored in ARPAbet dictionary replacements
- Fixed not all settings being reset
- [Bunglepaws] Fixed manual ARPAbet input being broken in the text processing pipeline
- Fixed 3D voice embeddings visualizer not displaying gender information correctly
- Lots and lots of other misc tweaks, fixes, and polishing

As for the v1.2.0 xVATrainer update, the full changelog is:

- Added training support for the v3 models
- Added support for whisper models for automatic speech-to-text transcription
- Removed Wav2Vec2 ASR models
- Added automatic audio formatting and audio normalization tools' effects to dataset pre-processing
- Skip audio pre-processing if the files are all already there
- Changed audio normalization tool to also convert stereo to mono
- Removed main screen audio pre-processing button
- Fixed voice exporting
- Fixed end-of-training breaking the UI
- Misc bug fixes

-------- -------- -------- --------

There are a few v3 voice already done, to go with the v3 release. Aside from gaming voices, I got together some models trained on the NVIDIA HiFi TTS dataset, which is permissively licensed as CC BY 4.0. This means you can probably use these models commercially, if you needed. Do note, that all v3 models are fine-tuned from the base model, which was pre-trained with non-permissively licensed data. That's probably not an issue, but be aware. We're still the wild-west days of AI, in terms of licensing and permissions, so I don't think there's any rules yet against training data of models from BEFORE fine-tuning with permissive data, but I am not a laywer, this is not legal advice, etc etc. The voices are uploaded to the xVATrainer nexusmods page, under the Miscellaneous files section. The voices are:

- NVIDIA HIFI 11614 F
- NVIDIA HIFI 12787 F
- NVIDIA HIFI 6097 M
- NVIDIA HIFI 6670 M
- NVIDIA HIFI 6671 M
- NVIDIA HIFI 8051 F
- NVIDIA HIFI 9017 M
- NVIDIA HIFI 9136 F
- NVIDIA HIFI 92 F

These all go into a new category in xVASynth, "CC BY 4.0", for permissively licensed voice models.

As for the gaming voices, there are mostly the voices used in the showcase video, plus a few more. They are:

- Fallout 4: Cait
- Fallout 4: Curie
- Fallout 4: Danse
- Fallout 4: Deacon
- Fallout 4: Gen1Synth01 - credits: radbeetle
- Fallout 4: Hancock
- Fallout 4: Nick - credits: radbeetle
- Fallout 4: Piper - credits: HappyPenguin
- Fallout 4: Preston
- Fallout 4: Supermutant
- Fallout 4: Nora
- Fallout 4: Nate
- Skyrim: Delphine - credits: HappyPenguin
- Skyrim: FemaleCommoner
- Skyrim: FemaleKhajiit
- Skyrim: FemaleNord
- Skyrim: MaleArgonian
- Skyrim: MaleEvenTonedAccented
- Skyrim: MaleNord
- Skyrim: Serana
- Skyrim: Ulfric
- Mass Effect: EDI
- Mass Effect: Garrus
- Mass Effect: Joker
- Mass Effect: Miranda - credits: Negomi
- Mass Effect: Shepard (Female)
- Fallout 3: Amata
- Fallout 3: Butch
- Oblivion: Female Nords
- The Witcher: Cirilla
- The Witcher: Geralt
- The Witcher: Yennefer
- Command and Conquer: CABAL - credits: Pendrokar
- Cyberpunk 2077: V (Male)
- Cyberpunk 2077: V (Female)

These should be enough, for people to have some voices to play with at launch, while others are still being worked on. I've uploaded these to their respective nexusmods pages. I've also updated the Steam builds, which I'd recommend for the base apps. Do note, that Steam wasn't letting me upload enough GB to also store the priors data which is required for training, so unfortunately, this needs to be downloaded from the slow Nexus servers, from the xVATrainer page.

xVASynth and xVATrainer both went through a fair bit more pre-release testing this time around, so hopefully there won't be any major issues to fix, right away. But this week is still quite clear, so I'll be around to patch things up if anything comes up.

-------- -------- -------- --------

The v3 update (with everything it entailed) has taken a huge amount of time and effort to develop, and I couldn't have done it without the support of everyone here, and on the Discord server. I've also had direct help from people both on Discord, as well as through pull requests. To everyone, thank you so much for all the help and support - it means a lot.

Let's see how the release goes!

Edited January 10, 2024 by LongDukDong

*LongDukDong* · June 18, 2023

Said content is that of Dan Ruta, the tool's creator...

Recent voices and updates

Spoiler

Well, the v3 release has (mostly) gone ok, aside from some bug fixes that have now been fixed, and a few that are still withstanding for upcoming patches. I've mainly been spending my time working through some patches, and training up more voices to v3 (other than some real life issues I've had to deal with recently, including writing my thesis).

All patches have been pushed through to both Nexus and Steam. The latest xVASynth v3.0.0 -> v3.0.2 patch compound changes contain:

- Fixed batch mode (tensor dimension error)
- Added _ as an allowed symbol in custom ARPAbet dictionaries
- Made auto-play enabled by default
- Added versioning info into the saved json files. (App version, model version, model type)
- Made "Clean-up" enabled by default
- [Bunglepaws] Fixed custom dictionaries use in the text pre-processing pipeline
- [Bunglepaws] Fixed linux use of Electron's "show in folder"
- Fixed pitch sliders being outside the range for some models
- Fixed the editor being broken sometimes on some older voices
- Fixed batch mode output folder creation errors
- Fixed nexus downloads not auto-creating missing models folders, and failing installations
- Fixed inverted Raise/Lower editor buttons on v3 models
- Fixed batch mode json outputting for v3 models
- Fixed occasional broken sliders, especially for older voices
- Added filtering to the languages dropdown by the languages listed in the voice json, if any
- Fixed not being able to over-write files
- Added batch mode resilience to csv files with extra spaces in the header values
- Fixed batch "Open Output" folder button
- Fixed energy sliders broken on older voices
- Fixed colons in words preventing dict phoneme replacements and falling back to g2p

xVATrainer is on v1.2.1, and the changes include:

- Reduced maximum system RAM consumption during training
- Fixed UI broken after clearing training queue
- Added languages trained on in priors into the voice json
- Stronger VRAM manual management for lower consumption during training
- Added extra error message for missing PRIORS data
- Added better handling of embedding clustering on lower spec systems for big datasets
- Fixed missing dependencies not bundled into compilation from new python environment
- Fixed clustering tool
- Temporarily removed speaker diarization tool
- Fixed regex replace errors

Unfortunately, I've been having some issues with the diarization tool in xVATrainer, as the new backend dependencies bundling doesn't like one of the dependencies, so it's not working as expected once shipped. This is likely the biggest priority for now to fix, on the xVATrainer side of things.

As for the new voices (since the release 45) contain:

- Skyrim: MaleOrc
- Skyrim: MaleKhajiit
- Skyrim: FemaleDarkElf
- Skyrim: MaleDunmer
- Skyrim: FemaleYoungEager
- Skyrim: MaleYoungEager
- Skyrim: FemaleEvenToned
- Skyrim: MaleEvenToned
- Fallout New Vegas: MaleAdult02
- Skyrim: Maven
- Skyrim: Brynjolf
- Skyrim: FemaleSultry
- Skyrim: FemaleArgonian
- Skyrim: FemaleElfHaughty
- Skyrim: MaleCoward
- Skyrim: FemaleCoward
- Skyrim: MaleCommander
- Skyrim: FemaleCommander
- Skyrim: FemaleOrc
- Skyrim: MaleDarkElfCynical
- Skyrim: MaleOldGrumpy
- Skyrim: Karliah
- Skyrim: KodlakWhitemane

Interestingly, the Skyrim:FemaleCondescending and Skyrim:MaleElfHaughty just completely failed to train, so I haven't posted those. But I'll try them again, with some different settings.

Getting through voices seems to be a bit quicker than it was for v2 (for me anyway), so at least while I'm still working through real life things, I'll likely mostly focus on getting more voices re-trained. Though there are some things on the todo list for patches still, of course.

I'm so far just aimlessly going through the Skyrim list (as this is where most interest is), but I'm happy to take suggestions for where to aim my computer, from the existing datasets/voices on the list.

Edited January 10, 2024 by LongDukDong

*LongDukDong* · December 17, 2023

Posts from DanRuta's Patreon ended

But more links are available

DanRuta made multipls pages within NexusMods for xVASynth, one for each game engine he considered its use. And within each page, he placed lnks to his Patreon page. However, the patreon page no longer offers any information.

However, I do have new information nonetheless.

Sadly, DanRuta has no plans to work on a vocal pack for Sheogorath. :dissapointed: Alas, that was one of which I was waiting for. On the other hand, there are OTHERS who are still making various vocal packs. Hopefully, one such modder will craft such a resource.

As to other vocal pack artists, the bottom of the main page now has more Nexus Mods links. Besides the Male Imperial voice for Oblivion which was not made by DanRuta, there are now voices for Skyrim Children in English,German and French. And you may find Protectron's voice for fans of Fallout 3. And on an odd note, there is a link for Pelican xVASynth for Stardew Valley players.

Of special note, there is now a special link to a GoogleDocs spreadsheet where you may find a wide variety of packs. Voices such as the cast of Star Trek: The Next Generation, Vin Diesel, Max Caulfield from Life is Strange, Ellie from The Last of Us, the 90s Batman Animated Harley Quinn and even Optimus Prime and Yoda can be found among the dozens and dozens within, and with working Mediafire and MegaNZ hotlinks. Admittedly, not all of them are great. But there is a wide variety.

Edited December 17, 2023 by LongDukDong

*LongDukDong* · January 10, 2024

xVASynth Content Search

Begins in 2024

The new year begins, and a pair of vocal packs have appeared within Nexus Mods:

Onean xVaSynth Voice Package

Okay, I was completely confused why this vocal pack appeared within the Skyrim Special Edition board at Nexus Mods as I never seen or heard of the character in Skyrim. It appears, however, that Onean is a fully voiced follower mod that has been available at Nexus since 2020 and even has cosmetic packs. So... someone decided to make a xVASynth vocal pack of the actress's anime-like voice.

Soundcloud if you want to hear: (Click)

And the page includes a link to the original mod.

Shaun child voice

I guess its understandable that Dan Ruta chose not to release child voices all things considered, and the creator of this vocal pack even noted it within the description. But one may want to craft a mod that includes extra dialogue for child-race NPCs that already exist (Babette in Skyrim for example). And so, this vocal pack is for Shaun, the son of the main character in Fallout 4.

Youtube example:

Spoiler

The vocals appear rushed, but if you have patience, you could make it work better than what HE put up.

While links for both xVASynth vocals are in this post, I did include them in the main page as well.

Sign In

xVASynth / xVATrainer / xVADict

Recommended Posts

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members