Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for precomputing conditioning latents to xtts for repeated inference on the same reference wavs for significant performance gains. #2956

Closed
wants to merge 1 commit into from

Conversation

Iamgoofball
Copy link

@Iamgoofball Iamgoofball commented Sep 16, 2023

Accidentally includes #2951 because I was lazy.

To use:

xtts = TTS("tts_models/multilingual/multi-dataset/xtts_v1", gpu=True)
xtts.synthesizer.tts_model.precompute_conditioning_latents("./path_to_wavs")

...

xtts.tts_to_file(
	text, 
	file_path="./path_to_output.wav", 
	speaker_wav="./xtts_ref_wavs/ref_speaker.wav", 
	language="en"
	precomputed_latents = True)

During benchmarking of the inference stack I found that like 2/3rds of the inference time was precomputation of the GPT latents. This is bad when you're working with fixed reference audio.

Without precomputed latents:
cmd_IhDhKtsHX7
With precomputed latents:
cmd_osc7WHG6vw

…ed inference on the same reference wavs for significant performance gains.
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@skshadan
Copy link

pls provide full code.

@Iamgoofball
Copy link
Author

this is the full code for this feature

@skshadan
Copy link

I used it still my inference time is same:(

@Iamgoofball
Copy link
Author

i recommend doing some local benchmarking

@skshadan
Copy link

locally it is taking too much time:((

@skshadan
Copy link

image

@PranjalyaDS
Copy link

I have noticed that the latent computation is the longest during the first run, and the subsequent latent computations are taking time in a miniscule format, even if different reference audio files are used. I am not sure of the reason, the most obvious could be that the GPT and Diffusion latent generators are not lazy loaded, most probably?

@stale
Copy link

stale bot commented Oct 28, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Oct 28, 2023
@stale stale bot closed this Nov 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on but feel free to help.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants