-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Delightful-TTS model #2095
Merged
Merged
Add Delightful-TTS model #2095
Changes from all commits
Commits
Show all changes
67 commits
Select commit
Hold shift + click to select a range
dab098c
add configs
loganhart02 4d2bcc8
Update config file
loganhart02 c757d97
Add model configs
loganhart02 944216e
Add model layers
loganhart02 99d71ac
Add layer files
loganhart02 c4c4e3f
Add layer modules
loganhart02 169ca74
change config names
loganhart02 49b43dd
Add emotion manager
loganhart02 023f33d
fIX missing ap bug
loganhart02 0e13f89
Fix missing ap bug
loganhart02 e291e15
Add base TTS e2e class
loganhart02 05794cd
Fix wrong variable name in load_tts_samples
loganhart02 2ce4f6d
Add training script
loganhart02 7bf7047
Remove range predictor and gaussian upsampling
loganhart02 29a1b67
Add helper function
loganhart02 c1c7701
Add vctk recipe
loganhart02 a0c03ed
Add conformer docs
loganhart02 78bbdac
Fix linting in conformer.py
loganhart02 d61e953
Add Docs
loganhart02 a896ac2
remove duplicate import
loganhart02 8cae5bf
refactor args
loganhart02 0436b4c
Fix bugs
loganhart02 11fe6b0
Removew emotion embedding
loganhart02 93b8ccb
remove unused arg
loganhart02 c106d89
Remove emotion embedding arg
loganhart02 4d46434
Remove emotion embedding arg
loganhart02 ad64a53
fix style issues
loganhart02 6cdfab4
Fix bugs
loganhart02 cb5e24f
Fix bugs
loganhart02 eb9be14
Add unittests
loganhart02 34b8bf8
make style
loganhart02 5b30274
fix formatter bug
loganhart02 c426f49
fix test
loganhart02 340349c
Add pyworld compute pitch func
loganhart02 299c2da
Update requirments.txt
loganhart02 11c6b80
Fix dataset Bug
loganhart02 b1b5633
Chnge layer norm to instance norm
loganhart02 0248b7f
Add missing import
loganhart02 92f2464
Remove emotions.py
loganhart02 7f0d890
remove ssim loss
loganhart02 8cffece
Add init layers func to aligner
loganhart02 f9c80a6
refactor model layers
loganhart02 658bd79
remove audio_config arg
loganhart02 759df28
Rename loss func
loganhart02 cd03d67
Rename to delightful-tts
loganhart02 0dd3aef
Rename loss func
loganhart02 7b934e4
Remove unused modules
loganhart02 6160cd2
refactor imports
loganhart02 378370a
replace audio config with audio processor
loganhart02 ced8f34
Add change sample rate option
loganhart02 7a8b825
remove broken resample func
loganhart02 156557c
update recipe
loganhart02 cfece08
fix style, add config docs
21dad7a
fix tests and multispeaker embd dim
03007a5
remove pyworld
a026cfc
Make style and fix inference
erogol c49a418
Split tts tests
erogol 96841c6
Fixup
erogol 09a2424
Fixup
erogol 4a287c1
Fixup
erogol 2abb754
Add argument names
erogol 1362cb1
Set "random" speaker in the model Tortoise/Bark
erogol 1fe6a53
Use a diff f0_cache path for delightfull tts
erogol 6349950
Fix delightful speaker handling
erogol dd093e0
Fix lint
erogol 3fde149
Make style
erogol b5bf9e6
Merge branch 'dev' into delightful-tts
erogol File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
name: tts-tests2 | ||
|
||
on: | ||
push: | ||
branches: | ||
- main | ||
pull_request: | ||
types: [opened, synchronize, reopened] | ||
jobs: | ||
check_skip: | ||
runs-on: ubuntu-latest | ||
if: "! contains(github.event.head_commit.message, '[ci skip]')" | ||
steps: | ||
- run: echo "${{ github.event.head_commit.message }}" | ||
|
||
test: | ||
runs-on: ubuntu-latest | ||
strategy: | ||
fail-fast: false | ||
matrix: | ||
python-version: [3.9, "3.10", "3.11"] | ||
experimental: [false] | ||
steps: | ||
- uses: actions/checkout@v3 | ||
- name: Set up Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v4 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
architecture: x64 | ||
cache: 'pip' | ||
cache-dependency-path: 'requirements*' | ||
- name: check OS | ||
run: cat /etc/os-release | ||
- name: set ENV | ||
run: export TRAINER_TELEMETRY=0 | ||
- name: Install dependencies | ||
run: | | ||
sudo apt-get update | ||
sudo apt-get install -y --no-install-recommends git make gcc | ||
sudo apt-get install espeak | ||
sudo apt-get install espeak-ng | ||
make system-deps | ||
- name: Install/upgrade Python setup deps | ||
run: python3 -m pip install --upgrade pip setuptools wheel | ||
- name: Replace scarf urls | ||
run: | | ||
sed -i 's/https:\/\/coqui.gateway.scarf.sh\//https:\/\/github.com\/coqui-ai\/TTS\/releases\/download\//g' TTS/.models.json | ||
- name: Install TTS | ||
run: | | ||
python3 -m pip install .[all] | ||
python3 setup.py egg_info | ||
- name: Unit tests | ||
run: make test_tts2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
from dataclasses import dataclass, field | ||
from typing import List | ||
|
||
from TTS.tts.configs.shared_configs import BaseTTSConfig | ||
from TTS.tts.models.delightful_tts import DelightfulTtsArgs, DelightfulTtsAudioConfig, VocoderConfig | ||
|
||
|
||
@dataclass | ||
class DelightfulTTSConfig(BaseTTSConfig): | ||
""" | ||
Configuration class for the DelightfulTTS model. | ||
|
||
Attributes: | ||
model (str): Name of the model ("delightful_tts"). | ||
audio (DelightfulTtsAudioConfig): Configuration for audio settings. | ||
model_args (DelightfulTtsArgs): Configuration for model arguments. | ||
use_attn_priors (bool): Whether to use attention priors. | ||
vocoder (VocoderConfig): Configuration for the vocoder. | ||
init_discriminator (bool): Whether to initialize the discriminator. | ||
steps_to_start_discriminator (int): Number of steps to start the discriminator. | ||
grad_clip (List[float]): Gradient clipping values. | ||
lr_gen (float): Learning rate for the gan generator. | ||
lr_disc (float): Learning rate for the gan discriminator. | ||
lr_scheduler_gen (str): Name of the learning rate scheduler for the generator. | ||
lr_scheduler_gen_params (dict): Parameters for the learning rate scheduler for the generator. | ||
lr_scheduler_disc (str): Name of the learning rate scheduler for the discriminator. | ||
lr_scheduler_disc_params (dict): Parameters for the learning rate scheduler for the discriminator. | ||
scheduler_after_epoch (bool): Whether to schedule after each epoch. | ||
optimizer (str): Name of the optimizer. | ||
optimizer_params (dict): Parameters for the optimizer. | ||
ssim_loss_alpha (float): Alpha value for the SSIM loss. | ||
mel_loss_alpha (float): Alpha value for the mel loss. | ||
aligner_loss_alpha (float): Alpha value for the aligner loss. | ||
pitch_loss_alpha (float): Alpha value for the pitch loss. | ||
energy_loss_alpha (float): Alpha value for the energy loss. | ||
u_prosody_loss_alpha (float): Alpha value for the utterance prosody loss. | ||
p_prosody_loss_alpha (float): Alpha value for the phoneme prosody loss. | ||
dur_loss_alpha (float): Alpha value for the duration loss. | ||
char_dur_loss_alpha (float): Alpha value for the character duration loss. | ||
binary_align_loss_alpha (float): Alpha value for the binary alignment loss. | ||
binary_loss_warmup_epochs (int): Number of warm-up epochs for the binary loss. | ||
disc_loss_alpha (float): Alpha value for the discriminator loss. | ||
gen_loss_alpha (float): Alpha value for the generator loss. | ||
feat_loss_alpha (float): Alpha value for the feature loss. | ||
vocoder_mel_loss_alpha (float): Alpha value for the vocoder mel loss. | ||
multi_scale_stft_loss_alpha (float): Alpha value for the multi-scale STFT loss. | ||
multi_scale_stft_loss_params (dict): Parameters for the multi-scale STFT loss. | ||
return_wav (bool): Whether to return audio waveforms. | ||
use_weighted_sampler (bool): Whether to use a weighted sampler. | ||
weighted_sampler_attrs (dict): Attributes for the weighted sampler. | ||
weighted_sampler_multipliers (dict): Multipliers for the weighted sampler. | ||
r (int): Value for the `r` override. | ||
compute_f0 (bool): Whether to compute F0 values. | ||
f0_cache_path (str): Path to the F0 cache. | ||
attn_prior_cache_path (str): Path to the attention prior cache. | ||
num_speakers (int): Number of speakers. | ||
use_speaker_embedding (bool): Whether to use speaker embedding. | ||
speakers_file (str): Path to the speaker file. | ||
speaker_embedding_channels (int): Number of channels for the speaker embedding. | ||
language_ids_file (str): Path to the language IDs file. | ||
""" | ||
|
||
model: str = "delightful_tts" | ||
|
||
# model specific params | ||
audio: DelightfulTtsAudioConfig = field(default_factory=DelightfulTtsAudioConfig) | ||
model_args: DelightfulTtsArgs = field(default_factory=DelightfulTtsArgs) | ||
use_attn_priors: bool = True | ||
|
||
# vocoder | ||
vocoder: VocoderConfig = field(default_factory=VocoderConfig) | ||
init_discriminator: bool = True | ||
|
||
# optimizer | ||
steps_to_start_discriminator: int = 200000 | ||
grad_clip: List[float] = field(default_factory=lambda: [1000, 1000]) | ||
lr_gen: float = 0.0002 | ||
lr_disc: float = 0.0002 | ||
lr_scheduler_gen: str = "ExponentialLR" | ||
lr_scheduler_gen_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1}) | ||
lr_scheduler_disc: str = "ExponentialLR" | ||
lr_scheduler_disc_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1}) | ||
scheduler_after_epoch: bool = True | ||
optimizer: str = "AdamW" | ||
optimizer_params: dict = field(default_factory=lambda: {"betas": [0.8, 0.99], "eps": 1e-9, "weight_decay": 0.01}) | ||
|
||
# acoustic model loss params | ||
ssim_loss_alpha: float = 1.0 | ||
mel_loss_alpha: float = 1.0 | ||
aligner_loss_alpha: float = 1.0 | ||
pitch_loss_alpha: float = 1.0 | ||
energy_loss_alpha: float = 1.0 | ||
u_prosody_loss_alpha: float = 0.5 | ||
p_prosody_loss_alpha: float = 0.5 | ||
dur_loss_alpha: float = 1.0 | ||
char_dur_loss_alpha: float = 0.01 | ||
binary_align_loss_alpha: float = 0.1 | ||
binary_loss_warmup_epochs: int = 10 | ||
|
||
# vocoder loss params | ||
disc_loss_alpha: float = 1.0 | ||
gen_loss_alpha: float = 1.0 | ||
feat_loss_alpha: float = 1.0 | ||
vocoder_mel_loss_alpha: float = 10.0 | ||
multi_scale_stft_loss_alpha: float = 2.5 | ||
multi_scale_stft_loss_params: dict = field( | ||
default_factory=lambda: { | ||
"n_ffts": [1024, 2048, 512], | ||
"hop_lengths": [120, 240, 50], | ||
"win_lengths": [600, 1200, 240], | ||
} | ||
) | ||
|
||
# data loader params | ||
return_wav: bool = True | ||
use_weighted_sampler: bool = False | ||
weighted_sampler_attrs: dict = field(default_factory=lambda: {}) | ||
weighted_sampler_multipliers: dict = field(default_factory=lambda: {}) | ||
|
||
# overrides | ||
r: int = 1 | ||
|
||
# dataset configs | ||
compute_f0: bool = True | ||
f0_cache_path: str = None | ||
attn_prior_cache_path: str = None | ||
|
||
# multi-speaker settings | ||
# use speaker embedding layer | ||
num_speakers: int = 0 | ||
use_speaker_embedding: bool = False | ||
speakers_file: str = None | ||
speaker_embedding_channels: int = 256 | ||
language_ids_file: str = None | ||
use_language_embedding: bool = False | ||
|
||
# use d-vectors | ||
use_d_vector_file: bool = False | ||
d_vector_file: str = None | ||
d_vector_dim: int = None | ||
|
||
# testing | ||
test_sentences: List[str] = field( | ||
default_factory=lambda: [ | ||
"It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.", | ||
"Be a voice, not an echo.", | ||
"I'm sorry Dave. I'm afraid I can't do that.", | ||
"This cake is great. It's so delicious and moist.", | ||
"Prior to November 22, 1963.", | ||
] | ||
) | ||
|
||
def __post_init__(self): | ||
# Pass multi-speaker parameters to the model args as `model.init_multispeaker()` looks for it there. | ||
if self.num_speakers > 0: | ||
self.model_args.num_speakers = self.num_speakers | ||
|
||
# speaker embedding settings | ||
if self.use_speaker_embedding: | ||
self.model_args.use_speaker_embedding = True | ||
if self.speakers_file: | ||
self.model_args.speakers_file = self.speakers_file | ||
|
||
# d-vector settings | ||
if self.use_d_vector_file: | ||
self.model_args.use_d_vector_file = True | ||
if self.d_vector_dim is not None and self.d_vector_dim > 0: | ||
self.model_args.d_vector_dim = self.d_vector_dim | ||
if self.d_vector_file: | ||
self.model_args.d_vector_file = self.d_vector_file |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can consider typing docstrings for the config arguments. I'd help you understand architecture better.