Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Delightful-TTS model #2095

Merged
merged 67 commits into from
Jul 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
dab098c
add configs
loganhart02 Oct 25, 2022
4d2bcc8
Update config file
loganhart02 Oct 25, 2022
c757d97
Add model configs
loganhart02 Oct 25, 2022
944216e
Add model layers
loganhart02 Oct 28, 2022
99d71ac
Add layer files
loganhart02 Oct 31, 2022
c4c4e3f
Add layer modules
loganhart02 Nov 1, 2022
169ca74
change config names
loganhart02 Nov 1, 2022
49b43dd
Add emotion manager
loganhart02 Nov 1, 2022
023f33d
fIX missing ap bug
loganhart02 Nov 1, 2022
0e13f89
Fix missing ap bug
loganhart02 Nov 3, 2022
e291e15
Add base TTS e2e class
loganhart02 Nov 3, 2022
05794cd
Fix wrong variable name in load_tts_samples
loganhart02 Nov 3, 2022
2ce4f6d
Add training script
loganhart02 Nov 4, 2022
7bf7047
Remove range predictor and gaussian upsampling
loganhart02 Nov 4, 2022
29a1b67
Add helper function
loganhart02 Nov 7, 2022
c1c7701
Add vctk recipe
loganhart02 Nov 7, 2022
a0c03ed
Add conformer docs
loganhart02 Nov 16, 2022
78bbdac
Fix linting in conformer.py
loganhart02 Nov 16, 2022
d61e953
Add Docs
loganhart02 Nov 23, 2022
a896ac2
remove duplicate import
loganhart02 Nov 23, 2022
8cae5bf
refactor args
loganhart02 Nov 23, 2022
0436b4c
Fix bugs
loganhart02 Nov 25, 2022
11fe6b0
Removew emotion embedding
loganhart02 Nov 25, 2022
93b8ccb
remove unused arg
loganhart02 Nov 25, 2022
c106d89
Remove emotion embedding arg
loganhart02 Nov 25, 2022
4d46434
Remove emotion embedding arg
loganhart02 Nov 25, 2022
ad64a53
fix style issues
loganhart02 Nov 28, 2022
6cdfab4
Fix bugs
loganhart02 Nov 29, 2022
cb5e24f
Fix bugs
loganhart02 Nov 29, 2022
eb9be14
Add unittests
loganhart02 Nov 29, 2022
34b8bf8
make style
loganhart02 Nov 29, 2022
5b30274
fix formatter bug
loganhart02 Nov 29, 2022
c426f49
fix test
loganhart02 Nov 30, 2022
340349c
Add pyworld compute pitch func
loganhart02 Dec 1, 2022
299c2da
Update requirments.txt
loganhart02 Dec 1, 2022
11c6b80
Fix dataset Bug
loganhart02 Dec 1, 2022
b1b5633
Chnge layer norm to instance norm
loganhart02 Dec 1, 2022
0248b7f
Add missing import
loganhart02 Dec 1, 2022
92f2464
Remove emotions.py
loganhart02 Dec 21, 2022
7f0d890
remove ssim loss
loganhart02 Dec 21, 2022
8cffece
Add init layers func to aligner
loganhart02 Dec 21, 2022
f9c80a6
refactor model layers
loganhart02 Dec 21, 2022
658bd79
remove audio_config arg
loganhart02 Dec 21, 2022
759df28
Rename loss func
loganhart02 Dec 21, 2022
cd03d67
Rename to delightful-tts
loganhart02 Dec 21, 2022
0dd3aef
Rename loss func
loganhart02 Dec 21, 2022
7b934e4
Remove unused modules
loganhart02 Dec 21, 2022
6160cd2
refactor imports
loganhart02 Dec 21, 2022
378370a
replace audio config with audio processor
loganhart02 Dec 21, 2022
ced8f34
Add change sample rate option
loganhart02 Feb 7, 2023
7a8b825
remove broken resample func
loganhart02 Feb 14, 2023
156557c
update recipe
loganhart02 Feb 15, 2023
cfece08
fix style, add config docs
May 15, 2023
21dad7a
fix tests and multispeaker embd dim
May 18, 2023
03007a5
remove pyworld
Jun 12, 2023
a026cfc
Make style and fix inference
erogol Jul 6, 2023
c49a418
Split tts tests
erogol Jul 7, 2023
96841c6
Fixup
erogol Jul 7, 2023
09a2424
Fixup
erogol Jul 17, 2023
4a287c1
Fixup
erogol Jul 17, 2023
2abb754
Add argument names
erogol Jul 24, 2023
1362cb1
Set "random" speaker in the model Tortoise/Bark
erogol Jul 24, 2023
1fe6a53
Use a diff f0_cache path for delightfull tts
erogol Jul 24, 2023
6349950
Fix delightful speaker handling
erogol Jul 24, 2023
dd093e0
Fix lint
erogol Jul 24, 2023
3fde149
Make style
erogol Jul 24, 2023
b5bf9e6
Merge branch 'dev' into delightful-tts
erogol Jul 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions .github/workflows/tts_tests2.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
name: tts-tests2

on:
push:
branches:
- main
pull_request:
types: [opened, synchronize, reopened]
jobs:
check_skip:
runs-on: ubuntu-latest
if: "! contains(github.event.head_commit.message, '[ci skip]')"
steps:
- run: echo "${{ github.event.head_commit.message }}"

test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: [3.9, "3.10", "3.11"]
experimental: [false]
steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
architecture: x64
cache: 'pip'
cache-dependency-path: 'requirements*'
- name: check OS
run: cat /etc/os-release
- name: set ENV
run: export TRAINER_TELEMETRY=0
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends git make gcc
sudo apt-get install espeak
sudo apt-get install espeak-ng
make system-deps
- name: Install/upgrade Python setup deps
run: python3 -m pip install --upgrade pip setuptools wheel
- name: Replace scarf urls
run: |
sed -i 's/https:\/\/coqui.gateway.scarf.sh\//https:\/\/github.com\/coqui-ai\/TTS\/releases\/download\//g' TTS/.models.json
- name: Install TTS
run: |
python3 -m pip install .[all]
python3 setup.py egg_info
- name: Unit tests
run: make test_tts2
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ test_vocoder: ## run vocoder tests.
test_tts: ## run tts tests.
nose2 -F -v -B --with-coverage --coverage TTS tests.tts_tests

test_tts2: ## run tts tests.
nose2 -F -v -B --with-coverage --coverage TTS tests.tts_tests2

test_aux: ## run aux tests.
nose2 -F -v -B --with-coverage --coverage TTS tests.aux_tests
./run_bash_tests.sh
Expand Down
6 changes: 3 additions & 3 deletions TTS/bin/synthesize.py
Original file line number Diff line number Diff line change
Expand Up @@ -430,9 +430,9 @@ def main():
if tts_path is not None:
wav = synthesizer.tts(
args.text,
args.speaker_idx,
args.language_idx,
args.speaker_wav,
speaker_name=args.speaker_idx,
language_name=args.language_idx,
speaker_wav=args.speaker_wav,
reference_wav=args.reference_wav,
style_wav=args.capacitron_style_wav,
style_text=args.capacitron_style_text,
Expand Down
170 changes: 170 additions & 0 deletions TTS/tts/configs/delightful_tts_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
from dataclasses import dataclass, field
from typing import List

from TTS.tts.configs.shared_configs import BaseTTSConfig
from TTS.tts.models.delightful_tts import DelightfulTtsArgs, DelightfulTtsAudioConfig, VocoderConfig


@dataclass
class DelightfulTTSConfig(BaseTTSConfig):
"""
Configuration class for the DelightfulTTS model.

Attributes:
model (str): Name of the model ("delightful_tts").
audio (DelightfulTtsAudioConfig): Configuration for audio settings.
model_args (DelightfulTtsArgs): Configuration for model arguments.
use_attn_priors (bool): Whether to use attention priors.
vocoder (VocoderConfig): Configuration for the vocoder.
init_discriminator (bool): Whether to initialize the discriminator.
steps_to_start_discriminator (int): Number of steps to start the discriminator.
grad_clip (List[float]): Gradient clipping values.
lr_gen (float): Learning rate for the gan generator.
lr_disc (float): Learning rate for the gan discriminator.
lr_scheduler_gen (str): Name of the learning rate scheduler for the generator.
lr_scheduler_gen_params (dict): Parameters for the learning rate scheduler for the generator.
lr_scheduler_disc (str): Name of the learning rate scheduler for the discriminator.
lr_scheduler_disc_params (dict): Parameters for the learning rate scheduler for the discriminator.
scheduler_after_epoch (bool): Whether to schedule after each epoch.
optimizer (str): Name of the optimizer.
optimizer_params (dict): Parameters for the optimizer.
ssim_loss_alpha (float): Alpha value for the SSIM loss.
mel_loss_alpha (float): Alpha value for the mel loss.
aligner_loss_alpha (float): Alpha value for the aligner loss.
pitch_loss_alpha (float): Alpha value for the pitch loss.
energy_loss_alpha (float): Alpha value for the energy loss.
u_prosody_loss_alpha (float): Alpha value for the utterance prosody loss.
p_prosody_loss_alpha (float): Alpha value for the phoneme prosody loss.
dur_loss_alpha (float): Alpha value for the duration loss.
char_dur_loss_alpha (float): Alpha value for the character duration loss.
binary_align_loss_alpha (float): Alpha value for the binary alignment loss.
binary_loss_warmup_epochs (int): Number of warm-up epochs for the binary loss.
disc_loss_alpha (float): Alpha value for the discriminator loss.
gen_loss_alpha (float): Alpha value for the generator loss.
feat_loss_alpha (float): Alpha value for the feature loss.
vocoder_mel_loss_alpha (float): Alpha value for the vocoder mel loss.
multi_scale_stft_loss_alpha (float): Alpha value for the multi-scale STFT loss.
multi_scale_stft_loss_params (dict): Parameters for the multi-scale STFT loss.
return_wav (bool): Whether to return audio waveforms.
use_weighted_sampler (bool): Whether to use a weighted sampler.
weighted_sampler_attrs (dict): Attributes for the weighted sampler.
weighted_sampler_multipliers (dict): Multipliers for the weighted sampler.
r (int): Value for the `r` override.
compute_f0 (bool): Whether to compute F0 values.
f0_cache_path (str): Path to the F0 cache.
attn_prior_cache_path (str): Path to the attention prior cache.
num_speakers (int): Number of speakers.
use_speaker_embedding (bool): Whether to use speaker embedding.
speakers_file (str): Path to the speaker file.
speaker_embedding_channels (int): Number of channels for the speaker embedding.
language_ids_file (str): Path to the language IDs file.
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can consider typing docstrings for the config arguments. I'd help you understand architecture better.

model: str = "delightful_tts"

# model specific params
audio: DelightfulTtsAudioConfig = field(default_factory=DelightfulTtsAudioConfig)
model_args: DelightfulTtsArgs = field(default_factory=DelightfulTtsArgs)
use_attn_priors: bool = True

# vocoder
vocoder: VocoderConfig = field(default_factory=VocoderConfig)
init_discriminator: bool = True

# optimizer
steps_to_start_discriminator: int = 200000
grad_clip: List[float] = field(default_factory=lambda: [1000, 1000])
lr_gen: float = 0.0002
lr_disc: float = 0.0002
lr_scheduler_gen: str = "ExponentialLR"
lr_scheduler_gen_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
lr_scheduler_disc: str = "ExponentialLR"
lr_scheduler_disc_params: dict = field(default_factory=lambda: {"gamma": 0.999875, "last_epoch": -1})
scheduler_after_epoch: bool = True
optimizer: str = "AdamW"
optimizer_params: dict = field(default_factory=lambda: {"betas": [0.8, 0.99], "eps": 1e-9, "weight_decay": 0.01})

# acoustic model loss params
ssim_loss_alpha: float = 1.0
mel_loss_alpha: float = 1.0
aligner_loss_alpha: float = 1.0
pitch_loss_alpha: float = 1.0
energy_loss_alpha: float = 1.0
u_prosody_loss_alpha: float = 0.5
p_prosody_loss_alpha: float = 0.5
dur_loss_alpha: float = 1.0
char_dur_loss_alpha: float = 0.01
binary_align_loss_alpha: float = 0.1
binary_loss_warmup_epochs: int = 10

# vocoder loss params
disc_loss_alpha: float = 1.0
gen_loss_alpha: float = 1.0
feat_loss_alpha: float = 1.0
vocoder_mel_loss_alpha: float = 10.0
multi_scale_stft_loss_alpha: float = 2.5
multi_scale_stft_loss_params: dict = field(
default_factory=lambda: {
"n_ffts": [1024, 2048, 512],
"hop_lengths": [120, 240, 50],
"win_lengths": [600, 1200, 240],
}
)

# data loader params
return_wav: bool = True
use_weighted_sampler: bool = False
weighted_sampler_attrs: dict = field(default_factory=lambda: {})
weighted_sampler_multipliers: dict = field(default_factory=lambda: {})

# overrides
r: int = 1

# dataset configs
compute_f0: bool = True
f0_cache_path: str = None
attn_prior_cache_path: str = None

# multi-speaker settings
# use speaker embedding layer
num_speakers: int = 0
use_speaker_embedding: bool = False
speakers_file: str = None
speaker_embedding_channels: int = 256
language_ids_file: str = None
use_language_embedding: bool = False

# use d-vectors
use_d_vector_file: bool = False
d_vector_file: str = None
d_vector_dim: int = None

# testing
test_sentences: List[str] = field(
default_factory=lambda: [
"It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
"Be a voice, not an echo.",
"I'm sorry Dave. I'm afraid I can't do that.",
"This cake is great. It's so delicious and moist.",
"Prior to November 22, 1963.",
]
)

def __post_init__(self):
# Pass multi-speaker parameters to the model args as `model.init_multispeaker()` looks for it there.
if self.num_speakers > 0:
self.model_args.num_speakers = self.num_speakers

# speaker embedding settings
if self.use_speaker_embedding:
self.model_args.use_speaker_embedding = True
if self.speakers_file:
self.model_args.speakers_file = self.speakers_file

# d-vector settings
if self.use_d_vector_file:
self.model_args.use_d_vector_file = True
if self.d_vector_dim is not None and self.d_vector_dim > 0:
self.model_args.d_vector_dim = self.d_vector_dim
if self.d_vector_file:
self.model_args.d_vector_file = self.d_vector_file
1 change: 1 addition & 0 deletions TTS/tts/datasets/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -686,6 +686,7 @@ def __init__(
self,
samples: Union[List[List], List[Dict]],
ap: "AudioProcessor",
audio_config=None, # pylint: disable=unused-argument
verbose=False,
cache_path: str = None,
precompute_num_workers=0,
Expand Down
Empty file.
Loading
Loading