Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Is the training data available? #6

Open
dirk61 opened this issue May 10, 2020 · 5 comments
Open

Is the training data available? #6

dirk61 opened this issue May 10, 2020 · 5 comments

Comments

@dirk61
Copy link

dirk61 commented May 10, 2020

Hey! I really love your work and I'm wondering whether you can provide the training data you synthesized from LibirSpeech clean-360 dataset? That would help a lot!

@faroit
Copy link
Owner

faroit commented May 11, 2020

Hi @dirk61 thanks for your interest. Unfortunately the training data is not available but I am happy to provide further information if needed

@dirk61
Copy link
Author

dirk61 commented May 16, 2020

Thanks. @faroit For the data accessed from clean-360 datasetd, did you simply add up these audios together according to the value of k? Before the transformation to time-frequency matirx, what else did you do to form the wav files for training?
I noticed you mentioned peak normalization in the article. I'm new to the audio-processing field and I wonder how that works. What value is set to be the maximum value so that peaks above it are normalized?

@faroit
Copy link
Owner

faroit commented May 18, 2020

For the data accessed from clean-360 datasetd, did you simply add up these audios together according to the value of k?

No, for the time-domain signals there were a few important steps involved to sample the data:
We chose random choice of k speakers from the dataset. Each speaker has 5 utterances. These were trimmed so that silence in the beginning and the end of the utterances is removed using a voice activity detection method (I used this one) The utterances were appended in random order. The concatenated utterances were padded with zeros at the end so that all speakers have the same recording length. Finally the utterances were mixed to get the final output

Example:
    A..C = Speaker Id
    1..3 = Utterance Id

    Before padding:
        track1: |---A3---||--A2--||-----A1-----|
        track2: |---B2---||-B1-||--B3--|
        track3: |-------C1------||-C3-||C2|

    After padding:
        track1: |---A3---||--A2--||--A1|
        track2: |---B2---||-B1-||--B3--|
        track3: |-------C1------||-C3-||

  frame count:  |333333333333333333333333
        k: 3

Before the transformation to time-frequency matirx, what else did you do to form the wav files for training?

the mixing was applied by normalizing each track to have the same SNR to each other and the final mix was peak normalized

I noticed you mentioned peak normalization in the article. I'm new to the audio-processing field and I wonder how that works. What value is set to be the maximum value so that peaks above it are normalized?

yes out /= np.max(out, axis=0) does it ;-)

@dirk61
Copy link
Author

dirk61 commented May 22, 2020

Thanks for the detailed explanation! Awesome :)

Each speaker has 5 utterances.

Are these utterances randomly chosen from the clean-360 flac audio files? I noticed in the article you mentioned the mixtures for training all last 10 seconds. The concatenated utterances after padding may last longer than 10s, so just chop it to 10s?

There's another thing in the article I can't quite understand, which is:

In fact, our method to generate synthetic samples results in an average overlap for k = 2 of
85% and for k = 10 of 55% (based on 5s segments).

If randomly chosen, why do these possibilites occur?

@faroit
Copy link
Owner

faroit commented Jan 8, 2021

@dirk61 sorry for the late reply (feel free to close):

Are these utterances randomly chosen from the clean-360 flac audio files? I noticed in the article you mentioned the mixtures for training all last 10 seconds. The concatenated utterances after padding may last longer than 10s, so just chop it to 10s?

yes, they were chopped.

If randomly chosen, why do these possibilites occur?

not sure if I understand correctly: ideally the overlap should be 100% for all k, but since speakers still make pauses between words, the actual overlap is less than that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants