diff --git a/docs/source/_static/sherpa-onnx-vits-zh-ll/0-value-2x.wav b/docs/source/_static/sherpa-onnx-vits-zh-ll/0-value-2x.wav new file mode 100644 index 000000000..0e17d2946 Binary files /dev/null and b/docs/source/_static/sherpa-onnx-vits-zh-ll/0-value-2x.wav differ diff --git a/docs/source/_static/sherpa-onnx-vits-zh-ll/1-numbers.wav b/docs/source/_static/sherpa-onnx-vits-zh-ll/1-numbers.wav new file mode 100644 index 000000000..fe8807949 Binary files /dev/null and b/docs/source/_static/sherpa-onnx-vits-zh-ll/1-numbers.wav differ diff --git a/docs/source/_static/sherpa-onnx-vits-zh-ll/2-numbers.wav b/docs/source/_static/sherpa-onnx-vits-zh-ll/2-numbers.wav new file mode 100644 index 000000000..a21e85494 Binary files /dev/null and b/docs/source/_static/sherpa-onnx-vits-zh-ll/2-numbers.wav differ diff --git a/docs/source/_static/sherpa-onnx-vits-zh-ll/3-wo-mi.wav b/docs/source/_static/sherpa-onnx-vits-zh-ll/3-wo-mi.wav new file mode 100644 index 000000000..cb8147897 Binary files /dev/null and b/docs/source/_static/sherpa-onnx-vits-zh-ll/3-wo-mi.wav differ diff --git a/docs/source/_static/sherpa-onnx-vits-zh-ll/4-heteronym.wav b/docs/source/_static/sherpa-onnx-vits-zh-ll/4-heteronym.wav new file mode 100644 index 000000000..542d33eaa Binary files /dev/null and b/docs/source/_static/sherpa-onnx-vits-zh-ll/4-heteronym.wav differ diff --git a/docs/source/_static/vits-piper-glados/glados-bug.wav b/docs/source/_static/vits-piper-glados/glados-bug.wav new file mode 100644 index 000000000..6c7453b32 Binary files /dev/null and b/docs/source/_static/vits-piper-glados/glados-bug.wav differ diff --git a/docs/source/_static/vits-piper-glados/glados-code.wav b/docs/source/_static/vits-piper-glados/glados-code.wav new file mode 100644 index 000000000..c667542b9 Binary files /dev/null and b/docs/source/_static/vits-piper-glados/glados-code.wav differ diff --git a/docs/source/_static/vits-piper-glados/glados-liliana.wav b/docs/source/_static/vits-piper-glados/glados-liliana.wav new file mode 100644 index 000000000..d3308c8af Binary files /dev/null and b/docs/source/_static/vits-piper-glados/glados-liliana.wav differ diff --git a/docs/source/_static/vits-piper-glados/glados-men.wav b/docs/source/_static/vits-piper-glados/glados-men.wav new file mode 100644 index 000000000..11c89b203 Binary files /dev/null and b/docs/source/_static/vits-piper-glados/glados-men.wav differ diff --git a/docs/source/_static/vits-piper-glados/glados-ship.wav b/docs/source/_static/vits-piper-glados/glados-ship.wav new file mode 100644 index 000000000..d70c94075 Binary files /dev/null and b/docs/source/_static/vits-piper-glados/glados-ship.wav differ diff --git a/docs/source/onnx/tts/pretrained_models/vits.rst b/docs/source/onnx/tts/pretrained_models/vits.rst index 30afe5864..00d4435d9 100644 --- a/docs/source/onnx/tts/pretrained_models/vits.rst +++ b/docs/source/onnx/tts/pretrained_models/vits.rst @@ -44,6 +44,18 @@ The following table summarizes the information of all models in this page. - `LibriTTS-R`_ - 75 - 22050 + * - :ref:`vits-piper-en_US-glados` + - English + - 1 + - N/A + - 61 + - 22050 + * - :ref:`sherpa-onnx-vits-zh-ll` + - Chinese + - 5 + - N/A + - 115 + - 16000 * - :ref:`vits-zh-hf-fanchen-C` - Chinese - 187 @@ -102,7 +114,9 @@ This model is converted from ``_ +for how to add new words. .. hint:: @@ -171,7 +185,7 @@ Generate speech with executable compiled from C++ --vits-model=./vits-melo-tts-zh_en/model.onnx \ --vits-lexicon=./vits-melo-tts-zh_en/lexicon.txt \ --vits-tokens=./vits-melo-tts-zh_en/tokens.txt \ - --tts-rule-fsts='./vits-melo-tts-zh_en/date.fst,./vits-melo-tts-zh_en/number.fst' \ + --tts-rule-fsts="./vits-melo-tts-zh_en/date.fst,./vits-melo-tts-zh_en/number.fst" \ --vits-dict-dir=./vits-melo-tts-zh_en/dict \ --output-filename=./zh-en-2.wav \ "Are you ok 是雷军2015年4月小米在印度举行新品发布会时说的。他还说过 I am very happy to be in China.雷军事后在微博上表示「万万没想到,视频火速传到国内,全国人民都笑了」、「现在国际米粉越来越多,我的确应该把英文学好,不让大家失望!加油!」" @@ -314,6 +328,243 @@ After running, it will generate a file ``zh-en-3.wav`` in the current directory. +.. _vits-piper-en_US-glados: + +vits-piper-en_US-glados (English, 1 speaker) +-------------------------------------------- + +This model is converted from ``_ +and it supports only English. + +See also ``_ . + +If you are interested in how the model is converted to `sherpa-onnx`_, please see +the following colab notebook: + + ``_ + +In the following, we describe how to download it and use it with `sherpa-onnx`_. + +Download the model +~~~~~~~~~~~~~~~~~~ + +Please use the following commands to download it. + +.. code-block:: bash + + cd /path/to/sherpa-onnx + + wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/vits-piper-en_US-glados.tar.bz2 + tar xvf vits-piper-en_US-glados.tar.bz2 + rm vits-piper-en_US-glados.tar.bz2 + +Please check that the file sizes of the pre-trained models are correct. See +the file sizes of ``*.onnx`` files below. + +.. code-block:: bash + + ls -lh vits-piper-en_US-glados/ + + -rw-r--r-- 1 fangjun staff 242B Dec 13 2023 README.md + -rw-r--r-- 1 fangjun staff 61M Dec 13 2023 en_US-glados.onnx + drwxr-xr-x 122 fangjun staff 3.8K Dec 13 2023 espeak-ng-data + -rw-r--r-- 1 fangjun staff 940B Dec 13 2023 tokens.txt + +Generate speech with executable compiled from C++ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + cd /path/to/sherpa-onnx + + ./build/bin/sherpa-onnx-offline-tts \ + --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\ + --vits-tokens=./vits-piper-en_US-glados/tokens.txt \ + --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \ + --output-filename=./glados-liliana.wav \ + "liliana, the most beautiful and lovely assistant of our team!" + + ./build/bin/sherpa-onnx-offline-tts \ + --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\ + --vits-tokens=./vits-piper-en_US-glados/tokens.txt \ + --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \ + --output-filename=./glados-code.wav \ + "Talk is cheap. Show me the code." + + ./build/bin/sherpa-onnx-offline-tts \ + --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\ + --vits-tokens=./vits-piper-en_US-glados/tokens.txt \ + --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \ + --output-filename=./glados-men.wav \ + "Today as always, men fall into two groups: slaves and free men. Whoever does not have two-thirds of his day for himself, is a slave, whatever he may be: a statesman, a businessman, an official, or a scholar." + +After running, it will generate 3 files ``glados-liliana.wav``, +``glados-code.wav``, and ``glados-men.wav`` in the current directory. + +.. code-block:: bash + + soxi glados*.wav + + Input File : 'glados-code.wav' + Channels : 1 + Sample Rate : 22050 + Precision : 16-bit + Duration : 00:00:02.18 = 48128 samples ~ 163.701 CDDA sectors + File Size : 96.3k + Bit Rate : 353k + Sample Encoding: 16-bit Signed Integer PCM + + + Input File : 'glados-liliana.wav' + Channels : 1 + Sample Rate : 22050 + Precision : 16-bit + Duration : 00:00:03.97 = 87552 samples ~ 297.796 CDDA sectors + File Size : 175k + Bit Rate : 353k + Sample Encoding: 16-bit Signed Integer PCM + + + Input File : 'glados-men.wav' + Channels : 1 + Sample Rate : 22050 + Precision : 16-bit + Duration : 00:00:15.31 = 337664 samples ~ 1148.52 CDDA sectors + File Size : 675k + Bit Rate : 353k + Sample Encoding: 16-bit Signed Integer PCM + + Total Duration of 3 files: 00:00:21.47 + +.. raw:: html + + + + + + + + + + + + + + + + + + + + + + +
Wave filenameContentText
glados-liliana.wav + + + liliana, the most beautiful and lovely assistant of our team! +
glados-code.wav + + + Talk is cheap. Show me the code. +
glados-men.wav + + + Today as always, men fall into two groups: slaves and free men. Whoever does not have two-thirds of his day for himself, is a slave, whatever he may be: a statesman, a businessman, an official, or a scholar. +
+ +Generate speech with Python script +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: bash + + cd /path/to/sherpa-onnx + + python3 ./python-api-examples/offline-tts.py \ + --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\ + --vits-tokens=./vits-piper-en_US-glados/tokens.txt \ + --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \ + --output-filename=./glados-ship.wav \ + "A ship in port is safe, but that's not what ships are built for." + + python3 ./python-api-examples/offline-tts.py \ + --vits-model=./vits-piper-en_US-glados/en_US-glados.onnx\ + --vits-tokens=./vits-piper-en_US-glados/tokens.txt \ + --vits-data-dir=./vits-piper-en_US-glados/espeak-ng-data \ + --output-filename=./glados-bug.wav \ + "Given enough eyeballs, all bugs are shallow." + +After running, it will generate two files ``glados-ship.wav`` +and ``glados-bug.wav`` in the current directory. + +.. code-block:: bash + + soxi ./glados-{ship,bug}.wav + + Input File : './glados-ship.wav' + Channels : 1 + Sample Rate : 22050 + Precision : 16-bit + Duration : 00:00:03.74 = 82432 samples ~ 280.381 CDDA sectors + File Size : 165k + Bit Rate : 353k + Sample Encoding: 16-bit Signed Integer PCM + + + Input File : './glados-bug.wav' + Channels : 1 + Sample Rate : 22050 + Precision : 16-bit + Duration : 00:00:02.67 = 58880 samples ~ 200.272 CDDA sectors + File Size : 118k + Bit Rate : 353k + Sample Encoding: 16-bit Signed Integer PCM + + Total Duration of 2 files: 00:00:06.41 + +.. raw:: html + + + + + + + + + + + + + + + + + +
Wave filenameContentText
glados-ship.wav + + + A ship in port is safe, but that's not what ships are built for. +
glados-bug.wav + + + Given enough eyeballs, all bugs are shallow. +
+ .. _vits-piper-en_US-libritts_r-medium: vits-piper-en_US-libritts_r-medium (English, 904 speakers) @@ -365,7 +616,7 @@ Generate speech with executable compiled from C++ --vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \ --output-filename=./libritts-liliana-109.wav \ --sid=109 \ - 'liliana, the most beautiful and lovely assistant of our team!' + "liliana, the most beautiful and lovely assistant of our team!" ./build/bin/sherpa-onnx-offline-tts \ --vits-model=./vits-piper-en_US-libritts_r-medium/en_US-libritts_r-medium.onnx \ @@ -373,7 +624,7 @@ Generate speech with executable compiled from C++ --vits-data-dir=./vits-piper-en_US-libritts_r-medium/espeak-ng-data \ --output-filename=./libritts-liliana-900.wav \ --sid=900 \ - 'liliana, the most beautiful and lovely assistant of our team!' + "liliana, the most beautiful and lovely assistant of our team!" After running, it will generate two files ``libritts-liliana-109.wav`` and ``libritts-liliana-900.wav`` in the current directory. @@ -571,7 +822,7 @@ Generate speech with executable compiled from C++ --vits-lexicon=./vits-ljs/lexicon.txt \ --vits-tokens=./vits-ljs/tokens.txt \ --output-filename=./liliana.wav \ - 'liliana, the most beautiful and lovely assistant of our team!' + "liliana, the most beautiful and lovely assistant of our team!" After running, it will generate a file ``liliana.wav`` in the current directory. @@ -718,7 +969,7 @@ We use speaker ID 0, 10, and 108 below to generate audio for the same text. --vits-tokens=./vits-vctk/tokens.txt \ --sid=0 \ --output-filename=./kennedy-0.wav \ - 'Ask not what your country can do for you; ask what you can do for your country.' + "Ask not what your country can do for you; ask what you can do for your country." ./build/bin/sherpa-onnx-offline-tts \ --vits-model=./vits-vctk/vits-vctk.onnx \ @@ -726,7 +977,7 @@ We use speaker ID 0, 10, and 108 below to generate audio for the same text. --vits-tokens=./vits-vctk/tokens.txt \ --sid=10 \ --output-filename=./kennedy-10.wav \ - 'Ask not what your country can do for you; ask what you can do for your country.' + "Ask not what your country can do for you; ask what you can do for your country." ./build/bin/sherpa-onnx-offline-tts \ --vits-model=./vits-vctk/vits-vctk.onnx \ @@ -734,7 +985,7 @@ We use speaker ID 0, 10, and 108 below to generate audio for the same text. --vits-tokens=./vits-vctk/tokens.txt \ --sid=108 \ --output-filename=./kennedy-108.wav \ - 'Ask not what your country can do for you; ask what you can do for your country.' + "Ask not what your country can do for you; ask what you can do for your country." It will generate 3 files: ``kennedy-0.wav``, ``kennedy-10.wav``, and ``kennedy-108.wav``. @@ -865,7 +1116,164 @@ It will generate 3 files: ``einstein-30.wav``, ``franklin-66.wav``, and ``martin +.. _sherpa-onnx-vits-zh-ll: + +csukuangfj/sherpa-onnx-vits-zh-ll (Chinese, 5 speakers) +------------------------------------------------------- + +You can download the model using the following commands:: + + wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/sherpa-onnx-vits-zh-ll.tar.bz2 + tar xvf sherpa-onnx-vits-zh-ll.tar.bz2 + rm sherpa-onnx-vits-zh-ll.tar.bz2 + +.. hint:: + + This model is trained with the following framework + + ``_ + +Please check the file sizes of the downloaded model: + +.. code-block:: bash + + ls -lh sherpa-onnx-vits-zh-ll/ + + -rw-r--r-- 1 fangjun staff 2.3K Apr 25 17:58 G_multisperaker_latest.json + -rw-r-----@ 1 fangjun staff 2.2K Apr 25 17:22 G_multisperaker_latest_low.json + -rw-r--r-- 1 fangjun staff 127B Apr 25 17:58 README.md + -rw-r--r-- 1 fangjun staff 58K Apr 25 17:58 date.fst + drwxr-xr-x 9 fangjun staff 288B Jun 21 16:32 dict + -rw-r--r-- 1 fangjun staff 368K Apr 25 17:58 lexicon.txt + -rw-r--r-- 1 fangjun staff 115M Apr 25 17:58 model.onnx + -rw-r--r-- 1 fangjun staff 21K Apr 25 17:58 new_heteronym.fst + -rw-r--r-- 1 fangjun staff 63K Apr 25 17:58 number.fst + -rw-r--r-- 1 fangjun staff 87K Apr 25 17:58 phone.fst + -rw-r--r-- 1 fangjun staff 331B Apr 25 17:58 tokens.txt + +**usage**: + +.. code-block:: bash + + sherpa-onnx-offline-tts \ + --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \ + --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \ + --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \ + --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \ + --vits-length-scale=0.5 \ + --sid=0 \ + --output-filename="./0-value-2x.wav" \ + "小米的核心价值观是什么?答案是真诚热爱!" + + + sherpa-onnx-offline-tts \ + --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \ + --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \ + --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \ + --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \ + --sid=1 \ + --tts-rule-fsts=./sherpa-onnx-vits-zh-ll/number.fst \ + --output-filename="./1-numbers.wav" \ + "小米有14岁了" + + sherpa-onnx-offline-tts \ + --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \ + --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \ + --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \ + --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \ + --tts-rule-fsts=./sherpa-onnx-vits-zh-ll/phone.fst,./sherpa-onnx-vits-zh-ll/number.fst \ + --sid=2 \ + --output-filename="./2-numbers.wav" \ + "有困难,请拨打110 或者18601200909" + + sherpa-onnx-offline-tts \ + --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \ + --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \ + --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \ + --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \ + --sid=3 \ + --output-filename="./3-wo-mi.wav" \ + "小米的使命是,始终坚持做感动人心、价格厚道的好产品,让全球每个人都能享受科技带来的美好生活。" + + sherpa-onnx-offline-tts \ + --vits-model=./sherpa-onnx-vits-zh-ll/model.onnx \ + --vits-dict-dir=./sherpa-onnx-vits-zh-ll/dict \ + --vits-lexicon=./sherpa-onnx-vits-zh-ll/lexicon.txt \ + --vits-tokens=./sherpa-onnx-vits-zh-ll/tokens.txt \ + --tts-rule-fsts=./sherpa-onnx-vits-zh-ll/number.fst \ + --sid=4 \ + --output-filename="./4-heteronym.wav" \ + "35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。" + +.. raw:: html + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Wave filenameContentText
0-value-2x.wav + + + 小米的核心价值观是什么?答案是真诚热爱! +
1-numbers.wav + + + 小米有14岁了 +
2-numbers.wav + + + 有困难,请拨打110 或者18601200909 +
3-wo-mi.wav + + + 小米的使命是,始终坚持做感动人心、价格厚道的好产品,让全球每个人都能享受科技带来的美好生活。 +
4-heteronym.wav + + + 35年前,他于长沙出生, 在长白山长大。9年前他当上了银行的领导,主管行政。 +
.. _vits-zh-hf-fanchen-C: @@ -1622,7 +2030,7 @@ Generate speech with executable compiled from C++ --vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \ --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \ --output-filename=./liliana-piper-en_US-lessac-medium.wav \ - 'liliana, the most beautiful and lovely assistant of our team!' + "liliana, the most beautiful and lovely assistant of our team!" .. hint:: @@ -1637,7 +2045,7 @@ Generate speech with executable compiled from C++ --vits-data-dir=./vits-piper-en_US-lessac-medium/espeak-ng-data \ --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \ --output-filename=./liliana-piper-en_US-lessac-medium.wav \ - 'liliana, the most beautiful and lovely assistant of our team!' + "liliana, the most beautiful and lovely assistant of our team!" which will play the audio as it is generating.