Skip to content

Commit

Permalink
Add doc for tts with sherpa-onnx (#487)
Browse files Browse the repository at this point in the history
  • Loading branch information
csukuangfj authored Oct 16, 2023
1 parent a5bb56b commit 683a9b7
Show file tree
Hide file tree
Showing 12 changed files with 383 additions and 0 deletions.
Binary file added docs/source/_static/vits-vctk/einstein-30.wav
Binary file not shown.
Binary file added docs/source/_static/vits-vctk/franklin-66.wav
Binary file not shown.
Binary file added docs/source/_static/vits-vctk/kennedy-0.wav
Binary file not shown.
Binary file added docs/source/_static/vits-vctk/kennedy-10.wav
Binary file not shown.
Binary file added docs/source/_static/vits-vctk/kennedy-108.wav
Binary file not shown.
Binary file added docs/source/_static/vits-vctk/martin-99.wav
Binary file not shown.
3 changes: 3 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,4 +153,7 @@ def get_version():
.. _Go: https://en.wikipedia.org/wiki/Go_(programming_language)
.. _sherpa-onnx-go: https://github.com/k2-fsa/sherpa-onnx-go
.. _yesno: https://www.openslr.org/1/
.. _vits: https://github.com/jaywalnut310/vits
.. _ljspeech: https://github.com/jaywalnut310
.. _VCTK: https://datashare.ed.ac.uk/handle/10283/2950
"""
6 changes: 6 additions & 0 deletions docs/source/onnx/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,9 @@ Also, we show how to use it for speech recognition with pre-trained models.
./websocket/index
./hotwords/index
./pretrained_models/index

.. toctree::
:maxdepth: 5
:caption: tts

./tts/index
2 changes: 2 additions & 0 deletions docs/source/onnx/python/install.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _install_sherpa_onnx_python:

Install the Python Package
==========================

Expand Down
13 changes: 13 additions & 0 deletions docs/source/onnx/tts/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Text-to-speech (TTS)
====================

This page describes how to use `sherpa-onnx`_ for text-to-speech (TTS).


Please first follow :ref:`install_sherpa_onnx` and/or :ref:`install_sherpa_onnx_python`
to install `sherpa-onnx`_ before you continue.

.. toctree::
:maxdepth: 5

./pretrained_models/index
15 changes: 15 additions & 0 deletions docs/source/onnx/tts/pretrained_models/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Pre-trained models
==================

This page list pre-trained models for text-to-speech.

.. hint::

Please install `git-lfs <https://git-lfs.com/>`_ before you continue.

Otherwise, you will be ``SAD`` later.

.. toctree::
:maxdepth: 5

./vits
344 changes: 344 additions & 0 deletions docs/source/onnx/tts/pretrained_models/vits.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,344 @@
vits
====

This page lists pre-trained `vits`_ models.

ljspeech (English, single-speaker)
----------------------------------

This model is converted from `pretrained_ljspeech.pth <https://drive.google.com/file/d/1q86w74Ygw2hNzYP9cWkeClGT5X25PvBT/view?usp=drive_link>`_,
which is trained by the `vits`_ author `Jaehyeon Kim <https://github.com/jaywalnut310>`_ on
the `ljspeech`_ dataset. It supports only English and is a single-speaker model.

.. note::

If you are interested in how the model is converted, please see
`<https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/vits/export-onnx-ljs.py>`_

In the following, we describe how to download it and use it with `sherpa-onnx`_.

Download the model
~~~~~~~~~~~~~~~~~~

Please use the following commands to download it.

.. code-block:: bash
cd /path/to/sherpa-onnx
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/vits-ljs
cd vits-ljs
git lfs pull --include ".*onnx"
Please check that the file sizes of the pre-trained models are correct. See
the file sizes of ``*.onnx`` files below.

.. code-block:: bash
vits-ljs fangjun$ ls -lh *.onnx
-rw-r--r-- 1 fangjun staff 36M Oct 16 15:16 vits-ljs.int8.onnx
-rw-r--r-- 1 fangjun staff 109M Oct 16 15:16 vits-ljs.onnx
Generate speech with executable compiled from C++
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-ljs/vits-ljs.onnx \
--vits-lexicon=./vits-ljs/lexicon.txt \
--vits-tokens=./vits-ljs/tokens.txt \
--output-filename=./liliana.wav \
'liliana, the most beautiful and lovely assistant of our team!'
After running, it will generate a file ``liliana.wav`` in the current directory.

.. code-block:: bash
soxi ./liliana.wav
Input File : './liliana.wav'
Channels : 1
Sample Rate : 22050
Precision : 16-bit
Duration : 00:00:04.39 = 96768 samples ~ 329.143 CDDA sectors
File Size : 194k
Bit Rate : 353k
Sample Encoding: 16-bit Signed Integer PCM
.. raw:: html

<table>
<tr>
<th>Wave filename</th>
<th>Content</th>
<th>Text</th>
</tr>
<tr>
<td>liliana.wav</td>
<td>
<audio title="Generated ./liliana.wav" controls="controls">
<source src="/_static/vits-ljs/liliana.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
liliana, the most beautiful and lovely assistant of our team!
</td>
</tr>
</table>

Generate speech with Python script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-ljs/vits-ljs.onnx \
--vits-lexicon=./vits-ljs/lexicon.txt \
--vits-tokens=./vits-ljs/tokens.txt \
--output-filename=./armstrong.wav \
"That's one small step for a man, a giant leap for mankind."
After running, it will generate a file ``armstrong.wav`` in the current directory.

.. code-block:: bash
soxi ./armstrong.wav
Input File : './armstrong.wav'
Channels : 1
Sample Rate : 22050
Precision : 16-bit
Duration : 00:00:04.81 = 105984 samples ~ 360.49 CDDA sectors
File Size : 212k
Bit Rate : 353k
Sample Encoding: 16-bit Signed Integer PCM
.. raw:: html

<table>
<tr>
<th>Wave filename</th>
<th>Content</th>
<th>Text</th>
</tr>
<tr>
<td>armstrong.wav</td>
<td>
<audio title="Generated ./armstrong.wav" controls="controls">
<source src="/_static/vits-ljs/armstrong.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
That's one small step for a man, a giant leap for mankind.
</td>
</tr>
</table>

VCTK (English, multi-speaker, 109 speakers)
-------------------------------------------

This model is converted from `pretrained_vctk.pth <https://drive.google.com/file/d/11aHOlhnxzjpdWDpsz1vFDCzbeEfoIxru/view?usp=drive_link>`_,
which is trained by the `vits`_ author `Jaehyeon Kim <https://github.com/jaywalnut310>`_ on
the `VCTK`_ dataset. It supports only English and is a multi-speaker model. It contains
109 speakers.

.. note::

If you are interested in how the model is converted, please see
`<https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/vits/export-onnx-vctk.py>`_

In the following, we describe how to download it and use it with `sherpa-onnx`_.

Download the model
~~~~~~~~~~~~~~~~~~

Please use the following commands to download it.

.. code-block:: bash
cd /path/to/sherpa-onnx
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/csukuangfj/vits-vctk
cd vits-ctk
git lfs pull --include ".*onnx"
Please check that the file sizes of the pre-trained models are correct. See
the file sizes of ``*.onnx`` files below.

.. code-block:: bash
vits-vctk fangjun$ ls -lh *.onnx
-rw-r--r-- 1 fangjun staff 37M Oct 16 10:57 vits-vctk.int8.onnx
-rw-r--r-- 1 fangjun staff 116M Oct 16 10:57 vits-vctk.onnx
Generate speech with executable compiled from C++
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since there are 109 speakers available, we can choose a speaker from 0 to 198.
The default speaker ID is 0.

We use speaker ID 0, 10, and 108 below to generate audio for the same text.

.. code-block:: bash
cd /path/to/sherpa-onnx
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=0 \
--output-filename=./kennedy-0.wav \
'Ask not what your country can do for you; ask what you can do for your country.'
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=10 \
--output-filename=./kennedy-10.wav \
'Ask not what your country can do for you; ask what you can do for your country.'
./build/bin/sherpa-onnx-offline-tts \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=108 \
--output-filename=./kennedy-108.wav \
'Ask not what your country can do for you; ask what you can do for your country.'
It will generate 3 files: ``kennedy-0.wav``, ``kennedy-10.wav``, and ``kennedy-108.wav``.

.. raw:: html

<table>
<tr>
<th>Wave filename</th>
<th>Content</th>
<th>Text</th>
</tr>
<tr>
<td>kennedy-0.wav</td>
<td>
<audio title="Generated ./kennedy-0.wav" controls="controls">
<source src="/_static/vits-vctk/kennedy-0.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
Ask not what your country can do for you; ask what you can do for your country.
</td>
</tr>
<tr>
<td>kennedy-10.wav</td>
<td>
<audio title="Generated ./kennedy-10.wav" controls="controls">
<source src="/_static/vits-vctk/kennedy-10.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
Ask not what your country can do for you; ask what you can do for your country.
</td>
</tr>
<tr>
<td>kennedy-108.wav</td>
<td>
<audio title="Generated ./kennedy-108.wav" controls="controls">
<source src="/_static/vits-vctk/kennedy-108.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
Ask not what your country can do for you; ask what you can do for your country.
</td>
</tr>
</table>

Generate speech with Python script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We use speaker ID 30, 66, and 99 below to generate audio for different transcripts.

.. code-block:: bash
cd /path/to/sherpa-onnx
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=30 \
--output-filename=./einstein-30.wav \
"Life is like riding a bicycle. To keep your balance, you must keep moving."
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=66 \
--output-filename=./franklin-66.wav \
"Three can keep a secret, if two of them are dead."
python3 ./python-api-examples/offline-tts.py \
--vits-model=./vits-vctk/vits-vctk.onnx \
--vits-lexicon=./vits-vctk/lexicon.txt \
--vits-tokens=./vits-vctk/tokens.txt \
--sid=99 \
--output-filename=./martin-99.wav \
"Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that"
It will generate 3 files: ``einstein-30.wav``, ``franklin-66.wav``, and ``martin-99.wav``.

.. raw:: html

<table>
<tr>
<th>Wave filename</th>
<th>Content</th>
<th>Text</th>
</tr>
<tr>
<td>einstein-30.wav</td>
<td>
<audio title="Generated ./einstein-30.wav" controls="controls">
<source src="/_static/vits-vctk/einstein-30.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
Life is like riding a bicycle. To keep your balance, you must keep moving.
</td>
</tr>
<tr>
<td>franklin-66.wav</td>
<td>
<audio title="Generated ./franklin-66.wav" controls="controls">
<source src="/_static/vits-vctk/franklin-66.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
Three can keep a secret, if two of them are dead.
</td>
</tr>
<tr>
<td>martin-99.wav</td>
<td>
<audio title="Generated ./martin-99.wav" controls="controls">
<source src="/_static/vits-vctk/martin-99.wav" type="audio/wav">
Your browser does not support the <code>audio</code> element.
</audio>
</td>
<td>
Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that
</td>
</tr>
</table>

0 comments on commit 683a9b7

Please sign in to comment.