A Python sample showcasing a complete server-side integration w/ the ElevenLabs WebSockets API.
git clone https://github.com/bephrem1/elevenlabs-websockets.git
cd elevenlabs-websockets
pip install -r requirements.txt
create a file called .env
at the root of the project w/ 1 key:
ELEVENLABS_API_KEY = api_key_here
python3 src/testing/voicebox.py
Clocking Times: elapsed time is clocked for a few critical events
initial socket connection
: websocket connection to ElevenLabs (usually takes 150-250ms) — this overhead exists on every TTS generation since connections have to be reestablished every generation (& the websocket handshake has to be redone).fgl
: stands for "first generation latency", this is the time between when the first speech text chunk is sent → & the firstbase64
speech chunk is received back from ElevenLabstotelap
: this is the total elapsed time between whenprepare()
was called → & the relevant log being recorded. This is an impotant metric to track the time from when LLM inference may have been fired off & the first speech chunk received back.
src/voicebox/Voicebox.py → has all socket-related logic
src/testing/voicebox.py → driver file
The sample showcases a Voicebox
class wrapping around the WebSocket functionality with a simple API:
actions:
prepare(speech_generation_start_time: float)
(non-blocking): This will prepare & initialize the socket connection (async).- If you are streaming text from an LLM you'd fire
prepare()
before your LLM request so both TTS prep + LLM inference proceed concurrently. - The voicebox should be ready to ingest speech within
200-300ms
(before your LLM would get its first token back to you).
- If you are streaming text from an LLM you'd fire
async
feed_speech(text: str)
: Once a connection is open, you can feed speech over the socket. Feed a string of any size (from a single character to a full sentence).async
feeding_finished()
: Signal to the voicebox that it has received all speech & transmission is finished.- This is a mandatory step — although speech generations may continue to come back (even when you have no more to send), ElevenLabs requires that you let it know that further speech will not be sent & that the transmission is complete.
async
reset()
: This will close the socket connection & reset the voicebox for the next speech generation to run.- This is a required step, socket connections cannot be reused (at the time of this sample's writing) or kept alive (default timeout is 20s)
state:
is_ready()
: Check if the voicebox is ready for speech transmission.- You would check this in a loop after firing off
prepare()
(sinceprepare()
will not block).
- You would check this in a loop after firing off
generation_complete()
: Returnstrue
when all speech has been received back from ElevenLabs.- You would check this in a loop after calling
feeding_finished()
to actually know when all speech has been generated & sent back to you. - You can then safely call
reset()
to prepare for the next generation.
- You would check this in a loop after calling
Just wanted to publish this as a quick sample — forgive any rough edges on code formatting, etc.