Skip to content

bephrem1/elevenlabs-websockets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ElevenLabs WebSocket API Sample

A Python sample showcasing a complete server-side integration w/ the ElevenLabs WebSockets API.

Running

Setup

0) clone + cd:

git clone https://github.com/bephrem1/elevenlabs-websockets.git
cd elevenlabs-websockets

1) install dependencies:

pip install -r requirements.txt

2) create .env file:

create a file called .env at the root of the project w/ 1 key:

ELEVENLABS_API_KEY = api_key_here

Run

3) run in terminal

python3 src/testing/voicebox.py

you should see something like this:

terminal logs

Clocking Times: elapsed time is clocked for a few critical events

  • initial socket connection: websocket connection to ElevenLabs (usually takes 150-250ms) — this overhead exists on every TTS generation since connections have to be reestablished every generation (& the websocket handshake has to be redone).
  • fgl: stands for "first generation latency", this is the time between when the first speech text chunk is sent → & the first base64 speech chunk is received back from ElevenLabs
  • totelap: this is the total elapsed time between when prepare() was called → & the relevant log being recorded. This is an impotant metric to track the time from when LLM inference may have been fired off & the first speech chunk received back.

Inspecting

4) inspect files

src/voicebox/Voicebox.py → has all socket-related logic
src/testing/voicebox.py → driver file

Overview

The sample showcases a Voicebox class wrapping around the WebSocket functionality with a simple API:

actions:

  • prepare(speech_generation_start_time: float) (non-blocking): This will prepare & initialize the socket connection (async).
    • If you are streaming text from an LLM you'd fire prepare() before your LLM request so both TTS prep + LLM inference proceed concurrently.
    • The voicebox should be ready to ingest speech within 200-300ms (before your LLM would get its first token back to you).
  • async feed_speech(text: str): Once a connection is open, you can feed speech over the socket. Feed a string of any size (from a single character to a full sentence).
  • async feeding_finished(): Signal to the voicebox that it has received all speech & transmission is finished.
    • This is a mandatory step — although speech generations may continue to come back (even when you have no more to send), ElevenLabs requires that you let it know that further speech will not be sent & that the transmission is complete.
  • async reset(): This will close the socket connection & reset the voicebox for the next speech generation to run.
    • This is a required step, socket connections cannot be reused (at the time of this sample's writing) or kept alive (default timeout is 20s)

state:

  • is_ready(): Check if the voicebox is ready for speech transmission.
    • You would check this in a loop after firing off prepare() (since prepare() will not block).
  • generation_complete(): Returns true when all speech has been received back from ElevenLabs.
    • You would check this in a loop after calling feeding_finished() to actually know when all speech has been generated & sent back to you.
    • You can then safely call reset() to prepare for the next generation.


Just wanted to publish this as a quick sample — forgive any rough edges on code formatting, etc.

Buy Me A Coffee

About

Python integration w/ ElevenLabs WebSocket API

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages