Skip to content

Carry a spoken conversation with LLM models through the use of Whisper speech to text and Piper text to speech!

License

Notifications You must be signed in to change notification settings

WesleyFister/llm-voice-assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Carry a spoken conversation with large language models! This project uses Whisper speech to text to transcribe the user's voice, send it to the LLM and pipe the final result to Piper text to speech. This project uses a socket server to get audio from a client device and sends it to the server to do all the processing. The final TTS output from the LLM is sent to the client device.

WIP Warning

This is very much a work in progress. Many basic features have yet to be implemented.

Install

This program only works on Linux Ubuntu/Debian and Arch systems.

Run 'setup.sh' script for both the client and server side. Next run 'start.sh' for the server first followed by the client next.

Features

  • 100% offline, opensource and private
  • Wake word detection: 'Hey Jarvis'
  • Hands free interation
  • Client server model
  • Fully multilingual pipeline
  • Streamed reponses

Configuring

In both the client and server configuring is done by opening 'start.sh' and passing in the corresponding flag to 'main.py' like so. python3 main.py --ip-address 192.168.1.123 --port 5432

Multilinguality

The language has to be supported by STT (Whisper languages), the LLM (Llama3.1 languages)) and TTS (Piper languages).

This means by default the application supports the following languages: English, German, French, Italian, Portuguese, and Spanish. However, more languages can easily be added by changing the large language model used such as Mistral Small.

The output text generated by the LLM is chunked into sentences and ran aganist Lingua to detect the language. A TTS model for that language is then downloaded and cached for future use. This unfortunately means that the first time a language is used the time it takes to generate a response back to the user will be exetremely slow. This is done to save on memory as loading all of the TTS models can take ~2.5GB of memory.

Todo

  1. Cancel audio playback when saying the wake word 'hey Jarvis'.
  2. Properly close the server with CTRL + C.
  3. Make a proper setup.sh for both the client and server.
  4. Clear LLM chat history by saying some variation of 'hey Jarvis clear chat history'.
  5. Allow multiple clients to connect to the server.
  6. Clean up code.

About

Carry a spoken conversation with LLM models through the use of Whisper speech to text and Piper text to speech!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published