You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue occurs when using the text_stream_sample with the zh-CN-YunxiaNeural voice model, resulting in unintended pauses between words, which disrupts the natural flow of the speech. For instance, when synthesizing the sentence "今天的天气真好", there is a noticeable and unnatural pause between the words "今" and "天", affecting the overall quality of the output. This behavior might be related to how the OpenAI-generated text is processed in chunks, leading to these pauses during real-time synthesis. I am looking for ways to prevent this from happening.
Steps to Reproduce
Use the framework of the text_stream_sample in the repo.
Set the speech_synthesis_voice_name to zh-CN-YunxiaNeural.
Send the synthesized audio chunks (using audio_buffer.tobytes()) through the WebSocket and play the PCM audio data on the client-side.
Notice that the longer the response, the more frequently unnatural pauses between words occur.
Expected Behavior
The synthesized speech should flow smoothly, with no unintended pauses between words unless indicated by appropriate punctuation. Each sentence should be delivered naturally and continuously.
The issue is most prominent with Chinese voice models. I have tested both zh-CN-YunxiaNeural and zh-CN-YunxiNeural, and the issue seems consistent across them.
The issue does not seem to occur with English voice models, such as en-US-BrianMultilingualNeural.
The text was updated successfully, but these errors were encountered:
dmingke
changed the title
Unintended Pauses Between Words using text_stream_sample.py for Chinese TTS
Unintended Pauses Between Words When Using text_stream_sample.py for Chinese TTS
Sep 19, 2024
dmingke
changed the title
Unintended Pauses Between Words When Using text_stream_sample.py for Chinese TTS
Unintended Pauses Between Words When Using text_stream_sample for Chinese TTS
Sep 19, 2024
Bug Description
This issue occurs when using the text_stream_sample with the zh-CN-YunxiaNeural voice model, resulting in unintended pauses between words, which disrupts the natural flow of the speech. For instance, when synthesizing the sentence "今天的天气真好", there is a noticeable and unnatural pause between the words "今" and "天", affecting the overall quality of the output. This behavior might be related to how the OpenAI-generated text is processed in chunks, leading to these pauses during real-time synthesis. I am looking for ways to prevent this from happening.
Steps to Reproduce
text_stream_sample
in the repo.speech_synthesis_voice_name
tozh-CN-YunxiaNeural
.audio_buffer.tobytes()
) through the WebSocket and play the PCM audio data on the client-side.Expected Behavior
The synthesized speech should flow smoothly, with no unintended pauses between words unless indicated by appropriate punctuation. Each sentence should be delivered naturally and continuously.
Version of the Cognitive Services Speech SDK
azure-cognitiveservices-speech==1.40.0
Additional Context
zh-CN-YunxiaNeural
andzh-CN-YunxiNeural
, and the issue seems consistent across them.en-US-BrianMultilingualNeural
.The text was updated successfully, but these errors were encountered: