Skip to content
This repository has been archived by the owner on Mar 5, 2020. It is now read-only.

Study recognizability of speech over different methods #161

Open
5 of 10 tasks
neumantm opened this issue Jul 9, 2018 · 3 comments
Open
5 of 10 tasks

Study recognizability of speech over different methods #161

neumantm opened this issue Jul 9, 2018 · 3 comments

Comments

@neumantm
Copy link
Member

neumantm commented Jul 9, 2018

Do: Spotify running and say Amy wakeup

Do this multiple microphones and/or devices/recognizers.
Smartphone vs Laptop

Also streaming audio from smartphone to CMU Sphinx.

Find out if the recognition problem is more of a software or a hradware(microphone) problem.

Also test with deep speech from mozilla.

x SP

Tasks:

  • Define the testing procedure (what to test how often)
  • Test with cmu sphinx and laptop mic
  • Test with cmu sphinx and better mic (Samson Go Mic (~30€) in this case)
  • Test with cmu sphinx and PS3 EyeToy
  • Test with google and laptop mic
  • Test with google and better mic
  • Test with google and PS3 EyeToy
  • Test with deep speech and laptop mic
  • Test with deep speech and better mic
  • Test with deep speech and PS3 EyeToy
@neumantm neumantm changed the title Study recongizability of speech over different methods Study recognizability of speech over different methods Jul 9, 2018
@neumantm neumantm added this to the Sprint 3 milestone Jul 9, 2018
@MakinZeel
Copy link
Contributor

MakinZeel commented Jul 16, 2018

Test wake-up call with and without music (CMU & DeepSpeech & Google Cloud Speech)
Text: "Amy wake up"
Microphone: HeadSet Microphone (SADES SA903)

Info:

cmu uses Input Stream from Microphone
DeepSpeech uses .wav files (16khz, Mono)
Google Using .wav files (16khz, Mono)

whole Amy system uses ~1.5gb ram
DeepSpeech-recognition uses ~1.0 gb ram
Google runs Online (https://cloud.google.com/speech-to-text/)

Amy running on Windows
DeepSpeech running ob Ubuntu Subsystem on Windows
Google runs Online (https://cloud.google.com/speech-to-text/)

cmu uses Grammar
DeepSpeech uses pretrained free Model
Google uses Cloud and free Model

Time Consumed:

cmu: ~1-2s after finishing to talk
DeepSpeech: ~ .wav file lenght * 5
Google: <1/5 of .wav file lenght

Test1 - Without Music:

cmu:
    Recognized: True
ds:
    Recognized: True (amy wake up)
Google:
    Recognized: True (Amy wake up)

Test2 - With loud Music:

cmu:
    Recognized: False (amy does not wake up)
ds:
    Recognized: False (Random words, sometimes 'wake' or 'up')
Google:
    Recognized: True (Amy wake up)

Test3 - With medium loud Music (voice louder than music by large margin):

cmu:
    Recognized: True (sometimes you have to repeat the command)
ds:
    Recognized: False (Random words, sometimes 'wake' or 'up')
Google:
    Recognized: True (Amy wake up)

@Hobbitsloth
Copy link
Member

Hobbitsloth commented Jul 16, 2018

Test wake-up call with and without music (CMU)
Text: "Amy wake up"
Microphone: Samson Go Mic, PS3 EyeToy
Setup: Distance to mic ~75 cm, Sound from Monitor directly behind the mic
Test: Say Text 50 times back to back in every case. (I say "Amy wake up" without sending Amy back to sleep in between.)

Time Consumed:
cmu: <1s after finishing to talk

Test 1 - Without Music:
No background noise.

  • cmu:
    • Samson Go Mic: good 7/10
    • PS3 EyeToy: good 8/10

Test 2 - With medium loud Music:
I can speak at a normal volume. Sound volume is at 30% in my case.

  • cmu:
    • Samson Go Mic: good 8/10
    • PS3 EyeToy: moderate 6/10

Test 3 - With loud Music:
I have to scream in the hope that Amy can understand me. Sound volume is at 100% in my case.

  • cmu:
    • Samson Go Mic: bad 1/10
    • PS3 EyeToy: not at all 0/10

@buddy200
Copy link
Contributor

buddy200 commented Jul 23, 2018

The tests were made with the USB Mikrophone Seacue (10€) and the build in micro phone from a hp pro book 440 G4.
comparative tests has shown that the build in micro from the notebook is equal to the usb microphone.

All tests are tested with serveral commands from different plugins (for each speech recognition the same commands).

Normal Commands:
CMU best recognize commands with the length of 4 words (8/10).
Commands with less then 4 words was recognize equal to 4 words but the commands was often recognized to the wrong time. e.g if nobody talks
Commands with more then 4 words were recognize clearly worse (6/10).
Commands with 10 or more words only 2/10.

Google Speech recognize commands with all lengths very well (9/10).

Number:
CMU can't recognize numbers at the moment. of 100 tests with 5 different numbers only 2 times was the correct number recognized.
Google can recognize numbers very well. on average 9/10.

Times:
CMU not possible see number.
Google can most of the time recognize the most times in diffrent formats. (8/10)
But single times are fromated strange e.g. 8 o'clock pm gets 2000 (every time by this time). the same behavior can be seen with long commands.

@Legion2 Legion2 modified the milestones: Sprint 3, Sprint 4 Jul 31, 2018
@Legion2 Legion2 pinned this issue Jan 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants