Skip to content

Latest commit

 

History

History
66 lines (49 loc) · 11.8 KB

listofpapers.md

File metadata and controls

66 lines (49 loc) · 11.8 KB

ASR Papers on TIMIT

Mostly adopted from this excellent repo: https://github.com/syhw/wer_are_we, but a bit more descriptive for my own research (NOTE: PER refers to Phone Error Rate calculated in %). This file is viewed best on a VSCode preview of the Markdown file.

List of Results

S.No. PER Paper name Published date Publisher Type of Training Features Used Training set Testing set Validation set
1 13.8 The PyTorch-Kaldi Speech Recognition Toolkit Feb. 2019 ICASSP 2019 Hybrid (Discriminative training using alignments from a generative GMM-HMM) Combination of MFCC features + FBANK features + fMLLR features Standard Kaldi s5 training set Core Test set As per Kaldi s5
2 14.9 Light Gated Recurrent Units for Speech Recognition Apr 2018 IEEE Trans. 2018 Results present for Hybrid and Discriminative (End-to-End training) Combination of Mel filterbank coefficients with energy + thier velocity + accln. coeff. Standard Kaldi s5 training set Core Test set Standard s5 dev set
3 16.5 Phone Recognition with Hierarchical Convolutional Deep Maxout Networks Sep. 2015 EURASIP Journal 2015 Hybrid (Discriminative training using alignments from a generative GMM-HMM) Mel-Filterbank features with context frames, spatially reshaped Standard Kaldi s5 training set (3696 sentences) Core Test set 10 % of training set (random)
4 16.5 A Regularization Post Layer: An Additional Way: how to Make Deep Neural Networks Robust Oct. 2017 ICSLSP 2017 Hybrid (Discriminative training using alignments from a generative GMM-HMM) MFCC+fMLLR features with deltas, double-deltas Standard Kaldi s5 training set (3696 sentences) Core Test set s5 dev set
5 16.7 Combining time-and frequency domain convolution in convolutional neural network based phone recognition May 2014 ICASSP 2014 Hybrid (Discriminative training using alignments from a generative GMM-HMM) Mel-Filterbank features with context frames, spatially reshaped Standard Kaldi s5 training set (3696 sentences) Core Test set 10 % of training set (random)
6 16.8 An investigation into instantaneous frequency estimation methods for improved speech recognition features Nov. 2017 GlobalSIP 2017 Hybrid (Discriminative training using alignments from a generative GMM-HMM) MFCC features with IFCC features Standard Kaldi s5 training set (3696 sentences)(Assuming) Core Test set s5 dev set (Assuming)
7 17.3 Segmental Recurrent Neural Networks for End-To-End Speech Recognition Mar. 2016 Interspeech 2016 Discriminative (End-to-End training) 24-dim log-mel filterbank features with context information Standard Kaldi s5 training set (3696 sentences) Core Test set s5 dev test
8 17.6 Attention-Based models for Speech Recognition June 2015 NIPS 2015 Discriminative (End-to-End training) 40-dim mel filterbank features with energy along with deltas and double-deltas Standard Kaldi s5 training set (3696 sentences) Core Test set s5 dev test
9 17.7 Speech Recognition with Deep Recurrent Neural Networks Mar. 2013 ICASSP 2013 Discriminative (End-to-End training) 40-dim mel filterbank features with energy along with deltas and double-deltas Standard Kaldi s5 training set (3696 sentences) Core Test set s5 dev test
10 18.0 Learning Filterbanks from Raw speech for Phone recognition Oct. 2017 ICASSP 2018 Discriminative (End-to-End training) Approximating mel-filterbanks from raw-speech using Time-Domain filterbanks and fine tuned with CNNs Standard Kaldi s5 training set (3696 sentences) Core Test set s5 dev test
11 23.0 Deep Belief Networks for Phone Recognition Dec. 2009 NIPS 2009 Hybrid (Discriminative training using alignments from a generative GMM-HMM) 13-dim MFCC features with deltas and double-deltas Standard Kaldi s5 training set (3696 sentences) Core Test set s5 dev test

Notes on Model used

This contains some detailed notes about the model used in each of the above papers.

S.No. PER Paper name Details of model used + Training details
1 13.8 The PyTorch-Kaldi Speech Recognition Toolkit Used DNN-HMM hybrid architecture, but instead of a regular DNN used a combination of 3 networks: MLP, Li-GRU and MLP to estimate posterior probabilities of phones, given the utterance features (DNNs were trained discriminatively as per standard practices for DNN-HMM)
2 14.9 Light Gated Recurrent Units for Speech Recognition End-to-End CTC training, removed reset gate of GRU, ReLU activations insead of Tanh + Batch Normalization
3 16.5 Phone Recognition with Hierarchical Convolutional Deep Maxout Networks Used DNN-HMM hybrid architecture, but instead of a regular DNN used Hierarchical maxout CNN + Dropout to estimate posterior probabilities of phones (used 61 phone labels while training, 39 while testing)
4 16.5 A Regularization Post Layer: An Additional Way: how to Make Deep Neural Networks Robust Used DBN-DNN with last layer regularization (Regularization layer was a notable contribution) (used 48 phone labels while training, 39 while testing)
5 16.7 Combining time-and frequency domain convolution in convolutional neural network based phone recognition Used DNN-HMM hybrid architecture, but instead of a regular DNN used CNN in time and frequency + dropout to estimate posterior probabilities of phones (used 61 phone labels while training, 39 while testing)
6 16.8 An investigation into instantaneous frequency estimation methods for improved speech recognition features Used DNN-HMM hybrid architecture to estimate posterior probabilities of phones --> (used 61 phone labels while training, 39 while testing)
7 17.3 Segmental Recurrent Neural Networks for End-To-End Speech Recognition Used a Recurrent Neural Network combined with Conditional Random Fields for end-to-end discriminative training (used 48 phone labels while training, 39 while testing)
8 17.6 Attention-Based models for Speech Recognition Used a Bi-directional Recurrent Neural Network combined with Attention mechansim for end-to-end discriminative training (used full 61 phone labels while training, 39 while testing)
9 17.7 Speech Recognition with Deep Recurrent Neural Networks Used a Bi-directional LSTM combined with skip connections and a special type of architecture called RNN Transducer that also provides some conditioning information based on labels. Uses end-to-end CTC (used 61 phone labels while training, 39 while testing)
10 18.0 Learning Filterbanks from Raw speech for Phone recognition Using a combination of time-domain Gabor filters to provide an approximate mel-filterbank output at the start, this method tries to learn the entire feature starting from pre-emphasis to averaging using CNNs. Used 39 phone labels while training, 39 while testing
11 23.0 Deep Belief Networks for Phone Recognition Used a Deep Belief Network combined with HMM for estimating posterior probabilities of phones, given the utterance features (DBNs were trained discriminatively) (used 61 phone labels while training, 39 while testing)

Description of Feature abbreviations

  • MFCC : Mel Frequency Cepstral Coefficients
  • FBANK: Mel Filterbank Coefficients (similar to MFCC without before applying the DCT). Also referred to as Mel Frequency Spectral Coefficients (MFSC) in some papers.
  • fMLLR: Feature space Maximum Likelihood Linear Regression

Description of standard Kaldi s5 recipe

NOTE: s5 basically means Version 5, which is the latest version now for Kaldi examples.

This is obtained from the RESULTS at the official repository for Kaldi at https://github.com/kaldi-asr/kaldi/blob/master/egs/timit/s5/RESULTS.md

  • Training set: 4620 sentences (or 3696 sentences is used, if the dialect sentences (SA) spoken by 462 speakers are excluded). Many papers use only the 3696 utterances, when they mention the use of "full" training set (in some cases, this is not always clear).
  • Validation / Dev set: 400 sentences
  • Test set: 192 sentences spoken by 24 speakers (8 sentences spoken per speaker) constitute the Core Test set. The total test set is composed of 1680 sentences
  • Phone mapping: Training with 48 phonemes (originally TIMIT considers 61 phonemes, but these are mapped down to 48 phonemes as cited in [1][2])
  • Language Model: Bigram phoneme language model which is extracted from the training set

Additional Notes:

In many cases, when it is mentioned "tested on 39 phones", what is generally said is that the decoding was carried on the same number of labels as in the training (i.e. 61 or 48) but for "scoring" purposes, these results were mapped to a 39-label system as mentioned in [2], and then PER is computed

Paper References

[1] Lopes, Carla, and Fernando Perdigao. "Phone recognition on the TIMIT database." Speech Technologies/Book 1 (2011): 285-302.

[2] Lee, K-F., and H-W. Hon. "Speaker-independent phone recognition using hidden Markov models." IEEE Transactions on Acoustics, Speech, and Signal Processing 37.11 (1989): 1641-1648.