Mostly adopted from this excellent repo: https://github.com/syhw/wer_are_we, but a bit more descriptive for my own research (NOTE: PER refers to Phone Error Rate calculated in %). This file is viewed best on a VSCode preview of the Markdown file.
S.No. | PER | Paper name | Published date | Publisher | Type of Training | Features Used | Training set | Testing set | Validation set |
---|---|---|---|---|---|---|---|---|---|
1 | 13.8 | The PyTorch-Kaldi Speech Recognition Toolkit | Feb. 2019 | ICASSP 2019 | Hybrid (Discriminative training using alignments from a generative GMM-HMM) | Combination of MFCC features + FBANK features + fMLLR features | Standard Kaldi s5 training set | Core Test set | As per Kaldi s5 |
2 | 14.9 | Light Gated Recurrent Units for Speech Recognition | Apr 2018 | IEEE Trans. 2018 | Results present for Hybrid and Discriminative (End-to-End training) | Combination of Mel filterbank coefficients with energy + thier velocity + accln. coeff. | Standard Kaldi s5 training set | Core Test set | Standard s5 dev set |
3 | 16.5 | Phone Recognition with Hierarchical Convolutional Deep Maxout Networks | Sep. 2015 | EURASIP Journal 2015 | Hybrid (Discriminative training using alignments from a generative GMM-HMM) | Mel-Filterbank features with context frames, spatially reshaped | Standard Kaldi s5 training set (3696 sentences) | Core Test set | 10 % of training set (random) |
4 | 16.5 | A Regularization Post Layer: An Additional Way: how to Make Deep Neural Networks Robust | Oct. 2017 | ICSLSP 2017 | Hybrid (Discriminative training using alignments from a generative GMM-HMM) | MFCC+fMLLR features with deltas, double-deltas | Standard Kaldi s5 training set (3696 sentences) | Core Test set | s5 dev set |
5 | 16.7 | Combining time-and frequency domain convolution in convolutional neural network based phone recognition | May 2014 | ICASSP 2014 | Hybrid (Discriminative training using alignments from a generative GMM-HMM) | Mel-Filterbank features with context frames, spatially reshaped | Standard Kaldi s5 training set (3696 sentences) | Core Test set | 10 % of training set (random) |
6 | 16.8 | An investigation into instantaneous frequency estimation methods for improved speech recognition features | Nov. 2017 | GlobalSIP 2017 | Hybrid (Discriminative training using alignments from a generative GMM-HMM) | MFCC features with IFCC features | Standard Kaldi s5 training set (3696 sentences)(Assuming) | Core Test set | s5 dev set (Assuming) |
7 | 17.3 | Segmental Recurrent Neural Networks for End-To-End Speech Recognition | Mar. 2016 | Interspeech 2016 | Discriminative (End-to-End training) | 24-dim log-mel filterbank features with context information | Standard Kaldi s5 training set (3696 sentences) | Core Test set | s5 dev test |
8 | 17.6 | Attention-Based models for Speech Recognition | June 2015 | NIPS 2015 | Discriminative (End-to-End training) | 40-dim mel filterbank features with energy along with deltas and double-deltas | Standard Kaldi s5 training set (3696 sentences) | Core Test set | s5 dev test |
9 | 17.7 | Speech Recognition with Deep Recurrent Neural Networks | Mar. 2013 | ICASSP 2013 | Discriminative (End-to-End training) | 40-dim mel filterbank features with energy along with deltas and double-deltas | Standard Kaldi s5 training set (3696 sentences) | Core Test set | s5 dev test |
10 | 18.0 | Learning Filterbanks from Raw speech for Phone recognition | Oct. 2017 | ICASSP 2018 | Discriminative (End-to-End training) | Approximating mel-filterbanks from raw-speech using Time-Domain filterbanks and fine tuned with CNNs | Standard Kaldi s5 training set (3696 sentences) | Core Test set | s5 dev test |
11 | 23.0 | Deep Belief Networks for Phone Recognition | Dec. 2009 | NIPS 2009 | Hybrid (Discriminative training using alignments from a generative GMM-HMM) | 13-dim MFCC features with deltas and double-deltas | Standard Kaldi s5 training set (3696 sentences) | Core Test set | s5 dev test |
This contains some detailed notes about the model used in each of the above papers.
S.No. | PER | Paper name | Details of model used + Training details |
---|---|---|---|
1 | 13.8 | The PyTorch-Kaldi Speech Recognition Toolkit | Used DNN-HMM hybrid architecture, but instead of a regular DNN used a combination of 3 networks: MLP, Li-GRU and MLP to estimate posterior probabilities of phones, given the utterance features (DNNs were trained discriminatively as per standard practices for DNN-HMM) |
2 | 14.9 | Light Gated Recurrent Units for Speech Recognition | End-to-End CTC training, removed reset gate of GRU, ReLU activations insead of Tanh + Batch Normalization |
3 | 16.5 | Phone Recognition with Hierarchical Convolutional Deep Maxout Networks | Used DNN-HMM hybrid architecture, but instead of a regular DNN used Hierarchical maxout CNN + Dropout to estimate posterior probabilities of phones (used 61 phone labels while training, 39 while testing) |
4 | 16.5 | A Regularization Post Layer: An Additional Way: how to Make Deep Neural Networks Robust | Used DBN-DNN with last layer regularization (Regularization layer was a notable contribution) (used 48 phone labels while training, 39 while testing) |
5 | 16.7 | Combining time-and frequency domain convolution in convolutional neural network based phone recognition | Used DNN-HMM hybrid architecture, but instead of a regular DNN used CNN in time and frequency + dropout to estimate posterior probabilities of phones (used 61 phone labels while training, 39 while testing) |
6 | 16.8 | An investigation into instantaneous frequency estimation methods for improved speech recognition features | Used DNN-HMM hybrid architecture to estimate posterior probabilities of phones --> (used 61 phone labels while training, 39 while testing) |
7 | 17.3 | Segmental Recurrent Neural Networks for End-To-End Speech Recognition | Used a Recurrent Neural Network combined with Conditional Random Fields for end-to-end discriminative training (used 48 phone labels while training, 39 while testing) |
8 | 17.6 | Attention-Based models for Speech Recognition | Used a Bi-directional Recurrent Neural Network combined with Attention mechansim for end-to-end discriminative training (used full 61 phone labels while training, 39 while testing) |
9 | 17.7 | Speech Recognition with Deep Recurrent Neural Networks | Used a Bi-directional LSTM combined with skip connections and a special type of architecture called RNN Transducer that also provides some conditioning information based on labels. Uses end-to-end CTC (used 61 phone labels while training, 39 while testing) |
10 | 18.0 | Learning Filterbanks from Raw speech for Phone recognition | Using a combination of time-domain Gabor filters to provide an approximate mel-filterbank output at the start, this method tries to learn the entire feature starting from pre-emphasis to averaging using CNNs. Used 39 phone labels while training, 39 while testing |
11 | 23.0 | Deep Belief Networks for Phone Recognition | Used a Deep Belief Network combined with HMM for estimating posterior probabilities of phones, given the utterance features (DBNs were trained discriminatively) (used 61 phone labels while training, 39 while testing) |
- MFCC : Mel Frequency Cepstral Coefficients
- FBANK: Mel Filterbank Coefficients (similar to MFCC without before applying the DCT). Also referred to as Mel Frequency Spectral Coefficients (MFSC) in some papers.
- fMLLR: Feature space Maximum Likelihood Linear Regression
NOTE: s5 basically means Version 5, which is the latest version now for Kaldi examples.
This is obtained from the RESULTS at the official repository for Kaldi at https://github.com/kaldi-asr/kaldi/blob/master/egs/timit/s5/RESULTS.md
- Training set: 4620 sentences (or 3696 sentences is used, if the dialect sentences (SA) spoken by 462 speakers are excluded). Many papers use only the 3696 utterances, when they mention the use of "full" training set (in some cases, this is not always clear).
- Validation / Dev set: 400 sentences
- Test set: 192 sentences spoken by 24 speakers (8 sentences spoken per speaker) constitute the Core Test set. The total test set is composed of 1680 sentences
- Phone mapping: Training with 48 phonemes (originally TIMIT considers 61 phonemes, but these are mapped down to 48 phonemes as cited in [1][2])
- Language Model: Bigram phoneme language model which is extracted from the training set
In many cases, when it is mentioned "tested on 39 phones", what is generally said is that the decoding was carried on the same number of labels as in the training (i.e. 61 or 48) but for "scoring" purposes, these results were mapped to a 39-label system as mentioned in [2], and then PER is computed
[1] Lopes, Carla, and Fernando Perdigao. "Phone recognition on the TIMIT database." Speech Technologies/Book 1 (2011): 285-302.
[2] Lee, K-F., and H-W. Hon. "Speaker-independent phone recognition using hidden Markov models." IEEE Transactions on Acoustics, Speech, and Signal Processing 37.11 (1989): 1641-1648.