Skip to content

Latest commit

 

History

History
34 lines (23 loc) · 2.76 KB

README.md

File metadata and controls

34 lines (23 loc) · 2.76 KB

Predicting responses in a dialogue: A dual encoder replication in keras

This is a keras implementation of the dual encoder architecture used in The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems.

The paper details the construction of the Ubuntu Dialogue Corpus – an open source dataset of dialogues extracted from Ubuntu-related chat rooms. In the same paper they use the dataset to benchmark the task of predicting the next utterance or reply in a dialogue. For this task they linked two recurrent neural networks together as below using Global Vectors for Word Representation (GloVe) embedded text sequences as the input.

The dual encoder architecture

The data is available in a raw unformatted format with a script to stem and lemmatize and to include some special tokens e.g. __eou__ indicates the end of a user's turn in the dialogue. You can download the stemmed and lemmatized data here.

I apply further preprocessing by running python ./utilities/prepare_data.py to make the training more manageable on my machine. I limit the length of any text sequence to be 100 and apply tokenization such that a word must appear 6 times to be included in the vocabulary.

The Results

The test set contains the correct response to a given context as well as 9 false responses. The metric the paper uses is recall at k, which is the proportion of test examples that contain the true response in the top k predicted probabilities.

Metric Paper Replication
1 in 2 Recall @ 1 87.8% 87.3%
1 in 10 Recall @ 1 60.4% 55.1%
1 in 10 Recall @ 2 74.5% 73.2%
1 in 10 Recall @ 5 92.6% 93.5%

Considering the limitations I made during preprocessing, these results are quite similar.

References

  1. R. Lowe, N. Pow, I. Serban, and J. Pineau. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. Proceedings of the Meeting of the Special Interest Group on Dialogue and Discourse, 2015.
  2. R. Kadlec, M. Schmid, J. Kleindienst. Improved Deep Learning Baselines for Ubuntu Corpus Dialogs. arXiv preprint arXiv:1510.03753, 2015.
  3. http://www.wildml.com/2016/07/deep-learning-for-chatbots-2-retrieval-based-model-tensorflow/
  4. https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
  5. J. Pennington, R. Socher, and C.D. Manning. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014.