GPT-2 encoder breaks in new version of PyTorch/huggingface #13

yaroslavvb · 2019-05-28T22:28:44Z

After switching to pytorch_april_patched and installing -r requirements.txt

Producing dataset wiki...
encoding file testdata/wikiextracted/AA/wiki_01.txt ...
Traceback (most recent call last):
  File "train.py", line 1036, in <module>
    eval(f'test_{g.args.test}()')
  File "<string>", line 1, in <module>
  File "train.py", line 940, in test_checkpoint_wiki
    data_setup()
  File "train.py", line 333, in data_setup
    g.corpus = get_lm_corpus(g.args.data, g.args.dataset, use_bpe=g.args.bpe)
  File "/home/ubuntu/data_utils.py", line 381, in get_lm_corpus
    corpus = Corpus(datadir, dataset, use_bpe, **kwargs)
  File "/home/ubuntu/data_utils.py", line 309, in __init__
    self.valid = self.vocab.encode_file(valid_path, ordered=True)
  File "/home/ubuntu/utils/vocabulary.py", line 204, in encode_file
    tokens: List[int] = self.tokenizer.encode(text) + [self.EOT]
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode
    return self.convert_tokens_to_ids(self.tokenize(text))
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr>
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
KeyError: 8212

The text was updated successfully, but these errors were encountered:

yaroslavvb · 2019-05-28T22:44:35Z

huggingface/transformers#537

thomwolf · 2019-06-08T08:54:42Z

Ok, I'll work on this next week for the next (long awaited) release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT-2 encoder breaks in new version of PyTorch/huggingface #13

GPT-2 encoder breaks in new version of PyTorch/huggingface #13

yaroslavvb commented May 28, 2019

yaroslavvb commented May 28, 2019

thomwolf commented Jun 8, 2019

GPT-2 encoder breaks in new version of PyTorch/huggingface #13

GPT-2 encoder breaks in new version of PyTorch/huggingface #13

Comments

yaroslavvb commented May 28, 2019

yaroslavvb commented May 28, 2019

thomwolf commented Jun 8, 2019