Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPT-2 encoder breaks in new version of PyTorch/huggingface #13

Open
yaroslavvb opened this issue May 28, 2019 · 2 comments
Open

GPT-2 encoder breaks in new version of PyTorch/huggingface #13

yaroslavvb opened this issue May 28, 2019 · 2 comments

Comments

@yaroslavvb
Copy link

After switching to pytorch_april_patched and installing -r requirements.txt

Producing dataset wiki...
encoding file testdata/wikiextracted/AA/wiki_01.txt ...
Traceback (most recent call last):
  File "train.py", line 1036, in <module>
    eval(f'test_{g.args.test}()')
  File "<string>", line 1, in <module>
  File "train.py", line 940, in test_checkpoint_wiki
    data_setup()
  File "train.py", line 333, in data_setup
    g.corpus = get_lm_corpus(g.args.data, g.args.dataset, use_bpe=g.args.bpe)
  File "/home/ubuntu/data_utils.py", line 381, in get_lm_corpus
    corpus = Corpus(datadir, dataset, use_bpe, **kwargs)
  File "/home/ubuntu/data_utils.py", line 309, in __init__
    self.valid = self.vocab.encode_file(valid_path, ordered=True)
  File "/home/ubuntu/utils/vocabulary.py", line 204, in encode_file
    tokens: List[int] = self.tokenizer.encode(text) + [self.EOT]
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 261, in encode
    return self.convert_tokens_to_ids(self.tokenize(text))
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in tokenize
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
  File "/home/ubuntu/anaconda3/envs/pytorch_april_patched/lib/python3.6/site-packages/pytorch_pretrained_bert/tokenization_gpt2.py", line 224, in <genexpr>
    token = ''.join(self.byte_encoder[ord(b)] for b in token)
KeyError: 8212
@yaroslavvb
Copy link
Author

huggingface/transformers#537

@thomwolf
Copy link

thomwolf commented Jun 8, 2019

Ok, I'll work on this next week for the next (long awaited) release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants