Skip to content

Commit

Permalink
Merge pull request #16 from helpmefindaname/fix-special-token-handling
Browse files Browse the repository at this point in the history
fix handling of special tokens for tokenizer that have strange buildups
  • Loading branch information
helpmefindaname authored Jul 8, 2024
2 parents 8eadc47 + 967fed0 commit ecb6a76
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
1 change: 1 addition & 0 deletions tests/test_set_tokenizer_vocab.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
("microsoft/layoutlm-large-uncased", "WordPiece"),
("microsoft/layoutlm-base-cased", "BPE"),
("xlm-roberta-large", "Unigram"),
("sentence-transformers/all-mpnet-base-v2", "WordPiece"),
]
unsupported_tokenizers = ["google/electra-small-discriminator"]

Expand Down
3 changes: 1 addition & 2 deletions transformer_smaller_training_vocab/token_stats.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ def get_token_stats(
tokenizer: PreTrainedTokenizer,
texts: Sequence[Union[TextInput, PreTokenizedInput, TextInputPair, PreTokenizedInputPair]],
) -> List[int]:
used = set()
used.update(tokenizer.all_special_ids)
used = {token_id for token_id, token in tokenizer.added_tokens_decoder.items() if token.special}
for text in texts:
if isinstance(text, tuple):
encoding = tokenizer(text[0], text[1], is_split_into_words=isinstance(text[0], list))
Expand Down

0 comments on commit ecb6a76

Please sign in to comment.