how to train the sentencepiece tokenizer #47

world2vec · 2021-01-15T07:27:03Z

Hi,
Thanks for share your good work.
Could you detail how to train the mt5 sentencepiece tokenizer?
Thanks.

prestonfrasch · 2021-06-10T18:09:19Z

Hi world2Vec,

I found the documentation on Sentencepiece helpful, and I generally use this bash script to encode/decode a corpus (from lupohin/transformer-lm).

Prepare data for training
Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.

The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.

Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly):

sp-train data/corpora-* sp-text.txt sp-model
Encode corpora, producing numpy files:

sp-encode data/corpora-* sp-model.model data/encoded

Hope that's helpful!

Cheers,
Preston

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to train the sentencepiece tokenizer #47

how to train the sentencepiece tokenizer #47

world2vec commented Jan 15, 2021

prestonfrasch commented Jun 10, 2021

how to train the sentencepiece tokenizer #47

how to train the sentencepiece tokenizer #47

Comments

world2vec commented Jan 15, 2021

prestonfrasch commented Jun 10, 2021