Skip to content
This repository has been archived by the owner on Feb 17, 2024. It is now read-only.

how to train the sentencepiece tokenizer #47

Open
world2vec opened this issue Jan 15, 2021 · 1 comment
Open

how to train the sentencepiece tokenizer #47

world2vec opened this issue Jan 15, 2021 · 1 comment

Comments

@world2vec
Copy link

Hi,
Thanks for share your good work.
Could you detail how to train the mt5 sentencepiece tokenizer?
Thanks.

@prestonfrasch
Copy link

Hi world2Vec,

I found the documentation on Sentencepiece helpful, and I generally use this bash script to encode/decode a corpus (from lupohin/transformer-lm).

Prepare data for training
Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.

The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.

Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly):

sp-train data/corpora-* sp-text.txt sp-model
Encode corpora, producing numpy files:

sp-encode data/corpora-* sp-model.model data/encoded

Hope that's helpful!

Cheers,
Preston

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants