You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 17, 2024. It is now read-only.
I found the documentation on Sentencepiece helpful, and I generally use this bash script to encode/decode a corpus (from lupohin/transformer-lm).
Prepare data for training
Corpus format: a directory with top-level train, valid and test folders. Each top-level folder may contain sub-folders. Inside them, there must be utf-8 encoded text files with .txt extension.
The commands to train sentencepiece model and encode the corpus support multiple corpora, in below examples we assume they can be listed as data/corpora-*.
Train sentencepiece model (sp-text.txt can be removed after running). This can consume a large amount of memory, adjust sentencepiece arguments as advised if needed (this is not supported in the sp-train command directly):
Hi,
Thanks for share your good work.
Could you detail how to train the mt5 sentencepiece tokenizer?
Thanks.
The text was updated successfully, but these errors were encountered: