GitHub

Entangled in the Scripts

The code for the paper "Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages" by Tomasz Limisiewicz, Jiří Balhar, and David Mareček (two first authors developed the code).

In the paper, we discuss the importance of multilingual tokenizer properties (vocabulary allocation and cross-lingual overlap) on LM performance in maksed language modeling and downstream tasks. We release our code for evaluating the multilingual tokenizers' key properties.

Furthermore, we show code walkthrough that include steps to reproduce the experiments from the paper. Specifically, we show how to:

Download the data (CC100 sample)
Train tokenizers (BPE, Unigram, SentencePiece BPE, SentencePiece Unigram)
Train mulitlingual LMs and evaluate them on masked language modeling task
Fine-tune and evaluate the models on downstream tasks (NER, POS, Dependency labeling, XNLI)

Preliminaries

Install the requirements:

pip install -r entangled_in_scipts/requirements.txt

Open the working directory with scripts

cd entangled_in_scripts/scripts

Evaluate tokenizers

The script evaluate_tokenizer.py loads tokenizer (pre-saved or from hugging face) and evaluates on metrics described in the paper:

Vocabulary Overlap measured by Jensen-Shannon divergence between in-language distributions;
Vocabulary Allocation measured by average rank of the token in in-language distribution.
Vocabulary Allocation measured by the average number of characters for a token in specific language.
Coverage, i.e. 1 - the share of unknown tokens in the tokenized text.

The results are saved as a json file tokenizer_properties.json. To run the evaluation run the following command (with exemplary parameters):

python evaluate_tokenizer.py \
    --data_list data_en.txt data_en2.txt data_es.txt data_pl.txt \
    --languages en en es pl \
    --tokenizer_name xlm-roberta-base \
    --output_dir /home/tokenizers_evaluation \
    [--unk_token <unk>] \
    [--pretokenized]

The explanation of the parameters:

data_list: listed paths to data
languages: listed languages of the data for each data path
tokenizer_name: HF tokenizer name or name of the tokenizer pre-saved in the output directory
output_dir: path to output directory to save the results
unk_token: optional, the unknown token in the vocabulary (by default <unk>)
pretokenized: optional (default=False), use if data was previously tokenized by a specific subword tokenizer, and saved with spaces between subwords. tokenizer_name must match the tokenizer that was used.

Reproducing the experiments

Downloading data

For experiments we use portion of the CC100 dataset (for six languages: Arabic, Turkish, Chinese, Greek, Spanish, English) and the UD treebanks (for 6 languages: Arabic, Turkish, Chinese, Greek, Spanish, English). To download the data run:

source prepare_data_cc100.sh <path_to_data_dir>

The script will download the data and create the following directory structure:

<path_to_data_dir>
├── lang
│   ├── alpha0.0
│   ├── alpha0.25
│   ├── alpha0.5
│   ├── alpha0.75
│   ├── alpha1.0
│   ├── dev
│   ├── test

alpha[0.0, 0.25, 0.5, 0.75, 1.0] are text files with accumulated data from CC100 dataset (for each language separately). The size of the text per file is defined by the equation:

$c_L = c_{min} \cdot (\frac{|C_L|}{c_{min}})^\alpha$

where $c_{min}$ is the minimal size of the text file, $|C_L|$ is the maximal size for language $L$ . $\alpha$ is the parameter that controls the sized of the corpus and balance between languages. (e.g. for $\alpha=0.0$ size of data is equal for all languages).

dev, test are development and test sets for each language.

Training tokenizers

To train tokenizer run:

source train_tokenizer.sh <vocab_size> <alpha_idx> <tokenizer_type> <path_to_data_dir> <output_path> <langugage_list>

<vocab_size> is the size of the vocabulary for the tokenizer.

<alpha_idx> is the index of the alpha parameter in the list [0.0, 0.25, 0.5, 0.75, 1.0] defining how many files should be accumulated per language.

<tokenizer_type> currently supporteted types of the tokenizer are bpe, unigram, sp-bpe, sp-unigram (where sp stands for SentencePiece method).

<path_to_data_dir> is the path to the directory with data, same as above.

<output_path> is the path to the directory where the tokenizer will be saved.

<langugage_list> is the list of langauges' iso codes (e.g. en es pl).

Training LMs

Before training the LM, it's necessary to prepare training configuration files:

source prepare_mlm_training_config.sh <vocab_size> <alpha_idx> <alpha_idx_train> <path_to_data_dir> <output_path> <langugage_list>

<alpha_idx_train> is choice og alpha parameter for training LM (note it can be different than <alpha_idx> for tokenizer training).

<output_path> is the path to the directory with subdirectories: /tokenizes and /models. All experimental outputs will be saved in those subdirectories.

Other parameters are the same as above.

Now, you can train the LM with the following command, with the same parameters:

source pretrtain_mlm.sh <vocab_size> <alpha_idx> <alpha_idx_train> <tokenizer_type> <path_to_data_dir> <output_path> <langugage_list>

Fine tuning models on downstream tasks

To fine-tune the models on downstream task run (for example for NER task):

source fine_tune_ner.sh <vocab_size> <alpha_> <alpha_train> <language_code> <tokenizer_type> <r_seed> <do_probe> <output_path>

New parameters are:

<alpha_> and <alpha_train> are now numerical values instead of indices.

<language_code> is the iso code language of the downstream task (e.g. en es pl).

<r_seed> is the seed for the random initialization of the probe.

<do_probe> 0 or 1: whether to use probing setting (i.e. freezeing the underlying LM's parameteres).

Subsequently, you can evaluate the fine-tuned model on test data in the same or different language.

source eval_ner.sh <vocab_size> <alpha> <alpha_train> <language_src> <language_tgt> <tokenizer_type> <r_seed> <do_probe> <output_path>

where <language_src> is the language of the fine-tuned model and <language_tgt> is the language of the test data.

Bibtex

If you use the code, please cite the paper:

@inproceedings{limisiewicz-etal-2023-tokenization,
    title = "Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages",
    author = "Limisiewicz, Tomasz  and
      Balhar, Ji{\v{r}}{\'\i}  and
      Mare{\v{c}}ek, David",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.350",
    doi = "10.18653/v1/2023.findings-acl.350",
    pages = "5661--5681"}

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Entangled in the Scripts

Preliminaries

Evaluate tokenizers

Reproducing the experiments

Downloading data

Training tokenizers

Training LMs

Fine tuning models on downstream tasks

Bibtex

About

Releases

Packages

Contributors 3

Languages

tomlimi/entangled_in_scripts

Folders and files

Latest commit

History

Repository files navigation

Entangled in the Scripts

Preliminaries

Evaluate tokenizers

Reproducing the experiments

Downloading data

Training tokenizers

Training LMs

Fine tuning models on downstream tasks

Bibtex

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages