Skip to content

[EMNLP'24] Autoregressive Pre-Training on Pixels and Texts

License

Notifications You must be signed in to change notification settings

ernie-research/pixelgpt

Repository files navigation

Models Datasets Datasets Paper EMNLP 2024

The official repository which contains the code and model checkpoints for our paper Autoregressive Pre-Training on Pixels and Texts (EMNLP 2024).

🔥 News

  • 21 September, 2024: 🎉 Our work has been accepted to EMNLP 2024! 🎉
  • 1 May, 2024: 🎉 We release the official codebase and model weights of PixelGPT, MonoGPT, and DualGPT . Stay tuned!🔥
image

Harnessing visual texts represents a burgeoning frontier in the evolution of language modeling. In this paper, we introduce a novel pre-training framework for a suite of pixel-based autoregressive language models, pre-training on a corpus of over 400 million documents rendered as RGB images. Our approach is characterized by a dual-modality training regimen, engaging both visual data through next patch prediction with a regression head and textual data via next token prediction with a classification head. This study is particularly focused on investigating the synergistic interplay between visual and textual modalities of language. Our comprehensive evaluation across a diverse array of benchmarks reveals that the confluence of visual and textual data substantially augments the efficacy of pixel-based language models. Notably, our findings show that a unidirectional pixel-based model, devoid of textual data during training, can match the performance levels of advanced bidirectional pixel-based models on various language understanding benchmarks. This work highlights the considerable untapped potential of integrating visual and textual information for language modeling purposes. We will release our code, data, and checkpoints to inspire further research advancement.

📕 Requirements

To set up the environment and install dependencies, run:

bash run_requirements.sh

📚 Fine-tuning Data

We fine-tune PixelGPT on the rendered GLUE and XNLI datasets. These rendered versions are publicly available at baidu/rendered_GLUE and baidu/rendered_xnli. After downloading the datasets from HuggingFace, extract them locally:

# Extract rendered GLUE
tar -xvf rendered_glue.tar

# Extract rendered XNLI
tar -xvf rendered_xnli.tar

For the rendered GLUE dataset, the extracted files contain multiple tasks. Each task has a corresponding training set, validation set, and test set. Note that for the MNLI task, both the validation and test sets contain matched and mismatched versions. You will need to assign the local paths of these task datasets to the --train_file, --validation_file, and --test_file parameters in the fine-tuning script. For the rendered XNLI dataset, assign the local dataset path to the --data_file_dir parameter in the corresponding fine-tuning script.

📌 Pre-trained Models

We pre-trained PixelGPT and three other models: MonoGPT, and DualGPT. We release checkpoints used in our experiment, which can be downloaded at baidu/PixelGPT, baidu/MonoGPT, and baidu/DualGPT. Before running the fine-tuning scripts bellow, download the corresponding pre-trained models from our open-source model repository above and place the file in the pre-trained model directory, e.g. pretrained_models/pixel_gpt.

🚀 Fine-tuning

Our main fine-tuning experiments were performed on rendered GLUE and XNLI. The scripts to run the experiments are given below.

GLUE

For example, to fine-tune on the MNLI task:

PixelGPT

bash run/pixel_gpt/ft_pixel_gpt_mnli.sh pretrained_models/PixelGPT

MonoGPT

# Text-only Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_text.sh pretrained_models/MonoGPT

# Pixel-only Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_pixel.sh pretrained_models/MonoGPT

# Pair-modality Fine-tuning
run/mono_gpt/ft_mono_gpt_mnli_pair.sh pretrained_models/MonoGPT

DualGPT

# Text-only Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_text.sh pretrained_models/DualGPT

# Pixel-only Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_pixel.sh pretrained_models/DualGPT


# Pair-modality Fine-tuning
run/dual_gpt/ft_dual_gpt_mnli_pair.sh pretrained_models/DualGPT

XNLI Training

We evaluated XNLI in two settings: (1) Translate-train-all, where the model is fine-tuned on a combination of English and machine-translated data from 14 other languages; (2) Cross-lingual Transfer settings, where the model is fine-tuned only on English data and tested on multiple languages.

1. Translate-train-all

PixelGPT
bash run/cross_lingual/xnli/train_all/pixel_gpt/ft_pixel_gpt_xnli.sh pretrained_models/PixelGPT
MonoGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_text.sh pretrained_models/MonoGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_image.sh pretrained_models/MonoGPT

# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_all/mono_gpt/ft_mono_gpt_xnli_pair.sh pretrained_models/MonoGPT
DualGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_text.sh pretrained_models/DualGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_image.sh pretrained_models/DualGPT

# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_all/dual_gpt/ft_dual_gpt_xnli_pair.sh pretrained_models/DualGPT

2. Cross-lingaul Transfer

PixelGPT
bash run/cross_lingual/xnli/train_en/pixel_gpt/ft_pixel_gpt_xnli.sh pretrained_models/PixelGPT
MonoGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_text.sh pretrained_models/MonoGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_image.sh pretrained_models/MonoGPT

# Pair-modality Fine-tuning
run/cross_lingual/xnli/train_en/mono_gpt/ft_mono_gpt_xnli_pair.sh pretrained_models/MonoGPT
DualGPT
# Text-only Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_text.sh pretrained_models/DualGPT

# Pixel-only Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_image.sh pretrained_models/DualGPT

# Pair-modality Fine-tuning
bash run/cross_lingual/xnli/train_en/dual_gpt/ft_dual_gpt_xnli_pair.sh pretrained_models/DualGPT

Citation

For attribution in academic contexts, please cite this work as:

@misc{chai2024autoregressivepretrainingpixelstexts,
  title = {Autoregressive Pre-Training on Pixels and Texts},
  author = {Chai, Yekun and Liu, Qingyi and Xiao, Jingwu and Wang, Shuohuan and Sun, Yu and Wu, Hua},
  year = {2024},
  eprint = {2404.10710},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  url = {https://arxiv.org/abs/2404.10710},
}

About

[EMNLP'24] Autoregressive Pre-Training on Pixels and Texts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published