Prototype Dataset Processor #1646

vwxyzjn · 2024-05-16T13:22:05Z

This PR attempts to refactor and pull all tokenization logic out of the Trainer class. Having a separate tokenization process gives us higher visibility into what's being used in training, providing more clarified logic and reducing bugs. It attempts to do the following things.

# 1. PPO (prompt)
# 2. SFT (prompt + demonstration), there is also packing.
# 3. ✅ RM / DPO (chosen and rejected)
# 4. ✅ Visualization of length distributions?
# 5. ✅ Filter?
#   * Smart truncation?
# 6. ✅ dataset_num_proc
# 7. check EOS token
# 8. dataset mixer?
# 9. ✅ pretty print that show tokenization?
# 10. hashable tokneization?
# 11. inputs / labels / attention_mask
# 12. always set a `tokenizer.pad_token_id`?

why?

Currently, the Trainer is also responsible for tokenization. It causes several issues:

duplicate tokenization steps. For example, alignment-handbook calls apply_chat_template(tokenize=False) for the dataset, followed by SFT/DPO trainer calling tokenized again. To remove duplication, we only needed to go through the dataset once by calling apply_chat_template(tokenize=True)

truncation logic happens in various places and is hard to predict. SFTTrainer calls it the max_seq_length, RewardModeling calls it max_length, DPO/KTOTrainers call it max_length, max_prompt_length, max_target_length. There are also different truncation logics. E.g., [(truncate the prompt if prompt + chosen is too long)]
(

trl/trl/trainer/dpo_trainer.py

Lines 797 to 799 in 99f2c94

    
           # if combined sequence is too long, truncate the prompt 
        
           for answer_tokens in [chosen_tokens, rejected_tokens, prompt_tokens]: 
        
               if len(answer_tokens["prompt_input_ids"]) + longer_response_length > self.max_length:

). This causes issue like https://huggingface.slack.com/archives/C04EX6W3QSY/p1715255460198239 as raised by @abhishekkrthakur.

the hard truncation logic seems debatable: if the sequence length is too long, shouldn't we filter them out instead of giving a truncated response? The truncated response could be an incomplete code snippet / summaries (basically bad data). If truncation is really desired, we should do some kind of smart truncation like truncate at the last paragraph, so the sentences are still complete.

learning to generate EOS tokens. Learning to generate EOS tokens #1623 (comment) suggested that EOS tokens always 1) correspond to -100 in the labels and 2) if the dataset contains the EOS token before collating, then the attention mask of EOS token is also 1. It's possible that the model may never learn to generate EOS tokens.
- what's a bit unclear to me is how zephyr learns to output EOS tokens, despite all the labels of EOS token are marked with -100 and are being masked out. My suspicion is that the attention_mask=1 plays some roles in it.
dataset_num_proc is not uniformly applied, as a result [ORPO] Enable batched tokenization & multiprocessing to process large datasets #1624 is needed. There is also the question of hashable tokenization
Dataset mixer (e.g., in our h4 codebase), that should be more widely available to use in TRL and can be combined with this class.

The current design

The current design roughly looks like this. Note that we can still put it in Trainer.__init__ so users don't have to configure it directly.

dataset_config = DatasetConfig(max_token_length=1024, max_prompt_token_lenth=128)
dataset_processor = PreferenceDatasetProcessor(tokenizer=tok, config=dataset_config)
train_dataset = dataset_processor.tokenize(preference_datasets["train"])
stats = dataset_processor.get_token_length_stats(train_dataset)
pprint.pp(stats)
train_dataset = dataset_processor.filter(train_dataset)
stats = dataset_processor.get_token_length_stats(train_dataset)
pprint.pp(stats)
dataset_processor.get_token_length_visualization(train_dataset)
print(tok.decode(train_dataset[0]["chosen"]))
visualize_token(train_dataset[0]["chosen"], tok)

HuggingFaceDocBuilderDev · 2024-05-16T13:27:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kashif · 2024-05-16T14:24:50Z

very cool! thanks! checking

edbeeching · 2024-05-17T12:00:59Z

what's a bit unclear to me is how zephyr learns to output EOS tokens, despite all the labels of EOS token are marked with -100 and are being masked out. My suspicion is that the attention_mask=1 plays some roles in it.

I think for zephyr we used packing and there is a concat token=eos that is not masked / ignored.

vwxyzjn · 2024-05-18T01:48:11Z

@edbeeching you are right, because the datacollator is not called when using the packed dataset! The output below confirms it.

vwxyzjn · 2024-06-25T16:38:51Z

trl/trainer/utils.py

@@ -872,7 +872,9 @@ def print_rich_table(df: pd.DataFrame) -> Table:
 SIMPLE_SFT_CHAT_TEMPLATE = "{% for message in messages %}{{' ' + message['content']}}{% endfor %}{{ eos_token }}"
 # SIMPLE_SFT_CHAT_TEMPLATE simply ends things with an EOS token, this helps the SFT model learn to end the completions with EOS tokens

-SIMPLE_QUERY_CHAT_TEMPLATE = "{% for message in messages %}{{' ' + message['content']}}{% endfor %}"
+SIMPLE_QUERY_CHAT_TEMPLATE = (


Will refactor this later

vwxyzjn · 2024-07-15T14:24:29Z

docs/source/reward_trainer.md

+accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
+    examples/scripts/rm/rm.py \
+    --dataset_name trl-internal-testing/tldr-preference-trl-style \
+    --dataset_train_split train \
+    --dataset_eval_split validation \
+    --model_name_or_path EleutherAI/pythia-1b-deduped \
+    --chat_template simple_concat_with_space \
+    --learning_rate 3e-6 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 8 \
+    --gradient_accumulation_steps 4 \
+    --logging_steps 1 \
+    --eval_strategy steps \
+    --max_token_length 1280 \
+    --max_prompt_token_lenth 1024 \
+    --remove_unused_columns False \
+    --num_train_epochs 1 \
+    --eval_steps=300 \
+    --bf16 \
+    --output_dir models/rm/rm_tldr_1b \
+    --push_to_hub \
+    --hub_model_id trl-internal-testing/rm_tldr_1b


cc @qgallouedec

github-actions · 2024-08-08T15:05:19Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

lewtun · 2024-08-26T08:45:00Z

Bot begone!

Prototype Dataset processor

f996dad

vwxyzjn requested review from kashif, abhishekkrthakur, edbeeching, lvwerra, lewtun and younesbelkada May 16, 2024 13:22

test

3349041

quick change

adfac5c

vwxyzjn mentioned this pull request May 21, 2024

Learning to generate EOS tokens #1623

Closed

vwxyzjn added 8 commits May 21, 2024 16:21

quick refactor

1e6bdd7

Merge branch 'main' into dataset-processor

5c9a945

psuh zephyr rm recipe

39b84db

Merge branch 'main' into dataset-processor

b19977c

add sanity check

9bd2324

refactor reward model training

07f3bf8

cache changes

2c46f86

refactor RM training

932a78c

vwxyzjn marked this pull request as ready for review June 21, 2024 15:21

vwxyzjn added 6 commits June 24, 2024 16:03

update docs for reward model

39a0245

better visualization

e747c06

update docs

ca059d8

update reward model script

7a5fbd0

update the docs

3ed0b71

precommit

d6761fd

vwxyzjn commented Jul 15, 2024

View reviewed changes

github-actions bot closed this Aug 16, 2024

kashif reopened this Aug 16, 2024

github-actions bot closed this Aug 25, 2024

lewtun reopened this Aug 26, 2024

lewtun mentioned this pull request Sep 12, 2024

Documentation dataset format #2020

Merged

5 tasks

qgallouedec marked this pull request as draft September 23, 2024 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype Dataset Processor #1646

Prototype Dataset Processor #1646

vwxyzjn commented May 16, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented May 16, 2024

kashif commented May 16, 2024

edbeeching commented May 17, 2024

vwxyzjn commented May 18, 2024

vwxyzjn Jun 25, 2024

vwxyzjn Jul 15, 2024

github-actions bot commented Aug 8, 2024

lewtun commented Aug 26, 2024

	# if combined sequence is too long, truncate the prompt
	for answer_tokens in [chosen_tokens, rejected_tokens, prompt_tokens]:
	if len(answer_tokens["prompt_input_ids"]) + longer_response_length > self.max_length:

Prototype Dataset Processor #1646

Are you sure you want to change the base?

Prototype Dataset Processor #1646

Conversation

vwxyzjn commented May 16, 2024 • edited Loading

why?

The current design

HuggingFaceDocBuilderDev commented May 16, 2024

kashif commented May 16, 2024

edbeeching commented May 17, 2024

vwxyzjn commented May 18, 2024

vwxyzjn Jun 25, 2024

Choose a reason for hiding this comment

vwxyzjn Jul 15, 2024

Choose a reason for hiding this comment

github-actions bot commented Aug 8, 2024

lewtun commented Aug 26, 2024

vwxyzjn commented May 16, 2024 •

edited

Loading