diff --git a/docs/source/dataset_formats.mdx b/docs/source/dataset_formats.mdx
index bad094d277..03444d7cf1 100644
--- a/docs/source/dataset_formats.mdx
+++ b/docs/source/dataset_formats.mdx
@@ -180,6 +180,8 @@ preference_example = {"prompt": "The sky is", "chosen": " blue.", "rejected": "
 preference_example = {"chosen": "The sky is blue.", "rejected": "The sky is green."}
 ```
 
+Some preference datasets can be found with [the tag `dpo` on Hugging Face Hub](https://huggingface.co/datasets?other=dpo). You can also explore the [librarian-bots' DPO Collections](https://huggingface.co/collections/librarian-bots/direct-preference-optimization-datasets-66964b12835f46289b6ef2fc) to identify preference datasets.
+
 ### Unpaired preference
 
 An unpaired preference dataset is similar to a preference dataset but instead of having `"chosen"` and `"rejected"` completions for the same prompt, it includes a single `"completion"` and a `"label"` indicating whether the completion is preferred or not.
@@ -710,3 +712,35 @@ dataset = dataset.remove_columns(["completion", "label"])
 >>> dataset[0]
 {'prompt': 'The sky is'}
 ```
+
+## Vision datasets
+
+Some trainers also support fine-tuning vision-language models (VLMs) using image-text pairs. In this scenario, it's recommended to use a conversational format, as each model handles image placeholders in text differently. 
+
+A conversational vision dataset differs from a standard conversational dataset in two key ways:
+
+1. The dataset must contain the key `images` with the image data.
+2. The `"content"` field in messages must be a list of dictionaries, where each dictionary specifies the type of data: `"image"` or `"text"`.
+
+Example:
+
+```python
+# Textual dataset format:
+"content": "What color is the sky?"
+
+# Vision dataset format:
+"content": [
+    {"type": "image"}, 
+    {"type": "text", "text": "What color is the sky in the image?"}
+]
+```
+
+An example of a conversational vision dataset is the [openbmb/RLAIF-V-Dataset](https://huggingface.co/datasets/openbmb/RLAIF-V-Dataset). Below is an embedded view of the dataset's training data, allowing you to explore it directly:
+
+<iframe
+  src="https://huggingface.co/datasets/trl-lib/rlaif-v/embed/viewer/default/train"
+  frameborder="0"
+  width="100%"
+  height="560px"
+></iframe>
+
diff --git a/docs/source/dpo_trainer.mdx b/docs/source/dpo_trainer.mdx
index 2f86c851c0..22daae9df2 100644
--- a/docs/source/dpo_trainer.mdx
+++ b/docs/source/dpo_trainer.mdx
@@ -1,156 +1,149 @@
 # DPO Trainer
 
-TRL supports the DPO Trainer for training language models from preference data, as described in the paper [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290) by Rafailov et al., 2023. For a full example have a look at  [`examples/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py).
+## Overview
 
-The first step as always is to train your SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.
+TRL supports the DPO Trainer for training language models from preference data, as described in the paper [Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://huggingface.co/papers/2305.18290) by [Rafael Rafailov](https://huggingface.co/rmrafailov), Archit Sharma, Eric Mitchell, [Stefano Ermon](https://huggingface.co/ermonste), [Christopher D. Manning](https://huggingface.co/manning), [Chelsea Finn](https://huggingface.co/cbfinn).
 
-## How DPO works
+The abstract from the paper is the following:
 
-Fine-tuning a language model via DPO consists of two steps and is easier than PPO:
+> While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these preferences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sentiment of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.
 
-1. **Data collection**: Gather a preference dataset with positive and negative selected pairs of generation, given a prompt.
-2. **Optimization**: Maximize the log-likelihood of the DPO loss directly.
+The first step is to train an SFT model, to ensure the data we train on is in-distribution for the DPO algorithm.
+
+Then, fine-tuning a language model via DPO consists of two steps and is easier than [PPO](ppov2_trainer):
 
-DPO-compatible datasets can be found with [the tag `dpo` on Hugging Face Hub](https://huggingface.co/datasets?other=dpo). You can also explore the [librarian-bots/direct-preference-optimization-datasets](https://huggingface.co/collections/librarian-bots/direct-preference-optimization-datasets-66964b12835f46289b6ef2fc) Collection to identify datasets that are likely to support DPO training.
+1. **Data collection**: Gather a [preference dataset](dataset_formats#preference) with positive and negative selected pairs of generation, given a prompt.
+2. **Optimization**: Maximize the log-likelihood of the DPO loss directly.
 
-This process is illustrated in the sketch below (from [figure 1 of the original paper](https://huggingface.co/papers/2305.18290)):
+This process is illustrated in the sketch below (from [Figure 1 of the DPO paper](https://huggingface.co/papers/2305.18290)):
 
-<img width="835" alt="Screenshot 2024-03-19 at 12 39 41" src="https://github.com/huggingface/trl/assets/49240599/9150fac6-3d88-4ca2-8ec6-2a6f3473216d">
+![](https://github.com/huggingface/trl/assets/49240599/9150fac6-3d88-4ca2-8ec6-2a6f3473216d)
 
 Read more about DPO algorithm in the [original paper](https://huggingface.co/papers/2305.18290).
 
+## Quick start
 
-## Expected dataset format
+This example demonstrates how to train a model using the DPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model. We use the preference data from the [Capybara dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the data in the dataset here:
+
+<iframe
+  src="https://huggingface.co/datasets/trl-lib/Capybara-Preferences/embed/viewer/default/train?row=0"
+  frameborder="0"
+  width="100%"
+  height="560px"
+></iframe>
+
+Below is the script to train the model:
+
+```python
+# train_dpo.py
+from datasets import load_dataset
+from trl import DPOConfig, DPOTrainer
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+train_dataset = load_dataset("trl-lib/Capybara-Preferences", split="train")
 
-The DPO trainer expects a very specific format for the dataset. Since the model will be trained to directly optimize the preference of which sentence is the most relevant, given two sentences. We provide an example from the [`Anthropic/hh-rlhf`](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset below:
-
-<div style="text-align: center">
-<img src="https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/rlhf-antropic-example.png", width="50%">
-</div>
-
-Therefore the final dataset object should contain these 3 entries if you use the default [`DPODataCollatorWithPadding`] data collator. The entries should be named:
-
-- `prompt`
-- `chosen`
-- `rejected`
-
-for example:
-
-```py
-dpo_dataset_dict = {
-    "prompt": [
-        "hello",
-        "how are you",
-        "What is your name?",
-        "What is your name?",
-        "Which is the best programming language?",
-        "Which is the best programming language?",
-        "Which is the best programming language?",
-    ],
-    "chosen": [
-        "hi nice to meet you",
-        "I am fine",
-        "My name is Mary",
-        "My name is Mary",
-        "Python",
-        "Python",
-        "Java",
-    ],
-    "rejected": [
-        "leave me alone",
-        "I am not fine",
-        "Whats it to you?",
-        "I dont have a name",
-        "Javascript",
-        "C++",
-        "C++",
-    ],
-}
+training_args = DPOConfig(output_dir="dpo-qwen2", logging_steps=10)
+trainer = DPOTrainer(model=model, args=training_args, tokenizer=tokenizer, train_dataset=train_dataset)
+trainer.train()
 ```
 
-where the `prompt` contains the context inputs, `chosen` contains the corresponding chosen responses and `rejected` contains the corresponding negative (rejected) responses. As can be seen a prompt can have multiple responses and this is reflected in the entries being repeated in the dictionary's value arrays.
-
-[`DPOTrainer`] can be used to fine-tune visual language models (VLMs). In this case, the dataset must also contain the key `images`, and the trainer's `tokenizer` is the VLM's `processor`. For example, for Idefics2, the processor expects the dataset to have the following format:
-
-Note: Currently, VLM support is exclusive to Idefics2 and does not extend to other VLMs.
-
-```py
-dpo_dataset_dict = {
-    'images': [
-        [Image.open('beach.jpg')],
-        [Image.open('street.jpg')],
-    ],
-    'prompt': [
-        'The image <image> shows',
-        '<image> The image depicts',
-    ],
-    'chosen': [
-        'a sunny beach with palm trees.',
-        'a busy street with several cars and buildings.',
-    ],
-    'rejected': [
-        'a snowy mountain with skiers.',
-        'a calm countryside with green fields.',
-    ],
-}
+Execute the script using the following command:
+
+```bash
+accelerate launch train_dpo.py
 ```
 
-## Expected model format
+Distributed across 8 GPUs, the training takes approximately 3 minutes. You can verify the training progress by checking the reward graph. An increasing trend in the reward margin indicates that the model is improving and generating better responses over time.
 
-The DPO trainer expects a model of `AutoModelForCausalLM` or `AutoModelForVision2Seq`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
+![](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/dpo-qwen2-reward-margin.png)
 
-## Using the `DPOTrainer`
+To see how the [trained model](https://huggingface.co/trl-lib/dpo-qwen2) performs, use the following code to generate completions:
 
-For a detailed example have a look at the `examples/scripts/dpo.py` script. At a high level we need to initialize the [`DPOTrainer`] with a `model` we wish to train, a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response, the `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
+```python
+>>> from transformers import pipeline
+>>> generator = pipeline("text-generation", model="trl-lib/dpo-qwen2", device="cuda")
+>>> question = "What is the best programming language?"
+>>> output = generator([{"role": "user", "content": question}], max_new_tokens=200, return_full_text=False)[0]
+>>> print(output["generated_text"])
+The choice of the best programming language depends on several factors, including your level of experience, skills, and interests, as well as your personal preferences and needs.
 
-```py
-training_args = DPOConfig(
-    beta=0.1,
-)
-dpo_trainer = DPOTrainer(
-    model,
-    ref_model,
-    args=training_args,
-    train_dataset=train_dataset,
-    tokenizer=tokenizer,  # for visual language models, use tokenizer=processor instead
-)
-```
+Some popular choices for beginners include Python, Java, JavaScript, and C++. These languages are widely used in various fields such as web development, data analysis, artificial intelligence, and game development.
 
-After this one can then call:
+For professionals or those with more advanced knowledge, languages like C++, Ruby, and Swift can be very useful. These languages have a strong community and offer better support for complex tasks.
 
-```py
-dpo_trainer.train()
+Ultimately, the best programming language will depend on your specific needs and goals. It's recommended to try out different languages and see which one you find most suitable for your work.
 ```
 
-Note that the `beta` is the temperature parameter for the DPO loss, typically something in the range of `0.1` to `0.5`. We ignore the reference model as `beta` -> 0.
+## Expected dataset format
 
-## Loss functions
+DPO requires a [preference dataset](dataset_formats#preference). The [`DPOTrainer`] supports both [conversational](dataset_formats#conversational-dataset-format) and [standard](dataset_formats#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
 
-Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. To use this loss, set the `loss_type="sigmoid"` (default) in the [`DPOConfig`].
+### Special considerations for vision-language models
 
-The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. To use this loss, set the `loss_type="hinge"` in the [`DPOConfig`]. In this case, the `beta` is the reciprocal of the margin.
+The [`DPOTrainer`] supports fine-tuning vision-language models (VLMs). For these models, a vision dataset is required. To learn more about the specific format for vision datasets, refer to the [Vision dataset format](dataset_formats#vision-datasets) section.
 
-The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. To use the loss set the `loss_type="ipo"` in the [`DPOConfig`]. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only). 
+Additionally, unlike standard text-based models where a `tokenizer` is used, for VLMs, you should replace the `tokenizer` with a `processor`.
 
-The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0).
+```diff
+- model = AutoModelForCausalLM.from_pretrained(model_id)
++ model = AutoModelForVision2Seq.from_pretrained(model_id)
 
-The [EXO](https://huggingface.co/papers/2402.00856) authors propose to minimize the reverse KL instead of the negative log-sigmoid loss of DPO which corresponds to forward KL. To use the loss set the `loss_type="exo_pair"` in the [`DPOConfig`]. Setting non-zero `label_smoothing` (default `1e-3`) leads to a simplified version of EXO on pair-wise preferences (see Eqn. (16) of the [EXO paper](https://huggingface.co/papers/2402.00856)). The full version of EXO uses `K>2` completions generated by the SFT policy, which becomes an unbiased estimator of the PPO objective (up to a constant) when `K` is sufficiently large.
+- tokenizer = AutoTokenizer.from_pretrained(model_id)
++ processor = AutoProcessor.from_pretrained(model_id)
 
-The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood. To use the loss set the `loss_type="nca_pair"` in the [`DPOConfig`].
+  trainer = DPOTrainer(
+      model,
+      args=training_args,
+      train_dataset=train_dataset,
+-     tokenizer=tokenizer,
++     tokenizer=processor,
+)
+```
 
-The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased estimate of the DPO loss that is robust to preference noise in the data. Like in cDPO, it assumes that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0) and set the `loss_type="robust"` in the [`DPOConfig`].
+For a complete example of fine-tuning a vision-language model, refer to the script in [`examples/scripts/dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo_vlm.py).
 
-The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0. To use this loss, set the `loss_type="bco_pair"` in the [`DPOConfig`].
 
-The [TR-DPO](https://huggingface.co/papers/2404.09656) paper suggests syncing the reference model weights after every `ref_model_sync_steps` steps of SGD with weight `ref_model_mixup_alpha` during DPO training. To toggle this callback use the `sync_ref_model=True` in the [`DPOConfig`].
+## Example script
 
-The [RPO](https://huggingface.co/papers/2404.19733) paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this [paper](https://huggingface.co/papers/2405.16436) that essentially consists of a weighted SFT loss on the chosen preferences together with the DPO loss. To use this loss, set the `rpo_alpha` in the [`DPOConfig`] to an appropriate value. The paper suggests setting this weight to 1.0.
+We provide an example script to train a model using the DPO method. The script is available in [`examples/scripts/dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/dpo.py)
 
-The [SPPO](https://huggingface.co/papers/2405.00675) authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser. To use this loss, set the `loss_type="sppo_hard"` in the [`DPOConfig`].
+To test the DPO script with the [Qwen2 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) on the [UltraFeedback dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback_binarized), run the following command:
+
+```bash
+accelerate launch examples/scripts/dpo.py \
+    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
+    --dataset_name trl-lib/ultrafeedback_binarized \
+    --num_train_epochs 1 \
+    --logging_steps 25 \
+    --output_dir Qwen2-0.5B-DPO
+```
 
-The [AOT](https://huggingface.co/papers/2406.05882) authors propose to use Distributional Preference Alignment Via Optimal Transport. Traditionally, the alignment algorithms use paired preferences at a sample level, which does not ensure alignment on the distributional level. AOT, on the other hand, can align LLMs on paired or unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. Specifically, `loss_type="aot"` is appropriate for paired datasets, where each prompt has both chosen and rejected responses; `loss_type="aot_pair"` is for unpaired datasets. In a nutshell, `loss_type="aot"` ensures that the log-likelihood ratio of chosen to rejected of the aligned model has higher quantiles than that ratio for the reference model. `loss_type="aot_pair"` ensures that the chosen reward is higher on all quantiles than the rejected reward. Note that in both cases quantiles are obtained via sorting. To fully leverage the advantages of the AOT algorithm, it is important to maximize the per-GPU batch size.
+## Loss functions
 
-The [APO](https://huggingface.co/papers/2408.06266) method introduces an "anchored" version of the alignment objective. There are two variants: `apo_zero` and `apo_down`. The `apo_zero` loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, `apo_down` decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set `loss_type="apo_zero"` or `loss_type="apo_down"` in the [`DPOConfig`].
+The DPO algorithm supports several loss functions. The loss function can be set using the `loss_type` parameter in the [`DPOConfig`]. The following loss functions are supported:
+
+| `loss_type=`                           | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `"sigmoid"` (default)                  | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model and in fact the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| `"hinge"`                              | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| `"ipo"`                                | The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair and thus the smaller the `beta` the larger this gaps is. As per the paper the loss is averaged over log-likelihoods of the completion (unlike DPO which is summed only).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| `"exo_pair"`                           | The [EXO](https://huggingface.co/papers/2402.00856) authors propose to minimize the reverse KL instead of the negative log-sigmoid loss of DPO which corresponds to forward KL. Setting non-zero `label_smoothing` (default `1e-3`) leads to a simplified version of EXO on pair-wise preferences (see Eqn. (16) of the [EXO paper](https://huggingface.co/papers/2402.00856)). The full version of EXO uses `K>2` completions generated by the SFT policy, which becomes an unbiased estimator of the PPO objective (up to a constant) when `K` is sufficiently large.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+| `"nca_pair"`                           | The [NCA](https://huggingface.co/papers/2402.05369) authors shows that NCA optimizes the absolute likelihood for each response rather than the relative likelihood.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| `"robust"`                             | The [Robust DPO](https://huggingface.co/papers/2403.00409) authors propose an unbiased estimate of the DPO loss that is robust to preference noise in the data. Like in cDPO, it assumes that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+| `"bco_pair"`                           | The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+| `"sppo_hard"`                          | The [SPPO](https://huggingface.co/papers/2405.00675) authors claim that SPPO is capable of solving the Nash equilibrium iteratively by pushing the chosen rewards to be as large as 1/2 and the rejected rewards to be as small as -1/2 and can alleviate data sparsity issues. The implementation approximates this algorithm by employing hard label probabilities, assigning 1 to the winner and 0 to the loser.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| `"aot"`  or `loss_type="aot_pair"`     | The [AOT](https://huggingface.co/papers/2406.05882) authors propose to use Distributional Preference Alignment Via Optimal Transport. Traditionally, the alignment algorithms use paired preferences at a sample level, which does not ensure alignment on the distributional level. AOT, on the other hand, can align LLMs on paired or unpaired preference data by making the reward distribution of the positive samples stochastically dominant in the first order on the distribution of negative samples. Specifically, `loss_type="aot"` is appropriate for paired datasets, where each prompt has both chosen and rejected responses; `loss_type="aot_pair"` is for unpaired datasets. In a nutshell, `loss_type="aot"` ensures that the log-likelihood ratio of chosen to rejected of the aligned model has higher quantiles than that ratio for the reference model. `loss_type="aot_pair"` ensures that the chosen reward is higher on all quantiles than the rejected reward. Note that in both cases quantiles are obtained via sorting. To fully leverage the advantages of the AOT algorithm, it is important to maximize the per-GPU batch size. |
+| `"apo_zero"` or `loss_type="apo_down"` | The [APO](https://huggingface.co/papers/2408.06266) method introduces an "anchored" version of the alignment objective. There are two variants: `apo_zero` and `apo_down`. The `apo_zero` loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, `apo_down` decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+
+## To categorize
+
+| Parameter             | Description                                                                                                                                                                                                                                                                                                                                                                                                                                              |
+| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `label_smoothing=0.1` | The [cDPO](https://ericmitchell.ai/cdpo.pdf) is a tweak on the DPO loss where we assume that the preference labels are noisy with some probability. In this approach, the `label_smoothing` parameter in the [`DPOConfig`] is used to model the probability of existing label noise. To apply this conservative loss, set `label_smoothing` to a value greater than 0.0 (between 0.0 and 0.5; the default is 0.0).                                       |
+| `sync_ref_model=True` | The [TR-DPO](https://huggingface.co/papers/2404.09656) paper suggests syncing the reference model weights after every `ref_model_sync_steps` steps of SGD with weight `ref_model_mixup_alpha` during DPO training. To toggle this callback use the `sync_ref_model=True` in the [`DPOConfig`].                                                                                                                                                           |
+| `rpo_alpha=1.0`       | The [RPO](https://huggingface.co/papers/2404.19733) paper implements an iterative preference tuning algorithm using a loss related to the RPO loss in this [paper](https://huggingface.co/papers/2405.16436) that essentially consists of a weighted SFT loss on the chosen preferences together with the DPO loss. To use this loss, set the `rpo_alpha` in the [`DPOConfig`] to an appropriate value. The paper suggests setting this weight to `1.0`. |
 
 ### For Mixture of Experts Models: Enabling the auxiliary loss
 
@@ -173,55 +166,30 @@ While training and evaluating we record the following reward metrics:
 
 You can further accelerate QLoRA / LoRA (2x faster, 60% less memory) using the [`unsloth`](https://github.com/unslothai/unsloth) library that is fully compatible with `SFTTrainer`. Currently `unsloth` supports only Llama (Yi, TinyLlama, Qwen, Deepseek etc) and Mistral architectures. Some benchmarks for DPO listed below:
 
-| GPU      | Model     | Dataset    | 🤗  | 🤗 + Flash Attention 2 | 🦥 Unsloth | 🦥 VRAM saved |
-| -------- | --------- | ---------- | --- | ---------------------- | ---------- | ------------- |
-| A100 40G | Zephyr 7b | Ultra Chat | 1x  | 1.24x                  | **1.88x**  | -11.6%        |
-| Tesla T4 | Zephyr 7b | Ultra Chat | 1x  | 1.09x                  | **1.55x**  | -18.6%        |
+| GPU      | Model     | Dataset    | 🤗   | 🤗 + Flash Attention 2 | 🦥 Unsloth | 🦥 VRAM saved |
+| -------- | --------- | ---------- | --- | --------------------- | --------- | ------------ |
+| A100 40G | Zephyr 7b | Ultra Chat | 1x  | 1.24x                 | **1.88x** | -11.6%       |
+| Tesla T4 | Zephyr 7b | Ultra Chat | 1x  | 1.09x                 | **1.55x** | -18.6%       |
 
 First install `unsloth` according to the [official documentation](https://github.com/unslothai/unsloth). Once installed, you can incorporate unsloth into your workflow in a very simple manner; instead of loading `AutoModelForCausalLM`, you just need to load a `FastLanguageModel` as follows:
 
-```python
-import torch
-from trl import DPOConfig, DPOTrainer
-from unsloth import FastLanguageModel
+```diff
+  from datasets import load_dataset
+  from trl import DPOConfig, DPOTrainer
+- from transformers import AutoModelForCausalLM, AutoTokenizer
++ from unsloth import FastLanguageModel
 
-max_seq_length = 2048 # Supports automatic RoPE Scaling, so choose any number.
+- model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+- tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
++ model, tokenizer = FastLanguageModel.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
++ model = FastLanguageModel.get_peft_model(model)
+  train_dataset = load_dataset("trl-lib/Capybara-Preferences", split="train")
 
-# Load model
-model, tokenizer = FastLanguageModel.from_pretrained(
-    model_name = "unsloth/zephyr-sft",
-    max_seq_length = max_seq_length,
-    dtype = None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
-    load_in_4bit = True, # Use 4bit quantization to reduce memory usage. Can be False.
-    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
-)
+- training_args = DPOConfig(output_dir="dpo-qwen2", logging_steps=10)
++ training_args = DPOConfig(output_dir="dpo-qwen2", logging_steps=10, bf16=True)
+  trainer = DPOTrainer(model=model, args=training_args, tokenizer=tokenizer, train_dataset=train_dataset)
+  trainer.train()
 
-# Do model patching and add fast LoRA weights
-model = FastLanguageModel.get_peft_model(
-    model,
-    r = 16,
-    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
-                      "gate_proj", "up_proj", "down_proj",],
-    lora_alpha = 16,
-    lora_dropout = 0, # Dropout = 0 is currently optimized
-    bias = "none",    # Bias = "none" is currently optimized
-    use_gradient_checkpointing = True,
-    random_state = 3407,
-)
-
-training_args = DPOConfig(
-    output_dir="./output",
-    beta=0.1,
-)
-
-dpo_trainer = DPOTrainer(
-    model,
-    ref_model=None,
-    args=training_args,
-    train_dataset=train_dataset,
-    tokenizer=tokenizer,
-)
-dpo_trainer.train()
 ```
 
 The saved model is fully compatible with Hugging Face's transformers library. Learn more about unsloth in their [official repository](https://github.com/unslothai/unsloth).
diff --git a/docs/source/online_dpo_trainer.md b/docs/source/online_dpo_trainer.md
index 40f26214a1..d52b2c954e 100644
--- a/docs/source/online_dpo_trainer.md
+++ b/docs/source/online_dpo_trainer.md
@@ -38,11 +38,7 @@ train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
 
 training_args = OnlineDPOConfig(output_dir="online-dpo-qwen2", logging_steps=10)
 trainer = OnlineDPOTrainer(
-    model=model,
-    reward_model=reward_model,
-    args=training_args,
-    tokenizer=tokenizer,
-    train_dataset=train_dataset,
+    model=model, reward_model=reward_model, args=training_args, tokenizer=tokenizer, train_dataset=train_dataset
 )
 trainer.train()
 ```
diff --git a/examples/datasets/rlaif-v.py b/examples/datasets/rlaif-v.py
new file mode 100644
index 0000000000..ec2501d4c7
--- /dev/null
+++ b/examples/datasets/rlaif-v.py
@@ -0,0 +1,73 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+from typing import Optional
+
+from datasets import features, load_dataset
+from transformers import HfArgumentParser
+
+
+@dataclass
+class ScriptArguments:
+    r"""
+    Arguments for the script.
+
+    Args:
+        push_to_hub (`bool`, *optional*, defaults to `False`):
+            Whether to push the dataset to the Hugging Face Hub.
+        repo_id (`str`, *optional*, defaults to `"trl-lib/rlaif-v"`):
+            Hugging Face repository ID to push the dataset to.
+        dataset_num_proc (`Optional[int]`, *optional*, defaults to `None`):
+            Number of workers to use for dataset processing.
+    """
+
+    push_to_hub: bool = False
+    repo_id: str = "trl-lib/rlaif-v"
+    dataset_num_proc: Optional[int] = None
+
+
+def to_conversational(example):
+    """
+    Convert prompt from "xxx" to [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "xxx"}]}]
+    and chosen and rejected from "xxx" to [{"role": "assistant", "content": [{"type": "text", "text": "xxx"}]}].
+    Images are wrapped into a list.
+    """
+    prompt = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": example["question"]}]}]
+    chosen = [{"role": "assistant", "content": [{"type": "text", "text": example["chosen"]}]}]
+    rejected = [{"role": "assistant", "content": [{"type": "text", "text": example["rejected"]}]}]
+    return {"prompt": prompt, "images": [example["image"]], "chosen": chosen, "rejected": rejected}
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser(ScriptArguments)
+    script_args = parser.parse_args_into_dataclasses()[0]
+
+    dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train")
+    dataset = dataset.map(
+        to_conversational,
+        num_proc=script_args.dataset_num_proc,
+        remove_columns=dataset.column_names,
+        writer_batch_size=128,
+    )
+
+    # Cast the images to Sequence[Image] to avoid bytes format
+    f = dataset.features
+    f["images"] = features.Sequence(features.Image(decode=True))
+    dataset = dataset.cast(f)
+
+    dataset = dataset.train_test_split(test_size=0.01, writer_batch_size=128)
+
+    if script_args.push_to_hub:
+        dataset.push_to_hub(script_args.repo_id)
diff --git a/examples/scripts/dpo.py b/examples/scripts/dpo.py
index 5fe7ddf1ca..0669bcc093 100644
--- a/examples/scripts/dpo.py
+++ b/examples/scripts/dpo.py
@@ -47,7 +47,6 @@
 """
 
 import torch
-from accelerate import PartialState
 from datasets import load_dataset
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
@@ -60,8 +59,6 @@
     get_kbit_device_map,
     get_peft_config,
     get_quantization_config,
-    maybe_apply_chat_template,
-    maybe_extract_prompt,
 )
 from trl.trainer.utils import SIMPLE_CHAT_TEMPLATE
 
@@ -115,12 +112,6 @@
     ################
     dataset = load_dataset(script_args.dataset_name)
 
-    with PartialState().local_main_process_first():
-        dataset = dataset.map(maybe_extract_prompt, num_proc=training_args.dataset_num_proc)
-        dataset = dataset.map(
-            maybe_apply_chat_template, num_proc=training_args.dataset_num_proc, fn_kwargs={"tokenizer": tokenizer}
-        )
-
     ##########
     # Training
     ################
diff --git a/tests/test_data_utils.py b/tests/test_data_utils.py
index d4fbf406be..6f894729c0 100644
--- a/tests/test_data_utils.py
+++ b/tests/test_data_utils.py
@@ -268,7 +268,7 @@ def test_maybe_unpair_preference_dataset_dict_already_paired(self):
 
 
 class ExtractPromptTester(unittest.TestCase):
-    example_implicit_prompt = {
+    example_implicit_prompt_conversational = {
         "chosen": [
             {"role": "user", "content": "What color is the sky?"},
             {"role": "assistant", "content": "It is blue."},
@@ -279,7 +279,7 @@ class ExtractPromptTester(unittest.TestCase):
         ],
     }
 
-    example_explicit_prompt = {
+    example_explicit_prompt_conversational = {
         "prompt": [
             {"role": "user", "content": "What color is the sky?"},
         ],
@@ -291,30 +291,68 @@ class ExtractPromptTester(unittest.TestCase):
         ],
     }
 
-    def test_extract_prompt(self):
+    example_implicit_prompt_standard = {
+        "chosen": "The sky is blue.",
+        "rejected": "The sky is green.",
+    }
+
+    example_explicit_prompt_standard = {
+        "prompt": "The sky is",
+        "chosen": " blue.",
+        "rejected": " green.",
+    }
+
+    def test_extract_prompt_conversational(self):
+        # Test that the prompt is correctly extracted from the dataset
+        example_extracted_prompt = extract_prompt(self.example_implicit_prompt_conversational)
+        self.assertEqual(
+            example_extracted_prompt,
+            self.example_explicit_prompt_conversational,
+            "The prompt is not correctly extracted from the dataset.",
+        )
+
+    def test_maybe_extract_prompt_conversational(self):
+        # Test that the prompt is correctly extracted from the dataset with maybe_extract_prompt
+        example_extracted_prompt = maybe_extract_prompt(self.example_implicit_prompt_conversational)
+        self.assertEqual(
+            example_extracted_prompt,
+            self.example_explicit_prompt_conversational,
+            "The prompt is not correctly extracted from the dataset.",
+        )
+
+    def test_maybe_extract_prompt_conversational_already_explicit(self):
+        # Test that the prompt remains unchanged with maybe_extract_prompt
+        example_extracted_prompt = maybe_extract_prompt(self.example_explicit_prompt_conversational)
+        self.assertEqual(
+            example_extracted_prompt,
+            self.example_explicit_prompt_conversational,
+            "The prompt should remain unchanged.",
+        )
+
+    def test_extract_prompt_standard(self):
         # Test that the prompt is correctly extracted from the dataset
-        example_extracted_prompt = extract_prompt(self.example_implicit_prompt)
+        example_extracted_prompt = extract_prompt(self.example_implicit_prompt_standard)
         self.assertEqual(
             example_extracted_prompt,
-            self.example_explicit_prompt,
+            self.example_explicit_prompt_standard,
             "The prompt is not correctly extracted from the dataset.",
         )
 
-    def test_maybe_extract_prompt(self):
+    def test_maybe_extract_prompt_standard(self):
         # Test that the prompt is correctly extracted from the dataset with maybe_extract_prompt
-        example_extracted_prompt = maybe_extract_prompt(self.example_implicit_prompt)
+        example_extracted_prompt = maybe_extract_prompt(self.example_implicit_prompt_standard)
         self.assertEqual(
             example_extracted_prompt,
-            self.example_explicit_prompt,
+            self.example_explicit_prompt_standard,
             "The prompt is not correctly extracted from the dataset.",
         )
 
-    def test_maybe_extract_prompt_already_explicit(self):
+    def test_maybe_extract_prompt_standard_already_explicit(self):
         # Test that the prompt remains unchanged with maybe_extract_prompt
-        example_extracted_prompt = maybe_extract_prompt(self.example_explicit_prompt)
+        example_extracted_prompt = maybe_extract_prompt(self.example_explicit_prompt_standard)
         self.assertEqual(
             example_extracted_prompt,
-            self.example_explicit_prompt,
+            self.example_explicit_prompt_standard,
             "The prompt should remain unchanged.",
         )
 
diff --git a/trl/data_utils.py b/trl/data_utils.py
index 59c59202e4..266dceaad7 100644
--- a/trl/data_utils.py
+++ b/trl/data_utils.py
@@ -11,7 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from typing import Any, Dict, List, Optional, TypeVar
+from typing import Any, Dict, List, Optional, Sequence, TypeVar
 
 from datasets import Dataset, DatasetDict
 from transformers import PreTrainedTokenizer
@@ -280,7 +280,7 @@ def maybe_unpair_preference_dataset(dataset: DatasetType, num_proc: Optional[int
         return dataset
 
 
-def extract_prompt(example: Dict[str, List]) -> Dict[str, List]:
+def extract_prompt(example: Dict[str, Sequence]) -> Dict[str, Sequence]:
     r"""
     Extracts the shared prompt from a preference data example, where the prompt is implicit within both
     the chosen and rejected completions.
@@ -288,7 +288,9 @@ def extract_prompt(example: Dict[str, List]) -> Dict[str, List]:
     For more details, see [`maybe_extract_prompt`].
     """
     for idx in range(min(len(example["chosen"]), len(example["rejected"]))):
-        if example["chosen"][idx]["content"] != example["rejected"][idx]["content"]:
+        if example["chosen"][idx] != example["rejected"][idx]:
+            if example["chosen"][idx - 1] == " ":  # remove space before the prompt
+                idx -= 1
             break
     return {
         "prompt": example["chosen"][:idx],
@@ -303,7 +305,6 @@ def maybe_extract_prompt(example: Dict[str, List]) -> Dict[str, List]:
     the chosen and rejected completions.
 
     If the example already contains a `"prompt"` key, the function returns the example as is. Else, the function
-
     identifies the longest common sequence (prefix) of conversation turns between the "chosen" and "rejected"
     completions and extracts this as the prompt. It then removes this prompt from the respective "chosen" and
     "rejected" completions.
@@ -311,7 +312,7 @@ def maybe_extract_prompt(example: Dict[str, List]) -> Dict[str, List]:
     Args:
         example (`Dict[str, List]`):
             A dictionary representing a single data entry in the preference dataset. It must contain the keys
-            `"chosen"` and `"rejected"`, where each value is a list.
+            `"chosen"` and `"rejected"`, where each value is either conversational or standard (`str`).
 
     Returns:
         `Dict[str, List]`: A dictionary containing:
@@ -379,7 +380,10 @@ def maybe_extract_prompt(example: Dict[str, List]) -> Dict[str, List]:
     #  "chosen": [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}],
     #  "rejected": [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}]}
     # That's why we check if the prompt is also conversational before deciding not to extract it.
-    if "prompt" in example and is_conversational({"prompt": example["prompt"]}):
-        return example
-    else:
-        return extract_prompt({"chosen": example["chosen"], "rejected": example["rejected"]})
+    if "prompt" in example:
+        # Both conversational or both non-conversational
+        chosen_conv = is_conversational({"chosen": example["chosen"]})
+        prompt_conv = is_conversational({"prompt": example["prompt"]})
+        if (chosen_conv and prompt_conv) or (not chosen_conv and not prompt_conv):
+            return example
+    return extract_prompt({"chosen": example["chosen"], "rejected": example["rejected"]})
diff --git a/trl/trainer/dpo_config.py b/trl/trainer/dpo_config.py
index a517c3cbfe..cac196638e 100644
--- a/trl/trainer/dpo_config.py
+++ b/trl/trainer/dpo_config.py
@@ -40,6 +40,9 @@ class DPOConfig(TrainingArguments):
     command line.
 
     Parameters:
+        learning_rate (`float`, *optional*, defaults to `1e-6`):
+            Initial learning rate for [`AdamW`] optimizer. The default value replaces that of
+            [`~transformers.TrainingArguments`].
         beta (`float`, *optional*, defaults to `0.1`):
             Parameter controlling the deviation from the reference model. Higher β means less deviation from the
             reference model. For the IPO loss (`loss_type="ipo"`), β is the regularization parameter denoted by τ in
@@ -130,6 +133,7 @@ class DPOConfig(TrainingArguments):
             DPO loss. The paper recommends `rpo_alpha=1.0`.
     """
 
+    learning_rate: float = 1e-6
     beta: float = 0.1
     label_smoothing: float = 0.0
     loss_type: Literal[
diff --git a/trl/trainer/dpo_trainer.py b/trl/trainer/dpo_trainer.py
index e04e4693d8..bc17ffdf39 100644
--- a/trl/trainer/dpo_trainer.py
+++ b/trl/trainer/dpo_trainer.py
@@ -44,6 +44,7 @@
 from transformers.trainer_utils import EvalLoopOutput
 from transformers.utils import is_peft_available
 
+from ..data_utils import maybe_apply_chat_template, maybe_extract_prompt
 from ..models import PreTrainedModelWrapper, create_reference_model
 from .callbacks import SyncRefModelCallback
 from .dpo_config import DPOConfig, FDivergenceConstants, FDivergenceType
@@ -815,6 +816,17 @@ def make_inputs_require_grad(module, input, output):
         # Compute that only on the main process for faster data processing.
         # see: https://github.com/huggingface/trl/pull/1255
         with PartialState().local_main_process_first():
+            # Extract the prompt if needed, and apply the chat template if needed
+            train_dataset = train_dataset.map(maybe_extract_prompt, num_proc=args.dataset_num_proc)
+            train_dataset = train_dataset.map(
+                maybe_apply_chat_template, fn_kwargs={"tokenizer": tokenizer}, num_proc=args.dataset_num_proc
+            )
+            if eval_dataset is not None:
+                eval_dataset = eval_dataset.map(maybe_extract_prompt, num_proc=args.dataset_num_proc)
+                eval_dataset = eval_dataset.map(
+                    maybe_apply_chat_template, fn_kwargs={"tokenizer": tokenizer}, num_proc=args.dataset_num_proc
+                )
+
             # tokenize the dataset, lower writer batch size to avoid OOM (frequent in vision models)
             fn_kwargs = {
                 "tokenizer": self.tokenizer,
diff --git a/trl/trainer/kto_config.py b/trl/trainer/kto_config.py
index 85e8271331..4cac67e915 100644
--- a/trl/trainer/kto_config.py
+++ b/trl/trainer/kto_config.py
@@ -28,7 +28,8 @@ class KTOConfig(TrainingArguments):
 
     Parameters:
         learning_rate (`float`, *optional*, defaults to `5e-7`):
-            Initial learning rate for [`AdamW`] optimizer. The default value replaces that of [`~transformers.TrainingArguments`].
+            Initial learning rate for [`AdamW`] optimizer. The default value replaces that of
+            [`~transformers.TrainingArguments`].
         max_length (`Optional[int]`, *optional*, defaults to `None`):
             Maximum length of the sequences (prompt + completion) in the batch. This argument is required if you want
             to use the default data collator.