Merge pull request #26 from ExponentialML/version/version2

Text To Video Finetuning Version 2 ## Changes and Updates - [x] High quality VRAM config. - [x] Add text encoder training. - [x] Allow training on low vram systems. - [x] Allow single image training. - [x] Train with image captions. - [x] Train with video captions in folder. - [x] Gradient checkpointing support. - [x] Time agnostic training. - [x] Add aspect ratio bucketing. - [x] Verify installation. - [x] Add hybrid LoRA for training. - [x] Add latent caching. - [x] Add optimizer agnostic settings in config. - [x] Soup up unet finetuner for readability and efficiency. - [x] Update README to reflect training.
ExponentialML · Apr 9, 2023 · 9c85d2d · 9c85d2d
2 parents 25697f9 + 4b0be8a
commit 9c85d2d
Show file tree

Hide file tree

Showing 15 changed files with 1,442 additions and 599 deletions.
diff --git a/README.md b/README.md
@@ -1,199 +1,121 @@
 # Text-To-Video-Finetuning
 ## Finetune ModelScope's Text To Video model using Diffusers 🧨 
-***(This is a WIP)***
+
+[output.webm](https://user-images.githubusercontent.com/59846140/230748413-fe91e90b-94b9-49ea-97ec-250469ee9472.webm)
 
 ### Updates
+- **2023-4-8**: Version 2 is released! 
 - **2023-3-29**: Added gradient checkpointing support. 
 - **2023-3-27**: Support for using Scaled Dot Product Attention for Torch 2.0 users. 
 
 ## Getting Started
-### Requirements
 
-#### Installation
+### Requirements & Installation
 
-#### Repository Requirements
 ```bash
 git clone https://github.com/ExponentialML/Text-To-Video-Finetuning.git
 cd Text-To-Video-Finetuning
 git lfs install
 git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b ./models/model_scope_diffusers/
 ```
 
-#### Create Conda Environment
+### Create Conda Environment (Optional)
+It is recommended to install Anaconda.
+
+**Windows Installation:** https://docs.anaconda.com/anaconda/install/windows/
+
+**Linux Installation:** https://docs.anaconda.com/anaconda/install/linux/
+
 ```bash
 conda create -n text2video-finetune python=3.10
 conda activate text2video-finetune
 ```
 
-#### Python Requirements
+### Python Requirements
 ```bash
 pip install -r requirements.txt
 ```
 
+## Hardware
+
 All code was tested on Python 3.10.9 & Torch version 1.13.1 & 2.0.
 
-You could potentially save memory by installing xformers and enabling it in your config. Please follow the instructions at the following repository for details on how to install.
+It is **highly recommended** to install >= Torch 2.0. This way, you don't have to install Xformers *or* worry about memory performance. 
+
+If you don't have Xformers enabled, you can follow the instructions here: https://github.com/facebookresearch/xformers
 
-https://github.com/facebookresearch/xformers
 
-## Hardware
 Recommended to use a RTX 3090, but you should be able to train on GPUs with <= 16GB ram with:
-- Validation turned off 
+- Validation turned off.
 - Xformers or Torch 2.0 Scaled Dot-Product Attention 
-- gradient checkpointing enabled. 
+- Gradient checkpointing enabled. 
 - Resolution of 256.
+- Enable all LoRA options.
 
-## Usage
+## Preprocessing your data
 
-### Preprocessing your data
-All videos were preprocessed using the script [here](https://github.com/ExponentialML/Video-BLIP2-Preprocessor) using automatic BLIP2 captions. Please follow the instructions there.
+### Using Captions
 
-If you wish to use a custom dataloader (for instance, a folder of mp4's and captions), you're free to update the dataloader [here](https://github.com/ExponentialML/Text-To-Video-Finetuning/blob/d72e34cfbd91d2a62c07172f9ef079ca5cd651b2/utils/dataset.py#L83). 
+You can use caption files when training on images or video. Simply place them into a folder like so:
 
-Feel free to share your dataloaders for others to use! It would be much appreciated.
+**Images**: `/images/img.png /images/img.txt`
+**Videos**: `/videos/vid.mp4 | /videos/vid.txt`
 
-### Finetuning using a training JSON
-```python
-python train.py --config ./configs/my_config.yaml
-```
+Then in your config, make sure to have `-folder` enabled, along with the root directory containing the files.
 
-### Finetuning using a training JSON and HQ settings
-```python
-python train.py --config ./configs/my_config_hq.yaml
-```
+### Process Automatically
+
+You can automatically caption the videos using the [Video-BLIP2-Preprocessor Script](https://github.com/ExponentialML/Video-BLIP2-Preprocessor)
+
+## Configuration
 
-### Finetuning on a single video
+The configuration uses a YAML config borrowed from [Tune-A-Video](https://github.com/showlab/Tune-A-Video) reposotories. 
+
+All configuration details are placed in `configs/v2/train_config.yaml`. Each parameter has a definition for what it does.
+
+### How would you recommend I proceed with making a config with my data?
+
+I highly recommend (I did this myself) going to `configs/v2/train_config.yaml`. Then make a copy of it and name it whatever you wish `my_train.yaml`.
+
+Then, follow each line and configure it for your specific use case. 
+
+The instructions should be clear enough to get you up and running with your dataset, but feel free to ask any questions in the discussion board.
+
+## Finetune.
 ```python
-python train.py --config ./configs/single_video_config.yaml
+python train.py --config ./configs/v2/train_config.yaml
 ```
 ---
 
-### Training Results
+## Training Results
+
 With a lot of data, you can expect training results to show at roughly 2500 steps at a constant learning rate of 5e-6. 
-Play around with learning rates to see what works best for you (5e-6, 3e-5, 1e-4).
 
 When finetuning on a single video, you should see results in half as many steps.
 
-After training, you should see your results in your output directory. By default, it should be placed at the script root under `./outputs/train_<date>`
+After training, you should see your results in your output directory. 
 
-## Configuration
-The configuration uses a YAML config borrowed from [Tune-A-Video](https://github.com/showlab/Tune-A-Video) reposotories. Here's the gist of how it works.
-
-<details>
-
-```yaml
-
-# The path to your diffusers folder. The structure should look exactly like the huggingface one with folders and json configs
-pretrained_model_path: "diffusers_path"
-
-# The directory where your training runs (and samples) will be saved.
-output_dir: "./outputs"
-
-# Enable training the text encoder or not.
-train_text_encoder: False
-
-# The basis of where your training data is store.
-train_data:
-
-  # The path to your JSON file using the steps above.
-  json_path: "json/train.json"
-
-  # Leave this as true for now. Custom configurations are currently not supported.
-  preprocessed: True
-
-  # Number of frames to sample from the videos. The higher this number, the more VRAM is required (usage is similar to batchsize)
-  n_sample_frames: 4
-
-  # Choose whether or not to ignore the frame data from the preprocessing step, and shuffle them.
-  shuffle_frames: False
-
-  # The height and width of training data.
-  width: 256      
-  height: 256
-
-  # At what frame to start the video sampling. Ignores preprocessing frames.
-  sample_start_idx: 0
-
-  # The rate of sampling frames. This effectively "skips" frames making it appear faster or slower.
-  sample_frame_rate: 1
-
-  # The key of the video data name. This is to align with any preprocess script changes.
-  vid_data_key: "video_path"
-
-  # The video path and prompt for that video for single video training.
-  # If enabled, JSON path is ignored
-  single_video_path: ""
-  single_video_prompt: ""
-
-# This is the data for validation during training. Prompt will override training data prompts.
-  sample_preview: True
-  prompt: ""
-  num_frames: 16
-  width: 256
-  height: 256
-  num_inference_steps: 50
-  guidance_scale: 9
-
-# Training parameters
-learning_rate: 5e-6
-adam_weight_decay: 0
-train_batch_size: 1
-max_train_steps: 50000
-
-# Allow checkpointing during training (save once every X amount of steps)
-checkpointing_steps: 10000
-
-# How many steps during training before we create a sample
-validation_steps: 100
-
-# The parameters to unfreeze. As it is now, all attention layers are unfrozen. 
-# Unfreezing resnet layers would lead to better quality, but consumes a very large amount of VRAM.
-trainable_modules:
-  - "attn1"
-  - "attn1"
-
-# Seed for sampling validation
-seed: 64
-
-# Use mixed precision for better memory allocation
-mixed_precision: "fp16"
-
-# This seems to be incompatible at the moment in my testing.
-use_8bit_adam: False
-
-# Currently has no effect.
-enable_xformers_memory_efficient_attention: True
-
-```
-  </details>
+By default, it should be placed at the script root under `./outputs/train_<date>`
 
-## Trainable modules (Advanced Usage)
-The `trainable_modules` parameter are a set list by the user that tells the model which layers to unfreeze. 
+From my testing, I recommend:
 
-Typically you want to train the cross attention layers. The more layers you unfreeze, the higher the VRAM usage. Typically in my testing, here is what I see.
+- Keep the number of sample frames between 4-16. Use long frame generation for inference, *not* training.
+- If you have a low VRAM system, you can try single frame training or just use `n_sample_frames: 2`.
+- Using a learning rate of about `5e-6` seems to work well in all cases.
+- The best quality will always come from training the text encoder. If you're limited on VRAM, disabling it can help.
+- Leave some memory to avoid OOM when saving models during training.
 
-`"attentions"`: Uses a lot of VRAM, but high probability for quality.
+## Developing
 
-`"attn1", "attn2"`: Uses a good amount of VRAM, but allows for processing more frames. Good quality finetunes can happen with these settings.
+Please feel free to open a pull request if you have a feature implementation or suggesstion! I welcome all contributions.
 
-`"attn1.to_out", "attn2.to_out"`: This only trains the linears on on the cross attention layers. This seems to be a good tradeoff for VRAM with great results with a learning rate of 1e-4.
+I've tried to make the code fairly modular so you can hack away, see how the code works, and what the implementations do.
 
-## Running
-After training, you can easily run your model by doing the following.
+## Deprecation
+If you want to use the V1 repository, you can use the branch [here](https://github.com/ExponentialML/Text-To-Video-Finetuning/tree/version/first-release).
 
-```python
-import torch
-from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
-from diffusers.utils import export_to_video
+## Shoutouts
 
-my_trained_model_path = "./trained_model_path/"
-pipe = DiffusionPipeline.from_pretrained(my_trained_model_path, torch_dtype=torch.float16, variant="fp16")
-pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
-pipe.enable_model_cpu_offload()
-
-prompt = "Your prompt based on train data"
-video_frames = pipe(prompt, num_inference_steps=25).frames
-
-out_file = "./my_video.mp4"
-video_path = export_to_video(video_frames, out_file)
-```
+- [Showlab](https://github.com/showlab/Tune-A-Video) and bryandlee[https://github.com/bryandlee/Tune-A-Video] for their Tune-A-Video contribution that made this much easier.
+- [lucidrains](https://github.com/lucidrains) for their implementations around video diffusion.
+- [cloneofsimo](https://github.com/cloneofsimo) for their diffusers implementation of LoRA.
diff --git a/configs/my_config.yaml b/configs/my_config.yaml
diff --git a/configs/my_config_hq.yaml b/configs/my_config_hq.yaml