Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix some error , now can run on mac m2 with 64GB memory #174

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.DS_Store
__pycache__/
model/
66 changes: 27 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,14 @@

<a href='https://minigpt-4.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2304.10592'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/spaces/Vision-CAIR/minigpt4'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue'></a> <a href='https://huggingface.co/Vision-CAIR/MiniGPT-4'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a> [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R?usp=sharing) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://www.youtube.com/watch?v=__tftoxpBAw&feature=youtu.be)


## News
We now provide a pretrained MiniGPT-4 aligned with Vicuna-7B! The demo GPU memory consumption now can be as low as 12GB.


## Online Demo

Click the image to chat with MiniGPT-4 around your images
[![demo](figs/online_demo.png)](https://minigpt-4.github.io)


## Examples
| | |
:-------------------------:|:-------------------------:
Expand All @@ -24,19 +21,15 @@ Click the image to chat with MiniGPT-4 around your images

More examples can be found in the [project page](https://minigpt-4.github.io).



## Introduction
- MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.
- MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.
- We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.
- To address this issue and improve usability, we propose a novel way to create high-quality image-text pairs by the model itself and ChatGPT together. Based on this, we then create a small (3500 pairs in total) yet high-quality dataset.
- The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.
- MiniGPT-4 yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.

- MiniGPT-4 yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.

![overview](figs/overview.png)


## Getting Started
### Installation

Expand All @@ -51,11 +44,10 @@ conda env create -f environment.yml
conda activate minigpt4
```


**2. Prepare the pretrained Vicuna weights**

The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.
Please refer to our instruction [here](PrepareVicuna.md)
Please refer to our instruction [here](PrepareVicuna.md)
to prepare the Vicuna weights.
The final weights would be in a single folder in a structure similar to the following:

Expand All @@ -65,10 +57,10 @@ vicuna_weights
├── generation_config.json
├── pytorch_model.bin.index.json
├── pytorch_model-00001-of-00003.bin
...
...
```

Then, set the path to the vicuna weight in the model config file
Then, set the path to the vicuna weight in the model config file
[here](minigpt4/configs/models/minigpt4.yaml#L16) at Line 16.

**3. Prepare the pretrained MiniGPT-4 checkpoint**
Expand All @@ -77,13 +69,10 @@ Download the pretrained checkpoints according to the Vicuna model you prepare.

| Checkpoint Aligned with Vicuna 13B | Checkpoint Aligned with Vicuna 7B |
:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
[Downlad](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) | [Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing)


Then, set the path to the pretrained checkpoint in the evaluation config file
in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 11.

[Downlad](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) | [Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing)

Then, set the path to the pretrained checkpoint in the evaluation config file
in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 11.

### Launching Demo Locally

Expand All @@ -93,78 +82,77 @@ Try out our demo [demo.py](demo.py) on your local machine by running
python demo.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0
```

To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1.
This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B.
To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1.
This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B.
For more powerful GPUs, you can run the model
in 16 bit by setting low_resource to False in the config file
in 16 bit by setting low_resource to False in the config file
[minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml) and use a larger beam search width.

Thanks [@WangRongsheng](https://github.com/WangRongsheng), you can also run our code on [Colab](https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R?usp=sharing)


### Training
The training of MiniGPT-4 contains two alignment stages.

**1. First pretraining stage**

In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets
to align the vision and language model. To download and prepare the datasets, please check
our [first stage dataset preparation instruction](dataset/README_1_STAGE.md).
to align the vision and language model. To download and prepare the datasets, please check
our [first stage dataset preparation instruction](dataset/README_1_STAGE.md).
After the first stage, the visual features are mapped and can be understood by the language
model.
To launch the first stage training, run the following command. In our experiments, we use 4 A100.
You can change the save path in the config file
To launch the first stage training, run the following command. In our experiments, we use 4 A100.
You can change the save path in the config file
[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage1_pretrain.yaml)

```bash
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml
```

A MiniGPT-4 checkpoint with only stage one training can be downloaded
A MiniGPT-4 checkpoint with only stage one training can be downloaded
[here (13B)](https://drive.google.com/file/d/1u9FRRBB3VovP1HxCAlpD9Lw4t4P6-Yq8/view?usp=share_link) or [here (7B)](https://drive.google.com/file/d/1HihQtCEXUyBM1i9DQbaK934wW3TZi-h5/view?usp=share_link).
Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.


**2. Second finetuning stage**

In the second stage, we use a small high quality image-text pair dataset created by ourselves
and convert it to a conversation format to further align MiniGPT-4.
To download and prepare our second stage dataset, please check our
To download and prepare our second stage dataset, please check our
[second stage dataset preparation instruction](dataset/README_2_STAGE.md).
To launch the second stage alignment,
first specify the path to the checkpoint file trained in stage 1 in
To launch the second stage alignment,
first specify the path to the checkpoint file trained in stage 1 in
[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage2_finetune.yaml).
You can also specify the output path there.
You can also specify the output path there.
Then, run the following command. In our experiments, we use 1 A100.

```bash
torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml
```

After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.

After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.

## Run on Mac

```
pip install -r requirements.txt
```

## Acknowledgement

+ [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) The model architecture of MiniGPT-4 follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!
+ [Lavis](https://github.com/salesforce/LAVIS) This repository is built upon Lavis!
+ [Vicuna](https://github.com/lm-sys/FastChat) The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!


If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
```bibtex
@misc{zhu2022minigpt4,
title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models},
title={MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models},
author={Deyao Zhu and Jun Chen and Xiaoqian Shen and Xiang Li and Mohamed Elhoseiny},
journal={arXiv preprint arXiv:2304.10592},
year={2023},
}
```


## License
This repository is under [BSD 3-Clause License](LICENSE.md).
Many codes are based on [Lavis](https://github.com/salesforce/LAVIS) with
Many codes are based on [Lavis](https://github.com/salesforce/LAVIS) with
BSD 3-Clause License [here](LICENSE_Lavis.md).
15 changes: 9 additions & 6 deletions demo.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from minigpt4.runners import *
from minigpt4.tasks import *

CUDA = torch.cuda.is_available()

def parse_args():
parser = argparse.ArgumentParser(description="Demo")
Expand Down Expand Up @@ -57,11 +58,13 @@ def setup_seeds(config):
model_config = cfg.model_cfg
model_config.device_8bit = args.gpu_id
model_cls = registry.get_model_class(model_config.arch)
model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))
GPU = 'cuda:{}'.format(args.gpu_id) if CUDA else None
model = model_cls.from_config(model_config).to(GPU)
model = torch.compile(model)

vis_processor_cfg = cfg.datasets_cfg.cc_sbu_align.vis_processor.train
vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)
chat = Chat(model, vis_processor, device='cuda:{}'.format(args.gpu_id))
chat = Chat(model, vis_processor, device=GPU)
print('Initialization Finished')

# ========================================
Expand Down Expand Up @@ -118,7 +121,7 @@ def gradio_answer(chatbot, chat_state, img_list, num_beams, temperature):
image = gr.Image(type="pil")
upload_button = gr.Button(value="Upload & Start Chat", interactive=True, variant="primary")
clear = gr.Button("Restart")

num_beams = gr.Slider(
minimum=1,
maximum=10,
Expand All @@ -127,7 +130,7 @@ def gradio_answer(chatbot, chat_state, img_list, num_beams, temperature):
interactive=True,
label="beam search numbers)",
)

temperature = gr.Slider(
minimum=0.1,
maximum=2.0,
Expand All @@ -142,9 +145,9 @@ def gradio_answer(chatbot, chat_state, img_list, num_beams, temperature):
img_list = gr.State()
chatbot = gr.Chatbot(label='MiniGPT-4')
text_input = gr.Textbox(label='User', placeholder='Please upload your image first', interactive=False)

upload_button.click(upload_img, [image, text_input, chat_state], [image, text_input, upload_button, chat_state, img_list])

text_input.submit(gradio_ask, [text_input, chatbot, chat_state], [text_input, chatbot, chat_state]).then(
gradio_answer, [chatbot, chat_state, img_list, num_beams, temperature], [chatbot, chat_state, img_list]
)
Expand Down
7 changes: 7 additions & 0 deletions demo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/usr/bin/env bash

DIR=$(realpath $0) && DIR=${DIR%/*}
cd $DIR
set -ex

python demo.py --cfg-path $DIR/eval_configs/minigpt4_eval.yaml
6 changes: 3 additions & 3 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ channels:
- defaults
- anaconda
dependencies:
- python=3.9
- python=3.10
- cudatoolkit
- pip
- pytorch=1.12.1
- pytorch=2.0.0
- pytorch-mutex=1.0=cuda
- torchaudio=0.12.1
- torchvision=0.13.1
- torchvision=0.15.1
- pip:
- accelerate==0.16.0
- aiohttp==3.8.4
Expand Down
2 changes: 1 addition & 1 deletion eval_configs/minigpt4_eval.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ model:
low_resource: True
prompt_path: "prompts/alignment.txt"
prompt_template: '###Human: {} ###Assistant: '
ckpt: '/path/to/pretrained/ckpt/'
ckpt: ./model/pretrained_minigpt4.pth # /path/to/pretrained/ckpt/


datasets:
Expand Down
2 changes: 1 addition & 1 deletion minigpt4/configs/models/minigpt4.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ model:
num_query_token: 32

# Vicuna
llama_model: "/path/to/vicuna/weights/"
llama_model: './model/vicuna-13b' # /path/to/vicuna/weights/

# generation configs
prompt: ""
Expand Down
11 changes: 7 additions & 4 deletions minigpt4/models/mini_gpt4.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from minigpt4.models.modeling_llama import LlamaForCausalLM
from transformers import LlamaTokenizer

CUDA = torch.cuda.is_available()

@registry.register_model("mini_gpt4")
class MiniGPT4(Blip2Base):
Expand Down Expand Up @@ -86,17 +87,19 @@ def __init__(
self.llama_tokenizer = LlamaTokenizer.from_pretrained(llama_model, use_fast=False)
self.llama_tokenizer.pad_token = self.llama_tokenizer.eos_token

torch_dtype = torch.float16 if CUDA else torch.float32
if self.low_resource:
self.llama_model = LlamaForCausalLM.from_pretrained(
llama_model,
torch_dtype=torch.float16,
load_in_8bit=True,
device_map={'': device_8bit}
torch_dtype=torch_dtype,
load_in_8bit=CUDA,
offload_folder="model/offload",
device_map={'': device_8bit} if CUDA else 'auto'
)
else:
self.llama_model = LlamaForCausalLM.from_pretrained(
llama_model,
torch_dtype=torch.float16,
torch_dtype=torch_dtype,
)

for name, param in self.llama_model.named_parameters():
Expand Down
85 changes: 85 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
accelerate==0.18.0
aiofiles==23.1.0
aiohttp==3.8.4
aiosignal==1.3.1
altair==4.2.2
antlr4-python3-runtime==4.9.3
anyio==3.6.2
async-timeout==4.0.2
attrs==23.1.0
bitsandbytes==0.38.1
braceexpand==0.1.7
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
contourpy==1.0.7
cycler==0.11.0
entrypoints==0.4
eva-decord==0.6.1
fastapi==0.95.1
ffmpy==0.3.0
filelock==3.12.0
fonttools==4.39.3
frozenlist==1.3.3
fsspec==2023.4.0
gradio==3.28.1
gradio_client==0.1.4
h11==0.14.0
httpcore==0.17.0
httpx==0.24.0
huggingface-hub==0.14.1
idna==3.4
iopath==0.1.10
Jinja2==3.1.2
jsonschema==4.17.3
kiwisolver==1.4.4
linkify-it-py==2.0.0
markdown-it-py==2.2.0
MarkupSafe==2.1.2
matplotlib==3.7.1
mdit-py-plugins==0.3.3
mdurl==0.1.2
mpmath==1.3.0
multidict==6.0.4
networkx==3.1
numpy==1.24.3
omegaconf==2.3.0
opencv-python==4.7.0.72
orjson==3.8.11
packaging==23.1
pandas==2.0.1
Pillow==9.5.0
portalocker==2.7.0
psutil==5.9.5
pydantic==1.10.7
pydub==0.25.1
pyparsing==3.0.9
pyrsistent==0.19.3
python-dateutil==2.8.2
python-multipart==0.0.6
pytz==2023.3
PyYAML==6.0
regex==2023.3.23
requests==2.29.0
semantic-version==2.10.0
sentencepiece==0.1.98
six==1.16.0
sniffio==1.3.0
socksio==1.0.0
starlette==0.26.1
sympy==1.11.1
timm==0.6.13
tokenizers==0.13.3
toolz==0.12.0
torch==2.0.0
torchvision==0.15.1
tqdm==4.65.0
transformers==4.28.1
typing_extensions==4.5.0
tzdata==2023.3
uc-micro-py==1.0.1
urllib3==1.26.15
uvicorn==0.22.0
webdataset==0.2.48
websockets==11.0.2
yarl==1.9.2