Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the LLaVA-OneVision 0.5B Visual tokens #271

Open
dragonlzm opened this issue Sep 27, 2024 · 0 comments
Open

About the LLaVA-OneVision 0.5B Visual tokens #271

dragonlzm opened this issue Sep 27, 2024 · 0 comments

Comments

@dragonlzm
Copy link

dragonlzm commented Sep 27, 2024

I am re-evaluating the LLaVA-OneVision 0.5B on ActivityNet-QA and trying to get the value 50.5%. I get the model checkpoints using following commands:

warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args

It will use the google--siglip-so400m-patch14-384 as backbone. I eval the model with hyper-parameters as following:

--for_get_frames_num 32 \
--mm_spatial_pool_stride 2 \
--mm_spatial_pool_mode average \
--mm_newline_position no_token \
--overwrite True \

The output of the visual backbone is torch.Size([32, 729, 896]), but I notice that each video frame will be encoded into 169 tokens after function self.get_2dPool, instead of 196 tokens mentioned here. Could you please comfirm what hyper-parameters used to get the number 50.5% on ActivityNet-QA. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant