About the LLaVA-OneVision 0.5B Visual tokens #271

dragonlzm · 2024-09-27T15:32:10Z

I am re-evaluating the LLaVA-OneVision 0.5B on ActivityNet-QA and trying to get the value 50.5%. I get the model checkpoints using following commands:

warnings.filterwarnings("ignore")
pretrained = "lmms-lab/llava-onevision-qwen2-0.5b-ov"
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args

It will use the google--siglip-so400m-patch14-384 as backbone. I eval the model with hyper-parameters as following:

--for_get_frames_num 32 \
--mm_spatial_pool_stride 2 \
--mm_spatial_pool_mode average \
--mm_newline_position no_token \
--overwrite True \

The output of the visual backbone is torch.Size([32, 729, 896]), but I notice that each video frame will be encoded into 169 tokens after function self.get_2dPool, instead of 196 tokens mentioned here. Could you please comfirm what hyper-parameters used to get the number 50.5% on ActivityNet-QA. Thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the LLaVA-OneVision 0.5B Visual tokens #271

About the LLaVA-OneVision 0.5B Visual tokens #271

dragonlzm commented Sep 27, 2024 •

edited

Loading

About the LLaVA-OneVision 0.5B Visual tokens #271

About the LLaVA-OneVision 0.5B Visual tokens #271

Comments

dragonlzm commented Sep 27, 2024 • edited Loading

dragonlzm commented Sep 27, 2024 •

edited

Loading