You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The output of the visual backbone is torch.Size([32, 729, 896]), but I notice that each video frame will be encoded into 169 tokens after function self.get_2dPool, instead of 196 tokens mentioned here. Could you please comfirm what hyper-parameters used to get the number 50.5% on ActivityNet-QA. Thanks!
The text was updated successfully, but these errors were encountered:
I am re-evaluating the LLaVA-OneVision 0.5B on ActivityNet-QA and trying to get the value 50.5%. I get the model checkpoints using following commands:
It will use the google--siglip-so400m-patch14-384 as backbone. I eval the model with hyper-parameters as following:
The output of the visual backbone is torch.Size([32, 729, 896]), but I notice that each video frame will be encoded into 169 tokens after function
self.get_2dPool
, instead of 196 tokens mentioned here. Could you please comfirm what hyper-parameters used to get the number 50.5% on ActivityNet-QA. Thanks!The text was updated successfully, but these errors were encountered: