-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor image features selection in LlaVa #33696
base: main
Are you sure you want to change the base?
Refactor image features selection in LlaVa #33696
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kenza-bouzid !
Thanks for opening a PR! Yes, actually it would be nice to have as a rule that VLMs have a method to get_image_features
, where all logic happens. My prev idea was to make code less cluttered when the VLM has many special crops/pads/concats. But the use-case of overriding the method for custom models is very coool
We already have a similar pattern in video LLMs, for the same reason of uncluttering the code. WDYT about moving the whole image feature preparation to a separate method, including MM-proj. So the method takes in pixels and return features ready to be merged with text embeddings?
And, also, we might need to change more models to follow the same pattern to make CI happy. If you have bandwidth, would be nice to propagate change to all VLMs. But don't stress if you can't, I'll make an overall standardization soon on that :)
Hi @zucchini-nlp, Thanks for your reply.
Good idea, however I would suggest having a separate method for that as well to make it easy to customize the projection. I can think of many use cases where you may want to change the projection. We can have
I'm afraid I won't have time to propagate the change to all VLLMs. Is that a blocker for this PR to pass the CI? You probably have more context for an overall standardization! |
Hmm, imo that is a bit redundant, and since "select+proj" is mostly interlinked and isn't a huge piece of code we may ask users to override that one method, Independently of whether they tweak at selection stage or projection stage. So that we don't end up with separate methods that perform a one-liner, which imo is a bad practice. Lmk if you have any other ideas or objections
No worries, I just realized it is the very first llava being modified here. So our CI should be happy after running |
Sounds great! Let me make the changes you suggested. Thank youu! |
@@ -282,6 +282,21 @@ def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_m | |||
self.vocab_size = model_embeds.num_embeddings | |||
return model_embeds | |||
|
|||
def get_image_features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp make fix-copies
copied this over but looking at the forward pass of vipllava, the image features selection is slightly different since they select features from for the layers -2, -5, -8, -11 and 6
I don't think it's breaking anything since this function is not called. Note that we'll need vision_feature_layers: list[int]
and not only a single one!
What does this PR do?
Wrap image features selection in LLaVa in a separate function to make it easier to override for custom use cases (e.g. applying a layer norm on the image features before projection, etc).
Fixes #33695
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker, @younesbelkada