Refactor image features selection in LlaVa #33696

kenza-bouzid · 2024-09-25T10:26:42Z

What does this PR do?

Wrap image features selection in LLaVa in a separate function to make it easier to override for custom use cases (e.g. applying a layer norm on the image features before projection, etc).

Fixes #33695

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker, @younesbelkada

zucchini-nlp

Hey @kenza-bouzid !

Thanks for opening a PR! Yes, actually it would be nice to have as a rule that VLMs have a method to get_image_features, where all logic happens. My prev idea was to make code less cluttered when the VLM has many special crops/pads/concats. But the use-case of overriding the method for custom models is very coool

We already have a similar pattern in video LLMs, for the same reason of uncluttering the code. WDYT about moving the whole image feature preparation to a separate method, including MM-proj. So the method takes in pixels and return features ready to be merged with text embeddings?

And, also, we might need to change more models to follow the same pattern to make CI happy. If you have bandwidth, would be nice to propagate change to all VLMs. But don't stress if you can't, I'll make an overall standardization soon on that :)

kenza-bouzid · 2024-09-26T09:58:37Z

Hi @zucchini-nlp,

Thanks for your reply.

WDYT about moving the whole image feature preparation to a separate method, including MM-proj. So the method takes in pixels and return features ready to be merged with text embeddings?

Good idea, however I would suggest having a separate method for that as well to make it easy to customize the projection. I can think of many use cases where you may want to change the projection.

We can have get_image_features and project_image_features that are then called sequentially in a get_image_embeddings ready to be merged with text embeddings as you suggested.

If you have bandwidth, would be nice to propagate change to all VLMs. But don't stress if you can't, I'll make an overall standardization soon on that :)

I'm afraid I won't have time to propagate the change to all VLLMs. Is that a blocker for this PR to pass the CI? You probably have more context for an overall standardization!

zucchini-nlp · 2024-09-26T10:09:52Z

I would suggest having a separate method for that as well to make it easy to customize the projection. I can think of many use cases where you may want to change the projection.

Hmm, imo that is a bit redundant, and since "select+proj" is mostly interlinked and isn't a huge piece of code we may ask users to override that one method, Independently of whether they tweak at selection stage or projection stage. So that we don't end up with separate methods that perform a one-liner, which imo is a bad practice. Lmk if you have any other ideas or objections

I'm afraid I won't have time to propagate the change to all VLLMs. Is that a blocker for this PR to pass the CI? You probably have more context for an overall standardization!

No worries, I just realized it is the very first llava being modified here. So our CI should be happy after running make fix-copies and make style. For other VLMs that are more tricky, I'll make one round of standardization later. Not a blocker at all :)

kenza-bouzid · 2024-09-26T10:20:00Z

Sounds great! Let me make the changes you suggested. Thank youu!

kenza-bouzid · 2024-09-26T10:54:00Z

src/transformers/models/vipllava/modeling_vipllava.py

@@ -282,6 +282,21 @@ def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_m
        self.vocab_size = model_embeds.num_embeddings
        return model_embeds

+    def get_image_features(


@zucchini-nlp make fix-copies copied this over but looking at the forward pass of vipllava, the image features selection is slightly different since they select features from for the layers -2, -5, -8, -11 and 6
I don't think it's breaking anything since this function is not called. Note that we'll need vision_feature_layers: list[int] and not only a single one!

kenza-bouzid added 3 commits September 25, 2024 03:15

refactor image features selection

304c9f4

break line

c5bf4d5

remove whitespace

0c626c5

LysandreJik requested a review from zucchini-nlp September 25, 2024 16:52

zucchini-nlp reviewed Sep 26, 2024

View reviewed changes

kenza-bouzid added 2 commits September 26, 2024 03:46

add pr comments: include projection and rename function

4592f8c

make fix-copies

465e660

kenza-bouzid commented Sep 26, 2024

View reviewed changes

kenza-bouzid requested a review from zucchini-nlp September 26, 2024 10:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor image features selection in LlaVa #33696

Refactor image features selection in LlaVa #33696

kenza-bouzid commented Sep 25, 2024

zucchini-nlp left a comment

kenza-bouzid commented Sep 26, 2024 •

edited

Loading

zucchini-nlp commented Sep 26, 2024

kenza-bouzid commented Sep 26, 2024

kenza-bouzid Sep 26, 2024

Refactor image features selection in LlaVa #33696

Are you sure you want to change the base?

Refactor image features selection in LlaVa #33696

Conversation

kenza-bouzid commented Sep 25, 2024

What does this PR do?

Before submitting

Who can review?

zucchini-nlp left a comment

Choose a reason for hiding this comment

kenza-bouzid commented Sep 26, 2024 • edited Loading

zucchini-nlp commented Sep 26, 2024

kenza-bouzid commented Sep 26, 2024

kenza-bouzid Sep 26, 2024

Choose a reason for hiding this comment

kenza-bouzid commented Sep 26, 2024 •

edited

Loading