Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor image features selection in LlaVa #33696

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kenza-bouzid
Copy link

What does this PR do?

Wrap image features selection in LLaVa in a separate function to make it easier to override for custom use cases (e.g. applying a layer norm on the image features before projection, etc).

Fixes #33695

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker, @younesbelkada

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kenza-bouzid !

Thanks for opening a PR! Yes, actually it would be nice to have as a rule that VLMs have a method to get_image_features, where all logic happens. My prev idea was to make code less cluttered when the VLM has many special crops/pads/concats. But the use-case of overriding the method for custom models is very coool

We already have a similar pattern in video LLMs, for the same reason of uncluttering the code. WDYT about moving the whole image feature preparation to a separate method, including MM-proj. So the method takes in pixels and return features ready to be merged with text embeddings?

And, also, we might need to change more models to follow the same pattern to make CI happy. If you have bandwidth, would be nice to propagate change to all VLMs. But don't stress if you can't, I'll make an overall standardization soon on that :)

@kenza-bouzid
Copy link
Author

kenza-bouzid commented Sep 26, 2024

Hi @zucchini-nlp,

Thanks for your reply.

WDYT about moving the whole image feature preparation to a separate method, including MM-proj. So the method takes in pixels and return features ready to be merged with text embeddings?

Good idea, however I would suggest having a separate method for that as well to make it easy to customize the projection. I can think of many use cases where you may want to change the projection.

We can have get_image_features and project_image_features that are then called sequentially in a get_image_embeddings ready to be merged with text embeddings as you suggested.

If you have bandwidth, would be nice to propagate change to all VLMs. But don't stress if you can't, I'll make an overall standardization soon on that :)

I'm afraid I won't have time to propagate the change to all VLLMs. Is that a blocker for this PR to pass the CI? You probably have more context for an overall standardization!

@zucchini-nlp
Copy link
Member

I would suggest having a separate method for that as well to make it easy to customize the projection. I can think of many use cases where you may want to change the projection.

Hmm, imo that is a bit redundant, and since "select+proj" is mostly interlinked and isn't a huge piece of code we may ask users to override that one method, Independently of whether they tweak at selection stage or projection stage. So that we don't end up with separate methods that perform a one-liner, which imo is a bad practice. Lmk if you have any other ideas or objections

I'm afraid I won't have time to propagate the change to all VLLMs. Is that a blocker for this PR to pass the CI? You probably have more context for an overall standardization!

No worries, I just realized it is the very first llava being modified here. So our CI should be happy after running make fix-copies and make style. For other VLMs that are more tricky, I'll make one round of standardization later. Not a blocker at all :)

@kenza-bouzid
Copy link
Author

Sounds great! Let me make the changes you suggested. Thank youu!

@@ -282,6 +282,21 @@ def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_m
self.vocab_size = model_embeds.num_embeddings
return model_embeds

def get_image_features(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp make fix-copies copied this over but looking at the forward pass of vipllava, the image features selection is slightly different since they select features from for the layers -2, -5, -8, -11 and 6
I don't think it's breaking anything since this function is not called. Note that we'll need vision_feature_layers: list[int] and not only a single one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor image features selection in LlaVa
2 participants