Add support for nested images to LLava and VipLLava #35558

yonigozlan · 2025-01-07T23:37:58Z

What does this PR do?

This PR adds the functions make_flat_list_of_images , make_nested_list_of_images and make_batched_videos to image_utils, removing some unnecessarily duplicated code.
make_flat_list_of_images also replaces make_list_of_images in clip, blip, and siglip image processors, as it allows image-text-to-text models which use these image processors to support nested images inputs, while preserving BC.

Partially addresses #34545

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@zucchini-nlp

HuggingFaceDocBuilderDev · 2025-01-08T00:16:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Rocketknight1 · 2025-01-08T14:03:13Z

Flagging this PR too - I made some changes to the Llava/Pixtral processing for nested images here, so there might be some conflicts! #34801

zucchini-nlp

Super cool, thanks for cleaning this up. Looks much better now

zucchini-nlp · 2025-01-08T18:21:51Z

src/transformers/image_utils.py

+    if is_valid_image(images):
+        output_images = [[images]]


could it be that we get a 4D tensor as a batch of images?

Yes you are right! Made some changes that should account for that

yonigozlan requested a review from zucchini-nlp January 8, 2025 15:03

zucchini-nlp approved these changes Jan 9, 2025

View reviewed changes

yonigozlan requested review from qubvel, molbap, Rocketknight1 and ArthurZucker as code owners January 9, 2025 17:05

yonigozlan added 7 commits January 9, 2025 17:05

move make_flat_list_of_images and make_batched_videos to image_utils

5c95234

remove unnecessary is_vision_available

4ecd5d6

move make_nested_list_of_images to image_utils

50e0920

fix fast pixtral image processor

595f5e1

fix import mllama

162790b

fix make_nested_list_of_images

b0568b4

add tests

6f595da

yonigozlan force-pushed the uniformize-image-text-to-text-inputs-processing branch from 0319100 to 6f595da Compare January 9, 2025 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for nested images to LLava and VipLLava #35558

Add support for nested images to LLava and VipLLava #35558

yonigozlan commented Jan 7, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 8, 2025

Rocketknight1 commented Jan 8, 2025

zucchini-nlp left a comment

zucchini-nlp Jan 8, 2025

yonigozlan Jan 9, 2025

Add support for nested images to LLava and VipLLava #35558

Are you sure you want to change the base?

Add support for nested images to LLava and VipLLava #35558

Conversation

yonigozlan commented Jan 7, 2025 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Jan 8, 2025

Rocketknight1 commented Jan 8, 2025

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Jan 8, 2025

Choose a reason for hiding this comment

yonigozlan Jan 9, 2025

Choose a reason for hiding this comment

yonigozlan commented Jan 7, 2025 •

edited

Loading