Image storage format #436

nikonikolov · 2024-09-12T16:38:21Z

I am quite interested in using LeRobotDataset for large scale training. I am interested to get more context on the options for storing images so I am aware of the implications this might have:

Did you by chance study if the mp4 video compression has any negative effects on the image quality in terms of model performance (or any studies you based your decision on)
I see atm lerobot supports storing images either in .mp4 or .pt, but not in arrow or parquet format as many other HF datasets do. Is there any specific reason you didn't add support for arrow / parquet which also provide memory mapping? Any ideas how pytorch would compare to arrow / parquet when using datasets of 100s of millions of examples?

The text was updated successfully, but these errors were encountered:

Cadene · 2024-09-12T23:07:04Z

We compared png frames versus mp4 video compressed on Pusht and Aloha environments in simulation. We didnt notice lower success rate. You could reproduce this result as we currently support both images and video datasets. For instance:

https://huggingface.co/datasets/lerobot/pusht_image
https://huggingface.co/datasets/lerobot/pusht
We might create a blog post about it at some point.

"I see atm lerobot supports storing images either in .mp4 or .pt, but not in arrow or parquet format as many other HF datasets do."

As of now, we use parquet to store images: https://huggingface.co/datasets/lerobot/pusht_image/tree/main/data
We use parquet to store all other data (expect video that are in mp4 to be downloaded and streamed easily)

Any ideas how pytorch would compare to arrow / parquet when using datasets of 100s of millions of examples?

Our current data format use parquet to store the data on hub, then arrow once it is downloaded in the cache (through HF dataset), and HF dataset load arrow data as pytorch tensors. It's fast enough for now. We are still iterating on the format to make it simpler and faster ; and scallable

nikonikolov changed the title ~~mp4 format for images~~ Image storage format Sep 12, 2024

Cadene self-assigned this Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image storage format #436

Image storage format #436

nikonikolov commented Sep 12, 2024 •

edited

Loading

Cadene commented Sep 12, 2024 •

edited

Loading

Image storage format #436

Image storage format #436

Comments

nikonikolov commented Sep 12, 2024 • edited Loading

Cadene commented Sep 12, 2024 • edited Loading

nikonikolov commented Sep 12, 2024 •

edited

Loading

Cadene commented Sep 12, 2024 •

edited

Loading