Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image storage format #436

Open
nikonikolov opened this issue Sep 12, 2024 · 1 comment
Open

Image storage format #436

nikonikolov opened this issue Sep 12, 2024 · 1 comment
Assignees

Comments

@nikonikolov
Copy link

nikonikolov commented Sep 12, 2024

I am quite interested in using LeRobotDataset for large scale training. I am interested to get more context on the options for storing images so I am aware of the implications this might have:

  • Did you by chance study if the mp4 video compression has any negative effects on the image quality in terms of model performance (or any studies you based your decision on)
  • I see atm lerobot supports storing images either in .mp4 or .pt, but not in arrow or parquet format as many other HF datasets do. Is there any specific reason you didn't add support for arrow / parquet which also provide memory mapping? Any ideas how pytorch would compare to arrow / parquet when using datasets of 100s of millions of examples?
@nikonikolov nikonikolov changed the title mp4 format for images Image storage format Sep 12, 2024
@Cadene
Copy link
Collaborator

Cadene commented Sep 12, 2024

We compared png frames versus mp4 video compressed on Pusht and Aloha environments in simulation. We didnt notice lower success rate. You could reproduce this result as we currently support both images and video datasets. For instance:

"I see atm lerobot supports storing images either in .mp4 or .pt, but not in arrow or parquet format as many other HF datasets do."

As of now, we use parquet to store images: https://huggingface.co/datasets/lerobot/pusht_image/tree/main/data
We use parquet to store all other data (expect video that are in mp4 to be downloaded and streamed easily)

Any ideas how pytorch would compare to arrow / parquet when using datasets of 100s of millions of examples?

Our current data format use parquet to store the data on hub, then arrow once it is downloaded in the cache (through HF dataset), and HF dataset load arrow data as pytorch tensors. It's fast enough for now. We are still iterating on the format to make it simpler and faster ; and scallable

@Cadene Cadene self-assigned this Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants