Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade OnlineBuffer to DataBuffer and use it in the train script. #445

Draft
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

alexander-soare
Copy link
Collaborator

@alexander-soare alexander-soare commented Sep 18, 2024

What this does

This PR promotes OnlineBuffer to the more widely scoped DataBuffer. In other words, the class is generalized to be used for all purposes, rather than just as a data buffer for the online training loop. In a series of follow-up PR's LerobotDataset will gradually be superseded by DataBuffer. Changes to look out for:

  • online_buffer.py::OnlineBuffer -> data_buffer.py::DataBuffer.
  • Generalized more intuitive interface in DataBuffer.
  • A new DataBuffer.from_hugging_face_hub method. This downloads a HF hub dataset then siphons it into a DataBuffer (in a follow up we might try streaming from the hub to avoid doubling up on disk space requirements).
  • Changes to the train and eval scripts to make the DataBuffer. Of particular interest is that the training script has two new options added which can be used to configure the usage of the DataBuffer.
  • Minor changes to some utilities, mostly around the idea that DataBuffer works primarily with numpy arrays, not torch tensors.

Note: For the LeRobot library user, this change will be silent. The options to use the DataBuffer are not added to the yaml configs and need to be explicitly specified. The plan is for us to road test it internally, or allow users to try it out if they really want, while follow-up PRs proceed.

TODO before merging

  • Points in testing section. Particularly, run the backwards compatibility script as a last step before merging.

Why? (high-level context)

These are the main reasons we are switching to DataBuffer (backed by numpy.memmaps) from datasets.Dataset (backed by PyArrow)

  1. For the online training loop, we frequently update the data in the buffer, including overwriting old data. numpy.memmaps are much more amenable to this (it's no contest, datasets.Datasets were not designed for in-place updates).
  2. In general, for iterating through the dataset, numpy.memmaps are faster than dataset.Datasets in the way we currently use them (on the order of 10x faster). Some work can be done both on our end, and on the datasets end to speed dataset.Datasets up, but even at a fundamental level PyArrow can't be made to be faster than numpy.memmaps (although it can get close).
  3. The interface to numpy.memmaps is dead simple for anyone who knows how to use numpy (everyone...). They can be sliced and mutated the same way as regular arrays. In comparison, the interface for dataset.Datasets requires some getting used to (certainly worth it if the use case is right, but not in LeRobot).

It should be noted that points 1 and 2 above are particularly salient in LeRobot, where each data item in a training loop typically involves taking slices over the temporal dimension multiple data arrays. For example, the default settings for TD-MPC require slices of 5 time-steps over actions and observations with a batch size of 256. With dataset.Datasets, the data loading bottleneck slows training to a crawl.

Follow-ups

  • Refactor online training sampler to use the DataBuffer for the offline dataset.
  • Generate video paths on the fly rather than storing them explicitly in the data.
  • Incorporate multi-dataset support
  • In from_huggingface_hub("lerobot/image_dataset", decode_images=True) avoid decoding and re-encoding the PNG files.
  • Don't make it compulsory to provide the buffer capacity. Just double the size of the memmaps as and when needed.
  • In from_hugging_face_hub, stream the HF Dataset and siphon it into the LeRobot DataBuffer one episode at a time.
  • Completely remove LeRobotDataset or replace it with DataBuffer (this has lots of sub-points)

How it was tested

  • Added tests for CI in both pytest and end-to-end testing

WIP

  • Check that iterating through the DataBuffer gives the same items as iterating through the LeRobotDataset for a representative selection of datasets. I won't add B/C tests for this as LeRobotDataset will be completely phased out. Instead, I'll leave a script here. I'll run the script again right before merging.
  • Train a model with the offline training loop with the DataBuffer (shouldn't be a big deal if the point above is done properly)
  • Train TD-MPC pusht with online training loop.

How to checkout & try? (for the reviewer)

Run this script with/without the +use_lerobot_data_buffer=true, and when +use_lerobot_data_buffer=true, try it with/without +lerobot_data_buffer_decode_video=true. Observe the data_s log item in the terminal.

JOB_NAME=vqbet_dubug

# python -m debugpy --listen localhost:51355 --wait-for-client lerobot/scripts/train.py \
python lerobot/scripts/train.py \
    hydra.job.name=$JOB_NAME \
    hydra.run.dir=outputs/train/$(date +'%Y-%m-%d/%H-%M-%S')_${JOB_NAME} \
    env=pusht \
    dataset_repo_id=lerobot/pusht_image \
    policy=vqbet \
    training.num_workers=0 \
    training.log_freq=1 \
    training.offline_steps=400000 \
    training.save_checkpoint=true \
    training.save_freq=50000 \
    wandb.enable=false \
    wandb.disable_artifact=true \
    device=cuda \
    use_amp=true \
    +use_lerobot_data_buffer=true \
    +lerobot_data_buffer_decode_video=true

This change is Reviewable

@alexander-soare alexander-soare marked this pull request as draft September 18, 2024 17:39
return Path(f"/tmp/{repo_id}_{fingerprint}{'_decoded' if decode_video else ''}")


def compute_sampler_weights(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This will look very different once we drop LeRobotDataset, has decent test coverage for now, and was already here before the PR. Feel free to skim or skip.

pass


class DataBuffer(torch.utils.data.Dataset):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Totally happy for other name suggestions.



# Arbitrarily set small dataset sizes, making sure to have uneven sizes.
@pytest.mark.parametrize("offline_dataset_size", [0, 6])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: all tests from here and below were here before the PR. Feel free to skim / skip.

Copy link
Collaborator Author

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 1 unresolved discussion (waiting on @aliberts)


tests/test_data_buffer.py line 16 at r1 (raw file):

# See the License for the specific language governing permissions and
# limitations under the License.d
from copy import deepcopy

Test comment

@alexander-soare alexander-soare marked this pull request as ready for review September 19, 2024 16:04
@alexander-soare
Copy link
Collaborator Author

alexander-soare commented Sep 19, 2024

Backwards compat check script:

"""
(this will take < 20 mins, covers FULL datasets)
python scripts/data_buffer_bc_checks.py

OR

(this will take hours, covers lots of datasets, probs overkill)
DATA_DIR=tests/data python scripts/data_buffer_bc_checks.py
"""

import os
from itertools import product

import torch
import torch.utils
from tqdm import trange

from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.datasets.online_buffer import DataBuffer

if os.environ.get("DATA_DIR", "") == "tests/data":
    datasets_to_test = [
        "lerobot/pusht",
        "lerobot/pusht_image",
        "lerobot/pusht_keypoints",
        "lerobot/aloha_mobile_cabinet",
        "lerobot/aloha_mobile_chair",
        "lerobot/aloha_mobile_elevator",
        "lerobot/aloha_mobile_shrimp",
        "lerobot/aloha_mobile_wash_pan",
        "lerobot/aloha_mobile_wipe_wine",
        "lerobot/aloha_sim_insertion_human",
        "lerobot/aloha_sim_insertion_human_image",
        "lerobot/aloha_sim_insertion_scripted",
        "lerobot/aloha_sim_insertion_scripted_image",
        "lerobot/aloha_sim_transfer_cube_human",
        "lerobot/aloha_sim_transfer_cube_human_image",
        "lerobot/aloha_sim_transfer_cube_scripted",
        "lerobot/aloha_sim_transfer_cube_scripted_image",
        "lerobot/aloha_static_battery",
        "lerobot/aloha_static_candy",
        "lerobot/aloha_static_coffee",
        "lerobot/aloha_static_coffee_new",
        "lerobot/aloha_static_cups_open",
        "lerobot/aloha_static_fork_pick_up",
        "lerobot/aloha_static_pingpong_test",
        "lerobot/aloha_static_pro_pencil",
        "lerobot/aloha_static_screw_driver",
        "lerobot/aloha_static_tape",
        "lerobot/aloha_static_thread_velcro",
        "lerobot/aloha_static_towel",
        "lerobot/aloha_static_vinh_cup",
        "lerobot/aloha_static_vinh_cup_left",
        "lerobot/aloha_static_ziploc_slide",
        "lerobot/umi_cup_in_the_wild",
        "lerobot/unitreeh1_fold_clothes",
        "lerobot/unitreeh1_rearrange_objects",
        "lerobot/unitreeh1_two_robot_greeting",
        "lerobot/unitreeh1_warehouse",
        "lerobot/xarm_lift_medium",
        "lerobot/xarm_lift_medium_image",
        "lerobot/xarm_lift_medium_replay",
        "lerobot/xarm_lift_medium_replay_image",
        "lerobot/xarm_push_medium",
        "lerobot/xarm_push_medium_image",
        "lerobot/xarm_push_medium_replay",
        "lerobot/xarm_push_medium_replay_image",
    ]
else:
    # Reduced test set
    datasets_to_test = [
        "lerobot/pusht",
        "lerobot/pusht_image",
        "lerobot/pusht_keypoints",
        "lerobot/unitreeh1_two_robot_greeting",  # chosen because it contains multiple image keys
    ]

for dataset_repo_id in datasets_to_test:
    lerobot_dataset = LeRobotDataset(dataset_repo_id)
    fps = lerobot_dataset.fps

    assert "observation.state" in lerobot_dataset[0]

    delta_timestamps = {
        "action": [-1 / fps, 0, 1 / fps],
        "observation.state": [-1 / fps, 0, 1 / fps],
        **{k: [-1 / fps, 0, 1 / fps] for k in lerobot_dataset.camera_keys},
    }
    delta_timestamps_options = [None, delta_timestamps]

    decode_images_options = [False]
    if os.environ.get("DATA_DIR", "") == "tests/data":
        decode_images_options.append(True)

    for decode_images, delta_timestamps in product(decode_images_options, delta_timestamps_options):
        lerobot_dataset = LeRobotDataset(dataset_repo_id, delta_timestamps=delta_timestamps)
        buffer = DataBuffer.from_huggingface_hub(
            dataset_repo_id, decode_images=decode_images, fps=fps, delta_timestamps=delta_timestamps
        )
        assert len(lerobot_dataset) == len(buffer)

        lerobot_dataset_dataloader = iter(
            torch.utils.data.DataLoader(
                lerobot_dataset, batch_size=16, num_workers=8, shuffle=False, drop_last=False
            )
        )
        buffer_dataloader = iter(
            torch.utils.data.DataLoader(buffer, batch_size=16, num_workers=8, shuffle=False, drop_last=False)
        )

        for _ in trange(
            len(lerobot_dataset_dataloader),
            desc=f"{dataset_repo_id}{'_dt' if delta_timestamps is not None else ''}{'_decoded' if decode_images else ''}",
        ):
            buffer_item = next(buffer_dataloader)
            lerobot_dataset_item = next(lerobot_dataset_dataloader)
            assert set(buffer_item) == set(lerobot_dataset_item)
            for k in buffer_item:
                assert buffer_item[k].dtype == lerobot_dataset_item[k].dtype
                assert torch.equal(buffer_item[k], lerobot_dataset_item[k])

Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review where I focused mainly on the from_huggingface_hub method. I don't think I should review further until we discuss my comment about the video buffering as I think there was a misunderstanding on the feature-set/perimeter of this dataset reshape. I simply don't think we should buffer any videos at all, this defeats a lot of the purpose of video encoding/decoding (except for new online episodes where it makes sense). Why do you think it is necessary?

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @alexander-soare)


lerobot/common/datasets/online_buffer.py line 69 at r2 (raw file):

    Data is considered to come in the form of "episodes" (an instance of a robot performing a task). Episodes
    are made up of "frames", which are chronoligically ordered and contain timestamp aligned data, potentially

Suggestion:

chronologically

lerobot/common/datasets/online_buffer.py line 91 at r2 (raw file):

    The `add_episodes` method can be used to insert more data in the form of integral episodes (starting from
    frame 0 and with the frames ordered). The buffer has a compulsory size limit, which must be provided when

Is there a limit to how big this can be?
Is there an observable decline in performance passed a certain threshold?

Code quote:

The buffer has a compulsory size limit

lerobot/common/datasets/online_buffer.py line 93 at r2 (raw file):

    frame 0 and with the frames ordered). The buffer has a compulsory size limit, which must be provided when
    creating a new one. Data is inserted in a circular fashion, inserting after the most recently added frame,
    and wrapping around to the start when necessary (in which case older episodes are overwritten).

Does that mean that if the buffer is full and I add_episodes with, say 5 frames, this will scrap the first 5 frames from another episode? Why leave not remove it entirely?

EDIT: nevermind I just read your docstring in that method. Feels weird to leave half episodes but I guess that makes sense as long as this is just a buffer and not a drop-in replacement for LeRobotDataset.

Code quote:

in which case older episodes are overwritten

lerobot/common/datasets/online_buffer.py line 545 at r2 (raw file):

            repo_id: The dataset repository ID.
            decode_video: If repo_id refers to a video dataset (the image observations are encoded as videos),
                decode the videos and store the frames in a numpy memmap.

I did not have that understanding about this new format. Sorry if I've not been very clear about that but what I had in mind was that it should only take care of the non-video data (and store that as np.memmap).

There is the question of how to handle/add new episodes in the online dataset where I think it does make sense to store frames temporarily as memmaps, but I don't think existing videos should be buffered.

Code quote:

            decode_video: If repo_id refers to a video dataset (the image observations are encoded as videos),
                decode the videos and store the frames in a numpy memmap.

lerobot/common/datasets/online_buffer.py line 558 at r2 (raw file):

        hf_dataset = load_hf_dataset(repo_id, version=CODEBASE_VERSION, root=root, split="train")
        hf_dataset.set_transform(lambda x: x)

Why does this need to exist?

Code quote:

        hf_dataset.set_transform(lambda x: x)

Copy link
Collaborator Author

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @aliberts)


lerobot/common/datasets/online_buffer.py line 69 at r2 (raw file):

    Data is considered to come in the form of "episodes" (an instance of a robot performing a task). Episodes
    are made up of "frames", which are chronoligically ordered and contain timestamp aligned data, potentially

Done.


lerobot/common/datasets/online_buffer.py line 91 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

Is there a limit to how big this can be?
Is there an observable decline in performance passed a certain threshold?

If I explain that this is because "you need to specify in advance how much disk space to reserve", is that clear enough?
Then I hope it would be implicit that it's not got to do with any other things like performance.

Maybe I should have worded it like that instead of "The buffer has a compulsory size limit", which I see now can be subtly misleading.


lerobot/common/datasets/online_buffer.py line 93 at r2 (raw file):

Why leave not remove it entirely?

I actually had some code to do this, but it doesn't play nicely at all with the data loader when num_workers > 0, and an online training loop. This is because it ends up reducing the dataset length, and the dataloader has already decided which indices to fetch in advance (meaning you could get an IndexError).
Does that explanation satisfy you, and if so, do you think I should explain it somehow here?


lerobot/common/datasets/online_buffer.py line 545 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

I did not have that understanding about this new format. Sorry if I've not been very clear about that but what I had in mind was that it should only take care of the non-video data (and store that as np.memmap).

There is the question of how to handle/add new episodes in the online dataset where I think it does make sense to store frames temporarily as memmaps, but I don't think existing videos should be buffered.

Think of this feature as a nice-to-have extra. It gives you a massive speedup over video decoding (10x +), and if your dataset is small, it's definitely worth it. For example, try training vqbet as I shared in the main PR comment with and without it.

If we get video decoding faster, I'd remove this feature.


lerobot/common/datasets/online_buffer.py line 558 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

Why does this need to exist?

There is some default transform in place. I know this because when I comment this out, my tests fail due to dtype issues. I could figure out exactly what that is, but no transform is better than transform for speed and the degree of control we have on our end.

Thoughts?

Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 9 unresolved discussions (waiting on @alexander-soare)


Makefile line 71 at r2 (raw file):

		policy.chunk_size=20 \
		training.batch_size=2 \
		training.image_transforms.enable=true \

If I read your code correctly, I don't think this is actually enabled.


lerobot/common/datasets/online_buffer.py line 91 at r2 (raw file):

"you need to specify in advance how much disk space to reserve", is that clear enough?

I understood that, so I'll take that the limit is simply your disk space.
Additionally, I was wondering if performance can vary with the size of the buffer (?)


lerobot/common/datasets/online_buffer.py line 93 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Why leave not remove it entirely?

I actually had some code to do this, but it doesn't play nicely at all with the data loader when num_workers > 0, and an online training loop. This is because it ends up reducing the dataset length, and the dataloader has already decided which indices to fetch in advance (meaning you could get an IndexError).
Does that explanation satisfy you, and if so, do you think I should explain it somehow here?

Makes sense, I think the docstring in add_episodes is enough, thanks.


lerobot/common/datasets/online_buffer.py line 134 at r2 (raw file):

                created with the first call to `add_episodes`.
            buffer_capacity: How many frames should be stored in the buffer as a maximum. Be aware of your
                system's available disk space when choosing this. Note that if `storage_dir` references an

From a practical standpoint, should we rather define that as a disk-space amount?
I'm wondering how as, a user, I should determine this size in a number of frames given my disk space.

Code quote:

            buffer_capacity: How many frames should be stored in the buffer as a maximum. Be aware of your
                system's available disk space when choosing this. Note that if `storage_dir` references an

lerobot/common/datasets/online_buffer.py line 369 at r2 (raw file):

                    )
            else:
                _, h, w, c = data[k].shape

I think we should also retrieve that from the metadata in the future as well (ok for now).

Code quote:

_, h, w, c = data[k].shape

lerobot/common/datasets/online_buffer.py line 545 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Think of this feature as a nice-to-have extra. It gives you a massive speedup over video decoding (10x +), and if your dataset is small, it's definitely worth it. For example, try training vqbet as I shared in the main PR comment with and without it.

If we get video decoding faster, I'd remove this feature.

I think this is a lot of complexity for a nice-to-have, but setting that aside, the issue is that speed is not the only factor here. What if we have hundreds of GB of video (as we do on some open X datasets)? We can't really scale with that.

Does this 10x factor take into account initial decoding? And will that work with streaming?

Video is the biggest chunk of the data by far (>99% in most of our repos), it's expected to take the largest amount of time during data loading. Even if we're able to further improve decoding time in the future by fiddling with encoding, it won't be by a factor of 10x because of the decompression — which you took out of the equation iiuc.
I do see a way that we could significantly improve it in the future by training sequentially on each frame in a per-episode loop. That way, we could decode and buffer the video to memory at the start of an episode and use that throughout that episode until the next one. I believe this is one of the features Torchcodec is working on (don't quote me on that though).

I'm also curious about the absolute gains (so not relative improvement). How much time does this shave off a 5h training run for example ?

I get your point that it can be useful in certain scenarios, but that has to be weighted-in against those questions as well. Wdyt?


lerobot/common/datasets/online_buffer.py line 558 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

There is some default transform in place. I know this because when I comment this out, my tests fail due to dtype issues. I could figure out exactly what that is, but no transform is better than transform for speed and the degree of control we have on our end.

Thoughts?

I see, could you add a comment about that?
We should probably fix that later on during the dataset rework. Parquet does support types so once written properly we shouldn't have the need for that.


lerobot/common/datasets/online_buffer.py line 620 at r2 (raw file):

                else:
                    data_dict[k] = np.stack(
                        [np.array(dct["path"], dtype=f"S{MAX_VIDEO_PATH_LENGTH}") for dct in hf_dataset[k]]

I think we should avoid using storage space for that and have a single template path (which will take episode number and camera key) in the metadata instead. That way we also avoid the annoying max length issue.

Code quote:

                    data_dict[k] = np.stack(
                        [np.array(dct["path"], dtype=f"S{MAX_VIDEO_PATH_LENGTH}") for dct in hf_dataset[k]]

Copy link
Collaborator Author

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 9 unresolved discussions (waiting on @aliberts)


Makefile line 71 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

If I read your code correctly, I don't think this is actually enabled.

Oops. Okay I worked it in with r3 (revision3).


lerobot/common/datasets/online_buffer.py line 91 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

"you need to specify in advance how much disk space to reserve", is that clear enough?

I understood that, so I'll take that the limit is simply your disk space.
Additionally, I was wondering if performance can vary with the size of the buffer (?)

In theory, it just depends on whether the whole buffer can fit in RAM or now. If it can fit, it's constant-time random access. Try this script to prove that:

"""
Run with/without DATA_DIR=tests/data to try with different dataset sizes.
"""

from itertools import cycle

import numpy as np
from tqdm import trange

from lerobot.common.datasets.online_buffer import DataBuffer

buffer = DataBuffer.from_huggingface_hub("lerobot/pusht", decode_video=True)

iter_buffer = cycle(buffer)
np.random.seed(0)
indices = [int(i) for i in np.random.permutation(len(buffer))]
cycle_indices = cycle(indices)


N = 100000

for _ in trange(N):
    _ = buffer[next(cycle_indices)]

If it can't fit in RAM, it needs to be read from disk into RAM. And there's lots of tricky stuff that numpy does under the hood to optimize this. It's the same deal with HF datasets though. I haven't fully characterized this behaviour but I have verified that it definitely runs slower when it needs to access parts of the buffer that havent' been paged into memory.

What, if any of this do you think I should include in the docs here now?


lerobot/common/datasets/online_buffer.py line 93 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

Makes sense, I think the docstring in add_episodes is enough, thanks.

Done.


lerobot/common/datasets/online_buffer.py line 134 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

From a practical standpoint, should we rather define that as a disk-space amount?
I'm wondering how as, a user, I should determine this size in a number of frames given my disk space.

I think if we define it as a disk space amount, one can ask the opposite question: from a practical standpoint, how does a user know how big their buffer will be in terms of episode frames. And I think this is the more important question for the ML side.

So, I wonder what the right approach is here.

One idea could be to have a calculator fn that you can pass an episode into and it will tell you storage space required per frame.

Another idea is to simple add documentation here: "don't worry, if you try to reserve 80% or more of your disk space, the program will exit with an exception and not create the buffer at all" (see _make_memmap_safe).

Got any other ideas?


lerobot/common/datasets/online_buffer.py line 369 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

I think we should also retrieve that from the metadata in the future as well (ok for now).

This information is saved in the metadata. But at this point we are checking the data passed into add_episodes by the API user. They can pass anything in, regardless of the metadata. Does this address your comment?


lerobot/common/datasets/online_buffer.py line 545 at r2 (raw file):
The 0th order answer to all this, is that current use cases of real users are hugely helped by this feature. For example:

  1. I've been working a lot with TD-MPC. Training TD-MPC with PushT, it takes about 9-10 seconds per 100 training iters with decode_video=True. On the other hand, it takes about 26~27 seconds with decode_video=False (hopefully this also helps answer your absolute gains question). (and worth noting, this is with 12 workers for data loading).
  2. 2~3 users have reported very slow training with VQ-BET PushT. With decode_video=True, 100 iters takes ~3s. With decode_video=False, 100 iters takes ~19s.

Tabulating the results.

. TD-MPC PushT VQ-BET PushT
pusht, decode=True 9s / 100iter 2s / 100iter
pusht, decode=False 26s /100iter 19s / 100iter

So yeah it's more like very-nice-to-have.

Now, it is true that if I use pusht_image, it becomes equivalent to pusht with decode=True in terms of dataloading time. So, the other answer could be: "if you want fast data loading at the expense of disk space, reupload your dataset in image format". Do we have a script for this? What do you think of this approach, and then I drop this decode logic? (really I'd be moving the logic to be in a script instead of this class)


But to answer the rest of your questions:

Does this 10x factor take into account initial decoding?
The 10x was just the data_s in the logging, so no initial decoding is not included. In saying that, it shouldn't matter, because it's a one-off cost.

And will that work with streaming?
I haven't even though of how we'd stream videos for training. Nor do I know if such a thing is even practically feasible. Would love to get your thoughts f2f :)


lerobot/common/datasets/online_buffer.py line 558 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

I see, could you add a comment about that?
We should probably fix that later on during the dataset rework. Parquet does support types so once written properly we shouldn't have the need for that.

Added a comment. I agree that we should rework it eventually.


lerobot/common/datasets/online_buffer.py line 620 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

I think we should avoid using storage space for that and have a single template path (which will take episode number and camera key) in the metadata instead. That way we also avoid the annoying max length issue.

I'm agreed. Yeah I was thinking of doing this as a close follow up PR for a couple of reasons. (a) I wanted to keep a cap on this PR. (b) It would be easier for me to make decisions once I go deeper into deprecating our usage of HF Datasets. Is that okay with you?

Copy link
Collaborator Author

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 9 unresolved discussions (waiting on @aliberts)


lerobot/common/datasets/online_buffer.py line 545 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

The 0th order answer to all this, is that current use cases of real users are hugely helped by this feature. For example:

  1. I've been working a lot with TD-MPC. Training TD-MPC with PushT, it takes about 9-10 seconds per 100 training iters with decode_video=True. On the other hand, it takes about 26~27 seconds with decode_video=False (hopefully this also helps answer your absolute gains question). (and worth noting, this is with 12 workers for data loading).
  2. 2~3 users have reported very slow training with VQ-BET PushT. With decode_video=True, 100 iters takes ~3s. With decode_video=False, 100 iters takes ~19s.

Tabulating the results.

. TD-MPC PushT VQ-BET PushT
pusht, decode=True 9s / 100iter 2s / 100iter
pusht, decode=False 26s /100iter 6s / 100iter

So yeah it's more like very-nice-to-have.

Now, it is true that if I use pusht_image, it becomes equivalent to pusht with decode=True in terms of dataloading time. So, the other answer could be: "if you want fast data loading at the expense of disk space, reupload your dataset in image format". Do we have a script for this? What do you think of this approach, and then I drop this decode logic? (really I'd be moving the logic to be in a script instead of this class)


But to answer the rest of your questions:

Does this 10x factor take into account initial decoding?
The 10x was just the data_s in the logging, so no initial decoding is not included. In saying that, it shouldn't matter, because it's a one-off cost.

And will that work with streaming?
I haven't even though of how we'd stream videos for training. Nor do I know if such a thing is even practically feasible. Would love to get your thoughts f2f :)

Oh, and the bottom right cell on the table is incorrect. Should be 19s/ 100 iter.

Copy link
Collaborator Author

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 9 unresolved discussions (waiting on @aliberts)


lerobot/common/datasets/online_buffer.py line 545 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

Oh, and the bottom right cell on the table is incorrect. Should be 19s/ 100 iter.

. TD-MPC PushT VQ-BET PushT
pusht, decode=True 9s / 100iter 2s / 100iter
pusht, decode=False 26s /100iter 19s / 100iter

Copy link
Collaborator

@aliberts aliberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r3.
Reviewable status: 1 of 10 files reviewed, 5 unresolved discussions (waiting on @alexander-soare)


lerobot/common/datasets/online_buffer.py line 91 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

In theory, it just depends on whether the whole buffer can fit in RAM or now. If it can fit, it's constant-time random access. Try this script to prove that:

"""
Run with/without DATA_DIR=tests/data to try with different dataset sizes.
"""

from itertools import cycle

import numpy as np
from tqdm import trange

from lerobot.common.datasets.online_buffer import DataBuffer

buffer = DataBuffer.from_huggingface_hub("lerobot/pusht", decode_video=True)

iter_buffer = cycle(buffer)
np.random.seed(0)
indices = [int(i) for i in np.random.permutation(len(buffer))]
cycle_indices = cycle(indices)


N = 100000

for _ in trange(N):
    _ = buffer[next(cycle_indices)]

If it can't fit in RAM, it needs to be read from disk into RAM. And there's lots of tricky stuff that numpy does under the hood to optimize this. It's the same deal with HF datasets though. I haven't fully characterized this behaviour but I have verified that it definitely runs slower when it needs to access parts of the buffer that havent' been paged into memory.

What, if any of this do you think I should include in the docs here now?

I see, thank you for the insights!
I think we should add just that (so keeping it high level just so that people are aware of this) and then point to memmap's doc if the user want to know more. How does that sound?


lerobot/common/datasets/online_buffer.py line 134 at r2 (raw file):

I think if we define it as a disk space amount, one can ask the opposite question [...]
So, I wonder what the right approach is here.

Yes I agree, I don't have a strong opinion either way.

I think the dataset refactor will probably help us when we get there. For instance we could imagine 2 components for this buffer: one for states/action for which we can easily calculate its size in advance, and one more dynamic for images/video (thinking out loud here)


lerobot/common/datasets/online_buffer.py line 369 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

This information is saved in the metadata. But at this point we are checking the data passed into add_episodes by the API user. They can pass anything in, regardless of the metadata. Does this address your comment?

I understand, thanks!


lerobot/common/datasets/online_buffer.py line 545 at r2 (raw file):

I've been working a lot with TD-MPC. Training TD-MPC with PushT, it takes about 9-10 seconds per 100 training iters with decode_video=True. On the other hand, it takes about 26~27 seconds with decode_video=False (hopefully this also helps answer your absolute gains question). (and worth noting, this is with 12 workers for data loading).

I'd take these numbers with a bucket of salt: pusht is the worst dataset to evaluate video decoding against and is very much not representative of our datasets (in fact it's an outlier: 96x96 pixels). Decoding times scales much better with resolution, what about a 1280x720 or a 1920x1080 image? Thinking about the future, I think the direction we're going towards is larger image sizes, not smaller.

Now, it is true that if I use pusht_image, it becomes equivalent to pusht with decode=True in terms of dataloading time.So, the other answer could be: "if you want fast data loading at the expense of disk space, reupload your dataset in image format". Do we have a script for this?

I'd really argue against it. I think for sharing especially (so everything hub-related) there should be only one way to do it. This impacts storage, visualization, download times, etc. Plus the obligation to maintain both options. In fact I'm all for deprecating image datasets altogether. I'd be more lenient about what happens locally. If we can easily cache locally to make training go faster, I'm all for it.

The 10x was just the data_s in the logging, so no initial decoding is not included. In saying that, it shouldn't matter, because it's a one-off cost.

I'm still asking because if I'm to be sold on faster loading times, I want to know as a user how much I'll save on my training run — including that part.


lerobot/common/datasets/online_buffer.py line 620 at r2 (raw file):

Previously, alexander-soare (Alexander Soare) wrote…

I'm agreed. Yeah I was thinking of doing this as a close follow up PR for a couple of reasons. (a) I wanted to keep a cap on this PR. (b) It would be easier for me to make decisions once I go deeper into deprecating our usage of HF Datasets. Is that okay with you?

Good to me, thanks.

Copy link
Collaborator Author

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 10 files reviewed, 3 unresolved discussions (waiting on @aliberts)


lerobot/common/datasets/online_buffer.py line 134 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

I think if we define it as a disk space amount, one can ask the opposite question [...]
So, I wonder what the right approach is here.

Yes I agree, I don't have a strong opinion either way.

I think the dataset refactor will probably help us when we get there. For instance we could imagine 2 components for this buffer: one for states/action for which we can easily calculate its size in advance, and one more dynamic for images/video (thinking out loud here)

Note to self: Let's leave this open for the next revision. I think I want to do something more about precalculating the required memory. The current issue is that you can end up making some of the memmaps and only get the 80% exception when creating the last one, which leaves the whole data buffer in a corrupt state (it has some of the data keys, but not all of them.


lerobot/common/datasets/online_buffer.py line 545 at r2 (raw file):

Previously, aliberts (Simon Alibert) wrote…

I've been working a lot with TD-MPC. Training TD-MPC with PushT, it takes about 9-10 seconds per 100 training iters with decode_video=True. On the other hand, it takes about 26~27 seconds with decode_video=False (hopefully this also helps answer your absolute gains question). (and worth noting, this is with 12 workers for data loading).

I'd take these numbers with a bucket of salt: pusht is the worst dataset to evaluate video decoding against and is very much not representative of our datasets (in fact it's an outlier: 96x96 pixels). Decoding times scales much better with resolution, what about a 1280x720 or a 1920x1080 image? Thinking about the future, I think the direction we're going towards is larger image sizes, not smaller.

Now, it is true that if I use pusht_image, it becomes equivalent to pusht with decode=True in terms of dataloading time.So, the other answer could be: "if you want fast data loading at the expense of disk space, reupload your dataset in image format". Do we have a script for this?

I'd really argue against it. I think for sharing especially (so everything hub-related) there should be only one way to do it. This impacts storage, visualization, download times, etc. Plus the obligation to maintain both options. In fact I'm all for deprecating image datasets altogether. I'd be more lenient about what happens locally. If we can easily cache locally to make training go faster, I'm all for it.

The 10x was just the data_s in the logging, so no initial decoding is not included. In saying that, it shouldn't matter, because it's a one-off cost.

I'm still asking because if I'm to be sold on faster loading times, I want to know as a user how much I'll save on my training run — including that part.

Our offline discussion clarified a few things here. The takeaway is let's leave the video decoding option as there's such a strong need for it in the short/mid term. We'll try to be mindful of keeping it under the hood and making sure it doesn't scope creep.

Copy link
Collaborator Author

@alexander-soare alexander-soare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 10 files reviewed, 4 unresolved discussions (waiting on @aliberts)


lerobot/common/datasets/online_buffer.py line 597 at r3 (raw file):

        data_dict = {}
        for k, feature in hf_dataset.features.items():
            if isinstance(feature, datasets.features.Image):

Okay, after our conversation I'm realizing this branch should have an equivalent to if decode_video in the branch below. I am turning these PIL images into memmaps by default, when I perhaps shouldn't be doing that. Once you've got the headspace to tackle this, I'd like to have another conversation to get your thoughts on some design questions.

In the meantime, I am prototyping in a new branch.

@alexander-soare
Copy link
Collaborator Author

alexander-soare commented Sep 25, 2024

I made another branch c86c755 where I tried saving the png images in the storage directory and decoding them on the fly. I compared that against other methods. Results here: https://docs.google.com/spreadsheets/d/1SSGUF2ka8L8d3uhwsI33CCyh3bydMCDG95jHn3QBJhA/edit?gid=1553838219#gid=1553838219

Some takeways:

  • DataBuffer > LeRobotDataset, pretty much unanimously.
  • You already knew that video data-laoding gets more efficient with bigger chunk size. In fact, with resolutions of 100sx100x of pix, it's not too much slower than memmaps! In resolutions of 100x100 pix, memmaps win (which is good because that's the regime in which we can afford the disk capacity).
  • In regimes where memmaps win, PNG images are not far behind! Based on this, I'm considering dropping the "decode" option.

Script here:

import subprocess
import time
from pathlib import Path

import numpy as np
from torch.utils.data import DataLoader
from tqdm import trange

from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, DATA_DIR, LeRobotDataset
from lerobot.common.datasets.online_buffer import DataBuffer
from lerobot.common.datasets.utils import cycle, load_info
from lerobot.common.utils.utils import seeded_context


def du(path):
    """disk usage in human readable format"""
    return subprocess.check_output(["du", "-sh", path]).split()[0].decode("utf-8")


def run_test(
    repo_id: str,
    dataset_type: str,
    n_iters: int,
    chunk_size: int,
    batch_size: int,
    n_workers: int,
    decode: bool,
) -> tuple[str | None, float]:
    lerobot_dataset_info = load_info(repo_id, version=CODEBASE_VERSION, root=DATA_DIR)
    fps = lerobot_dataset_info["fps"]

    lerobot_dataset = LeRobotDataset(repo_id)

    delta_timestamps = np.arange(chunk_size, dtype=np.float32) / fps
    delta_timestamps = {k: delta_timestamps for k in lerobot_dataset.camera_keys}

    if dataset_type == "lerobot_dataset":
        dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps)
        storage = None
    elif dataset_type == "data_buffer":
        dataset = DataBuffer.from_huggingface_hub(
            repo_id, fps=fps, delta_timestamps=delta_timestamps, decode_images=decode
        )
        storage = du(dataset.storage_dir)

    with seeded_context(0):
        dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=n_workers)
        iter_dataloader = cycle(dataloader)
        start = time.perf_counter()
        for _ in trange(n_iters):
            _ = next(iter_dataloader)
        total_time = time.perf_counter() - start

    return storage, total_time


if __name__ == "__main__":
    import pandas as pd

    RESULTS_PATH = Path("results.csv")

    sweep_params = [
        # for example
        {
            "repo_id": "lerobot/pusht",
            "dataset_type": "data_buffer",
            "decode": False,
            "n_iters": 100,
            "chunk_size": 5,
            "batch_size": 64,
            "n_workers": 8,
        },
    ]

    cols = list(sweep_params[0].keys()) + ["storage", "time_s"]
    results = pd.read_csv(RESULTS_PATH, index_col=0) if RESULTS_PATH.exists() else pd.DataFrame(columns=cols)

    for params in sweep_params:
        if (results.iloc[:, :-2] == pd.Series(params)).all(axis=1).any():
            print("Skipping test as it is already in the results")
            print(params)
            continue
        storage, total_time = run_test(**params)
        params["storage"] = storage or ""
        params["time_s"] = total_time

        results = pd.concat([results, pd.DataFrame([params])], ignore_index=True)
        results.to_csv(RESULTS_PATH)

@alexander-soare alexander-soare marked this pull request as draft September 25, 2024 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants