-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade OnlineBuffer
to DataBuffer
and use it in the train script.
#445
base: main
Are you sure you want to change the base?
Conversation
return Path(f"/tmp/{repo_id}_{fingerprint}{'_decoded' if decode_video else ''}") | ||
|
||
|
||
def compute_sampler_weights( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: This will look very different once we drop LeRobotDataset
, has decent test coverage for now, and was already here before the PR. Feel free to skim or skip.
pass | ||
|
||
|
||
class DataBuffer(torch.utils.data.Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Totally happy for other name suggestions.
tests/test_data_buffer.py
Outdated
|
||
|
||
# Arbitrarily set small dataset sizes, making sure to have uneven sizes. | ||
@pytest.mark.parametrize("offline_dataset_size", [0, 6]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: all tests from here and below were here before the PR. Feel free to skim / skip.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 1 unresolved discussion (waiting on @aliberts)
tests/test_data_buffer.py
line 16 at r1 (raw file):
# See the License for the specific language governing permissions and # limitations under the License.d from copy import deepcopy
Test comment
Backwards compat check script: """
(this will take < 20 mins, covers FULL datasets)
python scripts/data_buffer_bc_checks.py
OR
(this will take hours, covers lots of datasets, probs overkill)
DATA_DIR=tests/data python scripts/data_buffer_bc_checks.py
"""
import os
from itertools import product
import torch
import torch.utils
from tqdm import trange
from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
from lerobot.common.datasets.online_buffer import DataBuffer
if os.environ.get("DATA_DIR", "") == "tests/data":
datasets_to_test = [
"lerobot/pusht",
"lerobot/pusht_image",
"lerobot/pusht_keypoints",
"lerobot/aloha_mobile_cabinet",
"lerobot/aloha_mobile_chair",
"lerobot/aloha_mobile_elevator",
"lerobot/aloha_mobile_shrimp",
"lerobot/aloha_mobile_wash_pan",
"lerobot/aloha_mobile_wipe_wine",
"lerobot/aloha_sim_insertion_human",
"lerobot/aloha_sim_insertion_human_image",
"lerobot/aloha_sim_insertion_scripted",
"lerobot/aloha_sim_insertion_scripted_image",
"lerobot/aloha_sim_transfer_cube_human",
"lerobot/aloha_sim_transfer_cube_human_image",
"lerobot/aloha_sim_transfer_cube_scripted",
"lerobot/aloha_sim_transfer_cube_scripted_image",
"lerobot/aloha_static_battery",
"lerobot/aloha_static_candy",
"lerobot/aloha_static_coffee",
"lerobot/aloha_static_coffee_new",
"lerobot/aloha_static_cups_open",
"lerobot/aloha_static_fork_pick_up",
"lerobot/aloha_static_pingpong_test",
"lerobot/aloha_static_pro_pencil",
"lerobot/aloha_static_screw_driver",
"lerobot/aloha_static_tape",
"lerobot/aloha_static_thread_velcro",
"lerobot/aloha_static_towel",
"lerobot/aloha_static_vinh_cup",
"lerobot/aloha_static_vinh_cup_left",
"lerobot/aloha_static_ziploc_slide",
"lerobot/umi_cup_in_the_wild",
"lerobot/unitreeh1_fold_clothes",
"lerobot/unitreeh1_rearrange_objects",
"lerobot/unitreeh1_two_robot_greeting",
"lerobot/unitreeh1_warehouse",
"lerobot/xarm_lift_medium",
"lerobot/xarm_lift_medium_image",
"lerobot/xarm_lift_medium_replay",
"lerobot/xarm_lift_medium_replay_image",
"lerobot/xarm_push_medium",
"lerobot/xarm_push_medium_image",
"lerobot/xarm_push_medium_replay",
"lerobot/xarm_push_medium_replay_image",
]
else:
# Reduced test set
datasets_to_test = [
"lerobot/pusht",
"lerobot/pusht_image",
"lerobot/pusht_keypoints",
"lerobot/unitreeh1_two_robot_greeting", # chosen because it contains multiple image keys
]
for dataset_repo_id in datasets_to_test:
lerobot_dataset = LeRobotDataset(dataset_repo_id)
fps = lerobot_dataset.fps
assert "observation.state" in lerobot_dataset[0]
delta_timestamps = {
"action": [-1 / fps, 0, 1 / fps],
"observation.state": [-1 / fps, 0, 1 / fps],
**{k: [-1 / fps, 0, 1 / fps] for k in lerobot_dataset.camera_keys},
}
delta_timestamps_options = [None, delta_timestamps]
decode_images_options = [False]
if os.environ.get("DATA_DIR", "") == "tests/data":
decode_images_options.append(True)
for decode_images, delta_timestamps in product(decode_images_options, delta_timestamps_options):
lerobot_dataset = LeRobotDataset(dataset_repo_id, delta_timestamps=delta_timestamps)
buffer = DataBuffer.from_huggingface_hub(
dataset_repo_id, decode_images=decode_images, fps=fps, delta_timestamps=delta_timestamps
)
assert len(lerobot_dataset) == len(buffer)
lerobot_dataset_dataloader = iter(
torch.utils.data.DataLoader(
lerobot_dataset, batch_size=16, num_workers=8, shuffle=False, drop_last=False
)
)
buffer_dataloader = iter(
torch.utils.data.DataLoader(buffer, batch_size=16, num_workers=8, shuffle=False, drop_last=False)
)
for _ in trange(
len(lerobot_dataset_dataloader),
desc=f"{dataset_repo_id}{'_dt' if delta_timestamps is not None else ''}{'_decoded' if decode_images else ''}",
):
buffer_item = next(buffer_dataloader)
lerobot_dataset_item = next(lerobot_dataset_dataloader)
assert set(buffer_item) == set(lerobot_dataset_item)
for k in buffer_item:
assert buffer_item[k].dtype == lerobot_dataset_item[k].dtype
assert torch.equal(buffer_item[k], lerobot_dataset_item[k]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review where I focused mainly on the from_huggingface_hub
method. I don't think I should review further until we discuss my comment about the video buffering as I think there was a misunderstanding on the feature-set/perimeter of this dataset reshape. I simply don't think we should buffer any videos at all, this defeats a lot of the purpose of video encoding/decoding (except for new online episodes where it makes sense). Why do you think it is necessary?
Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @alexander-soare)
lerobot/common/datasets/online_buffer.py
line 69 at r2 (raw file):
Data is considered to come in the form of "episodes" (an instance of a robot performing a task). Episodes are made up of "frames", which are chronoligically ordered and contain timestamp aligned data, potentially
Suggestion:
chronologically
lerobot/common/datasets/online_buffer.py
line 91 at r2 (raw file):
The `add_episodes` method can be used to insert more data in the form of integral episodes (starting from frame 0 and with the frames ordered). The buffer has a compulsory size limit, which must be provided when
Is there a limit to how big this can be?
Is there an observable decline in performance passed a certain threshold?
Code quote:
The buffer has a compulsory size limit
lerobot/common/datasets/online_buffer.py
line 93 at r2 (raw file):
frame 0 and with the frames ordered). The buffer has a compulsory size limit, which must be provided when creating a new one. Data is inserted in a circular fashion, inserting after the most recently added frame, and wrapping around to the start when necessary (in which case older episodes are overwritten).
Does that mean that if the buffer is full and I add_episodes
with, say 5 frames, this will scrap the first 5 frames from another episode? Why leave not remove it entirely?
EDIT: nevermind I just read your docstring in that method. Feels weird to leave half episodes but I guess that makes sense as long as this is just a buffer and not a drop-in replacement for LeRobotDataset.
Code quote:
in which case older episodes are overwritten
lerobot/common/datasets/online_buffer.py
line 545 at r2 (raw file):
repo_id: The dataset repository ID. decode_video: If repo_id refers to a video dataset (the image observations are encoded as videos), decode the videos and store the frames in a numpy memmap.
I did not have that understanding about this new format. Sorry if I've not been very clear about that but what I had in mind was that it should only take care of the non-video data (and store that as np.memmap).
There is the question of how to handle/add new episodes in the online dataset where I think it does make sense to store frames temporarily as memmaps, but I don't think existing videos should be buffered.
Code quote:
decode_video: If repo_id refers to a video dataset (the image observations are encoded as videos),
decode the videos and store the frames in a numpy memmap.
lerobot/common/datasets/online_buffer.py
line 558 at r2 (raw file):
hf_dataset = load_hf_dataset(repo_id, version=CODEBASE_VERSION, root=root, split="train") hf_dataset.set_transform(lambda x: x)
Why does this need to exist?
Code quote:
hf_dataset.set_transform(lambda x: x)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 5 unresolved discussions (waiting on @aliberts)
lerobot/common/datasets/online_buffer.py
line 69 at r2 (raw file):
Data is considered to come in the form of "episodes" (an instance of a robot performing a task). Episodes are made up of "frames", which are chronoligically ordered and contain timestamp aligned data, potentially
Done.
lerobot/common/datasets/online_buffer.py
line 91 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
Is there a limit to how big this can be?
Is there an observable decline in performance passed a certain threshold?
If I explain that this is because "you need to specify in advance how much disk space to reserve", is that clear enough?
Then I hope it would be implicit that it's not got to do with any other things like performance.
Maybe I should have worded it like that instead of "The buffer has a compulsory size limit", which I see now can be subtly misleading.
lerobot/common/datasets/online_buffer.py
line 93 at r2 (raw file):
Why leave not remove it entirely?
I actually had some code to do this, but it doesn't play nicely at all with the data loader when num_workers > 0, and an online training loop. This is because it ends up reducing the dataset length, and the dataloader has already decided which indices to fetch in advance (meaning you could get an IndexError).
Does that explanation satisfy you, and if so, do you think I should explain it somehow here?
lerobot/common/datasets/online_buffer.py
line 545 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
I did not have that understanding about this new format. Sorry if I've not been very clear about that but what I had in mind was that it should only take care of the non-video data (and store that as np.memmap).
There is the question of how to handle/add new episodes in the online dataset where I think it does make sense to store frames temporarily as memmaps, but I don't think existing videos should be buffered.
Think of this feature as a nice-to-have extra. It gives you a massive speedup over video decoding (10x +), and if your dataset is small, it's definitely worth it. For example, try training vqbet as I shared in the main PR comment with and without it.
If we get video decoding faster, I'd remove this feature.
lerobot/common/datasets/online_buffer.py
line 558 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
Why does this need to exist?
There is some default transform in place. I know this because when I comment this out, my tests fail due to dtype issues. I could figure out exactly what that is, but no transform is better than transform for speed and the degree of control we have on our end.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 9 unresolved discussions (waiting on @alexander-soare)
Makefile
line 71 at r2 (raw file):
policy.chunk_size=20 \ training.batch_size=2 \ training.image_transforms.enable=true \
If I read your code correctly, I don't think this is actually enabled.
lerobot/common/datasets/online_buffer.py
line 91 at r2 (raw file):
"you need to specify in advance how much disk space to reserve", is that clear enough?
I understood that, so I'll take that the limit is simply your disk space.
Additionally, I was wondering if performance can vary with the size of the buffer (?)
lerobot/common/datasets/online_buffer.py
line 93 at r2 (raw file):
Previously, alexander-soare (Alexander Soare) wrote…
Why leave not remove it entirely?
I actually had some code to do this, but it doesn't play nicely at all with the data loader when num_workers > 0, and an online training loop. This is because it ends up reducing the dataset length, and the dataloader has already decided which indices to fetch in advance (meaning you could get an IndexError).
Does that explanation satisfy you, and if so, do you think I should explain it somehow here?
Makes sense, I think the docstring in add_episodes
is enough, thanks.
lerobot/common/datasets/online_buffer.py
line 134 at r2 (raw file):
created with the first call to `add_episodes`. buffer_capacity: How many frames should be stored in the buffer as a maximum. Be aware of your system's available disk space when choosing this. Note that if `storage_dir` references an
From a practical standpoint, should we rather define that as a disk-space amount?
I'm wondering how as, a user, I should determine this size in a number of frames given my disk space.
Code quote:
buffer_capacity: How many frames should be stored in the buffer as a maximum. Be aware of your
system's available disk space when choosing this. Note that if `storage_dir` references an
lerobot/common/datasets/online_buffer.py
line 369 at r2 (raw file):
) else: _, h, w, c = data[k].shape
I think we should also retrieve that from the metadata in the future as well (ok for now).
Code quote:
_, h, w, c = data[k].shape
lerobot/common/datasets/online_buffer.py
line 545 at r2 (raw file):
Previously, alexander-soare (Alexander Soare) wrote…
Think of this feature as a nice-to-have extra. It gives you a massive speedup over video decoding (10x +), and if your dataset is small, it's definitely worth it. For example, try training vqbet as I shared in the main PR comment with and without it.
If we get video decoding faster, I'd remove this feature.
I think this is a lot of complexity for a nice-to-have, but setting that aside, the issue is that speed is not the only factor here. What if we have hundreds of GB of video (as we do on some open X datasets)? We can't really scale with that.
Does this 10x factor take into account initial decoding? And will that work with streaming?
Video is the biggest chunk of the data by far (>99% in most of our repos), it's expected to take the largest amount of time during data loading. Even if we're able to further improve decoding time in the future by fiddling with encoding, it won't be by a factor of 10x because of the decompression — which you took out of the equation iiuc.
I do see a way that we could significantly improve it in the future by training sequentially on each frame in a per-episode loop. That way, we could decode and buffer the video to memory at the start of an episode and use that throughout that episode until the next one. I believe this is one of the features Torchcodec is working on (don't quote me on that though).
I'm also curious about the absolute gains (so not relative improvement). How much time does this shave off a 5h training run for example ?
I get your point that it can be useful in certain scenarios, but that has to be weighted-in against those questions as well. Wdyt?
lerobot/common/datasets/online_buffer.py
line 558 at r2 (raw file):
Previously, alexander-soare (Alexander Soare) wrote…
There is some default transform in place. I know this because when I comment this out, my tests fail due to dtype issues. I could figure out exactly what that is, but no transform is better than transform for speed and the degree of control we have on our end.
Thoughts?
I see, could you add a comment about that?
We should probably fix that later on during the dataset rework. Parquet does support types so once written properly we shouldn't have the need for that.
lerobot/common/datasets/online_buffer.py
line 620 at r2 (raw file):
else: data_dict[k] = np.stack( [np.array(dct["path"], dtype=f"S{MAX_VIDEO_PATH_LENGTH}") for dct in hf_dataset[k]]
I think we should avoid using storage space for that and have a single template path (which will take episode number and camera key) in the metadata instead. That way we also avoid the annoying max length issue.
Code quote:
data_dict[k] = np.stack(
[np.array(dct["path"], dtype=f"S{MAX_VIDEO_PATH_LENGTH}") for dct in hf_dataset[k]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 9 unresolved discussions (waiting on @aliberts)
Makefile
line 71 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
If I read your code correctly, I don't think this is actually enabled.
Oops. Okay I worked it in with r3 (revision3).
lerobot/common/datasets/online_buffer.py
line 91 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
"you need to specify in advance how much disk space to reserve", is that clear enough?
I understood that, so I'll take that the limit is simply your disk space.
Additionally, I was wondering if performance can vary with the size of the buffer (?)
In theory, it just depends on whether the whole buffer can fit in RAM or now. If it can fit, it's constant-time random access. Try this script to prove that:
"""
Run with/without DATA_DIR=tests/data to try with different dataset sizes.
"""
from itertools import cycle
import numpy as np
from tqdm import trange
from lerobot.common.datasets.online_buffer import DataBuffer
buffer = DataBuffer.from_huggingface_hub("lerobot/pusht", decode_video=True)
iter_buffer = cycle(buffer)
np.random.seed(0)
indices = [int(i) for i in np.random.permutation(len(buffer))]
cycle_indices = cycle(indices)
N = 100000
for _ in trange(N):
_ = buffer[next(cycle_indices)]
If it can't fit in RAM, it needs to be read from disk into RAM. And there's lots of tricky stuff that numpy does under the hood to optimize this. It's the same deal with HF datasets though. I haven't fully characterized this behaviour but I have verified that it definitely runs slower when it needs to access parts of the buffer that havent' been paged into memory.
What, if any of this do you think I should include in the docs here now?
lerobot/common/datasets/online_buffer.py
line 93 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
Makes sense, I think the docstring in
add_episodes
is enough, thanks.
Done.
lerobot/common/datasets/online_buffer.py
line 134 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
From a practical standpoint, should we rather define that as a disk-space amount?
I'm wondering how as, a user, I should determine this size in a number of frames given my disk space.
I think if we define it as a disk space amount, one can ask the opposite question: from a practical standpoint, how does a user know how big their buffer will be in terms of episode frames. And I think this is the more important question for the ML side.
So, I wonder what the right approach is here.
One idea could be to have a calculator fn that you can pass an episode into and it will tell you storage space required per frame.
Another idea is to simple add documentation here: "don't worry, if you try to reserve 80% or more of your disk space, the program will exit with an exception and not create the buffer at all" (see _make_memmap_safe
).
Got any other ideas?
lerobot/common/datasets/online_buffer.py
line 369 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
I think we should also retrieve that from the metadata in the future as well (ok for now).
This information is saved in the metadata. But at this point we are checking the data passed into add_episodes
by the API user. They can pass anything in, regardless of the metadata. Does this address your comment?
lerobot/common/datasets/online_buffer.py
line 545 at r2 (raw file):
The 0th order answer to all this, is that current use cases of real users are hugely helped by this feature. For example:
- I've been working a lot with TD-MPC. Training TD-MPC with PushT, it takes about 9-10 seconds per 100 training iters with decode_video=True. On the other hand, it takes about 26~27 seconds with decode_video=False (hopefully this also helps answer your absolute gains question). (and worth noting, this is with 12 workers for data loading).
- 2~3 users have reported very slow training with VQ-BET PushT. With decode_video=True, 100 iters takes ~3s. With decode_video=False, 100 iters takes ~19s.
Tabulating the results.
. | TD-MPC PushT | VQ-BET PushT |
---|---|---|
pusht, decode=True | 9s / 100iter | 2s / 100iter |
pusht, decode=False | 26s /100iter | 19s / 100iter |
So yeah it's more like very-nice-to-have.
Now, it is true that if I use pusht_image
, it becomes equivalent to pusht
with decode=True in terms of dataloading time. So, the other answer could be: "if you want fast data loading at the expense of disk space, reupload your dataset in image format". Do we have a script for this? What do you think of this approach, and then I drop this decode logic? (really I'd be moving the logic to be in a script instead of this class)
But to answer the rest of your questions:
Does this 10x factor take into account initial decoding?
The 10x was just thedata_s
in the logging, so no initial decoding is not included. In saying that, it shouldn't matter, because it's a one-off cost.
And will that work with streaming?
I haven't even though of how we'd stream videos for training. Nor do I know if such a thing is even practically feasible. Would love to get your thoughts f2f :)
lerobot/common/datasets/online_buffer.py
line 558 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
I see, could you add a comment about that?
We should probably fix that later on during the dataset rework. Parquet does support types so once written properly we shouldn't have the need for that.
Added a comment. I agree that we should rework it eventually.
lerobot/common/datasets/online_buffer.py
line 620 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
I think we should avoid using storage space for that and have a single template path (which will take episode number and camera key) in the metadata instead. That way we also avoid the annoying max length issue.
I'm agreed. Yeah I was thinking of doing this as a close follow up PR for a couple of reasons. (a) I wanted to keep a cap on this PR. (b) It would be easier for me to make decisions once I go deeper into deprecating our usage of HF Datasets. Is that okay with you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 9 unresolved discussions (waiting on @aliberts)
lerobot/common/datasets/online_buffer.py
line 545 at r2 (raw file):
Previously, alexander-soare (Alexander Soare) wrote…
The 0th order answer to all this, is that current use cases of real users are hugely helped by this feature. For example:
- I've been working a lot with TD-MPC. Training TD-MPC with PushT, it takes about 9-10 seconds per 100 training iters with decode_video=True. On the other hand, it takes about 26~27 seconds with decode_video=False (hopefully this also helps answer your absolute gains question). (and worth noting, this is with 12 workers for data loading).
- 2~3 users have reported very slow training with VQ-BET PushT. With decode_video=True, 100 iters takes ~3s. With decode_video=False, 100 iters takes ~19s.
Tabulating the results.
. TD-MPC PushT VQ-BET PushT pusht, decode=True 9s / 100iter 2s / 100iter pusht, decode=False 26s /100iter 6s / 100iter So yeah it's more like very-nice-to-have.
Now, it is true that if I use
pusht_image
, it becomes equivalent topusht
with decode=True in terms of dataloading time. So, the other answer could be: "if you want fast data loading at the expense of disk space, reupload your dataset in image format". Do we have a script for this? What do you think of this approach, and then I drop this decode logic? (really I'd be moving the logic to be in a script instead of this class)
But to answer the rest of your questions:
Does this 10x factor take into account initial decoding?
The 10x was just thedata_s
in the logging, so no initial decoding is not included. In saying that, it shouldn't matter, because it's a one-off cost.And will that work with streaming?
I haven't even though of how we'd stream videos for training. Nor do I know if such a thing is even practically feasible. Would love to get your thoughts f2f :)
Oh, and the bottom right cell on the table is incorrect. Should be 19s/ 100 iter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 10 files reviewed, 9 unresolved discussions (waiting on @aliberts)
lerobot/common/datasets/online_buffer.py
line 545 at r2 (raw file):
Previously, alexander-soare (Alexander Soare) wrote…
Oh, and the bottom right cell on the table is incorrect. Should be 19s/ 100 iter.
. | TD-MPC PushT | VQ-BET PushT |
---|---|---|
pusht, decode=True | 9s / 100iter | 2s / 100iter |
pusht, decode=False | 26s /100iter | 19s / 100iter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r3.
Reviewable status: 1 of 10 files reviewed, 5 unresolved discussions (waiting on @alexander-soare)
lerobot/common/datasets/online_buffer.py
line 91 at r2 (raw file):
Previously, alexander-soare (Alexander Soare) wrote…
In theory, it just depends on whether the whole buffer can fit in RAM or now. If it can fit, it's constant-time random access. Try this script to prove that:
""" Run with/without DATA_DIR=tests/data to try with different dataset sizes. """ from itertools import cycle import numpy as np from tqdm import trange from lerobot.common.datasets.online_buffer import DataBuffer buffer = DataBuffer.from_huggingface_hub("lerobot/pusht", decode_video=True) iter_buffer = cycle(buffer) np.random.seed(0) indices = [int(i) for i in np.random.permutation(len(buffer))] cycle_indices = cycle(indices) N = 100000 for _ in trange(N): _ = buffer[next(cycle_indices)]If it can't fit in RAM, it needs to be read from disk into RAM. And there's lots of tricky stuff that numpy does under the hood to optimize this. It's the same deal with HF datasets though. I haven't fully characterized this behaviour but I have verified that it definitely runs slower when it needs to access parts of the buffer that havent' been paged into memory.
What, if any of this do you think I should include in the docs here now?
I see, thank you for the insights!
I think we should add just that (so keeping it high level just so that people are aware of this) and then point to memmap's doc if the user want to know more. How does that sound?
lerobot/common/datasets/online_buffer.py
line 134 at r2 (raw file):
I think if we define it as a disk space amount, one can ask the opposite question [...]
So, I wonder what the right approach is here.
Yes I agree, I don't have a strong opinion either way.
I think the dataset refactor will probably help us when we get there. For instance we could imagine 2 components for this buffer: one for states/action for which we can easily calculate its size in advance, and one more dynamic for images/video (thinking out loud here)
lerobot/common/datasets/online_buffer.py
line 369 at r2 (raw file):
Previously, alexander-soare (Alexander Soare) wrote…
This information is saved in the metadata. But at this point we are checking the data passed into
add_episodes
by the API user. They can pass anything in, regardless of the metadata. Does this address your comment?
I understand, thanks!
lerobot/common/datasets/online_buffer.py
line 545 at r2 (raw file):
I've been working a lot with TD-MPC. Training TD-MPC with PushT, it takes about 9-10 seconds per 100 training iters with decode_video=True. On the other hand, it takes about 26~27 seconds with decode_video=False (hopefully this also helps answer your absolute gains question). (and worth noting, this is with 12 workers for data loading).
I'd take these numbers with a bucket of salt: pusht is the worst dataset to evaluate video decoding against and is very much not representative of our datasets (in fact it's an outlier: 96x96 pixels). Decoding times scales much better with resolution, what about a 1280x720 or a 1920x1080 image? Thinking about the future, I think the direction we're going towards is larger image sizes, not smaller.
Now, it is true that if I use
pusht_image
, it becomes equivalent topusht
with decode=True in terms of dataloading time.So, the other answer could be: "if you want fast data loading at the expense of disk space, reupload your dataset in image format". Do we have a script for this?
I'd really argue against it. I think for sharing especially (so everything hub-related) there should be only one way to do it. This impacts storage, visualization, download times, etc. Plus the obligation to maintain both options. In fact I'm all for deprecating image datasets altogether. I'd be more lenient about what happens locally. If we can easily cache locally to make training go faster, I'm all for it.
The 10x was just the
data_s
in the logging, so no initial decoding is not included. In saying that, it shouldn't matter, because it's a one-off cost.
I'm still asking because if I'm to be sold on faster loading times, I want to know as a user how much I'll save on my training run — including that part.
lerobot/common/datasets/online_buffer.py
line 620 at r2 (raw file):
Previously, alexander-soare (Alexander Soare) wrote…
I'm agreed. Yeah I was thinking of doing this as a close follow up PR for a couple of reasons. (a) I wanted to keep a cap on this PR. (b) It would be easier for me to make decisions once I go deeper into deprecating our usage of HF Datasets. Is that okay with you?
Good to me, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 10 files reviewed, 3 unresolved discussions (waiting on @aliberts)
lerobot/common/datasets/online_buffer.py
line 134 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
I think if we define it as a disk space amount, one can ask the opposite question [...]
So, I wonder what the right approach is here.Yes I agree, I don't have a strong opinion either way.
I think the dataset refactor will probably help us when we get there. For instance we could imagine 2 components for this buffer: one for states/action for which we can easily calculate its size in advance, and one more dynamic for images/video (thinking out loud here)
Note to self: Let's leave this open for the next revision. I think I want to do something more about precalculating the required memory. The current issue is that you can end up making some of the memmaps and only get the 80% exception when creating the last one, which leaves the whole data buffer in a corrupt state (it has some of the data keys, but not all of them.
lerobot/common/datasets/online_buffer.py
line 545 at r2 (raw file):
Previously, aliberts (Simon Alibert) wrote…
I've been working a lot with TD-MPC. Training TD-MPC with PushT, it takes about 9-10 seconds per 100 training iters with decode_video=True. On the other hand, it takes about 26~27 seconds with decode_video=False (hopefully this also helps answer your absolute gains question). (and worth noting, this is with 12 workers for data loading).
I'd take these numbers with a bucket of salt: pusht is the worst dataset to evaluate video decoding against and is very much not representative of our datasets (in fact it's an outlier: 96x96 pixels). Decoding times scales much better with resolution, what about a 1280x720 or a 1920x1080 image? Thinking about the future, I think the direction we're going towards is larger image sizes, not smaller.
Now, it is true that if I use
pusht_image
, it becomes equivalent topusht
with decode=True in terms of dataloading time.So, the other answer could be: "if you want fast data loading at the expense of disk space, reupload your dataset in image format". Do we have a script for this?I'd really argue against it. I think for sharing especially (so everything hub-related) there should be only one way to do it. This impacts storage, visualization, download times, etc. Plus the obligation to maintain both options. In fact I'm all for deprecating image datasets altogether. I'd be more lenient about what happens locally. If we can easily cache locally to make training go faster, I'm all for it.
The 10x was just the
data_s
in the logging, so no initial decoding is not included. In saying that, it shouldn't matter, because it's a one-off cost.I'm still asking because if I'm to be sold on faster loading times, I want to know as a user how much I'll save on my training run — including that part.
Our offline discussion clarified a few things here. The takeaway is let's leave the video decoding option as there's such a strong need for it in the short/mid term. We'll try to be mindful of keeping it under the hood and making sure it doesn't scope creep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 of 10 files reviewed, 4 unresolved discussions (waiting on @aliberts)
lerobot/common/datasets/online_buffer.py
line 597 at r3 (raw file):
data_dict = {} for k, feature in hf_dataset.features.items(): if isinstance(feature, datasets.features.Image):
Okay, after our conversation I'm realizing this branch should have an equivalent to if decode_video
in the branch below. I am turning these PIL images into memmaps by default, when I perhaps shouldn't be doing that. Once you've got the headspace to tackle this, I'd like to have another conversation to get your thoughts on some design questions.
In the meantime, I am prototyping in a new branch.
I made another branch c86c755 where I tried saving the png images in the storage directory and decoding them on the fly. I compared that against other methods. Results here: https://docs.google.com/spreadsheets/d/1SSGUF2ka8L8d3uhwsI33CCyh3bydMCDG95jHn3QBJhA/edit?gid=1553838219#gid=1553838219 Some takeways:
Script here: import subprocess
import time
from pathlib import Path
import numpy as np
from torch.utils.data import DataLoader
from tqdm import trange
from lerobot.common.datasets.lerobot_dataset import CODEBASE_VERSION, DATA_DIR, LeRobotDataset
from lerobot.common.datasets.online_buffer import DataBuffer
from lerobot.common.datasets.utils import cycle, load_info
from lerobot.common.utils.utils import seeded_context
def du(path):
"""disk usage in human readable format"""
return subprocess.check_output(["du", "-sh", path]).split()[0].decode("utf-8")
def run_test(
repo_id: str,
dataset_type: str,
n_iters: int,
chunk_size: int,
batch_size: int,
n_workers: int,
decode: bool,
) -> tuple[str | None, float]:
lerobot_dataset_info = load_info(repo_id, version=CODEBASE_VERSION, root=DATA_DIR)
fps = lerobot_dataset_info["fps"]
lerobot_dataset = LeRobotDataset(repo_id)
delta_timestamps = np.arange(chunk_size, dtype=np.float32) / fps
delta_timestamps = {k: delta_timestamps for k in lerobot_dataset.camera_keys}
if dataset_type == "lerobot_dataset":
dataset = LeRobotDataset(repo_id, delta_timestamps=delta_timestamps)
storage = None
elif dataset_type == "data_buffer":
dataset = DataBuffer.from_huggingface_hub(
repo_id, fps=fps, delta_timestamps=delta_timestamps, decode_images=decode
)
storage = du(dataset.storage_dir)
with seeded_context(0):
dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=n_workers)
iter_dataloader = cycle(dataloader)
start = time.perf_counter()
for _ in trange(n_iters):
_ = next(iter_dataloader)
total_time = time.perf_counter() - start
return storage, total_time
if __name__ == "__main__":
import pandas as pd
RESULTS_PATH = Path("results.csv")
sweep_params = [
# for example
{
"repo_id": "lerobot/pusht",
"dataset_type": "data_buffer",
"decode": False,
"n_iters": 100,
"chunk_size": 5,
"batch_size": 64,
"n_workers": 8,
},
]
cols = list(sweep_params[0].keys()) + ["storage", "time_s"]
results = pd.read_csv(RESULTS_PATH, index_col=0) if RESULTS_PATH.exists() else pd.DataFrame(columns=cols)
for params in sweep_params:
if (results.iloc[:, :-2] == pd.Series(params)).all(axis=1).any():
print("Skipping test as it is already in the results")
print(params)
continue
storage, total_time = run_test(**params)
params["storage"] = storage or ""
params["time_s"] = total_time
results = pd.concat([results, pd.DataFrame([params])], ignore_index=True)
results.to_csv(RESULTS_PATH) |
What this does
This PR promotes
OnlineBuffer
to the more widely scopedDataBuffer
. In other words, the class is generalized to be used for all purposes, rather than just as a data buffer for the online training loop. In a series of follow-up PR'sLerobotDataset
will gradually be superseded byDataBuffer
. Changes to look out for:online_buffer.py::OnlineBuffer
->data_buffer.py::DataBuffer
.DataBuffer
.DataBuffer.from_hugging_face_hub
method. This downloads a HF hub dataset then siphons it into aDataBuffer
(in a follow up we might try streaming from the hub to avoid doubling up on disk space requirements).train
andeval
scripts to make theDataBuffer
. Of particular interest is that the training script has two new options added which can be used to configure the usage of theDataBuffer
.DataBuffer
works primarily with numpy arrays, not torch tensors.Note: For the LeRobot library user, this change will be silent. The options to use the
DataBuffer
are not added to the yaml configs and need to be explicitly specified. The plan is for us to road test it internally, or allow users to try it out if they really want, while follow-up PRs proceed.TODO before merging
Why? (high-level context)
These are the main reasons we are switching to
DataBuffer
(backed bynumpy.memmap
s) fromdatasets.Dataset
(backed by PyArrow)numpy.memmap
s are much more amenable to this (it's no contest,datasets.Dataset
s were not designed for in-place updates).numpy.memmap
s are faster thandataset.Dataset
s in the way we currently use them (on the order of 10x faster). Some work can be done both on our end, and on thedatasets
end to speeddataset.Dataset
s up, but even at a fundamental levelPyArrow
can't be made to be faster thannumpy.memmap
s (although it can get close).numpy.memmap
s is dead simple for anyone who knows how to usenumpy
(everyone...). They can be sliced and mutated the same way as regular arrays. In comparison, the interface fordataset.Dataset
s requires some getting used to (certainly worth it if the use case is right, but not in LeRobot).It should be noted that points 1 and 2 above are particularly salient in LeRobot, where each data item in a training loop typically involves taking slices over the temporal dimension multiple data arrays. For example, the default settings for TD-MPC require slices of 5 time-steps over actions and observations with a batch size of 256. With
dataset.Dataset
s, the data loading bottleneck slows training to a crawl.Follow-ups
DataBuffer
for the offline dataset.from_huggingface_hub("lerobot/image_dataset", decode_images=True)
avoid decoding and re-encoding the PNG files.from_hugging_face_hub
, stream the HF Dataset and siphon it into the LeRobot DataBuffer one episode at a time.How it was tested
WIP
How to checkout & try? (for the reviewer)
Run this script with/without the
+use_lerobot_data_buffer=true
, and when+use_lerobot_data_buffer=true
, try it with/without+lerobot_data_buffer_decode_video=true
. Observe thedata_s
log item in the terminal.This change is