Results before and after fixing shard shuffling bug #354

DonkeyShot21 · 2023-01-05T14:15:02Z

DonkeyShot21
Jan 5, 2023

Hi, thanks for the awesome repo.

I found this sentence in the readme:

This was the first major train session using the updated webdataset 0.2.x code. A bug was found that prevented shards from being shuffled properly between nodes/workers each epoch. This was fixed part way through training (epoch 26) but likely had an impact.

I have a few questions:

is the bug fully fixed in the current commit?
which models / checkpoints / results are affected by this bug?
are the results in your recent paper (https://arxiv.org/abs/2212.07143) also affected by this bug? I assume they are (at least for ViT-B/16 on LAION400), since they look very similar (67.07 vs 67.05).
by any chance, have you tried re-training ViT-B/16 on LAION400 from scratch after fixing the bug?

rwightman · 2023-01-07T23:02:47Z

rwightman
Jan 7, 2023
Maintainer

@DonkeyShot21 the paper results were with the fix.

The 400m B/16 model was re-run (separately fromt he paper) w/ some varying hparams and also w/ 'resampling' enabled, the varations were not that significant re the bug or resampling with the lower batch sizes. However, using a higher LR and larger batch size had a bit more impact. Subsequent runs have generally been using both larger batch size and larger initial LR.

EDIT: the graph below includes the set of comparison LAION-400m runs. The far left column is the original w/ the shuffle bug. Then there is a run w/ shard resampling w/ replacement enabled at 32k batch size, a run with shuffle fixed (no resampling), and then two 64k batch size runs (one with same LR as the 32k run, and one with higher initial LR).. and a convnext base. The other ViT runs w/o specified LR were all 5e-4.

0 replies

rwightman · 2023-01-07T23:03:22Z

rwightman
Jan 7, 2023
Maintainer

0 replies

DonkeyShot21 · 2023-01-08T14:35:54Z

DonkeyShot21
Jan 8, 2023
Author

Thank you for the quick reply! Nice to see that with a bit of tuning LAION400 basically matches the results obtained by OpenAI, and thanks for the clarifications! So, in general, you recommend using resampling or not?

0 replies

rwightman · 2023-01-09T01:12:49Z

rwightman
Jan 9, 2023
Maintainer

Thank you for the quick reply! Nice to see that with a bit of tuning LAION400 basically matches the results obtained by OpenAI, and thanks for the clarifications! So, in general, you recommend using resampling or not?

Most of the people associated with this project doing @ scale runs have been using resampling. The graph above might suggest it's a bit worse but there is run-run variation, and in runs w/ larger global batch sizes 80-160k and larger samples-seen we dont' see much difference. It is however quite a bit more convenient, esp on larger runs where we enable resampling, and set the '# samples per epoch' (--train-num-samples=) to a lower value such that we get reasonable checkpoint intervals (otherwise past quota times on some clusters, or lose a LOT of compute if there's a failure btw checkpoints).

w/ resampling for many LAION-2B runs we use 64-256 'epochs' (calling them checkpoint intervals now) and set the --train-num-samples= in the 135646078 to 203666042 range.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results before and after fixing shard shuffling bug #354

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Results before and after fixing shard shuffling bug #354

DonkeyShot21 Jan 5, 2023

Replies: 4 comments

rwightman Jan 7, 2023 Maintainer

rwightman Jan 7, 2023 Maintainer

DonkeyShot21 Jan 8, 2023 Author

rwightman Jan 9, 2023 Maintainer

DonkeyShot21
Jan 5, 2023

rwightman
Jan 7, 2023
Maintainer

rwightman
Jan 7, 2023
Maintainer

DonkeyShot21
Jan 8, 2023
Author

rwightman
Jan 9, 2023
Maintainer