Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing training of MegaPose fails #97

Open
ponimatkin opened this issue Oct 28, 2023 · 1 comment
Open

Reproducing training of MegaPose fails #97

ponimatkin opened this issue Oct 28, 2023 · 1 comment

Comments

@ponimatkin
Copy link

Hi,

I've tried to reproduce training of MegaPose on Jean Zay, and it failed with this error:

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1646756402876/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1169, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

I did some fixes in the codebase to run the code on JZ here: https://github.com/ponimatkin/happypose/commit/44aacdb79e0557ae50ea84716338e322c6ebe239

Do you by chance know what is the cause of this NCCL error?

@ponimatkin
Copy link
Author

Ok fixes in 6b3862f and 55fc99c seem to be fixing the issue and I'm running coarse model training now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant