Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the speed of multi-gpu training #9

Closed
LiewFeng opened this issue Dec 21, 2022 · 10 comments
Closed

About the speed of multi-gpu training #9

LiewFeng opened this issue Dec 21, 2022 · 10 comments

Comments

@LiewFeng
Copy link

Hi, @Cc-Hy .When I train the model on kitti train, 2 GPUs takes more time than 1 GPU, which is really strange. Do you encounter this pthenomenon?

@sunnyHelen
Copy link

Hi, can I ask how much GPU memory can afford the training of this model? I need to evaluate if my GPU memory is enough to try it.

@sunnyHelen
Copy link

@LiewFeng

@LiewFeng
Copy link
Author

@sunnyHelen ~18G.

@sunnyHelen
Copy link

Ok. Thanks a lot~

@Cc-Hy
Copy link
Owner

Cc-Hy commented Dec 21, 2022

@LiewFeng
That's very strange. Can you provide more details of your training?
For example, what commond do you run, the batch size, and how much time is spent in both cases .

@LiewFeng
Copy link
Author

Hi, @Cc-Hy .Sorry for the late reply. The command is the same as that provided by the GETTING_STARTED.md. I didn't modify the batch size.
For 1 GPU setting, it takes about 10 mins for the first epoch. It should take 10 hours for 60-epoch training. However, it only takes 5 hours, which is really strange.
For 2 GPU setting, it takes about 6 mins for the first epoch. It should take 6 hours for 60-epoch training. It only takes 6hours, which is normal.
Another phenomenon is that the cpu utilization of 1 GPU setting is high, while that of 2 GPU setting is really low.

@LiewFeng
Copy link
Author

Experiments are conducted on kitti train.

@Cc-Hy
Copy link
Owner

Cc-Hy commented Dec 24, 2022

@LiewFeng
Hi, it seems your 2 GPU training time is close to mine. It takes ~ 6 minutes for each epoch and I use 2 NVIDIA GeForce RTX 3090. And it takes ~ 12 minutes for each epoch when I use one GPU.

So I think your 2 GPU training time is normal. But if your GPUs are really working at very low utilization, you may check your CPU status. I once met this situation where my CPU was suffering from a bottleneck and the GPU could not work fully.

@LiewFeng
Copy link
Author

Hi, @Cc-Hy . I figure it out. The reason is the version of pytorch. When I run the experiment with 1 GPU, the pytorch version is 1.10. When I try to run with 2 GPUs, it gets stuck. Then I turn to pytorch 1.8 and it can work, but 2x slower. I am using A 100. It's about 2x faster than 3090. I still get stuck with 2GPU. It seems it's solved in OpenPCDet. Sadly, it doesn't work for me.

@LiewFeng
Copy link
Author

Problem of getting stuck fixed here and it works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants