Abnormal Performance in Multi-GPU Testing caused by All2All #199

chen-yy20 · 2024-09-09T05:00:40Z

Hi, thanks for your great work!

I ran the DSP and PAB examples from examples/latte on A800 GPUs. The results I obtained are as follows:

Compared to single-GPU performance, the multi-GPU setup did not achieve the expected results.

Furthermore, I examined the trace profile of the inference process and found that the all-to-all communication in Dynamic Switch consumed more time than computation:

Communication overhead using 4x A800s ⬇️

Communication overhead using 8x A800s ⬇️ It's totally communication bound!

Could you please explain this situation? How would you interpret these results, particularly the unexpected performance in multi-GPU setup and the disproportionate time spent on communication versus computation?

Additional Consideration:
Given that a single GPU can complete the inference process for a single Latte video sample, it's understandable that the entire inference process becomes communication-bound in a multi-GPU setup. I suspect that increasing the computational load might be necessary to achieve the expected performance on multiple GPUs. However, Latte doesn't support modifying video duration or resolution.

Questions:

Do you have any suggestions for modifying the computational load in your code?
Are there any other parameters or settings we can adjust to better balance computation and communication in multi-GPU scenarios?
Is this behavior expected for small video samples, and if so, what would be the recommended minimum sample size or computational load to effectively utilize multiple GPUs?
Any insights or guidance on optimizing performance for multi-GPU setups with Latte would be greatly appreciated.

gttiankai · 2024-09-11T09:03:55Z

我这边使用 THUDM/CogVideoX-5b 模型测试遇到同样的问题，测试使用的是 A100

oahzxl · 2024-09-12T18:21:36Z

Thanks for your feedbacks. The problem is caused by cp parallel. Disable now in #205 !

And for small videos, the speedup will be lower. For longer videos (as in pab paper), the speedup will be near linear. I have tested open_sora (run_base) on A100, and the speedup is 3.5x for 4 gpus.

As to the all_to_all problem you have mentioned. When i tested before on h100, the total time of all2all is about 5-8% even for 8 gpus. If you still have this problem, maybe i can test on a800 later.

oahzxl · 2024-09-25T14:59:48Z

close due to inactive

oahzxl · 2024-10-06T09:57:19Z

we find that the all2all speed will be much slower on H800 and A800 due to their low nvlink bandwidth. we may optimize this problem in future.

oahzxl mentioned this issue Sep 12, 2024

disable cp parallel #205

Merged

oahzxl closed this as completed Sep 25, 2024

oahzxl reopened this Oct 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormal Performance in Multi-GPU Testing caused by All2All #199

Abnormal Performance in Multi-GPU Testing caused by All2All #199

chen-yy20 commented Sep 9, 2024

gttiankai commented Sep 11, 2024

oahzxl commented Sep 12, 2024

oahzxl commented Sep 25, 2024

oahzxl commented Oct 6, 2024

Abnormal Performance in Multi-GPU Testing caused by All2All #199

Abnormal Performance in Multi-GPU Testing caused by All2All #199

Comments

chen-yy20 commented Sep 9, 2024

gttiankai commented Sep 11, 2024

oahzxl commented Sep 12, 2024

oahzxl commented Sep 25, 2024

oahzxl commented Oct 6, 2024