Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormal Performance in Multi-GPU Testing caused by All2All #199

Open
chen-yy20 opened this issue Sep 9, 2024 · 4 comments
Open

Abnormal Performance in Multi-GPU Testing caused by All2All #199

chen-yy20 opened this issue Sep 9, 2024 · 4 comments

Comments

@chen-yy20
Copy link

Hi, thanks for your great work!

I ran the DSP and PAB examples from examples/latte on A800 GPUs. The results I obtained are as follows:

image

Compared to single-GPU performance, the multi-GPU setup did not achieve the expected results.

Furthermore, I examined the trace profile of the inference process and found that the all-to-all communication in Dynamic Switch consumed more time than computation:

Communication overhead using 4x A800s ⬇️
image

Communication overhead using 8x A800s ⬇️ It's totally communication bound!
image

Could you please explain this situation? How would you interpret these results, particularly the unexpected performance in multi-GPU setup and the disproportionate time spent on communication versus computation?

Additional Consideration:
Given that a single GPU can complete the inference process for a single Latte video sample, it's understandable that the entire inference process becomes communication-bound in a multi-GPU setup. I suspect that increasing the computational load might be necessary to achieve the expected performance on multiple GPUs. However, Latte doesn't support modifying video duration or resolution.

Questions:

Do you have any suggestions for modifying the computational load in your code?
Are there any other parameters or settings we can adjust to better balance computation and communication in multi-GPU scenarios?
Is this behavior expected for small video samples, and if so, what would be the recommended minimum sample size or computational load to effectively utilize multiple GPUs?
Any insights or guidance on optimizing performance for multi-GPU setups with Latte would be greatly appreciated.

@gttiankai
Copy link
Contributor

我这边使用 THUDM/CogVideoX-5b 模型测试遇到同样的问题,测试使用的是 A100
image

@oahzxl
Copy link
Collaborator

oahzxl commented Sep 12, 2024

Thanks for your feedbacks. The problem is caused by cp parallel. Disable now in #205 !

And for small videos, the speedup will be lower. For longer videos (as in pab paper), the speedup will be near linear. I have tested open_sora (run_base) on A100, and the speedup is 3.5x for 4 gpus.

As to the all_to_all problem you have mentioned. When i tested before on h100, the total time of all2all is about 5-8% even for 8 gpus. If you still have this problem, maybe i can test on a800 later.

@oahzxl
Copy link
Collaborator

oahzxl commented Sep 25, 2024

close due to inactive

@oahzxl oahzxl closed this as completed Sep 25, 2024
@oahzxl oahzxl reopened this Oct 6, 2024
@oahzxl
Copy link
Collaborator

oahzxl commented Oct 6, 2024

we find that the all2all speed will be much slower on H800 and A800 due to their low nvlink bandwidth. we may optimize this problem in future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants