Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpectedly poor performance of Dace generated cuda code through DaceML framework #1663

Open
vinaysaxena93 opened this issue Sep 20, 2024 · 0 comments

Comments

@vinaysaxena93
Copy link

Configuration:
OS: Linux (Ubuntu 20.04)
CPU: Intel Core i5 12600K, 64 GB RAM
GPU: Nvidia RTX A2000 12 GB
Dace version: 0.13.3
DaceML version: 0.2
Onnx Runtime: 1.7.0
CUDA: 11.4, CuDNN: 8.4
Python: 3.8.1, Pytorch: 1.8.1

I am trying to reproduce the results obtained in this 2021 paper from SPCL: [https://arxiv.org/abs/2110.10802]
I built Dace, ONNX Runtime (modified branch for Dace as instructed) with Cuda enabled and finally DaceML. I have installed the same version of Pytorch that is used in the paper and the same version of Cuda as well.

My main expectation from this is to see some acceleration in the forward and backward passes of Dace compiled neural network Pytorch layers compared to the native Pytorch implementation (as demonstrated in the paper results).
But. upon running the test scripts from the official DaceML repo inside the tests/torch directory and benchmarking the test code by setting cuda event timers, I am getting a huge (~20-30x) slowdown. This is consistent across both test_bert_encoder and test_efficientnet_block.
Here's an extract of the test code from the bert layer benchmarking with the corrosponding output: [https://gist.github.com/vinaysaxena93/78e07fd687eace24b43831989bb3b283]

As you might see, the time taken by Dacemodule in the forward passes is several orders of magnitude greater. The only modification to the code I have done is to add cuda events for timing. This result is consistent across all tests and my personal Pytorch scripts as well.

Could you please tell if there's anything I'm missing which is causing such slowdowns? I have not made any custom .dace.conf file and if that's the reason could you please post some hints on how to improve the performance / what configurations were used for the evaluation presented in the paper? Any knowledge will be very helpful.
Please let me know if you need any more information. If required I can also create a Dockerfile to reproduce my benchmark environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant