Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow memory management on Nvidia GPUs #841

Open
fstein93 opened this issue Sep 10, 2024 · 7 comments
Open

Slow memory management on Nvidia GPUs #841

fstein93 opened this issue Sep 10, 2024 · 7 comments

Comments

@fstein93
Copy link
Contributor

If a DBCSR-heavy calculation in CP2K (LS_SCF) is profiled on NVIDIA GPUs, it turns out that DBCSR spends a lot (most) of the time on allocating/freeing memory on GPUs (tested on H100). PM for additional data. Potentially, this may also be the case on AMD hardware.

@alazzaro
Copy link
Member

alazzaro commented Sep 10, 2024

We do test LS, namely H2O-DFT-LS. I don't see any connection with the GPU type, the data movement is GPU-agnostic.
Could you post the DBCSR statistics and CP2K timers?

Specifically, for GPU data allocation we use memory pools, so I would not expect any big impact from that. I can assume these are allocations of the indices, which are async, so the effect should be minimal.

@hfp
Copy link
Member

hfp commented Sep 10, 2024

I have recently ran tests on a GH200 system with the OpenCL backend. The OpenCL backend has support for profiling results to appear in DBCSRs / CP2Ks regular profile (end of execution). The allocations were visible for both host- and GPU-backed memory. Though, this can also depend on the node's configuration like amount of page-lockable memory, etc. Still, the time spent was relatively negligible compared to the total time to solution spent (wall time).

@hfp
Copy link
Member

hfp commented Sep 10, 2024

These are the prototypes that allow to call CP2K/DBCSR's timer facility for instance in cuda_hip sources:
https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc.h#L67-L68

@fstein93
Copy link
Contributor Author

@alazzaro I have sent you a mail.

@alazzaro
Copy link
Member

alazzaro commented Sep 10, 2024

@alazzaro I have sent you a mail.

OK, so I've checked the slides and my understanding is that the problem is appearing on the first multiplications, which is expected.
We do use memory pools with a factor size of 1 (if I recall correctly, we do use memory pools on the CPU too, with a factor size increase of 1.2). The main function which does memory settings (pools and allocations) is

SUBROUTINE dbcsr_memtype_setup(memtype, acc_hostalloc, acc_devalloc, mpi, &

Then, there is a function to ensure that size of the buffers is OK.

The part where this function is called is for the C matrix:

CALL dbcsr_data_ensure_size(product_matrix%wms(i)%data_area, &

where we also try to make an educate guess of the final size (per each thread).

Now, the occupancies of the matrices increase with the multiplications, up to a given plateau. So, in the first multiplications there is a reallocation of the memory, but then we use the memory loop and do not reallocate. So, the benchmark itself can have a bit of overhead, but in the real production runs (with many more multiplications) the effect is negligible.
So the question is: do we see the memory allocations for all multiplications, i.e. the memory pool is never used?

I can image we make the resize_factor as an external parameters so that we can avoid reallocations (at the cost of large memory footprint).

cudaMallocAsync will require some refactoring, but I don't think it is worth the pain.

@hfp
Copy link
Member

hfp commented Sep 13, 2024

@fstein93 was the issue discovered on GH200 like Alps?

@fstein93
Copy link
Contributor Author

It was 8xH100 with 2 ranks per GPU. I did not run the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants