Slow memory management on Nvidia GPUs #841

fstein93 · 2024-09-10T13:17:18Z

If a DBCSR-heavy calculation in CP2K (LS_SCF) is profiled on NVIDIA GPUs, it turns out that DBCSR spends a lot (most) of the time on allocating/freeing memory on GPUs (tested on H100). PM for additional data. Potentially, this may also be the case on AMD hardware.

alazzaro · 2024-09-10T13:21:14Z

We do test LS, namely H2O-DFT-LS. I don't see any connection with the GPU type, the data movement is GPU-agnostic.
Could you post the DBCSR statistics and CP2K timers?

Specifically, for GPU data allocation we use memory pools, so I would not expect any big impact from that. I can assume these are allocations of the indices, which are async, so the effect should be minimal.

hfp · 2024-09-10T13:35:40Z

I have recently ran tests on a GH200 system with the OpenCL backend. The OpenCL backend has support for profiling results to appear in DBCSRs / CP2Ks regular profile (end of execution). The allocations were visible for both host- and GPU-backed memory. Though, this can also depend on the node's configuration like amount of page-lockable memory, etc. Still, the time spent was relatively negligible compared to the total time to solution spent (wall time).

hfp · 2024-09-10T13:39:46Z

These are the prototypes that allow to call CP2K/DBCSR's timer facility for instance in cuda_hip sources:
https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc.h#L67-L68

fstein93 · 2024-09-10T13:47:39Z

@alazzaro I have sent you a mail.

alazzaro · 2024-09-10T14:32:38Z

@alazzaro I have sent you a mail.

OK, so I've checked the slides and my understanding is that the problem is appearing on the first multiplications, which is expected.
We do use memory pools with a factor size of 1 (if I recall correctly, we do use memory pools on the CPU too, with a factor size increase of 1.2). The main function which does memory settings (pools and allocations) is

dbcsr/src/data/dbcsr_mem_methods.F

Line 207 in f4e8c38

SUBROUTINE dbcsr_memtype_setup(memtype, acc_hostalloc, acc_devalloc, mpi, &

Then, there is a function to ensure that size of the buffers is OK.

The part where this function is called is for the C matrix:

dbcsr/src/mm/dbcsr_mm_cannon.F

Line 1199 in f4e8c38

CALL dbcsr_data_ensure_size(product_matrix%wms(i)%data_area, &

where we also try to make an educate guess of the final size (per each thread).

Now, the occupancies of the matrices increase with the multiplications, up to a given plateau. So, in the first multiplications there is a reallocation of the memory, but then we use the memory loop and do not reallocate. So, the benchmark itself can have a bit of overhead, but in the real production runs (with many more multiplications) the effect is negligible.
So the question is: do we see the memory allocations for all multiplications, i.e. the memory pool is never used?

I can image we make the resize_factor as an external parameters so that we can avoid reallocations (at the cost of large memory footprint).

cudaMallocAsync will require some refactoring, but I don't think it is worth the pain.

hfp · 2024-09-13T12:15:12Z

@fstein93 was the issue discovered on GH200 like Alps?

fstein93 · 2024-09-13T12:27:36Z

It was 8xH100 with 2 ranks per GPU. I did not run the tests.

alazzaro mentioned this issue Sep 13, 2024

ocl: speedup host-memory allocation on GH200 #846

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow memory management on Nvidia GPUs #841

Slow memory management on Nvidia GPUs #841

fstein93 commented Sep 10, 2024

alazzaro commented Sep 10, 2024 •

edited

Loading

hfp commented Sep 10, 2024

hfp commented Sep 10, 2024

fstein93 commented Sep 10, 2024

alazzaro commented Sep 10, 2024 •

edited

Loading

hfp commented Sep 13, 2024

fstein93 commented Sep 13, 2024

Slow memory management on Nvidia GPUs #841

Slow memory management on Nvidia GPUs #841

Comments

fstein93 commented Sep 10, 2024

alazzaro commented Sep 10, 2024 • edited Loading

hfp commented Sep 10, 2024

hfp commented Sep 10, 2024

fstein93 commented Sep 10, 2024

alazzaro commented Sep 10, 2024 • edited Loading

hfp commented Sep 13, 2024

fstein93 commented Sep 13, 2024

alazzaro commented Sep 10, 2024 •

edited

Loading

alazzaro commented Sep 10, 2024 •

edited

Loading