Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acc: Always set active device before accessing a GPU #684

Merged
merged 1 commit into from
Jul 4, 2023

Conversation

oschuett
Copy link
Member

@oschuett oschuett commented Jun 29, 2023

I've had for a while the problem that some CP2K regtests would fail with cudaErrorInvalidResourceHandle when I ran on two GPUs. I believe this is because the active device gets changed between calls to DBCSR.

Presumably, most users are not affect by this because Slurm et al. set CUDA_VISIBLE_DEVICES to pin GPUs to CPU sockets.

@dev-zero
Copy link
Contributor

Presumably, most users are not affect by this because Slurm et al. set CUDA_VISIBLE_DEVICES to pin GPUs to CPU sockets.

You would certainly wish for that, but no. The default slurm setting with gres is rather primitive and the more advanced one via nvml, rsmi or openapi which would do this affinity pinning is only available if slurm is built against the respective gpu runtime.

@hfp
Copy link
Member

hfp commented Jun 30, 2023

I think this does not happen anymore. Rank-device combination was stable in recent versions. However and as Ole said, multi-GPU did not even work in the past (let alone d2d comm), and I used this script in the past to prefix the CP2K-executable (it uses environment variables exposed by OpenMPI or MPICH/2 to round-robin assign devices). On the other hand, a change in DBCSR can be useful to affinitize devices if the job manager lacks configuration (or is not used). It can be intriguing without tools to find which socket a rank belong to, and which device belongs to a socket, etc.

@oschuett
Copy link
Member Author

The default slurm setting with gres is rather primitive...

Interesting! Then it's indeed strange that not more users have run into this.

What do you think about this PR? Instead of setting the active device before every call to acc_interface_* we could just do it once at the beginning of dbcsr_multiply.

@alazzaro
Copy link
Member

@oschuett I will take a look at the PR next week and hopefully close everything to make a new DBCSR release...

@alazzaro
Copy link
Member

alazzaro commented Jul 4, 2023

OK, I think this is a reasonable change and it is going in the direction of the issue #205

@alazzaro alazzaro merged commit 397bf0f into cp2k:develop Jul 4, 2023
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants