Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multigpu returns warnings #4

Open
MattBixley opened this issue Sep 17, 2024 · 0 comments
Open

multigpu returns warnings #4

MattBixley opened this issue Sep 17, 2024 · 0 comments
Assignees

Comments

@MattBixley
Copy link
Collaborator

I only set A100:2 in slurm, did not change the functions
waiting to start with A100:4

1) XALT/minimal 2) slurm 3) NeSI Starting A100 GPU test on 4 GPUs... Process Process-3: Process Process-4: Traceback (most recent call last): File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 49, in gpu_worker large_tensor = create_large_tensor(gpu_memory_usage_gb, device) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 11, in create_large_tensor return torch.rand(num_elements, device=device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Starting work on GPU 2 Traceback (most recent call last): File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 49, in gpu_worker large_tensor = create_large_tensor(gpu_memory_usage_gb, device) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 11, in create_large_tensor return torch.rand(num_elements, device=device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Starting work on GPU 3 Starting work on GPU 1 GPU 1 completed 5 iterations in 304.03 seconds Starting work on GPU 0 GPU 0 completed 5 iterations in 304.64 seconds All GPU tests completed

@DininduSenanayake DininduSenanayake self-assigned this Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants