multigpu returns warnings #4

MattBixley · 2024-09-17T07:05:48Z

I only set A100:2 in slurm, did not change the functions
waiting to start with A100:4

1) XALT/minimal 2) slurm 3) NeSI Starting A100 GPU test on 4 GPUs... Process Process-3: Process Process-4: Traceback (most recent call last): File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 49, in gpu_worker large_tensor = create_large_tensor(gpu_memory_usage_gb, device) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 11, in create_large_tensor return torch.rand(num_elements, device=device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Starting work on GPU 2 Traceback (most recent call last): File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 49, in gpu_worker large_tensor = create_large_tensor(gpu_memory_usage_gb, device) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 11, in create_large_tensor return torch.rand(num_elements, device=device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Starting work on GPU 3 Starting work on GPU 1 GPU 1 completed 5 iterations in 304.03 seconds Starting work on GPU 0 GPU 0 completed 5 iterations in 304.64 seconds All GPU tests completed

The text was updated successfully, but these errors were encountered:

DininduSenanayake self-assigned this Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multigpu returns warnings #4

multigpu returns warnings #4

MattBixley commented Sep 17, 2024

multigpu returns warnings #4

multigpu returns warnings #4

Comments

MattBixley commented Sep 17, 2024