You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I only set A100:2 in slurm, did not change the functions
waiting to start with A100:4
1) XALT/minimal 2) slurm 3) NeSI Starting A100 GPU test on 4 GPUs... Process Process-3: Process Process-4: Traceback (most recent call last): File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 49, in gpu_worker large_tensor = create_large_tensor(gpu_memory_usage_gb, device) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 11, in create_large_tensor return torch.rand(num_elements, device=device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Starting work on GPU 2 Traceback (most recent call last): File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 49, in gpu_worker large_tensor = create_large_tensor(gpu_memory_usage_gb, device) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 11, in create_large_tensor return torch.rand(num_elements, device=device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Starting work on GPU 3 Starting work on GPU 1 GPU 1 completed 5 iterations in 304.03 seconds Starting work on GPU 0 GPU 0 completed 5 iterations in 304.64 seconds All GPU tests completed
The text was updated successfully, but these errors were encountered:
I only set A100:2 in slurm, did not change the functions
waiting to start with A100:4
1) XALT/minimal 2) slurm 3) NeSI Starting A100 GPU test on 4 GPUs... Process Process-3: Process Process-4: Traceback (most recent call last): File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 49, in gpu_worker large_tensor = create_large_tensor(gpu_memory_usage_gb, device) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 11, in create_large_tensor return torch.rand(num_elements, device=device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Starting work on GPU 2 Traceback (most recent call last): File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/nesi/CS400_centos7_bdw/Python/3.10.5-gimkl-2022a/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 49, in gpu_worker large_tensor = create_large_tensor(gpu_memory_usage_gb, device) File "/scale_wlg_nobackup/filesets/nobackup/nesi99999/MattB/gpu-tests/torch-largetensor-matrix/multiple-gpu.py", line 11, in create_large_tensor return torch.rand(num_elements, device=device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Starting work on GPU 3 Starting work on GPU 1 GPU 1 completed 5 iterations in 304.03 seconds Starting work on GPU 0 GPU 0 completed 5 iterations in 304.64 seconds All GPU tests completed
The text was updated successfully, but these errors were encountered: