local 8-GPU run hangs in .item after 3 days #10

yaroslavvb · 2019-05-14T05:33:43Z

The long local run is hanging with all 8 processes having identical stack trace and 100% nvidia-smi GPU utilization

#8  0x00007f5178571f92 in cuMemcpyDtoHAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007f517d0984bf in ?? ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#10 0x00007f517d075573 in ?? ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#11 0x00007f517d0aed86 in cudaMemcpyAsync ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#12 0x00007f518ba39566 in at::native::_local_scalar_dense_cuda(at::Tensor const&)::{lambda()#1}::operator()() const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007f518ba3bbb7 in at::native::_local_scalar_dense_cuda(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007f518aa70902 in at::CUDAType::_local_scalar_dense(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007f517d8e5685 in torch::autograd::VariableType::_local_scalar_dense(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#16 0x00007f517fb0392a in at::native::item(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#17 0x00007f517fe0de15 in at::TypeDefault::item(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#18 0x00007f517dadf418 in torch::autograd::VariableType::item(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#19 0x00007f51be448756 in torch::autograd::dispatch_to_CLong(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#20 0x00007f51be4499f0 in torch::autograd::THPVariable_item(_object*, _object*) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#21 0x000055dd363e1bda in _PyCFunction_FastCallDict ()

There's only one place in training which uses .item

train_loss += loss.float().item()

Figure out of that's connected, and maybe change code to not use item() here

The text was updated successfully, but these errors were encountered:

yaroslavvb · 2019-05-14T05:37:28Z

This is using PyTorch 1.1 on DLAMI 22 derived image, which has only been tested for PyTorch 1.0

yaroslavvb · 2019-05-14T05:50:02Z

From Natalia Gimelshein [10:41 PM]
.item() is probably a red herring, likely it hangs in nccl call that's happening in backward right before that, and cannot synchronize, that is required for d2h transfer in item().

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local 8-GPU run hangs in .item after 3 days #10

local 8-GPU run hangs in .item after 3 days #10

yaroslavvb commented May 14, 2019

yaroslavvb commented May 14, 2019

yaroslavvb commented May 14, 2019

local 8-GPU run hangs in .item after 3 days #10

local 8-GPU run hangs in .item after 3 days #10

Comments

yaroslavvb commented May 14, 2019

yaroslavvb commented May 14, 2019

yaroslavvb commented May 14, 2019