Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

local 8-GPU run hangs in .item after 3 days #10

Open
yaroslavvb opened this issue May 14, 2019 · 2 comments
Open

local 8-GPU run hangs in .item after 3 days #10

yaroslavvb opened this issue May 14, 2019 · 2 comments

Comments

@yaroslavvb
Copy link

The long local run is hanging with all 8 processes having identical stack trace and 100% nvidia-smi GPU utilization

#8  0x00007f5178571f92 in cuMemcpyDtoHAsync_v2 () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#9  0x00007f517d0984bf in ?? ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#10 0x00007f517d075573 in ?? ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#11 0x00007f517d0aed86 in cudaMemcpyAsync ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcudart-f7fdd8d7.so.9.0
#12 0x00007f518ba39566 in at::native::_local_scalar_dense_cuda(at::Tensor const&)::{lambda()#1}::operator()() const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#13 0x00007f518ba3bbb7 in at::native::_local_scalar_dense_cuda(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#14 0x00007f518aa70902 in at::CUDAType::_local_scalar_dense(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so
#15 0x00007f517d8e5685 in torch::autograd::VariableType::_local_scalar_dense(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#16 0x00007f517fb0392a in at::native::item(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#17 0x00007f517fe0de15 in at::TypeDefault::item(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libcaffe2.so
#18 0x00007f517dadf418 in torch::autograd::VariableType::item(at::Tensor const&) const ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so.1
#19 0x00007f51be448756 in torch::autograd::dispatch_to_CLong(at::Tensor const&) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#20 0x00007f51be4499f0 in torch::autograd::THPVariable_item(_object*, _object*) ()
   from /home/ubuntu/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#21 0x000055dd363e1bda in _PyCFunction_FastCallDict ()

There's only one place in training which uses .item

train_loss += loss.float().item()

Figure out of that's connected, and maybe change code to not use item() here

@yaroslavvb
Copy link
Author

This is using PyTorch 1.1 on DLAMI 22 derived image, which has only been tested for PyTorch 1.0

@yaroslavvb
Copy link
Author

From Natalia Gimelshein [10:41 PM]
.item() is probably a red herring, likely it hangs in nccl call that's happening in backward right before that, and cannot synchronize, that is required for d2h transfer in item().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant