You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My eviroument is Ubuntu16.04+Python3.6 +Pytorch1.1+CUDA10.0
I try to use this code to train distributed python -m torch.distributed.launch --nproc_per_node=2 --master_port=4321 train_niqe.py -opt options/train/train_AdaGrowingNet.yml --launcher pytorch
First, for VGGFeatureExtractor, I got this error: RuntimeError: replicas_[0].size() >= 1 ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53, please report a bug to PyTorch. Expected at least one parameter. (Reducer at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f27c47be441 in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f27c47bdd7a in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: c10d::Reducer::Reducer(std::vector<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> >, std::allocator<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > > >, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >, std::shared_ptr<c10d::ProcessGroup>) + 0x199c (0x7f280405fc1c in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Then I set the parameters of netF: v.requires_grad = False; After self.netF = DistributedDataParallel(self.netF, device_ids=[torch.cuda.current_device()]).
While this code is first at the define of the VGGFeatureExtractor.
So, this error disappeared.
Then I still run this code,
But it got RuntimeError. Traceback (most recent call last): File "train_niqe.py", line 260, in <module> main() File "train_niqe.py", line 172, in main model.optimize_parameters(current_step) File "/home/wangzhan/SRtask/data_augment/RankSRGAN-master/codes/models/RankSRGAN_model.py", line 215, in optimize_parameters l_d_total.backward() File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Do you encounter this problem?
How to fix it?
The text was updated successfully, but these errors were encountered:
Thanks for your project.
My eviroument is Ubuntu16.04+Python3.6 +Pytorch1.1+CUDA10.0
I try to use this code to train distributed
python -m torch.distributed.launch --nproc_per_node=2 --master_port=4321 train_niqe.py -opt options/train/train_AdaGrowingNet.yml --launcher pytorch
First, for VGGFeatureExtractor, I got this error:
RuntimeError: replicas_[0].size() >= 1 ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53, please report a bug to PyTorch. Expected at least one parameter. (Reducer at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f27c47be441 in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f27c47bdd7a in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: c10d::Reducer::Reducer(std::vector<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> >, std::allocator<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > > >, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >, std::shared_ptr<c10d::ProcessGroup>) + 0x199c (0x7f280405fc1c in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
Then I set the parameters of netF: v.requires_grad = False; After
self.netF = DistributedDataParallel(self.netF, device_ids=[torch.cuda.current_device()])
.While this code is first at the define of the VGGFeatureExtractor.
So, this error disappeared.
Then I still run this code,
But it got RuntimeError.
Traceback (most recent call last): File "train_niqe.py", line 260, in <module> main() File "train_niqe.py", line 172, in main model.optimize_parameters(current_step) File "/home/wangzhan/SRtask/data_augment/RankSRGAN-master/codes/models/RankSRGAN_model.py", line 215, in optimize_parameters l_d_total.backward() File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Do you encounter this problem?
How to fix it?
The text was updated successfully, but these errors were encountered: