How to train this model in one node multi-gpus mode? #14

trillionpowers · 2020-03-03T05:55:44Z

Thanks for your project.

My eviroument is Ubuntu16.04+Python3.6 +Pytorch1.1+CUDA10.0

I try to use this code to train distributed
python -m torch.distributed.launch --nproc_per_node=2 --master_port=4321 train_niqe.py -opt options/train/train_AdaGrowingNet.yml --launcher pytorch

First, for VGGFeatureExtractor, I got this error:
RuntimeError: replicas_[0].size() >= 1 ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53, please report a bug to PyTorch. Expected at least one parameter. (Reducer at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f27c47be441 in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f27c47bdd7a in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: c10d::Reducer::Reducer(std::vector<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> >, std::allocator<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > > >, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >, std::shared_ptr<c10d::ProcessGroup>) + 0x199c (0x7f280405fc1c in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

Then I set the parameters of netF: v.requires_grad = False; After self.netF = DistributedDataParallel(self.netF, device_ids=[torch.cuda.current_device()]).
While this code is first at the define of the VGGFeatureExtractor.
So, this error disappeared.

Then I still run this code,
But it got RuntimeError.
Traceback (most recent call last): File "train_niqe.py", line 260, in <module> main() File "train_niqe.py", line 172, in main model.optimize_parameters(current_step) File "/home/wangzhan/SRtask/data_augment/RankSRGAN-master/codes/models/RankSRGAN_model.py", line 215, in optimize_parameters l_d_total.backward() File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Do you encounter this problem?
How to fix it?

The text was updated successfully, but these errors were encountered:

Som5ra · 2021-11-30T08:47:58Z

Hi mate, how did u fix this? I just met the same problem with multi-GPUs mode but the code ran well in single GPU mode instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train this model in one node multi-gpus mode? #14

How to train this model in one node multi-gpus mode? #14

trillionpowers commented Mar 3, 2020 •

edited

Loading

Som5ra commented Nov 30, 2021

How to train this model in one node multi-gpus mode? #14

How to train this model in one node multi-gpus mode? #14

Comments

trillionpowers commented Mar 3, 2020 • edited Loading

Som5ra commented Nov 30, 2021

trillionpowers commented Mar 3, 2020 •

edited

Loading