Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] single_train.sh would generate a runtime-error. #65

Open
Lxml-q opened this issue Jul 17, 2024 · 0 comments
Open

[Bug] single_train.sh would generate a runtime-error. #65

Lxml-q opened this issue Jul 17, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Lxml-q
Copy link

Lxml-q commented Jul 17, 2024

Describe the bug

training with sh file: tools\single_train.sh would report a runtime-error:
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

it made that can not use pycharm to debug program with parameter: configs/classification/cifar10/mixups/basic/r18_mixups_CE_none.py --work_dir work_dirs/classification/cifar10/mixups/basic/r18_mixups_CE_none/

To Reproduce

The command I executed.

bash tools/single_train.sh configs/classification/cifar10/mixups/basic/r18_mixups_CE_none.py


or set the parameter of pycharm as onfigs/classification/cifar10/mixups/basic/r18_mixups_CE_none.py --work_dir work_dirs/classification/cifar10/mixups/basic/r18_mixups_CE_none/ and execute with pycharm

Post related information

  1. Your train log file if you meet the problem during training.
2024-07-17 14:15:48,333 - openmixup - INFO - workflow: [('train', 1)], max: 20 epochs
2024-07-17 14:15:48,333 - openmixup - INFO - Checkpoints will be saved to /home/hang/research/repository/openmixup/work_dirs/classification/cifar10/mixups/basic/r18_mixups_CE_none by HardDiskBackend.
2024-07-17 14:15:58,315 - openmixup - INFO - Epoch [1][50/500]  lr: 1.000e-01, eta: 0:32:57, time: 0.199, data_time: 0.049, memory: 2011, loss: 3.1864, acc: 11.2400, acc_mix: 11.1770
2024-07-17 14:16:02,621 - openmixup - INFO - Epoch [1][100/500] lr: 1.000e-01, eta: 0:23:31, time: 0.086, data_time: 0.002, memory: 2011, loss: 2.2559, acc: 13.2800, acc_mix: 14.8798
2024-07-17 14:16:06,981 - openmixup - INFO - Epoch [1][150/500] lr: 1.000e-01, eta: 0:20:21, time: 0.087, data_time: 0.001, memory: 2011, loss: 2.1951, acc: 15.7400, acc_mix: 17.7451
2024-07-17 14:16:11,302 - openmixup - INFO - Epoch [1][200/500] lr: 1.000e-01, eta: 0:18:43, time: 0.087, data_time: 0.002, memory: 2011, loss: 2.1531, acc: 17.2800, acc_mix: 20.0116
2024-07-17 14:16:15,607 - openmixup - INFO - Epoch [1][250/500] lr: 1.000e-01, eta: 0:17:41, time: 0.086, data_time: 0.001, memory: 2011, loss: 2.1376, acc: 18.2000, acc_mix: 20.3139
2024-07-17 14:16:19,971 - openmixup - INFO - Epoch [1][300/500] lr: 1.000e-01, eta: 0:17:01, time: 0.087, data_time: 0.002, memory: 2011, loss: 2.1047, acc: 19.3600, acc_mix: 22.5280
2024-07-17 14:16:24,319 - openmixup - INFO - Epoch [1][350/500] lr: 1.000e-01, eta: 0:16:30, time: 0.087, data_time: 0.002, memory: 2011, loss: 2.0813, acc: 20.6200, acc_mix: 23.2146
2024-07-17 14:16:28,675 - openmixup - INFO - Epoch [1][400/500] lr: 1.000e-01, eta: 0:16:07, time: 0.087, data_time: 0.002, memory: 2011, loss: 2.0527, acc: 19.5200, acc_mix: 24.3855
2024-07-17 14:16:33,065 - openmixup - INFO - Epoch [1][450/500] lr: 1.000e-01, eta: 0:15:48, time: 0.088, data_time: 0.001, memory: 2011, loss: 2.0525, acc: 19.1800, acc_mix: 25.3555
2024-07-17 14:16:37,458 - openmixup - INFO - Exp name: r18_mixups_CE_none.py
2024-07-17 14:16:37,458 - openmixup - INFO - Epoch [1][500/500] lr: 1.000e-01, eta: 0:15:32, time: 0.088, data_time: 0.002, memory: 2011, loss: 2.0137, acc: 23.0800, acc_mix: 26.3356
Traceback (most recent call last):
File "/home/hang/research/repository/openmixup/tools/train.py", line 208, in <module>
  main()
File "/home/hang/research/repository/openmixup/tools/train.py", line 198, in main
  train_model(
File "/home/hang/research/repository/openmixup/openmixup/apis/train.py", line 225, in train_model
  runner.run(data_loaders, cfg.workflow)
File "/home/hang/anaconda3/envs/openmixup/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
  epoch_runner(data_loaders[i], **kwargs)
File "/home/hang/anaconda3/envs/openmixup/lib/python3.9/site-packages/mmcv/runner/epoch_based_runner.py", line 58, in train
  self.call_hook('after_train_epoch')
File "/home/hang/anaconda3/envs/openmixup/lib/python3.9/site-packages/mmcv/runner/base_runner.py", line 317, in call_hook
  getattr(hook, fn_name)(self)
File "/home/hang/research/repository/openmixup/openmixup/core/hooks/validate_hook.py", line 281, in after_train_epoch
  self._run_validate(runner)
File "/home/hang/research/repository/openmixup/openmixup/core/hooks/validate_hook.py", line 371, in _run_validate
  dist.broadcast(module.running_var, 0)
File "/home/hang/anaconda3/envs/openmixup/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1192, in broadcast
  default_pg = _get_default_group()
File "/home/hang/anaconda3/envs/openmixup/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
  raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Additional context

image

@Lxml-q Lxml-q added the bug Something isn't working label Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant