You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
04/30 07:45:32 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used.
04/30 07:45:32 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook
04/30 07:45:33 - mmengine - INFO - Working directory: ./work_dirs/glean_x8_2xb8_cat
04/30 07:45:33 - mmengine - INFO - Log directory: /root/glean/work_dirs/glean_x8_2xb8_cat/20240430_074521
04/30 07:45:33 - mmengine - WARNING - cat is not a meta file, simply parsed as meta information
04/30 07:45:33 - mmengine - WARNING - sisr is not a meta file, simply parsed as meta information
04/30 07:45:35 - mmengine - INFO - Add to optimizer 'generator' ({'type': 'Adam', 'lr': 0.0001, 'betas': (0.9, 0.99)}): 'generator'.
04/30 07:45:35 - mmengine - INFO - Add to optimizer 'discriminator' ({'type': 'Adam', 'lr': 0.0001, 'betas': (0.9, 0.99)}): 'discriminator'.
04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class MAE.
04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class PSNR.
04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class SSIM.
04/30 07:45:36 - mmengine - INFO - load generator_ema in model from: http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth
Loads checkpoint by http backend from path: http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth
04/30 07:45:36 - mmengine - WARNING - The model and loaded state dict do not match exactly
04/30 06:43:01 - mmengine - INFO - Saving checkpoint at 275000 iterations
Switch to evaluation style mode: single
04/30 06:43:25 - mmengine - INFO - Iter(val) [100/100] eta: 0:00:00 time: 0.1712 data_time: 0.0235 memor3032
04/30 06:43:26 - mmengine - INFO - Iter(val) [100/100] MAE: 0.0457 PSNR: 23.7792 SSIM: 0.5953 data_time:0234 time: 0.1709
Traceback (most recent call last):
File "tools/train.py", line 114, in
main()
File "tools/train.py", line 107, in main
runner.train()
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1778, in tra
model = self.train_loop.run() # type: ignore
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/loops.py", line 294, in run
self.runner.val_loop.run()
File "/root/glean/mmagic/engine/runner/multi_loops.py", line 247, in run
self._runner.call_hook('after_val_epoch', metrics=multi_metric)
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1841, in calook
getattr(hook, fn_name)(self, **kwargs)
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 361, after_val_epoch
self._save_best_checkpoint(runner, metrics)
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 521, _save_best_checkpoint
if key_score is None or not self.is_better_than[key_indicator](
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 123,
rule_map = {'greater': lambda x, y: x > y, 'less': lambda x, y: x < y}
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Prerequisite
Environment
System environment:
sys.platform: linux
Python: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 473525473
GPU 0,1: GeForce RTX 3090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.2, V11.2.152
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.9.1+cu111
PyTorch compiling details: PyTorch built with:
GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,
TorchVision: 0.10.1+cu111
OpenCV: 4.9.0
MMEngine: 0.10.4
Runtime environment:
cudnn_benchmark: False
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 4}
dist_cfg: {'backend': 'nccl'}
seed: 473525473
Distributed launcher: none
Distributed training: False
GPU number: 1
Reproduces the problem - code sample
04/30 07:45:22 - mmengine - INFO - Config:
custom_hooks = [
dict(interval=1, type='BasicVisualizationHook'),
]
dataset_type = 'BasicImageDataset'
default_hooks = dict(
checkpoint=dict(
by_epoch=False,
interval=5000,
max_keep_ckpts=10,
out_dir='./work_dirs',
rule=[
'less',
'greater',
'greater',
],
save_best=[
'MAE',
'PSNR',
'SSIM',
],
save_optimizer=True,
type='CheckpointHook'),
logger=dict(interval=100, type='LoggerHook'),
param_scheduler=dict(type='ParamSchedulerHook'),
sampler_seed=dict(type='DistSamplerSeedHook'),
timer=dict(type='IterTimerHook'))
default_scope = 'mmagic'
env_cfg = dict(
cudnn_benchmark=False,
dist_cfg=dict(backend='nccl'),
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=4))
experiment_name = 'glean_x8_2xb8_cat'
inference_pipeline = [
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
backend='pillow',
interpolation='bicubic',
keys=[
'img',
],
scale=(
32,
32,
),
type='Resize'),
dict(type='PackInputs'),
]
launcher = 'none'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=False, type='LogProcessor', window_size=100)
model = dict(
data_preprocessor=dict(
mean=[
127.5,
127.5,
127.5,
],
std=[
127.5,
127.5,
127.5,
],
type='DataPreprocessor'),
discriminator=dict(
in_size=256,
init_cfg=dict(
checkpoint=
'http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth',
prefix='discriminator',
type='Pretrained'),
type='StyleGANv2Discriminator'),
gan_loss=dict(
fake_label_val=0,
gan_type='vanilla',
loss_weight=0.01,
real_label_val=1.0,
type='GANLoss'),
generator=dict(
in_size=32,
init_cfg=dict(
checkpoint=
'http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth',
prefix='generator_ema',
type='Pretrained'),
out_size=256,
style_channels=512,
type='GLEANStyleGANv2'),
perceptual_loss=dict(
criterion='mse',
layer_weights=dict({'21': 1.0}),
norm_img=False,
perceptual_weight=0.01,
pretrained='torchvision://vgg16',
style_weight=0,
type='PerceptualLoss',
vgg_type='vgg16'),
pixel_loss=dict(loss_weight=1.0, reduction='mean', type='MSELoss'),
test_cfg=dict(),
train_cfg=dict(),
type='SRGAN')
model_wrapper_cfg = dict(
find_unused_parameters=True, type='MMSeparateDistributedDataParallel')
optim_wrapper = dict(
constructor='MultiOptimWrapperConstructor',
discriminator=dict(
optimizer=dict(betas=(
0.9,
0.99,
), lr=0.0001, type='Adam'),
type='OptimWrapper'),
generator=dict(
optimizer=dict(betas=(
0.9,
0.99,
), lr=0.0001, type='Adam'),
type='OptimWrapper'))
param_scheduler = dict(
T_max=600000, by_epoch=False, eta_min=1e-07, type='CosineAnnealingLR')
resume = True
save_dir = './work_dirs'
scale = 8
test_cfg = dict(type='MultiTestLoop')
test_dataloader = dict(
dataset=dict(
ann_file='meta_info_Cat100_GT.txt',
data_prefix=dict(gt='GT', img='BIx8_down'),
data_root='data/cat_test',
metainfo=dict(dataset_type='cat', task_name='sisr'),
pipeline=[
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(type='PackInputs'),
],
type='BasicImageDataset'),
drop_last=False,
num_workers=8,
persistent_workers=False,
pin_memory=True,
sampler=dict(shuffle=False, type='DefaultSampler'))
test_evaluator = [
dict(type='MAE'),
dict(type='PSNR'),
dict(type='SSIM'),
]
test_pipeline = [
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(type='PackInputs'),
]
train_cfg = dict(
max_iters=300000, type='IterBasedTrainLoop', val_interval=5000)
train_dataloader = dict(
batch_size=8,
dataset=dict(
ann_file='meta_info_LSUNcat_GT.txt',
data_prefix=dict(gt='GT', img='BIx8_down'),
data_root='data/cat_train',
metainfo=dict(dataset_type='cat', task_name='sisr'),
pipeline=[
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(
direction='horizontal',
flip_ratio=0.5,
keys=[
'img',
'gt',
],
type='Flip'),
dict(type='PackInputs'),
],
type='BasicImageDataset'),
num_workers=8,
persistent_workers=False,
pin_memory=True,
sampler=dict(shuffle=True, type='InfiniteSampler'))
train_pipeline = [
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(
direction='horizontal',
flip_ratio=0.5,
keys=[
'img',
'gt',
],
type='Flip'),
dict(type='PackInputs'),
]
val_cfg = dict(type='MultiValLoop')
val_dataloader = dict(
dataset=dict(
ann_file='meta_info_Cat100_GT.txt',
data_prefix=dict(gt='GT', img='BIx8_down'),
data_root='data/cat_test',
metainfo=dict(dataset_type='cat', task_name='sisr'),
pipeline=[
dict(
channel_order='rgb',
color_type='color',
key='img',
type='LoadImageFromFile'),
dict(
channel_order='rgb',
color_type='color',
key='gt',
type='LoadImageFromFile'),
dict(type='PackInputs'),
],
type='BasicImageDataset'),
drop_last=False,
num_workers=8,
persistent_workers=False,
pin_memory=True,
sampler=dict(shuffle=False, type='DefaultSampler'))
val_evaluator = [
dict(type='MAE'),
dict(type='PSNR'),
dict(type='SSIM'),
]
vis_backends = [
dict(type='LocalVisBackend'),
]
visualizer = dict(
bgr2rgb=True,
fn_key='gt_path',
img_keys=[
'gt_img',
'input',
'pred_img',
],
type='ConcatImageVisualizer',
vis_backends=[
dict(type='LocalVisBackend'),
])
work_dir = './work_dirs/glean_x8_2xb8_cat'
04/30 07:45:32 - mmengine - INFO - Distributed training is not used, all SyncBatchNorm (SyncBN) layers in the model will be automatically reverted to BatchNormXd layers if they are used.
04/30 07:45:32 - mmengine - INFO - Hooks will be executed in the following order:
before_run:
(VERY_HIGH ) RuntimeInfoHook
(BELOW_NORMAL) LoggerHook
before_train:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(VERY_LOW ) CheckpointHook
before_train_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) DistSamplerSeedHook
before_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
after_train_iter:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(NORMAL ) BasicVisualizationHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
after_train_epoch:
(NORMAL ) IterTimerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
before_val:
(VERY_HIGH ) RuntimeInfoHook
before_val_epoch:
(NORMAL ) IterTimerHook
before_val_iter:
(NORMAL ) IterTimerHook
after_val_iter:
(NORMAL ) IterTimerHook
(NORMAL ) BasicVisualizationHook
(BELOW_NORMAL) LoggerHook
after_val_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
(LOW ) ParamSchedulerHook
(VERY_LOW ) CheckpointHook
after_val:
(VERY_HIGH ) RuntimeInfoHook
after_train:
(VERY_HIGH ) RuntimeInfoHook
(VERY_LOW ) CheckpointHook
before_test:
(VERY_HIGH ) RuntimeInfoHook
before_test_epoch:
(NORMAL ) IterTimerHook
before_test_iter:
(NORMAL ) IterTimerHook
after_test_iter:
(NORMAL ) IterTimerHook
(NORMAL ) BasicVisualizationHook
(BELOW_NORMAL) LoggerHook
after_test_epoch:
(VERY_HIGH ) RuntimeInfoHook
(NORMAL ) IterTimerHook
(BELOW_NORMAL) LoggerHook
after_test:
(VERY_HIGH ) RuntimeInfoHook
after_run:
(BELOW_NORMAL) LoggerHook
04/30 07:45:33 - mmengine - INFO - Working directory: ./work_dirs/glean_x8_2xb8_cat
04/30 07:45:33 - mmengine - INFO - Log directory: /root/glean/work_dirs/glean_x8_2xb8_cat/20240430_074521
04/30 07:45:33 - mmengine - WARNING - cat is not a meta file, simply parsed as meta information
04/30 07:45:33 - mmengine - WARNING - sisr is not a meta file, simply parsed as meta information
04/30 07:45:35 - mmengine - INFO - Add to optimizer 'generator' ({'type': 'Adam', 'lr': 0.0001, 'betas': (0.9, 0.99)}): 'generator'.
04/30 07:45:35 - mmengine - INFO - Add to optimizer 'discriminator' ({'type': 'Adam', 'lr': 0.0001, 'betas': (0.9, 0.99)}): 'discriminator'.
04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class MAE.
04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class PSNR.
04/30 07:45:35 - mmengine - WARNING - The prefix is not set in metric class SSIM.
04/30 07:45:36 - mmengine - INFO - load generator_ema in model from: http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth
Loads checkpoint by http backend from path: http://download.openmmlab.com/mmediting/stylegan2/official_weights/stylegan2-cat-config-f-official_20210327_172444-15bc485b.pth
04/30 07:45:36 - mmengine - WARNING - The model and loaded state dict do not match exactly
Reproduces the problem - command or script
python tools/train.py configs/glean/glean_x8_2xb8_cat.py --resume
Reproduces the problem - error message
04/30 06:43:01 - mmengine - INFO - Saving checkpoint at 275000 iterations
Switch to evaluation style mode: single
04/30 06:43:25 - mmengine - INFO - Iter(val) [100/100] eta: 0:00:00 time: 0.1712 data_time: 0.0235 memor3032
04/30 06:43:26 - mmengine - INFO - Iter(val) [100/100] MAE: 0.0457 PSNR: 23.7792 SSIM: 0.5953 data_time:0234 time: 0.1709
Traceback (most recent call last):
File "tools/train.py", line 114, in
main()
File "tools/train.py", line 107, in main
runner.train()
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1778, in tra
model = self.train_loop.run() # type: ignore
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/loops.py", line 294, in run
self.runner.val_loop.run()
File "/root/glean/mmagic/engine/runner/multi_loops.py", line 247, in run
self._runner.call_hook('after_val_epoch', metrics=multi_metric)
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1841, in calook
getattr(hook, fn_name)(self, **kwargs)
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 361, after_val_epoch
self._save_best_checkpoint(runner, metrics)
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 521, _save_best_checkpoint
if key_score is None or not self.is_better_than[key_indicator](
File "/root/miniconda3/envs/mmagic/lib/python3.8/site-packages/mmengine/hooks/checkpoint_hook.py", line 123,
rule_map = {'greater': lambda x, y: x > y, 'less': lambda x, y: x < y}
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Additional information
我在命令行后增加--resume命令后出现的情况,在恢复训练进行5000次迭代后,模型自动保存权重、进行验证,过后打算重新再进入下一个5000次迭代的循环中时,报错,无法继续自动进行训练,报错内容如上。
The text was updated successfully, but these errors were encountered: