Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU配置 #267

Open
wangfeiyu-zerobug opened this issue Sep 23, 2022 · 14 comments
Open

GPU配置 #267

wangfeiyu-zerobug opened this issue Sep 23, 2022 · 14 comments

Comments

@wangfeiyu-zerobug
Copy link

一机多卡怎么配置device啊 好像没看到相关介绍

@zhangjiajin
Copy link
Member

@wangfeiyu-zerobug

参考如下配置:
https://github.com/huawei-noah/vega/blob/master/docs/cn/user/config_reference.md#2-%E5%85%AC%E5%85%B1%E9%85%8D%E7%BD%AE%E9%A1%B9

将如下配置项配置为true后,使用本机所有GPU搜索模型:

parallel_search: True

若不希望使用所有的GPU,设置parallel_search: True后,还需要设置环境变量CUDA_VISIBLE_DEVICES,如:export CUDA_VISIBLE_DEVICES=0,1,2,使用三块GPU。

@wangfeiyu-zerobug
Copy link
Author

似乎是并行计算库不能启动?
INFO:root:------------------------------------------------
INFO:root: Step: serial
INFO:root:------------------------------------------------
INFO:root:master ip and port: 127.0.0.1:28703
INFO:root:Initializing cluster. Please wait.
INFO:root:Dask-scheduler not start. Start dask-scheduler in master 127.0.0.1
ERROR:vega.core.pipeline.pipeline:Failed to run pipeline, message: [Errno 2] No such file or directory: 'dask-scheduler': 'dask-scheduler'
ERROR:vega.core.pipeline.pipeline:Traceback (most recent call last):
File "/root/.local/lib/python3.7/site-packages/vega/core/pipeline/pipeline.py", line 84, in run
pipestep = PipeStep(name=step_name)
File "/root/.local/lib/python3.7/site-packages/vega/core/pipeline/search_pipe_step.py", line 45, in init
self.master = create_master(update_func=self.generator.update)
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/master_ops.py", line 44, in create_master
master_instance = Master(**kwargs)
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/master.py", line 65, in init
status = self.dask_env.start()
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/dask_env.py", line 119, in start
self._start_dask()
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/dask_env.py", line 155, in _start_dask
scheduler_p = run_scheduler(ip=master_ip, port=master_port, tmp_file=scheduler_file)
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/run_dask.py", line 56, in run_scheduler
env=os.environ
File "/opt/conda/lib/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/opt/conda/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'dask-scheduler': 'dask-scheduler'

@wangfeiyu-zerobug
Copy link
Author

pip install dask
Requirement already satisfied: dask in /root/.local/lib/python3.7/site-packages (2022.2.0)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.7/site-packages (from dask) (21.3)
Requirement already satisfied: partd>=0.3.10 in /root/.local/lib/python3.7/site-packages (from dask) (1.3.0)
Requirement already satisfied: toolz>=0.8.2 in /root/.local/lib/python3.7/site-packages (from dask) (0.12.0)
Requirement already satisfied: fsspec>=0.6.0 in /root/.local/lib/python3.7/site-packages (from dask) (2022.8.2)
Requirement already satisfied: cloudpickle>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from dask) (2.1.0)
Requirement already satisfied: pyyaml>=5.3.1 in /opt/conda/lib/python3.7/site-packages (from dask) (5.4.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=20.0->dask) (2.4.7)
Requirement already satisfied: locket in /root/.local/lib/python3.7/site-packages (from partd>=0.3.10->dask) (1.0.0)

@wangfeiyu-zerobug
Copy link
Author

已解决! 提个建议 :这个路径采用os.environ这个接口导入,我是在docker起的容器中运行的,所以这块解决方法可能不太一样,这块是不是可以做一些补充 另外 https://github.com/huawei-noah/vega/blob/master/docs/cn/user/config_reference.md#2-%E5%85%AC%E5%85%B1%E9%85%8D%E7%BD%AE%E9%A1%B9 -2-2.1中general的pytorch打错了

@zhangjiajin
Copy link
Member

@wangfeiyu-zerobug

感谢你的建议。
请问你在容器中运行,解决方法是怎样的?

@wangfeiyu-zerobug
Copy link
Author

jupyter命令行:%env PATH=/root/.local/bin:

@zhangjiajin
Copy link
Member

@wangfeiyu-zerobug

谢谢,我们及时刷新。

@wangfeiyu-zerobug
Copy link
Author

如果想利用已经搜索出的网络测试另一批数据,该怎么做呢?还需要通过pipline进行fulltrain嘛

@zhangjiajin
Copy link
Member

是的,需要fullytrain,看下精度。

@wangfeiyu-zerobug
Copy link
Author

那是需要重新训练整个搜索好的网络? 目前没有提供yaml配置选项去调用模型参数只进行测试数据嘛

@zhangjiajin
Copy link
Member

若搜索时的pipeline中包含了fullytrain,就不需要重新训练。
测试的代码可参考https://github.com/huawei-noah/vega/blob/39741b5ddd9623f0984599d7f52ea38ef6f253c1/vega/tools/inference.py

@wangfeiyu-zerobug
Copy link
Author

File "testcode.py", line 147, in
main()
File "testcode.py", line 141, in main
result = _infer(args, loader, model)
File "testcode.py", line 50, in _infer
return _infer_pytorch(args, model, loader)
File "testcode.py", line 70, in _infer_pytorch
infer_result = model(**batch)
TypeError: FasterRCNN object argument after ** must be a mapping, not list
我之前采用SP-NAS搜索的网络
!python testcode.py -c '/workspace/wfyexp/vega/vega-master/examples/nas/sp_nas/tasks/0928.183347.496/output/fullytrain/desc_4.json' -m '/workspace/wfyexp/vega/vega-master/examples/nas/sp_nas/tasks/0928.183347.496/output/fullytrain/model_4.pth' -df "COCO" -dp '/workspace/data/upper/added_dataset_COCO_format'
数据集就是更换了test部分的数据,然后这里显示dataload的输出结果放入mode时出错
打印batch结果:
[[tensor([[[0.5255, 0.5255, 0.5255, ..., 0.3216, 0.3216, 0.3216],
[0.5255, 0.5255, 0.5255, ..., 0.3216, 0.3216, 0.3216],
[0.5255, 0.5255, 0.5255, ..., 0.3216, 0.3216, 0.3216],
...,
[0.8078, 0.8078, 0.8078, ..., 0.2824, 0.2824, 0.2824],
[0.8078, 0.8078, 0.8078, ..., 0.2824, 0.2824, 0.2824],
[0.8078, 0.8078, 0.8078, ..., 0.2824, 0.2824, 0.2824]],

    [[0.5647, 0.5647, 0.5647,  ..., 0.3373, 0.3373, 0.3373],
     [0.5647, 0.5647, 0.5647,  ..., 0.3373, 0.3373, 0.3373],
     [0.5647, 0.5647, 0.5647,  ..., 0.3373, 0.3373, 0.3373],
     ...,
     [0.8588, 0.8588, 0.8588,  ..., 0.2902, 0.2902, 0.2902],
     [0.8588, 0.8588, 0.8588,  ..., 0.2902, 0.2902, 0.2902],
     [0.8588, 0.8588, 0.8588,  ..., 0.2902, 0.2902, 0.2902]],

    [[0.6000, 0.6000, 0.6000,  ..., 0.3333, 0.3333, 0.3333],
     [0.6000, 0.6000, 0.6000,  ..., 0.3333, 0.3333, 0.3333],
     [0.6000, 0.6000, 0.6000,  ..., 0.3333, 0.3333, 0.3333],
     ...,
     [0.9216, 0.9216, 0.9216,  ..., 0.2784, 0.2784, 0.2784],
     [0.9216, 0.9216, 0.9216,  ..., 0.2784, 0.2784, 0.2784],
     [0.9216, 0.9216, 0.9216,  ..., 0.2784, 0.2784, 0.2784]]])], [{'boxes': tensor([[190.9997, 410.9996, 250.9997, 470.9997],
    [412.9995, 312.0001, 477.9995, 365.0001]]), 'labels': tensor([1, 1]), 'masks': tensor([[[0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     ...,
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0]],

    [[0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     ...,
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0]]], dtype=torch.uint8), 'image_id': tensor([1]), 'area': tensor([3600.0029, 3445.0051]), 'iscrowd': tensor([0, 0])}]]

为什么dataload出来会是map呢?

@zhangjiajin
Copy link
Member

数据格式是否是COCO格式的?

另外对于检测,需要参考这个代码:https://github.com/huawei-noah/vega/blob/39741b5ddd9623f0984599d7f52ea38ef6f253c1/vega/tools/detection_inference.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants