GPU配置 #267

wangfeiyu-zerobug · 2022-09-23T07:51:26Z

一机多卡怎么配置device啊好像没看到相关介绍

zhangjiajin · 2022-09-26T07:14:49Z

参考如下配置：
https://github.com/huawei-noah/vega/blob/master/docs/cn/user/config_reference.md#2-%E5%85%AC%E5%85%B1%E9%85%8D%E7%BD%AE%E9%A1%B9

将如下配置项配置为true后，使用本机所有GPU搜索模型：

parallel_search: True

若不希望使用所有的GPU，设置parallel_search: True后，还需要设置环境变量CUDA_VISIBLE_DEVICES，如：export CUDA_VISIBLE_DEVICES=0,1,2，使用三块GPU。

wangfeiyu-zerobug · 2022-09-26T10:35:08Z

似乎是并行计算库不能启动？
INFO:root:------------------------------------------------
INFO:root: Step: serial
INFO:root:------------------------------------------------
INFO:root:master ip and port: 127.0.0.1:28703
INFO:root:Initializing cluster. Please wait.
INFO:root:Dask-scheduler not start. Start dask-scheduler in master 127.0.0.1
ERROR:vega.core.pipeline.pipeline:Failed to run pipeline, message: [Errno 2] No such file or directory: 'dask-scheduler': 'dask-scheduler'
ERROR:vega.core.pipeline.pipeline:Traceback (most recent call last):
File "/root/.local/lib/python3.7/site-packages/vega/core/pipeline/pipeline.py", line 84, in run
pipestep = PipeStep(name=step_name)
File "/root/.local/lib/python3.7/site-packages/vega/core/pipeline/search_pipe_step.py", line 45, in init
self.master = create_master(update_func=self.generator.update)
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/master_ops.py", line 44, in create_master
master_instance = Master(**kwargs)
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/master.py", line 65, in init
status = self.dask_env.start()
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/dask_env.py", line 119, in start
self._start_dask()
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/dask_env.py", line 155, in _start_dask
scheduler_p = run_scheduler(ip=master_ip, port=master_port, tmp_file=scheduler_file)
File "/root/.local/lib/python3.7/site-packages/vega/core/scheduler/run_dask.py", line 56, in run_scheduler
env=os.environ
File "/opt/conda/lib/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/opt/conda/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'dask-scheduler': 'dask-scheduler'

wangfeiyu-zerobug · 2022-09-26T10:38:49Z

pip install dask
Requirement already satisfied: dask in /root/.local/lib/python3.7/site-packages (2022.2.0)
Requirement already satisfied: packaging>=20.0 in /opt/conda/lib/python3.7/site-packages (from dask) (21.3)
Requirement already satisfied: partd>=0.3.10 in /root/.local/lib/python3.7/site-packages (from dask) (1.3.0)
Requirement already satisfied: toolz>=0.8.2 in /root/.local/lib/python3.7/site-packages (from dask) (0.12.0)
Requirement already satisfied: fsspec>=0.6.0 in /root/.local/lib/python3.7/site-packages (from dask) (2022.8.2)
Requirement already satisfied: cloudpickle>=1.1.1 in /opt/conda/lib/python3.7/site-packages (from dask) (2.1.0)
Requirement already satisfied: pyyaml>=5.3.1 in /opt/conda/lib/python3.7/site-packages (from dask) (5.4.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging>=20.0->dask) (2.4.7)
Requirement already satisfied: locket in /root/.local/lib/python3.7/site-packages (from partd>=0.3.10->dask) (1.0.0)

zhangjiajin · 2022-09-26T10:47:49Z

@wangfeiyu-zerobug

dask组件路径未生效。
两种办法，添加路径到$PATH中，或者重新登录到你所在系统，命令生效。
https://github.com/huawei-noah/vega/blob/master/docs/cn/user/faq.md#14-%E5%BC%82%E5%B8%B8-permissionerror-errno-13-permission-denied-dask-scheduler-filenotfounderror-errno-2-no-such-file-or-directory-dask-scheduler-dask-scheduler-%E6%88%96%E8%80%85-vega-command-not-found

wangfeiyu-zerobug · 2022-09-27T01:20:24Z

已解决！提个建议 :这个路径采用os.environ这个接口导入，我是在docker起的容器中运行的，所以这块解决方法可能不太一样，这块是不是可以做一些补充另外 https://github.com/huawei-noah/vega/blob/master/docs/cn/user/config_reference.md#2-%E5%85%AC%E5%85%B1%E9%85%8D%E7%BD%AE%E9%A1%B9 -2-2.1中general的pytorch打错了

zhangjiajin · 2022-09-27T01:22:58Z

@wangfeiyu-zerobug

感谢你的建议。
请问你在容器中运行，解决方法是怎样的？

wangfeiyu-zerobug · 2022-09-27T01:31:11Z

jupyter命令行：%env PATH=/root/.local/bin:

zhangjiajin · 2022-09-27T02:18:39Z

@wangfeiyu-zerobug

谢谢，我们及时刷新。

wangfeiyu-zerobug · 2022-10-13T08:02:18Z

如果想利用已经搜索出的网络测试另一批数据，该怎么做呢？还需要通过pipline进行fulltrain嘛

zhangjiajin · 2022-10-14T02:23:54Z

是的，需要fullytrain，看下精度。

wangfeiyu-zerobug · 2022-10-14T03:08:49Z

那是需要重新训练整个搜索好的网络？目前没有提供yaml配置选项去调用模型参数只进行测试数据嘛

zhangjiajin · 2022-10-14T03:41:14Z

若搜索时的pipeline中包含了fullytrain，就不需要重新训练。
测试的代码可参考https://github.com/huawei-noah/vega/blob/39741b5ddd9623f0984599d7f52ea38ef6f253c1/vega/tools/inference.py

wangfeiyu-zerobug · 2022-10-14T11:08:05Z

File "testcode.py", line 147, in
main()
File "testcode.py", line 141, in main
result = _infer(args, loader, model)
File "testcode.py", line 50, in _infer
return _infer_pytorch(args, model, loader)
File "testcode.py", line 70, in _infer_pytorch
infer_result = model(**batch)
TypeError: FasterRCNN object argument after ** must be a mapping, not list
我之前采用SP-NAS搜索的网络
!python testcode.py -c '/workspace/wfyexp/vega/vega-master/examples/nas/sp_nas/tasks/0928.183347.496/output/fullytrain/desc_4.json' -m '/workspace/wfyexp/vega/vega-master/examples/nas/sp_nas/tasks/0928.183347.496/output/fullytrain/model_4.pth' -df "COCO" -dp '/workspace/data/upper/added_dataset_COCO_format'
数据集就是更换了test部分的数据，然后这里显示dataload的输出结果放入mode时出错
打印batch结果：
[[tensor([[[0.5255, 0.5255, 0.5255, ..., 0.3216, 0.3216, 0.3216],
[0.5255, 0.5255, 0.5255, ..., 0.3216, 0.3216, 0.3216],
[0.5255, 0.5255, 0.5255, ..., 0.3216, 0.3216, 0.3216],
...,
[0.8078, 0.8078, 0.8078, ..., 0.2824, 0.2824, 0.2824],
[0.8078, 0.8078, 0.8078, ..., 0.2824, 0.2824, 0.2824],
[0.8078, 0.8078, 0.8078, ..., 0.2824, 0.2824, 0.2824]],

    [[0.5647, 0.5647, 0.5647,  ..., 0.3373, 0.3373, 0.3373],
     [0.5647, 0.5647, 0.5647,  ..., 0.3373, 0.3373, 0.3373],
     [0.5647, 0.5647, 0.5647,  ..., 0.3373, 0.3373, 0.3373],
     ...,
     [0.8588, 0.8588, 0.8588,  ..., 0.2902, 0.2902, 0.2902],
     [0.8588, 0.8588, 0.8588,  ..., 0.2902, 0.2902, 0.2902],
     [0.8588, 0.8588, 0.8588,  ..., 0.2902, 0.2902, 0.2902]],

    [[0.6000, 0.6000, 0.6000,  ..., 0.3333, 0.3333, 0.3333],
     [0.6000, 0.6000, 0.6000,  ..., 0.3333, 0.3333, 0.3333],
     [0.6000, 0.6000, 0.6000,  ..., 0.3333, 0.3333, 0.3333],
     ...,
     [0.9216, 0.9216, 0.9216,  ..., 0.2784, 0.2784, 0.2784],
     [0.9216, 0.9216, 0.9216,  ..., 0.2784, 0.2784, 0.2784],
     [0.9216, 0.9216, 0.9216,  ..., 0.2784, 0.2784, 0.2784]]])], [{'boxes': tensor([[190.9997, 410.9996, 250.9997, 470.9997],
    [412.9995, 312.0001, 477.9995, 365.0001]]), 'labels': tensor([1, 1]), 'masks': tensor([[[0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     ...,
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0]],

    [[0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     ...,
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0],
     [0, 0, 0,  ..., 0, 0, 0]]], dtype=torch.uint8), 'image_id': tensor([1]), 'area': tensor([3600.0029, 3445.0051]), 'iscrowd': tensor([0, 0])}]]

为什么dataload出来会是map呢？

zhangjiajin · 2022-10-17T10:34:22Z

数据格式是否是COCO格式的？

另外对于检测，需要参考这个代码：https://github.com/huawei-noah/vega/blob/39741b5ddd9623f0984599d7f52ea38ef6f253c1/vega/tools/detection_inference.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU配置 #267

GPU配置 #267

wangfeiyu-zerobug commented Sep 23, 2022

zhangjiajin commented Sep 26, 2022

wangfeiyu-zerobug commented Sep 26, 2022

wangfeiyu-zerobug commented Sep 26, 2022

zhangjiajin commented Sep 26, 2022

wangfeiyu-zerobug commented Sep 27, 2022

zhangjiajin commented Sep 27, 2022

wangfeiyu-zerobug commented Sep 27, 2022

zhangjiajin commented Sep 27, 2022

wangfeiyu-zerobug commented Oct 13, 2022

zhangjiajin commented Oct 14, 2022

wangfeiyu-zerobug commented Oct 14, 2022

zhangjiajin commented Oct 14, 2022

wangfeiyu-zerobug commented Oct 14, 2022

zhangjiajin commented Oct 17, 2022

GPU配置 #267

GPU配置 #267

Comments

wangfeiyu-zerobug commented Sep 23, 2022

zhangjiajin commented Sep 26, 2022

wangfeiyu-zerobug commented Sep 26, 2022

wangfeiyu-zerobug commented Sep 26, 2022

zhangjiajin commented Sep 26, 2022

wangfeiyu-zerobug commented Sep 27, 2022

zhangjiajin commented Sep 27, 2022

wangfeiyu-zerobug commented Sep 27, 2022

zhangjiajin commented Sep 27, 2022

wangfeiyu-zerobug commented Oct 13, 2022

zhangjiajin commented Oct 14, 2022

wangfeiyu-zerobug commented Oct 14, 2022

zhangjiajin commented Oct 14, 2022

wangfeiyu-zerobug commented Oct 14, 2022

zhangjiajin commented Oct 17, 2022