Model Wont Train on Server #18

Dickoabc123 · 2023-11-10T13:39:33Z

Hi there, I am trying to run a yolov3_tiny model on a server which and I get a few errors, here are the steps I took and the resulting messages:

module load nvidia/sdk/21.3

cd /users/XXX/Hank_Darknet/darknet/build

cmake -DCMAKE_BUILD_TYPE=Release \

-DCUDAToolkit_CUPTI_INCLUDE_DIR=/opt/software/nvidia/sdk/Linux_x86_64/21.3/cuda/11.2/extras/CUPTI/include \

-DCMAKE_CXX_FLAGS="-I/opt/software/nvidia/sdk/Linux_x86_64/21.3/cuda/11.2/include" ..

make -j$(nproc)

And this all seemed to build ok.

The darknet file was not where I expected it to be, it was in the src folder in the build folder.

I then tried to run the training command from the src folder:

./darknet detector train /users/XXX/Hank_Darknet/darknet/mydata/coco.data /users/XXX/Hank_Darknet/darknet/mydata/yolov3-tiny.cfg /users/XXX/Hank_Darknet/darknet/mydata/yolov3-tiny.conv.15 -map

It didn’t train as it aborted with the following:

CUDA runtime version 11020 (v11.2), driver version 11060 (v11.6) cuDNN is DISABLED => NVIDIA A100-PCIE-40GB [#8.0], 39.4 GiB OpenCV version: 4.5.5 Prepare additional network for mAP calculation... 0 : compute_capability = 800, cudnn_half = 0, GPU: NVIDIA A100-PCIE-40GB net.optimized_memory = 0 mini_batch = 1, batch = 64, time_steps = 1, train = 0 layer filters size/strd(dil) input output 0 Create CUDA-stream - 0 conv 16 3 x 3/ 1 640 x 640 x 3 -> 640 x 640 x 16 0.354 BF 1 max 2x 2/ 2 640 x 640 x 16 -> 320 x 320 x 16 0.007 BF 2 conv 32 3 x 3/ 1 320 x 320 x 16 -> 320 x 320 x 32 0.944 BF 3 max 2x 2/ 2 320 x 320 x 32 -> 160 x 160 x 32 0.003 BF 4 conv 64 3 x 3/ 1 160 x 160 x 32 -> 160 x 160 x 64 0.944 BF 5 max 2x 2/ 2 160 x 160 x 64 -> 80 x 80 x 64 0.002 BF 6 conv 128 3 x 3/ 1 80 x 80 x 64 -> 80 x 80 x 128 0.944 BF 7 max 2x 2/ 2 80 x 80 x 128 -> 40 x 40 x 128 0.001 BF 8 conv 256 3 x 3/ 1 40 x 40 x 128 -> 40 x 40 x 256 0.944 BF 9 max 2x 2/ 2 40 x 40 x 256 -> 20 x 20 x 256 0.000 BF 10 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF 11 max 2x 2/ 1 20 x 20 x 512 -> 20 x 20 x 512 0.001 BF 12 conv 1024 3 x 3/ 1 20 x 20 x 512 -> 20 x 20 x1024 3.775 BF 13 conv 256 1 x 1/ 1 20 x 20 x1024 -> 20 x 20 x 256 0.210 BF 14 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF 15 conv 18 1 x 1/ 1 20 x 20 x 512 -> 20 x 20 x 18 0.007 BF 16 yolo [yolo] params: iou loss: mse (2), iou_norm: 0.75, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.00 17 route 13 -> 20 x 20 x 256 18 conv 128 1 x 1/ 1 20 x 20 x 256 -> 20 x 20 x 128 0.026 BF 19 upsample 2x 20 x 20 x 128 -> 40 x 40 x 128 20 route 19 8 -> 40 x 40 x 384 21 conv 256 3 x 3/ 1 40 x 40 x 384 -> 40 x 40 x 256 2.831 BF 22 conv 18 1 x 1/ 1 40 x 40 x 256 -> 40 x 40 x 18 0.015 BF 23 yolo [yolo] params: iou loss: mse (2), iou_norm: 0.75, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.00 Total BFLOPS 12.894 avg_outputs = 768866 Allocating workspace to transfer between CPU and GPU: 56.2 MiB Remembering 1 class: -> class #0 (Pole) will use colour #FF00FF 0 : compute_capability = 800, cudnn_half = 0, GPU: NVIDIA A100-PCIE-40GB net.optimized_memory = 0 mini_batch = 1, batch = 64, time_steps = 1, train = 1 layer filters size/strd(dil) input output 0 conv 16 3 x 3/ 1 640 x 640 x 3 -> 640 x 640 x 16 0.354 BF 1 max 2x 2/ 2 640 x 640 x 16 -> 320 x 320 x 16 0.007 BF 2 conv 32 3 x 3/ 1 320 x 320 x 16 -> 320 x 320 x 32 0.944 BF 3 max 2x 2/ 2 320 x 320 x 32 -> 160 x 160 x 32 0.003 BF 4 conv 64 3 x 3/ 1 160 x 160 x 32 -> 160 x 160 x 64 0.944 BF 5 max 2x 2/ 2 160 x 160 x 64 -> 80 x 80 x 64 0.002 BF 6 conv 128 3 x 3/ 1 80 x 80 x 64 -> 80 x 80 x 128 0.944 BF 7 max 2x 2/ 2 80 x 80 x 128 -> 40 x 40 x 128 0.001 BF 8 conv 256 3 x 3/ 1 40 x 40 x 128 -> 40 x 40 x 256 0.944 BF 9 max 2x 2/ 2 40 x 40 x 256 -> 20 x 20 x 256 0.000 BF 10 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF 11 max 2x 2/ 1 20 x 20 x 512 -> 20 x 20 x 512 0.001 BF 12 conv 1024 3 x 3/ 1 20 x 20 x 512 -> 20 x 20 x1024 3.775 BF 13 conv 256 1 x 1/ 1 20 x 20 x1024 -> 20 x 20 x 256 0.210 BF 14 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF 15 conv 18 1 x 1/ 1 20 x 20 x 512 -> 20 x 20 x 18 0.007 BF 16 yolo [yolo] params: iou loss: mse (2), iou_norm: 0.75, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.00 17 route 13 -> 20 x 20 x 256 18 conv 128 1 x 1/ 1 20 x 20 x 256 -> 20 x 20 x 128 0.026 BF 19 upsample 2x 20 x 20 x 128 -> 40 x 40 x 128 20 route 19 8 -> 40 x 40 x 384 21 conv 256 3 x 3/ 1 40 x 40 x 384 -> 40 x 40 x 256 2.831 BF 22 conv 18 1 x 1/ 1 40 x 40 x 256 -> 40 x 40 x 18 0.015 BF 23 yolo [yolo] params: iou loss: mse (2), iou_norm: 0.75, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.00 Total BFLOPS 12.894 avg_outputs = 768866 Allocating workspace to transfer between CPU and GPU: 56.2 MiB Loading weights from /users/XXX/Hank_Darknet/darknet/mydata/yolov3-tiny.conv.15... seen 64, trained: 0 K-images (0 Kilo-batches_64) Done! Loaded 15 layers from weights-file Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005 Detection layer #16 is type 28 (yolo) Detection layer #23 is type 28 (yolo) mAP calculations will be every 100 iterations weights will be saved every 1000 iterations Resizing, random_coef = 1.40 928 x 928 Create 6 permanent cpu-threads Allocating workspace to transfer between CPU and GPU: 118.3 MiB Workspace begins at 0x14acee000000 loaded 64 images in 357.070 milliseconds v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 1.00) Region 16 Avg (IOU: 0.000000), count: 1, class_loss = 529.588867, iou_loss = 0.000000, total_loss = 529.588867 v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 1.00) Region 23 Avg (IOU: 0.212021), count: 2, class_loss = 2539.934326, iou_loss = 22.478271, total_loss = 2562.412598 total_bbox=2, rewritten_bbox=0.000000% terminate called after throwing an instance of 'cv::Exception' what(): OpenCV(4.5.5) /users/acs03114/software/opencv/opencv-4.5.5_build/modules/highgui/src/window.cpp:1334: error: (-2:Unspecified error) The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Cocoa support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script in function 'cvWaitKey' * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * A fatal error has been detected. Darknet will now exit. * Errno 2: No such file or directory * Error location: /users/XXX/Hank_Darknet/darknet/src/darknet.cpp, darknet_signal_handler(), line #443 * Error message: signal handler invoked for signal #6 (Aborted) * Version v2.0-11-gd01e285a-dirty built on Nov 10 2023 12:45:29 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * backtrace (20 entries): 1/20: ./darknet(_Z13log_backtracev+0x21) [0x5329b1] 2/20: ./darknet(darknet_fatal_error+0x181) [0x532bc1] 3/20: /lib64/libc.so.6(+0x37400) [0x14add1562400] 4/20: /lib64/libc.so.6(gsignal+0x10f) [0x14add156237f] 5/20: /lib64/libc.so.6(abort+0x127) [0x14add154cdb5] 6/20: /opt/software/anaconda/python-3.9.7/2021.11/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc) [0x14ade6b91872] 7/20: /opt/software/anaconda/python-3.9.7/2021.11/lib/libstdc++.so.6(+0xacf6f) [0x14ade6b8ff6f] 8/20: /opt/software/anaconda/python-3.9.7/2021.11/lib/libstdc++.so.6(+0xacfb1) [0x14ade6b8ffb1] 9/20: /opt/software/anaconda/python-3.9.7/2021.11/lib/libstdc++.so.6(__cxa_rethrow+0) [0x14ade6b9019a] 10/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_core.so.405(+0x96468) [0x14add2d7e468] 11/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_core.so.405(_ZN2cv5errorEiRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKcS9_i+0x5f) [0x14add2fb7f6f] 12/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_highgui.so.405(cvWaitKey+0x122) [0x14add8085832] 13/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_highgui.so.405(_ZN2cv9waitKeyExEi+0x135) [0x14add8086815] 14/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_highgui.so.405(_ZN2cv7waitKeyEi+0x21) [0x14add8086911] 15/20: ./darknet(train_network_waitkey+0x431) [0x50f861] 16/20: ./darknet(train_detector+0x1c86) [0x4a0866] 17/20: ./darknet(_Z12run_detectoriPPc+0x875) [0x4a3f55] 18/20: ./darknet(main+0x4c3) [0x437283] 19/20: /lib64/libc.so.6(__libc_start_main+0xf3) [0x14add154e493] 20/20: ./darknet(_start+0x2e) [0x439c1e]

I tried without the OpenCV flag

cmake -DCMAKE_BUILD_TYPE=Release \ -DCUDAToolkit_CUPTI_INCLUDE_DIR=/opt/software/nvidia/sdk/Linux_x86_64/21.3/cuda/11.2/extras/CUPTI/include \

-DCMAKE_CXX_FLAGS="-I/opt/software/nvidia/sdk/Linux_x86_64/21.3/cuda/11.2/include" \

-DENABLE_OPENCV=OFF ..

But that didn’t work either.

stephanecharette · 2023-11-11T16:56:23Z

Your training command is wrong. Please see the FAQ (https://www.ccoderun.ca/programming/yolo_faq/#training_command) or the readme (https://github.com/hank-ai/darknet#training).

Dickoabc123 added the question Further information is requested label Nov 10, 2023

Dickoabc123 assigned stephanecharette Nov 10, 2023

stephanecharette closed this as completed Nov 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Wont Train on Server #18

Model Wont Train on Server #18

Dickoabc123 commented Nov 10, 2023

stephanecharette commented Nov 11, 2023 •

edited

Loading

Model Wont Train on Server #18

Model Wont Train on Server #18

Comments

Dickoabc123 commented Nov 10, 2023

stephanecharette commented Nov 11, 2023 • edited Loading

stephanecharette commented Nov 11, 2023 •

edited

Loading