Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Wont Train on Server #18

Closed
Dickoabc123 opened this issue Nov 10, 2023 · 1 comment
Closed

Model Wont Train on Server #18

Dickoabc123 opened this issue Nov 10, 2023 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@Dickoabc123
Copy link

Hi there, I am trying to run a yolov3_tiny model on a server which and I get a few errors, here are the steps I took and the resulting messages:

module load nvidia/sdk/21.3

cd /users/XXX/Hank_Darknet/darknet/build

cmake -DCMAKE_BUILD_TYPE=Release \

-DCUDAToolkit_CUPTI_INCLUDE_DIR=/opt/software/nvidia/sdk/Linux_x86_64/21.3/cuda/11.2/extras/CUPTI/include \

-DCMAKE_CXX_FLAGS="-I/opt/software/nvidia/sdk/Linux_x86_64/21.3/cuda/11.2/include" ..

make -j$(nproc)

And this all seemed to build ok.

The darknet file was not where I expected it to be, it was in the src folder in the build folder.

I then tried to run the training command from the src folder:

./darknet detector train /users/XXX/Hank_Darknet/darknet/mydata/coco.data /users/XXX/Hank_Darknet/darknet/mydata/yolov3-tiny.cfg /users/XXX/Hank_Darknet/darknet/mydata/yolov3-tiny.conv.15 -map

It didn’t train as it aborted with the following:

CUDA runtime version 11020 (v11.2), driver version 11060 (v11.6) cuDNN is DISABLED => NVIDIA A100-PCIE-40GB [#8.0], 39.4 GiB OpenCV version: 4.5.5 Prepare additional network for mAP calculation... 0 : compute_capability = 800, cudnn_half = 0, GPU: NVIDIA A100-PCIE-40GB net.optimized_memory = 0 mini_batch = 1, batch = 64, time_steps = 1, train = 0 layer filters size/strd(dil) input output 0 Create CUDA-stream - 0 conv 16 3 x 3/ 1 640 x 640 x 3 -> 640 x 640 x 16 0.354 BF 1 max 2x 2/ 2 640 x 640 x 16 -> 320 x 320 x 16 0.007 BF 2 conv 32 3 x 3/ 1 320 x 320 x 16 -> 320 x 320 x 32 0.944 BF 3 max 2x 2/ 2 320 x 320 x 32 -> 160 x 160 x 32 0.003 BF 4 conv 64 3 x 3/ 1 160 x 160 x 32 -> 160 x 160 x 64 0.944 BF 5 max 2x 2/ 2 160 x 160 x 64 -> 80 x 80 x 64 0.002 BF 6 conv 128 3 x 3/ 1 80 x 80 x 64 -> 80 x 80 x 128 0.944 BF 7 max 2x 2/ 2 80 x 80 x 128 -> 40 x 40 x 128 0.001 BF 8 conv 256 3 x 3/ 1 40 x 40 x 128 -> 40 x 40 x 256 0.944 BF 9 max 2x 2/ 2 40 x 40 x 256 -> 20 x 20 x 256 0.000 BF 10 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF 11 max 2x 2/ 1 20 x 20 x 512 -> 20 x 20 x 512 0.001 BF 12 conv 1024 3 x 3/ 1 20 x 20 x 512 -> 20 x 20 x1024 3.775 BF 13 conv 256 1 x 1/ 1 20 x 20 x1024 -> 20 x 20 x 256 0.210 BF 14 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF 15 conv 18 1 x 1/ 1 20 x 20 x 512 -> 20 x 20 x 18 0.007 BF 16 yolo [yolo] params: iou loss: mse (2), iou_norm: 0.75, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.00 17 route 13 -> 20 x 20 x 256 18 conv 128 1 x 1/ 1 20 x 20 x 256 -> 20 x 20 x 128 0.026 BF 19 upsample 2x 20 x 20 x 128 -> 40 x 40 x 128 20 route 19 8 -> 40 x 40 x 384 21 conv 256 3 x 3/ 1 40 x 40 x 384 -> 40 x 40 x 256 2.831 BF 22 conv 18 1 x 1/ 1 40 x 40 x 256 -> 40 x 40 x 18 0.015 BF 23 yolo [yolo] params: iou loss: mse (2), iou_norm: 0.75, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.00 Total BFLOPS 12.894 avg_outputs = 768866 Allocating workspace to transfer between CPU and GPU: 56.2 MiB Remembering 1 class: -> class #0 (Pole) will use colour #FF00FF 0 : compute_capability = 800, cudnn_half = 0, GPU: NVIDIA A100-PCIE-40GB net.optimized_memory = 0 mini_batch = 1, batch = 64, time_steps = 1, train = 1 layer filters size/strd(dil) input output 0 conv 16 3 x 3/ 1 640 x 640 x 3 -> 640 x 640 x 16 0.354 BF 1 max 2x 2/ 2 640 x 640 x 16 -> 320 x 320 x 16 0.007 BF 2 conv 32 3 x 3/ 1 320 x 320 x 16 -> 320 x 320 x 32 0.944 BF 3 max 2x 2/ 2 320 x 320 x 32 -> 160 x 160 x 32 0.003 BF 4 conv 64 3 x 3/ 1 160 x 160 x 32 -> 160 x 160 x 64 0.944 BF 5 max 2x 2/ 2 160 x 160 x 64 -> 80 x 80 x 64 0.002 BF 6 conv 128 3 x 3/ 1 80 x 80 x 64 -> 80 x 80 x 128 0.944 BF 7 max 2x 2/ 2 80 x 80 x 128 -> 40 x 40 x 128 0.001 BF 8 conv 256 3 x 3/ 1 40 x 40 x 128 -> 40 x 40 x 256 0.944 BF 9 max 2x 2/ 2 40 x 40 x 256 -> 20 x 20 x 256 0.000 BF 10 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF 11 max 2x 2/ 1 20 x 20 x 512 -> 20 x 20 x 512 0.001 BF 12 conv 1024 3 x 3/ 1 20 x 20 x 512 -> 20 x 20 x1024 3.775 BF 13 conv 256 1 x 1/ 1 20 x 20 x1024 -> 20 x 20 x 256 0.210 BF 14 conv 512 3 x 3/ 1 20 x 20 x 256 -> 20 x 20 x 512 0.944 BF 15 conv 18 1 x 1/ 1 20 x 20 x 512 -> 20 x 20 x 18 0.007 BF 16 yolo [yolo] params: iou loss: mse (2), iou_norm: 0.75, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.00 17 route 13 -> 20 x 20 x 256 18 conv 128 1 x 1/ 1 20 x 20 x 256 -> 20 x 20 x 128 0.026 BF 19 upsample 2x 20 x 20 x 128 -> 40 x 40 x 128 20 route 19 8 -> 40 x 40 x 384 21 conv 256 3 x 3/ 1 40 x 40 x 384 -> 40 x 40 x 256 2.831 BF 22 conv 18 1 x 1/ 1 40 x 40 x 256 -> 40 x 40 x 18 0.015 BF 23 yolo [yolo] params: iou loss: mse (2), iou_norm: 0.75, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.00 Total BFLOPS 12.894 avg_outputs = 768866 Allocating workspace to transfer between CPU and GPU: 56.2 MiB Loading weights from /users/XXX/Hank_Darknet/darknet/mydata/yolov3-tiny.conv.15... seen 64, trained: 0 K-images (0 Kilo-batches_64) Done! Loaded 15 layers from weights-file Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005 Detection layer #16 is type 28 (yolo) Detection layer #23 is type 28 (yolo) mAP calculations will be every 100 iterations weights will be saved every 1000 iterations Resizing, random_coef = 1.40 928 x 928 Create 6 permanent cpu-threads Allocating workspace to transfer between CPU and GPU: 118.3 MiB Workspace begins at 0x14acee000000 loaded 64 images in 357.070 milliseconds v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 1.00) Region 16 Avg (IOU: 0.000000), count: 1, class_loss = 529.588867, iou_loss = 0.000000, total_loss = 529.588867 v3 (mse loss, Normalizer: (iou: 0.75, obj: 1.00, cls: 1.00) Region 23 Avg (IOU: 0.212021), count: 2, class_loss = 2539.934326, iou_loss = 22.478271, total_loss = 2562.412598 total_bbox=2, rewritten_bbox=0.000000% terminate called after throwing an instance of 'cv::Exception' what(): OpenCV(4.5.5) /users/acs03114/software/opencv/opencv-4.5.5_build/modules/highgui/src/window.cpp:1334: error: (-2:Unspecified error) The function is not implemented. Rebuild the library with Windows, GTK+ 2.x or Cocoa support. If you are on Ubuntu or Debian, install libgtk2.0-dev and pkg-config, then re-run cmake or configure script in function 'cvWaitKey' * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * A fatal error has been detected. Darknet will now exit. * Errno 2: No such file or directory * Error location: /users/XXX/Hank_Darknet/darknet/src/darknet.cpp, darknet_signal_handler(), line #443 * Error message: signal handler invoked for signal #6 (Aborted) * Version v2.0-11-gd01e285a-dirty built on Nov 10 2023 12:45:29 * * * * * * * * * * * * * * * * * * * * * * * * * * * * * backtrace (20 entries): 1/20: ./darknet(_Z13log_backtracev+0x21) [0x5329b1] 2/20: ./darknet(darknet_fatal_error+0x181) [0x532bc1] 3/20: /lib64/libc.so.6(+0x37400) [0x14add1562400] 4/20: /lib64/libc.so.6(gsignal+0x10f) [0x14add156237f] 5/20: /lib64/libc.so.6(abort+0x127) [0x14add154cdb5] 6/20: /opt/software/anaconda/python-3.9.7/2021.11/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0xbc) [0x14ade6b91872] 7/20: /opt/software/anaconda/python-3.9.7/2021.11/lib/libstdc++.so.6(+0xacf6f) [0x14ade6b8ff6f] 8/20: /opt/software/anaconda/python-3.9.7/2021.11/lib/libstdc++.so.6(+0xacfb1) [0x14ade6b8ffb1] 9/20: /opt/software/anaconda/python-3.9.7/2021.11/lib/libstdc++.so.6(__cxa_rethrow+0) [0x14ade6b9019a] 10/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_core.so.405(+0x96468) [0x14add2d7e468] 11/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_core.so.405(_ZN2cv5errorEiRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKcS9_i+0x5f) [0x14add2fb7f6f] 12/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_highgui.so.405(cvWaitKey+0x122) [0x14add8085832] 13/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_highgui.so.405(_ZN2cv9waitKeyExEi+0x135) [0x14add8086815] 14/20: /opt/software/anaconda/python-3.9.7/2021.11/opencv-4.5.5.20220304/lib64/libopencv_highgui.so.405(_ZN2cv7waitKeyEi+0x21) [0x14add8086911] 15/20: ./darknet(train_network_waitkey+0x431) [0x50f861] 16/20: ./darknet(train_detector+0x1c86) [0x4a0866] 17/20: ./darknet(_Z12run_detectoriPPc+0x875) [0x4a3f55] 18/20: ./darknet(main+0x4c3) [0x437283] 19/20: /lib64/libc.so.6(__libc_start_main+0xf3) [0x14add154e493] 20/20: ./darknet(_start+0x2e) [0x439c1e]

I tried without the OpenCV flag

cmake -DCMAKE_BUILD_TYPE=Release \ -DCUDAToolkit_CUPTI_INCLUDE_DIR=/opt/software/nvidia/sdk/Linux_x86_64/21.3/cuda/11.2/extras/CUPTI/include \

-DCMAKE_CXX_FLAGS="-I/opt/software/nvidia/sdk/Linux_x86_64/21.3/cuda/11.2/include" \

-DENABLE_OPENCV=OFF ..

But that didn’t work either.

@Dickoabc123 Dickoabc123 added the question Further information is requested label Nov 10, 2023
@stephanecharette
Copy link
Collaborator

stephanecharette commented Nov 11, 2023

Your training command is wrong. Please see the FAQ (https://www.ccoderun.ca/programming/yolo_faq/#training_command) or the readme (https://github.com/hank-ai/darknet#training).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants