Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when training a network #48

Closed
stephanecharette opened this issue Feb 18, 2024 · 3 comments
Closed

segfault when training a network #48

stephanecharette opened this issue Feb 18, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@stephanecharette
Copy link
Collaborator

Following the recent cleanup in free_layer_custom(), we're now seeing a segfault while training a network. Reported by CrazyBoris on discord:

darknet detector train ships.data yolov4-custom.cfg -map
Darknet v2.0-108-g4eaec0f7
...
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, obj_norm: 1.00, cls_norm: 1.00, delta_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000 
Total BFLOPS 59.563 
avg_outputs = 489778 
Allocating workspace to transfer between CPU and GPU:  50.0 MiB

* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
* Error location: /home/borys/Tools/darknet/src-cli/darknet.cpp, darknet_signal_handler(), line #431
* Error message:  signal handler invoked for signal #11 (Segmentation fault)
* Version v2.0-108-g4eaec0f7 built on Feb 17 2024 00:31:24
* * * * * * * * * * * * * * * * * * * * * * * * * * * * *
backtrace (11 entries):
1/11: darknet(_Z13log_backtracev+0x38) [0x5f0b4d1b6568]
2/11: darknet(darknet_fatal_error+0x208) [0x5f0b4d1b6818]
3/11: /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x74ef5ce42520]
4/11: /lib/x86_64-linux-gnu/libc.so.6(free+0x1e) [0x74ef5cea53fe]
5/11: darknet(free_layer_custom+0x8ef) [0x5f0b4d17827f]
6/11: darknet(train_detector+0x3220) [0x5f0b4d1150c0]
7/11: darknet(_Z12run_detectoriPPc+0xb3d) [0x5f0b4d11829d]
8/11: darknet(main+0x44f) [0x5f0b4d09790f]
9/11: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x74ef5ce29d90]
10/11: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x74ef5ce29e40]
11/11: darknet(_start+0x25) [0x5f0b4d09a8a5]
@stephanecharette stephanecharette added the bug Something isn't working label Feb 18, 2024
@stephanecharette stephanecharette self-assigned this Feb 18, 2024
@bortyr
Copy link

bortyr commented Feb 18, 2024

Hi. This has been reported by me on discord. I can expand here a bit for more general info:

  • Pop!_OS 22.04 LTS x86_64; Kernel: 6.6.10-76060610-generic
  • GPU: RTX 4080 Super; CPU: AMD Ryzen 5 5600X (12) @ 4.200GHz; RAM: 32 GB

Darknet v2.0-108-g4eaec0f7
CUDA runtime version 12030 (v12.3), driver version 12030 (v12.3)
cuDNN version 12020 (v8.9.7), use of half-size floats is ENABLED
=> 0: NVIDIA Graphics Device [#8.9], 15.7 GiB
OpenCV v4.9.0

So running:

darknet detector train ships.data yolov4-custom.cfg

Works but adding -map at the end causes segmentation fault. This however does not happen with:

  • yolov4 tiny
  • yolov7x

Both work with -map and without.

I am using cfg from the repo adjusted to my dataset according to guidelines. My dataset consists roughly of 500 training set and 100 validation set. Later for inference yolov4-tiny and yolov7x work fine. Only somehow that yolov4 causes issues.

If I can help here or anything let me know. Thanks for looking into it

@stephanecharette
Copy link
Collaborator Author

Reproduced the problem locally. I believe this is now fixed with commit hash 6ccabd7.

@bortyr
Copy link

bortyr commented Feb 18, 2024

Yes, I confirm that it has been fixed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants