UnboundLocalError: local variable 'l1_loss' referenced before assignment #1738

johnlockejrr · 2024-09-30T09:14:59Z

Bug description

While training a Doctr model with my own dataset, I encountered an UnboundLocalError in the compute_loss function of the differentiable_binarization module.

Code snippet to reproduce the bug

python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0

Error traceback

Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1627s (67 samples in 34 batches)
Train set loaded in 0.1016s (540 samples in 270 batches)
  0%|                                                                                                                                                                                                                                                           | 0/270 [00:03<?, ?it/s]
Traceback (most recent call last):                                                                                                                                                                                                                              | 0/270 [00:00<?, ?it/s]
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 481, in <module>
    main(args)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 388, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 126, in fit_one_epoch
    train_loss = model(images, targets)["loss"]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/incognito/doctr/doctr/models/detection/differentiable_binarization/pytorch.py", line 216, in forward
    loss = self.compute_loss(logits, thresh_map, target)
  File "/home/incognito/doctr/doctr/models/detection/differentiable_binarization/pytorch.py", line 286, in compute_loss
    return l1_loss + focal_scale * focal_loss + dice_loss
UnboundLocalError: local variable 'l1_loss' referenced before assignment

Environment

DocTR version: 0.9.1a0
TensorFlow version: N/A
PyTorch version: 2.4.1+cu121 (torchvision 0.19.1+cu121)
OpenCV version: 4.10.0
OS: Ubuntu 22.04.5 LTS
Python version: 3.10.12
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060
Nvidia driver version: 561.09
cuDNN version: Could not collect

Deep Learning backend

from doctr.file_utils import is_tf_available, is_torch_available
print(f"is_tf_available: {is_tf_available()}")
is_tf_available: False
print(f"is_torch_available: {is_torch_available()}")
is_torch_available: True

The text was updated successfully, but these errors were encountered:

johnlockejrr · 2024-09-30T09:15:41Z

If needed, I can upload my dataset.

johnlockejrr · 2024-09-30T09:23:50Z

I tested this on a different environment and I get the same error:

DocTR version: 0.9.1a0
TensorFlow version: N/A
PyTorch version: 2.4.1+cu121 (torchvision 0.19.1+cu121)
OpenCV version: 4.10.0
OS: Ubuntu 22.04.5 LTS
Python version: 3.10.12
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4070
Nvidia driver version: 560.94
cuDNN version: Could not collect

Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from doctr.file_utils import is_tf_available, is_torch_available
>>> print(f"is_tf_available: {is_tf_available()}")
is_tf_available: False
>>> print(f"is_torch_available: {is_torch_available()}")
is_torch_available: True
>>>

johnlockejrr · 2024-09-30T09:37:20Z

I think I made a mistake, I just realized I used polygons from original images and the images in dataset were mogrified... checking

johnlockejrr · 2024-09-30T09:53:30Z

Working now, with big images much slower. What height or width would be recommanded for training?

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.4528s (67 samples in 34 batches)
Train set loaded in 0.07876s (540 samples in 270 batches)

Training loss: 0.658518:  78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                             | 211/270 [03:23<00:47,  1.24it/s]

EDIT: worked until killed:

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.4528s (67 samples in 34 batches)
Train set loaded in 0.07876s (540 samples in 270 batches)
Training loss: 0.643471: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [03:52<00:00,  1.16it/s]100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:33<00:00,  1.00it/s]
Validation loss decreased inf --> 2.29916: saving state...
Epoch 1/5 - Validation loss: 2.29916 (Recall: 1.67% | Precision: 5.47% | Mean IoU: 9.00%)
  0%|                                                                                                                                                                                                                                                 | 0/270 [00:20<?, ?it/s]
Traceback (most recent call last):                                                                                                                                                                                                                    | 0/270 [00:00<?, ?it/s]
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 481, in <module>
    main(args)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 388, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 109, in fit_one_epoch
    for images, targets in pbar:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1327, in _next_data
    idx, data = self._get_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1283, in _get_data
    success, data = self._try_get_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1131, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.10/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/usr/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 44881) is killed by signal: Killed.

felixdittrich92 · 2024-09-30T10:45:06Z

Images are resized internally :)

Try to reduce/set the workers with --workers=<INT_DEPENDING_ON_YOU_MACHINE>

johnlockejrr · 2024-09-30T12:36:58Z

I just resized the images to x960 and recalculated the the polygons and everything goes smooth, anyway my dataset is at line level, I give it a try :)

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1393s (67 samples in 34 batches)
Train set loaded in 0.07748s (540 samples in 270 batches)
Training loss: 1.3698: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:34<00:00,  2.86it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:13<00:00,  2.55it/s]
Validation loss decreased inf --> 0.674124: saving state...
Epoch 1/5 - Validation loss: 0.674124 (Recall: 4.78% | Precision: 3.38% | Mean IoU: 5.00%)
Training loss: 0.711258: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:28<00:00,  3.05it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.48it/s]
Epoch 2/5 - Validation loss: 0.817873 (Recall: 5.47% | Precision: 2.30% | Mean IoU: 3.00%)
Training loss: 0.563128: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:26<00:00,  3.11it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.52it/s]
Validation loss decreased 0.674124 --> 0.632917: saving state...
Epoch 3/5 - Validation loss: 0.632917 (Recall: 16.05% | Precision: 32.59% | Mean IoU: 29.00%)
Training loss: 0.610216: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:27<00:00,  3.07it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.50it/s]
Epoch 4/5 - Validation loss: 0.642417 (Recall: 21.75% | Precision: 11.35% | Mean IoU: 9.00%)
Training loss: 0.604278: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:27<00:00,  3.09it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.49it/s]
Validation loss decreased 0.632917 --> 0.565686: saving state...
Epoch 5/5 - Validation loss: 0.565686 (Recall: 43.27% | Precision: 46.25% | Mean IoU: 36.00%)

felixdittrich92 · 2024-09-30T12:51:50Z

You should train longer :D But for only 5 epochs the metrics doesn't looks wrong 👍

johnlockejrr · 2024-09-30T12:53:01Z

Yes! I just wanted to be sure it runs, was a first test, I'm happy with it anyway.

Just figuring how to add my new trained (*ish) model to the streamlit demo app :-|

EDIT: besides my datasets are line-level, I have another problem: my datasets are mostly RTL, should I do anything for it to work (like python bidi etc.)? Is, let's say Arabic or Hebrew requiring other features?

felixdittrich92 · 2024-09-30T13:05:04Z

Yes! I just wanted to be sure it runs, was a first test, I'm happy with it anyway.

Just figuring how to add my new trained (*ish) model to the streamlit demo app :-|

Curious to see how well this can work ^^

Currently we use anyascii (https://github.com/anyascii/anyascii) i think this should work !? :)

johnlockejrr · 2024-09-30T13:07:42Z

Yes! I just wanted to be sure it runs, was a first test, I'm happy with it anyway.
Just figuring how to add my new trained (*ish) model to the streamlit demo app :-|

Curious to see how well this can work ^^

Currently we use anyascii (https://github.com/anyascii/anyascii) i think this should work !? :)

Never used it, yes, I think it should.

johnlockejrr · 2024-09-30T13:12:44Z

Seems I can't load it as per https://mindee.github.io/doctr/using_doctr/custom_models_training.html :)

felixdittrich92 · 2024-09-30T13:14:59Z

You can :) You have to change the vocab with --vocab=..
See here for the predefined vocabs we have: https://github.com/mindee/doctr/blob/main/doctr/datasets/vocabs.py

The vocab should contain all the chars you have in your dataset (or more)

johnlockejrr · 2024-09-30T13:16:01Z

Oh, sorry, I'm new to it. I mostly trained kraken yolov8 and DocUFCN models.

But it needs a vocab for a detection model? I didn't train a recognition model yet.

felixdittrich92 · 2024-09-30T13:18:32Z

If no of the predefined vocabs should fit you can simply change:

doctr/references/recognition/train_pytorch.py

Line 189 in df762ed

vocab = VOCABS[args.vocab]

to vocab="abc" for example but to load the model later you need the same string which defines your models vocab :)

felixdittrich92 · 2024-09-30T13:18:58Z

@johnlockejrr No only for the recognition model training

johnlockejrr · 2024-09-30T13:20:51Z

Couldn't I load only the detection model to see how it performs on a new test image?

johnlockejrr · 2024-09-30T13:23:16Z

If no of the predefined vocabs should fit you can simply change:

doctr/references/recognition/train_pytorch.py

Line 189 in df762ed

vocab = VOCABS[args.vocab]

to vocab="abc" for example but to load the model later you need the same string which defines your models vocab :)

I just take a look at vocabs.py and for VOCABS["hebrew"] there are more characters, the file should be amended sometime in the future.

felixdittrich92 · 2024-09-30T13:24:14Z

If no of the predefined vocabs should fit you can simply change:

doctr/references/recognition/train_pytorch.py

Line 189 in df762ed

vocab = VOCABS[args.vocab]

to vocab="abc" for example but to load the model later you need the same string which defines your models vocab :)

I just take a look at vocabs.py and for VOCABS["hebrew"] there are more characters, the file should be amended sometime in the future.

Feel free to open a PR to add the missing chars 👍

felixdittrich92 · 2024-09-30T13:28:22Z

Can't load only the detection model to see how it performs on a new test image?

Sure :)

Load your custom trained model (in combination with the ocr_predictor):

# Load custom detection model
det_model = db_resnet50(pretrained=False, pretrained_backbone=False)
det_params = torch.load('<path_to_pt>', map_location="cpu")
det_model.load_state_dict(det_params)
predictor = ocr_predictor(det_arch=det_model, reco_arch="vitstr_small", pretrained=True)

or only with the detection_predictor:

import requests
import cv2
import numpy as np
import torch

from doctr.io import DocumentFile
from doctr.models import detection_predictor, db_resnet50
from doctr.utils.geometry import detach_scores


# Convert relative coordinates to absolute pixel values
def _to_absolute(geom, img_shape: tuple[int, int]) -> list[list[int]]:
    h, w = img_shape
    if len(geom) == 2:  # Assume straight pages = True -> [[xmin, ymin], [xmax, ymax]]
        (xmin, ymin), (xmax, ymax) = geom
        xmin, xmax = int(round(w * xmin)), int(round(w * xmax))
        ymin, ymax = int(round(h * ymin)), int(round(h * ymax))
        return [[xmin, ymin], [xmax, ymin], [xmax, ymax], [xmin, ymax]]
    else:  # For polygons, convert each point to absolute coordinates
        return [[int(point[0] * w), int(point[1] * h)] for point in geom]


url = "https://www.francetvinfo.fr/pictures/uGwaNE-aJq7zHLhZJdzdCd9nyjE/1200x900/2021/03/16/phpCDwGn0.jpg"

# Load custom detection model
det_model = db_resnet50(pretrained=False, pretrained_backbone=False)
det_params = torch.load('<path_to_pt>', map_location="cpu")
det_model.load_state_dict(det_params)

det_predictor = detection_predictor(
    arch=det_model,
    pretrained=False,
    assume_straight_pages=True,
    symmetric_pad=True,
    preserve_aspect_ratio=True,
) #.cuda().half()  # Uncomment this line if you have a GPU

det_predictor.model.postprocessor.bin_thresh = 0.3
det_predictor.model.postprocessor.box_thresh = 0.65

docs = DocumentFile.from_images([requests.get(url).content])
results = det_predictor(docs)

image = cv2.imdecode(np.frombuffer(requests.get(url).content, np.uint8), cv2.IMREAD_COLOR)

for doc, res in zip(docs, results):
    img_shape = (doc.shape[0], doc.shape[1])
    # Detach the probability scores from the results
    detached_coords, prob_scores = detach_scores([res.get("words")])

    for i, coords in enumerate(detached_coords[0]):
        coords = coords.reshape(2, 2).tolist() if coords.shape == (4, ) else coords.tolist()

        # Convert relative to absolute pixel coordinates
        points = np.array(_to_absolute(coords, img_shape), dtype=np.int32).reshape((-1, 1, 2))

        # Draw the bounding box on the image
        cv2.polylines(image, [points], isClosed=True, color=(255, 0, 0), thickness=2)

    # Save the modified image with bounding boxes
    cv2.imwrite("output.jpg", image)

johnlockejrr · 2024-09-30T13:29:39Z

Perfect! Thank you for all your help! I'll open a PR later today for a new language and ammend the Hebrew language.

felixdittrich92 · 2024-09-30T13:33:35Z

Perfect! Thank you for all your help! I'll open a PR later today for a new language and ammend the Hebrew language.

reference PR to show what's required to update or add a vocab: https://github.com/mindee/doctr/pull/1700/files

johnlockejrr · 2024-09-30T13:51:13Z

Very strage with my model. Executing your script above:

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python load_det_model.py
/home/incognito/doctr/load_det_model.py:27: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  det_params = torch.load('db_resnet50_20240930-142637.pt', map_location="cpu")
Traceback (most recent call last):
  File "/home/incognito/doctr/load_det_model.py", line 28, in <module>
    det_model.load_state_dict(det_params)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DBNet:
        size mismatch for prob_head.6.weight: copying a param with shape torch.Size([64, 2, 2, 2]) from checkpoint, the shape in current model is torch.Size([64, 1, 2, 2]).
        size mismatch for prob_head.6.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([1]).
        size mismatch for thresh_head.6.weight: copying a param with shape torch.Size([64, 2, 2, 2]) from checkpoint, the shape in current model is torch.Size([64, 1, 2, 2]).
        size mismatch for thresh_head.6.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([1]).

Could this happen because I trained it on a line-level dataset?

felixdittrich92 · 2024-09-30T13:56:03Z

Can your share on entry from your labels.json you used for training ?

johnlockejrr · 2024-09-30T13:57:35Z

Can your share on entry from your labels.json you used for training ?

Sure:

{"81_dc946_default.jpg": {"img_dimensions": [720, 960], "img_hash": "f04698acbbc7246475a8401dc031facf1d152c156cb1363217270cd7591e94d3", "polygons": {"textzone": [[[66, 153], [527, 153], [527, 709], [66, 709]]], "textline": [[[78, 161], [515, 161], [515, 188], [78, 188]], [[76, 180], [515, 180], [515, 207], [76, 207]], [[79, 201], [515, 201], [515, 229], [79, 229]], [[77, 221], [514, 221], [514, 250], [77, 250]], [[78, 242], [516, 242], [516, 273], [78, 273]], [[73, 264], [516, 264], [516, 292], [73, 292]], [[75, 287], [517, 287], [517, 313], [75, 313]], [[76, 307], [517, 307], [517, 335], [76, 335]], [[73, 327], [518, 327], [518, 356], [73, 356]], [[75, 350], [516, 350], [516, 377], [75, 377]], [[76, 388], [518, 388], [518, 417], [76, 417]], [[77, 412], [519, 412], [519, 437], [77, 437]], [[74, 434], [518, 434], [518, 457], [74, 457]], [[75, 452], [518, 452], [518, 478], [75, 478]], [[78, 472], [518, 472], [518, 499], [78, 499]], [[81, 493], [519, 493], [519, 519], [81, 519]], [[81, 514], [518, 514], [518, 540], [81, 540]], [[73, 535], [519, 535], [519, 560], [73, 560]], [[74, 556], [519, 556], [519, 581], [74, 581]], [[72, 576], [519, 576], [519, 602], [72, 602]], [[74, 596], [519, 596], [519, 624], [74, 624]], [[75, 618], [517, 618], [517, 647], [75, 647]], [[73, 637], [521, 637], [521, 666], [73, 666]], [[79, 658], [520, 658], [520, 686], [79, 686]], [[75, 680], [520, 680], [520, 714], [75, 714]]]}}, "136_7aab7_default.jpg": {"img_dimensions": [720, 960], "img_hash": "eac91c1193e188f4dd089705086e3e3dfd6bc5233d5ceb714c6082684a64ab06", "polygons": {"textzone": [[[183, 174], [621, 174], [621, 722], [183, 722]]], "textline": [[[188, 181], [615, 181], [615, 211], [188, 211]], [[187, 206], [614, 206], [614, 231], [187, 231]], [[184, 226], [613, 226], [613, 252], [184, 252]], [[188, 246], [614, 246], [614, 274], [188, 274]], [[188, 268], [615, 268], [615, 291], [188, 291]], [[189, 287], [615, 287], [615, 315], [189, 315]], [[188, 308], [614, 308], [614, 335], [188, 335]], [[188, 329], [616, 329], [616, 355], [188, 355]], [[187, 349], [616, 349], [616, 375], [187, 375]], [[186, 372], [616, 372], [616, 397], [186, 397]], [[186, 390], [616, 390], [616, 417], [186, 417]], [[188, 429], [618, 429], [618, 455], [188, 455]], [[189, 450], [619, 450], [619, 477], [189, 477]], [[189, 471], [619, 471], [619, 498], [189, 498]], [[189, 491], [619, 491], [619, 517], [189, 517]], [[190, 512], [618, 512], [618, 538], [190, 538]], [[190, 533], [620, 533], [620, 558], [190, 558]], [[189, 553], [619, 553], [619, 577], [189, 577]], [[192, 574], [616, 574], [616, 599], [192, 599]], [[191, 594], [620, 594], [620, 620], [191, 620]], [[191, 613], [619, 613], [619, 638], [191, 638]], [[193, 633], [619, 633], [619, 660], [193, 660]], [[190, 655], [620, 655], [620, 680], [190, 680]], [[189, 673], [619, 673], [619, 700], [189, 700]], [[186, 694], [618, 694], [618, 729], [186, 729]]]}},
...

johnlockejrr · 2024-09-30T13:59:12Z

Better, I can upload the labels.json of val because is smaller than train.

labels.json

felixdittrich92 · 2024-09-30T14:00:44Z

Ah i see you trained an KIE model 😅

To train only a detection model polygons shouldn't be a dict -- only the polygons as value like.

"polygons": [[[66, 153], [527, 153], [527, 709], [66, 709]], .....]

johnlockejrr · 2024-09-30T14:01:29Z

OMG! :)

felixdittrich92 · 2024-09-30T14:02:04Z

OMG! :)

I think this wasn't planned right ? ^^

johnlockejrr · 2024-09-30T14:03:12Z

For a detection model can't I specify more class names? As I have textzones and textlines
Or better I just remove the textzone class and keep the textlines?

felixdittrich92 · 2024-09-30T14:05:25Z

For a detection model can't I specify more class names? As I have textzones and textlines

You can also load this model with:

det_model = db_resnet50(pretrained=False, pretrained_backbone=False, class_names=['textzone', 'textline'])
det_params = torch.load('<path_to_pt>', map_location="cpu")
det_model.load_state_dict(det_params)

johnlockejrr · 2024-09-30T14:07:44Z

For a detection model can't I specify more class names? As I have textzones and textlines

You can also load this model with:
det_model = db_resnet50(pretrained=False, pretrained_backbone=False, class_names=['textzone', 'textline'])
det_params = torch.load('<path_to_pt>', map_location="cpu")
det_model.load_state_dict(det_params)

Bad day :)

/home/incognito/doctr/load_det_model-kie.py:28: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  det_params = torch.load('db_resnet50_20240930-142637.pt', map_location="cpu")
Traceback (most recent call last):
  File "/home/incognito/doctr/load_det_model-kie.py", line 50, in <module>
    detached_coords, prob_scores = detach_scores([res.get("words")])
  File "/home/incognito/doctr/doctr/utils/geometry.py", line 79, in detach_scores
    loc_preds, obj_scores = zip(*(_detach(box) for box in boxes))
  File "/home/incognito/doctr/doctr/utils/geometry.py", line 79, in <genexpr>
    loc_preds, obj_scores = zip(*(_detach(box) for box in boxes))
  File "/home/incognito/doctr/doctr/utils/geometry.py", line 75, in _detach
    if boxes.ndim == 2:
AttributeError: 'NoneType' object has no attribute 'ndim'

johnlockejrr · 2024-09-30T14:08:09Z

I think I should re-train it :)

Error on line detached_coords, prob_scores = detach_scores([res.get("words")])

If is a KIE model shouldn't I from doctr.models import kie_predictor?

I changed the line to detached_coords, prob_scores = detach_scores([res.get("textline")])

But I get nothing, script runs but no detections.

detached_coords -> [array([], shape=(0, 4), dtype=float32)]

johnlockejrr · 2024-09-30T14:22:58Z

I reconverted my data to:

{"215_67426_default.jpg": {"img_dimensions": [720, 960], "img_hash": "f4da2a0dcdcd28dbc08609bac090f465ee5d7b471fa42024da0a11e79acade60", "polygons": [[[72, 162], [514, 162], [514, 194], [72, 194]], [[69, 188], [514, 188], [514, 216], [69, 216]], [[69, 209], [514, 209], [514, 238], [69, 238]], [[69, 231], [514, 231], [514, 259], [69, 259]], [[69, 251], [514, 251], [514, 283], [69, 283]], [[70, 274], [515, 274], [515, 299], [70, 299]], [[70, 293], [515, 293], [515, 322], [70, 322]], [[69, 314], [516, 314], [516, 340], [69, 340]], [[69, 335], [516, 335], [516, 364], [69, 364]], [[67, 355], [516, 355], [516, 386], [67, 386]], [[69, 392], [517, 392], [517, 427], [69, 427]], [[70, 420], [514, 420], [514, 447], [70, 447]], [[70, 441], [517, 441], [517, 468], [70, 468]], [[70, 462], [517, 462], [517, 493], [70, 493]], [[70, 483], [518, 483], [518, 511], [70, 511]], [[77, 504], [519, 504], [519, 534], [77, 534]], [[65, 526], [520, 526], [520, 555], [65, 555]], [[69, 547], [519, 547], [519, 578], [69, 578]], [[69, 570], [521, 570], [521, 598], [69, 598]], [[71, 590], [520, 590], [520, 619], [71, 619]], [[65, 612], [521, 612], [521, 642], [65, 642]], [[70, 635], [521, 635], [521, 663], [70, 663]], [[70, 660], [522, 660], [522, 684], [70, 684]], [[66, 677], [522, 677], [522, 703], [66, 703]], [[70, 698], [522, 698], [522, 727], [70, 727]], [[67, 716], [199, 716], [199, 741], [67, 741]]]}, "545_4408b_default.jpg": {"img_dimensions": [720, 960], "img_hash": "21c0f7326a7821b77b2a5e49e76017e60555dd40670005863a20a13d2803748d", "polygons": [[[107, 179], [507, 179], [507, 207], [107, 207]], [[107, 200], [510, 200], [510, 226], [107, 226]], [[105, 220], [509, 220], [509, 245], [105, 245]], [[109, 243], [510, 243], [510, 262], [109, 262]], [[106, 259], [510, 259], [510, 282], [106, 282]], [[106, 277], [510, 277], [510, 301], [106, 301]], [[106, 299], [510, 299], [510, 319], [106, 319]], [[103, 315], [510, 315], [510, 338], [103, 338]], [[103, 333], [510, 333], [510, 358], [103, 358]], [[101, 354], [510, 354], [510, 379], [101, 379]], [[104, 373], [509, 373], [509, 398], [104, 398]], [[101, 390], [510, 390], [510, 416], [101, 416]], [[103, 412], [511, 412], [511, 431], [103, 431]], [[104, 430], [511, 430], [511, 455], [104, 455]], [[101, 450], [510, 450], [510, 475], [101, 475]], [[104, 469], [510, 469], [510, 495], [104, 495]], [[104, 489], [509, 489], [509, 514], [104, 514]], [[104, 507], [510, 507], [510, 533], [104, 533]], [[104, 528], [510, 528], [510, 553], [104, 553]], [[103, 549], [511, 549], [511, 572], [103, 572]], [[103, 565], [509, 565], [509, 591], [103, 591]], [[103, 584], [511, 584], [511, 611], [103, 611]], [[101, 602], [511, 602], [511, 629], [101, 629]], [[99, 622], [512, 622], [512, 650], [99, 650]], [[105, 660], [512, 660], [512, 693], [105, 693]], [[103, 684], [202, 684], [202, 710], [103, 710]]]},

I'll retrain :)

johnlockejrr · 2024-09-30T14:33:16Z

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 10 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=10, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1427s (67 samples in 34 batches)
Train set loaded in 0.0208s (540 samples in 270 batches)
Training loss: 0.29681: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:06<00:00,  4.07it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.59it/s]
Validation loss decreased inf --> 0.362736: saving state...
Epoch 1/10 - Validation loss: 0.362736 (Recall: 98.02% | Precision: 85.08% | Mean IoU: 65.00%)
Training loss: 0.321628: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.49it/s]
Epoch 2/10 - Validation loss: 0.372804 (Recall: 95.15% | Precision: 84.16% | Mean IoU: 63.00%)
Training loss: 0.406969: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.24it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.43it/s]
Validation loss decreased 0.362736 --> 0.33441: saving state...
Epoch 3/10 - Validation loss: 0.33441 (Recall: 92.34% | Precision: 75.74% | Mean IoU: 52.00%)
Training loss: 0.508775: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.54it/s]
Epoch 4/10 - Validation loss: 0.354248 (Recall: 98.68% | Precision: 80.43% | Mean IoU: 64.00%)
Training loss: 0.389871: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.28it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.54it/s]
Validation loss decreased 0.33441 --> 0.316777: saving state...
Epoch 5/10 - Validation loss: 0.316777 (Recall: 98.68% | Precision: 89.18% | Mean IoU: 70.00%)
Training loss: 0.36966: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.30it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.60it/s]
Validation loss decreased 0.316777 --> 0.308347: saving state...
Epoch 6/10 - Validation loss: 0.308347 (Recall: 97.19% | Precision: 81.19% | Mean IoU: 59.00%)
Training loss: 0.31847: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.25it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.49it/s]
Validation loss decreased 0.308347 --> 0.285198: saving state...
Epoch 7/10 - Validation loss: 0.285198 (Recall: 98.08% | Precision: 87.41% | Mean IoU: 67.00%)
Training loss: 0.202373:  11%|███████████████████████▊                                                                                                                                                                                       | 31/270 [00:08<01:05,  3.67it/s]
Traceback (most recent call last):███████████████████▊                                                                                                                                                                                       | 31/270 [00:08<00:52,  4.53it/s]
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 481, in <module>
    main(args)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 388, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 109, in fit_one_epoch
    for images, targets in pbar:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 15.
Original Traceback (most recent call last):
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/incognito/doctr/doctr/datasets/datasets/base.py", line 67, in __getitem__
    img_transformed, target[class_name] = self.sample_transforms(img, bboxes)
  File "/home/incognito/doctr/doctr/transforms/modules/base.py", line 56, in __call__
    x, target = t(x, target)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/incognito/doctr/doctr/transforms/modules/pytorch.py", line 87, in forward
    target[:, [0, 2]] = offset[0] + target[:, [0, 2]] * raw_shape[-1] / img.shape[-1]
UnboundLocalError: local variable 'offset' referenced before assignment

I resumed it and it finished:

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0 --resume ./db_resnet50_20240930-162432.pt --workers 2
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=2, resume='./db_resnet50_20240930-162432.pt', test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1605s (67 samples in 34 batches)
Resuming ./db_resnet50_20240930-162432.pt
/home/incognito/doctr/references/detection/train_pytorch.py:228: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(args.resume, map_location="cpu")
Train set loaded in 0.07673s (540 samples in 270 batches)
Training loss: 0.342384: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:04<00:00,  4.20it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:08<00:00,  4.06it/s]
Validation loss decreased inf --> 0.333333: saving state...
Epoch 1/5 - Validation loss: 0.333333 (Recall: 98.32% | Precision: 84.99% | Mean IoU: 64.00%)
Training loss: 0.285108: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.35it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.64it/s]
Validation loss decreased 0.333333 --> 0.298129: saving state...
Epoch 2/5 - Validation loss: 0.298129 (Recall: 97.84% | Precision: 90.08% | Mean IoU: 67.00%)
Training loss: 0.241384: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:01<00:00,  4.40it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.66it/s]
Validation loss decreased 0.298129 --> 0.234458: saving state...
Epoch 3/5 - Validation loss: 0.234458 (Recall: 98.80% | Precision: 81.85% | Mean IoU: 71.00%)
Training loss: 0.238148: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:01<00:00,  4.37it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.72it/s]
Epoch 4/5 - Validation loss: 0.238532 (Recall: 98.50% | Precision: 86.95% | Mean IoU: 75.00%)
Training loss: 0.237705: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.34it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.62it/s]
Validation loss decreased 0.234458 --> 0.20468: saving state...
Epoch 5/5 - Validation loss: 0.20468 (Recall: 98.98% | Precision: 89.64% | Mean IoU: 80.00%)

felixdittrich92 · 2024-09-30T14:43:32Z

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 10 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=10, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1427s (67 samples in 34 batches)
Train set loaded in 0.0208s (540 samples in 270 batches)
Training loss: 0.29681: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:06<00:00,  4.07it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.59it/s]
Validation loss decreased inf --> 0.362736: saving state...
Epoch 1/10 - Validation loss: 0.362736 (Recall: 98.02% | Precision: 85.08% | Mean IoU: 65.00%)
Training loss: 0.321628: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.49it/s]
Epoch 2/10 - Validation loss: 0.372804 (Recall: 95.15% | Precision: 84.16% | Mean IoU: 63.00%)
Training loss: 0.406969: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.24it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.43it/s]
Validation loss decreased 0.362736 --> 0.33441: saving state...
Epoch 3/10 - Validation loss: 0.33441 (Recall: 92.34% | Precision: 75.74% | Mean IoU: 52.00%)
Training loss: 0.508775: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.54it/s]
Epoch 4/10 - Validation loss: 0.354248 (Recall: 98.68% | Precision: 80.43% | Mean IoU: 64.00%)
Training loss: 0.389871: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.28it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.54it/s]
Validation loss decreased 0.33441 --> 0.316777: saving state...
Epoch 5/10 - Validation loss: 0.316777 (Recall: 98.68% | Precision: 89.18% | Mean IoU: 70.00%)
Training loss: 0.36966: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.30it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.60it/s]
Validation loss decreased 0.316777 --> 0.308347: saving state...
Epoch 6/10 - Validation loss: 0.308347 (Recall: 97.19% | Precision: 81.19% | Mean IoU: 59.00%)
Training loss: 0.31847: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.25it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.49it/s]
Validation loss decreased 0.308347 --> 0.285198: saving state...
Epoch 7/10 - Validation loss: 0.285198 (Recall: 98.08% | Precision: 87.41% | Mean IoU: 67.00%)
Training loss: 0.202373:  11%|███████████████████████▊                                                                                                                                                                                       | 31/270 [00:08<01:05,  3.67it/s]
Traceback (most recent call last):███████████████████▊                                                                                                                                                                                       | 31/270 [00:08<00:52,  4.53it/s]
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 481, in <module>
    main(args)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 388, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 109, in fit_one_epoch
    for images, targets in pbar:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 15.
Original Traceback (most recent call last):
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/incognito/doctr/doctr/datasets/datasets/base.py", line 67, in __getitem__
    img_transformed, target[class_name] = self.sample_transforms(img, bboxes)
  File "/home/incognito/doctr/doctr/transforms/modules/base.py", line 56, in __call__
    x, target = t(x, target)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/incognito/doctr/doctr/transforms/modules/pytorch.py", line 87, in forward
    target[:, [0, 2]] = offset[0] + target[:, [0, 2]] * raw_shape[-1] / img.shape[-1]
UnboundLocalError: local variable 'offset' referenced before assignment

I resumed it and it finished:

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0 --resume ./db_resnet50_20240930-162432.pt --workers 2
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=2, resume='./db_resnet50_20240930-162432.pt', test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1605s (67 samples in 34 batches)
Resuming ./db_resnet50_20240930-162432.pt
/home/incognito/doctr/references/detection/train_pytorch.py:228: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(args.resume, map_location="cpu")
Train set loaded in 0.07673s (540 samples in 270 batches)
Training loss: 0.342384: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:04<00:00,  4.20it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:08<00:00,  4.06it/s]
Validation loss decreased inf --> 0.333333: saving state...
Epoch 1/5 - Validation loss: 0.333333 (Recall: 98.32% | Precision: 84.99% | Mean IoU: 64.00%)
Training loss: 0.285108: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.35it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.64it/s]
Validation loss decreased 0.333333 --> 0.298129: saving state...
Epoch 2/5 - Validation loss: 0.298129 (Recall: 97.84% | Precision: 90.08% | Mean IoU: 67.00%)
Training loss: 0.241384: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:01<00:00,  4.40it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.66it/s]
Validation loss decreased 0.298129 --> 0.234458: saving state...
Epoch 3/5 - Validation loss: 0.234458 (Recall: 98.80% | Precision: 81.85% | Mean IoU: 71.00%)
Training loss: 0.238148: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:01<00:00,  4.37it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.72it/s]
Epoch 4/5 - Validation loss: 0.238532 (Recall: 98.50% | Precision: 86.95% | Mean IoU: 75.00%)
Training loss: 0.237705: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.34it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.62it/s]
Validation loss decreased 0.234458 --> 0.20468: saving state...
Epoch 5/5 - Validation loss: 0.20468 (Recall: 98.98% | Precision: 89.64% | Mean IoU: 80.00%)

That's a known issue PR to fix this is on the way :)
#1715
CC @odulcy-mindee

johnlockejrr · 2024-09-30T14:45:21Z

It performs well (*ish). With your script above but any idea why identifies only one line?

felixdittrich92 · 2024-09-30T14:52:43Z

It performs well (*ish). With your script above but any idea why identifies only one line?

What's the shape of the model output?

felixdittrich92 · 2024-09-30T14:54:02Z

Btw in my provided script lower bin_thresh and box_thresh to 0.1

johnlockejrr · 2024-09-30T14:56:20Z

I trained the model on x960 images, when detecting I sould use the same resolution?

felixdittrich92 · 2024-09-30T14:58:02Z

I trained the model on x960 images, when detecting I sould use the same resolution?

If you have resized it before on your own it would make sense yep

johnlockejrr · 2024-09-30T14:59:48Z

I resized the image to x960. I think it needs more training.

johnlockejrr added the type: bug Something isn't working label Sep 30, 2024

mindee locked and limited conversation to collaborators Oct 1, 2024

felixdittrich92 converted this issue into discussion #1739 Oct 1, 2024

This issue was moved to a discussion.

UnboundLocalError: local variable 'l1_loss' referenced before assignment #1738

UnboundLocalError: local variable 'l1_loss' referenced before assignment #1738

Comments

johnlockejrr commented Sep 30, 2024

Bug description

Code snippet to reproduce the bug

Error traceback

Environment

Deep Learning backend

johnlockejrr commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024 • edited Loading

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024 • edited Loading

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024 • edited Loading

johnlockejrr commented Sep 30, 2024 • edited Loading

felixdittrich92 commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024 • edited Loading

johnlockejrr commented Sep 30, 2024 • edited Loading

felixdittrich92 commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024 • edited Loading

johnlockejrr commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024 • edited Loading

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024 • edited Loading

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024 • edited Loading

johnlockejrr commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024 • edited Loading

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024

felixdittrich92 commented Sep 30, 2024

johnlockejrr commented Sep 30, 2024 • edited Loading

This issue was moved to a discussion.

johnlockejrr commented Sep 30, 2024 •

edited

Loading

johnlockejrr commented Sep 30, 2024 •

edited

Loading

felixdittrich92 commented Sep 30, 2024 •

edited

Loading

johnlockejrr commented Sep 30, 2024 •

edited

Loading

johnlockejrr commented Sep 30, 2024 •

edited

Loading

johnlockejrr commented Sep 30, 2024 •

edited

Loading

felixdittrich92 commented Sep 30, 2024 •

edited

Loading

johnlockejrr commented Sep 30, 2024 •

edited

Loading

johnlockejrr commented Sep 30, 2024 •

edited

Loading

johnlockejrr commented Sep 30, 2024 •

edited

Loading

johnlockejrr commented Sep 30, 2024 •

edited

Loading

johnlockejrr commented Sep 30, 2024 •

edited

Loading