Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnboundLocalError: local variable 'l1_loss' referenced before assignment #1738

Closed
johnlockejrr opened this issue Sep 30, 2024 · 41 comments
Closed
Labels
type: bug Something isn't working

Comments

@johnlockejrr
Copy link

Bug description

While training a Doctr model with my own dataset, I encountered an UnboundLocalError in the compute_loss function of the differentiable_binarization module.

Code snippet to reproduce the bug

python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0

Error traceback

Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1627s (67 samples in 34 batches)
Train set loaded in 0.1016s (540 samples in 270 batches)
  0%|                                                                                                                                                                                                                                                           | 0/270 [00:03<?, ?it/s]
Traceback (most recent call last):                                                                                                                                                                                                                              | 0/270 [00:00<?, ?it/s]
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 481, in <module>
    main(args)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 388, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 126, in fit_one_epoch
    train_loss = model(images, targets)["loss"]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/incognito/doctr/doctr/models/detection/differentiable_binarization/pytorch.py", line 216, in forward
    loss = self.compute_loss(logits, thresh_map, target)
  File "/home/incognito/doctr/doctr/models/detection/differentiable_binarization/pytorch.py", line 286, in compute_loss
    return l1_loss + focal_scale * focal_loss + dice_loss
UnboundLocalError: local variable 'l1_loss' referenced before assignment

Environment

DocTR version: 0.9.1a0
TensorFlow version: N/A
PyTorch version: 2.4.1+cu121 (torchvision 0.19.1+cu121)
OpenCV version: 4.10.0
OS: Ubuntu 22.04.5 LTS
Python version: 3.10.12
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060
Nvidia driver version: 561.09
cuDNN version: Could not collect

Deep Learning backend

from doctr.file_utils import is_tf_available, is_torch_available
print(f"is_tf_available: {is_tf_available()}")
is_tf_available: False
print(f"is_torch_available: {is_torch_available()}")
is_torch_available: True

@johnlockejrr johnlockejrr added the type: bug Something isn't working label Sep 30, 2024
@johnlockejrr
Copy link
Author

If needed, I can upload my dataset.

@johnlockejrr
Copy link
Author

I tested this on a different environment and I get the same error:

DocTR version: 0.9.1a0
TensorFlow version: N/A
PyTorch version: 2.4.1+cu121 (torchvision 0.19.1+cu121)
OpenCV version: 4.10.0
OS: Ubuntu 22.04.5 LTS
Python version: 3.10.12
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4070
Nvidia driver version: 560.94
cuDNN version: Could not collect

Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from doctr.file_utils import is_tf_available, is_torch_available
>>> print(f"is_tf_available: {is_tf_available()}")
is_tf_available: False
>>> print(f"is_torch_available: {is_torch_available()}")
is_torch_available: True
>>>

@johnlockejrr
Copy link
Author

I think I made a mistake, I just realized I used polygons from original images and the images in dataset were mogrified... checking

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

Working now, with big images much slower. What height or width would be recommanded for training?

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.4528s (67 samples in 34 batches)
Train set loaded in 0.07876s (540 samples in 270 batches)

Training loss: 0.658518:  78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                                             | 211/270 [03:23<00:47,  1.24it/s]

EDIT: worked until killed:

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.4528s (67 samples in 34 batches)
Train set loaded in 0.07876s (540 samples in 270 batches)
Training loss: 0.643471: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [03:52<00:00,  1.16it/s]100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:33<00:00,  1.00it/s]
Validation loss decreased inf --> 2.29916: saving state...
Epoch 1/5 - Validation loss: 2.29916 (Recall: 1.67% | Precision: 5.47% | Mean IoU: 9.00%)
  0%|                                                                                                                                                                                                                                                 | 0/270 [00:20<?, ?it/s]
Traceback (most recent call last):                                                                                                                                                                                                                    | 0/270 [00:00<?, ?it/s]
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 481, in <module>
    main(args)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 388, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 109, in fit_one_epoch
    for images, targets in pbar:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1327, in _next_data
    idx, data = self._get_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1283, in _get_data
    success, data = self._try_get_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1131, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/usr/lib/python3.10/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/usr/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/signal_handling.py", line 67, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 44881) is killed by signal: Killed.

@felixdittrich92
Copy link
Contributor

Images are resized internally :)

Try to reduce/set the workers with --workers=<INT_DEPENDING_ON_YOU_MACHINE>

@johnlockejrr
Copy link
Author

I just resized the images to x960 and recalculated the the polygons and everything goes smooth, anyway my dataset is at line level, I give it a try :)

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1393s (67 samples in 34 batches)
Train set loaded in 0.07748s (540 samples in 270 batches)
Training loss: 1.3698: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:34<00:00,  2.86it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:13<00:00,  2.55it/s]
Validation loss decreased inf --> 0.674124: saving state...
Epoch 1/5 - Validation loss: 0.674124 (Recall: 4.78% | Precision: 3.38% | Mean IoU: 5.00%)
Training loss: 0.711258: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:28<00:00,  3.05it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.48it/s]
Epoch 2/5 - Validation loss: 0.817873 (Recall: 5.47% | Precision: 2.30% | Mean IoU: 3.00%)
Training loss: 0.563128: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:26<00:00,  3.11it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.52it/s]
Validation loss decreased 0.674124 --> 0.632917: saving state...
Epoch 3/5 - Validation loss: 0.632917 (Recall: 16.05% | Precision: 32.59% | Mean IoU: 29.00%)
Training loss: 0.610216: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:27<00:00,  3.07it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.50it/s]
Epoch 4/5 - Validation loss: 0.642417 (Recall: 21.75% | Precision: 11.35% | Mean IoU: 9.00%)
Training loss: 0.604278: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:27<00:00,  3.09it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.49it/s]
Validation loss decreased 0.632917 --> 0.565686: saving state...
Epoch 5/5 - Validation loss: 0.565686 (Recall: 43.27% | Precision: 46.25% | Mean IoU: 36.00%)

@felixdittrich92
Copy link
Contributor

You should train longer :D But for only 5 epochs the metrics doesn't looks wrong 👍

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

Yes! I just wanted to be sure it runs, was a first test, I'm happy with it anyway.

Just figuring how to add my new trained (*ish) model to the streamlit demo app :-|

EDIT: besides my datasets are line-level, I have another problem: my datasets are mostly RTL, should I do anything for it to work (like python bidi etc.)? Is, let's say Arabic or Hebrew requiring other features?

@felixdittrich92
Copy link
Contributor

Yes! I just wanted to be sure it runs, was a first test, I'm happy with it anyway.

Just figuring how to add my new trained (*ish) model to the streamlit demo app :-|

Curious to see how well this can work ^^

Currently we use anyascii (https://github.com/anyascii/anyascii) i think this should work !? :)

@johnlockejrr
Copy link
Author

Yes! I just wanted to be sure it runs, was a first test, I'm happy with it anyway.
Just figuring how to add my new trained (*ish) model to the streamlit demo app :-|

Curious to see how well this can work ^^

Currently we use anyascii (https://github.com/anyascii/anyascii) i think this should work !? :)

Never used it, yes, I think it should.

@johnlockejrr
Copy link
Author

Seems I can't load it as per https://mindee.github.io/doctr/using_doctr/custom_models_training.html :)

image

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Sep 30, 2024

You can :) You have to change the vocab with --vocab=..
See here for the predefined vocabs we have: https://github.com/mindee/doctr/blob/main/doctr/datasets/vocabs.py

The vocab should contain all the chars you have in your dataset (or more)

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

Oh, sorry, I'm new to it. I mostly trained kraken yolov8 and DocUFCN models.

But it needs a vocab for a detection model? I didn't train a recognition model yet.

@felixdittrich92
Copy link
Contributor

If no of the predefined vocabs should fit you can simply change:

vocab = VOCABS[args.vocab]

to vocab="abc" for example but to load the model later you need the same string which defines your models vocab :)

@felixdittrich92
Copy link
Contributor

@johnlockejrr No only for the recognition model training

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

Couldn't I load only the detection model to see how it performs on a new test image?

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

If no of the predefined vocabs should fit you can simply change:

vocab = VOCABS[args.vocab]

to vocab="abc" for example but to load the model later you need the same string which defines your models vocab :)

I just take a look at vocabs.py and for VOCABS["hebrew"] there are more characters, the file should be amended sometime in the future.

@felixdittrich92
Copy link
Contributor

If no of the predefined vocabs should fit you can simply change:

vocab = VOCABS[args.vocab]

to vocab="abc" for example but to load the model later you need the same string which defines your models vocab :)

I just take a look at vocabs.py and for VOCABS["hebrew"] there are more characters, the file should be amended sometime in the future.

Feel free to open a PR to add the missing chars 👍

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Sep 30, 2024

Can't load only the detection model to see how it performs on a new test image?

Sure :)

Load your custom trained model (in combination with the ocr_predictor):

# Load custom detection model
det_model = db_resnet50(pretrained=False, pretrained_backbone=False)
det_params = torch.load('<path_to_pt>', map_location="cpu")
det_model.load_state_dict(det_params)
predictor = ocr_predictor(det_arch=det_model, reco_arch="vitstr_small", pretrained=True)

or only with the detection_predictor:

import requests
import cv2
import numpy as np
import torch

from doctr.io import DocumentFile
from doctr.models import detection_predictor, db_resnet50
from doctr.utils.geometry import detach_scores


# Convert relative coordinates to absolute pixel values
def _to_absolute(geom, img_shape: tuple[int, int]) -> list[list[int]]:
    h, w = img_shape
    if len(geom) == 2:  # Assume straight pages = True -> [[xmin, ymin], [xmax, ymax]]
        (xmin, ymin), (xmax, ymax) = geom
        xmin, xmax = int(round(w * xmin)), int(round(w * xmax))
        ymin, ymax = int(round(h * ymin)), int(round(h * ymax))
        return [[xmin, ymin], [xmax, ymin], [xmax, ymax], [xmin, ymax]]
    else:  # For polygons, convert each point to absolute coordinates
        return [[int(point[0] * w), int(point[1] * h)] for point in geom]


url = "https://www.francetvinfo.fr/pictures/uGwaNE-aJq7zHLhZJdzdCd9nyjE/1200x900/2021/03/16/phpCDwGn0.jpg"

# Load custom detection model
det_model = db_resnet50(pretrained=False, pretrained_backbone=False)
det_params = torch.load('<path_to_pt>', map_location="cpu")
det_model.load_state_dict(det_params)

det_predictor = detection_predictor(
    arch=det_model,
    pretrained=False,
    assume_straight_pages=True,
    symmetric_pad=True,
    preserve_aspect_ratio=True,
) #.cuda().half()  # Uncomment this line if you have a GPU

det_predictor.model.postprocessor.bin_thresh = 0.3
det_predictor.model.postprocessor.box_thresh = 0.65

docs = DocumentFile.from_images([requests.get(url).content])
results = det_predictor(docs)

image = cv2.imdecode(np.frombuffer(requests.get(url).content, np.uint8), cv2.IMREAD_COLOR)

for doc, res in zip(docs, results):
    img_shape = (doc.shape[0], doc.shape[1])
    # Detach the probability scores from the results
    detached_coords, prob_scores = detach_scores([res.get("words")])

    for i, coords in enumerate(detached_coords[0]):
        coords = coords.reshape(2, 2).tolist() if coords.shape == (4, ) else coords.tolist()

        # Convert relative to absolute pixel coordinates
        points = np.array(_to_absolute(coords, img_shape), dtype=np.int32).reshape((-1, 1, 2))

        # Draw the bounding box on the image
        cv2.polylines(image, [points], isClosed=True, color=(255, 0, 0), thickness=2)

    # Save the modified image with bounding boxes
    cv2.imwrite("output.jpg", image)

@johnlockejrr
Copy link
Author

Perfect! Thank you for all your help! I'll open a PR later today for a new language and ammend the Hebrew language.

@felixdittrich92
Copy link
Contributor

Perfect! Thank you for all your help! I'll open a PR later today for a new language and ammend the Hebrew language.

reference PR to show what's required to update or add a vocab: https://github.com/mindee/doctr/pull/1700/files

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

Very strage with my model. Executing your script above:

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python load_det_model.py
/home/incognito/doctr/load_det_model.py:27: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  det_params = torch.load('db_resnet50_20240930-142637.pt', map_location="cpu")
Traceback (most recent call last):
  File "/home/incognito/doctr/load_det_model.py", line 28, in <module>
    det_model.load_state_dict(det_params)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DBNet:
        size mismatch for prob_head.6.weight: copying a param with shape torch.Size([64, 2, 2, 2]) from checkpoint, the shape in current model is torch.Size([64, 1, 2, 2]).
        size mismatch for prob_head.6.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([1]).
        size mismatch for thresh_head.6.weight: copying a param with shape torch.Size([64, 2, 2, 2]) from checkpoint, the shape in current model is torch.Size([64, 1, 2, 2]).
        size mismatch for thresh_head.6.bias: copying a param with shape torch.Size([2]) from checkpoint, the shape in current model is torch.Size([1]).

Could this happen because I trained it on a line-level dataset?

@felixdittrich92
Copy link
Contributor

Can your share on entry from your labels.json you used for training ?

@johnlockejrr
Copy link
Author

Can your share on entry from your labels.json you used for training ?

Sure:

{"81_dc946_default.jpg": {"img_dimensions": [720, 960], "img_hash": "f04698acbbc7246475a8401dc031facf1d152c156cb1363217270cd7591e94d3", "polygons": {"textzone": [[[66, 153], [527, 153], [527, 709], [66, 709]]], "textline": [[[78, 161], [515, 161], [515, 188], [78, 188]], [[76, 180], [515, 180], [515, 207], [76, 207]], [[79, 201], [515, 201], [515, 229], [79, 229]], [[77, 221], [514, 221], [514, 250], [77, 250]], [[78, 242], [516, 242], [516, 273], [78, 273]], [[73, 264], [516, 264], [516, 292], [73, 292]], [[75, 287], [517, 287], [517, 313], [75, 313]], [[76, 307], [517, 307], [517, 335], [76, 335]], [[73, 327], [518, 327], [518, 356], [73, 356]], [[75, 350], [516, 350], [516, 377], [75, 377]], [[76, 388], [518, 388], [518, 417], [76, 417]], [[77, 412], [519, 412], [519, 437], [77, 437]], [[74, 434], [518, 434], [518, 457], [74, 457]], [[75, 452], [518, 452], [518, 478], [75, 478]], [[78, 472], [518, 472], [518, 499], [78, 499]], [[81, 493], [519, 493], [519, 519], [81, 519]], [[81, 514], [518, 514], [518, 540], [81, 540]], [[73, 535], [519, 535], [519, 560], [73, 560]], [[74, 556], [519, 556], [519, 581], [74, 581]], [[72, 576], [519, 576], [519, 602], [72, 602]], [[74, 596], [519, 596], [519, 624], [74, 624]], [[75, 618], [517, 618], [517, 647], [75, 647]], [[73, 637], [521, 637], [521, 666], [73, 666]], [[79, 658], [520, 658], [520, 686], [79, 686]], [[75, 680], [520, 680], [520, 714], [75, 714]]]}}, "136_7aab7_default.jpg": {"img_dimensions": [720, 960], "img_hash": "eac91c1193e188f4dd089705086e3e3dfd6bc5233d5ceb714c6082684a64ab06", "polygons": {"textzone": [[[183, 174], [621, 174], [621, 722], [183, 722]]], "textline": [[[188, 181], [615, 181], [615, 211], [188, 211]], [[187, 206], [614, 206], [614, 231], [187, 231]], [[184, 226], [613, 226], [613, 252], [184, 252]], [[188, 246], [614, 246], [614, 274], [188, 274]], [[188, 268], [615, 268], [615, 291], [188, 291]], [[189, 287], [615, 287], [615, 315], [189, 315]], [[188, 308], [614, 308], [614, 335], [188, 335]], [[188, 329], [616, 329], [616, 355], [188, 355]], [[187, 349], [616, 349], [616, 375], [187, 375]], [[186, 372], [616, 372], [616, 397], [186, 397]], [[186, 390], [616, 390], [616, 417], [186, 417]], [[188, 429], [618, 429], [618, 455], [188, 455]], [[189, 450], [619, 450], [619, 477], [189, 477]], [[189, 471], [619, 471], [619, 498], [189, 498]], [[189, 491], [619, 491], [619, 517], [189, 517]], [[190, 512], [618, 512], [618, 538], [190, 538]], [[190, 533], [620, 533], [620, 558], [190, 558]], [[189, 553], [619, 553], [619, 577], [189, 577]], [[192, 574], [616, 574], [616, 599], [192, 599]], [[191, 594], [620, 594], [620, 620], [191, 620]], [[191, 613], [619, 613], [619, 638], [191, 638]], [[193, 633], [619, 633], [619, 660], [193, 660]], [[190, 655], [620, 655], [620, 680], [190, 680]], [[189, 673], [619, 673], [619, 700], [189, 700]], [[186, 694], [618, 694], [618, 729], [186, 729]]]}},
...

@johnlockejrr
Copy link
Author

Better, I can upload the labels.json of val because is smaller than train.

labels.json

@felixdittrich92
Copy link
Contributor

Ah i see you trained an KIE model 😅

To train only a detection model polygons shouldn't be a dict -- only the polygons as value like.

"polygons": [[[66, 153], [527, 153], [527, 709], [66, 709]], .....]

@johnlockejrr
Copy link
Author

OMG! :)

@felixdittrich92
Copy link
Contributor

OMG! :)

I think this wasn't planned right ? ^^

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

For a detection model can't I specify more class names? As I have textzones and textlines
Or better I just remove the textzone class and keep the textlines?

@felixdittrich92
Copy link
Contributor

For a detection model can't I specify more class names? As I have textzones and textlines

You can also load this model with:

det_model = db_resnet50(pretrained=False, pretrained_backbone=False, class_names=['textzone', 'textline'])
det_params = torch.load('<path_to_pt>', map_location="cpu")
det_model.load_state_dict(det_params)

@johnlockejrr
Copy link
Author

For a detection model can't I specify more class names? As I have textzones and textlines

You can also load this model with:

det_model = db_resnet50(pretrained=False, pretrained_backbone=False, class_names=['textzone', 'textline'])
det_params = torch.load('<path_to_pt>', map_location="cpu")
det_model.load_state_dict(det_params)

Bad day :)

/home/incognito/doctr/load_det_model-kie.py:28: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  det_params = torch.load('db_resnet50_20240930-142637.pt', map_location="cpu")
Traceback (most recent call last):
  File "/home/incognito/doctr/load_det_model-kie.py", line 50, in <module>
    detached_coords, prob_scores = detach_scores([res.get("words")])
  File "/home/incognito/doctr/doctr/utils/geometry.py", line 79, in detach_scores
    loc_preds, obj_scores = zip(*(_detach(box) for box in boxes))
  File "/home/incognito/doctr/doctr/utils/geometry.py", line 79, in <genexpr>
    loc_preds, obj_scores = zip(*(_detach(box) for box in boxes))
  File "/home/incognito/doctr/doctr/utils/geometry.py", line 75, in _detach
    if boxes.ndim == 2:
AttributeError: 'NoneType' object has no attribute 'ndim'

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

I think I should re-train it :)

Error on line detached_coords, prob_scores = detach_scores([res.get("words")])

If is a KIE model shouldn't I from doctr.models import kie_predictor?

I changed the line to detached_coords, prob_scores = detach_scores([res.get("textline")])

But I get nothing, script runs but no detections.

detached_coords -> [array([], shape=(0, 4), dtype=float32)]

@johnlockejrr
Copy link
Author

I reconverted my data to:

{"215_67426_default.jpg": {"img_dimensions": [720, 960], "img_hash": "f4da2a0dcdcd28dbc08609bac090f465ee5d7b471fa42024da0a11e79acade60", "polygons": [[[72, 162], [514, 162], [514, 194], [72, 194]], [[69, 188], [514, 188], [514, 216], [69, 216]], [[69, 209], [514, 209], [514, 238], [69, 238]], [[69, 231], [514, 231], [514, 259], [69, 259]], [[69, 251], [514, 251], [514, 283], [69, 283]], [[70, 274], [515, 274], [515, 299], [70, 299]], [[70, 293], [515, 293], [515, 322], [70, 322]], [[69, 314], [516, 314], [516, 340], [69, 340]], [[69, 335], [516, 335], [516, 364], [69, 364]], [[67, 355], [516, 355], [516, 386], [67, 386]], [[69, 392], [517, 392], [517, 427], [69, 427]], [[70, 420], [514, 420], [514, 447], [70, 447]], [[70, 441], [517, 441], [517, 468], [70, 468]], [[70, 462], [517, 462], [517, 493], [70, 493]], [[70, 483], [518, 483], [518, 511], [70, 511]], [[77, 504], [519, 504], [519, 534], [77, 534]], [[65, 526], [520, 526], [520, 555], [65, 555]], [[69, 547], [519, 547], [519, 578], [69, 578]], [[69, 570], [521, 570], [521, 598], [69, 598]], [[71, 590], [520, 590], [520, 619], [71, 619]], [[65, 612], [521, 612], [521, 642], [65, 642]], [[70, 635], [521, 635], [521, 663], [70, 663]], [[70, 660], [522, 660], [522, 684], [70, 684]], [[66, 677], [522, 677], [522, 703], [66, 703]], [[70, 698], [522, 698], [522, 727], [70, 727]], [[67, 716], [199, 716], [199, 741], [67, 741]]]}, "545_4408b_default.jpg": {"img_dimensions": [720, 960], "img_hash": "21c0f7326a7821b77b2a5e49e76017e60555dd40670005863a20a13d2803748d", "polygons": [[[107, 179], [507, 179], [507, 207], [107, 207]], [[107, 200], [510, 200], [510, 226], [107, 226]], [[105, 220], [509, 220], [509, 245], [105, 245]], [[109, 243], [510, 243], [510, 262], [109, 262]], [[106, 259], [510, 259], [510, 282], [106, 282]], [[106, 277], [510, 277], [510, 301], [106, 301]], [[106, 299], [510, 299], [510, 319], [106, 319]], [[103, 315], [510, 315], [510, 338], [103, 338]], [[103, 333], [510, 333], [510, 358], [103, 358]], [[101, 354], [510, 354], [510, 379], [101, 379]], [[104, 373], [509, 373], [509, 398], [104, 398]], [[101, 390], [510, 390], [510, 416], [101, 416]], [[103, 412], [511, 412], [511, 431], [103, 431]], [[104, 430], [511, 430], [511, 455], [104, 455]], [[101, 450], [510, 450], [510, 475], [101, 475]], [[104, 469], [510, 469], [510, 495], [104, 495]], [[104, 489], [509, 489], [509, 514], [104, 514]], [[104, 507], [510, 507], [510, 533], [104, 533]], [[104, 528], [510, 528], [510, 553], [104, 553]], [[103, 549], [511, 549], [511, 572], [103, 572]], [[103, 565], [509, 565], [509, 591], [103, 591]], [[103, 584], [511, 584], [511, 611], [103, 611]], [[101, 602], [511, 602], [511, 629], [101, 629]], [[99, 622], [512, 622], [512, 650], [99, 650]], [[105, 660], [512, 660], [512, 693], [105, 693]], [[103, 684], [202, 684], [202, 710], [103, 710]]]},

I'll retrain :)

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 10 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=10, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1427s (67 samples in 34 batches)
Train set loaded in 0.0208s (540 samples in 270 batches)
Training loss: 0.29681: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:06<00:00,  4.07it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.59it/s]
Validation loss decreased inf --> 0.362736: saving state...
Epoch 1/10 - Validation loss: 0.362736 (Recall: 98.02% | Precision: 85.08% | Mean IoU: 65.00%)
Training loss: 0.321628: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.49it/s]
Epoch 2/10 - Validation loss: 0.372804 (Recall: 95.15% | Precision: 84.16% | Mean IoU: 63.00%)
Training loss: 0.406969: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.24it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.43it/s]
Validation loss decreased 0.362736 --> 0.33441: saving state...
Epoch 3/10 - Validation loss: 0.33441 (Recall: 92.34% | Precision: 75.74% | Mean IoU: 52.00%)
Training loss: 0.508775: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.54it/s]
Epoch 4/10 - Validation loss: 0.354248 (Recall: 98.68% | Precision: 80.43% | Mean IoU: 64.00%)
Training loss: 0.389871: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.28it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.54it/s]
Validation loss decreased 0.33441 --> 0.316777: saving state...
Epoch 5/10 - Validation loss: 0.316777 (Recall: 98.68% | Precision: 89.18% | Mean IoU: 70.00%)
Training loss: 0.36966: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.30it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.60it/s]
Validation loss decreased 0.316777 --> 0.308347: saving state...
Epoch 6/10 - Validation loss: 0.308347 (Recall: 97.19% | Precision: 81.19% | Mean IoU: 59.00%)
Training loss: 0.31847: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.25it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.49it/s]
Validation loss decreased 0.308347 --> 0.285198: saving state...
Epoch 7/10 - Validation loss: 0.285198 (Recall: 98.08% | Precision: 87.41% | Mean IoU: 67.00%)
Training loss: 0.202373:  11%|███████████████████████▊                                                                                                                                                                                       | 31/270 [00:08<01:05,  3.67it/s]
Traceback (most recent call last):███████████████████▊                                                                                                                                                                                       | 31/270 [00:08<00:52,  4.53it/s]
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 481, in <module>
    main(args)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 388, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 109, in fit_one_epoch
    for images, targets in pbar:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 15.
Original Traceback (most recent call last):
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/incognito/doctr/doctr/datasets/datasets/base.py", line 67, in __getitem__
    img_transformed, target[class_name] = self.sample_transforms(img, bboxes)
  File "/home/incognito/doctr/doctr/transforms/modules/base.py", line 56, in __call__
    x, target = t(x, target)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/incognito/doctr/doctr/transforms/modules/pytorch.py", line 87, in forward
    target[:, [0, 2]] = offset[0] + target[:, [0, 2]] * raw_shape[-1] / img.shape[-1]
UnboundLocalError: local variable 'offset' referenced before assignment

I resumed it and it finished:

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0 --resume ./db_resnet50_20240930-162432.pt --workers 2
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=2, resume='./db_resnet50_20240930-162432.pt', test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1605s (67 samples in 34 batches)
Resuming ./db_resnet50_20240930-162432.pt
/home/incognito/doctr/references/detection/train_pytorch.py:228: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(args.resume, map_location="cpu")
Train set loaded in 0.07673s (540 samples in 270 batches)
Training loss: 0.342384: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:04<00:00,  4.20it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:08<00:00,  4.06it/s]
Validation loss decreased inf --> 0.333333: saving state...
Epoch 1/5 - Validation loss: 0.333333 (Recall: 98.32% | Precision: 84.99% | Mean IoU: 64.00%)
Training loss: 0.285108: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.35it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.64it/s]
Validation loss decreased 0.333333 --> 0.298129: saving state...
Epoch 2/5 - Validation loss: 0.298129 (Recall: 97.84% | Precision: 90.08% | Mean IoU: 67.00%)
Training loss: 0.241384: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:01<00:00,  4.40it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.66it/s]
Validation loss decreased 0.298129 --> 0.234458: saving state...
Epoch 3/5 - Validation loss: 0.234458 (Recall: 98.80% | Precision: 81.85% | Mean IoU: 71.00%)
Training loss: 0.238148: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:01<00:00,  4.37it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.72it/s]
Epoch 4/5 - Validation loss: 0.238532 (Recall: 98.50% | Precision: 86.95% | Mean IoU: 75.00%)
Training loss: 0.237705: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.34it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.62it/s]
Validation loss decreased 0.234458 --> 0.20468: saving state...
Epoch 5/5 - Validation loss: 0.20468 (Recall: 98.98% | Precision: 89.64% | Mean IoU: 80.00%)

@felixdittrich92
Copy link
Contributor

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 10 --device 0
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=10, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=None, resume=None, test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1427s (67 samples in 34 batches)
Train set loaded in 0.0208s (540 samples in 270 batches)
Training loss: 0.29681: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:06<00:00,  4.07it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:09<00:00,  3.59it/s]
Validation loss decreased inf --> 0.362736: saving state...
Epoch 1/10 - Validation loss: 0.362736 (Recall: 98.02% | Precision: 85.08% | Mean IoU: 65.00%)
Training loss: 0.321628: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.49it/s]
Epoch 2/10 - Validation loss: 0.372804 (Recall: 95.15% | Precision: 84.16% | Mean IoU: 63.00%)
Training loss: 0.406969: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.24it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.43it/s]
Validation loss decreased 0.362736 --> 0.33441: saving state...
Epoch 3/10 - Validation loss: 0.33441 (Recall: 92.34% | Precision: 75.74% | Mean IoU: 52.00%)
Training loss: 0.508775: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.54it/s]
Epoch 4/10 - Validation loss: 0.354248 (Recall: 98.68% | Precision: 80.43% | Mean IoU: 64.00%)
Training loss: 0.389871: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.28it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.54it/s]
Validation loss decreased 0.33441 --> 0.316777: saving state...
Epoch 5/10 - Validation loss: 0.316777 (Recall: 98.68% | Precision: 89.18% | Mean IoU: 70.00%)
Training loss: 0.36966: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.30it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.60it/s]
Validation loss decreased 0.316777 --> 0.308347: saving state...
Epoch 6/10 - Validation loss: 0.308347 (Recall: 97.19% | Precision: 81.19% | Mean IoU: 59.00%)
Training loss: 0.31847: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:03<00:00,  4.25it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:06<00:00,  5.49it/s]
Validation loss decreased 0.308347 --> 0.285198: saving state...
Epoch 7/10 - Validation loss: 0.285198 (Recall: 98.08% | Precision: 87.41% | Mean IoU: 67.00%)
Training loss: 0.202373:  11%|███████████████████████▊                                                                                                                                                                                       | 31/270 [00:08<01:05,  3.67it/s]
Traceback (most recent call last):███████████████████▊                                                                                                                                                                                       | 31/270 [00:08<00:52,  4.53it/s]
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 481, in <module>
    main(args)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 388, in main
    fit_one_epoch(model, train_loader, batch_transforms, optimizer, scheduler, amp=args.amp)
  File "/home/incognito/doctr/references/detection/train_pytorch.py", line 109, in fit_one_epoch
    for images, targets in pbar:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1324, in _next_data
    return self._process_data(data)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
UnboundLocalError: Caught UnboundLocalError in DataLoader worker process 15.
Original Traceback (most recent call last):
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/incognito/doctr/doctr/datasets/datasets/base.py", line 67, in __getitem__
    img_transformed, target[class_name] = self.sample_transforms(img, bboxes)
  File "/home/incognito/doctr/doctr/transforms/modules/base.py", line 56, in __call__
    x, target = t(x, target)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/incognito/doctr/env-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/incognito/doctr/doctr/transforms/modules/pytorch.py", line 87, in forward
    target[:, [0, 2]] = offset[0] + target[:, [0, 2]] * raw_shape[-1] / img.shape[-1]
UnboundLocalError: local variable 'offset' referenced before assignment

I resumed it and it finished:

(env-py3.10) incognito@DESKTOP-NHKR7QL:~/doctr$ python references/detection/train_pytorch.py datasets/sam/train_out datasets/sam/val_out db_resnet50 --epochs 5 --device 0 --resume ./db_resnet50_20240930-162432.pt --workers 2
Namespace(train_path='datasets/sam/train_out', val_path='datasets/sam/val_out', arch='db_resnet50', name=None, epochs=5, batch_size=2, device=0, save_interval_epoch=False, input_size=1024, lr=0.001, weight_decay=0, workers=2, resume='./db_resnet50_20240930-162432.pt', test_only=False, freeze_backbone=False, show_samples=False, wb=False, push_to_hub=False, pretrained=False, rotation=False, eval_straight=False, sched='poly', amp=False, find_lr=False, early_stop=False, early_stop_epochs=5, early_stop_delta=0.01)
Validation set loaded in 0.1605s (67 samples in 34 batches)
Resuming ./db_resnet50_20240930-162432.pt
/home/incognito/doctr/references/detection/train_pytorch.py:228: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  checkpoint = torch.load(args.resume, map_location="cpu")
Train set loaded in 0.07673s (540 samples in 270 batches)
Training loss: 0.342384: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:04<00:00,  4.20it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:08<00:00,  4.06it/s]
Validation loss decreased inf --> 0.333333: saving state...
Epoch 1/5 - Validation loss: 0.333333 (Recall: 98.32% | Precision: 84.99% | Mean IoU: 64.00%)
Training loss: 0.285108: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.35it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.64it/s]
Validation loss decreased 0.333333 --> 0.298129: saving state...
Epoch 2/5 - Validation loss: 0.298129 (Recall: 97.84% | Precision: 90.08% | Mean IoU: 67.00%)
Training loss: 0.241384: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:01<00:00,  4.40it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.66it/s]
Validation loss decreased 0.298129 --> 0.234458: saving state...
Epoch 3/5 - Validation loss: 0.234458 (Recall: 98.80% | Precision: 81.85% | Mean IoU: 71.00%)
Training loss: 0.238148: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:01<00:00,  4.37it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.72it/s]
Epoch 4/5 - Validation loss: 0.238532 (Recall: 98.50% | Precision: 86.95% | Mean IoU: 75.00%)
Training loss: 0.237705: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 270/270 [01:02<00:00,  4.34it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [00:05<00:00,  6.62it/s]
Validation loss decreased 0.234458 --> 0.20468: saving state...
Epoch 5/5 - Validation loss: 0.20468 (Recall: 98.98% | Precision: 89.64% | Mean IoU: 80.00%)

That's a known issue PR to fix this is on the way :)
#1715
CC @odulcy-mindee

@johnlockejrr
Copy link
Author

It performs well (*ish). With your script above but any idea why identifies only one line?

output

@felixdittrich92
Copy link
Contributor

It performs well (*ish). With your script above but any idea why identifies only one line?

output

What's the shape of the model output?

@felixdittrich92
Copy link
Contributor

Btw in my provided script lower bin_thresh and box_thresh to 0.1

@johnlockejrr
Copy link
Author

I trained the model on x960 images, when detecting I sould use the same resolution?

@felixdittrich92
Copy link
Contributor

I trained the model on x960 images, when detecting I sould use the same resolution?

If you have resized it before on your own it would make sense yep

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 30, 2024

I resized the image to x960. I think it needs more training.

output

@mindee mindee locked and limited conversation to collaborators Oct 1, 2024
@felixdittrich92 felixdittrich92 converted this issue into discussion #1739 Oct 1, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants