Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate words in OCR result #330

Closed
jonathanMindee opened this issue Jun 26, 2021 · 8 comments · Fixed by #1279
Closed

Duplicate words in OCR result #330

jonathanMindee opened this issue Jun 26, 2021 · 8 comments · Fixed by #1279
Assignees
Labels
help wanted Extra attention is needed type: bug Something isn't working
Milestone

Comments

@jonathanMindee
Copy link
Contributor

jonathanMindee commented Jun 26, 2021

🐛 Bug

Running the sample code:

from doctr.documents import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_images(["table.png"])
# Analyze
result = model(doc)

result.show(doc)

I get this result:

image

Everything looks fine but there is some overlap between different words. The mouse is pointing to the word "Header4" and there is another word with the content "4". In that case I'm not able to reconstruct properly the table header as there is either an extra "4".

To Reproduce

Steps to reproduce the behavior:

  1. download this image

table

  1. Run the following code
from doctr.documents import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_images(["table.png"])
# Analyze
result = model(doc)

result.show(doc)
@jonathanMindee jonathanMindee added the type: bug Something isn't working label Jun 26, 2021
@jonathanMindee
Copy link
Contributor Author

I think using some overlap detection postprocessing it's possible to filter out those duplicates.

@fg-mindee
Copy link
Contributor

Thanks for reporting this!

I'm not sure which way would be the best, but here are some ideas to handle this:

  • Batch post-processing: NMS to perform with a looser threshold.
  • Manual post-processing: estimate candidate overlaps with a box IoU. For pairs where there is a text overlap as well, we perform a manual NMS (taking the one with the longest string while having the confidence above a given threshold). The probable issue would be that the predicted resulting string will wrongly not include the blank space.
  • Training-based: we add blank space as part of the vocab in the recognition and use NMS.

The first option being natively implemented in most modern DL frameworks, it might be a suitable option to try first

@charlesmindee charlesmindee self-assigned this Jun 27, 2021
@charlesmindee
Copy link
Collaborator

charlesmindee commented Jun 28, 2021

I think we shouldn't only perform NMS, because here for instance we want to keep both boxes when there is an overlap. I see 2 solutions:

  • Merging the 2 boxes in 1 box, it is quick an easy but it can include undesirable spaces.
  • Arbitrarily shorten one of the 2 boxes to eliminate overlapping.

It is however an uncommon edge case, I think it only happens with underscores
ex

@charlesmindee
Copy link
Collaborator

charlesmindee commented Jun 28, 2021

As a matter of fact, we do want to suppress very small boxes included in other ones, so I suggest the following:

  • performing NMS with a very high threshold (let's say > 80%) to filter boxes covered by other ones (avoid repetitions without loosing information).
  • merging boxes with a consistent overlapping but with a lower IOU (for instance, IOU between 20% & 80%), to keep all the information we need.

This overlapping seems to be mostly frequent with underscores, so I think it is a good approximation to merge boxes in that case (technically, it is the same word). What do you think @fg-mindee ?

ex1 png

@fg-mindee
Copy link
Contributor

@charlesmindee Thanks for the suggestion!
However when I suggested an NMS, I thinking about the iterative merging implementation of it
So I fully agree that pure filtering won't be enough. As you mentioned, we might need to use another metric than IoU 👍

@charlesmindee charlesmindee added the help wanted Extra attention is needed label Jul 2, 2021
@fg-mindee
Copy link
Contributor

Coming back to this issue, I suggest the following:

  • Investigate the heatmap of the text detection module to assess whether this comes from the segmentation or box conversion part (I'm especially interested in the overlapping localization candidates shown on the issue description image)
  • discuss options to handle the situation depending on our findings
  • as shown earlier, NMS isn't really the best option here since we're talking about small IoU overlaps. So if we tweak this NMS, that will start merging words that are correctly separated by a blank space

But let's not leave this issue unaddressed 😃

@felixT2K
Copy link
Contributor

@frgfm @charlesmindee @odulcy-mindee

Seems to be solved with preserve_aspect_ratio=True.
(Both TF and PT are identically)
I have tested some personal documents and keeping the aspect ratio was always the better choice ... Should we use it by default wdyt ?

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

model = ocr_predictor(pretrained=True, preserve_aspect_ratio=True)
# PDF
doc = DocumentFile.from_images(["/home/felix/Desktop/table.png"])
# Analyze
result = model(doc)

result.show(doc)

Screenshot from 2023-07-25 08-15-10

@charlesmindee
Copy link
Collaborator

Hi @felixdittrich92, thanks for the suggestion, I think we can change the default behaviour since it is quite natural to preserve the aspect ratio by default. Moreover, it will make the predictions robuster to cropping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants