Skip to content

Commit

Permalink
fix MRR
Browse files Browse the repository at this point in the history
  • Loading branch information
Nikita Shevtsov committed Aug 8, 2024
1 parent 3334124 commit 70c78fd
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ def detect_txtlayer(self, path: str, parameters: dict) -> PdfTxtlayerParameters:
lines = self.__get_lines_for_predict(path=path, parameters=parameters)
if str(parameters.get("fast_textual_layer_detection", "false")).lower() == "true":
is_correct = any(line.line.strip() for line in lines)
first_page_correct = True if len([line for line in lines if line.metadata.page_id == 0]) > 0 else False
first_page_lines = [line for line in lines if line.metadata.page_id == 0]
first_page_correct = first_page_lines and any(line.line.strip() for line in first_page_lines)
else:
is_correct = self.txtlayer_classifier.predict(lines)
first_page_correct = self.__is_first_page_correct(lines=lines, is_txt_layer_correct=is_correct)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/dedoc_api_usage/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ Api parameters description
If the document doesn't have a textual layer (it is an image, scanned document), PDF document parsing works like with ``need_pdf_table_analysis=false``.
It is highly recommended to use this option value for any PDF document parsing.

* - fast_auto
* - fast_textual_layer_detection
- true, false
- false
- Enable fast textual layer detection. Works only when **auto** or **auto_tabby** is selected at **pdf_with_text_layer**.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/parameters/pdf_handling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ PDF and images handling
If the document doesn't have a textual layer (it is an image, scanned document), :class:`dedoc.readers.PdfImageReader` will be used.
It is highly recommended to use this option value for any PDF document parsing.

* - fast_auto
* - fast_textual_layer_detection
- true, false
- false
- * :meth:`dedoc.readers.PdfAutoReader.read`
Expand Down
9 changes: 8 additions & 1 deletion tests/api_tests/test_api_format_pdf_auto_text_layer.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,15 @@ def test_fast_textual_layer_detection(self) -> None:
self.assertIn("Assume document has a correct textual layer", warnings)
self.assertEqual(result["content"]["structure"]["subparagraphs"][5]["text"][:10], "This paper")

file_name = "tz_scan_1page.pdf"
parameters = dict(pdf_with_text_layer="auto_tabby", fast_textual_layer_detection=True)
result = self._send_request(file_name, parameters)
warnings = result["warnings"]
self.assertIn("Assume document has incorrect textual layer", result["warnings"])

file_name = "mixed_pdf.pdf"
parameters = dict(pdf_with_text_layer="auto", fast_textual_layer_detection=True)
result = self._send_request(file_name, parameters)
warnings = result["warnings"]
self.assertIn("Assume document has a correct textual layer", warnings)
self.assertEqual(result["content"]["structure"]["subparagraphs"][5]["text"][:10], "This paper")
self.assertIn("Assume the first page hasn't a textual layer", result["warnings"])

0 comments on commit 70c78fd

Please sign in to comment.