Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLDR-386 pdf auto reader bug #298

Merged
merged 4 commits into from
Jul 26, 2023
Merged

TLDR-386 pdf auto reader bug #298

merged 4 commits into from
Jul 26, 2023

Conversation

NastyBoget
Copy link
Collaborator

  • refactor pdf_txtlayer_classifier, its training script and benchmarking;
  • added script for training data generation;
  • generate more training data;
  • changed classifier from xgboost to catboost, added feature importances;
  • uploaded data and the new classifier on cloud.

@NastyBoget NastyBoget force-pushed the TLDR-386_pdf_auto_reader_bug branch from bdb0d4a to 8e260a3 Compare July 21, 2023 13:24
@NastyBoget NastyBoget requested a review from oksidgy July 21, 2023 13:25
from sklearn.metrics import f1_score
from xgboost import XGBClassifier

from config import get_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from dedoc.config

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if not subdir.is_dir():
continue
for file_path in subdir.iterdir():
if str(file_path).endswith("txt"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endswith(".txt")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@oksidgy oksidgy merged commit 0886076 into develop Jul 26, 2023
2 checks passed
@oksidgy oksidgy deleted the TLDR-386_pdf_auto_reader_bug branch July 26, 2023 15:39
dronperminov added a commit that referenced this pull request Aug 1, 2023
* TLDR-386 pdf auto reader bug (#298)

* TLDR-386 Added features importances

* TLDR-386 added script for txtlayer dataset generation

* TLDR-386 move all data to the cloud

* Review fixes

* exclude version and changelog files (#299)

* TLDR-419 add confidence annotation (#301)

* add new annotation

* add confidence extracting

* add test for confidence annotation

* add confidence annotation to documentation

* fix flake

* add mergeable field for annotation

* review fixes

* TLDR-369 class for full dedoc pipeline running (#300)

* DedocPipeline added (work in progress)

* TLDR-369_dedoc_manager

* TLDR-369 fix documentation and add test for attachments recursion

* TLDR-369 change version saving

* TLDR-369 review fixes

* TLDR-369 added temporary file name

* new version 0.10.0 (#302)

---------

Co-authored-by: Bogatenkova Anastasiya <[email protected]>
dronperminov added a commit that referenced this pull request Aug 1, 2023
* TLDR-386 pdf auto reader bug (#298)

* TLDR-386 Added features importances

* TLDR-386 added script for txtlayer dataset generation

* TLDR-386 move all data to the cloud

* Review fixes

* exclude version and changelog files (#299)

* TLDR-419 add confidence annotation (#301)

* add new annotation

* add confidence extracting

* add test for confidence annotation

* add confidence annotation to documentation

* fix flake

* add mergeable field for annotation

* review fixes

* TLDR-369 class for full dedoc pipeline running (#300)

* DedocPipeline added (work in progress)

* TLDR-369_dedoc_manager

* TLDR-369 fix documentation and add test for attachments recursion

* TLDR-369 change version saving

* TLDR-369 review fixes

* TLDR-369 added temporary file name

* new version 0.10.0 (#302)

---------

Co-authored-by: Bogatenkova Anastasiya <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants