-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLDR-386 pdf auto reader bug #298
Conversation
NastyBoget
commented
Jul 21, 2023
- refactor pdf_txtlayer_classifier, its training script and benchmarking;
- added script for training data generation;
- generate more training data;
- changed classifier from xgboost to catboost, added feature importances;
- uploaded data and the new classifier on cloud.
bdb0d4a
to
8e260a3
Compare
from sklearn.metrics import f1_score | ||
from xgboost import XGBClassifier | ||
|
||
from config import get_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from dedoc.config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
if not subdir.is_dir(): | ||
continue | ||
for file_path in subdir.iterdir(): | ||
if str(file_path).endswith("txt"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endswith(".txt")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* TLDR-386 pdf auto reader bug (#298) * TLDR-386 Added features importances * TLDR-386 added script for txtlayer dataset generation * TLDR-386 move all data to the cloud * Review fixes * exclude version and changelog files (#299) * TLDR-419 add confidence annotation (#301) * add new annotation * add confidence extracting * add test for confidence annotation * add confidence annotation to documentation * fix flake * add mergeable field for annotation * review fixes * TLDR-369 class for full dedoc pipeline running (#300) * DedocPipeline added (work in progress) * TLDR-369_dedoc_manager * TLDR-369 fix documentation and add test for attachments recursion * TLDR-369 change version saving * TLDR-369 review fixes * TLDR-369 added temporary file name * new version 0.10.0 (#302) --------- Co-authored-by: Bogatenkova Anastasiya <[email protected]>
* TLDR-386 pdf auto reader bug (#298) * TLDR-386 Added features importances * TLDR-386 added script for txtlayer dataset generation * TLDR-386 move all data to the cloud * Review fixes * exclude version and changelog files (#299) * TLDR-419 add confidence annotation (#301) * add new annotation * add confidence extracting * add test for confidence annotation * add confidence annotation to documentation * fix flake * add mergeable field for annotation * review fixes * TLDR-369 class for full dedoc pipeline running (#300) * DedocPipeline added (work in progress) * TLDR-369_dedoc_manager * TLDR-369 fix documentation and add test for attachments recursion * TLDR-369 change version saving * TLDR-369 review fixes * TLDR-369 added temporary file name * new version 0.10.0 (#302) --------- Co-authored-by: Bogatenkova Anastasiya <[email protected]>