Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text quality heuristics #65

Open
jpcompartir opened this issue Jun 20, 2024 · 0 comments
Open

Text quality heuristics #65

jpcompartir opened this issue Jun 20, 2024 · 0 comments

Comments

@jpcompartir
Copy link
Owner

Implement the filters/heurstics from:
https://arxiv.org/pdf/2405.01582

filter_name heuristic description
has_first_letter_caps First character capitalized Check if first character of each line is capitalized.
no_all_caps All characters capitalised Check if all the characters in the line are capitalized
word_repetetion_ratio_ge_0_2 Word repetition ratio Check if ratio of repetition for word in line is > 0.2
digit_punctuation_ratio_0_25 Digit/punctuation to word ratio Identify lines with ratio of digits/punctuation to words in a line is > 0.25.
no_special_characters Has { character Flower brackets are usually common in code as we are curating for text only content this filter identifies text that might contain code.
terminal_punctuation Has terminal punctuation Check if the lines end with one of these puntuation marks - ’.’, ’!’, ’?’, ’"’.
stop_word_match_2 Has 2 stop words Check if the line contains at least 2 stop words among ’the’, ’be’, ’to’, ’of’, ’and’, ’that’, ’have’, ’with’.
javascript_flag Contains special phrases C Check if text contains phrases ’javascript’ or ’lorem ipsum’ to identify docs with code.
token_count_ge_3 Token count Check if the token count is > 3
word_count_3_256 Word count range Check if line word count is > 3 and < 256.
has_object Has object check if there is object identified by parser
has_noun Has noun Check if there is at least one noun in the line.
has_determiner Has determiner Check if the line contains determiner based on results from text parser
text_complexity_c1 Text complexity For this we use setup similar to CAT filter(Radenovic et al., 2023), where lines with atleast one edge from object are flagged as positive.

Combine into the scores given:
$$\text{score}\text{line} = \frac{\sum{i=1}^{F} w_iI_i(line)}{\sum_{i=1}^{F} w_i}$$

$$\text{score}\text{doc} = \frac{\sum{\text{line}=1}^{n} tc_\text{line}\text{score}\text{line}}{\sum{\text{line=1}}^{n} tc_\text{line}} $$

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant