Text quality heuristics #65

jpcompartir · 2024-06-20T10:52:43Z

Implement the filters/heurstics from:
https://arxiv.org/pdf/2405.01582

filter_name	heuristic	description
has_first_letter_caps	First character capitalized	Check if first character of each line is capitalized.
no_all_caps	All characters capitalised	Check if all the characters in the line are capitalized
word_repetetion_ratio_ge_0_2	Word repetition ratio	Check if ratio of repetition for word in line is > 0.2
digit_punctuation_ratio_0_25	Digit/punctuation to word ratio	Identify lines with ratio of digits/punctuation to words in a line is > 0.25.
no_special_characters	Has { character	Flower brackets are usually common in code as we are curating for text only content this filter identifies text that might contain code.
terminal_punctuation	Has terminal punctuation	Check if the lines end with one of these puntuation marks - ’.’, ’!’, ’?’, ’"’.
stop_word_match_2	Has 2 stop words	Check if the line contains at least 2 stop words among ’the’, ’be’, ’to’, ’of’, ’and’, ’that’, ’have’, ’with’.
javascript_flag	Contains special phrases C	Check if text contains phrases ’javascript’ or ’lorem ipsum’ to identify docs with code.
token_count_ge_3	Token count	Check if the token count is > 3
word_count_3_256	Word count range	Check if line word count is > 3 and < 256.
has_object	Has object	check if there is object identified by parser
has_noun	Has noun	Check if there is at least one noun in the line.
has_determiner	Has determiner	Check if the line contains determiner based on results from text parser
text_complexity_c1	Text complexity	For this we use setup similar to CAT filter(Radenovic et al., 2023), where lines with atleast one edge from object are flagged as positive.

Combine into the scores given:
$$\text{score}\text{line} = \frac{\sum{i=1}^{F} w_iI_i(line)}{\sum_{i=1}^{F} w_i}$$

$$\text{score}\text{doc} = \frac{\sum{\text{line}=1}^{n} tc_\text{line}\text{score}\text{line}}{\sum{\text{line=1}}^{n} tc_\text{line}} $$

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text quality heuristics #65

Text quality heuristics #65

jpcompartir commented Jun 20, 2024

Text quality heuristics #65

Text quality heuristics #65

Comments

jpcompartir commented Jun 20, 2024