Skip to content

Commit

Permalink
Merge branch 'main' into change-dependencies
Browse files Browse the repository at this point in the history
  • Loading branch information
nonprofittechy authored Jul 11, 2024
2 parents 5f78625 + c50a8a0 commit 2617c72
Show file tree
Hide file tree
Showing 8 changed files with 151 additions and 15 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ dev-testing/.DS_Store
.env
.venv
venv
formfyxer/keys/**
16 changes: 15 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# CHANGELOG


## Version v0.3.0

### Added
* Add warning when sensitive fields are detected by @codestronger in https://github.com/SuffolkLITLab/RateMyPDF/issues/25

### Changed
N/A

### Fixed
N/A

**Full Changelog**: https://github.com/SuffolkLITLab/FormFyxer/compare/v0.2.0...v0.3.0

## Version v0.2.0

### Added
Expand All @@ -22,7 +36,7 @@

### Fixed

* If GPT-3 says the readability is too high (i.e. high likelyhood we have garabage), we will use ocrmypydf to re-evaluate the text in a PDF (https://github.com/SuffolkLITLab/FormFyxer/commit/a6dcd9872d2d0a6542f687aa46b1b9b00f16d3e5)
* If GPT-3 says the readability is too high (i.e. high likelihood we have garbage), we will use ocrmypydf to re-evaluate the text in a PDF (https://github.com/SuffolkLITLab/FormFyxer/commit/a6dcd9872d2d0a6542f687aa46b1b9b00f16d3e5)
* Adds more actionable information to the stats returned from `parse_form` (https://github.com/SuffolkLITLab/FormFyxer/pull/83):
* Gives more context for citations in found in the text: https://github.com/SuffolkLITLab/FormFyxer/pull/83/commits/b62bd41958fc1bd0373b7698adde1a234779f77a

Expand Down
34 changes: 29 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,12 @@ Functions from `pdf_wrangling` are found on [our documentation site](https://suf
- [Parameters:](#parameters-10)
- [Returns:](#returns-10)
- [Example:](#example-10)
- [formfyxer.get\_sensitive\_fields(fields)](#formfyxerget_sensitive_fieldsfields)
- [Parameters:](#parameters-11)
- [Returns:](#returns-11)
- [Example:](#example-11)
- [License](#license)


### formfyxer.re_case(text)
Reformats snake_case, camelCase, and similarly-formatted text into individual words.
#### Parameters:
Expand All @@ -99,9 +102,9 @@ A string where words combined by cases like snake_case are split back into indiv


### formfyxer.regex_norm_field(text)
Given an auto-generated field name (e.g., those applied by a PDF editor's find form feilds function), this function uses regular expressions to replace common auto-generated field names for those found in our [standard field names](https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/label_variables/).
Given an auto-generated field name (e.g., those applied by a PDF editor's find form fields function), this function uses regular expressions to replace common auto-generated field names for those found in our [standard field names](https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/label_variables/).
#### Parameters:
* **text : str** A string of words, such as that found in an auto-generated field name (e.g., those applied by a PDF editor's find form feilds function).
* **text : str** A string of words, such as that found in an auto-generated field name (e.g., those applied by a PDF editor's find form fields function).
#### Returns:
Either the original string/field name, or if a standard field name is found, the standard field name.
#### Example:
Expand All @@ -124,7 +127,7 @@ A snake_case string summarizing the input sentence.
#### Example:
```python
>>> import formfyxer
>>> reformat_field("this is a variable where you fill out your name")
>>> formfyxer.reformat_field("this is a variable where you fill out your name")
'variable_fill_name'
```
[back to top](#formfyxer)
Expand Down Expand Up @@ -345,7 +348,7 @@ A string with a proposed plain language rewrite.
### formfyxer.describe_form(text)
An OpenAI-enabled tool that will write a draft plain language description for a form. In order to use this feature **you must edit the `openai_org.txt` and `openai_key.txt` files found in this package to contain your OpenAI credentials**. You can sign up for an account and get your token on the [OpenAI signup](https://beta.openai.com/signup).

Given a string conataining the full text of a court form, this function will return its a draft description of the form written in plain language.
Given a string containing the full text of a court form, this function will return its a draft description of the form written in plain language.

#### Parameters:
* **text : str** text.
Expand Down Expand Up @@ -444,6 +447,27 @@ An object grouping together similar field names.
[back to top](#formfyxer)



### formfyxer.get_sensitive_data_types(fields, fields_old)
Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive fields grouped by type. A list of the old field names can also be provided. These fields should be in the same order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value will not contain the old field name, only the corresponding field name from the first parameter.

The sensitive field types are: Bank Account Number, Credit Card Number, Driver's License Number, and Social Security Number.
#### Parameters:
* **fields : List[str]** List of field names.
#### Returns:
List of sensitive fields found within the fields passed in.
#### Example:
```python
>>> import formfyxer
>>> formfyxer.get_sensitive_data_types(["users1_name", "users1_address", "users1_ssn", "users1_routing_number"])
{'Social Security Number': ['users1_ssn'], 'Bank Account Number': ['users1_routing_number']}
>>> formfyxer.get_sensitive_data_types(["user_ban1", "user_credit_card_number", "user_cvc", "user_cdl", "user_social_security"], ["old_bank_account_number", "old_credit_card_number", "old_cvc", "old_drivers_license", "old_ssn"])
{'Bank Account Number': ['user_ban1'], 'Credit Card Number': ['user_credit_card_number', 'user_cvc'], "Driver's License Number": ['user_cdl'], 'Social Security Number': ['user_social_security']}
```
[back to top](#formfyxer)



## License
[MIT](https://github.com/SuffolkLITLab/FormFyxer/blob/main/LICENSE)

Expand Down
10 changes: 5 additions & 5 deletions formfyxer/docx_wrangling.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ def add_run_after(run, text):


def update_docx(
document: Union[docx.Document, str], modified_runs: List[Tuple[int, int, str, int]]
) -> docx.Document:
document: Union[docx.document.Document, str], modified_runs: List[Tuple[int, int, str, int]]
) -> docx.document.Document:
"""Update the document with the modified runs.
Note: OpenAI is probabilistic, so the modified run indices may not be correct.
Expand Down Expand Up @@ -423,7 +423,7 @@ def get_modified_docx_runs(
return guesses


def make_docx_plain_language(docx_path: str) -> docx.Document:
def make_docx_plain_language(docx_path: str) -> docx.document.Document:
"""
Convert a DOCX file to plain language with the help of OpenAI.
"""
Expand All @@ -449,8 +449,8 @@ def make_docx_plain_language(docx_path: str) -> docx.Document:
)
return update_docx(docx.Document(docx_path), guesses)


def modify_docx_with_openai_guesses(docx_path: str) -> docx.Document:
def modify_docx_with_openai_guesses(docx_path: str) -> docx.document.Document:
"""Uses OpenAI to guess the variable names for a document and then modifies the document with the guesses.
Args:
Expand Down
49 changes: 49 additions & 0 deletions formfyxer/lit_explorer.py
Original file line number Diff line number Diff line change
Expand Up @@ -974,6 +974,53 @@ def get_citations(text: str, tokenized_sentences: List[str]) -> List[str]:
return citations_with_context


# NOTE: omitting "CID" for Credit Card IDs since it has a lot of false positives.
FIELD_PATTERNS = {
"Bank Account Number": [
r"account[\W_]*number",
r"ABA$",
r"routing[\W_]*number",
r"checking",
],
"Credit Card Number": [r"credit[\W_]*card", r"(CV[CDV]2?|CCV|CSC)"],
"Driver's License Number": [r"drivers[\W_]*license", r".?DL$"],
"Social Security Number": [r"social[\W_]*security[\W_]*number", r"SSN", r"TIN$"],
}
FIELD_REGEXES = {
name: re.compile("|".join(patterns), re.IGNORECASE | re.MULTILINE)
for name, patterns in FIELD_PATTERNS.items()
}


def get_sensitive_data_types(
fields: List[str], fields_old: Optional[List[str]] = None
) -> Dict[str, List[str]]:
"""
Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive
fields grouped by type. A list of the old field names can also be provided. These fields should be in the same
order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value
will not contain the old field name, only the corresponding field name from the first parameter.
The sensitive data types are: Bank Account Number, Credit Card Number, Driver's License Number, and Social Security
Number.
"""

if fields_old is not None and len(fields) != len(fields_old):
raise ValueError(
"If provided, fields_old must have the same number of items as fields."
)

sensitive_data_types: Dict[str, List[str]] = {}
num_fields = len(fields)
for i, field in enumerate(fields):
for name, regex in FIELD_REGEXES.items():
if re.search(regex, field):
sensitive_data_types.setdefault(name, []).append(field)
elif fields_old is not None and re.search(regex, fields_old[i]):
sensitive_data_types.setdefault(name, []).append(field)
return sensitive_data_types


def substitute_phrases(
input_string: str, substitution_phrases: Dict[str, str]
) -> Tuple[str, List[Tuple[int, int]]]:
Expand Down Expand Up @@ -1206,6 +1253,7 @@ def parse_form(
classify_field(field, new_names[index])
for index, field in enumerate(field_types)
]
sensitive_data_types = get_sensitive_data_types(new_names, field_names)

slotin_count = sum(1 for c in classified if c == AnswerType.SLOT_IN)
gathered_count = sum(1 for c in classified if c == AnswerType.GATHERED)
Expand Down Expand Up @@ -1234,6 +1282,7 @@ def parse_form(
"fields": new_names,
"fields_conf": new_names_conf,
"fields_old": field_names,
"sensitive data types": sensitive_data_types,
"text": text,
"original_text": original_text,
"number of sentences": sentence_count,
Expand Down
4 changes: 2 additions & 2 deletions formfyxer/pdf_wrangling.py
Original file line number Diff line number Diff line change
Expand Up @@ -1152,8 +1152,8 @@ def get_possible_radios(img: Union[str, BinaryIO, cv2.Mat]):
maxRadius=50,
)
if circles is not None:
circles = np.uint16(np.around(circles))
for i in circles[0, :]:
rounded_circles = np.around(circles)
for i in rounded_circles[0, :]:
center = (i[0], i[1])
# circle center
cv2.circle(img_mat, center, 1, (0, 100, 100), 3)
Expand Down
50 changes: 49 additions & 1 deletion formfyxer/tests/test_lit_explorer.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

from formfyxer.lit_explorer import spot, substitute_phrases
from formfyxer.lit_explorer import spot, substitute_phrases, get_sensitive_data_types


class TestSubstitutePhrases(unittest.TestCase):
Expand Down Expand Up @@ -154,5 +154,53 @@ def test_calls_spot_with_reduced_character_count(self, mock_post):
)


class TestGetSensitiveDataTypes(unittest.TestCase):
def test_without_fields_old(self):
actual_output = get_sensitive_data_types(
["users1_name", "users1_address", "users1_ssn", "users1_routing_number"],
None,
)
self.assertEqual(
actual_output,
{
"Social Security Number": ["users1_ssn"],
"Bank Account Number": ["users1_routing_number"],
},
)

def test_merge_of_fields_from_both(self):
actual_output = get_sensitive_data_types(
[
"user_ban1",
"user_credit_card_number",
"user_cvc",
"user_cdl",
"user_social_security",
],
[
"old_bank_account_number",
"old_credit_card_number",
"old_cvc",
"old_drivers_license",
"old_ssn",
],
)
self.assertEqual(
actual_output,
{
"Bank Account Number": ["user_ban1"],
"Credit Card Number": ["user_credit_card_number", "user_cvc"],
"Driver's License Number": ["user_cdl"],
"Social Security Number": ["user_social_security"],
},
)

def test_no_sensitive_data_types(self):
actual_output = get_sensitive_data_types(
["name", "address", "zip", "signature"]
)
self.assertEqual(actual_output, {})


if __name__ == "__main__":
unittest.main()
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def run(self):
'nltk', 'boxdetect', 'pdf2image', 'reportlab>=3.6.13', 'pdfminer.six',
'opencv-python', 'ocrmypdf', 'eyecite', 'passivepy>=0.2.16', 'sigfig',
'typer>=0.4.1,<0.5.0', # typer pre 0.4.1 was broken by click 8.1.0: https://github.com/explosion/spaCy/issues/10564
'openai', 'transformers'
'openai', 'python-docx', 'tiktoken', 'transformers'
],
cmdclass={
'install': InstallSpacyModelCommand,
Expand Down

0 comments on commit 2617c72

Please sign in to comment.