Merge branch 'main' into change-dependencies

SuffolkLITLab · Jul 11, 2024 · 2617c72 · 2617c72
2 parents 5f78625 + c50a8a0
commit 2617c72
Show file tree

Hide file tree

Showing 8 changed files with 151 additions and 15 deletions.
diff --git a/.gitignore b/.gitignore
@@ -11,3 +11,4 @@ dev-testing/.DS_Store
 .env
 .venv
 venv
+formfyxer/keys/**
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,19 @@
 # CHANGELOG
 
+
+## Version v0.3.0
+
+### Added
+* Add warning when sensitive fields are detected by @codestronger in https://github.com/SuffolkLITLab/RateMyPDF/issues/25
+
+### Changed
+N/A
+
+### Fixed
+N/A
+
+**Full Changelog**: https://github.com/SuffolkLITLab/FormFyxer/compare/v0.2.0...v0.3.0
+
 ## Version v0.2.0
 
 ### Added
@@ -22,7 +36,7 @@
 
 ### Fixed
 
-* If GPT-3 says the readability is too high (i.e. high likelyhood we have garabage), we will use ocrmypydf to re-evaluate the text in a PDF (https://github.com/SuffolkLITLab/FormFyxer/commit/a6dcd9872d2d0a6542f687aa46b1b9b00f16d3e5)
+* If GPT-3 says the readability is too high (i.e. high likelihood we have garbage), we will use ocrmypydf to re-evaluate the text in a PDF (https://github.com/SuffolkLITLab/FormFyxer/commit/a6dcd9872d2d0a6542f687aa46b1b9b00f16d3e5)
 * Adds more actionable information to the stats returned from `parse_form` (https://github.com/SuffolkLITLab/FormFyxer/pull/83):
     * Gives more context for citations in found in the text: https://github.com/SuffolkLITLab/FormFyxer/pull/83/commits/b62bd41958fc1bd0373b7698adde1a234779f77a
 

diff --git a/README.md b/README.md
@@ -80,9 +80,12 @@ Functions from `pdf_wrangling` are found on [our documentation site](https://suf
       - [Parameters:](#parameters-10)
       - [Returns:](#returns-10)
       - [Example:](#example-10)
+    - [formfyxer.get\_sensitive\_fields(fields)](#formfyxerget_sensitive_fieldsfields)
+      - [Parameters:](#parameters-11)
+      - [Returns:](#returns-11)
+      - [Example:](#example-11)
   - [License](#license)
 
-
 ### formfyxer.re_case(text)
 Reformats snake_case, camelCase, and similarly-formatted text into individual words.
 #### Parameters:
@@ -99,9 +102,9 @@ A string where words combined by cases like snake_case are split back into indiv
 
 
 ### formfyxer.regex_norm_field(text)
-Given an auto-generated field name (e.g., those applied by a PDF editor's find form feilds function), this function uses regular expressions to replace common auto-generated field names for those found in our [standard field names](https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/label_variables/). 
+Given an auto-generated field name (e.g., those applied by a PDF editor's find form fields function), this function uses regular expressions to replace common auto-generated field names for those found in our [standard field names](https://suffolklitlab.org/docassemble-AssemblyLine-documentation/docs/label_variables/). 
 #### Parameters:
-* **text : str** A string of words, such as that found in an auto-generated field name (e.g., those applied by a PDF editor's find form feilds function).
+* **text : str** A string of words, such as that found in an auto-generated field name (e.g., those applied by a PDF editor's find form fields function).
 #### Returns: 
 Either the original string/field name, or if a standard field name is found, the standard field name.
 #### Example:
@@ -124,7 +127,7 @@ A snake_case string summarizing the input sentence.
 #### Example:
 ```python
 >>> import formfyxer
->>> reformat_field("this is a variable where you fill out your name")
+>>> formfyxer.reformat_field("this is a variable where you fill out your name")
 'variable_fill_name'
 ```
 [back to top](#formfyxer)
@@ -345,7 +348,7 @@ A string with a proposed plain language rewrite.
 ### formfyxer.describe_form(text)
 An OpenAI-enabled tool that will write a draft plain language description for a form. In order to use this feature **you must edit the `openai_org.txt` and `openai_key.txt` files found in this package to contain your OpenAI credentials**. You can sign up for an account and get your token on the [OpenAI signup](https://beta.openai.com/signup).
 
-Given a string conataining the full text of a court form, this function will return its a draft description of the form written in plain language. 
+Given a string containing the full text of a court form, this function will return its a draft description of the form written in plain language. 
 
 #### Parameters:
 * **text : str** text.
@@ -444,6 +447,27 @@ An object grouping together similar field names.
 [back to top](#formfyxer)
 
 
+
+### formfyxer.get_sensitive_data_types(fields, fields_old)
+Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive fields grouped by type. A list of the old field names can also be provided. These fields should be in the same order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value will not contain the old field name, only the corresponding field name from the first parameter.
+
+The sensitive field types are: Bank Account Number, Credit Card Number, Driver's License Number, and Social Security Number.
+#### Parameters:
+* **fields : List[str]** List of field names.
+#### Returns: 
+List of sensitive fields found within the fields passed in.
+#### Example:
+```python
+>>> import formfyxer
+>>> formfyxer.get_sensitive_data_types(["users1_name", "users1_address", "users1_ssn", "users1_routing_number"])
+{'Social Security Number': ['users1_ssn'], 'Bank Account Number': ['users1_routing_number']}
+>>> formfyxer.get_sensitive_data_types(["user_ban1", "user_credit_card_number", "user_cvc", "user_cdl", "user_social_security"], ["old_bank_account_number", "old_credit_card_number", "old_cvc", "old_drivers_license", "old_ssn"])
+{'Bank Account Number': ['user_ban1'], 'Credit Card Number': ['user_credit_card_number', 'user_cvc'], "Driver's License Number": ['user_cdl'], 'Social Security Number': ['user_social_security']}
+```
+[back to top](#formfyxer)
+
+
+
 ## License
 [MIT](https://github.com/SuffolkLITLab/FormFyxer/blob/main/LICENSE)
 

diff --git a/formfyxer/docx_wrangling.py b/formfyxer/docx_wrangling.py
@@ -52,8 +52,8 @@ def add_run_after(run, text):
 
 
 def update_docx(
-    document: Union[docx.Document, str], modified_runs: List[Tuple[int, int, str, int]]
-) -> docx.Document:
+    document: Union[docx.document.Document, str], modified_runs: List[Tuple[int, int, str, int]]
+) -> docx.document.Document:
     """Update the document with the modified runs.
 
     Note: OpenAI is probabilistic, so the modified run indices may not be correct.
@@ -423,7 +423,7 @@ def get_modified_docx_runs(
     return guesses
 
 
-def make_docx_plain_language(docx_path: str) -> docx.Document:
+def make_docx_plain_language(docx_path: str) -> docx.document.Document:
     """
     Convert a DOCX file to plain language with the help of OpenAI.
     """
@@ -449,8 +449,8 @@ def make_docx_plain_language(docx_path: str) -> docx.Document:
     )
     return update_docx(docx.Document(docx_path), guesses)
 
-
-def modify_docx_with_openai_guesses(docx_path: str) -> docx.Document:
+  
+def modify_docx_with_openai_guesses(docx_path: str) -> docx.document.Document:
     """Uses OpenAI to guess the variable names for a document and then modifies the document with the guesses.
 
     Args:

diff --git a/formfyxer/lit_explorer.py b/formfyxer/lit_explorer.py
@@ -974,6 +974,53 @@ def get_citations(text: str, tokenized_sentences: List[str]) -> List[str]:
     return citations_with_context
 
 
+# NOTE: omitting "CID" for Credit Card IDs since it has a lot of false positives.
+FIELD_PATTERNS = {
+    "Bank Account Number": [
+        r"account[\W_]*number",
+        r"ABA$",
+        r"routing[\W_]*number",
+        r"checking",
+    ],
+    "Credit Card Number": [r"credit[\W_]*card", r"(CV[CDV]2?|CCV|CSC)"],
+    "Driver's License Number": [r"drivers[\W_]*license", r".?DL$"],
+    "Social Security Number": [r"social[\W_]*security[\W_]*number", r"SSN", r"TIN$"],
+}
+FIELD_REGEXES = {
+    name: re.compile("|".join(patterns), re.IGNORECASE | re.MULTILINE)
+    for name, patterns in FIELD_PATTERNS.items()
+}
+
+
+def get_sensitive_data_types(
+    fields: List[str], fields_old: Optional[List[str]] = None
+) -> Dict[str, List[str]]:
+    """
+    Given a list of fields, identify those related to sensitive information and return a dictionary with the sensitive
+    fields grouped by type. A list of the old field names can also be provided. These fields should be in the same
+    order. Passing the old field names allows the sensitive field algorithm to match more accurately. The return value
+    will not contain the old field name, only the corresponding field name from the first parameter.
+
+    The sensitive data types are: Bank Account Number, Credit Card Number, Driver's License Number, and Social Security
+    Number.
+    """
+
+    if fields_old is not None and len(fields) != len(fields_old):
+        raise ValueError(
+            "If provided, fields_old must have the same number of items as fields."
+        )
+
+    sensitive_data_types: Dict[str, List[str]] = {}
+    num_fields = len(fields)
+    for i, field in enumerate(fields):
+        for name, regex in FIELD_REGEXES.items():
+            if re.search(regex, field):
+                sensitive_data_types.setdefault(name, []).append(field)
+            elif fields_old is not None and re.search(regex, fields_old[i]):
+                sensitive_data_types.setdefault(name, []).append(field)
+    return sensitive_data_types
+
+
 def substitute_phrases(
     input_string: str, substitution_phrases: Dict[str, str]
 ) -> Tuple[str, List[Tuple[int, int]]]:
@@ -1206,6 +1253,7 @@ def parse_form(
         classify_field(field, new_names[index])
         for index, field in enumerate(field_types)
     ]
+    sensitive_data_types = get_sensitive_data_types(new_names, field_names)
 
     slotin_count = sum(1 for c in classified if c == AnswerType.SLOT_IN)
     gathered_count = sum(1 for c in classified if c == AnswerType.GATHERED)
@@ -1234,6 +1282,7 @@ def parse_form(
         "fields": new_names,
         "fields_conf": new_names_conf,
         "fields_old": field_names,
+        "sensitive data types": sensitive_data_types,
         "text": text,
         "original_text": original_text,
         "number of sentences": sentence_count,

diff --git a/formfyxer/pdf_wrangling.py b/formfyxer/pdf_wrangling.py
@@ -1152,8 +1152,8 @@ def get_possible_radios(img: Union[str, BinaryIO, cv2.Mat]):
         maxRadius=50,
     )
     if circles is not None:
-        circles = np.uint16(np.around(circles))
-        for i in circles[0, :]:
+        rounded_circles = np.around(circles)
+        for i in rounded_circles[0, :]:
             center = (i[0], i[1])
             # circle center
             cv2.circle(img_mat, center, 1, (0, 100, 100), 3)

diff --git a/formfyxer/tests/test_lit_explorer.py b/formfyxer/tests/test_lit_explorer.py
@@ -7,7 +7,7 @@
 
 sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
 
-from formfyxer.lit_explorer import spot, substitute_phrases
+from formfyxer.lit_explorer import spot, substitute_phrases, get_sensitive_data_types
 
 
 class TestSubstitutePhrases(unittest.TestCase):
@@ -154,5 +154,53 @@ def test_calls_spot_with_reduced_character_count(self, mock_post):
         )
 
 
+class TestGetSensitiveDataTypes(unittest.TestCase):
+    def test_without_fields_old(self):
+        actual_output = get_sensitive_data_types(
+            ["users1_name", "users1_address", "users1_ssn", "users1_routing_number"],
+            None,
+        )
+        self.assertEqual(
+            actual_output,
+            {
+                "Social Security Number": ["users1_ssn"],
+                "Bank Account Number": ["users1_routing_number"],
+            },
+        )
+
+    def test_merge_of_fields_from_both(self):
+        actual_output = get_sensitive_data_types(
+            [
+                "user_ban1",
+                "user_credit_card_number",
+                "user_cvc",
+                "user_cdl",
+                "user_social_security",
+            ],
+            [
+                "old_bank_account_number",
+                "old_credit_card_number",
+                "old_cvc",
+                "old_drivers_license",
+                "old_ssn",
+            ],
+        )
+        self.assertEqual(
+            actual_output,
+            {
+                "Bank Account Number": ["user_ban1"],
+                "Credit Card Number": ["user_credit_card_number", "user_cvc"],
+                "Driver's License Number": ["user_cdl"],
+                "Social Security Number": ["user_social_security"],
+            },
+        )
+
+    def test_no_sensitive_data_types(self):
+        actual_output = get_sensitive_data_types(
+            ["name", "address", "zip", "signature"]
+        )
+        self.assertEqual(actual_output, {})
+
+
 if __name__ == "__main__":
     unittest.main()
diff --git a/setup.py b/setup.py
@@ -33,7 +33,7 @@ def run(self):
         'nltk', 'boxdetect', 'pdf2image', 'reportlab>=3.6.13', 'pdfminer.six',
         'opencv-python', 'ocrmypdf', 'eyecite', 'passivepy>=0.2.16', 'sigfig',
         'typer>=0.4.1,<0.5.0', # typer pre 0.4.1 was broken by click 8.1.0: https://github.com/explosion/spaCy/issues/10564
-        'openai', 'transformers' 
+        'openai', 'python-docx', 'tiktoken', 'transformers' 
     ],
     cmdclass={
       'install': InstallSpacyModelCommand,