Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F23 Final Merge #55

Merged
merged 333 commits into from
Jan 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
333 commits
Select commit Hold shift + click to select a range
3e07fb3
Merge branch 'dev' into michigan-campaign-eda
averyschoen Oct 23, 2023
fb2eb2c
Merge pull request #31 from dsi-clinic/michigan-campaign-eda
averyschoen Oct 23, 2023
0ca6182
added skeleton for state cleaning base class
trevorspreadbury Oct 23, 2023
1d354f7
Update README.md
nrposner Oct 23, 2023
1b62f09
added proof of concept for requests based AZ scraper
trevorspreadbury Oct 24, 2023
8c2d443
Merge pull request #39 from dsi-clinic/nrposner-patch-1
averyschoen Oct 24, 2023
73d422b
Delete data/Contributions/Test/1998_mi_cfr_contributions.txt
averyschoen Oct 24, 2023
83fe536
Delete notebooks/mi_campaign_eda.ipynb
necabotheking Oct 24, 2023
257cf62
Delete notebooks/__init__.py
necabotheking Oct 24, 2023
b55a380
Delete data/MI_campaign_data.ipynb
necabotheking Oct 24, 2023
a6d3c84
update max line lengths
trevorspreadbury Oct 24, 2023
bab4e54
updated util and notebook README, revised EDA code based on TA input
yuzhouw313 Oct 24, 2023
16f8789
set isort and flake8 line lengths to 88 to deal with black 10% rule
trevorspreadbury Oct 25, 2023
c11ecfd
add scaffold of state cleaner class
trevorspreadbury Oct 25, 2023
6a5df86
remove erroneous file
trevorspreadbury Oct 25, 2023
3719523
Merge pull request #40 from dsi-clinic/cleaning/base_class
trevorspreadbury Oct 25, 2023
6b2acfa
Merge remote-tracking branch 'origin/dev' into michigan-expenditure-eda
necabotheking Oct 25, 2023
24413d5
updated docstrings for individual funcitons
nrposner Oct 25, 2023
62ac675
added typing for args and kwargs in individual scrapers
nrposner Oct 25, 2023
11b4db2
Create expenditure Constants
necabotheking Oct 25, 2023
66b3c11
update
necabotheking Oct 25, 2023
331c66b
Update README.md
necabotheking Oct 25, 2023
e7dd476
update MI expenditure EDA & constants
necabotheking Oct 25, 2023
1405d9a
Update constants.py
necabotheking Oct 25, 2023
4df1a92
Update mi_campaign_expenditure.ipynb
necabotheking Oct 25, 2023
348fdd2
modifications to files after Avery's feedback
alankagiri Oct 27, 2023
a9cff26
resolving pull issue
alankagiri Oct 27, 2023
1027778
Merge branch 'dev' into PA_EDA_and_Schema
alankagiri Oct 27, 2023
354f549
update webscraper to include expenditure data and contribution data …
necabotheking Oct 27, 2023
74f8f12
Update based on prior comments
necabotheking Oct 27, 2023
207a58f
Update mi_campaign_webscraper.py
necabotheking Oct 27, 2023
2a2185b
Update mi_campaign_webscraper.py
necabotheking Oct 27, 2023
aa03b0e
Update mi_campaign_webscraper.py
necabotheking Oct 27, 2023
caeb051
Update mi_campaign_webscraper.py
necabotheking Oct 27, 2023
2c7e963
PA util functions
alankagiri Oct 28, 2023
0e4d73f
created new crawler based on curl, base functionality established
nrposner Oct 28, 2023
e9ca764
some changes based on Trevor's input on util file
yuzhouw313 Oct 28, 2023
3ee5a0d
making progress on EDA since connected to cluster?
Oct 29, 2023
8974e71
Delete utils/az_web_crawler.py
nrposner Oct 30, 2023
71dd5c3
notebook used to experiment with curl crawling
nrposner Oct 30, 2023
657efed
Merge branch 'az_webcrawler_2' of https://github.com/dsi-clinic/2023-…
nrposner Oct 30, 2023
6f9a72e
updated notebook readme
nrposner Oct 30, 2023
3855804
adding state cleaner draft, utils file with cleaner functions, and up…
nrposner Oct 30, 2023
522e0fe
Merge branch 'dev' into az_webcrawler_2
nrposner Oct 30, 2023
725bebb
Seeing if linter test fails/passes
Oct 30, 2023
6e36229
one more check on linter test
Oct 30, 2023
eea6ae1
all linter tests passed
Oct 30, 2023
3d7790b
Delete DATA_271_Data_Clinic_I/Pennsylvania_Contributions.ipynb
alankagiri Oct 30, 2023
5fa4ee2
Delete DATA_271_Data_Clinic_I directory
alankagiri Oct 30, 2023
d01ab3d
part 1 of EDA (not including expenditures) is done
Oct 31, 2023
a2b6556
Merge branch 'dev' into w4_MN_CompleteData_EDA
averyschoen Oct 31, 2023
08378c5
Delete notebooks/az_webcrawler_3.ipynb
averyschoen Oct 31, 2023
96a0fb1
Update clean.py
necabotheking Nov 1, 2023
7ddc9cc
fixing in response to PR comments
nrposner Nov 1, 2023
23336a9
Update constants.py
necabotheking Nov 2, 2023
9586908
Removed VALUES_TO_CHECK
necabotheking Nov 3, 2023
f28a0d7
updated contribution notebook and util
yuzhouw313 Nov 4, 2023
afc3acf
previous commit has outdated EDA notebook
yuzhouw313 Nov 4, 2023
54bd8bc
Merge expenditure and contribution EDA
necabotheking Nov 5, 2023
e920267
fix linter
necabotheking Nov 5, 2023
3a2abb5
small changes to constants
nrposner Nov 6, 2023
b5ed824
tried to add iteration
nrposner Nov 6, 2023
9670d3d
experimented in notebook
nrposner Nov 6, 2023
4ac13aa
adding notebook back in to fix merge
nrposner Nov 6, 2023
bbc6fab
uploading notebook readme
Nov 6, 2023
943fb35
Merge pull request #35 from dsi-clinic/michigan-web-scraper
averyschoen Nov 6, 2023
f5050f2
Merge branch 'dev' into michigan-expenditure-eda
necabotheking Nov 6, 2023
70c87a8
Added pipeline implementation and edits to preprocess from team meeting
trevorspreadbury Nov 6, 2023
8a9c726
Merge pull request #43 from dsi-clinic/create_statecleaner
trevorspreadbury Nov 6, 2023
3f7514e
Implementing Wk6 feedback from Avery
Nov 7, 2023
c43d601
forgot to check linter tests. should work now
Nov 7, 2023
caef882
addressed Mon Avery's feedback and combined con and exp
yuzhouw313 Nov 7, 2023
27a69b9
updated basic curl crawler
nrposner Nov 8, 2023
11561f1
Merge branch 'dev' into PA_EDA_and_Schema
alankagiri Nov 8, 2023
2ab79e6
streamlined info table
nrposner Nov 8, 2023
9b8c740
resolved all comments from Avery apart from EDA on expenditures
Nov 8, 2023
15f381b
Merge branch 'PA_EDA_and_Schema' of github.com:dsi-clinic/2023-fall-c…
Nov 8, 2023
6b01e8a
git push after resolving merge conflicts and linter test
Nov 8, 2023
c7aeb13
passed black test
yuzhouw313 Nov 8, 2023
034d462
need to access dev file
yuzhouw313 Nov 8, 2023
b7b5dbc
minor changes for the sake of merging
Nov 8, 2023
f145750
Delete notebooks/arizona_scraper_proof_of_concept.ipynb
averyschoen Nov 9, 2023
56bb652
Delete notebooks/az_webcrawler_3.ipynb
averyschoen Nov 9, 2023
789712e
saving work, no need for review
Nov 9, 2023
7e113e9
fixing averys requests
nrposner Nov 10, 2023
2dc6fe5
bring branches up to date
nrposner Nov 10, 2023
65c27d2
bring branches up to date
nrposner Nov 10, 2023
cf9b993
fixed rest of issues
nrposner Nov 10, 2023
ccf0d23
added dtype to cleaner
nrposner Nov 10, 2023
d01b863
Delete notebooks/az_webcrawler_3.ipynb
averyschoen Nov 10, 2023
4b0d8d2
Delete utils/state_cleaner_draft.py
averyschoen Nov 10, 2023
4c1cc16
Merge branch 'dev' into az_basic_crawler
averyschoen Nov 10, 2023
00fec2d
Update clean.py
averyschoen Nov 10, 2023
60a7498
Update clean.py
averyschoen Nov 10, 2023
ebb59a7
Merge pull request #44 from dsi-clinic/az_basic_crawler
averyschoen Nov 10, 2023
e2f0a57
Remove dropdown and simplify graphs
necabotheking Nov 10, 2023
7f2277f
Merge branch 'dev' into michigan-expenditure-eda
necabotheking Nov 10, 2023
fac3335
Update constants.py
necabotheking Nov 10, 2023
e41525d
Merge pull request #41 from dsi-clinic/michigan-expenditure-eda
averyschoen Nov 10, 2023
276a7e0
Merge branch 'dev' into PA_EDA_and_Schema
alankagiri Nov 10, 2023
5e32d7e
major revelations about EDA... no need to look through yet
Nov 10, 2023
1fba923
major EDA revelations...no need to check yet
Nov 10, 2023
4565890
EDA with expenditure data done
Nov 11, 2023
ce30f96
first steps to cleaning
nrposner Nov 11, 2023
5c37e58
adding state cleaner functionality and updating curl crawler
nrposner Nov 13, 2023
87413d2
added state extraction functionality
nrposner Nov 13, 2023
115ddc9
added state validation
nrposner Nov 13, 2023
a69b8bb
fixed case sensitivity for states
nrposner Nov 13, 2023
404025a
first draft of MN abstract class, entity map not done
yuzhouw313 Nov 14, 2023
d45ab06
removed unnecessary crawler element, made more efficient
nrposner Nov 15, 2023
3c88de0
just saving my work, no need for review
Nov 15, 2023
daeaee3
update method descriptions
Nov 15, 2023
521636a
Merge pull request #47 from dsi-clinic/abstractclassdescriptions
averyschoen Nov 15, 2023
362c222
Merge branch 'dev' into MN_abstract_class
yuzhouw313 Nov 15, 2023
74b6e70
no need to check this commit, doing this before merging with dev
Nov 15, 2023
1f8db78
revised notebook changes
Nov 16, 2023
4e8c853
revised notebook changes
Nov 16, 2023
c9a0c4a
revised EDA after Avery's feedback
Nov 16, 2023
e7c38b3
linter tests passed after Avery's feedback
Nov 16, 2023
02e3a7b
final Eda
Nov 16, 2023
01da03a
Merge pull request #33 from dsi-clinic/PA_EDA_and_Schema
averyschoen Nov 16, 2023
27cc164
minor update, commit to merge dev
yuzhouw313 Nov 16, 2023
b8a7756
second draft of MN abstract implement, added entity map
yuzhouw313 Nov 20, 2023
38503ee
restoring older commit to solve commit problems
Nov 21, 2023
be53e64
gitignore
Nov 21, 2023
5f30046
updated crawler, clean_utils, and clean to run smoothly from end to end
nrposner Nov 24, 2023
70383a4
Update
necabotheking Nov 26, 2023
40785c7
updated crawler and cleaner, almost complete end to end
nrposner Nov 27, 2023
6d18723
Merge branch 'dev' into az_state_cleaner
nrposner Nov 27, 2023
013c20f
changed class name to ArizonaCleaner
nrposner Nov 27, 2023
be9921c
Merge branch 'az_state_cleaner' of https://github.com/dsi-clinic/2023…
nrposner Nov 27, 2023
828d76b
Merge branch 'dev' into w4_MN_CompleteData_EDA
yuzhouw313 Nov 27, 2023
0e916b4
Merge branch 'dev' into MN_abstract_class
yuzhouw313 Nov 27, 2023
09fe283
fixed linter issues and merging conflict
yuzhouw313 Nov 27, 2023
096d1ec
Merge remote-tracking branch 'refs/remotes/origin/MN_abstract_class' …
yuzhouw313 Nov 27, 2023
a5f23f6
Update constants.py
necabotheking Nov 27, 2023
fd2721f
fixed constant.py linter test
yuzhouw313 Nov 27, 2023
c5d375d
update raw data google drive link
yuzhouw313 Nov 27, 2023
9e4d021
updated some docstrings and info, addressing comments still in progress
nrposner Nov 27, 2023
ff4c83f
Delete utils/PA_constants.py
averyschoen Nov 27, 2023
704e468
Merge pull request #36 from dsi-clinic/w4_MN_CompleteData_EDA
averyschoen Nov 27, 2023
526b3a2
Implemented UUID mapping
necabotheking Nov 27, 2023
ca486e4
Delete notebooks/PennsylvaniaCleaner.py
averyschoen Nov 28, 2023
83f29e1
commiting changes before merging
Nov 29, 2023
ccd06f8
Finished create_organizations and create_individuals()
necabotheking Nov 30, 2023
fdcbc47
finish MichiganCleaner() and rename EDA notebook
necabotheking Nov 30, 2023
d627c23
finished minnesota.py and tested in jupyter notebook, updated dev, ut…
yuzhouw313 Nov 30, 2023
a1d1e66
updated notebook descriptions
nrposner Dec 1, 2023
1cf748f
updated AZ_EDA notebook to access needed data
nrposner Dec 1, 2023
2d2cbc1
update on PACleaner thus far. Still working on create_Tables
Dec 1, 2023
b2c473e
Delete utils/pennsylvania_helper_functions.py
alankagiri Dec 1, 2023
88d5ea8
made many changes for functionality and according to comments, employ…
nrposner Dec 1, 2023
a64d2a8
Delete utils/mn_state_cleaner.py
averyschoen Dec 1, 2023
17633f4
Merge branch 'dev' of github.com:dsi-clinic/2023-fall-clinic-climate-…
Dec 1, 2023
fc74fe2
Merge branch 'Pennsylvania_State_Cleaner' of github.com:dsi-clinic/20…
Dec 1, 2023
73e7578
rework michigan cleaner and add ID_MAP output
necabotheking Dec 3, 2023
fc4de31
fix transactions bug and linter error
necabotheking Dec 3, 2023
819fddc
preprocess done, create_tables almost done
Dec 3, 2023
54e5413
-linter check passed for pennsylvania.py
Dec 3, 2023
2ad79f5
-had to git rm PennsylvaniaCleaner.py to pass linter tests
Dec 3, 2023
6c09a75
moved the cleaner to its own file, updated crawler, cleaner, and add-ons
nrposner Dec 3, 2023
c28f285
updated filepaths and cleaner to run demo files
nrposner Dec 3, 2023
9317cea
updated some docstrings, fixed some bugs, moved towards schema
nrposner Dec 3, 2023
7c86a8c
changed name from arizona_cleaner to arizona
nrposner Dec 3, 2023
475b096
added note about readme
nrposner Dec 3, 2023
af8605e
added utils readme
nrposner Dec 3, 2023
1019f1e
updated readme
nrposner Dec 3, 2023
af60a51
remove functions and uncomment commented filepaths
necabotheking Dec 4, 2023
6a1438c
improved code quality based on Nico's input and updated dev README
yuzhouw313 Dec 4, 2023
7b19a74
fixed minor issue in creating mapping table csv
yuzhouw313 Dec 4, 2023
aaad2c6
Merge MN_abstract_class into dev-f23
trevorspreadbury Dec 4, 2023
01c104e
Merge remote-tracking branch 'origin/michigan-statecleaner' into dev-f23
trevorspreadbury Dec 4, 2023
500e67a
updated filepaths and setup
nrposner Dec 4, 2023
602862c
updated readme
nrposner Dec 4, 2023
b02ae29
Merge remote-tracking branch 'origin/Pennsylvania_State_Cleaner' into…
trevorspreadbury Dec 4, 2023
ecf212f
Merge remote-tracking branch 'origin/az_state_cleaner' into dev-f23
trevorspreadbury Dec 4, 2023
ac48cb2
ran minnesota on ipython with the whole dataset and produced right ou…
yuzhouw313 Dec 4, 2023
33a677a
updated readme
nrposner Dec 5, 2023
9ee45a7
progress on pennsylvania_Cleaner
Dec 5, 2023
7b2f04f
progress on pennsylvania_Cleaner
Dec 5, 2023
e909b85
Delete utils/arizona_cleaner.py
averyschoen Dec 5, 2023
28af64a
Delete utils/README.md
averyschoen Dec 5, 2023
5c79dd6
Update README.md
averyschoen Dec 5, 2023
51b490d
uncommented arizonacleaner in pipeline.py and imported
nrposner Dec 5, 2023
a832404
Update pipeline.py
averyschoen Dec 5, 2023
dcbf6ea
Update description in create_tables()
averyschoen Dec 5, 2023
930b2b8
Update clean.py
averyschoen Dec 5, 2023
261fa60
update for linter tests
Dec 5, 2023
40f92a1
Merge branch 'dev-f23' into MN_abstract_class
averyschoen Dec 5, 2023
57390b3
Merge pull request #51 from dsi-clinic/MN_abstract_class
averyschoen Dec 5, 2023
330ae38
Merge branch 'dev-f23' into az_state_cleaner
averyschoen Dec 5, 2023
74bc643
Merge pull request #52 from dsi-clinic/az_state_cleaner
averyschoen Dec 5, 2023
5f1a6fa
Delete notebooks/PA_EDA.ipynb
averyschoen Dec 5, 2023
f73bc10
update for linter
Dec 5, 2023
77b0d9f
addressed Avery's input and create 4 transaction table
yuzhouw313 Dec 5, 2023
05d41d1
fixed linter issue
yuzhouw313 Dec 5, 2023
70e241b
removed unused import
trevorspreadbury Dec 5, 2023
0b64184
uploading statecleaner to dev-f23
Dec 5, 2023
257314c
uploading statecleaner to dev-f23
Dec 5, 2023
96ac8c8
uploading statecleaner to dev-f23
Dec 5, 2023
c862708
making sure latest updates show in dev-f23
Dec 5, 2023
d9299f8
Add 12/4 minnesota cleaner
yuzhouw313 Dec 5, 2023
4bb53ed
Merge pull request #50 from dsi-clinic/dev-f23
nrposner Dec 5, 2023
f6a6717
Merge branch 'dev' into MN_abstract_class
averyschoen Dec 5, 2023
530a7d5
Merge pull request #46 from dsi-clinic/MN_abstract_class
averyschoen Dec 5, 2023
173c16d
pushing modified clean.py before git checkout dev-f23
Dec 5, 2023
b4b25a1
completing merge to dev-f23
Dec 5, 2023
2563e27
uncommented other state cleaners
trevorspreadbury Dec 5, 2023
d970c92
Merge pull request #54 from dsi-clinic/dev-f23
trevorspreadbury Dec 5, 2023
5ef6778
updated filepaths
nrposner Dec 6, 2023
f8e3dda
removed if main
nrposner Dec 6, 2023
e734cca
should pass linter now
nrposner Dec 6, 2023
a5b9e3a
Merge pull request #56 from dsi-clinic/arizona_corrections
averyschoen Dec 6, 2023
3b250e6
Saving changes to old EDA before switching branches
Dec 6, 2023
2e54a77
forgot to check linter tests...these should be ok now
Dec 6, 2023
1ebe37f
Delete the duplicated minnesota.py
yuzhouw313 Dec 6, 2023
33ef2c6
Merge pull request #58 from dsi-clinic/yuzhouw313-patch-1
averyschoen Dec 6, 2023
2835a15
remove pandas requirement and complete pipeline.py
necabotheking Dec 6, 2023
7773d8c
Update requirements.txt with pipreqs
necabotheking Dec 6, 2023
84bb956
Update README
necabotheking Dec 6, 2023
a35fba9
Update README.md
necabotheking Dec 6, 2023
8650b2e
fix linter error
necabotheking Dec 6, 2023
9c0df10
final revisions to pennsylvania_state_cleaner
Dec 7, 2023
75071e4
Merge branch 'dev' of github.com:dsi-clinic/2023-fall-clinic-climate-…
Dec 7, 2023
b5e6622
trying to solve linter test failure
Dec 7, 2023
dd44abe
updated docstrings and added transactions splitter
nrposner Dec 7, 2023
1e1516f
Merge pull request #59 from dsi-clinic/README-edits
trevorspreadbury Dec 7, 2023
f806264
Merge pull request #61 from dsi-clinic/arizona_corrections
trevorspreadbury Dec 7, 2023
21bb3f2
Merge branch 'dev' into new_pennsylvania_state_cleaner
trevorspreadbury Dec 7, 2023
715887a
Merge pull request #60 from dsi-clinic/new_pennsylvania_state_cleaner
trevorspreadbury Dec 7, 2023
897c21c
fix minnesota bugs that prevented pipeline from running
trevorspreadbury Dec 18, 2023
0b08647
Update PA webscraper to save each year to separate directories
trevorspreadbury Dec 18, 2023
b90a96e
update PA data readme
trevorspreadbury Dec 18, 2023
0860346
clean PA and eliminate bugs preventing pipeline from running
trevorspreadbury Dec 19, 2023
9bc8d11
fix docstrings in StateCleaner
trevorspreadbury Dec 19, 2023
4c25054
clean pennsylvania, re-order methods
trevorspreadbury Jan 2, 2024
71a3e39
updated michigan code
trevorspreadbury Jan 9, 2024
fc28b3c
Merge remote-tracking branch 'origin/dev' into PA_EDA_and_Schema
trevorspreadbury Jan 9, 2024
0780643
Merge pull request #57 from uchicago-dsi/PA_EDA_and_Schema
trevorspreadbury Jan 9, 2024
035ac6b
move deprecated helper function for az into notebook
trevorspreadbury Jan 9, 2024
2018dc6
moved scrapers into new scraper module
trevorspreadbury Jan 9, 2024
414e32b
update scrapers, AZ is WIP
trevorspreadbury Jan 9, 2024
b1e483a
move az helper functions to arizoner cleaner
trevorspreadbury Jan 9, 2024
34e439e
refactor functions to be more general for 'detailed' and other endpoints
trevorspreadbury Jan 10, 2024
c1867c4
fixed az scraper headers
trevorspreadbury Jan 10, 2024
068e1d5
working arizona scraper
trevorspreadbury Jan 11, 2024
fc21f1b
update arizona cleaner to work on subset
trevorspreadbury Jan 11, 2024
aab3204
Merge branch 'dev' of github.com:dsi-clinic/2023-fall-clinic-climate-…
trevorspreadbury Jan 11, 2024
6608c25
update states to return a single transactions table
trevorspreadbury Jan 11, 2024
a49c00b
fix pre-commit errors
trevorspreadbury Jan 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,15 @@
},
"isort.args": [
"--profile", "black",
"--line_length", "80",
"--line_length", "88",
"--multi_line_output", "3",
"--include_trailing_comma", "true",
"force_grid_wrap", "0",
"--use_parentheses", "true"
],
"flake8.args": [
"--max-line-length", "80",
"--extend-ignore", "E203"
"--max-line-length", "88",
"--extend-ignore", "E203",
]
},
"extensions": [
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -138,3 +138,4 @@ venv.bak/

# data files
*.avro
data/*.txt
1 change: 0 additions & 1 deletion 2023-fall-clinic-climate-cabinet
Submodule 2023-fall-clinic-climate-cabinet deleted from 9b0d34
45 changes: 18 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,15 @@
# 2023-fall-clinic-climate-cabinet

## Project Background
## Data Science Clinic Project Goals

[Please add project background]
1. Collect state's political campaign finance report data which should include
recipient information, donor information, and transaction information.
2. Preprocess, clean, and standardize the collected raw data across 4 states
by implementing state cleaner abstract class
3. Conduct Exploratory Data Analysis, facilitate the examination of
the conribution made by green energy company versus that by fossil
fuel company in terms of state's political campaign activity

## Project Goals

[Please add project background]

## Usage

Expand All @@ -31,29 +34,17 @@ If you prefer to develop inside a container with VS Code then do the following s
3. Click the blue or green rectangle in the bottom left of VS code (should say something like `><` or `>< WSL`). Options should appear in the top center of your screen. Select `Reopen in Container`.


### Project Pipeline
1. Collect the data through **<span style="color: red;">one</span>** of the steps below
a. Collect state's finance campaign data either from web scraping (AZ, MI, PA) or direct download (MN) OR
b. Go to the [Project's Google Drive]('https://drive.google.com/drive/u/2/folders/1HUbOU0KRZy85mep2SHMU48qUQ1ZOSNce') to download each state's data to their local repo following this format: repo_root / "data" / "raw" / <State Initial> / "file"
2. Open in development container which installs all necessary packages.
3. Run the project by running ```python utils/pipeline.py``` or ```python3 utils/pipeline.py``` run the processing pipeline that cleans, standardizes, and creates the individuals, organizations, and transactions concatenated into one comprehensive database.
5. running ```pipeline.py``` returns the tables to the output folder as csv files containing the complete individuals, organizations, and transactions DataFrames combining the AZ, MI, MN, and PA datasets.
6. For future reference, the above pipeline also stores the information mapping given id to our database id (generated via uuid) in a csv file in the format of (state)IDMap.csv (example: ArizonaIDMap.csv) in the output folder

## Team Members

## Repository Structure

### utils
Project python code

### notebooks
Contains short, clean notebooks to demonstrate analysis.

### data

Contains details of acquiring all raw data used in repository. If data is small (<50MB) then it is okay to save it to the repo, making sure to clearly document how to the data is obtained.

If the data is larger than 50MB than you should not add it to the repo and instead document how to get the data in the README.md file in the data directory.

This [README.md file](/data/README.md) should be kept up to date.

### output
Should contain work product generated by the analysis. Keep in mind that results should (generally) be excluded from the git repository.


## Team Member
Student Name: April Wang
Student Email: [email protected]

Expand All @@ -64,4 +55,4 @@ Student Name: Aïcha Camara
Student Email: [email protected]

Student Name: Alan Kagiri
Student Email: [email protected].
Student Email: [email protected].
161 changes: 159 additions & 2 deletions data/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,162 @@
### Data
# Data

This directory contains information for use in this project.

Please make sure to document each source file here.
## Arizona Campaign Finance Data

### Summary
- The Arizona Campaign Finance Data are publicly available at (https://seethemoney.az.gov/Reporting/) as a collection of tables which can be downloaded as aggregated CSVs. It has no overt webscraping defenses, but scraping the bulk transactions is non-trivial. This site is supported by the Arizona Secretary of State’s office.


- The dataset comprises records of contributions, expenditures, vendor payments and operating expenses for political entities within the State of Arizona. Individual contributions under $100 from within the state need not be identified by name, employer, or other identifying data, and are collected under pseudonyms such as 'Multiple Donors.' More specifically, the dataset includes
- individual donations, with those over $100 or from out of state always accompanied by identifying information
- Independent expenditures and ballot measure expenditures by PACs, political parties, and other organizations
- Income and Expenses for candidate campaigns, political parties, PACs, and other organizations
- Payments to and refunds from vendors with regard to the aforementioned entities

- This project will focus on individual, corporate, and PAC spendin.

### Features
- This dataset includes comprehensive records on races for the office of Governor, Attorney General, Corporation Commissioner, Secretary of State, State Senator, State Mine Inspector, State Representative, State Treasurer, and Superintendent of Public Education. Other races, such as mayoral or federal Congressional races, appear incomprehensively within the dataset, and will not be studied.

- The dataset covers the years 2002 - present

- Data is divided into 8 sections: Candidates, PACs, Political Parties, Organizations, Indepndent Expenditures, Ballot Measures, Individual Contributions, and Vendors.

- Transactions required to report and itemize:
1. Contributions in excess of $100 or from out of state
2. Expenses in excess of $250
3. Independent Expenditures and Ballot Measure Expenditures

- Limitation:
1. Small-money contributions under $100 from within the state need not be itemized (though they frequently still are).
2. The easily available CSVs are heavily aggregated, and do not list both payer and payee. Detailed transaction data with dates, precise amounts, and both payer and payee are available, but they must be accessed individually, and over a million such records exist.
3. Lobbyist spending is not distinguished, and spending by political organizations is often obfuscated by listing them as vendors.

- Additional information:
1. Negative expenditures in the dataset are enclosed within parentheses. These expenditures indicate refunds to donors, loans paid, and similar transaction.

## Michigan Campaign Finance Data

### Summary
The Michigan Campaign Finance data are publicly available on the
[Michigan Department of State Website](https://miboecfr.nictusa.com/cfr/dumpall/cfrdetail/)
in txt format and has a captcha button functions as an anti-web scraping defense; however, the yearly campaign finance records are directly available at the link above.

The developers of this project have stored the 1995-2023 Michigan campaign finance
contribution data and READMEs in a Google Drive for the duration of this project.

### Features
- This dataset covers 1998 to 2023
- The contributions data are stored in tab separated text files with naming conventions {year}_mi_cfr_contribution_00.txt, in which only the 00 file contains a header. The data includes RUNTIME in the header which indicates the time these transactions were exported from the Bureau of Elections database. RUNTIME is only indluded in the header.

- Transactions Required to Report
1. Beginning in January 2014, committees spending or receiving $5,000 or more in a calendar year were required to file electronically.
2. The Michigan Secretary of State website notes that, "when filing electronically, all of the information submitted by the committee is made available and information is not changed or manipulated by the Bureau of Elections".
3. Independent Expenditures from an individual of over $100.01 in a calendar year must be reported with the individuals employer information and occupation. The committee reveiving this information must report it on campaign statements. More iformations on Individuals and the Michigan Campaign Finance Act (MCFA) is available [here](https://mertsplus.com/mertsuserguide/index.php?n=MANUALS.AppendixQ).

- Additional information:
1. The MI Campaign Finance Data includes contributions/expenditures for federal and state legislators and representatives, as well as local elections.
2. Contribution Type Accronyms
- DIS: District Party
- STA: State Party
- BAL: Ballot Question
- COU: County Party
- POL: Political Party
- GUB: Gubernatorial
- CAN: Candidate
- IND: Independent PAC.


## Minnesota Campaign Finance Data

### Summary
- The Minnesota Campaign Finance data are available in this shared
[Google Drive](https://drive.google.com/drive/u/2/folders/1uA70woWDhTf3_0F8AbadDa_XIKraCeoc) in zip format and has no anti-webscraping defenses. Please first unzip it and store 12 csv files (10 candidate-recipient contribution dataset, 1 noncandidate-recipient dataset, and 1 expenditure dataset) to local repo in this format: repo root / "data" / "file name"

- The above dataset is provided by the Minnesota Campaign Finance website developer. This dataset includes 10 separate CSV files, each documenting contributions made to a specific recipient type from 1998 to 2023. This dataset also includes a non-candidate contribution dataset dating back to 1998 and an independent expenditure dataset dating back to 2015.

- MN dataset comprises itemized records of contributions and expenditures exceeding $200, which aligns with the reporting threshold set at $200 in Minnesota campaign finance regulations.

- For the purpose of our project I will focus on contribution and independent expenditure from 2018 to 2023.

### Features
- Races / Office Sought:
- AG: Attorney General
- AP: State Appeals Court Judge
- DC: State District Court Judge
- GC: Governor
- House: State Representative
- SA: State Auditor
- SC: State Supreme Court Justice
- Senate: State Senator
- SS: Secretary of State
- ST: State Treasurer (this office was abolished in 2003 and no longer exists)

- Donor Types:
- I: Individual
- L: Lobbyist
- C: Candidate Committee
- F: Political Committee/Fund
- S: Supporting Association
- P: Party Unit
- B: Businness
- H: Hennepin County Local Candidate Committee
- U: Association Not Registered in Board
- O: Other
- PTU: Political Party Unit
- PCF: Political Committee and Fund

- Trasactions required to report and itemize: Contributions received from any particular source in excess of $200 within a calendar year

- Limitation: Only covers contributions over 200$ by MN campaign finance regulation


- Additional information:
1. in-kind: Donations of things other than money are in-kind contributions to the receiving entity
2. For the purpose of our project, I created a separate column of total donation by summing both monetary donation and in-kind donation
3. Contributors whose total contributions exceed $200 are individually itemized in separate rows. Contributions from donors who each give $200 or less are reported as aggregate totals and are not included in this dataset by definition.
4. The dataset has 467 missing rows, of which belong to "Registration fee for Netroots event" and have no recipient, donor, or total donation amount.


## Pennsylvania Campaign Finance Data
### Summary
- The Pennsylvania Campaign Finance Data comes from the [Pennsylvania Department of State Website’s Full Campaign Finance Export section](https://www.dos.pa.gov/VotingElections/CandidatesCommittees/CampaignFinance/Resources/Pages/FullCampaignFinanceExport.aspx). Although only some years are visible on this page, the url for each year follows the same format. To see the actual forms and reports those can be found here: https://www.dos.pa.gov/VotingElections/CandidatesCommittees/FormsReports/Pages/default.aspx. No defenses or anti-captcha mechanisms exist to monitor access to the data. The data is stored in the form of csvs, but is named with a .txt and .zip tag depending on the year. This is because while the data spans from 1987-2022, there are incongruences in the formatting. The pre-2000 data have their sub-categories listed as separate links in the form of .txt, while the post-2000 years have each year as a .zip nested file that contains the sub-categories. Additionally, 2022 has 2 additional fields (Timestamp & Reported ID) in the filer report, making it have more columns than previous years.

### Format
- The data consists of csv files organized according to year, with each year having 5 files representing 5 categories. Although the exact filenames are not consistent, they always start with the same string (at least since 2010): `contrib` (contributions), `debt` (debt), `expense` (expenditures), `filer` (basic filer info), and `receipt` (other receipts). While there is a readme file, it is largely ineffectual since it merely describes some of the data types. A more useful description can be found on the https://www.dos.pa.gov/VotingElections/CandidatesCommittees/CampaignFinance/Resources/Pages/Technical-Specifications.aspx page, which redirects to a page that better details what the data is about. Even then one must consult the filing documentations like the report cover sheet, and its various Schedules, to better understand how some of the values (like filerType) can be intepreted.
- As mentioned, in 2022 there were changes to the format of the data, including concatenating the name fields (no more separate fields for first, middle and last name), the treasurer’s address fields were removed in the filer document, party codes were updated to be more intuitive, and some more minor adjustments.

### Features
- The following statewide offices, which can be grouped into administrative, judicial, and legislative appointments, are required to file out the finance reports:

- Administrative:
1. GOV: Governor
2. LTG: Liutenant Gov
3. ATT: Attorney General
4. AUD: Auditor General
5. TRE: State Treasurer

- Judicial:
6. SPM: Justice of the Supreme Court
7. SPR: Judge of the Superior Court
8. CCJ: Judge of the CommonWealth Court
9. CPJ: Judge of the Court of Common Pleas
10. MCJ: Judge of the Municipal Court
11. TCJ: Judge of the Traffic Crt

- Legislative:
12. STS: Senator (General Assembly)
13. STH: Representative (General Assembly)

-Other:
14. OTH: Other candidates for local offices

### Other Relevant Information
1. Candidates, political committees, and contributing lobbyists are required to disclose expenditures and contributions through the Campaign Finance Report document. However they can fill out a statement in lieu of a full report when the amount of contributions (including in-kind contributions) received, the amount of money expended, and the liabilities incurred each did not exceed $250.00 during the reporting period. This means that the cumulative contributions received, the money expended, and liabilities incurred in the entire reporting period (campaign cycle), each did not exceed \$250.00.

2. Debts that are forgiven are considered contributions, but double counting is prevented as the data is reviewed and updated months after the last filing period, allowing for data that was classified as debt to be itemized as a contribution. Corporations or unincorporated associations are prohibited from forgiving debts and thus contributing in this manner.

3. The Finance Report states that a record must be kept for any contribution over \$10.00, but “Contributions and receipts of \$50.00 or less per contributor, during the reporting period, need not be itemized on the report” … this might mean that if 1,000 people for instance donate \$50 or less, there could be potentially thousands/tens of thousands of \$ not shown on the data, even though this information is recorded. This means that the total contributions that filers itemize does not necessarily reflect the total contributions they received.

4. Transparency USA has aggregated data on the contributions of individuals and committees. This could be a helpful source to cross-check the data and potentially help alleviate the debt-contribution issue. Pennsylvania' Dept. of State also offers a detailed website that shows all the aggregated contributions made and received, expenditures made, debts, and receipts. The catch is one must know which candidate they are looking for as it's a searchable database, but it can be very helpful for cross-matching and verification. Here's the link :https://www.campaignfinanceonline.pa.gov/Pages/CFReportSearch.aspx
Loading
Loading