diff --git a/data/README.md b/data/README.md index 0d33649..84b2890 100644 --- a/data/README.md +++ b/data/README.md @@ -2,6 +2,7 @@ This directory contains information for use in this project. +Please make sure to document each source file here. #### Arizona Campaign Finance Data ##### Summary @@ -67,7 +68,6 @@ contribution data and READMEs in a Google Drive for the duration of this project - CAN: Candidate - IND: Independent PAC. -======= #### Minnesota Campaign Finance Data @@ -138,35 +138,50 @@ contribution data and READMEs in a Google Drive for the duration of this project - O: Other 7. The new dataset has 467 missing rows, of which belong to "Registration fee for Netroots event" and have no recipient, donor, or total donation amount. -#### Pennsylvania Campaign Finance Data -##### Accessibility -- The data comes from the Pennsylvania Government Website’s Full Finance Campaign Report section. To see the actual forms and reports those can be found here: https://www.dos.pa.gov/VotingElections/CandidatesCommittees/FormsReports/Pages/default.aspx +#### Pennsylvania Campaign Finance Data +##### Summary +- The Pennsylvania Campaign Finance Data comes from the Pennsylvania Government Website’s Full Finance Campaign Report section. To see the actual forms and reports +those can be found here: https://www.dos.pa.gov/VotingElections/CandidatesCommittees/FormsReports/Pages/default.aspx. No defenses or anti-captcha mechanisms exist +to monitor access to the data. The data is stored in the form of csvs, but is named with a .txt and .zip tag depending on the year. This is because while the data +spans from 1987-2022, there are incongruences in the formatting. The pre-2000 data have their sub-categories listed as separate links in the form of .txt, while the +post-2000 years have each year as a .zip nested file that contains the sub-categories. Additionally, 2022 has 2 additional fields (Timestamp & Reported ID) in the +filer report, making it have more columns than previous years. ##### Format -- The data consists of csv files organized according to year, with each annual file having 5 files/categories: contributions, debt, expenditures, basic filer info, and other receipts. While there is a readme file, it is largely ineffectual since it merely describes some of the data types. A more useful description can be found on the https://www.dos.pa.gov/VotingElections/CandidatesCommittees/CampaignFinance/Resources/Pages/Technical-Specifications.aspx page, which redirects to a page that better details what the data is about. Even then one must consult the filing documentations like the report cover sheet, its various Schedules. -- In 2022 there were changes to the format of the data, including concatenating the name fields (no more separate fields for first, middle and last name), the treasurer’s address fields were removed in the filer document, party codes were updated to be more intuitive, and some more minor adjustments. - -##### Defenses -None +- The data consists of csv files organized according to year, with each annual file having 5 files/categories: contributions, debt, expenditures, basic filer info, and other receipts. While there is a readme file, it is largely ineffectual since it merely describes some of the data types. A more useful description can be found on the https://www.dos.pa.gov/VotingElections/CandidatesCommittees/CampaignFinance/Resources/Pages/Technical-Specifications.aspx page, which redirects to a page that better details what the data is about. Even then one must consult the filing documentations like the report cover sheet, and its various Schedules, to better understand how some of the values (like filerType) can be intepreted. +- As mentioned, in 2022 there were changes to the format of the data, including concatenating the name fields (no more separate fields for first, middle and last name), the treasurer’s address fields were removed in the filer document, party codes were updated to be more intuitive, and some more minor adjustments. -##### Races -Races included (i.e state legislator, mayoral, city council, etc) -- Seems to be filed by any candidates running for any level of government, notably statewide, legislative, and judicial offices (general and municipal elections) - -##### Years that the Dataset covers: 1987-2023 +##### Features +- The following statewide offices, which can be grouped into administrative, judicial, and legislative appointments, are required to file out the finance reports: + +- Administrative: +1. GOV: Governor +2. LTG: Liutenant Gov +3. ATT: Attorney General +4. AUD: Auditor General +5. TRE: State Treasurer + +- Judicial: +6. SPM: Justice of the Supreme Court +7. SPR: Judge of the Superior Court +8. CCJ: Judge of the CommonWealth Court +9. CPJ: Judge of the Court of Common Pleas +10. MCJ: Judge of the Municipal Court +11. TCJ: Judge of the Traffic Crt + +- Legislative: +12. STS: Senator (General Assembly) +13. STH: Representative (General Assembly) + +-Other: +14. OTH: Other candidates for local offices -##### Reporting: -Who is required to report their contributions: -- Candidates, political committees, and contributing lobbyists are required to disclose expenditures and contributions through the Campaign Finance Report document. However they can fill out a statement in lieu of a full report when the amount of contributions (including in-kind contributions) received, the amount of money expended, and the liabilities incurred each did not exceed $250.00 during the reporting period. This means that the cumulative contributions received, the money expended, and liabilities incurred in the entire reporting period (campaign cycle), each did not exceed $250.00 -- Unless the incurred contributions and expenditures do not individually exceed $250. If they don’t exceed then they can file a statement instead of a report. I am unsure where this leaves individuals, but I am still scouring for this information. +##### Other Relevant Information +1. Candidates, political committees, and contributing lobbyists are required to disclose expenditures and contributions through the Campaign Finance Report document. However they can fill out a statement in lieu of a full report when the amount of contributions (including in-kind contributions) received, the amount of money expended, and the liabilities incurred each did not exceed $250.00 during the reporting period. This means that the cumulative contributions received, the money expended, and liabilities incurred in the entire reporting period (campaign cycle), each did not exceed \$250.00. -##### Limits to the dataset -- Although there is a README file detailing what the dataset columns are, the column don’t match up with the format of some of the documents, so one has to cross match manually which can be tricky, especially with columns that have many NaN values. -- The Finance Report states that a record must be kept for any contribution over $10, but “Contributions and receipts of $50.00 or less per contributor, during the reporting period, need not be itemized on the report” … this might mean that if 1,000 people for instance donate $50 or less, there could be potentially thousands/tens of thousands of $$ not shown on the data. +2. Debts that are forgiven are considered contributions, but double counting is prevented as the data is reviewed and updated months after the last filing period, allowing for data that was classified as debt to be itemized as a contribution. Corporations or unincorporated associations are prohibited from forgiving debts and thus contributing in this manner. -##### Other Relevant Information -1. Filer Document: Contains info about each filer. Filers range from interest groups, individuals, committees (including PACs like VisionPAC, Build PA PAC...etc), and private organizations. The filer document is needed to trace the contributions dataset to the individuals who filed them. -2. Debts that are forgiven are considered contributions, but double counting is prevented as the data is reviewed and updated months after the last filing period, allowing for data that was classified as debt to be itemized as a contribution. Corporations or unincorporated associations are prohibited from forgiving debts and thus contributing in this manner. -3. Transparency USA has aggregated data on the contributions of individuals and committees. This could be a helpful source to cross-check the data and potentially help alleviate the debt-contribution issue. +3. The Finance Report states that a record must be kept for any contribution over \$10.00, but “Contributions and receipts of \$50.00 or less per contributor, during the reporting period, need not be itemized on the report” … this might mean that if 1,000 people for instance donate \$50 or less, there could be potentially thousands/tens of thousands of \$ not shown on the data, even though this information is recorded. This means that the total contributions that filers itemize does not necessarily reflect the total contributions they received. +4. Transparency USA has aggregated data on the contributions of individuals and committees. This could be a helpful source to cross-check the data and potentially help alleviate the debt-contribution issue. Pennsylvania' Dept. of State also offers a detailed website that shows all the aggregated contributions made and received, expenditures made, debts, and receipts. The catch is one must know which candidate they are looking for as it's a searchable database, but it can be very helpful for cross-matching and verification. Here's the link :https://www.campaignfinanceonline.pa.gov/Pages/CFReportSearch.aspx \ No newline at end of file diff --git a/notebooks/PA_EDA.ipynb b/notebooks/PA_EDA.ipynb new file mode 100644 index 0000000..18f0d8d --- /dev/null +++ b/notebooks/PA_EDA.ipynb @@ -0,0 +1,3662 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "a55e9cbe", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np\n", + "import plotly.express as px\n", + "import warnings\n", + "warnings.filterwarnings('ignore')\n", + "import sys\n", + "sys.path.append('/home/alankagiri/2023-fall-clinic-climate-cabinet')\n", + "from utils import PA_EDA_Functions as eda\n", + "from utils import PA_Data_Web_Scraper as scraper\n", + "from utils import constants as const" + ] + }, + { + "cell_type": "markdown", + "id": "49be4b35", + "metadata": {}, + "source": [ + "##### This Notebook examines Pennsylvania'a campaign data specifically from 2018-2023, although previous years can be loaded onto the analysis considerations. The dataset is relational, with the five documents per annum (contributions, debt, expense, expenditures, and filer) linked through a unique filer ID." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "19c96a4a", + "metadata": {}, + "outputs": [], + "source": [ + "# download the data\n", + "scraper.download_PA_data(2018,2023)" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "7010f66e", + "metadata": {}, + "outputs": [], + "source": [ + "#initialize the datasets:\n", + "contrib_paths = [[\"../data/contrib_2018_03042019.txt\", 2018],\n", + " [\"../data/contrib.txt\", 2019],\n", + " [\"../data/contrib_2020.txt\",2020],\n", + " [\"../data/contrib_2021.txt\",2021],\n", + " [\"../data/contrib_2022.txt\",2022],\n", + " [\"../data/2023/contrib_2023.txt\",2023]]\n", + "\n", + "filer_paths = [[\"../data/filer_2018_03042019.txt\", 2018],\n", + " [\"../data/filer.txt\",2019],\n", + " [\"../data/filer_2020.txt\",2020],\n", + " [\"../data/filer_2021.txt\",2021],\n", + " [\"../data/filer_2022.txt\",2022],\n", + " [\"../data/2023/filer_2023.txt\",2023]]\n", + "\n", + "expense_paths = [[\"../data/expense_2018_03042019.txt\",2018],\n", + " [\"../data/expense.txt\", 2019],\n", + " [\"../data/expense_2020.txt\", 2020],\n", + " [\"../data/expense_2021.txt\",2021],\n", + " [\"../data/expense_2022.txt\",2022],\n", + " [\"../data/2023/expense_2023.txt\",2023]]" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Skipping line 209819: expected 24 fields, saw 29\n", + "\n", + "Skipping line 465906: expected 24 fields, saw 25\n", + "\n", + "Skipping line 1334: expected 12 fields, saw 15\n", + "Skipping line 62726: expected 12 fields, saw 13\n", + "\n", + "Skipping line 108099: expected 12 fields, saw 13\n", + "\n", + "Skipping line 251552: expected 24 fields, saw 27\n", + "\n", + "Skipping line 60173: expected 12 fields, saw 13\n", + "\n", + "Skipping line 523329: expected 24 fields, saw 47\n", + "Skipping line 523330: expected 24 fields, saw 47\n", + "\n", + "Skipping line 14404: expected 12 fields, saw 17\n", + "\n", + "Skipping line 66486: expected 12 fields, saw 13\n", + "Skipping line 66487: expected 12 fields, saw 13\n", + "Skipping line 66488: expected 12 fields, saw 13\n", + "Skipping line 66489: expected 12 fields, saw 13\n", + "Skipping line 66490: expected 12 fields, saw 13\n", + "Skipping line 66491: expected 12 fields, saw 13\n", + "Skipping line 66492: expected 12 fields, saw 13\n", + "\n", + "Skipping line 109048: expected 26 fields, saw 27\n", + "\n", + "Skipping line 12499: expected 14 fields, saw 15\n", + "Skipping line 29777: expected 14 fields, saw 15\n", + "Skipping line 29778: expected 14 fields, saw 15\n", + "\n", + "Skipping line 227923: expected 26 fields, saw 27\n", + "\n" + ] + } + ], + "source": [ + "merged_datasets_per_year = []\n", + "merged_expense_dataset = []\n", + "for i in range(len(contrib_paths)):\n", + " contrib_df = eda.initialize_PA_dataset(contrib_paths[i][0],contrib_paths[i][1])\n", + " filer_df = eda.initialize_PA_dataset(filer_paths[i][0],filer_paths[i][1])\n", + " expense_df = eda.initialize_PA_dataset(expense_paths[i][0],expense_paths[i][1])\n", + " merged = eda.merge_same_year_datasets(contrib_df,filer_df)\n", + " merged_datasets_per_year.append(merged)\n", + " merged_expense_dataset.append(expense_df)" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "63f2c05a", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FILER_IDYEARCONTRIBUTORCONT_AMT_1CONT_AMT_2CONT_AMT_3CONT_DESCRIPTOTAL_CONT_AMTCONTRIBUTOR_TYPEFILER_TYPEFILER_NAMEOFFICEPARTY
020000812018Joseph A Ribas25.0000NaN25.00INDIVIDUALCommitteeFIRSTENERGY CORP. POLITICAL ACTION COMMITTEENaNNaN
120000812018Paul J Kashella40.0000NaN40.00INDIVIDUALCommitteeFIRSTENERGY CORP. POLITICAL ACTION COMMITTEENaNNaN
220000812018Vicky C Thiel25.0000NaN25.00INDIVIDUALCommitteeFIRSTENERGY CORP. POLITICAL ACTION COMMITTEENaNNaN
320000812018Joseph B Hildebrandt20.0000NaN20.00INDIVIDUALCommitteeFIRSTENERGY CORP. POLITICAL ACTION COMMITTEENaNNaN
420000812018Jacqueline A Espinoza50.0000NaN50.00INDIVIDUALCommitteeFIRSTENERGY CORP. POLITICAL ACTION COMMITTEENaNNaN
..........................................
3874593936712023ERIC J YARNELL38.4700NaN38.47INDIVIDUALCommitteeHIGHMARK PAC OF HIGHMARK INC.NaNNaN
3874603936712023PATRICIA LAUGHLIN116.0000NaN116.00INDIVIDUALCommitteeHIGHMARK PAC OF HIGHMARK INC.NaNNaN
3874613936712023MATTHEW J RHENISH130.0000NaN130.00INDIVIDUALCommitteeHIGHMARK PAC OF HIGHMARK INC.NaNNaN
3874623936712023JAMES J BENEDICT192.3000NaN192.30INDIVIDUALCommitteeHIGHMARK PAC OF HIGHMARK INC.NaNNaN
3874633936712023KRISTY A YOHEY50.0000NaN50.00INDIVIDUALCommitteeHIGHMARK PAC OF HIGHMARK INC.NaNNaN
\n", + "

6507958 rows × 13 columns

\n", + "
" + ], + "text/plain": [ + " FILER_ID YEAR CONTRIBUTOR CONT_AMT_1 CONT_AMT_2 \\\n", + "0 2000081 2018 Joseph A Ribas 25.00 0 \n", + "1 2000081 2018 Paul J Kashella 40.00 0 \n", + "2 2000081 2018 Vicky C Thiel 25.00 0 \n", + "3 2000081 2018 Joseph B Hildebrandt 20.00 0 \n", + "4 2000081 2018 Jacqueline A Espinoza 50.00 0 \n", + "... ... ... ... ... ... \n", + "387459 393671 2023 ERIC J YARNELL 38.47 0 \n", + "387460 393671 2023 PATRICIA LAUGHLIN 116.00 0 \n", + "387461 393671 2023 MATTHEW J RHENISH 130.00 0 \n", + "387462 393671 2023 JAMES J BENEDICT 192.30 0 \n", + "387463 393671 2023 KRISTY A YOHEY 50.00 0 \n", + "\n", + " CONT_AMT_3 CONT_DESCRIP TOTAL_CONT_AMT CONTRIBUTOR_TYPE FILER_TYPE \\\n", + "0 0 NaN 25.00 INDIVIDUAL Committee \n", + "1 0 NaN 40.00 INDIVIDUAL Committee \n", + "2 0 NaN 25.00 INDIVIDUAL Committee \n", + "3 0 NaN 20.00 INDIVIDUAL Committee \n", + "4 0 NaN 50.00 INDIVIDUAL Committee \n", + "... ... ... ... ... ... \n", + "387459 0 NaN 38.47 INDIVIDUAL Committee \n", + "387460 0 NaN 116.00 INDIVIDUAL Committee \n", + "387461 0 NaN 130.00 INDIVIDUAL Committee \n", + "387462 0 NaN 192.30 INDIVIDUAL Committee \n", + "387463 0 NaN 50.00 INDIVIDUAL Committee \n", + "\n", + " FILER_NAME OFFICE PARTY \n", + "0 FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE NaN NaN \n", + "1 FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE NaN NaN \n", + "2 FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE NaN NaN \n", + "3 FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE NaN NaN \n", + "4 FIRSTENERGY CORP. POLITICAL ACTION COMMITTEE NaN NaN \n", + "... ... ... ... \n", + "387459 HIGHMARK PAC OF HIGHMARK INC. NaN NaN \n", + "387460 HIGHMARK PAC OF HIGHMARK INC. NaN NaN \n", + "387461 HIGHMARK PAC OF HIGHMARK INC. NaN NaN \n", + "387462 HIGHMARK PAC OF HIGHMARK INC. NaN NaN \n", + "387463 HIGHMARK PAC OF HIGHMARK INC. NaN NaN \n", + "\n", + "[6507958 rows x 13 columns]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "contrib_filer_info_2018_2023 = eda.merge_all_datasets(merged_datasets_per_year)\n", + "contrib_filer_info_2018_2023" + ] + }, + { + "cell_type": "markdown", + "id": "23fc2d42", + "metadata": {}, + "source": [ + "##### 1.1 For each column, what are the contents of it? How many blanks or nulls are there? What is the format? If there it is one of several types, what are those types?" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "1488530d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
columnNamecolTypenumNullsnull_percent
0FILER_IDobject00.00
1YEARint6400.00
2CONTRIBUTORobject00.00
3CONT_AMT_1float6400.00
4CONT_AMT_2int6400.00
5CONT_AMT_3int6400.00
6CONT_DESCRIPobject633878297.40
7TOTAL_CONT_AMTfloat6400.00
8CONTRIBUTOR_TYPEobject00.00
9FILER_TYPEobject55730.09
10FILER_NAMEobject00.00
11OFFICEobject623682495.83
12PARTYobject428937565.91
\n", + "
" + ], + "text/plain": [ + " columnName colType numNulls null_percent\n", + "0 FILER_ID object 0 0.00\n", + "1 YEAR int64 0 0.00\n", + "2 CONTRIBUTOR object 0 0.00\n", + "3 CONT_AMT_1 float64 0 0.00\n", + "4 CONT_AMT_2 int64 0 0.00\n", + "5 CONT_AMT_3 int64 0 0.00\n", + "6 CONT_DESCRIP object 6338782 97.40\n", + "7 TOTAL_CONT_AMT float64 0 0.00\n", + "8 CONTRIBUTOR_TYPE object 0 0.00\n", + "9 FILER_TYPE object 5573 0.09\n", + "10 FILER_NAME object 0 0.00\n", + "11 OFFICE object 6236824 95.83\n", + "12 PARTY object 4289375 65.91" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cols, type, nulls, null_percent = [],[],[],[]\n", + "for column in contrib_filer_info_2018_2023.columns:\n", + " cols.append(column)\n", + " type.append(contrib_filer_info_2018_2023.dtypes[column]) \n", + " nulls.append(contrib_filer_info_2018_2023[column].isna().sum(),)\n", + " null_percent.append(round(contrib_filer_info_2018_2023[column].isna().sum()/len(contrib_filer_info_2018_2023)*100,2))\n", + "\n", + "summary_df = {'columnName':cols, 'colType':type,'numNulls':nulls,'null_percent':null_percent}\n", + "summary_df = pd.DataFrame(summary_df)#, columns==['columnName','colType','numNulls','nullPercent'])\n", + "summary_df\n" + ] + }, + { + "cell_type": "markdown", + "id": "e65803c2", + "metadata": {}, + "source": [ + " Having reduced the contributor and filer datasets to the relevant datasets, it is evident that with the exception of {CONT_DESCRIP, OFFICE, PARTY} columns, most of the values are reported and available. With regards to the type of data stored in the datasets, most are considered objects (which are mainly strings), in part due to the presence of dirty/inconsistent data inputs." + ] + }, + { + "cell_type": "markdown", + "id": "7de3ef9a", + "metadata": {}, + "source": [ + "##### 2.1 Who are the top 10 contributors in your data? The top 10 recipients?" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "89830e0b", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TOTAL_CONT_AMT
CONTRIBUTOR
CHARLOTTE SWENSON114202495.38
Jeffrey Yass57205000.00
Total Other Contributions37769756.19
COMMONWEALTH CHILDREN'S CHOICE FUND29672522.70
STUDENTS FIRST PAC28641424.71
STUDENT'S FIRST PAC18500000.00
COMMONWEALTH LEADERS FUND17637897.21
Contributions from FEC Report16911418.85
House Democratic Campaign Committee14139567.49
CONTRIBUTIONS FROM NON-PA SOURCES13101403.54
\n", + "
" + ], + "text/plain": [ + " TOTAL_CONT_AMT\n", + "CONTRIBUTOR \n", + "CHARLOTTE SWENSON 114202495.38\n", + "Jeffrey Yass 57205000.00\n", + "Total Other Contributions 37769756.19\n", + "COMMONWEALTH CHILDREN'S CHOICE FUND 29672522.70\n", + "STUDENTS FIRST PAC 28641424.71\n", + "STUDENT'S FIRST PAC 18500000.00\n", + "COMMONWEALTH LEADERS FUND 17637897.21\n", + "Contributions from FEC Report 16911418.85\n", + "House Democratic Campaign Committee 14139567.49\n", + "CONTRIBUTIONS FROM NON-PA SOURCES 13101403.54" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eda.top_n_contributors(contrib_filer_info_2018_2023,10)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "575200fc", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
TOTAL_CONT_AMT
FILER_NAME
FRIENDS OF JENNIFER O'MARA115522369.09
Shapiro for Pennsylvania76829505.52
Students First PAC58525000.00
COMMONWEALTH LEADERS FUND39822241.33
COMMONWEALTH CHILDREN'S CHOICE FUND36168500.00
PA Democratic Party34933417.16
Pennsylvania House Democratic Campaign Committee28272822.94
International Brotherhood of Electrical Workers Local 98 Committee on Political Education26470647.64
HOUSE REPUBLICAN CAMPAIGN COMMITTEE24083705.10
AMERICAN FEDERATION OF TEACHERS, AFL-CIO COPE (AFT/COPE)23304568.79
\n", + "
" + ], + "text/plain": [ + " TOTAL_CONT_AMT\n", + "FILER_NAME \n", + "FRIENDS OF JENNIFER O'MARA 115522369.09\n", + "Shapiro for Pennsylvania 76829505.52\n", + "Students First PAC 58525000.00\n", + "COMMONWEALTH LEADERS FUND 39822241.33\n", + "COMMONWEALTH CHILDREN'S CHOICE FUND 36168500.00\n", + "PA Democratic Party 34933417.16\n", + "Pennsylvania House Democratic Campaign Committee 28272822.94\n", + "International Brotherhood of Electrical Workers... 26470647.64\n", + "HOUSE REPUBLICAN CAMPAIGN COMMITTEE 24083705.10\n", + "AMERICAN FEDERATION OF TEACHERS, AFL-CIO COPE (... 23304568.79" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eda.top_n_recipients(contrib_filer_info_2018_2023,10)" + ] + }, + { + "cell_type": "markdown", + "id": "bf489e51", + "metadata": {}, + "source": [ + "##### 3.1 Make a bar chart with plotly comparing contributions by donor type (PAC, individual, etc) and one comparing recipients by the office type they are running for" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "359b2fd0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + " \n", + " " + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "Type of Filer=Candidate
YEAR=%{x}
Total Contribution Amount=%{y}", + "legendgroup": "Candidate", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "Candidate", + "offsetgroup": "Candidate", + "orientation": "v", + "showlegend": true, + "textposition": "auto", + "type": "bar", + "x": [ + 2018, + 2019, + 2020, + 2021, + 2022, + 2023 + ], + "xaxis": "x", + "y": [ + 1568627.22, + 517345.52, + 590490.29, + 900134.08, + 2476099.3, + 358696.73 + ], + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "hovertemplate": "Type of Filer=Committee
YEAR=%{x}
Total Contribution Amount=%{y}", + "legendgroup": "Committee", + "marker": { + "color": "#EF553B", + "pattern": { + "shape": "" + } + }, + "name": "Committee", + "offsetgroup": "Committee", + "orientation": "v", + "showlegend": true, + "textposition": "auto", + "type": "bar", + "x": [ + 2018, + 2019, + 2020, + 2021, + 2022, + 2023 + ], + "xaxis": "x", + "y": [ + 315345576.08, + 328678633.98, + 354754859.47, + 248565501.96, + 462917633.5, + 142906754.72 + ], + "yaxis": "y" + }, + { + "alignmentgroup": "True", + "hovertemplate": "Type of Filer=Lobbyist
YEAR=%{x}
Total Contribution Amount=%{y}", + "legendgroup": "Lobbyist", + "marker": { + "color": "#00cc96", + "pattern": { + "shape": "" + } + }, + "name": "Lobbyist", + "offsetgroup": "Lobbyist", + "orientation": "v", + "showlegend": true, + "textposition": "auto", + "type": "bar", + "x": [ + 2018, + 2019, + 2020, + 2021, + 2022, + 2023 + ], + "xaxis": "x", + "y": [ + 21065, + 45923, + 18888.44, + 50000, + 172329.76, + 6595 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "legend": { + "title": { + "text": "Type of Filer" + }, + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "PA Recipients of Annual Contributions (2018 - 2023)" + }, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "title": { + "text": "YEAR" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Total Contribution Amount" + } + } + } + }, + "text/html": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
YEARFILER_TYPETOTAL_CONT_AMT
02018Candidate1568627.22
12018Committee315345576.08
22018Lobbyist21065.00
32019Candidate517345.52
42019Committee328678633.98
52019Lobbyist45923.00
62020Candidate590490.29
72020Committee354754859.47
82020Lobbyist18888.44
92021Candidate900134.08
102021Committee248565501.96
112021Lobbyist50000.00
122022Candidate2476099.30
132022Committee462917633.50
142022Lobbyist172329.76
152023Candidate358696.73
162023Committee142906754.72
172023Lobbyist6595.00
\n", + "
" + ], + "text/plain": [ + " YEAR FILER_TYPE TOTAL_CONT_AMT\n", + "0 2018 Candidate 1568627.22\n", + "1 2018 Committee 315345576.08\n", + "2 2018 Lobbyist 21065.00\n", + "3 2019 Candidate 517345.52\n", + "4 2019 Committee 328678633.98\n", + "5 2019 Lobbyist 45923.00\n", + "6 2020 Candidate 590490.29\n", + "7 2020 Committee 354754859.47\n", + "8 2020 Lobbyist 18888.44\n", + "9 2021 Candidate 900134.08\n", + "10 2021 Committee 248565501.96\n", + "11 2021 Lobbyist 50000.00\n", + "12 2022 Candidate 2476099.30\n", + "13 2022 Committee 462917633.50\n", + "14 2022 Lobbyist 172329.76\n", + "15 2023 Candidate 358696.73\n", + "16 2023 Committee 142906754.72\n", + "17 2023 Lobbyist 6595.00" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eda.compare_cont_by_donorType(contrib_filer_info_2018_2023)" + ] + }, + { + "cell_type": "markdown", + "id": "8c96cd77", + "metadata": {}, + "source": [ + " The dataset is organized from the perspective of the entity filing the finance reports, which in this case is either a political committee, a lobbyist, or a candidate. As such, it is somewhat difficult to ascertain the classification of the contributors (were they a PAC, an individual, a corporation...) as there is no linearity in their names. However, the overwhelming majority of contribution recipients were committees, indicating that most entities donated to PACs or SuperPACS." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "cffc88f2", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.plotly.v1+json": { + "config": { + "plotlyServerURL": "https://plot.ly" + }, + "data": [ + { + "alignmentgroup": "True", + "hovertemplate": "OFFICE=%{x}
Total Contribution Amount=%{y}", + "legendgroup": "", + "marker": { + "color": "#636efa", + "pattern": { + "shape": "" + } + }, + "name": "", + "offsetgroup": "", + "orientation": "v", + "showlegend": false, + "textposition": "auto", + "type": "bar", + "x": [ + "Attorney General", + "Auditor General", + "Governor", + "Judge of the CommonWealth Crt", + "Judge of the Crt of Common Pleas", + "Judge of the Municipal Crt", + "Judge of the Superior Crt", + "Justice of the Supreme Crt", + "Liutenant Gov", + "Member of Dem State Committee", + "Member of Rep State Committee", + "Rep (General Assembly)", + "Senator (General Assembly)", + "State Treasurer", + "USP", + "United States Congress" + ], + "xaxis": "x", + "y": [ + 8573124.57, + 3963032.23, + 48193607.75, + 3285090.5, + 19324260.25, + 736581.25, + 9700580.78, + 15532691.39, + 4422494.4, + 37681.58, + 26445.42, + 230652186.96, + 57801291.46, + 2001511.27, + 2823.37, + 1416781.5899999999 + ], + "yaxis": "y" + } + ], + "layout": { + "barmode": "relative", + "legend": { + "tracegroupgap": 0 + }, + "template": { + "data": { + "bar": [ + { + "error_x": { + "color": "#2a3f5f" + }, + "error_y": { + "color": "#2a3f5f" + }, + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "bar" + } + ], + "barpolar": [ + { + "marker": { + "line": { + "color": "#E5ECF6", + "width": 0.5 + }, + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "barpolar" + } + ], + "carpet": [ + { + "aaxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "baxis": { + "endlinecolor": "#2a3f5f", + "gridcolor": "white", + "linecolor": "white", + "minorgridcolor": "white", + "startlinecolor": "#2a3f5f" + }, + "type": "carpet" + } + ], + "choropleth": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "choropleth" + } + ], + "contour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "contour" + } + ], + "contourcarpet": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "contourcarpet" + } + ], + "heatmap": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmap" + } + ], + "heatmapgl": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "heatmapgl" + } + ], + "histogram": [ + { + "marker": { + "pattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + } + }, + "type": "histogram" + } + ], + "histogram2d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2d" + } + ], + "histogram2dcontour": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "histogram2dcontour" + } + ], + "mesh3d": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "type": "mesh3d" + } + ], + "parcoords": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "parcoords" + } + ], + "pie": [ + { + "automargin": true, + "type": "pie" + } + ], + "scatter": [ + { + "fillpattern": { + "fillmode": "overlay", + "size": 10, + "solidity": 0.2 + }, + "type": "scatter" + } + ], + "scatter3d": [ + { + "line": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatter3d" + } + ], + "scattercarpet": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattercarpet" + } + ], + "scattergeo": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergeo" + } + ], + "scattergl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattergl" + } + ], + "scattermapbox": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scattermapbox" + } + ], + "scatterpolar": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolar" + } + ], + "scatterpolargl": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterpolargl" + } + ], + "scatterternary": [ + { + "marker": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "type": "scatterternary" + } + ], + "surface": [ + { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + }, + "colorscale": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "type": "surface" + } + ], + "table": [ + { + "cells": { + "fill": { + "color": "#EBF0F8" + }, + "line": { + "color": "white" + } + }, + "header": { + "fill": { + "color": "#C8D4E3" + }, + "line": { + "color": "white" + } + }, + "type": "table" + } + ] + }, + "layout": { + "annotationdefaults": { + "arrowcolor": "#2a3f5f", + "arrowhead": 0, + "arrowwidth": 1 + }, + "autotypenumbers": "strict", + "coloraxis": { + "colorbar": { + "outlinewidth": 0, + "ticks": "" + } + }, + "colorscale": { + "diverging": [ + [ + 0, + "#8e0152" + ], + [ + 0.1, + "#c51b7d" + ], + [ + 0.2, + "#de77ae" + ], + [ + 0.3, + "#f1b6da" + ], + [ + 0.4, + "#fde0ef" + ], + [ + 0.5, + "#f7f7f7" + ], + [ + 0.6, + "#e6f5d0" + ], + [ + 0.7, + "#b8e186" + ], + [ + 0.8, + "#7fbc41" + ], + [ + 0.9, + "#4d9221" + ], + [ + 1, + "#276419" + ] + ], + "sequential": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ], + "sequentialminus": [ + [ + 0, + "#0d0887" + ], + [ + 0.1111111111111111, + "#46039f" + ], + [ + 0.2222222222222222, + "#7201a8" + ], + [ + 0.3333333333333333, + "#9c179e" + ], + [ + 0.4444444444444444, + "#bd3786" + ], + [ + 0.5555555555555556, + "#d8576b" + ], + [ + 0.6666666666666666, + "#ed7953" + ], + [ + 0.7777777777777778, + "#fb9f3a" + ], + [ + 0.8888888888888888, + "#fdca26" + ], + [ + 1, + "#f0f921" + ] + ] + }, + "colorway": [ + "#636efa", + "#EF553B", + "#00cc96", + "#ab63fa", + "#FFA15A", + "#19d3f3", + "#FF6692", + "#B6E880", + "#FF97FF", + "#FECB52" + ], + "font": { + "color": "#2a3f5f" + }, + "geo": { + "bgcolor": "white", + "lakecolor": "white", + "landcolor": "#E5ECF6", + "showlakes": true, + "showland": true, + "subunitcolor": "white" + }, + "hoverlabel": { + "align": "left" + }, + "hovermode": "closest", + "mapbox": { + "style": "light" + }, + "paper_bgcolor": "white", + "plot_bgcolor": "#E5ECF6", + "polar": { + "angularaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "radialaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "scene": { + "xaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "yaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + }, + "zaxis": { + "backgroundcolor": "#E5ECF6", + "gridcolor": "white", + "gridwidth": 2, + "linecolor": "white", + "showbackground": true, + "ticks": "", + "zerolinecolor": "white" + } + }, + "shapedefaults": { + "line": { + "color": "#2a3f5f" + } + }, + "ternary": { + "aaxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "baxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + }, + "bgcolor": "#E5ECF6", + "caxis": { + "gridcolor": "white", + "linecolor": "white", + "ticks": "" + } + }, + "title": { + "x": 0.05 + }, + "xaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + }, + "yaxis": { + "automargin": true, + "gridcolor": "white", + "linecolor": "white", + "ticks": "", + "title": { + "standoff": 15 + }, + "zerolinecolor": "white", + "zerolinewidth": 2 + } + } + }, + "title": { + "text": "PA Contributions Received by Office-Type From 2018-2023" + }, + "xaxis": { + "anchor": "y", + "domain": [ + 0, + 1 + ], + "title": { + "text": "OFFICE" + } + }, + "yaxis": { + "anchor": "x", + "domain": [ + 0, + 1 + ], + "title": { + "text": "Total Contribution Amount" + } + } + } + }, + "text/html": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
OFFICETOTAL_CONT_AMT
0Attorney General8573124.57
1Auditor General3963032.23
2Governor48193607.75
3Judge of the CommonWealth Crt3285090.50
4Judge of the Crt of Common Pleas19324260.25
5Judge of the Municipal Crt736581.25
6Judge of the Superior Crt9700580.78
7Justice of the Supreme Crt15532691.39
8Liutenant Gov4422494.40
9Member of Dem State Committee37681.58
10Member of Rep State Committee26445.42
11Rep (General Assembly)230652186.96
12Senator (General Assembly)57801291.46
13State Treasurer2001511.27
14USP2823.37
15United States Congress1416781.59
\n", + "
" + ], + "text/plain": [ + " OFFICE TOTAL_CONT_AMT\n", + "0 Attorney General 8573124.57\n", + "1 Auditor General 3963032.23\n", + "2 Governor 48193607.75\n", + "3 Judge of the CommonWealth Crt 3285090.50\n", + "4 Judge of the Crt of Common Pleas 19324260.25\n", + "5 Judge of the Municipal Crt 736581.25\n", + "6 Judge of the Superior Crt 9700580.78\n", + "7 Justice of the Supreme Crt 15532691.39\n", + "8 Liutenant Gov 4422494.40\n", + "9 Member of Dem State Committee 37681.58\n", + "10 Member of Rep State Committee 26445.42\n", + "11 Rep (General Assembly) 230652186.96\n", + "12 Senator (General Assembly) 57801291.46\n", + "13 State Treasurer 2001511.27\n", + "14 USP 2823.37\n", + "15 United States Congress 1416781.59" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "eda.plot_recipients_by_office(contrib_filer_info_2018_2023)" + ] + }, + { + "cell_type": "markdown", + "id": "941c9a6f", + "metadata": {}, + "source": [ + " Not suprisingly, legislative races received the most contributions from 2018-2023, with a significant portion going to House races. This makes sense since House election cycles are more frequent that Senate. It is worth noting that although the PA campaign website offers an Office Code Table that indicates what the abbreviated races represent (link attached at end for reference), there are some abbreviations which do not match up with any in the Table on the PA website. These included {CPJA,CPJP,DSC,RSC,USC,USP,USS}. Reaching out to the PA Election official led to some answers for {CPJA, CPJP, USC, USS, DSC, RSC}, and the peculiar feature was that some of these codes apply to out-of-state races, namely races to the U.S Senate and House Chambers, as well as the nation presidency. This was explained as filing errors committed by the filing entities.\n", + "###### https://www.dos.pa.gov/VotingElections/CandidatesCommittees/CampaignFinance/Resources/Pages/Technical-Specifications.aspx " + ] + }, + { + "cell_type": "markdown", + "id": "3b2d4661", + "metadata": {}, + "source": [ + "##### 4.1: If you have multiple years, are they all similar? If not, is the difference explicable (maybe by election schedules)" + ] + }, + { + "cell_type": "markdown", + "id": "f3f038ea", + "metadata": {}, + "source": [ + "Thankfully the years are largely similar. However in 2022 additional columns were appended to the filer and contributor datasets, but these columns are irrelevant for the sake of our analysis" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### This next portion repeats the EDA done on contribution and filer data but on the expenditure datasets spanning 2018-2023. The expense dataset stores information from Schedule III of the campaign finance report, which details information about the services rendered to the filer by the recipient, as well as the nature of the expenditure (contribution, service, phone-banking, etc)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
FILER_IDYEAREXPENSE_NAMEEXPENSE_AMTEXPENSE_DESC
020011442018MICHAEL TURZAI931.92REIMBURSEMENT
120011442018ARMSTRONG25.04INTERNET
220011442018COMCAST421.50INTERNET
320011442018NAYLAX250.00AD
420022992018FRIENDS OF TOM TOSTI500.00POLITICAL CONTRIBUTION
\n", + "
" + ], + "text/plain": [ + " FILER_ID YEAR EXPENSE_NAME EXPENSE_AMT EXPENSE_DESC\n", + "0 2001144 2018 MICHAEL TURZAI 931.92 REIMBURSEMENT\n", + "1 2001144 2018 ARMSTRONG 25.04 INTERNET\n", + "2 2001144 2018 COMCAST 421.50 INTERNET\n", + "3 2001144 2018 NAYLAX 250.00 AD\n", + "4 2002299 2018 FRIENDS OF TOM TOSTI 500.00 POLITICAL CONTRIBUTION" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "expense_info_2018_2023 = eda.merge_all_datasets(merged_expense_dataset)\n", + "expense_info_2018_2023.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 1.2 For each column, what are the contents of it? How many blanks or nulls are there? What is the format? If there it is one of several types, what are those types?" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
columnNamecolTypenumNullsnull_percent
0FILER_IDobject00.0
1YEARint6400.0
2EXPENSE_NAMEobject00.0
3EXPENSE_AMTfloat6400.0
4EXPENSE_DESCobject00.0
\n", + "
" + ], + "text/plain": [ + " columnName colType numNulls null_percent\n", + "0 FILER_ID object 0 0.0\n", + "1 YEAR int64 0 0.0\n", + "2 EXPENSE_NAME object 0 0.0\n", + "3 EXPENSE_AMT float64 0 0.0\n", + "4 EXPENSE_DESC object 0 0.0" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cols, type, nulls, null_percent = [],[],[],[]\n", + "for column in expense_info_2018_2023.columns:\n", + " cols.append(column)\n", + " type.append(expense_info_2018_2023.dtypes[column]) \n", + " nulls.append(expense_info_2018_2023[column].isna().sum(),)\n", + " null_percent.append(round((expense_info_2018_2023[column].isna().sum()/len(expense_info_2018_2023))*100,2))\n", + "\n", + "summary_df = {'columnName':cols, 'colType':type,'numNulls':nulls,'null_percent':null_percent}\n", + "summary_df = pd.DataFrame(summary_df)\n", + "summary_df" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 2.2 What are the top 10 expenditure reasons in your data? The top 10 recipients?" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
EXPENSE_AMT
EXPENSE_DESC
NAN702230230.20
POSTAGE532741231.64
CONTRIBUTION441939525.25
UNITEMIZED EXPENDITURES212332926.31
DONATION138815526.73
NON-PENNSYLVANIA EXPENDITURES118542071.03
SEE FEC REPORT AT HTTPS://WWW.FEC.GOV/DATA/COMMITTEE/C00042366/? TAB=FILINGS114236098.71
NON PA DISBURSEMENTS103186134.60
SEE FEC REPORT AT HTTPS://WWW.FEC.GOV/DATA/COMMITTEE/C00042366/?TAB=FILINGS57998961.58
ADVERTISING51071086.44
\n", + "
" + ], + "text/plain": [ + " EXPENSE_AMT\n", + "EXPENSE_DESC \n", + "NAN 702230230.20\n", + "POSTAGE 532741231.64\n", + "CONTRIBUTION 441939525.25\n", + "UNITEMIZED EXPENDITURES 212332926.31\n", + "DONATION 138815526.73\n", + "NON-PENNSYLVANIA EXPENDITURES 118542071.03\n", + "SEE FEC REPORT AT HTTPS://WWW.FEC.GOV/DATA/COMM... 114236098.71\n", + "NON PA DISBURSEMENTS 103186134.60\n", + "SEE FEC REPORT AT HTTPS://WWW.FEC.GOV/DATA/COMM... 57998961.58\n", + "ADVERTISING 51071086.44" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.set_option(\"display.float_format\", \"{:.2f}\".format)\n", + "expenditure_reasons = (expense_info_2018_2023.groupby([\"EXPENSE_DESC\"])\n", + " .agg({\"EXPENSE_AMT\": sum})\n", + " .sort_values(by=\"EXPENSE_AMT\", ascending=False)\n", + " )\n", + "expenditure_reasons.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It's a bit difficult to ascertain the description column, mainly because there is no standardized reporting format. Filers are free to describe the expenditure as they see fit, which makes grouping them into categories uncertain. Some seem to link the Federal Election Committee's website url. The combined cost of expenditures lacking descriptions is the highest" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
EXPENSE_AMT
EXPENSE_NAME
ACME MARKETS517209160.05
NON-PENNSYLVANIA EXPENDITURES352505553.60
PNC304225694.63
DNC SERVICES/DEMOCRATIC NATIONAL COMMITTEE210584210.93
DSCC172269060.29
NON PA TRANSACTIONS114481897.04
CONTRIBUTIONS TO FEDERAL AND NON-PA STATE AND LOCAL CANDIDATES AND COMMITTEES106419119.55
THE BUSINESS CENTER FOR ENTREPRENEURSHIP & SOCIAL ENTERPRISE103202985.45
COMMONWEALTH CHILDREN'S CHOICE FUND44325110.15
GRASSROOTS MEDIA LLC40759192.03
\n", + "
" + ], + "text/plain": [ + " EXPENSE_AMT\n", + "EXPENSE_NAME \n", + "ACME MARKETS 517209160.05\n", + "NON-PENNSYLVANIA EXPENDITURES 352505553.60\n", + "PNC 304225694.63\n", + "DNC SERVICES/DEMOCRATIC NATIONAL COMMITTEE 210584210.93\n", + "DSCC 172269060.29\n", + "NON PA TRANSACTIONS 114481897.04\n", + "CONTRIBUTIONS TO FEDERAL AND NON-PA STATE AND L... 106419119.55\n", + "THE BUSINESS CENTER FOR ENTREPRENEURSHIP & ... 103202985.45\n", + "COMMONWEALTH CHILDREN'S CHOICE FUND 44325110.15\n", + "GRASSROOTS MEDIA LLC 40759192.03" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pd.set_option(\"display.float_format\", \"{:.2f}\".format)\n", + "expenditure_recipients = (expense_info_2018_2023.groupby([\"EXPENSE_NAME\"])\n", + " .agg({\"EXPENSE_AMT\": sum})\n", + " .sort_values(by=\"EXPENSE_AMT\", ascending=False)\n", + " )\n", + "expenditure_recipients.head(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It's very interesting that highest recipient of expenditures is ACME Markets, a supermarket chain. More interesting is that a PAC seems to be the recipient, which reveals an interesting reality. How legally clear is it when a PAC receives money in the form of contributions, vs when it does and this amount is considered an expenditure by the filer? If an organization seeks the \"services\" of a PAC and lists them as an expenditure, it wouldn't seem obvious if that PAC would then list its payment as a contribution. In the case it doesn't, this raises an interesting potential outcome of PACs ostensibly receiving funds to \"help\" campaigns they are already ideologically aligned with without counting such \"collaborations\" as donations" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### 3.2: If you have multiple years, are they all similar? If not, is the difference explicable (maybe by election schedules)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The years are all similar" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/notebooks/README.md b/notebooks/README.md index 86dbfdd..d4caf2c 100644 --- a/notebooks/README.md +++ b/notebooks/README.md @@ -14,3 +14,4 @@ This should contain information about what is done in each notebook * `AZ_EDA` : A notebook containing the EDA and plots for Arizona. +* `PA_EDA.ipynb` : This notebook contains the EDA for Pennsylvania datasets on contributions, filer information, and expenditure data from 2018-2023. diff --git a/requirements.txt b/requirements.txt index 850ac08..866f821 100644 --- a/requirements.txt +++ b/requirements.txt @@ -5,4 +5,7 @@ pre-commit~=2.20 ipykernel~=6.16 # project packages -pandas~=1.4 +pandas~=2.0.3 +plotly~=5.18.0 +bs4~=0.0.1 +nbformat~=5.9.2 \ No newline at end of file diff --git a/utils/PA_Data_Web_Scraper.py b/utils/PA_Data_Web_Scraper.py new file mode 100644 index 0000000..0db8adf --- /dev/null +++ b/utils/PA_Data_Web_Scraper.py @@ -0,0 +1,41 @@ +import zipfile +from io import BytesIO + +import numpy as np +import requests +from bs4 import BeautifulSoup as BS + +from utils import constants as const + + +def make_request(website_url: str) -> object: + """makes a HTTML request to the specified url, whose data is pulled out into + a Beautiful Soup + + Args: + website_url: the url link to the campaign finance reports on PA's + government website + + Returns: A parsed BeautifulSoup document + """ + return BS(requests.get(website_url).text, "html.parser") + + +def download_PA_data(start_year: int, end_year: int): + """downloads PA datasets from specified years to a local directory + Args: + start_year: The first year in the range of desired years to extract data + + end_year: The last year in the range of desired years to extract data. + Returns: + unzipped .txt files (that are really csvs) stored in the 'data' + directory + """ + + years = np.arange(start_year, end_year + 1) + for year in years: + link = const.PA_MAIN_URL + const.PA_ZIPPED_URL + str(year) + ".zip" + req = requests.get(link) + + zippedfiles = zipfile.ZipFile(BytesIO(req.content)) + zippedfiles.extractall("../data") diff --git a/utils/PA_EDA_Functions.py b/utils/PA_EDA_Functions.py new file mode 100644 index 0000000..9820912 --- /dev/null +++ b/utils/PA_EDA_Functions.py @@ -0,0 +1,367 @@ +# import sys + +import pandas as pd +import plotly.express as px + +# sys.path.append("/home/alankagiri/2023-fall-clinic-climate-cabinet") +from utils import constants as const + + +def assign_col_names(filepath: str, year: int) -> list: + """Assigns the right column names to the right datasets. + + Args: + filepath: the path in which the data is stored/located. + + year: to make parsing through the data more manageable, the year from + which the data originates is also taken. + + Returns: + a list of the appropriate column names for the dataset + """ + dir = filepath.split("/") + file_type = dir[len(dir) - 1] + + if "contrib" in file_type: + if year < 2022: + return const.PA_CONT_COLS_NAMES_PRE2022 + else: + return const.PA_CONT_COLS_NAMES_POST2022 + elif "filer" in file_type: + if year < 2022: + return const.PA_FILER_COLS_NAMES_PRE2022 + else: + return const.PA_FILER_COLS_NAMES_POST2022 + elif "expense" in file_type: + if year < 2022: + return const.PA_EXPENSE_COLS_NAMES_PRE2022 + else: + return const.PA_EXPENSE_COLS_NAMES_POST2022 + + +def classify_contributor(contributor: str) -> str: + """Takes a string input and compares it against a list of identifiers most + commonly associated with organizations/corporations/PACs, and classifies the + string input as belong to an individual or organization + + Args: + contributor: a string + Returns: + string "ORGANIZATION" or "INDIVIDUAL" depending on the classification of + the parameter + """ + split = contributor.split() + loc = 0 + while loc < len(split): + if split[loc].upper() in const.PA_ORGANIZATION_IDENTIFIERS: + return "ORGANIZATION" + loc += 1 + return "INDIVIDUAL" + + +def pre_process_contributor_dataset(df: pd.DataFrame): + """pre-processes a contributor dataset by sifting through the columns and + keeping the relevant columns for EDA and AbstractStateCleaner purposes + + Args: + df: the contributor dataset + + Returns: + a pandas dataframe whose columns are appropriately formatted. + """ + df["TOTAL_CONT_AMT"] = df["CONT_AMT_1"] + df["CONT_AMT_2"] + df["CONT_AMT_3"] + df["CONTRIBUTOR"] = df["CONTRIBUTOR"].astype("str") + df["CONTRIBUTOR_TYPE"] = df["CONTRIBUTOR"].apply(classify_contributor) + df.drop( + columns={ + "ADDRESS_1", + "ADDRESS_2", + "CITY", + "STATE", + "ZIPCODE", + "OCCUPATION", + "E_NAME", + "E_ADDRESS_1", + "E_ADDRESS_2", + "E_CITY", + "E_STATE", + "E_ZIPCODE", + "SECTION", + "CYCLE", + "CONT_DATE_1", + "CONT_DATE_2", + "CONT_DATE_3", + }, + inplace=True, + ) + + if "TIMESTAMP" in df.columns: + df.drop(columns={"TIMESTAMP", "REPORTER_ID"}, inplace=True) + df["CONTRIBUTOR"] = df["CONTRIBUTOR"].apply(lambda x: str(x).upper()) + + return df + + +def pre_process_filer_dataset(df: pd.DataFrame): + """pre-processes a filer dataset by sifting through the columns and + keeping the relevant columns for EDA and AbstractStateCleaner purposes + + Args: + df: the filer dataset + + Returns: + a pandas dataframe whose columns are appropriately formatted. + """ + df.drop( + columns={ + "YEAR", + "CYCLE", + "AMEND", + "TERMINATE", + "DISTRICT", + "ADDRESS_1", + "ADDRESS_2", + "CITY", + "STATE", + "ZIPCODE", + "COUNTY", + "PHONE", + "BEGINNING", + "MONETARY", + "INKIND", + }, + inplace=True, + ) + if "TIMESTAMP" in df.columns: + df.drop(columns={"TIMESTAMP", "REPORTER_ID"}, inplace=True) + + df.drop_duplicates(subset=["FILER_ID"], inplace=True) + df["FILER_TYPE"] = df.FILER_TYPE.map(const.PA_FILER_ABBREV_DICT) + df["FILER_NAME"] = df["FILER_NAME"].apply(lambda x: str(x).upper()) + return df + + +def pre_process_expense_dataset(df: pd.DataFrame): + """pre-processes an expenditure dataset by sifting through the columns and + keeping the relevant columns for EDA and AbstractStateCleaner purposes + + Args: + df: the expenditure dataset + + Returns: + a pandas dataframe whose columns are appropriately formatted. + """ + df.drop( + columns={ + "EXPENSE_CYCLE", + "EXPENSE_ADDRESS_1", + "EXPENSE_ADDRESS_2", + "EXPENSE_CITY", + "EXPENSE_STATE", + "EXPENSE_ZIPCODE", + "EXPENSE_DATE", + }, + inplace=True, + ) + if "EXPENSE_REPORTER_ID" in df.columns: + df.drop(columns={"EXPENSE_TIMESTAMP", "EXPENSE_REPORTER_ID"}, inplace=True) + df["EXPENSE_DESC"] = df["EXPENSE_DESC"].apply(lambda x: str(x).upper()) + df["EXPENSE_NAME"] = df["EXPENSE_NAME"].apply(lambda x: str(x).upper()) + + return df + + +def initialize_PA_dataset(data_filepath: str, year: int) -> pd.DataFrame: + """initializes the PA data appropriately based on whether the data contains + filer, contributor, or expense information + + Args: + data_filepath: the path in which the data is stored/located. + + year: the year from which the data originates + + Returns: + a pandas dataframe whose columns are appropriately formatted, and + any dirty rows with inconsistent columns names dropped. + """ + + df = pd.read_csv( + data_filepath, + names=assign_col_names(data_filepath, year), + sep=",", + encoding="latin-1", + on_bad_lines="warn", + ) + + df["YEAR"] = year + df["FILER_ID"] = df["FILER_ID"].astype("str") + dir = data_filepath.split("/") + file_type = dir[len(dir) - 1] + + if "contrib" in file_type: + return pre_process_contributor_dataset(df) + + elif "filer" in file_type: + return pre_process_filer_dataset(df) + + elif "expense" in file_type: + return pre_process_expense_dataset(df) + + else: + raise ValueError( + "This function is currently formatted for filer, \ + expense, and contributor datasets. Make sure your data \ + is from these sources." + ) + + +def top_n_recipients(df: pd.DataFrame, num_recipients: int) -> object: + """given a dataframe, retrieves the top n recipients of that year based on + contributions and returns a table + Args: + df: a pandas DataFrame with a contributions column + + num_recipients: an integer specifying how many recipients are desired. + If this value is larger than the possible amount of recipients, then all + recipients are returned instead. + Returns: + A pandas table (object)""" + recipients = ( + df.groupby(["FILER_NAME"]) + .agg({"TOTAL_CONT_AMT": sum}) + .sort_values(by="TOTAL_CONT_AMT", ascending=False) + ) + pd.set_option("display.float_format", "{:.2f}".format) + + if num_recipients > len(recipients): + return recipients + else: + return recipients.head(num_recipients) + + +def top_n_contributors(df: pd.DataFrame, num_contributors: int) -> object: + """given a dataframe, retrieves the top n contributors of that year based on + contributions and returns a table + + Args: + df: a pandas DataFrame with a contributions column + + num_contributors: an integer specifying how many contributors are + desired. If this value is larger than the possible amount of + contributors, then all contributors are returned instead. + Returns: + a pandas table (object)""" + + contributors = ( + df.groupby(["CONTRIBUTOR"]) + .agg({"TOTAL_CONT_AMT": sum}) + .sort_values(by="TOTAL_CONT_AMT", ascending=False) + ) + pd.set_option("display.float_format", "{:.2f}".format) + if num_contributors > len(contributors): + return contributors + else: + return contributors.head(num_contributors) + + +def merge_same_year_datasets( + cont_file: pd.DataFrame, filer_file: pd.DataFrame +) -> pd.DataFrame: + """merges the contributor and filer datasets from the same year using the + unique filerID + Args: + cont_file: The contributor dataset + + filer_file: the filer dataset from the same year as the cont_file. + Returns + The merged pandas dataframe + """ + merged_df = pd.merge(cont_file, filer_file, how="left", on="FILER_ID") + return merged_df + + +def merge_all_datasets(datasets: list) -> pd.DataFrame: + """concatenates datasets from different years into one super dataset + Args: + datasets: a list of datasets + + Returns: + The merged pandas dataframe + """ + return pd.concat(datasets) + + +def group_filerType_Party(dataset: pd.DataFrame) -> object: + """takes a dataset and returns a grouped table highlighting the kinds + of people who file the campaign reports (FilerType Key -> 1:Candidate, + 2:Committee, 3:Lobbyist.) and their political party affiliation + + Args: + dataset: a pandas DataFrame containing columns and values from the filer + dataset. + + Returns: + A table object""" + return dataset.groupby(["FILER_TYPE", "PARTY"]).agg({"TOTAL_CONT_AMT": sum}) + + +def plot_recipients_by_office(merged_dataset: pd.DataFrame) -> object: + """returns a table and plots a bargraph of data highlighting the amount of + contributions each statewide race received over the years + + Args: + merged_dataset: A (merged) pandas DataFrame containing columns and + values from the contributor and filer datasets. + + Return: + A table object""" + + recep_per_office = merged_dataset.replace({"OFFICE": const.PA_OFFICE_ABBREV_DICT}) + + recep_per_office = ( + recep_per_office.groupby(["OFFICE"]).agg({"TOTAL_CONT_AMT": sum}).reset_index() + ) + + fig = px.bar( + data_frame=recep_per_office, + x="OFFICE", + y="TOTAL_CONT_AMT", + title="PA Contributions Received by Office-Type From 2018-2023", + labels={"TOTAL_CONT_AMT": "Total Contribution Amount"}, + ) + fig.show() + + return recep_per_office + + +def compare_cont_by_donorType(merged_dataset: pd.DataFrame) -> object: + """returns a table and plots a barplot highlighting the annual contributions + campaign finance report-filers received based on whether they are candidates + , committees, or lobbyists. + + Args: + merged_dataset: A (merged) pandas DataFrame containing columns from both + the filer and contributor datasets. + Return: + A pandas DataFrame + """ + pd.set_option("display.float_format", "{:.2f}".format) + cont_by_donor = ( + merged_dataset.groupby(["YEAR", "FILER_TYPE"]) + .agg({"TOTAL_CONT_AMT": sum}) + .reset_index() + ) + + fig = px.bar( + data_frame=cont_by_donor, + x="YEAR", + y="TOTAL_CONT_AMT", + color="FILER_TYPE", + title="PA Recipients of Annual Contributions (2018 - 2023)", + labels={ + "TOTAL_CONT_AMT": "Total Contribution Amount", + "FILER_TYPE": "Type of Filer", + }, + ) + fig.show() + return cont_by_donor diff --git a/utils/PA_constants.py b/utils/PA_constants.py new file mode 100644 index 0000000..6c10644 --- /dev/null +++ b/utils/PA_constants.py @@ -0,0 +1,136 @@ +""" +This document lists the constants used in web scraping and Exploratory +Data Analysis + +""" +# Web Scraping Constants: + +main_url = "https://www.dos.pa.gov" +zipped_url = ( + "/VotingElections/CandidatesCommittees/CampaignFinance/Resources/Documents/" +) + +# EDA constants: + +cont_cols_names_pre2022: list = [ + "FilerID", + "EYear", + "Cycle", + "Section", + "Contributor", + "Address1", + "Address2", + "City", + "State", + "Zipcode", + "occupation", + "Ename", + "EAddress1", + "EAddress2", + "ECity", + "EState", + "EZipcode", + "ContDate1", + "ContAmt1", + "ContDate2", + "ContAmt2", + "ContDate3", + "ContAmt3", + "ContDesc", +] + +cont_cols_names_post22: list = [ + "FilerID", + "ReporterID", + "Timestamp", + "EYear", + "Cycle", + "Section", + "Contributor", + "Address1", + "Address2", + "City", + "State", + "Zipcode", + "occupation", + "Ename", + "EAddress1", + "EAddress2", + "ECity", + "EState", + "EZipcode", + "ContDate1", + "ContAmt1", + "ContDate2", + "ContAmt2", + "ContDate3", + "ContAmt3", + "ContDesc", +] + +filer_cols_names_pre2022: list = [ + "FilerID", + "EYear", + "Cycle", + "Amend", + "Terminate", + "FilerType", + "FilerName", + "Office", + "District", + "Party", + "Address1", + "Address2", + "City", + "State", + "Zipcode", + "County", + "PHONE", + "BEGINNING", + "MONETARY", + "INKIND", +] + +filer_cols_names_post2022: list = [ + "FilerID", + "ReporterID", + "Timestamp", + "EYear", + "Cycle", + "Amend", + "Terminate", + "FilerType", + "FilerName", + "Office", + "District", + "Party", + "Address1", + "Address2", + "City", + "State", + "Zipcode", + "County", + "PHONE", + "BEGINNING", + "MONETARY", + "INKIND", +] + +office_abb_dict: dict = { + "GOV": "Governor", + "LTG": "Liutenant Gov", + "ATT": "Attorney General", + "AUD": "Auditor General", + "TRE": "State Treasurer", + "SPM": "Justice of the Supreme Crt", + "SPR": "Judge of the Superior Crt", + "CCJ": "Judge of the CommonWealth Crt", + "CPJ": "Judge of the Crt of Common Pleas", + "MCJ": "Judge of the Municipal Crt", + "TCJ": "Judge of the Traffic Crt", + "STS": "Senator (General Assembly)", + "STH": "Rep (General Assembly)", + "OTH": "Other(local offices)", + "MISC": "Unknown", +} +filer_abb_dict: dict = {1.0: "Candidate", 2.0: "Committee", 3.0: "Lobbyist"} diff --git a/utils/constants.py b/utils/constants.py index 417eb0c..111be5e 100644 --- a/utils/constants.py +++ b/utils/constants.py @@ -3,6 +3,10 @@ """ from pathlib import Path +MI_FILEPATH = "../data/Contributions/" + +MI_VALUES_TO_CHECK = ["1998", "1999", "2000", "2001", "2002", "2003"] + BASE_FILEPATH = Path(__file__).resolve().parent.parent USER_AGENT = """Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 @@ -44,6 +48,207 @@ "extra_desc", ] +PA_MAIN_URL = "https://www.dos.pa.gov" +PA_ZIPPED_URL = ( + "/VotingElections/CandidatesCommittees/CampaignFinance/Resources/Documents/" +) + +# PA EDA constants: + +PA_CONT_COLS_NAMES_PRE2022: list = [ + "FILER_ID", + "YEAR", + "CYCLE", + "SECTION", + "CONTRIBUTOR", + "ADDRESS_1", + "ADDRESS_2", + "CITY", + "STATE", + "ZIPCODE", + "OCCUPATION", + "E_NAME", + "E_ADDRESS_1", + "E_ADDRESS_2", + "E_CITY", + "E_STATE", + "E_ZIPCODE", + "CONT_DATE_1", + "CONT_AMT_1", + "CONT_DATE_2", + "CONT_AMT_2", + "CONT_DATE_3", + "CONT_AMT_3", + "CONT_DESCRIP", +] + +PA_CONT_COLS_NAMES_POST2022: list = [ + "FILER_ID", + "REPORTER_ID", + "TIMESTAMP", + "YEAR", + "CYCLE", + "SECTION", + "CONTRIBUTOR", + "ADDRESS_1", + "ADDRESS_2", + "CITY", + "STATE", + "ZIPCODE", + "OCCUPATION", + "E_NAME", + "E_ADDRESS_1", + "E_ADDRESS_2", + "E_CITY", + "E_STATE", + "E_ZIPCODE", + "CONT_DATE_1", + "CONT_AMT_1", + "CONT_DATE_2", + "CONT_AMT_2", + "CONT_DATE_3", + "CONT_AMT_3", + "CONT_DESCRIP", +] + +PA_FILER_COLS_NAMES_PRE2022: list = [ + "FILER_ID", + "YEAR", + "CYCLE", + "AMEND", + "TERMINATE", + "FILER_TYPE", + "FILER_NAME", + "OFFICE", + "DISTRICT", + "PARTY", + "ADDRESS_1", + "ADDRESS_2", + "CITY", + "STATE", + "ZIPCODE", + "COUNTY", + "PHONE", + "BEGINNING", + "MONETARY", + "INKIND", +] + +PA_FILER_COLS_NAMES_POST2022: list = [ + "FILER_ID", + "REPORTER_ID", + "TIMESTAMP", + "YEAR", + "CYCLE", + "AMEND", + "TERMINATE", + "FILER_TYPE", + "FILER_NAME", + "OFFICE", + "DISTRICT", + "PARTY", + "ADDRESS_1", + "ADDRESS_2", + "CITY", + "STATE", + "ZIPCODE", + "COUNTY", + "PHONE", + "BEGINNING", + "MONETARY", + "INKIND", +] + +PA_EXPENSE_COLS_NAMES_PRE2022: list = [ + "FILER_ID", + "YEAR", + "EXPENSE_CYCLE", + "EXPENSE_NAME", + "EXPENSE_ADDRESS_1", + "EXPENSE_ADDRESS_2", + "EXPENSE_CITY", + "EXPENSE_STATE", + "EXPENSE_ZIPCODE", + "EXPENSE_DATE", + "EXPENSE_AMT", + "EXPENSE_DESC", +] + +PA_EXPENSE_COLS_NAMES_POST2022: list = [ + "FILER_ID", + "EXPENSE_REPORTER_ID", + "EXPENSE_TIMESTAMP", + "YEAR", + "EXPENSE_CYCLE", + "EXPENSE_NAME", + "EXPENSE_ADDRESS_1", + "EXPENSE_ADDRESS_2", + "EXPENSE_CITY", + "EXPENSE_STATE", + "EXPENSE_ZIPCODE", + "EXPENSE_DATE", + "EXPENSE_AMT", + "EXPENSE_DESC", +] + +PA_OFFICE_ABBREV_DICT: dict = { + "GOV": "Governor", + "LTG": "Liutenant Gov", + "ATT": "Attorney General", + "AUD": "Auditor General", + "TRE": "State Treasurer", + "SPM": "Justice of the Supreme Crt", + "SPR": "Judge of the Superior Crt", + "CCJ": "Judge of the CommonWealth Crt", + "CPJ": "Judge of the Crt of Common Pleas", + "CPJA": "Judge of the Crt of Common Pleas", + "CPJP": "Judge of the Crt of Common Pleas", + "MCJ": "Judge of the Municipal Crt", + "TCJ": "Judge of the Traffic Crt", + "STS": "Senator (General Assembly)", + "STH": "Rep (General Assembly)", + "USC": "United States Congress", + "USS": "United States Senate", + "DSC": "Member of Dem State Committee", + "RSC": "Member of Rep State Committee", + "OTH": "Other(local offices)", +} +PA_FILER_ABBREV_DICT: dict = {1.0: "Candidate", 2.0: "Committee", 3.0: "Lobbyist"} +PA_ORGANIZATION_IDENTIFIERS: list = [ + "FRIENDS", + "CITIZENS", + "UNION", + "STATE", + "TEAM", + "PAC", + "PA", + "GOVT", + "WARD", + "DEM", + "COM", + "COMMITTEE", + "CORP", + "ASSOCIATIONS", + "FOR", + "FOR THE", + "SENATE", + "COMMONWEALTH", + "ELECT", + "POLITICAL ACTION COMMITTEE", + "REPUBLICANS", + "REPUBLICAN", + "DEMOCRAT", + "DEMOCRATS", + "CORPORATION", + "CORP", + "COMPANY", + "CO", + "LIMITED", + "LTD", + "INC", + "INCORPORATED", + "LLC", +] MI_EXPENDITURE_COLUMNS = [ "doc_seq_no",