-
Notifications
You must be signed in to change notification settings - Fork 1
/
02-part-2.qmd
166 lines (116 loc) · 5.65 KB
/
02-part-2.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "Part 2 -- Data Import And CSV To Parquet Conversion"
execute:
eval: false
jupyter: python3
---
::: {.callout-tip}
## Learning objectives
* Read in data using Apache Arrow
* Convert data from CSV to Parquet
* Compare CSV file size to Parquet file size
:::
## Accessing the Data
### The PUMS Dataset
The U.S. Census provides the PUMS data in several ways. You can browse the data
right on their
[website](https://www.census.gov/programs-surveys/acs/microdata.html). But after
selecting 10 variables (out of 522) we are met with a throttle limit on how much
data we can select at a time.
![https://www.census.gov/programs-surveys/acs/microdata.html](./assets/img/pums_website.png)
Luckily for us, census.gov also provides data via their FTP server. We'll be
using the five-year data from 2021, found
[here](https://www2.census.gov/programs-surveys/acs/data/pums/2021/5-Year/).
For the J on the Beach workshop this data is already downloaded for you! If you'd like to try the workshop out locally follow the download the data section. For now
skip to [this section](#examining-the-data).
### Downloading the Data
You can download the data in CSV format using this FTP
URL: `ftp://ftp.census.gov/programs-surveys/acs/data/pums/2021/5-Year/`.
To do this use your preferred tool, such as `wget` or `aria2`.
As an example, here's how to download the data using `wget` on a Mac. Run the
following lines in your terminal, to retrieve and then unzip the data:
```bash
# use wget or your ftp retrieval method of choice
wget --recursive -w 2 -nc --level=1 --no-parent --no-directories \
--accept 'csv*' --reject '*pus*' --reject '*hus*' \
--directory-prefix=. \
ftp://ftp.census.gov/programs-surveys/acs/data/pums/2021/5-Year/
# unizp the files, feel free to use your method of choice
unzip *.zip
```
The PUMS dataset is divided into data about people and data about households.
After unzipping the files, we’ll divide the five-year set into person-focused
and housing-focused data.
Person data can be identified by the naming pattern `psam_p*.csv`, while housing
data has the pattern `psam_h*.csv`. From there, you can just use globbing to move
them where you want them — the example below will work on Mac and Linux.
```bash
mkdir housing
mkdir people
mv psam_h*.csv ./household/
mv psam_p*.csv ./people/
```
### Examining the Data
Once this is done, you should now have two directories full of CSV data that make up approximately 14.13 GB. The PUMS dataset is divided into data about individual people and data representing households.
::: {.callout-note}
## Activity
- Check the size of the two directories on your machine, are they also 14GB?
:::
14 GB is far too big for a desktop environment, you will run out of RAM working with this data.
Additionally, there are too many files in each directory to wrangle efficiently.
In our workshop today we'll convert the CSV files into two Parquet files: one
for the household data, and one for person data. This will compress the data
down to 4 GB! The Python Library `PyArrow` is well suited for this task.
### Introduction to Apache Arrow
[In this section we need more info about Arrow and Parquet]
Before we move further here are a few things to know about Arrow:
- It is a columnar data-format and a standard.
- It can be used to convert one format to another
- Many programming languages have their own library to work with Arrow. We are
using `PyArrow` today, but others exist for C++, Go, R, Rust, etc.
Let's import PyArrow and the specific methods we need:
```{python}
import pyarrow as pa
from pyarrow import parquet
import pyarrow.dataset as ds
```
Now, we need a view of the data — we’ll use PyArrow’s dataset functionality to do this.
A view does not read any data into memory.
```{python}
# This line takes all the csvs files in the folder and adds them to the
# same dataset.
people_dataset = ds.dataset("./people", format="csv")
house_dataset = ds.dataset("./household", format="csv")
```
Contrary to CSV files, arrow datasets are typed. When using the default options,
the data types are inferred by reading a subset of the data to guess the types.
However, with large datasets, it can happen that a variable is detected as being
numeric when it contains strings outside the range used to infer the data type.
To handle this, we will redefine the schema and make a new view. This is a cheap
operation because we haven’t read any data yet, only set up a view that’s aware
of the files.
```{python}
# Adjust the person data schema
people_schema = people_dataset.schema.set(1, pa.field("SERIALNO", pa.string()))
people_schema = people_schema.set(75, pa.field("WKWN", pa.string()))
# Adjust the household data schema
house_schema = house_dataset.schema.set(1, pa.field("SERIALNO", pa.string()))
#Update views with new schemas
people_dataset = ds.dataset('./people', format="csv", schema=people_schema)
house_dataset = ds.dataset('./household', format="csv", schema=house_schema)
```
Now, we write to Parquet. This can take a while, as it’s only now that PyArrow fully reads the data.
```{python}
ds.write_dataset(house_dataset, "pums_household", format="parquet")
ds.write_dataset(people_dataset, "pums_people", format="parquet")
```
### The Results
What gains do we get from this?
CSV to Parquet file size changes:
- Person Data:
10 GB --> 3.3GB
- Household Data:
4 GB --> 1.3 GB
That's a 3X reduction for both of these in a few lines of code!
Parquet compresses data, but unlike zip files, it's format allows data to be read directly in that compressed form.
In the next part of the tutorial we'll be using Ibis to read in our newly made parquet files and analyze the data to draw insights from it!