Transforming Tabular Data in Python 🛠️📈

Comparing Pandas v. Polars v. PyArrow v. DuckDB 🐼🐻‍❄️🏹🦆

This repository contains the benchmarking code backing the identically titled blog post: Transforming Tabular Data in Python 🛠️📈. This blog post compares four different frameworks based on performance and ease of use:

Pandas - the (as of recently) de-facto standard for dataframes in Python
Polars - a challenger to Pandas, backed by Rust and Apache Arrow
PyArrow - the direct Python bindings for Apache Arrow
DuckDB - an in-process Python analytical SQL database

Benchmarking is performed using pytest-benchmark, which extends pytest with a benchmark fixture that is used in each framework's respective test to measure the execution time of the transformation. The benchmarking code is located in the test/ directory, and the datasets used for benchmarking can be downloaded to the datasets/ directory using the download-datasets.sh script (see Setup below).

For each of the four frameworks, two transformations are benchmarked. First a simpler one which loads, groups, and orders data from a single csv file. Second a more advanced one which joins three csv files, filters based on multiple conditions, and finally also groups and orders the data.

Setup

poetry install
sh download-datasets.sh

Running the Benchmarks

# All benchmarks
pytest test/python-transformation-libraries-benchmark --benchmark-autosave --benchmark-min-rounds=8 --benchmark-min-time=0

# Only simple
pytest test/python-transformation-libraries-benchmark/test_simple.py --benchmark-autosave --benchmark-min-rounds=8 --benchmark-min-time=0

# Only advanced
pytest test/python-transformation-libraries-benchmark/test_advanced.py --benchmark-autosave --benchmark-min-rounds=8 --benchmark-min-time=0

Datasets

The main datasets used for benchmarking is the ~260MB "Watervogels" dataset from the Flemish Institute for Nature and Forest (INBO). This dataset...

contains information on more than 94,000 sampling events (bird counts) with over 720,000 observations (and zero counts when there is no associated occurrence) for the period 1991-2016, covering 167 species in over 1,100 wetland sites.

from the dataset description

Additionally, the ~5.73GB Backbone Taxonomy dataset by the Global Biodiversity Information Facility (GBIF) is used to enrich the Watervogels dataset with taxonomic information.

Related work

Database-like ops benchmark

License

The source code in this repository is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
datasets		datasets
images		images
test/python-transformation-libraries-benchmark		test/python-transformation-libraries-benchmark
.gitignore		.gitignore
License.txt		License.txt
Readme.md		Readme.md
download-datasets.sh		download-datasets.sh
poetry.lock		poetry.lock
popularity.ipynb		popularity.ipynb
post.md		post.md
pyproject.toml		pyproject.toml
results.ipynb		results.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transforming Tabular Data in Python 🛠️📈

Setup

Running the Benchmarks

Datasets

Related work

License

About

Contributors 2

Languages

License

datarootsio/transforming-tabular-data

Folders and files

Latest commit

History

Repository files navigation

Transforming Tabular Data in Python 🛠️📈

Setup

Running the Benchmarks

Datasets

Related work

License

About

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages