Skip to content

Comparing Pandas v. Polars v. PyArrow v. DuckDB πŸΌπŸ»β€β„οΈπŸΉπŸ¦†

License

Notifications You must be signed in to change notification settings

datarootsio/transforming-tabular-data

Repository files navigation

Transforming Tabular Data in Python πŸ› οΈπŸ“ˆ

Comparing Pandas v. Polars v. PyArrow v. DuckDB πŸΌπŸ»β€β„οΈπŸΉπŸ¦†


This repository contains the benchmarking code backing the identically titled blog post: Transforming Tabular Data in Python πŸ› οΈπŸ“ˆ. This blog post compares four different frameworks based on performance and ease of use:

  • Pandas - the (as of recently) de-facto standard for dataframes in Python
  • Polars - a challenger to Pandas, backed by Rust and Apache Arrow
  • PyArrow - the direct Python bindings for Apache Arrow
  • DuckDB - an in-process Python analytical SQL database

Benchmarking is performed using pytest-benchmark, which extends pytest with a benchmark fixture that is used in each framework's respective test to measure the execution time of the transformation. The benchmarking code is located in the test/ directory, and the datasets used for benchmarking can be downloaded to the datasets/ directory using the download-datasets.sh script (see Setup below).

For each of the four frameworks, two transformations are benchmarked. First a simpler one which loads, groups, and orders data from a single csv file. Second a more advanced one which joins three csv files, filters based on multiple conditions, and finally also groups and orders the data.

Setup

poetry install
sh download-datasets.sh

Running the Benchmarks

# All benchmarks
pytest test/python-transformation-libraries-benchmark --benchmark-autosave --benchmark-min-rounds=8 --benchmark-min-time=0

# Only simple
pytest test/python-transformation-libraries-benchmark/test_simple.py --benchmark-autosave --benchmark-min-rounds=8 --benchmark-min-time=0

# Only advanced
pytest test/python-transformation-libraries-benchmark/test_advanced.py --benchmark-autosave --benchmark-min-rounds=8 --benchmark-min-time=0

Datasets

The main datasets used for benchmarking is the ~260MB "Watervogels" dataset from the Flemish Institute for Nature and Forest (INBO). This dataset...

contains information on more than 94,000 sampling events (bird counts) with over 720,000 observations (and zero counts when there is no associated occurrence) for the period 1991-2016, covering 167 species in over 1,100 wetland sites.

from the dataset description

Additionally, the ~5.73GB Backbone Taxonomy dataset by the Global Biodiversity Information Facility (GBIF) is used to enrich the Watervogels dataset with taxonomic information.

Related work

Database-like ops benchmark

License

The source code in this repository is licensed under the MIT License.