Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add experimental BalancedParquetEngine implementation #80

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

rjzamora
Copy link
Contributor

@rjzamora rjzamora commented May 3, 2022

WARNING: This PR is still very rough, but I am sharing it now in case others wish to experiment (and/or share feedback)

May address parts of NVIDIA-Merlin/NVTabular#1340

This PR currently adds a balance_partitions= option to the Dataset API, which ultimately uses a new BalancedParquetEngine implementation for engine="parquet". The primary goal of this engine is to generate an underlying Dask DataFrame collection with an equivalent row count in every partition. The user may manually specify this desired row count (with rows_per_partition), or they can pass in the usual part_size or part_mem_fraction.

An additional option is to specify a desired batch_size to align with. If the batch_size argument is specified, the output partition sizes will always be divisible by this number.

TODO:

  • Handle hive/directory-partitioned datasets
  • Add an option to specify/guide the total number of desired partitions (for multi-gpu workflows)
  • More testing, iteration and general cleanup
  • Does a standalone engine make sense for this?

@github-actions
Copy link

github-actions bot commented May 3, 2022

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-80

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant