Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a benchmark for large file scans versus smaller files #41

Open
jonkeane opened this issue Sep 29, 2021 · 1 comment
Open

Add a benchmark for large file scans versus smaller files #41

jonkeane opened this issue Sep 29, 2021 · 1 comment

Comments

@jonkeane
Copy link
Contributor

When scanning and doing operations on files, it would be nice to know if it's more efficient to have one large parquet (for example) file per partition, or to have more smaller files.

@lidavidm
Copy link
Contributor

This would be good to have. For Parquet specifically, looking at row group sizes may also be interesting - we can potentially get more parallelism with smaller row groups, but if you're reading only a few sparse columns of many, and you're on something like S3, small row groups also mean you have to make lots of small reads which is not an ideal I/O pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants