Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Re-work some DataFrame APIs #875

Open
9 of 18 tasks
ion-elgreco opened this issue Sep 20, 2024 · 2 comments
Open
9 of 18 tasks

RFC: Re-work some DataFrame APIs #875

ion-elgreco opened this issue Sep 20, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@ion-elgreco
Copy link
Contributor

ion-elgreco commented Sep 20, 2024

Some API's feel a bit un-intuitive, I think Polars has really excelled at this area. My suggestion is we re-use some of those APIs or take some inspiration of them, changes I am proposing (I am happy to work on these areas especially with datafusion-ray becoming a thing):

  • - DataFrame.cache() -> DataFrame ===> DataFrame.collect() -> DataFrame
  • - DataFrame.collect() -> list[pyarrow.RecordBatch] ===> DataFrame.to_batches() -> list[pyarrow.RecordBatch]
  • - DataFrame.join ===> DataFrame.join(right: DataFrame, on: str | sequence[str] | None, left_on: str | sequence[str] | None, right_on: str | sequence[str] | None
  • - DataFrame.schema -> pyarrow.Schema ===> DataFrame.schema -> datafusion.Schema Map Rust arrow types to dafusion-py types
  • - DataFrame.with_column ===> DataFrame.with_columns Allow multiple inputs as exprs or key value pairs
  • - DataFrame.with_column_renamed ===> DataFrame.rename() a simple rename is clear enough and should allow a dict as input
  • - DataFrame.aggregate ===> DataFrame.group_by().agg() this feels more natural coming from PySpark/Polars/Pandas

Can remove these:

  • - DataFrame.select_columns already covered by DataFrame.select

Missing APIs:

  • - DataFrame.cast to cast on top level a single or multiple columns
  • - DataFrame.drop to drop columns, instead of writing a very verbose select
  • - DataFrame.fill_null/fill_nan to fill null or nan values
  • - DataFrame.interpolate interpolate values per col
  • - Asof join missing in df api?
  • - Join on (inequality join)
  • - DataFrame.head/tail
  • - DataFrame.pivot
  • - DataFrame.unpivot

Optional but useful:

  • - DataFrame.with_row_idx
@ion-elgreco ion-elgreco added the enhancement New feature or request label Sep 20, 2024
@emgeee
Copy link
Contributor

emgeee commented Sep 23, 2024

These proposals generally sound good to me. I do think care should be taken around the first two points since Dataframe cache() and collect() methods shadow the underlying rust library and renaming those methods at the python level would be immensely confusing for those coming from the rust library or those seeking to better understand the python layer.

The other suggestion I might add is to keep Datafusion.with_column() but make it a simple wrapper around Datafusion.with_columns().

@ion-elgreco
Copy link
Contributor Author

Asof joins are pending: apache/datafusion#318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants