Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Refactor functions into Expressions namespaces and functions on the Expr directly #876

Open
ion-elgreco opened this issue Sep 20, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@ion-elgreco
Copy link
Contributor

One thing I disliked a lot about pyspark was the Functions. Since most of if not all DataFusion.functions take in an expression and return an expression, we could rework this in the Expr namespaces, this should help finding the right expressions much easier, some examples:

  • col().dt.current_date()
  • col().list.ndims()

I am happy to start working on this and creating a full list of how the Expr and Expr.namespaces can look like

@ion-elgreco ion-elgreco added the enhancement New feature or request label Sep 20, 2024
@timsaucer
Copy link
Contributor

My primary concerns about doing this are:

  • API churn rate. If we keep changing our interface too rapidly between versions it will turn people off the project. So if we did this I'd probably support defining them on the expr but also having the functions available - one to call the other, almost as an alias.
  • Familiarity. Like it or not, many if not most, of the people who will be coming to the project will be coming from a pyspark/pandas/polars background. The layout we currently have will be more familiar in some places which I think will lead to greater adoption.
  • Diverging too much from the datafusion core approach. I think some level of difference is expected since rust and python do operate very differently. However if we diverge too much, it makes maintenance harder and harder to onboard new contributors. This is a smaller concern.

That being said, I have found that some of the functions I expected to be on Expr were in functions and I do find the interface to be somewhat non-intuitive. My proposal would be to handle these on a case by case basis of what should be where.

@ion-elgreco
Copy link
Contributor Author

My primary concerns about doing this are:

  • API churn rate. If we keep changing our interface too rapidly between versions it will turn people off the project. So if we did this I'd probably support defining them on the expr but also having the functions available - one to call the other, almost as an alias.

I think this can work but with the goal to deprecate the functions after X releases/months. One thing that is problematic for example with PySpark codebases is that there is no consistency due to the aliases. I don't think constant change is an issue, unless you don't document well how to change.

  • Familiarity. Like it or not, many if not most, of the people who will be coming to the project will be coming from a pyspark/pandas/polars background. The layout we currently have will be more familiar in some places which I think will lead to greater adoption.

In some way this might speak more to PySpark users the current API, but arguably and I think many will agree if you come from polars and pandas the current API isn't close to their familiarity.

  • Diverging too much from the datafusion core approach. I think some level of difference is expected since rust and python do operate very differently. However if we diverge too much, it makes maintenance harder and harder to onboard new contributors. This is a smaller concern.

That being said, I have found that some of the functions I expected to be on Expr were in functions and I do find the interface to be somewhat non-intuitive. My proposal would be to handle these on a case by case basis of what should be where.

I can make a draft of mapping Functions -> Expr, so we can get a full picture on how this will look like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants