Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.groupby().agg() feature #16084

Conversation

NeilGeorge1
Copy link

@NeilGeorge1 NeilGeorge1 commented Jun 25, 2024

Description

Added **kwargs Support in df.groupby().agg()

Implemented support for **kwargs in df.groupby().agg() to provide greater flexibility and customization options during aggregation operations.

This issue closes #15967

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@NeilGeorge1 NeilGeorge1 requested a review from a team as a code owner June 25, 2024 18:03
Copy link

copy-pr-bot bot commented Jun 25, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 25, 2024
@vyasr
Copy link
Contributor

vyasr commented Jun 25, 2024

Thanks for the contribution @NeilGeorge1! Could you add some tests of some different cases? I know you won't be able to run them on your machine without a GPU, but you should be able to write them using pandas code locally and then simply switch to using cudf at the end before commiting. Then we can use your tests to validate the behavior here. I would recommend trying a few different more complex cases such as including multiple output columns.

Also, right now you've made the change to the dask object, but we want the functionality in cudf itself. Have a look at this file, which is where I think you'd want to add the new code.

@vyasr vyasr added feature request New feature or request non-breaking Non-breaking change labels Jun 25, 2024
@NeilGeorge1
Copy link
Author

Thanks for the contribution @NeilGeorge1! Could you add some tests of some different cases? I know you won't be able to run them on your machine without a GPU, but you should be able to write them using pandas code locally and then simply switch to using cudf at the end before commiting. Then we can use your tests to validate the behavior here. I would recommend trying a few different more complex cases such as including multiple output columns.

Also, right now you've made the change to the dask object, but we want the functionality in cudf itself. Have a look at this file, which is where I think you'd want to add the new code.

Yeah sure @vyasr. I don't have any prior experience writing test cases before but I shall try my best.
Also should I just leave the code in dask as it is or should i remove it?

@vyasr
Copy link
Contributor

vyasr commented Jun 25, 2024

I think @galipremsagar or @wence- can probably give you a better answer about how best we want this to look in dask.

@wence-
Copy link
Contributor

wence- commented Jun 26, 2024

I think @galipremsagar or @wence- can probably give you a better answer about how best we want this to look in dask.

Let's first get the cudf implementation right first, it might then be we don't have to do anything dask-dataframe.

To do this, I think we want to mimic the way the pandas API behaves (https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html)

I don't think we support arguments and engine to agg in cudf, so we want to support this signature:

def agg(self, func, **kwargs):

The behaviour of pandas is:

  • If func is not None, then kwargs are ignored (silently)
  • If func is None, then kwargs are used with the key defining the output column name, and the value defining the (column, aggregation) pair.

This preprocessing should happen in cudf/python/cudf/cudf/core/groupby/groupby.py in the agg method (around line 600).

I think the way to attack this is to let _normalize_aggs take both func and kwargs, and produce appropriately preprocessed results. There's a little bit of a dance we need to do to get the right names out, but it's certainly possible.

@NeilGeorge1
Copy link
Author

NeilGeorge1 commented Jun 26, 2024

Understood @wence-

I will start by reading the documentation to grasp the required changes and proceed to implement them.

Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr .

Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request.

Is there anything else I need to consider?

@wence-
Copy link
Contributor

wence- commented Jun 26, 2024

Understood @wence-

I will start by reading the documentation to grasp the required changes and proceed to implement them.

Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr .

Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request.

Is there anything else I need to consider?

That sounds about right. If you have queries about the (admittedly underdocumented) internals of the groupby datastructures, please ask here.

@NeilGeorge1
Copy link
Author

Understood @wence-
I will start by reading the documentation to grasp the required changes and proceed to implement them.
Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr .
Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request.
Is there anything else I need to consider?

That sounds about right. If you have queries about the (admittedly underdocumented) internals of the groupby datastructures, please ask here.

Sure thing! Will do thanks!

Comment on lines -199 to +203
self, arg, split_every=None, split_out=1, shuffle_method=None
self, arg, split_every=None, split_out=1, shuffle_method=None, **kwargs
):
if kwargs:
arg = {col_name: agg_func for col_name, (col, agg_func) in kwargs.items()}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you are already revising this PR, but just a quick note: These changes are being made to the legacy dask-cudf API (which will soon be deprecated and then completely removed). The new (default) expression-based API is defined in expr/_groupby.py.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok Sure working on cudf now

@vyasr
Copy link
Contributor

vyasr commented Jul 19, 2024

@NeilGeorge1 have you had any luck here? Do you need any pointers?

@Matt711
Copy link
Contributor

Matt711 commented Aug 10, 2024

I'll open a new PR for this.
xref #16528

@Matt711 Matt711 closed this Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEA] Support named aggregations in df.groupby().agg()
5 participants