df.groupby().agg() feature #16084

NeilGeorge1 · 2024-06-25T18:03:34Z

Description

Added **kwargs Support in df.groupby().agg()

Implemented support for **kwargs in df.groupby().agg() to provide greater flexibility and customization options during aggregation operations.

This issue closes #15967

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-06-25T18:03:37Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vyasr · 2024-06-25T18:09:14Z

Thanks for the contribution @NeilGeorge1! Could you add some tests of some different cases? I know you won't be able to run them on your machine without a GPU, but you should be able to write them using pandas code locally and then simply switch to using cudf at the end before commiting. Then we can use your tests to validate the behavior here. I would recommend trying a few different more complex cases such as including multiple output columns.

Also, right now you've made the change to the dask object, but we want the functionality in cudf itself. Have a look at this file, which is where I think you'd want to add the new code.

NeilGeorge1 · 2024-06-25T19:26:03Z

Thanks for the contribution @NeilGeorge1! Could you add some tests of some different cases? I know you won't be able to run them on your machine without a GPU, but you should be able to write them using pandas code locally and then simply switch to using cudf at the end before commiting. Then we can use your tests to validate the behavior here. I would recommend trying a few different more complex cases such as including multiple output columns.

Also, right now you've made the change to the dask object, but we want the functionality in cudf itself. Have a look at this file, which is where I think you'd want to add the new code.

Yeah sure @vyasr. I don't have any prior experience writing test cases before but I shall try my best.
Also should I just leave the code in dask as it is or should i remove it?

vyasr · 2024-06-25T20:23:18Z

I think @galipremsagar or @wence- can probably give you a better answer about how best we want this to look in dask.

wence- · 2024-06-26T09:01:13Z

I think @galipremsagar or @wence- can probably give you a better answer about how best we want this to look in dask.

Let's first get the cudf implementation right first, it might then be we don't have to do anything dask-dataframe.

To do this, I think we want to mimic the way the pandas API behaves (https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.agg.html)

I don't think we support arguments and engine to agg in cudf, so we want to support this signature:

def agg(self, func, **kwargs):

The behaviour of pandas is:

If func is not None, then kwargs are ignored (silently)
If func is None, then kwargs are used with the key defining the output column name, and the value defining the (column, aggregation) pair.

This preprocessing should happen in cudf/python/cudf/cudf/core/groupby/groupby.py in the agg method (around line 600).

I think the way to attack this is to let _normalize_aggs take both func and kwargs, and produce appropriately preprocessed results. There's a little bit of a dance we need to do to get the right names out, but it's certainly possible.

NeilGeorge1 · 2024-06-26T11:42:01Z

Understood @wence-

I will start by reading the documentation to grasp the required changes and proceed to implement them.

Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr .

Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request.

Is there anything else I need to consider?

wence- · 2024-06-26T13:04:47Z

Understood @wence-

I will start by reading the documentation to grasp the required changes and proceed to implement them.

Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr .

Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request.

Is there anything else I need to consider?

That sounds about right. If you have queries about the (admittedly underdocumented) internals of the groupby datastructures, please ask here.

NeilGeorge1 · 2024-06-26T14:41:40Z

Understood @wence-
I will start by reading the documentation to grasp the required changes and proceed to implement them.
Following this, I will add few more test cases taking reference from the examples provided in the groupby.py file into python/cudf/cudf/tests/groupby/test_agg.py as requested by @vyasr .
Given that my PC lacks a GPU, I will then run these tests using only pandas and then when all test cases pass, switch to cudf before committing and making a pull request.
Is there anything else I need to consider?

That sounds about right. If you have queries about the (admittedly underdocumented) internals of the groupby datastructures, please ask here.

Sure thing! Will do thanks!

rjzamora · 2024-06-27T15:06:26Z

python/dask_cudf/dask_cudf/groupby.py

-        self, arg, split_every=None, split_out=1, shuffle_method=None
+        self, arg, split_every=None, split_out=1, shuffle_method=None, **kwargs
    ):
+        if kwargs:
+            arg = {col_name: agg_func for col_name, (col, agg_func) in kwargs.items()}
+


I know you are already revising this PR, but just a quick note: These changes are being made to the legacy dask-cudf API (which will soon be deprecated and then completely removed). The new (default) expression-based API is defined in expr/_groupby.py.

Ok Sure working on cudf now

vyasr · 2024-07-19T17:09:20Z

@NeilGeorge1 have you had any luck here? Do you need any pointers?

Matt711 · 2024-08-10T01:25:09Z

I'll open a new PR for this.
xref #16528

NeilGeorge1 and others added 2 commits June 25, 2024 23:04

Support named aggregations in df.groupby().agg() has been implemented

ed5a5b4

Merge branch 'rapidsai:branch-24.08' into df.groupby().agg()-feature

3970182

NeilGeorge1 requested a review from a team as a code owner June 25, 2024 18:03

github-actions bot added the Python Affects Python cuDF API. label Jun 25, 2024

vyasr assigned NeilGeorge1 Jun 25, 2024

vyasr added feature request New feature or request non-breaking Non-breaking change labels Jun 25, 2024

rjzamora reviewed Jun 27, 2024

View reviewed changes

Matt711 closed this Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

df.groupby().agg() feature #16084

df.groupby().agg() feature #16084

NeilGeorge1 commented Jun 25, 2024 •

edited

Loading

copy-pr-bot bot commented Jun 25, 2024

vyasr commented Jun 25, 2024 •

edited

Loading

NeilGeorge1 commented Jun 25, 2024

vyasr commented Jun 25, 2024

wence- commented Jun 26, 2024

NeilGeorge1 commented Jun 26, 2024 •

edited

Loading

wence- commented Jun 26, 2024

NeilGeorge1 commented Jun 26, 2024

rjzamora Jun 27, 2024

NeilGeorge1 Jun 28, 2024

vyasr commented Jul 19, 2024

Matt711 commented Aug 10, 2024 •

edited

Loading

df.groupby().agg() feature #16084

df.groupby().agg() feature #16084

Conversation

NeilGeorge1 commented Jun 25, 2024 • edited Loading

Description

Added **kwargs Support in df.groupby().agg()

Checklist

copy-pr-bot bot commented Jun 25, 2024

vyasr commented Jun 25, 2024 • edited Loading

NeilGeorge1 commented Jun 25, 2024

vyasr commented Jun 25, 2024

wence- commented Jun 26, 2024

NeilGeorge1 commented Jun 26, 2024 • edited Loading

wence- commented Jun 26, 2024

NeilGeorge1 commented Jun 26, 2024

rjzamora Jun 27, 2024

Choose a reason for hiding this comment

NeilGeorge1 Jun 28, 2024

Choose a reason for hiding this comment

vyasr commented Jul 19, 2024

Matt711 commented Aug 10, 2024 • edited Loading

NeilGeorge1 commented Jun 25, 2024 •

edited

Loading

vyasr commented Jun 25, 2024 •

edited

Loading

NeilGeorge1 commented Jun 26, 2024 •

edited

Loading

Matt711 commented Aug 10, 2024 •

edited

Loading