[RELEASE] cudf v23.10 #14224

raydouglass · 2023-09-28T14:57:01Z

❄️ Code freeze for `branch-23.10` and v23.10 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-23.10 until release (merging of this PR).

What is the purpose of this PR?

Update documentation
Allow testing for the new release
Enable a means to merge branch-23.10 into main for the release

Forward-merge branch-23.08 to branch-23.10

This PR enforces previously deprecated code until `23.08` in `23.10`. This PR removes `strings_to_categorical` parameter support in `read_parquet`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Richard (Rick) Zamora (https://github.com/rjzamora) - Bradley Dice (https://github.com/bdice) URL: #13732

…3729) draft This PR adds additional numeric dtypes to `GroupBy.apply` with `engine='jit'`. Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #13729

Branch 23.10 merge 23.08

Forward-merge branch-23.08 to branch-23.10

`make_unique` in Cython's libcpp headers is not annotated with `except +`. As a consequence, if the constructor throws, we do not catch it in Python. To work around this (see cython/cython#5560 for details), provide our own implementation. Due to the way assignments occur to temporaries, we need to now explicitly wrap all calls to `make_unique` in `move`, but that is arguably preferable to not being able to catch exceptions, and will not be necessary once we move to Cython 3. - Closes #13743 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #13746

Branch 23.10 merge 23.08

Checking for boolean values in a range results in incorrect behavior: ```python In [1]: True in range(0, 0) Out[1]: False In [3]: True in range(0, 2) Out[3]: True ``` This results in the following bug: ```python In [23]: s = cudf.Series([True, False]) In [24]: s[0] Out[24]: True In [25]: type(s[0]) Out[25]: numpy.bool_ In [26]: True in s Out[26]: True In [26]: True in s.to_pandas() Out[26]: False ``` This PR fixes this issue by properly checking if an integer is passed to the `RangeIndex. __contains__` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Matthew Roeschke (https://github.com/mroeschke) URL: #13779

We currently are allowing construction of mixed-dtype by type-casting them into a common type as below: ```python In [1]: import cudf In [2]: import pandas as pd In [3]: s = pd.Series([1, 2, 3], dtype='datetime64[ns]') In [5]: p = pd.Series([10, 11]) In [6]: new_s = pd.concat([s, p]) In [7]: new_s Out[7]: 0 1970-01-01 00:00:00.000000001 1 1970-01-01 00:00:00.000000002 2 1970-01-01 00:00:00.000000003 0 10 1 11 dtype: object In [8]: cudf.Series(new_s) Out[8]: 0 1970-01-01 00:00:00.000000 1 1970-01-01 00:00:00.000000 2 1970-01-01 00:00:00.000000 0 1970-01-01 00:00:00.000010 1 1970-01-01 00:00:00.000011 dtype: datetime64[us] ``` This behavior is incorrect and we are getting this from `pa.array` constructor. This PR ensures we do proper handling around such cases and raise an error. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Bradley Dice (https://github.com/bdice) URL: #13768

Negative unary op on boolean series is resulting in conversion to `int` type: ```python In [1]: import cudf In [2]: s = cudf.Series([True, False]) In [3]: s Out[3]: 0 True 1 False dtype: bool In [4]: -s Out[4]: 0 -1 1 0 dtype: int8 In [5]: -s.to_pandas() Out[5]: 0 False 1 True dtype: bool ``` The PR fixes the above issue by returning inversion of the boolean column instead of multiplying with `-1`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) URL: #13780

This PR preserves column names in various APIs by retaining `self._data._level_names` and also calculating when to preserve the column names. Fixes: #13741, #13740 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) - Lawrence Mitchell (https://github.com/wence-) URL: #13772

…13766) If a columns argument is provided to the dataframe constructor, this should be used to select columns from the provided data dictionary. The previous logic did do this correctly, but didn't preserve the appropriate order of the resulting columns (which should come out in the order that the column selection is in). - Closes #13738 Authors: - Lawrence Mitchell (https://github.com/wence-) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13766

This PR fixes various cases in binary operations where columns are of certain dtypes and the binary operations on those dataframes and series don't yield correct results, correct resulting column types, or have missing columns altogether. This PR also introduces ensuring column ordering to match pandas binary ops column ordering when pandas compatibility mode is enabled. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) URL: #13778

Forward-merge branch-23.08 to branch-23.10

This PR upgrades arrow version in `cudf` to `12.0.1` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) URL: #13728

@substitution

… other types (#13786) This PR fixes an issue when trying to merge a `datetime`|`timdelta` type column with another type: ```python In [1]: import cudf In [2]: import pandas as pd In [3]: df = cudf.DataFrame({'a': cudf.Series([10, 20, 30], dtype='datetime64[ns]')}) In [4]: df2 = df.astype('int') In [5]: df.merge(df2) Out[5]: a 0 10.0 1 20.0 2 30.0 In [6]: df2.merge(df) Out[6]: a 0 10.0 1 20.0 2 30.0 In [7]: df Out[7]: a 0 1970-01-01 00:00:00.000000010 1 1970-01-01 00:00:00.000000020 2 1970-01-01 00:00:00.000000030 In [8]: df2 Out[8]: a 0 10 1 20 2 30 In [9]: df.to_pandas().merge(df2.to_pandas()) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[9], line 1 ----> 1 df.to_pandas().merge(df2.to_pandas()) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/frame.py:10092, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 10073 @substitution("") 10074 @appender(_merge_doc, indents=2) 10075 def merge( (...) 10088 validate: str | None = None, 10089 ) -> DataFrame: 10090 from pandas.core.reshape.merge import merge > 10092 return merge( 10093 self, 10094 right, 10095 how=how, 10096 on=on, 10097 left_on=left_on, 10098 right_on=right_on, 10099 left_index=left_index, 10100 right_index=right_index, 10101 sort=sort, 10102 suffixes=suffixes, 10103 copy=copy, 10104 indicator=indicator, 10105 validate=validate, 10106 ) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:110, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 93 @substitution("\nleft : DataFrame or named Series") 94 @appender(_merge_doc, indents=0) 95 def merge( (...) 108 validate: str | None = None, 109 ) -> DataFrame: --> 110 op = _MergeOperation( 111 left, 112 right, 113 how=how, 114 on=on, 115 left_on=left_on, 116 right_on=right_on, 117 left_index=left_index, 118 right_index=right_index, 119 sort=sort, 120 suffixes=suffixes, 121 indicator=indicator, 122 validate=validate, 123 ) 124 return op.get_result(copy=copy) File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:707, in _MergeOperation.__init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, indicator, validate) 699 ( 700 self.left_join_keys, 701 self.right_join_keys, 702 self.join_names, 703 ) = self._get_merge_keys() 705 # validate the merge keys dtypes. We may need to coerce 706 # to avoid incompatible dtypes --> 707 self._maybe_coerce_merge_keys() 709 # If argument passed to validate, 710 # check if columns specified as unique 711 # are in fact unique. 712 if validate is not None: File /nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/pandas/core/reshape/merge.py:1344, in _MergeOperation._maybe_coerce_merge_keys(self) 1342 # datetimelikes must match exactly 1343 elif needs_i8_conversion(lk.dtype) and not needs_i8_conversion(rk.dtype): -> 1344 raise ValueError(msg) 1345 elif not needs_i8_conversion(lk.dtype) and needs_i8_conversion(rk.dtype): 1346 raise ValueError(msg) ValueError: You are trying to merge on datetime64[ns] and int64 columns. If you wish to proceed you should use pd.concat ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: #13786

…ter (#13791) Adds JSON reader and writer to the list of components that support GDS. Updates the supported data types in JSON reader and writer. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #13791

Arrow 12.0 uses the vendored CMake target `arrow::xsimd` instead of the global target name of `xsimd`. We need to use the new name so that libcudf can be used from the build directory by other projects. Found by issue: NVIDIA/spark-rapids-jni#1306 Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Bradley Dice (https://github.com/bdice) URL: #13790

spark-rapids has [code for debugging JNI Tables/Columns][1] that is useful for debugging during dev work in cudf This PR proposes to start moving it to cudf/java. spark-rapids will be updated to call into the cudf in a follow-up PR. [1]: https://github.com/NVIDIA/spark-rapids/blob/b5cf25eef347d845bd77077d5cb9035262281f98/sql-plugin/src/main/java/com/nvidia/spark/rapids/GpuColumnVector.java ## Sample Usage with JShell ```Bash (rapids) rapids@compose:~/cudf/java$ mvn dependency:build-classpath -Dmdep.outputFile=cudf-java-cp.txt (rapids) rapids@compose:~/cudf/java$ jshell --class-path target/cudf-23.10.0-SNAPSHOT-cuda12.jar:$(< cudf-java-cp.txt) \ -R -Dai.rapids.cudf.debug.output=log_error ``` ```Java | Welcome to JShell -- Version 11.0.20 | For an introduction type: /help intro jshell> import ai.rapids.cudf.*; jshell> Table tbl = new Table.TestBuilder().column(1,2,3,4,5,6).build() tbl ==> Table{columns=[ColumnVector{rows=6, type=INT32, n ... e=140381937458144, rows=6} jshell> TableDebug.get().debug("gera", tbl) [main] ERROR ai.rapids.cudf.TableDebug - DEBUG gera Table{columns=[ColumnVector{rows=6, type=INT32, nullCount=Optional[0], offHeap=(ID: 4 7fad371d1a30)}], cudfTable=140381937458144, rows=6} [main] ERROR ai.rapids.cudf.TableDebug - GPU COLUMN 0 - NC: 0 DATA: DeviceMemoryBufferView{address=0x7fad3be00000, length=24, id=-1} VAL: null [main] ERROR ai.rapids.cudf.TableDebug - COLUMN 0 - INT32 [main] ERROR ai.rapids.cudf.TableDebug - 0 1 [main] ERROR ai.rapids.cudf.TableDebug - 1 2 [main] ERROR ai.rapids.cudf.TableDebug - 2 3 [main] ERROR ai.rapids.cudf.TableDebug - 3 4 [main] ERROR ai.rapids.cudf.TableDebug - 4 5 [main] ERROR ai.rapids.cudf.TableDebug - 5 6 ``` Authors: - Gera Shegalov (https://github.com/gerashegalov) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) URL: #13783

This PR removes some extra stores and loads that don't appear to be necessary in our groupby apply lowering which are possibly slowing things down. This came up during #13767. Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #13792

This PR enables computing the pearson correlation between two columns of a group within a UDF. Concretely, syntax such as the following will be allowed and produce the same result as pandas: ```python ans = df.groupby('key').apply(lambda group_df: group_df['x'].corr(group_df['y'])) ``` Authors: - Ashwin Srinath (https://github.com/shwina) - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #13767

Fixes: #13049 This PR allows errors from pyarrow to be propagated when an un-bounded sequence is passed to `pa.array` constructor. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: #13799

… with mask (#14201) Workaround for illegal instruction error in sm90 for warp instrinsics with non `0xffffffff` mask Removed the mask, and used ~0u (`0xffffffff`) as MASK because - all threads in warp has correct data on error since is_within_bounds==true thread update error. - init_state is not required at last iteration only where MASK is not ~0u. Fixes #14183 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Divye Gala (https://github.com/divyegala) - Elias Stehle (https://github.com/elstehle) - Mark Harris (https://github.com/harrism) URL: #14201

This adds two more aggregations for groupby and reduction: * `HISTOGRAM`: Count the number of occurrences (aka frequency) for each element, and * `MERGE_HISTOGRAM`: Merge different outputs generated by `HISTOGRAM` aggregations This is the prerequisite for implementing the exact distributed percentile aggregation (#13885). However, these two new aggregations may be useful in other use-cases that need to do frequency counting. Closes #13885. Merging checklist: * [X] Working prototypes. * [X] Cleanup and docs. * [X] Unit test. * [ ] Test with spark-rapids integration tests. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #14045

…12.2 (#14108) Compile issues found by compiling libcudf with the `rapidsai/devcontainers:23.10-cpp-gcc9-cuda12.2-ubuntu20.04` docker container. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14108

This PR adds a method to ColumnView class to allow for conversion from Integers to hex closes #14081 Authors: - Raza Jafri (https://github.com/razajafri) Approvers: - Kuhu Shukla (https://github.com/kuhushukla) - Robert (Bobby) Evans (https://github.com/revans2) URL: #14205

This implements JNI for `HISTOGRAM` and `MERGE_HISTOGRAM` aggregations in both groupby and reduction. Depends on: * #14045 Contributes to: * #13885. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #14154

Fixes: #14088 This PR preserves `names` of `column` object while constructing a `DataFrame` through various constructor flows. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) URL: #14110

Pass the error code to the host when a kernel detects invalid input. If multiple errors types are detected, they are combined using a bitwise OR so that caller gets the aggregate error code that includes all types of errors that occurred. Does not change the kernel side checks. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - https://github.com/nvdbaranec - Divye Gala (https://github.com/divyegala) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #14167

Previously, the parquet chunked reader operated by controlling the size of output chunks only. It would still ingest the entire input file and decompress it, which can take up a considerable amount of memory. With this new 'progressive' support, we also 'chunk' at the input level. Specifically, the user can pass a `pass_read_limit` value which controls how much memory is used for storing compressed/decompressed data. The reader will make multiple passes over the file, reading as many row groups as it can to attempt to fit within this limit. Within each pass, chunks are emitted as before. From the external user's perspective, the chunked read mechanism is the same. You call `has_next()` and `read_chunk()`. If the user has specified a value for `pass_read_limit` the set of chunks produced might end up being different (although the concatenation of all of them will still be the same). The core idea of the code change is to add the idea of the internal `pass`. Previously we had a `file_intermediate_data` which held data across `read_chunk()` calls. There is now a `pass_intermediate_data` struct which holds information specific to a given pass. Many of the invariant things from the file level before (row groups and chunks to process) are now stored in the pass intermediate data. As we begin each pass, we take the subset of global row groups and chunks that we are going to process for this pass, copy them to out intermediate data, and the remainder of the reader reference this instead of the file-level data. In order to avoid breaking pre-existing interfaces, there's a new contructor for the `chunked_parquet_reader` class: ``` chunked_parquet_reader( std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const& options, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); ``` Authors: - https://github.com/nvdbaranec Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #14079

copy-pr-bot · 2023-09-28T14:57:07Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

review-notebook-app · 2023-09-28T14:57:08Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

This PR pins `dask` and `distributed` to `2023.9.2` for `23.10` release. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ray Douglass (https://github.com/raydouglass) - Peter Andreas Entschev (https://github.com/pentschev)

Fixes a bug where floating-point values were used in decimal128 rounding, giving wrong results. Closes #14210. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) - Mark Harris (https://github.com/harrism)

…nt values. (#14242) This is a follow-up PR to #14233. This PR fixes a bug where floating-point values were used as intermediates in ceil/floor unary operations and cast operations that require rescaling for fixed-point types, giving inaccurate results. See also: - #14233 (comment) - #14243 Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Vukasin Milovanovic (https://github.com/vuule)

raydouglass and others added 30 commits July 20, 2023 16:29

v23.10

6443f0e

Merge pull request #13730 from rapidsai/branch-23.08

0edea00

Forward-merge branch-23.08 to branch-23.10

Support more numeric types in Groupby.apply with engine='jit' (#1…

43aca00

…3729) draft This PR adds additional numeric dtypes to `GroupBy.apply` with `engine='jit'`. Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) URL: #13729

Merge branch 'branch-23.08' into branch-23.10-merge-23.08

40e0eb0

Merge pull request #13753 from vyasr/branch-23.10-merge-23.08

32dd46d

Branch 23.10 merge 23.08

Merge pull request #13759 from rapidsai/branch-23.08

a0fdca3

Forward-merge branch-23.08 to branch-23.10

Merge pull request #13760 from rapidsai/branch-23.08

ee7e39b

Forward-merge branch-23.08 to branch-23.10

Merge pull request #13761 from rapidsai/branch-23.08

06ef2fc

Forward-merge branch-23.08 to branch-23.10

Merge pull request #13762 from rapidsai/branch-23.08

7dcf052

Forward-merge branch-23.08 to branch-23.10

Merge pull request #13764 from rapidsai/branch-23.08

e55f944

Forward-merge branch-23.08 to branch-23.10

Merge branch 'branch-23.08' into branch-23.10-merge-23.08

5a3f9fc

Merge pull request #13773 from vyasr/branch-23.10-merge-23.08

9aa2968

Branch 23.10 merge 23.08

Merge branch-23.08 into branch-23.10

5af5ca8

Merge pull request #13784 from bdice/branch-23.10-merge-23.08

7746af4

Forward-merge branch-23.08 to branch-23.10

karthikeyann and others added 8 commits September 27, 2023 03:59

raydouglass requested review from a team as code owners September 28, 2023 14:57

raydouglass requested review from vyasr, galipremsagar and vuule September 28, 2023 14:57

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. CMake CMake build issue conda Java Affects Java cuDF API. labels Sep 28, 2023

galipremsagar and others added 4 commits September 28, 2023 13:16

Update Changelog [skip ci]

1358793

raydouglass merged commit 562f70e into main Oct 11, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE] cudf v23.10 #14224

[RELEASE] cudf v23.10 #14224

raydouglass commented Sep 28, 2023

copy-pr-bot bot commented Sep 28, 2023

review-notebook-app bot commented Sep 28, 2023

[RELEASE] cudf v23.10 #14224

[RELEASE] cudf v23.10 #14224

Conversation

raydouglass commented Sep 28, 2023

❄️ Code freeze for branch-23.10 and v23.10 release

What does this mean?

What is the purpose of this PR?

copy-pr-bot bot commented Sep 28, 2023

review-notebook-app bot commented Sep 28, 2023

❄️ Code freeze for `branch-23.10` and v23.10 release