Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-17682: [C++][Python] Bool8 Extension Type Implementation #43488

Merged
merged 21 commits into from
Aug 21, 2024

Conversation

joellubi
Copy link
Member

@joellubi joellubi commented Jul 30, 2024

Rationale for this change

C++ and Python implementations of #43234

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

Bool8 extension type will be available in C++ and Python libraries

cpp/src/arrow/extension/bool8.cc Outdated Show resolved Hide resolved
cpp/src/arrow/extension/bool8.cc Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 2, 2024
@joellubi joellubi changed the title GH-17682: [C++] [WIP] Bool8 Extension Type Implementation GH-17682: [C++][Python] Bool8 Extension Type Implementation Aug 2, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 2, 2024
Copy link
Contributor

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ part LGTM.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, but looks good otherwise.

Comment on lines 4472 to 4535
def to_numpy(self, zero_copy_only=True, writable=False):
try:
return self.storage.to_numpy().view(np.bool_)
except ArrowInvalid as e:
if zero_copy_only:
raise e

return _pc().not_equal(self.storage, 0).to_numpy(zero_copy_only=zero_copy_only, writable=writable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by _pc().not_equal(self.storage, 0). Isn't this creating a copy? Wasn't the purpose of bool8 to allow zero-copy with numpy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @westonpace. Yes the default path for the to_numpy() method is to enforce zero-copy behavior which is achieved by the line return self.storage.to_numpy().view(np.bool_). The zero_copy_only kwarg can optionally be set to False which relaxes this requirement.

The line you indicated does create a copy, but it will only be reached if zero_copy_only is False AND the original attempt at a zero copy view failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in practice, this code path gets reached if there are missing values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct. The outcomes of taking the various paths are demonstrated in this test.

This also matches the existing semantics of converting a normal boolean array to numpy, which currently performs a copy to an array of dtype=np.object_ if there are any missing values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for the explanation!

@v1gnesh v1gnesh mentioned this pull request Aug 5, 2024
@v1gnesh
Copy link

v1gnesh commented Aug 5, 2024

Thank you for this, this is such an excellent addition ❤️

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joellubi added some quick comments, but generally looking good! Still need to check the tests

python/pyarrow/array.pxi Outdated Show resolved Hide resolved
python/pyarrow/array.pxi Outdated Show resolved Hide resolved
Comment on lines 4472 to 4535
def to_numpy(self, zero_copy_only=True, writable=False):
try:
return self.storage.to_numpy().view(np.bool_)
except ArrowInvalid as e:
if zero_copy_only:
raise e

return _pc().not_equal(self.storage, 0).to_numpy(zero_copy_only=zero_copy_only, writable=writable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in practice, this code path gets reached if there are missing values?

Comment on lines 4511 to 4512
buf = foreign_buffer(obj.ctypes.data, obj.size)
return Array.from_buffers(bool8(), obj.size, [None, buf])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would loose track of the buffer owner (the numpy array obj), so you would need to pass that to the foreign_buffer function as base argument.

However, I think we could also simplify this by first creating a pyarrow storage array of int8, and then using self.from_storage() instead of using from_buffers() ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a try and it works if the numpy array has dtype=np.int8:

np_arr = np.array([1, 0, 1], dtype=np.int8)
pa_storage_arr = pa.array(np_arr, type=pa.int8())
pa_bool8_arr = pa.ExtensionArray.from_storage(pa.bool8(), pa_storage_arr)

This does not produce any copies. The existing approach of using foreign_buffer also works with np_arr = np.array([True, False, True], dtype=np.bool_) without making a copy.

However using the pa.array() constuctor currently does make a copy when going bool -> int8. I think this would require a zero-copy casting kernel to be added to C++. That seems like it would be a better approach, I just have to wrap my head around that part of the code.

CC: @felipecrv does this sound right ^?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that I think about it I don't think a casting kernel is what's needed in this specific scenario since that goes between Arrow types and we're not trying to convert Arrow Boolean to Arrow Int8. I think what we need is to reinterpret the numpy bool as a numpy int8, then continue the same way as above for the int8 arrow array. I'll give that a try now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I pushed up the change, let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks good!

python/pyarrow/includes/libarrow.pxd Outdated Show resolved Hide resolved
python/pyarrow/scalar.pxi Show resolved Hide resolved
python/pyarrow/tests/test_extension_type.py Outdated Show resolved Hide resolved
python/pyarrow/types.pxi Show resolved Hide resolved
python/pyarrow/types.pxi Show resolved Hide resolved
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Aug 6, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 14, 2024
@joellubi
Copy link
Member Author

@joellubi We'll need to update the "Canonical Extension types" table at the end of https://arrow.apache.org/docs/status.html#data-types

@pitrou I'll update that table in a follow-up PR. I made edits to it in #43679, so the addition will be easier once that PR has merged.

@joellubi joellubi marked this pull request as ready for review August 14, 2024 20:37
@joellubi
Copy link
Member Author

@pitrou @jorisvandenbossche Any more comments on the C++ or Python sides respectively, or does this look ok to merge?

return ss.str();
}

std::string Bool8Type::Serialize() const { return ""; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm why is this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what's specified in "description of the serialization" for Bool8.

This method is generally used to encode type parameters, but for bool8 there are no parameters. The type is fully defined by its name and storage type.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
I added a bunch more comments, but they are all just minor formatting / testing nits

python/pyarrow/array.pxi Outdated Show resolved Hide resolved
Comment on lines 5336 to 5337
unknown_col: [[True, False, True, True, null]]
unknown_col: [[-1,0,1,2,null]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: this is a good illustration for that we should ideally have a way to let the extension type control this string representation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point and certainly something I would have liked to have when going through this implementation. I'll open an issue for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have #36648 covering that I think

python/pyarrow/types.pxi Outdated Show resolved Hide resolved
python/pyarrow/types.pxi Show resolved Hide resolved
python/pyarrow/types.pxi Show resolved Hide resolved
python/pyarrow/tests/test_extension_type.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_extension_type.py Outdated Show resolved Hide resolved
python/pyarrow/tests/test_extension_type.py Show resolved Hide resolved


def test_bool8_scalar():
assert pa.ExtensionScalar.from_storage(pa.bool8(), -1).as_py()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I didn't think about in the previous round, but it might be better to test the value explicitly in this case, instead of relying on python's general truthiness:

Suggested change
assert pa.ExtensionScalar.from_storage(pa.bool8(), -1).as_py()
assert pa.ExtensionScalar.from_storage(pa.bool8(), -1).as_py() is True

Because otherwise this test doesn't actually ensure that the result is True or False. If we were still returning the underlying storage of 0, 1, 2 etc, those tests would also pass in its current form.

(same for the lines below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, it reads a lot clearer now too.



def test_bool8_scalar():
assert not pa.ExtensionScalar.from_storage(pa.bool8(), 0).as_py()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding that support!

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting merge Awaiting merge and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Aug 20, 2024
@felipecrv felipecrv merged commit 5258819 into apache:main Aug 21, 2024
42 of 43 checks passed
@felipecrv felipecrv removed the awaiting merge Awaiting merge label Aug 21, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5258819.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 26 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants