Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-25118: [Python] Make NumPy an optional runtime dependency #41904

Merged
merged 66 commits into from
Sep 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
8d7db71
Minimal changes to make possible to import pyarrow and create a pyarr…
raulcd May 31, 2024
bf729e0
Add minimal CI test
raulcd May 31, 2024
0763dce
Add minor linting fixes
raulcd May 31, 2024
50ebbe1
Minor fix for function call instead of dict access
raulcd May 31, 2024
6e4fbe1
Refactor tests to be able to run test_array.py with and without numpy
raulcd Jun 4, 2024
9ca7a82
Fix linting for C++ files
raulcd Jun 4, 2024
98af6e4
Fix a bunch more tests
raulcd Jun 5, 2024
60bd002
Several changes related to initial code review comments
raulcd Jun 11, 2024
70b6237
Fix wrong mark
raulcd Jun 11, 2024
a3aea68
Minor lint
raulcd Jun 11, 2024
0b92561
Add HAS_NUMPY in order to avoid 'numpy' in sys.modules every time
raulcd Jun 12, 2024
e2a2172
Add last bits that crashed some tests
raulcd Jun 12, 2024
19f013a
Some more improvements
raulcd Jun 13, 2024
3bd78cb
Use PYTEST_PYARGS to define module to test
raulcd Jun 19, 2024
d1d74a6
Add test_without_numpy test module and nonumpy marker. Add automarker…
raulcd Jun 20, 2024
b0e26c0
Add tensor test
raulcd Jun 20, 2024
8e92e7c
Add missing license
raulcd Jun 20, 2024
05c9a71
Minor tests refactor
raulcd Jun 25, 2024
6fbff8e
Add some more tests
raulcd Jun 25, 2024
fd15246
Rollback change on pyarrow/tests/test_adhoc_memory_leak.py
raulcd Jun 25, 2024
c5e1567
Remove HAS_NUMPY helper variable
raulcd Jun 25, 2024
f292b1f
Some other import modules and test changes
raulcd Jun 25, 2024
d7a8de7
Remove stray import
raulcd Jun 25, 2024
0d2886b
Remove without_numpy from test_array and mark the numpy tests
raulcd Jun 26, 2024
890f6d4
Remove without_numpy from test_compute and mark the numpy tests
raulcd Jun 26, 2024
b885984
Remove without_numpy from test_dataset and mark the numpy tests
raulcd Jun 26, 2024
80a0abe
Remove without_numpy from test_flight
raulcd Jun 26, 2024
ba4805f
Collect from test_cpp_internals only tests that do not require numpy …
raulcd Jun 26, 2024
2ca233c
Mark numpy tests for pyarrow/tests/test_csv.py
raulcd Jun 26, 2024
82cbdd8
Add pytest.mark.numpy to all test_cuda and test_cuda_numba_interop tests
raulcd Jun 26, 2024
a31e55a
Fix test collection and add corresponding marks for numpy
raulcd Jun 26, 2024
92966bf
Fix linting
raulcd Jul 23, 2024
a166d51
Fix if on Digest and Quantile Options
raulcd Jul 23, 2024
1bba8b7
Remove unnecessary test fixture and last review comments
raulcd Jul 23, 2024
e6ff932
Fix tests that without numpy where an ImportError is raised instead o…
raulcd Jul 23, 2024
b3341b7
Review comments: Update base image from no-numpy, use mamba uninstall…
raulcd Jul 31, 2024
fcd37c3
Some more review comments to remove usage of numpy on some tests
raulcd Jul 31, 2024
d28f9c8
Remove numpy from CSV tests
raulcd Jul 31, 2024
e231a05
Remove unnecessary numpy usages from test_dataset
raulcd Aug 1, 2024
9c8d5c8
Fix several tests to do not require numpy on test_io
raulcd Aug 1, 2024
1f9077d
Fix tests and remove usages of numpy mark on test_ipc, fix fixture
raulcd Aug 1, 2024
b7e2a56
Revert back and use fixture for np types using strings
raulcd Aug 1, 2024
190e1f6
Simplify if by using SUPPORTED_INPUT_ARR_TYPES
raulcd Aug 1, 2024
02911d4
Add numpy or pandas mark to wrapper when collection C++ injected tests
raulcd Aug 1, 2024
2cb7556
Remove skip when numpy on test_convert_builtin.py and add individual …
raulcd Aug 1, 2024
9b44792
Remove empty stray lines
raulcd Aug 1, 2024
3e8c97d
Fix stray change
raulcd Aug 1, 2024
8766b84
Remove stray import
raulcd Aug 2, 2024
90455d3
Mark hypothesis tests that require numpy
raulcd Aug 2, 2024
ea22c30
Use available SEQUENCE_TYPES to inner sequence on nested arrays
raulcd Aug 2, 2024
630b607
Fix test collection for tests/interchange/test_conversion.py
raulcd Aug 6, 2024
7963828
Fix test collection for tests/interchange/test_interchange_spec.py
raulcd Aug 6, 2024
547d98e
Remove numpy dependency from pyarrow/tests/test_dataset_encryption.py
raulcd Aug 6, 2024
51636e6
Apply suggestions from code review
raulcd Aug 6, 2024
ecd2e4f
Fix test collection for pyarrow/tests/test_dlpack.py
raulcd Aug 6, 2024
d88a3f5
Fix test collection for pyarrow/tests/test_extension_type.py
raulcd Aug 6, 2024
32304f8
Fix test collection for pyarrow/tests/scalars.py
raulcd Aug 6, 2024
f5420c2
Add comment to why cython tests require NumPy (still a build dependency)
raulcd Aug 6, 2024
6365ef2
Fix test collection for pyarrow/tests/pandas.py
raulcd Aug 6, 2024
3d99652
Fixes to numpy tests
raulcd Aug 6, 2024
6d54ffa
Update python/pyarrow/tests/test_pandas.py
raulcd Aug 26, 2024
8560239
Update python/pyarrow/tests/test_csv.py
raulcd Aug 26, 2024
a4e094c
Update python/pyarrow/tests/test_csv.py
raulcd Aug 26, 2024
eaabab7
Remove numpy mark and add np guard to test
raulcd Aug 26, 2024
1277b84
Rebase main and add mark to skip new tests that require numpy
raulcd Aug 26, 2024
08da867
Add comment on why we duplicate test_basics_np_required for float16
raulcd Aug 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ jobs:
- conda-python-3.9-nopandas
- conda-python-3.8-pandas-1.0
- conda-python-3.10-pandas-latest
- conda-python-3.10-no-numpy
include:
- name: conda-python-docs
cache: conda-python-3.9
Expand All @@ -83,6 +84,11 @@ jobs:
title: AMD64 Conda Python 3.10 Pandas latest
python: "3.10"
pandas: latest
- name: conda-python-3.10-no-numpy
cache: conda-python-3.10
image: conda-python-no-numpy
title: AMD64 Conda Python 3.10 without NumPy
python: "3.10"
env:
PYTHON: ${{ matrix.python || 3.8 }}
UBUNTU: ${{ matrix.ubuntu || 20.04 }}
Expand Down
32 changes: 32 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ x-hierarchy:
- conda-python-hdfs
- conda-python-java-integration
- conda-python-jpype
- conda-python-no-numpy
- conda-python-spark
- conda-python-substrait
- conda-verify-rc
Expand Down Expand Up @@ -1258,6 +1259,37 @@ services:
volumes: *conda-volumes
command: *python-conda-command

conda-python-no-numpy:
# Usage:
# docker-compose build conda
# docker-compose build conda-cpp
# docker-compose build conda-python
# docker-compose build conda-python-no-numpy
# docker-compose run --rm conda-python-no-numpy
image: ${REPO}:${ARCH}-conda-python-${PYTHON}-no-numpy
build:
context: .
dockerfile: ci/docker/conda-python.dockerfile
cache_from:
- ${REPO}:${ARCH}-conda-python-${PYTHON}
args:
repo: ${REPO}
arch: ${ARCH}
python: ${PYTHON}
shm_size: *shm-size
environment:
<<: [*common, *ccache, *sccache]
PARQUET_REQUIRE_ENCRYPTION: # inherit
HYPOTHESIS_PROFILE: # inherit
PYARROW_TEST_HYPOTHESIS: # inherit
volumes: *conda-volumes
command:
["
/arrow/ci/scripts/cpp_build.sh /arrow /build &&
/arrow/ci/scripts/python_build.sh /arrow /build &&
mamba uninstall -y numpy &&
/arrow/ci/scripts/python_test.sh /arrow"]

conda-python-docs:
# Usage:
# archery docker run conda-python-docs
Expand Down
4 changes: 2 additions & 2 deletions python/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -339,17 +339,17 @@ set(PYARROW_CPP_SRCS
${PYARROW_CPP_SOURCE_DIR}/gdb.cc
${PYARROW_CPP_SOURCE_DIR}/helpers.cc
${PYARROW_CPP_SOURCE_DIR}/inference.cc
${PYARROW_CPP_SOURCE_DIR}/init.cc
${PYARROW_CPP_SOURCE_DIR}/io.cc
${PYARROW_CPP_SOURCE_DIR}/ipc.cc
${PYARROW_CPP_SOURCE_DIR}/numpy_convert.cc
${PYARROW_CPP_SOURCE_DIR}/numpy_init.cc
${PYARROW_CPP_SOURCE_DIR}/numpy_to_arrow.cc
${PYARROW_CPP_SOURCE_DIR}/python_test.cc
${PYARROW_CPP_SOURCE_DIR}/python_to_arrow.cc
${PYARROW_CPP_SOURCE_DIR}/pyarrow.cc
${PYARROW_CPP_SOURCE_DIR}/serialize.cc
${PYARROW_CPP_SOURCE_DIR}/udf.cc)
set_source_files_properties(${PYARROW_CPP_SOURCE_DIR}/init.cc
set_source_files_properties(${PYARROW_CPP_SOURCE_DIR}/numpy_init.cc
PROPERTIES SKIP_PRECOMPILE_HEADERS ON
SKIP_UNITY_BUILD_INCLUSION ON)

Expand Down
16 changes: 12 additions & 4 deletions python/pyarrow/_compute.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,10 @@ from pyarrow.util import _DEPR_MSG
from libcpp cimport bool as c_bool

import inspect
import numpy as np
try:
import numpy as np
except ImportError:
np = None
import warnings


Expand All @@ -43,6 +46,11 @@ _substrait_msg = (
)


SUPPORTED_INPUT_ARR_TYPES = (list, tuple)
if np is not None:
SUPPORTED_INPUT_ARR_TYPES += (np.ndarray, )


def _pas():
global __pas
if __pas is None:
Expand Down Expand Up @@ -473,7 +481,7 @@ cdef class MetaFunction(Function):

cdef _pack_compute_args(object values, vector[CDatum]* out):
for val in values:
if isinstance(val, (list, np.ndarray)):
if isinstance(val, SUPPORTED_INPUT_ARR_TYPES):
val = lib.asarray(val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but at some point we should accept any ArrowArrayExportable? @jorisvandenbossche

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do that indeed!

(#43410 is a general issue about accepting such objects in more places)


if isinstance(val, Array):
Expand Down Expand Up @@ -2189,7 +2197,7 @@ class QuantileOptions(_QuantileOptions):

def __init__(self, q=0.5, *, interpolation="linear", skip_nulls=True,
min_count=0):
if not isinstance(q, (list, tuple, np.ndarray)):
if not isinstance(q, SUPPORTED_INPUT_ARR_TYPES):
q = [q]
self._set_options(q, interpolation, skip_nulls, min_count)

Expand Down Expand Up @@ -2222,7 +2230,7 @@ class TDigestOptions(_TDigestOptions):

def __init__(self, q=0.5, *, delta=100, buffer_size=500, skip_nulls=True,
min_count=0):
if not isinstance(q, (list, tuple, np.ndarray)):
if not isinstance(q, SUPPORTED_INPUT_ARR_TYPES):
q = [q]
self._set_options(q, delta, buffer_size, skip_nulls, min_count)

Expand Down
5 changes: 5 additions & 0 deletions python/pyarrow/array.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@ cdef _sequence_to_array(object sequence, object mask, object size,


cdef inline _is_array_like(obj):
if np is None:
return False
if isinstance(obj, np.ndarray):
return True
return pandas_api._have_pandas_internal() and pandas_api.is_array_like(obj)
Expand Down Expand Up @@ -1608,6 +1610,9 @@ cdef class Array(_PandasConvertible):
"""
self._assert_cpu()

if np is None:
raise ImportError(
"Cannot return a numpy.ndarray if NumPy is not present")
cdef:
PyObject* out
PandasOptions c_options
Expand Down
14 changes: 8 additions & 6 deletions python/pyarrow/builder.pxi
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
# specific language governing permissions and limitations
# under the License.

import math


cdef class StringBuilder(_Weakrefable):
"""
Expand Down Expand Up @@ -42,10 +44,10 @@ cdef class StringBuilder(_Weakrefable):
value : string/bytes or np.nan/None
The value to append to the string array builder.
"""
if value is None or value is np.nan:
self.builder.get().AppendNull()
elif isinstance(value, (bytes, str)):
if isinstance(value, (bytes, str)):
self.builder.get().Append(tobytes(value))
elif value is None or math.isnan(value):
self.builder.get().AppendNull()
else:
raise TypeError('StringBuilder only accepts string objects')

Expand Down Expand Up @@ -108,10 +110,10 @@ cdef class StringViewBuilder(_Weakrefable):
value : string/bytes or np.nan/None
The value to append to the string array builder.
"""
if value is None or value is np.nan:
self.builder.get().AppendNull()
elif isinstance(value, (bytes, str)):
if isinstance(value, (bytes, str)):
self.builder.get().Append(tobytes(value))
elif value is None or math.isnan(value):
self.builder.get().AppendNull()
else:
raise TypeError('StringViewBuilder only accepts string objects')

Expand Down
13 changes: 12 additions & 1 deletion python/pyarrow/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@
from pyarrow.tests.util import windows_has_tzdata
import sys

import numpy as np
raulcd marked this conversation as resolved.
Show resolved Hide resolved

groups = [
'acero',
Expand All @@ -46,6 +45,8 @@
'lz4',
'memory_leak',
'nopandas',
'nonumpy',
'numpy',
'orc',
'pandas',
'parquet',
Expand Down Expand Up @@ -81,6 +82,8 @@
'lz4': Codec.is_available('lz4'),
'memory_leak': False,
'nopandas': False,
'nonumpy': False,
'numpy': False,
'orc': False,
'pandas': False,
'parquet': False,
Expand Down Expand Up @@ -158,6 +161,12 @@
except ImportError:
defaults['nopandas'] = True

try:
import numpy # noqa
defaults['numpy'] = True
except ImportError:
defaults['nonumpy'] = True

try:
import pyarrow.parquet # noqa
defaults['parquet'] = True
Expand Down Expand Up @@ -327,6 +336,7 @@ def unary_agg_func_fixture():
Register a unary aggregate function (mean)
"""
from pyarrow import compute as pc
import numpy as np

def func(ctx, x):
return pa.scalar(np.nanmean(x))
Expand All @@ -352,6 +362,7 @@ def varargs_agg_func_fixture():
Register a unary aggregate function
"""
from pyarrow import compute as pc
import numpy as np

def func(ctx, *args):
sum = 0.0
Expand Down
2 changes: 1 addition & 1 deletion python/pyarrow/includes/libarrow_python.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,7 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py::internal" nogil:
CResult[PyObject*] StringToTzinfo(c_string)


cdef extern from "arrow/python/init.h":
cdef extern from "arrow/python/numpy_init.h" namespace "arrow::py":
int arrow_init_numpy() except -1


Expand Down
12 changes: 9 additions & 3 deletions python/pyarrow/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,10 @@

import datetime
import decimal as _pydecimal
import numpy as np
try:
import numpy as np
except ImportError:
np = None
import os
import sys

Expand All @@ -32,8 +35,11 @@ from pyarrow.includes.common cimport PyObject_to_object
cimport pyarrow.includes.libarrow_python as libarrow_python
cimport cpython as cp

# Initialize NumPy C API
arrow_init_numpy()

raulcd marked this conversation as resolved.
Show resolved Hide resolved
# Initialize NumPy C API only if numpy was able to be imported
if np is not None:
arrow_init_numpy()

# Initialize PyArrow C++ API
# (used from some of our C++ code, see e.g. ARROW-5260)
import_pyarrow()
Expand Down
79 changes: 47 additions & 32 deletions python/pyarrow/pandas_compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,17 @@
import re
import warnings

import numpy as np

try:
import numpy as np
except ImportError:
np = None
import pyarrow as pa
from pyarrow.lib import _pandas_api, frombytes, is_threading_enabled # noqa


_logical_type_map = {}
_numpy_logical_type_map = {}
_pandas_logical_type_map = {}


def get_logical_type_map():
Expand Down Expand Up @@ -85,27 +89,32 @@ def get_logical_type(arrow_type):
return 'object'


_numpy_logical_type_map = {
np.bool_: 'bool',
np.int8: 'int8',
np.int16: 'int16',
np.int32: 'int32',
np.int64: 'int64',
np.uint8: 'uint8',
np.uint16: 'uint16',
np.uint32: 'uint32',
np.uint64: 'uint64',
np.float32: 'float32',
np.float64: 'float64',
'datetime64[D]': 'date',
np.str_: 'string',
np.bytes_: 'bytes',
}
def get_numpy_logical_type_map():
global _numpy_logical_type_map
if not _numpy_logical_type_map:
_numpy_logical_type_map.update({
np.bool_: 'bool',
np.int8: 'int8',
np.int16: 'int16',
np.int32: 'int32',
np.int64: 'int64',
np.uint8: 'uint8',
np.uint16: 'uint16',
np.uint32: 'uint32',
np.uint64: 'uint64',
np.float32: 'float32',
np.float64: 'float64',
'datetime64[D]': 'date',
np.str_: 'string',
np.bytes_: 'bytes',
})
return _numpy_logical_type_map


def get_logical_type_from_numpy(pandas_collection):
numpy_logical_type_map = get_numpy_logical_type_map()
try:
return _numpy_logical_type_map[pandas_collection.dtype.type]
return numpy_logical_type_map[pandas_collection.dtype.type]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we make the get_numpy_logical_type_map throw a proper error when numpy is not available. Currently this check would throw a KeyError and would do pretty random things with anything that has a dtype attribute.

Same for get_pandas_logical_type_map

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all functions in this file essentially assume pandas and numpy are available without proper error checking, because this code is specifically for converting to/from pandas objects and so you know pandas and numpy are available when this gets called.

The functions from this file that actually get called from elsewhere in the pyarrow code should maybe have a better error message. But those two helpers are only used here.

except KeyError:
if hasattr(pandas_collection.dtype, 'tz'):
return 'datetimetz'
Expand Down Expand Up @@ -1023,18 +1032,23 @@ def _is_generated_index_name(name):
return re.match(pattern, name) is not None


_pandas_logical_type_map = {
'date': 'datetime64[D]',
'datetime': 'datetime64[ns]',
'datetimetz': 'datetime64[ns]',
'unicode': np.str_,
'bytes': np.bytes_,
'string': np.str_,
'integer': np.int64,
'floating': np.float64,
'decimal': np.object_,
'empty': np.object_,
}
def get_pandas_logical_type_map():
global _pandas_logical_type_map

if not _pandas_logical_type_map:
_pandas_logical_type_map.update({
'date': 'datetime64[D]',
'datetime': 'datetime64[ns]',
'datetimetz': 'datetime64[ns]',
'unicode': np.str_,
'bytes': np.bytes_,
'string': np.str_,
'integer': np.int64,
'floating': np.float64,
'decimal': np.object_,
'empty': np.object_,
})
return _pandas_logical_type_map


def _pandas_type_to_numpy_type(pandas_type):
Expand All @@ -1050,8 +1064,9 @@ def _pandas_type_to_numpy_type(pandas_type):
dtype : np.dtype
The dtype that corresponds to `pandas_type`.
"""
pandas_logical_type_map = get_pandas_logical_type_map()
try:
return _pandas_logical_type_map[pandas_type]
return pandas_logical_type_map[pandas_type]
except KeyError:
if 'mixed' in pandas_type:
# catching 'mixed', 'mixed-integer' and 'mixed-integer-float'
Expand Down
Loading
Loading