Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure query-planning is disabled in dask #371

Merged
merged 9 commits into from
Jul 26, 2024

Conversation

rjzamora
Copy link
Contributor

Adds logic to raise an error if "query-planning" is enabled in dask when merlin.core is imported.

We will need to make sure merlin.core is imported (across all of Merlin) before dask.dataframe is ever imported.
This is most important for RAPIDS 24.06+, but may apply to edge cases in 24.04.

@rjzamora rjzamora added the bug Something isn't working label Jul 16, 2024
@rjzamora rjzamora self-assigned this Jul 16, 2024
Copy link

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-371

@rjzamora
Copy link
Contributor Author

Notes on testing with RAPIDS 24.04

This PR disables both query-planning and automatic string conversion, which mitigate most test failures with 24.04. Remaining failures are described below (TODO: file issues where necessary)...

  • The dataset-validation machinery relies on deprecated "legacy" pyarrow features. My sense is that this validation machinery must be removed:
FAILED tests/unit/io/test_io.py::test_validate_dataset_bad_schema - ValueErro...
FAILED tests/unit/io/test_io.py::test_validate_dataset[parquet] - ValueError:...
FAILED tests/unit/io/test_io.py::test_validate_and_regenerate_dataset - Value...
  • It is no-longer possible to call the values attribute on an empty cudf.Series. This poses a problem for the SeriesLike protocol/test in merlin-core (not sure of the best solution yet):
FAILED tests/unit/core/test_protocols.py::test_series_are_serieslike[None] - ...
  • cudf does not support the np.float16 dtype (anymore?):
FAILED tests/unit/dag/ops/test_udf.py::test_udf_dtype_multi_op_propagation[False]
  • NOTE: Dataset cpu/gpu conversion must be updated to use the to_backend API before query-planning is enabled.

  • Possibly harmless int32 vs int64 dtype mismatch:

FAILED tests/unit/io/test_io.py::test_hive_partitioned_data[True] - Assertion...
FAILED tests/unit/io/test_io.py::test_hive_partitioned_data[False] - Assertio...

Comment on lines +18 to +20
from merlin.config import validate_dask_configs

validate_dask_configs()
Copy link
Contributor Author

@rjzamora rjzamora Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benfred @jperez999 - It doesn't seem like I can add a top-level merlin/__init__.py file to perform the dask-config validation without messing up the namespace for libraries like merlin.dataloader.

What do you think about the new merlin.config module I added in this PR?

As far as I can tell, we can be sure query-planning is disabled across merlin/nvt if we add this validate_dask_configs() call at the top of merlin/core/__init__.py, merlin/io/__init__.py, and merlin/dag/__init__.py. All internal merlin code using dask will need to import one of these modules first. Therefore, we only need to worry about the user importing dask.dataframe before importing something from merlin, and I think validate_dask_configs raises a suitable error message when this happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. Is there a way for us to test this works, right now? Or do we need to wait until we start using 24.06+ cudf/dask packages?

@jperez999 jperez999 merged commit 2643923 into NVIDIA-Merlin:main Jul 26, 2024
17 checks passed
@rjzamora rjzamora deleted the disable-query-planning branch July 26, 2024 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants