Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure query-planning is disabled in dask #371

Merged
merged 9 commits into from
Jul 26, 2024
67 changes: 67 additions & 0 deletions merlin/config/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
#
# Copyright (c) 2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

_DASK_QUERY_PLANNING_ENABLED = False
try:
# Disable query-planning and string conversion
import dask

dask.config.set(
{
"dataframe.query-planning": False,
"dataframe.convert-string": False,
}
)
except ImportError:
dask = None
else:
import sys

import dask.dataframe as dd
from packaging.version import parse

if parse(dask.__version__) > parse("2024.6.0"):
# For newer versions of dask, we can just check
# the official DASK_EXPR_ENABLED constant
_DASK_QUERY_PLANNING_ENABLED = dd.DASK_EXPR_ENABLED
else:
# For older versions of dask, we must assume query
# planning is enabled if dask_expr was imported
# (because we can't know for sure)
_DASK_QUERY_PLANNING_ENABLED = "dask_expr" in sys.modules


def validate_dask_configs():
"""Central check for problematic config options in Dask"""
if _DASK_QUERY_PLANNING_ENABLED:
raise NotImplementedError(
"Merlin does not support the query-planning API in "
"Dask Dataframe yet. Please make sure query-planning is "
"disabled before dask.dataframe is imported.\n\ne.g."
"dask.config.set({'dataframe.query-planning': False})"
"\n\nOr set the environment variable: "
"export DASK_DATAFRAME__QUERY_PLANNING=False"
)

if dask is not None and dask.config.get("dataframe.convert-string"):
raise NotImplementedError(
"Merlin does not support automatic string conversion in "
"Dask Dataframe yet. Please make sure this option is "
"disabled.\n\ne.g."
"dask.config.set({'dataframe.convert-string': False})"
"\n\nOr set the environment variable: "
"export DASK_DATAFRAME__CONVERT_STRING=False"
)
4 changes: 3 additions & 1 deletion merlin/core/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -14,6 +14,8 @@
# limitations under the License.
#

from merlin.config import validate_dask_configs
from merlin.core import _version

__version__ = _version.get_versions()["version"]
validate_dask_configs()
6 changes: 5 additions & 1 deletion merlin/dag/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -15,6 +15,10 @@
#
# flake8: noqa

from merlin.config import validate_dask_configs

validate_dask_configs()

from merlin.dag.graph import Graph
from merlin.dag.node import Node, iter_nodes, postorder_iter_nodes, preorder_iter_nodes
from merlin.dag.operator import DataFormats, Operator, Supports
Expand Down
8 changes: 6 additions & 2 deletions merlin/io/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2022, NVIDIA CORPORATION.
# Copyright (c) 2022-2024, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -13,8 +13,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.
#

# flake8: noqa

from merlin.config import validate_dask_configs

validate_dask_configs()
Comment on lines +18 to +20
Copy link
Contributor Author

@rjzamora rjzamora Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benfred @jperez999 - It doesn't seem like I can add a top-level merlin/__init__.py file to perform the dask-config validation without messing up the namespace for libraries like merlin.dataloader.

What do you think about the new merlin.config module I added in this PR?

As far as I can tell, we can be sure query-planning is disabled across merlin/nvt if we add this validate_dask_configs() call at the top of merlin/core/__init__.py, merlin/io/__init__.py, and merlin/dag/__init__.py. All internal merlin code using dask will need to import one of these modules first. Therefore, we only need to worry about the user importing dask.dataframe before importing something from merlin, and I think validate_dask_configs raises a suitable error message when this happens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. Is there a way for us to test this works, right now? Or do we need to wait until we start using 24.06+ cudf/dask packages?


from merlin.io import dataframe_iter, dataset, shuffle
from merlin.io.dataframe_iter import DataFrameIter
from merlin.io.dataset import MERLIN_METADATA_DIR_NAME, Dataset
Expand Down
Loading