Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove all Dask-ML uses #886

Merged
merged 13 commits into from
Nov 29, 2022
Merged

Conversation

sarahyurick
Copy link
Collaborator

Closes #853

Depends on #832 and #855

@codecov-commenter
Copy link

codecov-commenter commented Oct 24, 2022

Codecov Report

Merging #886 (670aea0) into main (161e276) will increase coverage by 0.17%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #886      +/-   ##
==========================================
+ Coverage   75.39%   75.56%   +0.17%     
==========================================
  Files          73       73              
  Lines        4011     4019       +8     
  Branches      721      715       -6     
==========================================
+ Hits         3024     3037      +13     
+ Misses        826      818       -8     
- Partials      161      164       +3     
Impacted Files Coverage Δ
dask_sql/physical/rel/custom/predict.py 91.66% <100.00%> (+2.38%) ⬆️
dask_sql/_version.py 35.31% <0.00%> (+1.41%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@sarahyurick sarahyurick changed the title [BLOCKED] Remove all Dask-ML uses Remove all Dask-ML uses Nov 14, 2022
@@ -918,7 +916,6 @@ def test_experiment_automl_regressor(c, client, training_df):
generations=2,
cv=2,
n_jobs=-1,
use_dask=True,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_dask requires Dask-ML to be installed.

@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Nov 16, 2022

TODO: Investigate GPU cases and add tests accordingly.

@ayushdg ayushdg requested a review from VibhuJawa November 16, 2022 18:42
@@ -794,7 +813,7 @@ def test_ml_experiment(c, client, training_df):
"""
CREATE EXPERIMENT my_exp WITH (
model_class = 'sklearn.ensemble.GradientBoostingClassifier',
experiment_class = 'dask_ml.model_selection.GridSearchCV',
experiment_class = 'sklearn.model_selection.GridSearchCV',
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't really get experiment_class to work on the GPU. For example:

c.sql(
    """
    CREATE EXPERIMENT my_exp WITH (
    model_class = 'sklearn.ensemble.GradientBoostingClassifier',
    experiment_class = 'cuml.model_selection.GridSearchCV',
    tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
                       max_depth = ARRAY [3,4,5,10]),
    target_column = 'target'
) AS (
        SELECT x, y, x*y > 0 AS target
        FROM timeseries
        LIMIT 100
    )
    """
)

errors with:

INFO:dask_sql.physical.rel.custom.create_experiment:{'n_estimators': [16, 32, 2], 'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 4, 5, 10]}
INFO:dask_sql.physical.rel.custom.create_experiment:{}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [8], line 1
----> 1 c.sql(
      2     """
      3     CREATE EXPERIMENT my_exp WITH (
      4     model_class = 'sklearn.ensemble.GradientBoostingClassifier',
      5     experiment_class = 'cuml.model_selection.GridSearchCV',
      6     tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
      7                        max_depth = ARRAY [3,4,5,10]),
      8     target_column = 'target'
      9 ) AS (
     10         SELECT x, y, x*y > 0 AS target
     11         FROM timeseries
     12         LIMIT 100
     13     )
     14     """
     15 )

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:501, in Context.sql(self, sql, return_futures, dataframes, gpu, config_options)
    496 else:
    497     raise RuntimeError(
    498         f"Encountered unsupported `LogicalPlan` sql type: {type(sql)}"
    499     )
--> 501 return self._compute_table_from_rel(rel, return_futures)

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:830, in Context._compute_table_from_rel(self, rel, return_futures)
    829 def _compute_table_from_rel(self, rel: "LogicalPlan", return_futures: bool = True):
--> 830     dc = RelConverter.convert(rel, context=self)
    832     # Optimization might remove some alias projects. Make sure to keep them here.
    833     select_names = [field for field in rel.getRowType().getFieldList()]

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/convert.py:61, in RelConverter.convert(cls, rel, context)
     55     raise NotImplementedError(
     56         f"No relational conversion for node type {node_type} available (yet)."
     57     )
     58 logger.debug(
     59     f"Processing REL {rel} using {plugin_instance.__class__.__name__}..."
     60 )
---> 61 df = plugin_instance.convert(rel, context=context)
     62 logger.debug(f"Processed REL {rel} into {LoggableDataFrame(df)}")
     63 return df

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/custom/create_experiment.py:169, in CreateExperimentPlugin.convert(self, rel, context)
    167 search = ExperimentClass(model, {**parameters}, **experiment_kwargs)
    168 logger.info(tune_fit_kwargs)
--> 169 search.fit(
    170     X.to_dask_array(lengths=True),
    171     y.to_dask_array(lengths=True),
    172     **tune_fit_kwargs,
    173 )
    174 df = pd.DataFrame(search.cv_results_)
    175 df["model_class"] = model_class

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_search.py:786, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    783 X, y, groups = indexable(X, y, groups)
    784 fit_params = _check_fit_params(X, fit_params)
--> 786 cv_orig = check_cv(self.cv, y, classifier=is_classifier(estimator))
    787 n_splits = cv_orig.get_n_splits(X, y, groups)
    789 base_estimator = clone(self.estimator)

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_split.py:2331, in check_cv(cv, y, classifier)
   2326 cv = 5 if cv is None else cv
   2327 if isinstance(cv, numbers.Integral):
   2328     if (
   2329         classifier
   2330         and (y is not None)
-> 2331         and (type_of_target(y, input_name="y") in ("binary", "multiclass"))
   2332     ):
   2333         return StratifiedKFold(cv)
   2334     else:

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:286, in type_of_target(y, input_name)
    283 if sparse_pandas:
    284     raise ValueError("y cannot be class 'SparseSeries' or 'SparseArray'")
--> 286 if is_multilabel(y):
    287     return "multilabel-indicator"
    289 # DeprecationWarning will be replaced by ValueError, see NEP 34
    290 # https://numpy.org/neps/nep-0034-infer-dtype-is-object.html

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:152, in is_multilabel(y)
    150 warnings.simplefilter("error", np.VisibleDeprecationWarning)
    151 try:
--> 152     y = np.asarray(y)
    153 except (np.VisibleDeprecationWarning, ValueError):
    154     # dtype=object should be provided explicitly for ragged arrays,
    155     # see NEP 34
    156     y = np.array(y, dtype=object)

File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask/array/core.py:1704, in Array.__array__(self, dtype, **kwargs)
   1702     x = x.astype(dtype)
   1703 if not isinstance(x, np.ndarray):
-> 1704     x = np.array(x)
   1705 return x

File cupy/_core/core.pyx:1473, in cupy._core.core._ndarray_base.__array__()

TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Using model_class = 'xgboost.XGBClassifier' or model_class = 'xgboost.dask.XGBClassifier' results in the same error as above.

When I try it with a model_class from cuML, more errors arise. For example, if I try it with model_class = 'cuml.dask.ensemble.RandomForestClassifier' (cuML has no GradientBoostingClassifier), sklearn raises a

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator <cuml.dask.ensemble.randomforestclassifier.RandomForestClassifier object at 0x7f0c5f692820> does not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could open some issues to try and get experiment_class working on the GPU. From what I've tried so far, I think the fixes would lie on the Dask and/or cuML side of things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can open up an issue to track this and followup in a future pr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #943

@sarahyurick sarahyurick marked this pull request as ready for review November 16, 2022 21:53
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions but things generally look good

@@ -794,7 +813,7 @@ def test_ml_experiment(c, client, training_df):
"""
CREATE EXPERIMENT my_exp WITH (
model_class = 'sklearn.ensemble.GradientBoostingClassifier',
experiment_class = 'dask_ml.model_selection.GridSearchCV',
experiment_class = 'sklearn.model_selection.GridSearchCV',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can open up an issue to track this and followup in a future pr

tests/integration/test_model.py Outdated Show resolved Hide resolved
try:
prediction = model.predict(df[training_columns])
predicted_df = df.assign(target=prediction)
except TypeError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what scenarios do we hit this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to allow test_clustering_and_prediction to pass. It would error with a TypeError: Column assignment doesn't support type numpy.ndarray because sklearn returns the list of clusters as a Numpy array (like array([1, 3, 4, 7, 7, 2, 0, 7, 5, 1, 2, 0, 1, 6, 0, 3, 6, 6, 4, 7, 0, 3, 2, 0, 3, 4, 5, 4, 1, 0], dtype=int32)), but Dask does not support column assignment with this datatype. So we have to convert it to a Dask Series before assignment.

I think it should be OK to use Pandas in the except block because this should only happen in the CPU case with sklearn. As an example, test_gpu_clustering_and_prediction uses cuml.dask and doesn't need to go into the except block.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. So this is specifically for the case where predict returns a numpy array instead of a Dask Array.

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thanks for the work here @sarahyurick

@ayushdg ayushdg merged commit f2913c8 into dask-contrib:main Nov 29, 2022
@sarahyurick sarahyurick deleted the remove_daskml branch May 26, 2023 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove all Dask-ML dependencies and references
3 participants