Remove all Dask-ML uses #886

sarahyurick · 2022-10-24T18:55:01Z

Closes #853

Depends on #832 and #855

codecov-commenter · 2022-10-24T19:42:17Z

Codecov Report

Merging #886 (670aea0) into main (161e276) will increase coverage by 0.17%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #886      +/-   ##
==========================================
+ Coverage   75.39%   75.56%   +0.17%     
==========================================
  Files          73       73              
  Lines        4011     4019       +8     
  Branches      721      715       -6     
==========================================
+ Hits         3024     3037      +13     
+ Misses        826      818       -8     
- Partials      161      164       +3

Impacted Files	Coverage Δ
dask_sql/physical/rel/custom/predict.py	`91.66% <100.00%> (+2.38%)`	⬆️
dask_sql/_version.py	`35.31% <0.00%> (+1.41%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

sarahyurick · 2022-11-16T00:08:53Z

tests/integration/test_model.py

@@ -918,7 +916,6 @@ def test_experiment_automl_regressor(c, client, training_df):
            generations=2,
            cv=2,
            n_jobs=-1,
-            use_dask=True,


use_dask requires Dask-ML to be installed.

sarahyurick · 2022-11-16T01:08:33Z

~~TODO: Investigate GPU cases and add tests accordingly.~~

sarahyurick · 2022-11-16T21:53:25Z

tests/integration/test_model.py

@@ -794,7 +813,7 @@ def test_ml_experiment(c, client, training_df):
        """
        CREATE EXPERIMENT my_exp WITH (
        model_class = 'sklearn.ensemble.GradientBoostingClassifier',
-        experiment_class = 'dask_ml.model_selection.GridSearchCV',
+        experiment_class = 'sklearn.model_selection.GridSearchCV',


I can't really get experiment_class to work on the GPU. For example:

c.sql( """ CREATE EXPERIMENT my_exp WITH ( model_class = 'sklearn.ensemble.GradientBoostingClassifier', experiment_class = 'cuml.model_selection.GridSearchCV', tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001], max_depth = ARRAY [3,4,5,10]), target_column = 'target' ) AS ( SELECT x, y, x*y > 0 AS target FROM timeseries LIMIT 100 ) """ )

errors with:

INFO:dask_sql.physical.rel.custom.create_experiment:{'n_estimators': [16, 32, 2], 'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 4, 5, 10]} INFO:dask_sql.physical.rel.custom.create_experiment:{} --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In [8], line 1 ----> 1 c.sql( 2 """ 3 CREATE EXPERIMENT my_exp WITH ( 4 model_class = 'sklearn.ensemble.GradientBoostingClassifier', 5 experiment_class = 'cuml.model_selection.GridSearchCV', 6 tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001], 7 max_depth = ARRAY [3,4,5,10]), 8 target_column = 'target' 9 ) AS ( 10 SELECT x, y, x*y > 0 AS target 11 FROM timeseries 12 LIMIT 100 13 ) 14 """ 15 ) File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:501, in Context.sql(self, sql, return_futures, dataframes, gpu, config_options) 496 else: 497 raise RuntimeError( 498 f"Encountered unsupported `LogicalPlan` sql type: {type(sql)}" 499 ) --> 501 return self._compute_table_from_rel(rel, return_futures) File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:830, in Context._compute_table_from_rel(self, rel, return_futures) 829 def _compute_table_from_rel(self, rel: "LogicalPlan", return_futures: bool = True): --> 830 dc = RelConverter.convert(rel, context=self) 832 # Optimization might remove some alias projects. Make sure to keep them here. 833 select_names = [field for field in rel.getRowType().getFieldList()] File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/convert.py:61, in RelConverter.convert(cls, rel, context) 55 raise NotImplementedError( 56 f"No relational conversion for node type {node_type} available (yet)." 57 ) 58 logger.debug( 59 f"Processing REL {rel} using {plugin_instance.__class__.__name__}..." 60 ) ---> 61 df = plugin_instance.convert(rel, context=context) 62 logger.debug(f"Processed REL {rel} into {LoggableDataFrame(df)}") 63 return df File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/custom/create_experiment.py:169, in CreateExperimentPlugin.convert(self, rel, context) 167 search = ExperimentClass(model, {**parameters}, **experiment_kwargs) 168 logger.info(tune_fit_kwargs) --> 169 search.fit( 170 X.to_dask_array(lengths=True), 171 y.to_dask_array(lengths=True), 172 **tune_fit_kwargs, 173 ) 174 df = pd.DataFrame(search.cv_results_) 175 df["model_class"] = model_class File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_search.py:786, in BaseSearchCV.fit(self, X, y, groups, **fit_params) 783 X, y, groups = indexable(X, y, groups) 784 fit_params = _check_fit_params(X, fit_params) --> 786 cv_orig = check_cv(self.cv, y, classifier=is_classifier(estimator)) 787 n_splits = cv_orig.get_n_splits(X, y, groups) 789 base_estimator = clone(self.estimator) File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_split.py:2331, in check_cv(cv, y, classifier) 2326 cv = 5 if cv is None else cv 2327 if isinstance(cv, numbers.Integral): 2328 if ( 2329 classifier 2330 and (y is not None) -> 2331 and (type_of_target(y, input_name="y") in ("binary", "multiclass")) 2332 ): 2333 return StratifiedKFold(cv) 2334 else: File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:286, in type_of_target(y, input_name) 283 if sparse_pandas: 284 raise ValueError("y cannot be class 'SparseSeries' or 'SparseArray'") --> 286 if is_multilabel(y): 287 return "multilabel-indicator" 289 # DeprecationWarning will be replaced by ValueError, see NEP 34 290 # https://numpy.org/neps/nep-0034-infer-dtype-is-object.html File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:152, in is_multilabel(y) 150 warnings.simplefilter("error", np.VisibleDeprecationWarning) 151 try: --> 152 y = np.asarray(y) 153 except (np.VisibleDeprecationWarning, ValueError): 154 # dtype=object should be provided explicitly for ragged arrays, 155 # see NEP 34 156 y = np.array(y, dtype=object) File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask/array/core.py:1704, in Array.__array__(self, dtype, **kwargs) 1702 x = x.astype(dtype) 1703 if not isinstance(x, np.ndarray): -> 1704 x = np.array(x) 1705 return x File cupy/_core/core.pyx:1473, in cupy._core.core._ndarray_base.__array__() TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.

Using model_class = 'xgboost.XGBClassifier' or model_class = 'xgboost.dask.XGBClassifier' results in the same error as above.

When I try it with a model_class from cuML, more errors arise. For example, if I try it with model_class = 'cuml.dask.ensemble.RandomForestClassifier' (cuML has no GradientBoostingClassifier), sklearn raises a

TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator <cuml.dask.ensemble.randomforestclassifier.RandomForestClassifier object at 0x7f0c5f692820> does not.

I could open some issues to try and get experiment_class working on the GPU. From what I've tried so far, I think the fixes would lie on the Dask and/or cuML side of things.

we can open up an issue to track this and followup in a future pr

Opened #943

ayushdg

Minor suggestions but things generally look good

ayushdg · 2022-11-28T18:57:38Z

tests/integration/test_model.py

@@ -794,7 +813,7 @@ def test_ml_experiment(c, client, training_df):
        """
        CREATE EXPERIMENT my_exp WITH (
        model_class = 'sklearn.ensemble.GradientBoostingClassifier',
-        experiment_class = 'dask_ml.model_selection.GridSearchCV',
+        experiment_class = 'sklearn.model_selection.GridSearchCV',


we can open up an issue to track this and followup in a future pr

tests/integration/test_model.py

ayushdg · 2022-11-29T15:37:04Z

dask_sql/physical/rel/custom/predict.py

+        try:
+            prediction = model.predict(df[training_columns])
+            predicted_df = df.assign(target=prediction)
+        except TypeError:


In what scenarios do we hit this case?

This is to allow test_clustering_and_prediction to pass. It would error with a TypeError: Column assignment doesn't support type numpy.ndarray because sklearn returns the list of clusters as a Numpy array (like array([1, 3, 4, 7, 7, 2, 0, 7, 5, 1, 2, 0, 1, 6, 0, 3, 6, 6, 4, 7, 0, 3, 2, 0, 3, 4, 5, 4, 1, 0], dtype=int32)), but Dask does not support column assignment with this datatype. So we have to convert it to a Dask Series before assignment.

I think it should be OK to use Pandas in the except block because this should only happen in the CPU case with sklearn. As an example, test_gpu_clustering_and_prediction uses cuml.dask and doesn't need to go into the except block.

That makes sense. So this is specifically for the case where predict returns a numpy array instead of a Dask Array.

ayushdg

lgtm! Thanks for the work here @sarahyurick

initial pass

9674fe5

sarahyurick mentioned this pull request Nov 14, 2022

Replace dask_ml.wrappers.Incremental with custom Incremental class #855

Merged

sarahyurick changed the title ~~[BLOCKED] Remove all Dask-ML uses~~ Remove all Dask-ML uses Nov 14, 2022

sarahyurick and others added 3 commits November 15, 2022 10:38

Merge branch 'main' into remove_daskml

0c7cdcd

remove all dask ml references

0027dd6

remove use_dask

c785aef

sarahyurick commented Nov 16, 2022

View reviewed changes

sarahyurick added 2 commits November 15, 2022 16:32

fix failing test

9007c0b

wrap in try/except

e139fa4

ayushdg requested a review from VibhuJawa November 16, 2022 18:42

add gpu test

ac00961

sarahyurick commented Nov 16, 2022

View reviewed changes

sarahyurick marked this pull request as ready for review November 16, 2022 21:53

sarahyurick requested review from ayushdg, charlesbluca and galipremsagar as code owners November 16, 2022 21:53

sarahyurick and others added 3 commits November 16, 2022 14:46

separate gpu test

07a060f

Merge branch 'main' into remove_daskml

a6acef5

Merge branch 'main' into remove_daskml

565db5c

ayushdg requested changes Nov 28, 2022

View reviewed changes

use gpu_client

9a0a5bc

sarahyurick mentioned this pull request Nov 28, 2022

[BUG] experiment_class fails on GPU #943

Open

sarahyurick and others added 2 commits November 28, 2022 11:39

remove imports

264cc34

Merge branch 'main' into remove_daskml

670aea0

ayushdg reviewed Nov 29, 2022

View reviewed changes

ayushdg approved these changes Nov 29, 2022

View reviewed changes

ayushdg merged commit f2913c8 into dask-contrib:main Nov 29, 2022

sarahyurick mentioned this pull request Jan 31, 2023

[DOC] Update ML docs to reflect recent changes #1022

Closed

sarahyurick deleted the remove_daskml branch May 26, 2023 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove all Dask-ML uses #886

Remove all Dask-ML uses #886

sarahyurick commented Oct 24, 2022

codecov-commenter commented Oct 24, 2022 •

edited

Loading

sarahyurick Nov 16, 2022

sarahyurick commented Nov 16, 2022 •

edited

Loading

sarahyurick Nov 16, 2022

sarahyurick Nov 16, 2022

ayushdg Nov 28, 2022

sarahyurick Nov 28, 2022

ayushdg left a comment

ayushdg Nov 28, 2022

ayushdg Nov 29, 2022

sarahyurick Nov 29, 2022

ayushdg Nov 29, 2022

ayushdg left a comment

Remove all Dask-ML uses #886

Remove all Dask-ML uses #886

Conversation

sarahyurick commented Oct 24, 2022

codecov-commenter commented Oct 24, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

sarahyurick commented Nov 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayushdg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ayushdg left a comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 24, 2022 •

edited

Loading

sarahyurick commented Nov 16, 2022 •

edited

Loading