-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove all Dask-ML uses #886
Conversation
Codecov Report
@@ Coverage Diff @@
## main #886 +/- ##
==========================================
+ Coverage 75.39% 75.56% +0.17%
==========================================
Files 73 73
Lines 4011 4019 +8
Branches 721 715 -6
==========================================
+ Hits 3024 3037 +13
+ Misses 826 818 -8
- Partials 161 164 +3
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -918,7 +916,6 @@ def test_experiment_automl_regressor(c, client, training_df): | |||
generations=2, | |||
cv=2, | |||
n_jobs=-1, | |||
use_dask=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_dask
requires Dask-ML to be installed.
|
@@ -794,7 +813,7 @@ def test_ml_experiment(c, client, training_df): | |||
""" | |||
CREATE EXPERIMENT my_exp WITH ( | |||
model_class = 'sklearn.ensemble.GradientBoostingClassifier', | |||
experiment_class = 'dask_ml.model_selection.GridSearchCV', | |||
experiment_class = 'sklearn.model_selection.GridSearchCV', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't really get experiment_class
to work on the GPU. For example:
c.sql(
"""
CREATE EXPERIMENT my_exp WITH (
model_class = 'sklearn.ensemble.GradientBoostingClassifier',
experiment_class = 'cuml.model_selection.GridSearchCV',
tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
max_depth = ARRAY [3,4,5,10]),
target_column = 'target'
) AS (
SELECT x, y, x*y > 0 AS target
FROM timeseries
LIMIT 100
)
"""
)
errors with:
INFO:dask_sql.physical.rel.custom.create_experiment:{'n_estimators': [16, 32, 2], 'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 4, 5, 10]}
INFO:dask_sql.physical.rel.custom.create_experiment:{}
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [8], line 1
----> 1 c.sql(
2 """
3 CREATE EXPERIMENT my_exp WITH (
4 model_class = 'sklearn.ensemble.GradientBoostingClassifier',
5 experiment_class = 'cuml.model_selection.GridSearchCV',
6 tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
7 max_depth = ARRAY [3,4,5,10]),
8 target_column = 'target'
9 ) AS (
10 SELECT x, y, x*y > 0 AS target
11 FROM timeseries
12 LIMIT 100
13 )
14 """
15 )
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:501, in Context.sql(self, sql, return_futures, dataframes, gpu, config_options)
496 else:
497 raise RuntimeError(
498 f"Encountered unsupported `LogicalPlan` sql type: {type(sql)}"
499 )
--> 501 return self._compute_table_from_rel(rel, return_futures)
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:830, in Context._compute_table_from_rel(self, rel, return_futures)
829 def _compute_table_from_rel(self, rel: "LogicalPlan", return_futures: bool = True):
--> 830 dc = RelConverter.convert(rel, context=self)
832 # Optimization might remove some alias projects. Make sure to keep them here.
833 select_names = [field for field in rel.getRowType().getFieldList()]
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/convert.py:61, in RelConverter.convert(cls, rel, context)
55 raise NotImplementedError(
56 f"No relational conversion for node type {node_type} available (yet)."
57 )
58 logger.debug(
59 f"Processing REL {rel} using {plugin_instance.__class__.__name__}..."
60 )
---> 61 df = plugin_instance.convert(rel, context=context)
62 logger.debug(f"Processed REL {rel} into {LoggableDataFrame(df)}")
63 return df
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/custom/create_experiment.py:169, in CreateExperimentPlugin.convert(self, rel, context)
167 search = ExperimentClass(model, {**parameters}, **experiment_kwargs)
168 logger.info(tune_fit_kwargs)
--> 169 search.fit(
170 X.to_dask_array(lengths=True),
171 y.to_dask_array(lengths=True),
172 **tune_fit_kwargs,
173 )
174 df = pd.DataFrame(search.cv_results_)
175 df["model_class"] = model_class
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_search.py:786, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
783 X, y, groups = indexable(X, y, groups)
784 fit_params = _check_fit_params(X, fit_params)
--> 786 cv_orig = check_cv(self.cv, y, classifier=is_classifier(estimator))
787 n_splits = cv_orig.get_n_splits(X, y, groups)
789 base_estimator = clone(self.estimator)
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_split.py:2331, in check_cv(cv, y, classifier)
2326 cv = 5 if cv is None else cv
2327 if isinstance(cv, numbers.Integral):
2328 if (
2329 classifier
2330 and (y is not None)
-> 2331 and (type_of_target(y, input_name="y") in ("binary", "multiclass"))
2332 ):
2333 return StratifiedKFold(cv)
2334 else:
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:286, in type_of_target(y, input_name)
283 if sparse_pandas:
284 raise ValueError("y cannot be class 'SparseSeries' or 'SparseArray'")
--> 286 if is_multilabel(y):
287 return "multilabel-indicator"
289 # DeprecationWarning will be replaced by ValueError, see NEP 34
290 # https://numpy.org/neps/nep-0034-infer-dtype-is-object.html
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:152, in is_multilabel(y)
150 warnings.simplefilter("error", np.VisibleDeprecationWarning)
151 try:
--> 152 y = np.asarray(y)
153 except (np.VisibleDeprecationWarning, ValueError):
154 # dtype=object should be provided explicitly for ragged arrays,
155 # see NEP 34
156 y = np.array(y, dtype=object)
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask/array/core.py:1704, in Array.__array__(self, dtype, **kwargs)
1702 x = x.astype(dtype)
1703 if not isinstance(x, np.ndarray):
-> 1704 x = np.array(x)
1705 return x
File cupy/_core/core.pyx:1473, in cupy._core.core._ndarray_base.__array__()
TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
Using model_class = 'xgboost.XGBClassifier'
or model_class = 'xgboost.dask.XGBClassifier'
results in the same error as above.
When I try it with a model_class
from cuML, more errors arise. For example, if I try it with model_class = 'cuml.dask.ensemble.RandomForestClassifier'
(cuML has no GradientBoostingClassifier
), sklearn raises a
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator <cuml.dask.ensemble.randomforestclassifier.RandomForestClassifier object at 0x7f0c5f692820> does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could open some issues to try and get experiment_class
working on the GPU. From what I've tried so far, I think the fixes would lie on the Dask and/or cuML side of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can open up an issue to track this and followup in a future pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #943
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestions but things generally look good
@@ -794,7 +813,7 @@ def test_ml_experiment(c, client, training_df): | |||
""" | |||
CREATE EXPERIMENT my_exp WITH ( | |||
model_class = 'sklearn.ensemble.GradientBoostingClassifier', | |||
experiment_class = 'dask_ml.model_selection.GridSearchCV', | |||
experiment_class = 'sklearn.model_selection.GridSearchCV', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can open up an issue to track this and followup in a future pr
try: | ||
prediction = model.predict(df[training_columns]) | ||
predicted_df = df.assign(target=prediction) | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what scenarios do we hit this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to allow test_clustering_and_prediction to pass. It would error with a TypeError: Column assignment doesn't support type numpy.ndarray
because sklearn returns the list of clusters as a Numpy array (like array([1, 3, 4, 7, 7, 2, 0, 7, 5, 1, 2, 0, 1, 6, 0, 3, 6, 6, 4, 7, 0, 3, 2, 0, 3, 4, 5, 4, 1, 0], dtype=int32)
), but Dask does not support column assignment with this datatype. So we have to convert it to a Dask Series before assignment.
I think it should be OK to use Pandas in the except
block because this should only happen in the CPU case with sklearn. As an example, test_gpu_clustering_and_prediction
uses cuml.dask and doesn't need to go into the except
block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. So this is specifically for the case where predict returns a numpy array instead of a Dask Array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! Thanks for the work here @sarahyurick
Closes #853
Depends on #832 and #855