You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In #886, we removed all dependencies on Dask-ML in favor of scikit-learn, cuML, and our own classes (ParallelPostFit and Incremental). Previously, when creating an experiment, experiment_class was expected to be a path to a dask_ml class, but sklearn classes were also found to be compatible. However, I couldn't get it to work with cuml, such as with cuml.model_selection.GridSearchCV. For example:
c.sql(
"""
CREATE EXPERIMENT my_exp WITH (
model_class = 'sklearn.ensemble.GradientBoostingClassifier',
experiment_class = 'cuml.model_selection.GridSearchCV',
tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
max_depth = ARRAY [3,4,5,10]),
target_column = 'target'
) AS (
SELECT x, y, x*y > 0 AS target
FROM timeseries
LIMIT 100
)
"""
)
errors with:
INFO:dask_sql.physical.rel.custom.create_experiment:{'n_estimators': [16, 32, 2], 'learning_rate': [0.1, 0.01, 0.001], 'max_depth': [3, 4, 5, 10]}
INFO:dask_sql.physical.rel.custom.create_experiment:{}
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [8], line 1
----> 1 c.sql(
2 """
3 CREATE EXPERIMENT my_exp WITH (
4 model_class = 'sklearn.ensemble.GradientBoostingClassifier',
5 experiment_class = 'cuml.model_selection.GridSearchCV',
6 tune_parameters = (n_estimators = ARRAY [16, 32, 2],learning_rate = ARRAY [0.1,0.01,0.001],
7 max_depth = ARRAY [3,4,5,10]),
8 target_column = 'target'
9 ) AS (
10 SELECT x, y, x*y > 0 AS target
11 FROM timeseries
12 LIMIT 100
13 )
14 """
15 )
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:501, in Context.sql(self, sql, return_futures, dataframes, gpu, config_options)
496 else:
497 raise RuntimeError(
498 f"Encountered unsupported `LogicalPlan` sql type: {type(sql)}"
499 )
--> 501 return self._compute_table_from_rel(rel, return_futures)
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/context.py:830, in Context._compute_table_from_rel(self, rel, return_futures)
829 def _compute_table_from_rel(self, rel: "LogicalPlan", return_futures: bool = True):
--> 830 dc = RelConverter.convert(rel, context=self)
832 # Optimization might remove some alias projects. Make sure to keep them here.
833 select_names = [field for field in rel.getRowType().getFieldList()]
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/convert.py:61, in RelConverter.convert(cls, rel, context)
55 raise NotImplementedError(
56 f"No relational conversion for node type {node_type} available (yet)."
57 )
58 logger.debug(
59 f"Processing REL {rel} using {plugin_instance.__class__.__name__}..."
60 )
---> 61 df = plugin_instance.convert(rel, context=context)
62 logger.debug(f"Processed REL {rel} into {LoggableDataFrame(df)}")
63 return df
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask_sql/physical/rel/custom/create_experiment.py:169, in CreateExperimentPlugin.convert(self, rel, context)
167 search = ExperimentClass(model, {**parameters}, **experiment_kwargs)
168 logger.info(tune_fit_kwargs)
--> 169 search.fit(
170 X.to_dask_array(lengths=True),
171 y.to_dask_array(lengths=True),
172 **tune_fit_kwargs,
173 )
174 df = pd.DataFrame(search.cv_results_)
175 df["model_class"] = model_class
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_search.py:786, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
783 X, y, groups = indexable(X, y, groups)
784 fit_params = _check_fit_params(X, fit_params)
--> 786 cv_orig = check_cv(self.cv, y, classifier=is_classifier(estimator))
787 n_splits = cv_orig.get_n_splits(X, y, groups)
789 base_estimator = clone(self.estimator)
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/model_selection/_split.py:2331, in check_cv(cv, y, classifier)
2326 cv = 5 if cv is None else cv
2327 if isinstance(cv, numbers.Integral):
2328 if (
2329 classifier
2330 and (y is not None)
-> 2331 and (type_of_target(y, input_name="y") in ("binary", "multiclass"))
2332 ):
2333 return StratifiedKFold(cv)
2334 else:
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:286, in type_of_target(y, input_name)
283 if sparse_pandas:
284 raise ValueError("y cannot be class 'SparseSeries' or 'SparseArray'")
--> 286 if is_multilabel(y):
287 return "multilabel-indicator"
289 # DeprecationWarning will be replaced by ValueError, see NEP 34
290 # https://numpy.org/neps/nep-0034-infer-dtype-is-object.html
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/sklearn/utils/multiclass.py:152, in is_multilabel(y)
150 warnings.simplefilter("error", np.VisibleDeprecationWarning)
151 try:
--> 152 y = np.asarray(y)
153 except (np.VisibleDeprecationWarning, ValueError):
154 # dtype=object should be provided explicitly for ragged arrays,
155 # see NEP 34
156 y = np.array(y, dtype=object)
File ~/miniconda3/envs/dsql_rapids-22.12/lib/python3.9/site-packages/dask/array/core.py:1704, in Array.__array__(self, dtype, **kwargs)
1702 x = x.astype(dtype)
1703 if not isinstance(x, np.ndarray):
-> 1704 x = np.array(x)
1705 return x
File cupy/_core/core.pyx:1473, in cupy._core.core._ndarray_base.__array__()
TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly.
Using model_class = 'xgboost.XGBClassifier' or model_class = 'xgboost.dask.XGBClassifier' results in the same error as above.
When I try it with a model_class from cuML, more errors arise. For example, if I try it with model_class = 'cuml.dask.ensemble.RandomForestClassifier' (cuML has no GradientBoostingClassifier), scikit-learn raises a
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator <cuml.dask.ensemble.randomforestclassifier.RandomForestClassifier object at 0x7f0c5f692820> does not.
I tried a couple of different changes on the Dask-SQL side but have yet to find a solution. It's possible that this will require changes on the Dask and/or cuML side of things.
The text was updated successfully, but these errors were encountered:
After some investigation, it seems like the issue runs pretty deep. Assuming that we can make the necessary changes on the scikit-learn side, quite a few errors still pop up on the Dask and cuML sides as well.
In #886, we removed all dependencies on Dask-ML in favor of scikit-learn, cuML, and our own classes (ParallelPostFit and Incremental). Previously, when creating an experiment,
experiment_class
was expected to be a path to adask_ml
class, butsklearn
classes were also found to be compatible. However, I couldn't get it to work withcuml
, such as withcuml.model_selection.GridSearchCV
. For example:errors with:
Using
model_class = 'xgboost.XGBClassifier'
ormodel_class = 'xgboost.dask.XGBClassifier'
results in the same error as above.When I try it with a
model_class
from cuML, more errors arise. For example, if I try it withmodel_class = 'cuml.dask.ensemble.RandomForestClassifier'
(cuML has noGradientBoostingClassifier
), scikit-learn raises aI tried a couple of different changes on the Dask-SQL side but have yet to find a solution. It's possible that this will require changes on the Dask and/or cuML side of things.
The text was updated successfully, but these errors were encountered: