Docs on GAFeatureSelectionCV

rodrigo-arenas · Nov 17, 2021 · 8dcc943 · 8dcc943
1 parent d39d3c0
commit 8dcc943
Show file tree

Hide file tree

Showing 13 changed files with 517 additions and 21 deletions.
diff --git a/README.rst b/README.rst
@@ -25,9 +25,10 @@
 Sklearn-genetic-opt
 ###################
 
-scikit-learn models hyperparameters tuning, using evolutionary algorithms.
+scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.
 
-This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search.
+This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search
+for hyperparameteres tuning, and from RFE, Select From Model for feature selection.
 
 Sklearn-genetic-opt uses evolutionary algorithms from the DEAP package to choose the set of hyperparameters that
 optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.
@@ -37,7 +38,8 @@ Documentation is available `here <https://sklearn-genetic-opt.readthedocs.io/>`_
 Main Features:
 ##############
 
-* **GASearchCV**: Principal class of the package, holds the evolutionary cross-validation optimization routine.
+* **GASearchCV**: Main class of the package for hyperparameters tuning, holds the evolutionary cross-validation optimization routine.
+* **GAFeatureSelectionCV**: Main class of the package for feature selection.
 * **Algorithms**: Set of different evolutionary algorithms to use as an optimization procedure.
 * **Callbacks**: Custom evaluation strategies to generate early stopping rules,
   logging (into TensorBoard, .pkl files, etc) or your custom logic.
@@ -82,8 +84,8 @@ The only optional dependency that the last command does not install, it's Tensor
 it is usually advised to look further which distribution works better for you.
 
 
-Example
-#######
+Example: Hyperparameters Tuning
+###############################
 
 .. code-block:: python
 
@@ -134,6 +136,49 @@ Example
    print("Best k solutions: ", evolved_estimator.hof)
 
 
+Example: Feature Selection
+##########################
+
+.. code:: python3
+
+    import matplotlib.pyplot as plt
+    from sklearn_genetic import GAFeatureSelectionCV
+    from sklearn.model_selection import train_test_split, StratifiedKFold
+    from sklearn.svm import SVC
+    from sklearn.datasets import load_iris
+    from sklearn.metrics import accuracy_score
+    import numpy as np
+
+    data = load_iris()
+    X, y = data["data"], data["target"]
+
+    # Add random non-important features
+    noise = np.random.uniform(0, 10, size=(X.shape[0], 5))
+    X = np.hstack((X, noise))
+
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
+
+    clf = SVC(gamma='auto')
+
+    evolved_estimator = GAFeatureSelectionCV(
+        estimator=clf,
+        scoring="accuracy",
+        population_size=30,
+        generations=20,
+        n_jobs=-1)
+
+    # Train and select the features
+    evolved_estimator.fit(X_train, y_train)
+
+    # Features selected by the algorithm
+    features = evolved_estimator.best_features_
+    print(features)
+
+    # Predict only with the subset of selected features
+    y_predict_ga = evolved_estimator.predict(X_test[:, features])
+    print(accuracy_score(y_test, y_predict_ga))
+
+
 Changelog
 #########
 

diff --git a/docs/api/featureselectioncv.rst b/docs/api/featureselectioncv.rst
@@ -0,0 +1,23 @@
+
+FeatureSelectionCV
+------------------
+
+.. currentmodule:: sklearn_genetic
+
+.. autosummary:: GAFeatureSelectionCV
+   GASearchCV.decision_function
+   GASearchCV.fit
+   GASearchCV.get_params
+   GASearchCV.inverse_transform
+   GASearchCV.predict
+   GASearchCV.predict_proba
+   GASearchCV.score
+   GASearchCV.score_samples
+   GASearchCV.set_params
+   GASearchCV.transform
+
+.. autoclass:: sklearn_genetic.GAFeatureSelectionCV
+   :members:
+   :inherited-members:
+   :exclude-members: evaluate, mutate, n_features_in_, classes_
+   :undoc-members: True
diff --git a/docs/conf.py b/docs/conf.py
@@ -55,7 +55,7 @@
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "**.ipynb_checkpoints"]
 
 # -- Options for HTML output -------------------------------------------------
 

diff --git a/docs/images/basic_usage_accuracy_6.PNG b/docs/images/basic_usage_accuracy_6.PNG
diff --git a/docs/images/basic_usage_fitness_plot_7.PNG b/docs/images/basic_usage_fitness_plot_7.PNG
diff --git a/docs/images/basic_usage_train_log_5.PNG b/docs/images/basic_usage_train_log_5.PNG
diff --git a/docs/index.rst b/docs/index.rst
@@ -5,10 +5,13 @@
 
 sklean-genetic-opt
 ==================
-scikit-learn models hyperparameters tuning, using evolutionary algorithms.
-##########################################################################
+scikit-learn models hyperparameters tuning and feature selection,
+using evolutionary algorithms.
 
-This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search.
+#################################################################
+
+This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search
+for hyperparameteres tuning, and from RFE, Select From Model for feature selection.
 
 Sklearn-genetic-opt uses evolutionary algorithms from the deap package to choose a set of hyperparameters
 that optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.
@@ -73,6 +76,7 @@ as it is usually advised to look further which distribution works better for you
 
    notebooks/sklearn_comparison.ipynb
    notebooks/Boston_Houses_decision_tree.ipynb
+   notebooks/Iris_feature_selection.ipynb
    notebooks/Digits_decision_tree.ipynb
    notebooks/MLflow_logger.ipynb
 
@@ -87,6 +91,7 @@ as it is usually advised to look further which distribution works better for you
    :caption: API Reference:
 
    api/gasearchcv
+   api/featureselectioncv
    api/callbacks
    api/plots
    api/mlflow

diff --git a/docs/notebooks/Iris_feature_selection.ipynb b/docs/notebooks/Iris_feature_selection.ipynb
diff --git a/docs/release_notes.rst b/docs/release_notes.rst
@@ -3,6 +3,30 @@ Release Notes
 
 Some notes on new features in various releases
 
+
+What's new in 0.7.0dev0
+-----------------------
+
+This is the current in-development version, these features are not yet
+available via PyPI
+
+^^^^^^^^^
+Features:
+^^^^^^^^^
+
+* :class:`~sklearn_genetic.GAFeatureSelectionCV` for feature selection along
+  with any scikit-learn classifier or regressor. It optimizes the cv-score
+  while minimizing the number of features to select.
+  This class is compatible with the mlflow and tensorboard integration,
+  the Callbacks and the ``plot_fitness_evolution`` function.
+
+^^^^^^^^^^^^
+API Changes:
+^^^^^^^^^^^^
+
+* The module :mod:`~sklearn_genetic.mlflow` was renamed to :class:`~sklearn_genetic.mlflow_log`
+  to avoid unexpected errors on name resolutions
+
 What's new in 0.6.1
 -------------------
 

diff --git a/docs/tutorials/basic_usage.rst b/docs/tutorials/basic_usage.rst
@@ -6,7 +6,8 @@ How to Use Sklearn-genetic-opt
 Introduction
 ------------
 
-Sklearn-genetic-opt uses evolutionary algorithms to fine-tune scikit-learn machine learning algorithms.
+Sklearn-genetic-opt uses evolutionary algorithms to fine-tune scikit-learn machine learning algorithms
+and perform feature selection.
 It is designed to accept a `scikit-learn <http://scikit-learn.org/stable/index.html>`__
 regression or classification model (or a pipeline containing on of those).
 
@@ -23,8 +24,8 @@ Then by using evolutionary operators as the mating, mutation, selection and eval
 it generates new candidates looking to improve the cross-validation score in each generation.
 It'll continue with this process until a number of generations is reached or until a callback criterion is met.
 
-Example
--------
+Fine-tuning Example
+-------------------
 
 First let's import some dataset and other scikit-learn standard modules, we'll use
 the `digits dataset <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html>`__.
@@ -165,10 +166,109 @@ sklearn-genetic-opt comes with a plot function to analyze this log:
 
 .. image:: ../images/basic_usage_plot_space_4.png
 
-What this plot shows us, is the distributione of the sampled values for each hyperparameter.
+What this plot shows us, is the distribution of the sampled values for each hyperparameter.
 We can see for example in the *'min_weight_fraction_leaf'* that the algorithm mostly sampled values below 0.15.
 You can also check every single combination of variables and the contour plot that represents the sampled values.
 
+
+Feature Selection Example
+-------------------------
+
+For this example, we are going to use the well-known Iris dataset, it's a classification problem with four features.
+We are also going to simulate some random noise to represent non-important features:
+
+.. code:: python3
+
+    import matplotlib.pyplot as plt
+    from sklearn_genetic import GAFeatureSelectionCV
+    from sklearn_genetic.plots import plot_fitness_evolution
+    from sklearn.model_selection import train_test_split, StratifiedKFold
+    from sklearn.svm import SVC
+    from sklearn.datasets import load_iris
+    from sklearn.metrics import accuracy_score
+    import numpy as np
+
+    data = load_iris()
+    X, y = data["data"], data["target"]
+
+    noise = np.random.uniform(0, 10, size=(X.shape[0], 10))
+
+    X = np.hstack((X, noise))
+
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
+
+This should give us 10 extra noisy features with our train and test set.
+
+Now we can create the GAFeatureSelectionCV object, it's very similar to the GASearchCV and they share
+most of the parameters, the main difference is GAFeatureSelectionCV doesn't run hyperparameters optimization
+thus the param_grid parameter it's not available, and the estimator should be defined with its hyperparameters.
+
+The way the feature selection is performed is by creating models with a subsample of features
+and evaluate its cv-score, the way the subsets are created is by using the available evolutionary algorithms.
+It also tries to minimize the number of selected features, so it's a multi-objective optimization.
+
+Let's create the feature selection object, the estimator we're going to use is a SVM:
+
+.. code:: python3
+
+    clf = SVC(gamma='auto')
+
+    evolved_estimator = GAFeatureSelectionCV(
+        estimator=clf,
+        cv=3,
+        scoring="accuracy",
+        population_size=30,
+        generations=20,
+        n_jobs=-1,
+        verbose=True,
+        keep_top_k=2,
+        elitism=True,
+    )
+
+We are ready to run the optimization routine:
+
+.. code:: python3
+
+    # Train and select the features
+    evolved_estimator.fit(X_train, y_train)
+
+During the training, the same log format is displayed as before:
+
+.. image:: ../images/basic_usage_train_log_5.png
+
+After fitting the model, we have some extra methods to use the model right away. It will use by default the best set of
+features it found, remember as the algorithm used only a subset, you have to select them from the
+``X_test array``, this is done like this:
+
+.. code:: python3
+
+    features = evolved_estimator.best_features_
+
+    # Predict only with the subset of selected features
+    y_predict_ga = evolved_estimator.predict(X_test[:, features])
+    accuracy = accuracy_score(y_test, y_predict_ga)
+
+.. image:: ../images/basic_usage_accuracy_6.png
+
+In this case, we got an accuracy score in the test set of 0.98.
+
+Notice that the ``best_features_`` is a vector of bool values, each
+position represents the index of the feature (column) and the value indicates
+if that features was selected (True) or not (False) by the algorithm.
+In this example, the algorithm, discarded all the noisy random variables we created
+and selected the original variables.
+
+We can also plot the fitness evolution:
+
+.. code:: python3
+
+    from sklearn_genetic.plots import plot_fitness_evolution
+    plot_fitness_evolution(evolved_estimator)
+    plt.show()
+
+.. image:: ../images/basic_usage_fitness_plot_7.png
+
 This concludes our introduction to the basic sklearn-genetic-opt usage.
-Further tutorials will cover the GASearchCV parameters, callbacks,
-different optimization algorithms and more advanced use cases.
+Further tutorials will cover the GASearchCV and GAFeatureSelectionCV parameters, callbacks,
+different optimization algorithms and more advanced use cases.
+
diff --git a/sklearn_genetic/_version.py b/sklearn_genetic/_version.py
@@ -1 +1 @@
-__version__ = "0.6.1"
+__version__ = "0.7.0dev0"
diff --git a/sklearn_genetic/plots.py b/sklearn_genetic/plots.py
@@ -15,6 +15,7 @@
 from .utils import logbook_to_pandas
 from .parameters import Metrics
 from .space import Categorical
+from .genetic_search import GAFeatureSelectionCV
 
 """
 This module contains some useful function to explore the results of the optimization routines
@@ -75,6 +76,10 @@ def plot_search_space(estimator, height=2, s=25, features: list = None):
     Pair plot of the used hyperparameters during the search
 
     """
+
+    if isinstance(estimator, GAFeatureSelectionCV):
+        raise TypeError("Estimator must be a GASearchCV instance, not a GAFeatureSelectionCV instance")
+
     sns.set_style("white")
 
     df = logbook_to_pandas(estimator.logbook)
@@ -131,6 +136,9 @@ def plot_parallel_coordinates(estimator, features: list = None):
 
     """
 
+    if isinstance(estimator, GAFeatureSelectionCV):
+        raise TypeError("Estimator must be a GASearchCV instance, not a GAFeatureSelectionCV instance")
+
     df = logbook_to_pandas(estimator.logbook)
     param_grid = estimator.space.param_grid
     score = df["score"]