Skip to content

Commit

Permalink
Docs on GAFeatureSelectionCV
Browse files Browse the repository at this point in the history
  • Loading branch information
rodrigo-arenas committed Nov 17, 2021
1 parent d39d3c0 commit 8dcc943
Show file tree
Hide file tree
Showing 13 changed files with 517 additions and 21 deletions.
55 changes: 50 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@
Sklearn-genetic-opt
###################

scikit-learn models hyperparameters tuning, using evolutionary algorithms.
scikit-learn models hyperparameters tuning and feature selection, using evolutionary algorithms.

This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search.
This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search
for hyperparameteres tuning, and from RFE, Select From Model for feature selection.

Sklearn-genetic-opt uses evolutionary algorithms from the DEAP package to choose the set of hyperparameters that
optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.
Expand All @@ -37,7 +38,8 @@ Documentation is available `here <https://sklearn-genetic-opt.readthedocs.io/>`_
Main Features:
##############

* **GASearchCV**: Principal class of the package, holds the evolutionary cross-validation optimization routine.
* **GASearchCV**: Main class of the package for hyperparameters tuning, holds the evolutionary cross-validation optimization routine.
* **GAFeatureSelectionCV**: Main class of the package for feature selection.
* **Algorithms**: Set of different evolutionary algorithms to use as an optimization procedure.
* **Callbacks**: Custom evaluation strategies to generate early stopping rules,
logging (into TensorBoard, .pkl files, etc) or your custom logic.
Expand Down Expand Up @@ -82,8 +84,8 @@ The only optional dependency that the last command does not install, it's Tensor
it is usually advised to look further which distribution works better for you.


Example
#######
Example: Hyperparameters Tuning
###############################

.. code-block:: python
Expand Down Expand Up @@ -134,6 +136,49 @@ Example
print("Best k solutions: ", evolved_estimator.hof)
Example: Feature Selection
##########################

.. code:: python3
import matplotlib.pyplot as plt
from sklearn_genetic import GAFeatureSelectionCV
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np
data = load_iris()
X, y = data["data"], data["target"]
# Add random non-important features
noise = np.random.uniform(0, 10, size=(X.shape[0], 5))
X = np.hstack((X, noise))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
clf = SVC(gamma='auto')
evolved_estimator = GAFeatureSelectionCV(
estimator=clf,
scoring="accuracy",
population_size=30,
generations=20,
n_jobs=-1)
# Train and select the features
evolved_estimator.fit(X_train, y_train)
# Features selected by the algorithm
features = evolved_estimator.best_features_
print(features)
# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test[:, features])
print(accuracy_score(y_test, y_predict_ga))
Changelog
#########

Expand Down
23 changes: 23 additions & 0 deletions docs/api/featureselectioncv.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@

FeatureSelectionCV
------------------

.. currentmodule:: sklearn_genetic

.. autosummary:: GAFeatureSelectionCV
GASearchCV.decision_function
GASearchCV.fit
GASearchCV.get_params
GASearchCV.inverse_transform
GASearchCV.predict
GASearchCV.predict_proba
GASearchCV.score
GASearchCV.score_samples
GASearchCV.set_params
GASearchCV.transform

.. autoclass:: sklearn_genetic.GAFeatureSelectionCV
:members:
:inherited-members:
:exclude-members: evaluate, mutate, n_features_in_, classes_
:undoc-members: True
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "**.ipynb_checkpoints"]

# -- Options for HTML output -------------------------------------------------

Expand Down
Binary file added docs/images/basic_usage_accuracy_6.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/basic_usage_fitness_plot_7.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/basic_usage_train_log_5.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 8 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,13 @@
sklean-genetic-opt
==================
scikit-learn models hyperparameters tuning, using evolutionary algorithms.
##########################################################################
scikit-learn models hyperparameters tuning and feature selection,
using evolutionary algorithms.

This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search.
#################################################################

This is meant to be an alternative from popular methods inside scikit-learn such as Grid Search and Randomized Grid Search
for hyperparameteres tuning, and from RFE, Select From Model for feature selection.

Sklearn-genetic-opt uses evolutionary algorithms from the deap package to choose a set of hyperparameters
that optimizes (max or min) the cross-validation scores, it can be used for both regression and classification problems.
Expand Down Expand Up @@ -73,6 +76,7 @@ as it is usually advised to look further which distribution works better for you

notebooks/sklearn_comparison.ipynb
notebooks/Boston_Houses_decision_tree.ipynb
notebooks/Iris_feature_selection.ipynb
notebooks/Digits_decision_tree.ipynb
notebooks/MLflow_logger.ipynb

Expand All @@ -87,6 +91,7 @@ as it is usually advised to look further which distribution works better for you
:caption: API Reference:

api/gasearchcv
api/featureselectioncv
api/callbacks
api/plots
api/mlflow
Expand Down
260 changes: 260 additions & 0 deletions docs/notebooks/Iris_feature_selection.ipynb

Large diffs are not rendered by default.

24 changes: 24 additions & 0 deletions docs/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,30 @@ Release Notes

Some notes on new features in various releases


What's new in 0.7.0dev0
-----------------------

This is the current in-development version, these features are not yet
available via PyPI

^^^^^^^^^
Features:
^^^^^^^^^

* :class:`~sklearn_genetic.GAFeatureSelectionCV` for feature selection along
with any scikit-learn classifier or regressor. It optimizes the cv-score
while minimizing the number of features to select.
This class is compatible with the mlflow and tensorboard integration,
the Callbacks and the ``plot_fitness_evolution`` function.

^^^^^^^^^^^^
API Changes:
^^^^^^^^^^^^

* The module :mod:`~sklearn_genetic.mlflow` was renamed to :class:`~sklearn_genetic.mlflow_log`
to avoid unexpected errors on name resolutions

What's new in 0.6.1
-------------------

Expand Down
112 changes: 106 additions & 6 deletions docs/tutorials/basic_usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@ How to Use Sklearn-genetic-opt
Introduction
------------

Sklearn-genetic-opt uses evolutionary algorithms to fine-tune scikit-learn machine learning algorithms.
Sklearn-genetic-opt uses evolutionary algorithms to fine-tune scikit-learn machine learning algorithms
and perform feature selection.
It is designed to accept a `scikit-learn <http://scikit-learn.org/stable/index.html>`__
regression or classification model (or a pipeline containing on of those).

Expand All @@ -23,8 +24,8 @@ Then by using evolutionary operators as the mating, mutation, selection and eval
it generates new candidates looking to improve the cross-validation score in each generation.
It'll continue with this process until a number of generations is reached or until a callback criterion is met.

Example
-------
Fine-tuning Example
-------------------

First let's import some dataset and other scikit-learn standard modules, we'll use
the `digits dataset <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html>`__.
Expand Down Expand Up @@ -165,10 +166,109 @@ sklearn-genetic-opt comes with a plot function to analyze this log:
.. image:: ../images/basic_usage_plot_space_4.png

What this plot shows us, is the distributione of the sampled values for each hyperparameter.
What this plot shows us, is the distribution of the sampled values for each hyperparameter.
We can see for example in the *'min_weight_fraction_leaf'* that the algorithm mostly sampled values below 0.15.
You can also check every single combination of variables and the contour plot that represents the sampled values.


Feature Selection Example
-------------------------

For this example, we are going to use the well-known Iris dataset, it's a classification problem with four features.
We are also going to simulate some random noise to represent non-important features:

.. code:: python3
import matplotlib.pyplot as plt
from sklearn_genetic import GAFeatureSelectionCV
from sklearn_genetic.plots import plot_fitness_evolution
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
import numpy as np
data = load_iris()
X, y = data["data"], data["target"]
noise = np.random.uniform(0, 10, size=(X.shape[0], 10))
X = np.hstack((X, noise))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
This should give us 10 extra noisy features with our train and test set.

Now we can create the GAFeatureSelectionCV object, it's very similar to the GASearchCV and they share
most of the parameters, the main difference is GAFeatureSelectionCV doesn't run hyperparameters optimization
thus the param_grid parameter it's not available, and the estimator should be defined with its hyperparameters.

The way the feature selection is performed is by creating models with a subsample of features
and evaluate its cv-score, the way the subsets are created is by using the available evolutionary algorithms.
It also tries to minimize the number of selected features, so it's a multi-objective optimization.

Let's create the feature selection object, the estimator we're going to use is a SVM:

.. code:: python3
clf = SVC(gamma='auto')
evolved_estimator = GAFeatureSelectionCV(
estimator=clf,
cv=3,
scoring="accuracy",
population_size=30,
generations=20,
n_jobs=-1,
verbose=True,
keep_top_k=2,
elitism=True,
)
We are ready to run the optimization routine:

.. code:: python3
# Train and select the features
evolved_estimator.fit(X_train, y_train)
During the training, the same log format is displayed as before:

.. image:: ../images/basic_usage_train_log_5.png

After fitting the model, we have some extra methods to use the model right away. It will use by default the best set of
features it found, remember as the algorithm used only a subset, you have to select them from the
``X_test array``, this is done like this:

.. code:: python3
features = evolved_estimator.best_features_
# Predict only with the subset of selected features
y_predict_ga = evolved_estimator.predict(X_test[:, features])
accuracy = accuracy_score(y_test, y_predict_ga)
.. image:: ../images/basic_usage_accuracy_6.png

In this case, we got an accuracy score in the test set of 0.98.

Notice that the ``best_features_`` is a vector of bool values, each
position represents the index of the feature (column) and the value indicates
if that features was selected (True) or not (False) by the algorithm.
In this example, the algorithm, discarded all the noisy random variables we created
and selected the original variables.

We can also plot the fitness evolution:

.. code:: python3
from sklearn_genetic.plots import plot_fitness_evolution
plot_fitness_evolution(evolved_estimator)
plt.show()
.. image:: ../images/basic_usage_fitness_plot_7.png

This concludes our introduction to the basic sklearn-genetic-opt usage.
Further tutorials will cover the GASearchCV parameters, callbacks,
different optimization algorithms and more advanced use cases.
Further tutorials will cover the GASearchCV and GAFeatureSelectionCV parameters, callbacks,
different optimization algorithms and more advanced use cases.

2 changes: 1 addition & 1 deletion sklearn_genetic/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.6.1"
__version__ = "0.7.0dev0"
8 changes: 8 additions & 0 deletions sklearn_genetic/plots.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from .utils import logbook_to_pandas
from .parameters import Metrics
from .space import Categorical
from .genetic_search import GAFeatureSelectionCV

"""
This module contains some useful function to explore the results of the optimization routines
Expand Down Expand Up @@ -75,6 +76,10 @@ def plot_search_space(estimator, height=2, s=25, features: list = None):
Pair plot of the used hyperparameters during the search
"""

if isinstance(estimator, GAFeatureSelectionCV):
raise TypeError("Estimator must be a GASearchCV instance, not a GAFeatureSelectionCV instance")

sns.set_style("white")

df = logbook_to_pandas(estimator.logbook)
Expand Down Expand Up @@ -131,6 +136,9 @@ def plot_parallel_coordinates(estimator, features: list = None):
"""

if isinstance(estimator, GAFeatureSelectionCV):
raise TypeError("Estimator must be a GASearchCV instance, not a GAFeatureSelectionCV instance")

df = logbook_to_pandas(estimator.logbook)
param_grid = estimator.space.param_grid
score = df["score"]
Expand Down
Loading

0 comments on commit 8dcc943

Please sign in to comment.