Skip to content

Commit

Permalink
[ci skip] ENH Convert some of the Wrap-up M4 content into exercise (#731
Browse files Browse the repository at this point in the history
  • Loading branch information
ArturoAmorQ committed Oct 27, 2023
1 parent e0570d1 commit adf38a8
Show file tree
Hide file tree
Showing 205 changed files with 3,560 additions and 1,627 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
127 changes: 88 additions & 39 deletions _sources/python_scripts/linear_models_ex_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,69 +14,118 @@
# %% [markdown]
# # 📝 Exercise M4.03
#
# The parameter `penalty` can control the **type** of regularization to use,
# whereas the regularization **strength** is set using the parameter `C`.
# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
# this exercise, we ask you to train a logistic regression classifier using the
# `penalty="l2"` regularization (which happens to be the default in
# scikit-learn) to find by yourself the effect of the parameter `C`.
#
# We start by loading the dataset.
# Now, we tackle a more realistic classification problem instead of making a
# synthetic dataset. We start by loading the Adult Census dataset with the
# following snippet. For the moment we retain only the **numerical features**.

# %%
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.select_dtypes(["integer", "floating"])
data = data.drop(columns=["education-num"])
data

# %% [markdown]
# ```{note}
# If you want a deeper overview regarding this dataset, you can refer to the
# Appendix - Datasets description section at the end of this MOOC.
# ```
# We confirm that all the selected features are numerical.
#
# Compute the generalization performance in terms of accuracy of a linear model
# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
# cross-validation with `return_estimator=True` to be able to inspect the
# trained estimators.

# %%
import pandas as pd
# Write your code here.

penguins = pd.read_csv("../datasets/penguins_classification.csv")
# only keep the Adelie and Chinstrap classes
penguins = (
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
)
# %% [markdown]
# What is the most important feature seen by the logistic regression?
#
# You can use a boxplot to compare the absolute values of the coefficients while
# also visualizing the variability induced by the cross-validation resampling.

# %%
# Write your code here.

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"
# %% [markdown]
# Let's now work with **both numerical and categorical features**. You can
# reload the Adult Census dataset with the following snippet:

# %%
from sklearn.model_selection import train_test_split
adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.drop(columns=["class", "education-num"])

# %% [markdown]
# Create a predictive model where:
# - The numerical data must be scaled.
# - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
# group categories concerning less than 1% of the total samples.
# - The predictor is a `LogisticRegression`. You may need to increase the number
# of `max_iter`, which is 100 by default.
#
# Use the same 10-fold cross-validation strategy with `return_estimator=True` as
# above to evaluate this complex pipeline.

penguins_train, penguins_test = train_test_split(penguins, random_state=0)
# %%
# Write your code here.

data_train = penguins_train[culmen_columns]
data_test = penguins_test[culmen_columns]
# %% [markdown]
# By comparing the cross-validation test scores of both models fold-to-fold,
# count the number of times the model using both numerical and categorical
# features has a better test score than the model using only numerical features.

target_train = penguins_train[target_column]
target_test = penguins_test[target_column]
# %%
# Write your code here.

# %% [markdown]
# First, let's create our predictive model.
# For the following questions, you can copy adn paste the following snippet to
# get the feature names from the column transformer here named `preprocessor`.
#
# ```python
# preprocessor.fit(data)
# feature_names = (
# preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
# categorical_columns
# )
# ).tolist()
# feature_names += numerical_columns
# feature_names
# ```

# %%
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Write your code here.

logistic_regression = make_pipeline(
StandardScaler(), LogisticRegression(penalty="l2")
)
# %% [markdown]
# Notice that there are as many feature names as coefficients in the last step
# of your predictive pipeline.

# %% [markdown]
# Given the following candidates for the `C` parameter, find out the impact of
# `C` on the classifier decision boundary. You can use
# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
# decision function boundary.
# Which of the following pairs of features is most impacting the predictions of
# the logistic regression classifier based on the absolute magnitude of its
# coefficients?

# %%
Cs = [0.01, 0.1, 1, 10]
# Write your code here.

# %% [markdown]
# Now create a similar pipeline consisting of the same preprocessor as above,
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
# Set `degree=2` and `interaction_only=True` to the feature engineering step.
# Remember not to include a "bias" feature to avoid introducing a redundancy
# with the intercept of the subsequent logistic regression.

# %%
# Write your code here.

# %% [markdown]
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# By comparing the cross-validation test scores of both models fold-to-fold,
# count the number of times the model using multiplicative interactions and both
# numerical and categorical features has a better test score than the model
# without interactions.

# %%
# Write your code here.

# %%
# Write your code here.
170 changes: 170 additions & 0 deletions _sources/python_scripts/linear_models_ex_04.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.15.2
# kernelspec:
# display_name: Python 3
# name: python3
# ---

# %% [markdown]
# # 📝 Exercise M4.04
#
# In the previous Module we tuned the hyperparameter `C` of the logistic
# regression without mentioning that it controls the regularization strength.
# Later, on the slides on 🎥 **Intuitions on regularized linear models** we
# metioned that a small `C` provides a more regularized model, whereas a
# non-regularized model is obtained with an infinitely large value of `C`.
# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
# model.
#
# In this exercise, we ask you to train a logistic regression classifier using
# different values of the parameter `C` to find its effects by yourself.
#
# We start by loading the dataset. We only keep the Adelie and Chinstrap classes
# to keep the discussion simple.


# %% [markdown]
# ```{note}
# If you want a deeper overview regarding this dataset, you can refer to the
# Appendix - Datasets description section at the end of this MOOC.
# ```

# %%
import pandas as pd

penguins = pd.read_csv("../datasets/penguins_classification.csv")
penguins = (
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
)

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"

# %%
from sklearn.model_selection import train_test_split

penguins_train, penguins_test = train_test_split(
penguins, random_state=0, test_size=0.4
)

data_train = penguins_train[culmen_columns]
data_test = penguins_test[culmen_columns]

target_train = penguins_train[target_column]
target_test = penguins_test[target_column]

# %% [markdown]
# We define a function to help us fit a given `model` and plot its decision
# boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging
# colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped
# to the white color. Equivalently, the darker the color, the closer the
# predicted probability is to 0 or 1 and the more confident the classifier is in
# its predictions.

# %%
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.inspection import DecisionBoundaryDisplay


def plot_decision_boundary(model):
model.fit(data_train, target_train)
accuracy = model.score(data_test, target_test)
C = model.get_params()["logisticregression__C"]

disp = DecisionBoundaryDisplay.from_estimator(
model,
data_train,
response_method="predict_proba",
plot_method="pcolormesh",
cmap="RdBu_r",
alpha=0.8,
vmin=0.0,
vmax=1.0,
)
DecisionBoundaryDisplay.from_estimator(
model,
data_train,
response_method="predict_proba",
plot_method="contour",
linestyles="--",
linewidths=1,
alpha=0.8,
levels=[0.5],
ax=disp.ax_,
)
sns.scatterplot(
data=penguins_train,
x=culmen_columns[0],
y=culmen_columns[1],
hue=target_column,
palette=["tab:blue", "tab:red"],
ax=disp.ax_,
)
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")


# %% [markdown]
# Let's now create our predictive model.

# %%
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())

# %% [markdown]
# ## Influence of the parameter `C` on the decision boundary
#
# Given the following candidates for the `C` parameter and the
# `plot_decision_boundary` function, find out the impact of `C` on the
# classifier's decision boundary.
#
# - How does the value of `C` impact the confidence on the predictions?
# - How does it impact the underfit/overfit trade-off?
# - How does it impact the position and orientation of the decision boundary?
#
# Try to give an interpretation on the reason for such behavior.

# %%
Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]

# Write your code here.

# %% [markdown]
# ## Impact of the regularization on the weights
#
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# **Hint**: You can [access pipeline
# steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)
# by name or position. Then you can query the attributes of that step such as
# `coef_`.

# %%
# Write your code here.

# %% [markdown]
# ## Impact of the regularization on with non-linear feature engineering
#
# Use the `plot_decision_boundary` function to repeat the experiment using a
# non-linear feature engineering pipeline. For such purpose, insert
# `Nystroem(kernel="rbf", gamma=1, n_components=100)` between the
# `StandardScaler` and the `LogisticRegression` steps.
#
# - Does the value of `C` still impact the position of the decision boundary and
# the confidence of the model?
# - What can you say about the impact of `C` on the underfitting vs overfitting
# trade-off?

# %%
from sklearn.kernel_approximation import Nystroem

# Write your code here.
Loading

0 comments on commit adf38a8

Please sign in to comment.