diff --git a/_images/032d44039813a98a96b553b070215955b70d851f6f9fac169280d9be52105833.png b/_images/032d44039813a98a96b553b070215955b70d851f6f9fac169280d9be52105833.png new file mode 100644 index 000000000..9f3de57bb Binary files /dev/null and b/_images/032d44039813a98a96b553b070215955b70d851f6f9fac169280d9be52105833.png differ diff --git a/_images/07c9345090d559dd29999ec0ee6c672b7ca65345dbe11259882d774345129b9b.png b/_images/07c9345090d559dd29999ec0ee6c672b7ca65345dbe11259882d774345129b9b.png new file mode 100644 index 000000000..c5eb5b463 Binary files /dev/null and b/_images/07c9345090d559dd29999ec0ee6c672b7ca65345dbe11259882d774345129b9b.png differ diff --git a/_images/0cb9478c8ebafd1110a77a3591feb59f7009ca636e3eda3619943c8f950c57f3.png b/_images/0cb9478c8ebafd1110a77a3591feb59f7009ca636e3eda3619943c8f950c57f3.png new file mode 100644 index 000000000..e2f2057ec Binary files /dev/null and b/_images/0cb9478c8ebafd1110a77a3591feb59f7009ca636e3eda3619943c8f950c57f3.png differ diff --git a/_images/0e50749bd867085e6fd9c797f89a1870ff6983514e8822934f333abef22f2156.png b/_images/0e50749bd867085e6fd9c797f89a1870ff6983514e8822934f333abef22f2156.png deleted file mode 100644 index 5510b432e..000000000 Binary files a/_images/0e50749bd867085e6fd9c797f89a1870ff6983514e8822934f333abef22f2156.png and /dev/null differ diff --git a/_images/453858549e380f9ecfac873f0dc202778dafe5323277d31a517a0d5f0437aa29.png b/_images/453858549e380f9ecfac873f0dc202778dafe5323277d31a517a0d5f0437aa29.png new file mode 100644 index 000000000..02be5da5c Binary files /dev/null and b/_images/453858549e380f9ecfac873f0dc202778dafe5323277d31a517a0d5f0437aa29.png differ diff --git a/_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png b/_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png new file mode 100644 index 000000000..d3088f636 Binary files /dev/null and b/_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png differ diff --git a/_images/4b8d72068b16a40033d1ae1c3e43e24f8a0695a944be1ef86aa1f8d61d157cc2.png b/_images/4b8d72068b16a40033d1ae1c3e43e24f8a0695a944be1ef86aa1f8d61d157cc2.png new file mode 100644 index 000000000..4edcbad60 Binary files /dev/null and b/_images/4b8d72068b16a40033d1ae1c3e43e24f8a0695a944be1ef86aa1f8d61d157cc2.png differ diff --git a/_images/4f064fcce2c9e74d5ca165c4b7a9d97adde6663fed9522e918e28a68073dd161.png b/_images/4f064fcce2c9e74d5ca165c4b7a9d97adde6663fed9522e918e28a68073dd161.png new file mode 100644 index 000000000..21639e031 Binary files /dev/null and b/_images/4f064fcce2c9e74d5ca165c4b7a9d97adde6663fed9522e918e28a68073dd161.png differ diff --git a/_images/7f99620cd8492571c1c6d809a38e7f691d3bffc6797b6c8f68bcca57101d2426.png b/_images/7f99620cd8492571c1c6d809a38e7f691d3bffc6797b6c8f68bcca57101d2426.png new file mode 100644 index 000000000..bba5c7de2 Binary files /dev/null and b/_images/7f99620cd8492571c1c6d809a38e7f691d3bffc6797b6c8f68bcca57101d2426.png differ diff --git a/_images/9e0634a92164f80e2d757d4a2bc574f1dea6cc857294526748584ac48c6c2b47.png b/_images/9e0634a92164f80e2d757d4a2bc574f1dea6cc857294526748584ac48c6c2b47.png new file mode 100644 index 000000000..2dc3337ec Binary files /dev/null and b/_images/9e0634a92164f80e2d757d4a2bc574f1dea6cc857294526748584ac48c6c2b47.png differ diff --git a/_images/a4aec66658c53ca38b826f418b4b3db906c1a92ab89a9ddef7eb6c0c1867396d.png b/_images/a4aec66658c53ca38b826f418b4b3db906c1a92ab89a9ddef7eb6c0c1867396d.png deleted file mode 100644 index 5da65a37b..000000000 Binary files a/_images/a4aec66658c53ca38b826f418b4b3db906c1a92ab89a9ddef7eb6c0c1867396d.png and /dev/null differ diff --git a/_images/b292d3662c24fda8d7cbb974f613e06bc0eab2b7400074243c74184b04574c1c.png b/_images/b292d3662c24fda8d7cbb974f613e06bc0eab2b7400074243c74184b04574c1c.png deleted file mode 100644 index 3d4c90264..000000000 Binary files a/_images/b292d3662c24fda8d7cbb974f613e06bc0eab2b7400074243c74184b04574c1c.png and /dev/null differ diff --git a/_images/ca8118111e8a08b91b86eccb27d4c40542f7d4455efdbd2e0d985bf5615b293c.png b/_images/ca8118111e8a08b91b86eccb27d4c40542f7d4455efdbd2e0d985bf5615b293c.png new file mode 100644 index 000000000..24e853086 Binary files /dev/null and b/_images/ca8118111e8a08b91b86eccb27d4c40542f7d4455efdbd2e0d985bf5615b293c.png differ diff --git a/_images/d29a899ee748eecaddc699f51059d814483a9f000303fa449509f98bdf524abe.png b/_images/d29a899ee748eecaddc699f51059d814483a9f000303fa449509f98bdf524abe.png new file mode 100644 index 000000000..1d93f86a6 Binary files /dev/null and b/_images/d29a899ee748eecaddc699f51059d814483a9f000303fa449509f98bdf524abe.png differ diff --git a/_images/d6eafcd252a9a80ed6880db211eccc646407dd995a818a381022e7cbeba3db42.png b/_images/d6eafcd252a9a80ed6880db211eccc646407dd995a818a381022e7cbeba3db42.png deleted file mode 100644 index 4ef13c8b1..000000000 Binary files a/_images/d6eafcd252a9a80ed6880db211eccc646407dd995a818a381022e7cbeba3db42.png and /dev/null differ diff --git a/_images/dd7f27dbcdf5564035e27c670f3123f844e7a981f8c63287750ece5b61a44928.png b/_images/dd7f27dbcdf5564035e27c670f3123f844e7a981f8c63287750ece5b61a44928.png deleted file mode 100644 index f7a28b673..000000000 Binary files a/_images/dd7f27dbcdf5564035e27c670f3123f844e7a981f8c63287750ece5b61a44928.png and /dev/null differ diff --git a/_images/e0e8d7699e085bb07f9674f12dfeaa5dd14ee4b5dace4789a6d858e77b673d09.png b/_images/e0e8d7699e085bb07f9674f12dfeaa5dd14ee4b5dace4789a6d858e77b673d09.png new file mode 100644 index 000000000..08c9a6fda Binary files /dev/null and b/_images/e0e8d7699e085bb07f9674f12dfeaa5dd14ee4b5dace4789a6d858e77b673d09.png differ diff --git a/_images/e64c91b30862454474c755622a6a99fbf325cb90fde784598bc71a96daecd3ac.png b/_images/e64c91b30862454474c755622a6a99fbf325cb90fde784598bc71a96daecd3ac.png new file mode 100644 index 000000000..970fe99d1 Binary files /dev/null and b/_images/e64c91b30862454474c755622a6a99fbf325cb90fde784598bc71a96daecd3ac.png differ diff --git a/_images/eaf825e25e641c4d8d6a242333ddfe7631a486b0cf3b5a2a87d34fac8c796537.png b/_images/eaf825e25e641c4d8d6a242333ddfe7631a486b0cf3b5a2a87d34fac8c796537.png new file mode 100644 index 000000000..74e850217 Binary files /dev/null and b/_images/eaf825e25e641c4d8d6a242333ddfe7631a486b0cf3b5a2a87d34fac8c796537.png differ diff --git a/_sources/python_scripts/linear_models_sol_03.py b/_sources/python_scripts/linear_models_sol_03.py index d789c8522..dc2a82f5c 100644 --- a/_sources/python_scripts/linear_models_sol_03.py +++ b/_sources/python_scripts/linear_models_sol_03.py @@ -8,14 +8,20 @@ # %% [markdown] # # 📃 Solution for Exercise M4.03 # -# The parameter `penalty` can control the **type** of regularization to use, -# whereas the regularization **strength** is set using the parameter `C`. -# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In -# this exercise, we ask you to train a logistic regression classifier using the -# `penalty="l2"` regularization (which happens to be the default in -# scikit-learn) to find by yourself the effect of the parameter `C`. +# In the previous Module we tuned the hyperparameter `C` of the logistic +# regression without mentioning that it controls the regularization strength. +# Later, on the slides on 🎥 **Intuitions on regularized linear models** we +# metioned that a small `C` provides a more regularized model, whereas a +# non-regularized model is obtained with an infinitely large value of `C`. +# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge` +# model. # -# We start by loading the dataset. +# In this exercise, we ask you to train a logistic regression classifier using +# different values of the parameter `C` to find its effects by yourself. +# +# We start by loading the dataset. We only keep the Adelie and Chinstrap classes +# to keep the discussion simple. + # %% [markdown] # ```{note} @@ -27,7 +33,6 @@ import pandas as pd penguins = pd.read_csv("../datasets/penguins_classification.csv") -# only keep the Adelie and Chinstrap classes penguins = ( penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() ) @@ -38,7 +43,9 @@ # %% from sklearn.model_selection import train_test_split -penguins_train, penguins_test = train_test_split(penguins, random_state=0) +penguins_train, penguins_test = train_test_split( + penguins, random_state=0, test_size=0.4 +) data_train = penguins_train[culmen_columns] data_test = penguins_test[culmen_columns] @@ -47,76 +54,227 @@ target_test = penguins_test[target_column] # %% [markdown] -# First, let's create our predictive model. +# We define a function to help us fit a given `model` and plot its decision +# boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging +# colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped +# to the white color. Equivalently, the darker the color, the closer the +# predicted probability is to 0 or 1 and the more confident the classifier is in +# its predictions. # %% -from sklearn.pipeline import make_pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.linear_model import LogisticRegression - -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty="l2") -) - -# %% [markdown] -# Given the following candidates for the `C` parameter, find out the impact of -# `C` on the classifier decision boundary. You can use -# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the -# decision function boundary. - -# %% -Cs = [0.01, 0.1, 1, 10] - -# solution import matplotlib.pyplot as plt import seaborn as sns from sklearn.inspection import DecisionBoundaryDisplay -for C in Cs: - logistic_regression.set_params(logisticregression__C=C) - logistic_regression.fit(data_train, target_train) - accuracy = logistic_regression.score(data_test, target_test) - DecisionBoundaryDisplay.from_estimator( - logistic_regression, - data_test, - response_method="predict", +def plot_decision_boundary(model): + model.fit(data_train, target_train) + accuracy = model.score(data_test, target_test) + + disp = DecisionBoundaryDisplay.from_estimator( + model, + data_train, + response_method="predict_proba", + plot_method="pcolormesh", cmap="RdBu_r", - alpha=0.5, + alpha=0.8, + vmin=0.0, + vmax=1.0, + ) + DecisionBoundaryDisplay.from_estimator( + model, + data_train, + response_method="predict_proba", + plot_method="contour", + linestyles="--", + linewidths=1, + alpha=0.8, + levels=[0.5], + ax=disp.ax_, ) sns.scatterplot( - data=penguins_test, + data=penguins_train, x=culmen_columns[0], y=culmen_columns[1], hue=target_column, - palette=["tab:red", "tab:blue"], + palette=["tab:blue", "tab:red"], + ax=disp.ax_, ) plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left") plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}") + +# %% [markdown] +# Let's now create our predictive model. + +# %% +from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LogisticRegression + +logistic_regression = make_pipeline(StandardScaler(), LogisticRegression()) + +# %% [markdown] +# ## Influence of the parameter `C` on the decision boundary +# +# Given the following candidates for the `C` parameter and the +# `plot_decision_boundary` function, find out the impact of `C` on the +# classifier's decision boundary. +# +# - How does the value of `C` impact the confidence on the predictions? +# - How does it impact the underfit/overfit trade-off? +# - How does it impact the position and orientation of the decision boundary? +# +# Try to give an interpretation on the reason for such behavior. + +# %% +Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6] + +# solution +for C in Cs: + logistic_regression.set_params(logisticregression__C=C) + plot_decision_boundary(logistic_regression) + +# %% [markdown] tags=["solution"] +# +# On this series of plots we can observe several important points. Regarding the +# confidence on the predictions: +# +# - For low values of `C` (strong regularization), the classifier is less +# confident in its predictions. We are enforcing a **spread sigmoid**. +# - For high values of `C` (weak regularization), the classifier is more +# confident: the areas with dark blue (very confident in predicting "Adelie") +# and dark red (very confident in predicting "Chinstrap") nearly cover the +# entire feature space. We are enforcing a **steep sigmoid**. +# +# To answer the next question, think that misclassified data points are more +# costly when the classifier is more confident on the decision. Decision rules +# are mostly driven by avoiding such cost. From the previous observations we can +# then deduce that: +# +# - The smaller the `C` (the stronger the regularization), the lower the cost +# of a misclassification. As more data points lay in the low-confidence +# zone, the more the decision rules are influenced almost uniformly by all +# the data points. This leads to a less expressive model, which may underfit. +# - The higher the value of `C` (the weaker the regularization), the more the +# decision is influenced by a few training points very close to the boundary, +# where decisions are costly. Remember that models may overfit if the number +# of samples in the training set is too small, as at least a minimum of +# samples is needed to average the noise out. +# +# The orientation is the result of two factors: minimizing the number of +# misclassified training points with high confidence and their distance to the +# decision boundary (notice how the contour line tries to align with the most +# misclassified data points in the dark-colored zone). This is closely related +# to the value of the weights of the model, which is explained in the next part +# of the exercise. +# +# Finally, for small values of `C` the position of the decision boundary is +# affected by the class imbalance: when `C` is near zero, the model predicts the +# majority class (as seen in the training set) everywhere in the feature space. +# In our case, there are approximately two times more "Adelie" than "Chinstrap" +# penguins. This explains why the decision boundary is shifted to the right when +# `C` gets smaller. Indeed, the most regularized model predicts light blue +# almost everywhere in the feature space. + # %% [markdown] +# ## Impact of the regularization on the weights +# # Look at the impact of the `C` hyperparameter on the magnitude of the weights. +# **Hint**: You can [access pipeline +# steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps) +# by name or position. Then you can query the attributes of that step such as +# `coef_`. # %% # solution -weights_ridge = [] +lr_weights = [] for C in Cs: logistic_regression.set_params(logisticregression__C=C) logistic_regression.fit(data_train, target_train) coefs = logistic_regression[-1].coef_[0] - weights_ridge.append(pd.Series(coefs, index=culmen_columns)) + lr_weights.append(pd.Series(coefs, index=culmen_columns)) # %% tags=["solution"] -weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs]) -weights_ridge.plot.barh() +lr_weights = pd.concat(lr_weights, axis=1, keys=[f"C: {C}" for C in Cs]) +lr_weights.plot.barh() _ = plt.title("LogisticRegression weights depending of C") # %% [markdown] tags=["solution"] -# We see that a small `C` will shrink the weights values toward zero. It means -# that a small `C` provides a more regularized model. Thus, `C` is the inverse -# of the `alpha` coefficient in the `Ridge` model. # -# Besides, with a strong penalty (i.e. small `C` value), the weight of the -# feature "Culmen Depth (mm)" is almost zero. It explains why the decision +# As small `C` provides a more regularized model, it shrinks the weights values +# toward zero, as in the `Ridge` model. +# +# In particular, with a strong penalty (e.g. `C = 0.01`), the weight of the feature +# named "Culmen Depth (mm)" is almost zero. It explains why the decision # separation in the plot is almost perpendicular to the "Culmen Length (mm)" # feature. +# +# For even stronger penalty strengths (e.g. `C = 1e-6`), the weights of both +# features are almost zero. It explains why the decision separation in the plot +# is almost constant in the feature space: the predicted probability is only +# based on the intercept parameter of the model (which is never regularized). + +# %% [markdown] +# ## Impact of the regularization on with non-linear feature engineering +# +# Use the `plot_decision_boundary` function to repeat the experiment using a +# non-linear feature engineering pipeline. For such purpose, insert +# `Nystroem(kernel="rbf", gamma=1, n_components=100)` between the +# `StandardScaler` and the `LogisticRegression` steps. +# +# - Does the value of `C` still impact the position of the decision boundary and +# the confidence of the model? +# - What can you say about the impact of `C` on the underfitting vs overfitting +# trade-off? + +# %% +from sklearn.kernel_approximation import Nystroem + +# solution +classifier = make_pipeline( + StandardScaler(), + Nystroem(kernel="rbf", gamma=1.0, n_components=100, random_state=0), + LogisticRegression(penalty="l2", max_iter=1000), +) + +for C in Cs: + classifier.set_params(logisticregression__C=C) + plot_decision_boundary(classifier) + +# %% [markdown] tags=["solution"] +# +# - For the lowest values of `C`, the overall pipeline underfits: it predicts +# the majority class everywhere, as previously. +# - When `C` increases, the models starts to predict some datapoints from the +# "Chinstrap" class but the model is not very confident anywhere in the +# feature space. +# - The decision boundary is no longer a straight line: the linear model is now +# classifying in the 100-dimensional feature space created by the `Nystroem` +# transformer. As are result, the decision boundary induced by the overall +# pipeline is now expressive enough to wrap around the minority class. +# - For `C = 1` in particular, it finds a smooth red blob around most of the +# "Chinstrap" data points. When moving away from the data points, the model is +# less confident in its predictions and again tends to predict the majority +# class according to the proportion in the training set. +# - For higher values of `C`, the model starts to overfit: it is very confident +# in its predictions almost everywhere, but it should not be trusted: the +# model also makes a larger number of mistakes on the test set (not shown in +# the plot) while adopting a very curvy decision boundary to attempt fitting +# all the training points, including the noisy ones at the frontier between +# the two classes. This makes the decision boundary very sensitive to the +# sampling of the training set and as a result, it does not generalize well in +# that region. This is confirmed by the (slightly) lower accuracy on the test +# set. +# +# Finally, we can also note that the linear model on the raw features was as +# good or better than the best model using non-linear feature engineering. So in +# this case, we did not really need this extra complexity in our pipeline. +# **Simpler is better!** +# +# So to conclude, when using non-linear feature engineering, it is often +# possible to make the pipeline overfit, even if the original feature space is +# low-dimensional. As a result, it is important to tune the regularization +# parameter in conjunction with the parameters of the transformers (e.g. tuning +# `gamma` would be important here). This has a direct impact on the certainty of +# the predictions. diff --git a/appendix/notebook_timings.html b/appendix/notebook_timings.html index 0bdb6d69d..94ea14b4c 100644 --- a/appendix/notebook_timings.html +++ b/appendix/notebook_timings.html @@ -1029,9 +1029,9 @@

Notebook timings

python_scripts/linear_models_sol_03

-

2023-08-30 10:07

+

2023-10-17 13:47

cache

-

3.97

+

11.76

python_scripts/linear_regression_in_sklearn

diff --git a/python_scripts/linear_models_sol_03.html b/python_scripts/linear_models_sol_03.html index c11870c99..7c020c1c9 100644 --- a/python_scripts/linear_models_sol_03.html +++ b/python_scripts/linear_models_sol_03.html @@ -656,7 +656,9 @@ `); - + @@ -672,6 +674,16 @@

📃 Solution for Exercise M4.03

+
+

Contents

+
+
@@ -683,13 +695,17 @@

📃 Solution for Exercise M4.03

📃 Solution for Exercise M4.03#

-

The parameter penalty can control the type of regularization to use, -whereas the regularization strength is set using the parameter C. -Settingpenalty="none" is equivalent to an infinitely large value of C. In -this exercise, we ask you to train a logistic regression classifier using the -penalty="l2" regularization (which happens to be the default in -scikit-learn) to find by yourself the effect of the parameter C.

-

We start by loading the dataset.

+

In the previous Module we tuned the hyperparameter C of the logistic +regression without mentioning that it controls the regularization strength. +Later, on the slides on 🎥 Intuitions on regularized linear models we +metioned that a small C provides a more regularized model, whereas a +non-regularized model is obtained with an infinitely large value of C. +Indeed, C behaves as the inverse of the alpha coefficient in the Ridge +model.

+

In this exercise, we ask you to train a logistic regression classifier using +different values of the parameter C to find its effects by yourself.

+

We start by loading the dataset. We only keep the Adelie and Chinstrap classes +to keep the discussion simple.

Note

If you want a deeper overview regarding this dataset, you can refer to the @@ -700,7 +716,6 @@

📃 Solution for Exercise M4.03
import pandas as pd
 
 penguins = pd.read_csv("../datasets/penguins_classification.csv")
-# only keep the Adelie and Chinstrap classes
 penguins = (
     penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
 )
@@ -715,7 +730,9 @@ 

📃 Solution for Exercise M4.03
from sklearn.model_selection import train_test_split
 
-penguins_train, penguins_test = train_test_split(penguins, random_state=0)
+penguins_train, penguins_test = train_test_split(
+    penguins, random_state=0, test_size=0.4
+)
 
 data_train = penguins_train[culmen_columns]
 data_test = penguins_test[culmen_columns]
@@ -726,97 +743,258 @@ 

📃 Solution for Exercise M4.03model and plot its decision +boundary. We recall that by using a DecisionBoundaryDisplay with diverging +colormap, vmin=0 and vmax=1, we ensure that the 0.5 probability is mapped +to the white color. Equivalently, the darker the color, the closer the +predicted probability is to 0 or 1 and the more confident the classifier is in +its predictions.

+
+
+
import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.inspection import DecisionBoundaryDisplay
+
+
+def plot_decision_boundary(model):
+    model.fit(data_train, target_train)
+    accuracy = model.score(data_test, target_test)
+
+    disp = DecisionBoundaryDisplay.from_estimator(
+        model,
+        data_train,
+        response_method="predict_proba",
+        plot_method="pcolormesh",
+        cmap="RdBu_r",
+        alpha=0.8,
+        vmin=0.0,
+        vmax=1.0,
+    )
+    DecisionBoundaryDisplay.from_estimator(
+        model,
+        data_train,
+        response_method="predict_proba",
+        plot_method="contour",
+        linestyles="--",
+        linewidths=1,
+        alpha=0.8,
+        levels=[0.5],
+        ax=disp.ax_,
+    )
+    sns.scatterplot(
+        data=penguins_train,
+        x=culmen_columns[0],
+        y=culmen_columns[1],
+        hue=target_column,
+        palette=["tab:blue", "tab:red"],
+        ax=disp.ax_,
+    )
+    plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
+    plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
+
+
+
+
+

Let’s now create our predictive model.

from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import StandardScaler
 from sklearn.linear_model import LogisticRegression
 
-logistic_regression = make_pipeline(
-    StandardScaler(), LogisticRegression(penalty="l2")
-)
+logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())
 
-

Given the following candidates for the C parameter, find out the impact of -C on the classifier decision boundary. You can use -sklearn.inspection.DecisionBoundaryDisplay.from_estimator to plot the -decision function boundary.

+
+

Influence of the parameter C on the decision boundary#

+

Given the following candidates for the C parameter and the +plot_decision_boundary function, find out the impact of C on the +classifier’s decision boundary.

+
    +
  • How does the value of C impact the confidence on the predictions?

  • +
  • How does it impact the underfit/overfit trade-off?

  • +
  • How does it impact the position and orientation of the decision boundary?

  • +
+

Try to give an interpretation on the reason for such behavior.

-
Cs = [0.01, 0.1, 1, 10]
+
Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]
 
 # solution
-import matplotlib.pyplot as plt
-import seaborn as sns
-from sklearn.inspection import DecisionBoundaryDisplay
-
 for C in Cs:
     logistic_regression.set_params(logisticregression__C=C)
-    logistic_regression.fit(data_train, target_train)
-    accuracy = logistic_regression.score(data_test, target_test)
-
-    DecisionBoundaryDisplay.from_estimator(
-        logistic_regression,
-        data_test,
-        response_method="predict",
-        cmap="RdBu_r",
-        alpha=0.5,
-    )
-    sns.scatterplot(
-        data=penguins_test,
-        x=culmen_columns[0],
-        y=culmen_columns[1],
-        hue=target_column,
-        palette=["tab:red", "tab:blue"],
-    )
-    plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
-    plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
+    plot_decision_boundary(logistic_regression)
 
-../_images/dd7f27dbcdf5564035e27c670f3123f844e7a981f8c63287750ece5b61a44928.png -../_images/b292d3662c24fda8d7cbb974f613e06bc0eab2b7400074243c74184b04574c1c.png -../_images/d6eafcd252a9a80ed6880db211eccc646407dd995a818a381022e7cbeba3db42.png -../_images/a4aec66658c53ca38b826f418b4b3db906c1a92ab89a9ddef7eb6c0c1867396d.png +../_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png +../_images/ca8118111e8a08b91b86eccb27d4c40542f7d4455efdbd2e0d985bf5615b293c.png +../_images/032d44039813a98a96b553b070215955b70d851f6f9fac169280d9be52105833.png +../_images/9e0634a92164f80e2d757d4a2bc574f1dea6cc857294526748584ac48c6c2b47.png +../_images/e0e8d7699e085bb07f9674f12dfeaa5dd14ee4b5dace4789a6d858e77b673d09.png +../_images/07c9345090d559dd29999ec0ee6c672b7ca65345dbe11259882d774345129b9b.png +../_images/0cb9478c8ebafd1110a77a3591feb59f7009ca636e3eda3619943c8f950c57f3.png
-

Look at the impact of the C hyperparameter on the magnitude of the weights.

+

On this series of plots we can observe several important points. Regarding the +confidence on the predictions:

+
    +
  • For low values of C (strong regularization), the classifier is less +confident in its predictions. We are enforcing a spread sigmoid.

  • +
  • For high values of C (weak regularization), the classifier is more +confident: the areas with dark blue (very confident in predicting “Adelie”) +and dark red (very confident in predicting “Chinstrap”) nearly cover the +entire feature space. We are enforcing a steep sigmoid.

  • +
+

To answer the next question, think that misclassified data points are more +costly when the classifier is more confident on the decision. Decision rules +are mostly driven by avoiding such cost. From the previous observations we can +then deduce that:

+
    +
  • The smaller the C (the stronger the regularization), the lower the cost +of a misclassification. As more data points lay in the low-confidence +zone, the more the decision rules are influenced almost uniformly by all +the data points. This leads to a less expressive model, which may underfit.

  • +
  • The higher the value of C (the weaker the regularization), the more the +decision is influenced by a few training points very close to the boundary, +where decisions are costly. Remember that models may overfit if the number +of samples in the training set is too small, as at least a minimum of +samples is needed to average the noise out.

  • +
+

The orientation is the result of two factors: minimizing the number of +misclassified training points with high confidence and their distance to the +decision boundary (notice how the contour line tries to align with the most +misclassified data points in the dark-colored zone). This is closely related +to the value of the weights of the model, which is explained in the next part +of the exercise.

+

Finally, for small values of C the position of the decision boundary is +affected by the class imbalance: when C is near zero, the model predicts the +majority class (as seen in the training set) everywhere in the feature space. +In our case, there are approximately two times more “Adelie” than “Chinstrap” +penguins. This explains why the decision boundary is shifted to the right when +C gets smaller. Indeed, the most regularized model predicts light blue +almost everywhere in the feature space.

+
+
+

Impact of the regularization on the weights#

+

Look at the impact of the C hyperparameter on the magnitude of the weights. +Hint: You can access pipeline +steps +by name or position. Then you can query the attributes of that step such as +coef_.

# solution
-weights_ridge = []
+lr_weights = []
 for C in Cs:
     logistic_regression.set_params(logisticregression__C=C)
     logistic_regression.fit(data_train, target_train)
     coefs = logistic_regression[-1].coef_[0]
-    weights_ridge.append(pd.Series(coefs, index=culmen_columns))
+    lr_weights.append(pd.Series(coefs, index=culmen_columns))
 
-
weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs])
-weights_ridge.plot.barh()
+
lr_weights = pd.concat(lr_weights, axis=1, keys=[f"C: {C}" for C in Cs])
+lr_weights.plot.barh()
 _ = plt.title("LogisticRegression weights depending of C")
 
-../_images/0e50749bd867085e6fd9c797f89a1870ff6983514e8822934f333abef22f2156.png +../_images/d29a899ee748eecaddc699f51059d814483a9f000303fa449509f98bdf524abe.png
-

We see that a small C will shrink the weights values toward zero. It means -that a small C provides a more regularized model. Thus, C is the inverse -of the alpha coefficient in the Ridge model.

-

Besides, with a strong penalty (i.e. small C value), the weight of the -feature “Culmen Depth (mm)” is almost zero. It explains why the decision +

As small C provides a more regularized model, it shrinks the weights values +toward zero, as in the Ridge model.

+

In particular, with a strong penalty (e.g. C = 0.01), the weight of the feature +named “Culmen Depth (mm)” is almost zero. It explains why the decision separation in the plot is almost perpendicular to the “Culmen Length (mm)” feature.

+

For even stronger penalty strengths (e.g. C = 1e-6), the weights of both +features are almost zero. It explains why the decision separation in the plot +is almost constant in the feature space: the predicted probability is only +based on the intercept parameter of the model (which is never regularized).

+
+
+

Impact of the regularization on with non-linear feature engineering#

+

Use the plot_decision_boundary function to repeat the experiment using a +non-linear feature engineering pipeline. For such purpose, insert +Nystroem(kernel="rbf", gamma=1, n_components=100) between the +StandardScaler and the LogisticRegression steps.

+
    +
  • Does the value of C still impact the position of the decision boundary and +the confidence of the model?

  • +
  • What can you say about the impact of C on the underfitting vs overfitting +trade-off?

  • +
+
+
+
from sklearn.kernel_approximation import Nystroem
+
+# solution
+classifier = make_pipeline(
+    StandardScaler(),
+    Nystroem(kernel="rbf", gamma=1.0, n_components=100, random_state=0),
+    LogisticRegression(penalty="l2", max_iter=1000),
+)
+
+for C in Cs:
+    classifier.set_params(logisticregression__C=C)
+    plot_decision_boundary(classifier)
+
+
+
+
+../_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png +../_images/4f064fcce2c9e74d5ca165c4b7a9d97adde6663fed9522e918e28a68073dd161.png +../_images/7f99620cd8492571c1c6d809a38e7f691d3bffc6797b6c8f68bcca57101d2426.png +../_images/eaf825e25e641c4d8d6a242333ddfe7631a486b0cf3b5a2a87d34fac8c796537.png +../_images/453858549e380f9ecfac873f0dc202778dafe5323277d31a517a0d5f0437aa29.png +../_images/e64c91b30862454474c755622a6a99fbf325cb90fde784598bc71a96daecd3ac.png +../_images/4b8d72068b16a40033d1ae1c3e43e24f8a0695a944be1ef86aa1f8d61d157cc2.png +
+
+
    +
  • For the lowest values of C, the overall pipeline underfits: it predicts +the majority class everywhere, as previously.

  • +
  • When C increases, the models starts to predict some datapoints from the +“Chinstrap” class but the model is not very confident anywhere in the +feature space.

  • +
  • The decision boundary is no longer a straight line: the linear model is now +classifying in the 100-dimensional feature space created by the Nystroem +transformer. As are result, the decision boundary induced by the overall +pipeline is now expressive enough to wrap around the minority class.

  • +
  • For C = 1 in particular, it finds a smooth red blob around most of the +“Chinstrap” data points. When moving away from the data points, the model is +less confident in its predictions and again tends to predict the majority +class according to the proportion in the training set.

  • +
  • For higher values of C, the model starts to overfit: it is very confident +in its predictions almost everywhere, but it should not be trusted: the +model also makes a larger number of mistakes on the test set (not shown in +the plot) while adopting a very curvy decision boundary to attempt fitting +all the training points, including the noisy ones at the frontier between +the two classes. This makes the decision boundary very sensitive to the +sampling of the training set and as a result, it does not generalize well in +that region. This is confirmed by the (slightly) lower accuracy on the test +set.

  • +
+

Finally, we can also note that the linear model on the raw features was as +good or better than the best model using non-linear feature engineering. So in +this case, we did not really need this extra complexity in our pipeline. +Simpler is better!

+

So to conclude, when using non-linear feature engineering, it is often +possible to make the pipeline overfit, even if the original feature space is +low-dimensional. As a result, it is important to tune the regularization +parameter in conjunction with the parameters of the transformers (e.g. tuning +gamma would be important here). This has a direct impact on the certainty of +the predictions.

+