diff --git a/_images/032d44039813a98a96b553b070215955b70d851f6f9fac169280d9be52105833.png b/_images/032d44039813a98a96b553b070215955b70d851f6f9fac169280d9be52105833.png new file mode 100644 index 000000000..9f3de57bb Binary files /dev/null and b/_images/032d44039813a98a96b553b070215955b70d851f6f9fac169280d9be52105833.png differ diff --git a/_images/07c9345090d559dd29999ec0ee6c672b7ca65345dbe11259882d774345129b9b.png b/_images/07c9345090d559dd29999ec0ee6c672b7ca65345dbe11259882d774345129b9b.png new file mode 100644 index 000000000..c5eb5b463 Binary files /dev/null and b/_images/07c9345090d559dd29999ec0ee6c672b7ca65345dbe11259882d774345129b9b.png differ diff --git a/_images/0cb9478c8ebafd1110a77a3591feb59f7009ca636e3eda3619943c8f950c57f3.png b/_images/0cb9478c8ebafd1110a77a3591feb59f7009ca636e3eda3619943c8f950c57f3.png new file mode 100644 index 000000000..e2f2057ec Binary files /dev/null and b/_images/0cb9478c8ebafd1110a77a3591feb59f7009ca636e3eda3619943c8f950c57f3.png differ diff --git a/_images/0e50749bd867085e6fd9c797f89a1870ff6983514e8822934f333abef22f2156.png b/_images/0e50749bd867085e6fd9c797f89a1870ff6983514e8822934f333abef22f2156.png deleted file mode 100644 index 5510b432e..000000000 Binary files a/_images/0e50749bd867085e6fd9c797f89a1870ff6983514e8822934f333abef22f2156.png and /dev/null differ diff --git a/_images/453858549e380f9ecfac873f0dc202778dafe5323277d31a517a0d5f0437aa29.png b/_images/453858549e380f9ecfac873f0dc202778dafe5323277d31a517a0d5f0437aa29.png new file mode 100644 index 000000000..02be5da5c Binary files /dev/null and b/_images/453858549e380f9ecfac873f0dc202778dafe5323277d31a517a0d5f0437aa29.png differ diff --git a/_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png b/_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png new file mode 100644 index 000000000..d3088f636 Binary files /dev/null and b/_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png differ diff --git a/_images/4b8d72068b16a40033d1ae1c3e43e24f8a0695a944be1ef86aa1f8d61d157cc2.png b/_images/4b8d72068b16a40033d1ae1c3e43e24f8a0695a944be1ef86aa1f8d61d157cc2.png new file mode 100644 index 000000000..4edcbad60 Binary files /dev/null and b/_images/4b8d72068b16a40033d1ae1c3e43e24f8a0695a944be1ef86aa1f8d61d157cc2.png differ diff --git a/_images/4f064fcce2c9e74d5ca165c4b7a9d97adde6663fed9522e918e28a68073dd161.png b/_images/4f064fcce2c9e74d5ca165c4b7a9d97adde6663fed9522e918e28a68073dd161.png new file mode 100644 index 000000000..21639e031 Binary files /dev/null and b/_images/4f064fcce2c9e74d5ca165c4b7a9d97adde6663fed9522e918e28a68073dd161.png differ diff --git a/_images/7f99620cd8492571c1c6d809a38e7f691d3bffc6797b6c8f68bcca57101d2426.png b/_images/7f99620cd8492571c1c6d809a38e7f691d3bffc6797b6c8f68bcca57101d2426.png new file mode 100644 index 000000000..bba5c7de2 Binary files /dev/null and b/_images/7f99620cd8492571c1c6d809a38e7f691d3bffc6797b6c8f68bcca57101d2426.png differ diff --git a/_images/9e0634a92164f80e2d757d4a2bc574f1dea6cc857294526748584ac48c6c2b47.png b/_images/9e0634a92164f80e2d757d4a2bc574f1dea6cc857294526748584ac48c6c2b47.png new file mode 100644 index 000000000..2dc3337ec Binary files /dev/null and b/_images/9e0634a92164f80e2d757d4a2bc574f1dea6cc857294526748584ac48c6c2b47.png differ diff --git a/_images/a4aec66658c53ca38b826f418b4b3db906c1a92ab89a9ddef7eb6c0c1867396d.png b/_images/a4aec66658c53ca38b826f418b4b3db906c1a92ab89a9ddef7eb6c0c1867396d.png deleted file mode 100644 index 5da65a37b..000000000 Binary files a/_images/a4aec66658c53ca38b826f418b4b3db906c1a92ab89a9ddef7eb6c0c1867396d.png and /dev/null differ diff --git a/_images/b292d3662c24fda8d7cbb974f613e06bc0eab2b7400074243c74184b04574c1c.png b/_images/b292d3662c24fda8d7cbb974f613e06bc0eab2b7400074243c74184b04574c1c.png deleted file mode 100644 index 3d4c90264..000000000 Binary files a/_images/b292d3662c24fda8d7cbb974f613e06bc0eab2b7400074243c74184b04574c1c.png and /dev/null differ diff --git a/_images/ca8118111e8a08b91b86eccb27d4c40542f7d4455efdbd2e0d985bf5615b293c.png b/_images/ca8118111e8a08b91b86eccb27d4c40542f7d4455efdbd2e0d985bf5615b293c.png new file mode 100644 index 000000000..24e853086 Binary files /dev/null and b/_images/ca8118111e8a08b91b86eccb27d4c40542f7d4455efdbd2e0d985bf5615b293c.png differ diff --git a/_images/d29a899ee748eecaddc699f51059d814483a9f000303fa449509f98bdf524abe.png b/_images/d29a899ee748eecaddc699f51059d814483a9f000303fa449509f98bdf524abe.png new file mode 100644 index 000000000..1d93f86a6 Binary files /dev/null and b/_images/d29a899ee748eecaddc699f51059d814483a9f000303fa449509f98bdf524abe.png differ diff --git a/_images/d6eafcd252a9a80ed6880db211eccc646407dd995a818a381022e7cbeba3db42.png b/_images/d6eafcd252a9a80ed6880db211eccc646407dd995a818a381022e7cbeba3db42.png deleted file mode 100644 index 4ef13c8b1..000000000 Binary files a/_images/d6eafcd252a9a80ed6880db211eccc646407dd995a818a381022e7cbeba3db42.png and /dev/null differ diff --git a/_images/dd7f27dbcdf5564035e27c670f3123f844e7a981f8c63287750ece5b61a44928.png b/_images/dd7f27dbcdf5564035e27c670f3123f844e7a981f8c63287750ece5b61a44928.png deleted file mode 100644 index f7a28b673..000000000 Binary files a/_images/dd7f27dbcdf5564035e27c670f3123f844e7a981f8c63287750ece5b61a44928.png and /dev/null differ diff --git a/_images/e0e8d7699e085bb07f9674f12dfeaa5dd14ee4b5dace4789a6d858e77b673d09.png b/_images/e0e8d7699e085bb07f9674f12dfeaa5dd14ee4b5dace4789a6d858e77b673d09.png new file mode 100644 index 000000000..08c9a6fda Binary files /dev/null and b/_images/e0e8d7699e085bb07f9674f12dfeaa5dd14ee4b5dace4789a6d858e77b673d09.png differ diff --git a/_images/e64c91b30862454474c755622a6a99fbf325cb90fde784598bc71a96daecd3ac.png b/_images/e64c91b30862454474c755622a6a99fbf325cb90fde784598bc71a96daecd3ac.png new file mode 100644 index 000000000..970fe99d1 Binary files /dev/null and b/_images/e64c91b30862454474c755622a6a99fbf325cb90fde784598bc71a96daecd3ac.png differ diff --git a/_images/eaf825e25e641c4d8d6a242333ddfe7631a486b0cf3b5a2a87d34fac8c796537.png b/_images/eaf825e25e641c4d8d6a242333ddfe7631a486b0cf3b5a2a87d34fac8c796537.png new file mode 100644 index 000000000..74e850217 Binary files /dev/null and b/_images/eaf825e25e641c4d8d6a242333ddfe7631a486b0cf3b5a2a87d34fac8c796537.png differ diff --git a/_sources/python_scripts/linear_models_sol_03.py b/_sources/python_scripts/linear_models_sol_03.py index d789c8522..dc2a82f5c 100644 --- a/_sources/python_scripts/linear_models_sol_03.py +++ b/_sources/python_scripts/linear_models_sol_03.py @@ -8,14 +8,20 @@ # %% [markdown] # # 📃 Solution for Exercise M4.03 # -# The parameter `penalty` can control the **type** of regularization to use, -# whereas the regularization **strength** is set using the parameter `C`. -# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In -# this exercise, we ask you to train a logistic regression classifier using the -# `penalty="l2"` regularization (which happens to be the default in -# scikit-learn) to find by yourself the effect of the parameter `C`. +# In the previous Module we tuned the hyperparameter `C` of the logistic +# regression without mentioning that it controls the regularization strength. +# Later, on the slides on 🎥 **Intuitions on regularized linear models** we +# metioned that a small `C` provides a more regularized model, whereas a +# non-regularized model is obtained with an infinitely large value of `C`. +# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge` +# model. # -# We start by loading the dataset. +# In this exercise, we ask you to train a logistic regression classifier using +# different values of the parameter `C` to find its effects by yourself. +# +# We start by loading the dataset. We only keep the Adelie and Chinstrap classes +# to keep the discussion simple. + # %% [markdown] # ```{note} @@ -27,7 +33,6 @@ import pandas as pd penguins = pd.read_csv("../datasets/penguins_classification.csv") -# only keep the Adelie and Chinstrap classes penguins = ( penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() ) @@ -38,7 +43,9 @@ # %% from sklearn.model_selection import train_test_split -penguins_train, penguins_test = train_test_split(penguins, random_state=0) +penguins_train, penguins_test = train_test_split( + penguins, random_state=0, test_size=0.4 +) data_train = penguins_train[culmen_columns] data_test = penguins_test[culmen_columns] @@ -47,76 +54,227 @@ target_test = penguins_test[target_column] # %% [markdown] -# First, let's create our predictive model. +# We define a function to help us fit a given `model` and plot its decision +# boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging +# colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped +# to the white color. Equivalently, the darker the color, the closer the +# predicted probability is to 0 or 1 and the more confident the classifier is in +# its predictions. # %% -from sklearn.pipeline import make_pipeline -from sklearn.preprocessing import StandardScaler -from sklearn.linear_model import LogisticRegression - -logistic_regression = make_pipeline( - StandardScaler(), LogisticRegression(penalty="l2") -) - -# %% [markdown] -# Given the following candidates for the `C` parameter, find out the impact of -# `C` on the classifier decision boundary. You can use -# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the -# decision function boundary. - -# %% -Cs = [0.01, 0.1, 1, 10] - -# solution import matplotlib.pyplot as plt import seaborn as sns from sklearn.inspection import DecisionBoundaryDisplay -for C in Cs: - logistic_regression.set_params(logisticregression__C=C) - logistic_regression.fit(data_train, target_train) - accuracy = logistic_regression.score(data_test, target_test) - DecisionBoundaryDisplay.from_estimator( - logistic_regression, - data_test, - response_method="predict", +def plot_decision_boundary(model): + model.fit(data_train, target_train) + accuracy = model.score(data_test, target_test) + + disp = DecisionBoundaryDisplay.from_estimator( + model, + data_train, + response_method="predict_proba", + plot_method="pcolormesh", cmap="RdBu_r", - alpha=0.5, + alpha=0.8, + vmin=0.0, + vmax=1.0, + ) + DecisionBoundaryDisplay.from_estimator( + model, + data_train, + response_method="predict_proba", + plot_method="contour", + linestyles="--", + linewidths=1, + alpha=0.8, + levels=[0.5], + ax=disp.ax_, ) sns.scatterplot( - data=penguins_test, + data=penguins_train, x=culmen_columns[0], y=culmen_columns[1], hue=target_column, - palette=["tab:red", "tab:blue"], + palette=["tab:blue", "tab:red"], + ax=disp.ax_, ) plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left") plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}") + +# %% [markdown] +# Let's now create our predictive model. + +# %% +from sklearn.pipeline import make_pipeline +from sklearn.preprocessing import StandardScaler +from sklearn.linear_model import LogisticRegression + +logistic_regression = make_pipeline(StandardScaler(), LogisticRegression()) + +# %% [markdown] +# ## Influence of the parameter `C` on the decision boundary +# +# Given the following candidates for the `C` parameter and the +# `plot_decision_boundary` function, find out the impact of `C` on the +# classifier's decision boundary. +# +# - How does the value of `C` impact the confidence on the predictions? +# - How does it impact the underfit/overfit trade-off? +# - How does it impact the position and orientation of the decision boundary? +# +# Try to give an interpretation on the reason for such behavior. + +# %% +Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6] + +# solution +for C in Cs: + logistic_regression.set_params(logisticregression__C=C) + plot_decision_boundary(logistic_regression) + +# %% [markdown] tags=["solution"] +# +# On this series of plots we can observe several important points. Regarding the +# confidence on the predictions: +# +# - For low values of `C` (strong regularization), the classifier is less +# confident in its predictions. We are enforcing a **spread sigmoid**. +# - For high values of `C` (weak regularization), the classifier is more +# confident: the areas with dark blue (very confident in predicting "Adelie") +# and dark red (very confident in predicting "Chinstrap") nearly cover the +# entire feature space. We are enforcing a **steep sigmoid**. +# +# To answer the next question, think that misclassified data points are more +# costly when the classifier is more confident on the decision. Decision rules +# are mostly driven by avoiding such cost. From the previous observations we can +# then deduce that: +# +# - The smaller the `C` (the stronger the regularization), the lower the cost +# of a misclassification. As more data points lay in the low-confidence +# zone, the more the decision rules are influenced almost uniformly by all +# the data points. This leads to a less expressive model, which may underfit. +# - The higher the value of `C` (the weaker the regularization), the more the +# decision is influenced by a few training points very close to the boundary, +# where decisions are costly. Remember that models may overfit if the number +# of samples in the training set is too small, as at least a minimum of +# samples is needed to average the noise out. +# +# The orientation is the result of two factors: minimizing the number of +# misclassified training points with high confidence and their distance to the +# decision boundary (notice how the contour line tries to align with the most +# misclassified data points in the dark-colored zone). This is closely related +# to the value of the weights of the model, which is explained in the next part +# of the exercise. +# +# Finally, for small values of `C` the position of the decision boundary is +# affected by the class imbalance: when `C` is near zero, the model predicts the +# majority class (as seen in the training set) everywhere in the feature space. +# In our case, there are approximately two times more "Adelie" than "Chinstrap" +# penguins. This explains why the decision boundary is shifted to the right when +# `C` gets smaller. Indeed, the most regularized model predicts light blue +# almost everywhere in the feature space. + # %% [markdown] +# ## Impact of the regularization on the weights +# # Look at the impact of the `C` hyperparameter on the magnitude of the weights. +# **Hint**: You can [access pipeline +# steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps) +# by name or position. Then you can query the attributes of that step such as +# `coef_`. # %% # solution -weights_ridge = [] +lr_weights = [] for C in Cs: logistic_regression.set_params(logisticregression__C=C) logistic_regression.fit(data_train, target_train) coefs = logistic_regression[-1].coef_[0] - weights_ridge.append(pd.Series(coefs, index=culmen_columns)) + lr_weights.append(pd.Series(coefs, index=culmen_columns)) # %% tags=["solution"] -weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs]) -weights_ridge.plot.barh() +lr_weights = pd.concat(lr_weights, axis=1, keys=[f"C: {C}" for C in Cs]) +lr_weights.plot.barh() _ = plt.title("LogisticRegression weights depending of C") # %% [markdown] tags=["solution"] -# We see that a small `C` will shrink the weights values toward zero. It means -# that a small `C` provides a more regularized model. Thus, `C` is the inverse -# of the `alpha` coefficient in the `Ridge` model. # -# Besides, with a strong penalty (i.e. small `C` value), the weight of the -# feature "Culmen Depth (mm)" is almost zero. It explains why the decision +# As small `C` provides a more regularized model, it shrinks the weights values +# toward zero, as in the `Ridge` model. +# +# In particular, with a strong penalty (e.g. `C = 0.01`), the weight of the feature +# named "Culmen Depth (mm)" is almost zero. It explains why the decision # separation in the plot is almost perpendicular to the "Culmen Length (mm)" # feature. +# +# For even stronger penalty strengths (e.g. `C = 1e-6`), the weights of both +# features are almost zero. It explains why the decision separation in the plot +# is almost constant in the feature space: the predicted probability is only +# based on the intercept parameter of the model (which is never regularized). + +# %% [markdown] +# ## Impact of the regularization on with non-linear feature engineering +# +# Use the `plot_decision_boundary` function to repeat the experiment using a +# non-linear feature engineering pipeline. For such purpose, insert +# `Nystroem(kernel="rbf", gamma=1, n_components=100)` between the +# `StandardScaler` and the `LogisticRegression` steps. +# +# - Does the value of `C` still impact the position of the decision boundary and +# the confidence of the model? +# - What can you say about the impact of `C` on the underfitting vs overfitting +# trade-off? + +# %% +from sklearn.kernel_approximation import Nystroem + +# solution +classifier = make_pipeline( + StandardScaler(), + Nystroem(kernel="rbf", gamma=1.0, n_components=100, random_state=0), + LogisticRegression(penalty="l2", max_iter=1000), +) + +for C in Cs: + classifier.set_params(logisticregression__C=C) + plot_decision_boundary(classifier) + +# %% [markdown] tags=["solution"] +# +# - For the lowest values of `C`, the overall pipeline underfits: it predicts +# the majority class everywhere, as previously. +# - When `C` increases, the models starts to predict some datapoints from the +# "Chinstrap" class but the model is not very confident anywhere in the +# feature space. +# - The decision boundary is no longer a straight line: the linear model is now +# classifying in the 100-dimensional feature space created by the `Nystroem` +# transformer. As are result, the decision boundary induced by the overall +# pipeline is now expressive enough to wrap around the minority class. +# - For `C = 1` in particular, it finds a smooth red blob around most of the +# "Chinstrap" data points. When moving away from the data points, the model is +# less confident in its predictions and again tends to predict the majority +# class according to the proportion in the training set. +# - For higher values of `C`, the model starts to overfit: it is very confident +# in its predictions almost everywhere, but it should not be trusted: the +# model also makes a larger number of mistakes on the test set (not shown in +# the plot) while adopting a very curvy decision boundary to attempt fitting +# all the training points, including the noisy ones at the frontier between +# the two classes. This makes the decision boundary very sensitive to the +# sampling of the training set and as a result, it does not generalize well in +# that region. This is confirmed by the (slightly) lower accuracy on the test +# set. +# +# Finally, we can also note that the linear model on the raw features was as +# good or better than the best model using non-linear feature engineering. So in +# this case, we did not really need this extra complexity in our pipeline. +# **Simpler is better!** +# +# So to conclude, when using non-linear feature engineering, it is often +# possible to make the pipeline overfit, even if the original feature space is +# low-dimensional. As a result, it is important to tune the regularization +# parameter in conjunction with the parameters of the transformers (e.g. tuning +# `gamma` would be important here). This has a direct impact on the certainty of +# the predictions. diff --git a/appendix/notebook_timings.html b/appendix/notebook_timings.html index 0bdb6d69d..94ea14b4c 100644 --- a/appendix/notebook_timings.html +++ b/appendix/notebook_timings.html @@ -1029,9 +1029,9 @@
2023-08-30 10:07
2023-10-17 13:47
cache
3.97
11.76
✅
The parameter penalty
can control the type of regularization to use,
-whereas the regularization strength is set using the parameter C
.
-Settingpenalty="none"
is equivalent to an infinitely large value of C
. In
-this exercise, we ask you to train a logistic regression classifier using the
-penalty="l2"
regularization (which happens to be the default in
-scikit-learn) to find by yourself the effect of the parameter C
.
We start by loading the dataset.
+In the previous Module we tuned the hyperparameter C
of the logistic
+regression without mentioning that it controls the regularization strength.
+Later, on the slides on 🎥 Intuitions on regularized linear models we
+metioned that a small C
provides a more regularized model, whereas a
+non-regularized model is obtained with an infinitely large value of C
.
+Indeed, C
behaves as the inverse of the alpha
coefficient in the Ridge
+model.
In this exercise, we ask you to train a logistic regression classifier using
+different values of the parameter C
to find its effects by yourself.
We start by loading the dataset. We only keep the Adelie and Chinstrap classes +to keep the discussion simple.
Note
If you want a deeper overview regarding this dataset, you can refer to the @@ -700,7 +716,6 @@
import pandas as pd
penguins = pd.read_csv("../datasets/penguins_classification.csv")
-# only keep the Adelie and Chinstrap classes
penguins = (
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
)
@@ -715,7 +730,9 @@ 📃 Solution for Exercise M4.03
from sklearn.model_selection import train_test_split
-penguins_train, penguins_test = train_test_split(penguins, random_state=0)
+penguins_train, penguins_test = train_test_split(
+ penguins, random_state=0, test_size=0.4
+)
data_train = penguins_train[culmen_columns]
data_test = penguins_test[culmen_columns]
@@ -726,97 +743,258 @@ 📃 Solution for Exercise M4.03model and plot its decision
+boundary. We recall that by using a DecisionBoundaryDisplay
with diverging
+colormap, vmin=0
and vmax=1
, we ensure that the 0.5 probability is mapped
+to the white color. Equivalently, the darker the color, the closer the
+predicted probability is to 0 or 1 and the more confident the classifier is in
+its predictions.
+
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.inspection import DecisionBoundaryDisplay
+
+
+def plot_decision_boundary(model):
+ model.fit(data_train, target_train)
+ accuracy = model.score(data_test, target_test)
+
+ disp = DecisionBoundaryDisplay.from_estimator(
+ model,
+ data_train,
+ response_method="predict_proba",
+ plot_method="pcolormesh",
+ cmap="RdBu_r",
+ alpha=0.8,
+ vmin=0.0,
+ vmax=1.0,
+ )
+ DecisionBoundaryDisplay.from_estimator(
+ model,
+ data_train,
+ response_method="predict_proba",
+ plot_method="contour",
+ linestyles="--",
+ linewidths=1,
+ alpha=0.8,
+ levels=[0.5],
+ ax=disp.ax_,
+ )
+ sns.scatterplot(
+ data=penguins_train,
+ x=culmen_columns[0],
+ y=culmen_columns[1],
+ hue=target_column,
+ palette=["tab:blue", "tab:red"],
+ ax=disp.ax_,
+ )
+ plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
+ plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
+
+
+
+
+Let’s now create our predictive model.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
-logistic_regression = make_pipeline(
- StandardScaler(), LogisticRegression(penalty="l2")
-)
+logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())
-Given the following candidates for the C
parameter, find out the impact of
-C
on the classifier decision boundary. You can use
-sklearn.inspection.DecisionBoundaryDisplay.from_estimator
to plot the
-decision function boundary.
+
+Influence of the parameter C
on the decision boundary#
+Given the following candidates for the C
parameter and the
+plot_decision_boundary
function, find out the impact of C
on the
+classifier’s decision boundary.
+
+How does the value of C
impact the confidence on the predictions?
+How does it impact the underfit/overfit trade-off?
+How does it impact the position and orientation of the decision boundary?
+
+Try to give an interpretation on the reason for such behavior.
-Cs = [0.01, 0.1, 1, 10]
+Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]
# solution
-import matplotlib.pyplot as plt
-import seaborn as sns
-from sklearn.inspection import DecisionBoundaryDisplay
-
for C in Cs:
logistic_regression.set_params(logisticregression__C=C)
- logistic_regression.fit(data_train, target_train)
- accuracy = logistic_regression.score(data_test, target_test)
-
- DecisionBoundaryDisplay.from_estimator(
- logistic_regression,
- data_test,
- response_method="predict",
- cmap="RdBu_r",
- alpha=0.5,
- )
- sns.scatterplot(
- data=penguins_test,
- x=culmen_columns[0],
- y=culmen_columns[1],
- hue=target_column,
- palette=["tab:red", "tab:blue"],
- )
- plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
- plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
+ plot_decision_boundary(logistic_regression)
-
-
-
-
+
+
+
+
+
+
+
-Look at the impact of the C
hyperparameter on the magnitude of the weights.
+On this series of plots we can observe several important points. Regarding the
+confidence on the predictions:
+
+For low values of C
(strong regularization), the classifier is less
+confident in its predictions. We are enforcing a spread sigmoid.
+For high values of C
(weak regularization), the classifier is more
+confident: the areas with dark blue (very confident in predicting “Adelie”)
+and dark red (very confident in predicting “Chinstrap”) nearly cover the
+entire feature space. We are enforcing a steep sigmoid.
+
+To answer the next question, think that misclassified data points are more
+costly when the classifier is more confident on the decision. Decision rules
+are mostly driven by avoiding such cost. From the previous observations we can
+then deduce that:
+
+The smaller the C
(the stronger the regularization), the lower the cost
+of a misclassification. As more data points lay in the low-confidence
+zone, the more the decision rules are influenced almost uniformly by all
+the data points. This leads to a less expressive model, which may underfit.
+The higher the value of C
(the weaker the regularization), the more the
+decision is influenced by a few training points very close to the boundary,
+where decisions are costly. Remember that models may overfit if the number
+of samples in the training set is too small, as at least a minimum of
+samples is needed to average the noise out.
+
+The orientation is the result of two factors: minimizing the number of
+misclassified training points with high confidence and their distance to the
+decision boundary (notice how the contour line tries to align with the most
+misclassified data points in the dark-colored zone). This is closely related
+to the value of the weights of the model, which is explained in the next part
+of the exercise.
+Finally, for small values of C
the position of the decision boundary is
+affected by the class imbalance: when C
is near zero, the model predicts the
+majority class (as seen in the training set) everywhere in the feature space.
+In our case, there are approximately two times more “Adelie” than “Chinstrap”
+penguins. This explains why the decision boundary is shifted to the right when
+C
gets smaller. Indeed, the most regularized model predicts light blue
+almost everywhere in the feature space.
+
+
+Impact of the regularization on the weights#
+Look at the impact of the C
hyperparameter on the magnitude of the weights.
+Hint: You can access pipeline
+steps
+by name or position. Then you can query the attributes of that step such as
+coef_
.
# solution
-weights_ridge = []
+lr_weights = []
for C in Cs:
logistic_regression.set_params(logisticregression__C=C)
logistic_regression.fit(data_train, target_train)
coefs = logistic_regression[-1].coef_[0]
- weights_ridge.append(pd.Series(coefs, index=culmen_columns))
+ lr_weights.append(pd.Series(coefs, index=culmen_columns))
-weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs])
-weights_ridge.plot.barh()
+lr_weights = pd.concat(lr_weights, axis=1, keys=[f"C: {C}" for C in Cs])
+lr_weights.plot.barh()
_ = plt.title("LogisticRegression weights depending of C")
-
+
-We see that a small C
will shrink the weights values toward zero. It means
-that a small C
provides a more regularized model. Thus, C
is the inverse
-of the alpha
coefficient in the Ridge
model.
-Besides, with a strong penalty (i.e. small C
value), the weight of the
-feature “Culmen Depth (mm)” is almost zero. It explains why the decision
+
As small C
provides a more regularized model, it shrinks the weights values
+toward zero, as in the Ridge
model.
+In particular, with a strong penalty (e.g. C = 0.01
), the weight of the feature
+named “Culmen Depth (mm)” is almost zero. It explains why the decision
separation in the plot is almost perpendicular to the “Culmen Length (mm)”
feature.
+For even stronger penalty strengths (e.g. C = 1e-6
), the weights of both
+features are almost zero. It explains why the decision separation in the plot
+is almost constant in the feature space: the predicted probability is only
+based on the intercept parameter of the model (which is never regularized).
+
+
+Impact of the regularization on with non-linear feature engineering#
+Use the plot_decision_boundary
function to repeat the experiment using a
+non-linear feature engineering pipeline. For such purpose, insert
+Nystroem(kernel="rbf", gamma=1, n_components=100)
between the
+StandardScaler
and the LogisticRegression
steps.
+
+Does the value of C
still impact the position of the decision boundary and
+the confidence of the model?
+What can you say about the impact of C
on the underfitting vs overfitting
+trade-off?
+
+
+
+from sklearn.kernel_approximation import Nystroem
+
+# solution
+classifier = make_pipeline(
+ StandardScaler(),
+ Nystroem(kernel="rbf", gamma=1.0, n_components=100, random_state=0),
+ LogisticRegression(penalty="l2", max_iter=1000),
+)
+
+for C in Cs:
+ classifier.set_params(logisticregression__C=C)
+ plot_decision_boundary(classifier)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+For the lowest values of C
, the overall pipeline underfits: it predicts
+the majority class everywhere, as previously.
+When C
increases, the models starts to predict some datapoints from the
+“Chinstrap” class but the model is not very confident anywhere in the
+feature space.
+The decision boundary is no longer a straight line: the linear model is now
+classifying in the 100-dimensional feature space created by the Nystroem
+transformer. As are result, the decision boundary induced by the overall
+pipeline is now expressive enough to wrap around the minority class.
+For C = 1
in particular, it finds a smooth red blob around most of the
+“Chinstrap” data points. When moving away from the data points, the model is
+less confident in its predictions and again tends to predict the majority
+class according to the proportion in the training set.
+For higher values of C
, the model starts to overfit: it is very confident
+in its predictions almost everywhere, but it should not be trusted: the
+model also makes a larger number of mistakes on the test set (not shown in
+the plot) while adopting a very curvy decision boundary to attempt fitting
+all the training points, including the noisy ones at the frontier between
+the two classes. This makes the decision boundary very sensitive to the
+sampling of the training set and as a result, it does not generalize well in
+that region. This is confirmed by the (slightly) lower accuracy on the test
+set.
+
+Finally, we can also note that the linear model on the raw features was as
+good or better than the best model using non-linear feature engineering. So in
+this case, we did not really need this extra complexity in our pipeline.
+Simpler is better!
+So to conclude, when using non-linear feature engineering, it is often
+possible to make the pipeline overfit, even if the original feature space is
+low-dimensional. As a result, it is important to tune the regularization
+parameter in conjunction with the parameters of the transformers (e.g. tuning
+gamma
would be important here). This has a direct impact on the certainty of
+the predictions.
+