[ci skip] MAINT Fix broken sphinx-book-theme reference in CI (#732) 4…

…b006d5
INRIA · Oct 17, 2023 · 159d244 · 159d244
1 parent 2971615
commit 159d244
Show file tree

Hide file tree

Showing 23 changed files with 460 additions and 108 deletions.
diff --git a/_images/032d44039813a98a96b553b070215955b70d851f6f9fac169280d9be52105833.png b/_images/032d44039813a98a96b553b070215955b70d851f6f9fac169280d9be52105833.png
diff --git a/_images/07c9345090d559dd29999ec0ee6c672b7ca65345dbe11259882d774345129b9b.png b/_images/07c9345090d559dd29999ec0ee6c672b7ca65345dbe11259882d774345129b9b.png
diff --git a/_images/0cb9478c8ebafd1110a77a3591feb59f7009ca636e3eda3619943c8f950c57f3.png b/_images/0cb9478c8ebafd1110a77a3591feb59f7009ca636e3eda3619943c8f950c57f3.png
diff --git a/_images/0e50749bd867085e6fd9c797f89a1870ff6983514e8822934f333abef22f2156.png b/_images/0e50749bd867085e6fd9c797f89a1870ff6983514e8822934f333abef22f2156.png
diff --git a/_images/453858549e380f9ecfac873f0dc202778dafe5323277d31a517a0d5f0437aa29.png b/_images/453858549e380f9ecfac873f0dc202778dafe5323277d31a517a0d5f0437aa29.png
diff --git a/_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png b/_images/4995b02ae7b2d0dec5f4214107dd113c296209f435f295ac352e7de6fd7f5abb.png
diff --git a/_images/4b8d72068b16a40033d1ae1c3e43e24f8a0695a944be1ef86aa1f8d61d157cc2.png b/_images/4b8d72068b16a40033d1ae1c3e43e24f8a0695a944be1ef86aa1f8d61d157cc2.png
diff --git a/_images/4f064fcce2c9e74d5ca165c4b7a9d97adde6663fed9522e918e28a68073dd161.png b/_images/4f064fcce2c9e74d5ca165c4b7a9d97adde6663fed9522e918e28a68073dd161.png
diff --git a/_images/7f99620cd8492571c1c6d809a38e7f691d3bffc6797b6c8f68bcca57101d2426.png b/_images/7f99620cd8492571c1c6d809a38e7f691d3bffc6797b6c8f68bcca57101d2426.png
diff --git a/_images/9e0634a92164f80e2d757d4a2bc574f1dea6cc857294526748584ac48c6c2b47.png b/_images/9e0634a92164f80e2d757d4a2bc574f1dea6cc857294526748584ac48c6c2b47.png
diff --git a/_images/a4aec66658c53ca38b826f418b4b3db906c1a92ab89a9ddef7eb6c0c1867396d.png b/_images/a4aec66658c53ca38b826f418b4b3db906c1a92ab89a9ddef7eb6c0c1867396d.png
diff --git a/_images/b292d3662c24fda8d7cbb974f613e06bc0eab2b7400074243c74184b04574c1c.png b/_images/b292d3662c24fda8d7cbb974f613e06bc0eab2b7400074243c74184b04574c1c.png
diff --git a/_images/ca8118111e8a08b91b86eccb27d4c40542f7d4455efdbd2e0d985bf5615b293c.png b/_images/ca8118111e8a08b91b86eccb27d4c40542f7d4455efdbd2e0d985bf5615b293c.png
diff --git a/_images/d29a899ee748eecaddc699f51059d814483a9f000303fa449509f98bdf524abe.png b/_images/d29a899ee748eecaddc699f51059d814483a9f000303fa449509f98bdf524abe.png
diff --git a/_images/d6eafcd252a9a80ed6880db211eccc646407dd995a818a381022e7cbeba3db42.png b/_images/d6eafcd252a9a80ed6880db211eccc646407dd995a818a381022e7cbeba3db42.png
diff --git a/_images/dd7f27dbcdf5564035e27c670f3123f844e7a981f8c63287750ece5b61a44928.png b/_images/dd7f27dbcdf5564035e27c670f3123f844e7a981f8c63287750ece5b61a44928.png
diff --git a/_images/e0e8d7699e085bb07f9674f12dfeaa5dd14ee4b5dace4789a6d858e77b673d09.png b/_images/e0e8d7699e085bb07f9674f12dfeaa5dd14ee4b5dace4789a6d858e77b673d09.png
diff --git a/_images/e64c91b30862454474c755622a6a99fbf325cb90fde784598bc71a96daecd3ac.png b/_images/e64c91b30862454474c755622a6a99fbf325cb90fde784598bc71a96daecd3ac.png
diff --git a/_images/eaf825e25e641c4d8d6a242333ddfe7631a486b0cf3b5a2a87d34fac8c796537.png b/_images/eaf825e25e641c4d8d6a242333ddfe7631a486b0cf3b5a2a87d34fac8c796537.png
diff --git a/_sources/python_scripts/linear_models_sol_03.py b/_sources/python_scripts/linear_models_sol_03.py
@@ -8,14 +8,20 @@
 # %% [markdown]
 # # 📃 Solution for Exercise M4.03
 #
-# The parameter `penalty` can control the **type** of regularization to use,
-# whereas the regularization **strength** is set using the parameter `C`.
-# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
-# this exercise, we ask you to train a logistic regression classifier using the
-# `penalty="l2"` regularization (which happens to be the default in
-# scikit-learn) to find by yourself the effect of the parameter `C`.
+# In the previous Module we tuned the hyperparameter `C` of the logistic
+# regression without mentioning that it controls the regularization strength.
+# Later, on the slides on 🎥 **Intuitions on regularized linear models** we
+# metioned that a small `C` provides a more regularized model, whereas a
+# non-regularized model is obtained with an infinitely large value of `C`.
+# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
+# model.
 #
-# We start by loading the dataset.
+# In this exercise, we ask you to train a logistic regression classifier using
+# different values of the parameter `C` to find its effects by yourself.
+#
+# We start by loading the dataset. We only keep the Adelie and Chinstrap classes
+# to keep the discussion simple.
+
 
 # %% [markdown]
 # ```{note}
@@ -27,7 +33,6 @@
 import pandas as pd
 
 penguins = pd.read_csv("../datasets/penguins_classification.csv")
-# only keep the Adelie and Chinstrap classes
 penguins = (
     penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
 )
@@ -38,7 +43,9 @@
 # %%
 from sklearn.model_selection import train_test_split
 
-penguins_train, penguins_test = train_test_split(penguins, random_state=0)
+penguins_train, penguins_test = train_test_split(
+    penguins, random_state=0, test_size=0.4
+)
 
 data_train = penguins_train[culmen_columns]
 data_test = penguins_test[culmen_columns]
@@ -47,76 +54,227 @@
 target_test = penguins_test[target_column]
 
 # %% [markdown]
-# First, let's create our predictive model.
+# We define a function to help us fit a given `model` and plot its decision
+# boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging
+# colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped
+# to the white color. Equivalently, the darker the color, the closer the
+# predicted probability is to 0 or 1 and the more confident the classifier is in
+# its predictions.
 
 # %%
-from sklearn.pipeline import make_pipeline
-from sklearn.preprocessing import StandardScaler
-from sklearn.linear_model import LogisticRegression
-
-logistic_regression = make_pipeline(
-    StandardScaler(), LogisticRegression(penalty="l2")
-)
-
-# %% [markdown]
-# Given the following candidates for the `C` parameter, find out the impact of
-# `C` on the classifier decision boundary. You can use
-# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
-# decision function boundary.
-
-# %%
-Cs = [0.01, 0.1, 1, 10]
-
-# solution
 import matplotlib.pyplot as plt
 import seaborn as sns
 from sklearn.inspection import DecisionBoundaryDisplay
 
-for C in Cs:
-    logistic_regression.set_params(logisticregression__C=C)
-    logistic_regression.fit(data_train, target_train)
-    accuracy = logistic_regression.score(data_test, target_test)
 
-    DecisionBoundaryDisplay.from_estimator(
-        logistic_regression,
-        data_test,
-        response_method="predict",
+def plot_decision_boundary(model):
+    model.fit(data_train, target_train)
+    accuracy = model.score(data_test, target_test)
+
+    disp = DecisionBoundaryDisplay.from_estimator(
+        model,
+        data_train,
+        response_method="predict_proba",
+        plot_method="pcolormesh",
         cmap="RdBu_r",
-        alpha=0.5,
+        alpha=0.8,
+        vmin=0.0,
+        vmax=1.0,
+    )
+    DecisionBoundaryDisplay.from_estimator(
+        model,
+        data_train,
+        response_method="predict_proba",
+        plot_method="contour",
+        linestyles="--",
+        linewidths=1,
+        alpha=0.8,
+        levels=[0.5],
+        ax=disp.ax_,
     )
     sns.scatterplot(
-        data=penguins_test,
+        data=penguins_train,
         x=culmen_columns[0],
         y=culmen_columns[1],
         hue=target_column,
-        palette=["tab:red", "tab:blue"],
+        palette=["tab:blue", "tab:red"],
+        ax=disp.ax_,
     )
     plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
     plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
 
+
+# %% [markdown]
+# Let's now create our predictive model.
+
+# %%
+from sklearn.pipeline import make_pipeline
+from sklearn.preprocessing import StandardScaler
+from sklearn.linear_model import LogisticRegression
+
+logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())
+
+# %% [markdown]
+# ## Influence of the parameter `C` on the decision boundary
+#
+# Given the following candidates for the `C` parameter and the
+# `plot_decision_boundary` function, find out the impact of `C` on the
+# classifier's decision boundary.
+#
+# - How does the value of `C` impact the confidence on the predictions?
+# - How does it impact the underfit/overfit trade-off?
+# - How does it impact the position and orientation of the decision boundary?
+#
+# Try to give an interpretation on the reason for such behavior.
+
+# %%
+Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]
+
+# solution
+for C in Cs:
+    logistic_regression.set_params(logisticregression__C=C)
+    plot_decision_boundary(logistic_regression)
+
+# %% [markdown] tags=["solution"]
+#
+# On this series of plots we can observe several important points. Regarding the
+# confidence on the predictions:
+#
+# - For low values of `C` (strong regularization), the classifier is less
+#   confident in its predictions. We are enforcing a **spread sigmoid**.
+# - For high values of `C` (weak regularization), the classifier is more
+#   confident: the areas with dark blue (very confident in predicting "Adelie")
+#   and dark red (very confident in predicting "Chinstrap") nearly cover the
+#   entire feature space. We are enforcing a **steep sigmoid**.
+#
+# To answer the next question, think that misclassified data points are more
+# costly when the classifier is more confident on the decision. Decision rules
+# are mostly driven by avoiding such cost. From the previous observations we can
+# then deduce that:
+#
+# - The smaller the `C` (the stronger the regularization), the lower the cost
+#   of a misclassification. As more data points lay in the low-confidence
+#   zone, the more the decision rules are influenced almost uniformly by all
+#   the data points. This leads to a less expressive model, which may underfit.
+# - The higher the value of `C` (the weaker the regularization), the more the
+#   decision is influenced by a few training points very close to the boundary,
+#   where decisions are costly. Remember that models may overfit if the number
+#   of samples in the training set is too small, as at least a minimum of
+#   samples is needed to average the noise out.
+#
+# The orientation is the result of two factors: minimizing the number of
+# misclassified training points with high confidence and their distance to the
+# decision boundary (notice how the contour line tries to align with the most
+# misclassified data points in the dark-colored zone). This is closely related
+# to the value of the weights of the model, which is explained in the next part
+# of the exercise.
+#
+# Finally, for small values of `C` the position of the decision boundary is
+# affected by the class imbalance: when `C` is near zero, the model predicts the
+# majority class (as seen in the training set) everywhere in the feature space.
+# In our case, there are approximately two times more "Adelie" than "Chinstrap"
+# penguins. This explains why the decision boundary is shifted to the right when
+# `C` gets smaller. Indeed, the most regularized model predicts light blue
+# almost everywhere in the feature space.
+
 # %% [markdown]
+# ## Impact of the regularization on the weights
+#
 # Look at the impact of the `C` hyperparameter on the magnitude of the weights.
+# **Hint**: You can [access pipeline
+# steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)
+# by name or position. Then you can query the attributes of that step such as
+# `coef_`.
 
 # %%
 # solution
-weights_ridge = []
+lr_weights = []
 for C in Cs:
     logistic_regression.set_params(logisticregression__C=C)
     logistic_regression.fit(data_train, target_train)
     coefs = logistic_regression[-1].coef_[0]
-    weights_ridge.append(pd.Series(coefs, index=culmen_columns))
+    lr_weights.append(pd.Series(coefs, index=culmen_columns))
 
 # %% tags=["solution"]
-weights_ridge = pd.concat(weights_ridge, axis=1, keys=[f"C: {C}" for C in Cs])
-weights_ridge.plot.barh()
+lr_weights = pd.concat(lr_weights, axis=1, keys=[f"C: {C}" for C in Cs])
+lr_weights.plot.barh()
 _ = plt.title("LogisticRegression weights depending of C")
 
 # %% [markdown] tags=["solution"]
-# We see that a small `C` will shrink the weights values toward zero. It means
-# that a small `C` provides a more regularized model. Thus, `C` is the inverse
-# of the `alpha` coefficient in the `Ridge` model.
 #
-# Besides, with a strong penalty (i.e. small `C` value), the weight of the
-# feature "Culmen Depth (mm)" is almost zero. It explains why the decision
+# As small `C` provides a more regularized model, it shrinks the weights values
+# toward zero, as in the `Ridge` model.
+#
+# In particular, with a strong penalty (e.g. `C = 0.01`), the weight of the feature
+# named "Culmen Depth (mm)" is almost zero. It explains why the decision
 # separation in the plot is almost perpendicular to the "Culmen Length (mm)"
 # feature.
+#
+# For even stronger penalty strengths (e.g. `C = 1e-6`), the weights of both
+# features are almost zero. It explains why the decision separation in the plot
+# is almost constant in the feature space: the predicted probability is only
+# based on the intercept parameter of the model (which is never regularized).
+
+# %% [markdown]
+# ## Impact of the regularization on with non-linear feature engineering
+#
+# Use the `plot_decision_boundary` function to repeat the experiment using a
+# non-linear feature engineering pipeline. For such purpose, insert
+# `Nystroem(kernel="rbf", gamma=1, n_components=100)` between the
+# `StandardScaler` and the `LogisticRegression` steps.
+#
+# - Does the value of `C` still impact the position of the decision boundary and
+#   the confidence of the model?
+# - What can you say about the impact of `C` on the underfitting vs overfitting
+#   trade-off?
+
+# %%
+from sklearn.kernel_approximation import Nystroem
+
+# solution
+classifier = make_pipeline(
+    StandardScaler(),
+    Nystroem(kernel="rbf", gamma=1.0, n_components=100, random_state=0),
+    LogisticRegression(penalty="l2", max_iter=1000),
+)
+
+for C in Cs:
+    classifier.set_params(logisticregression__C=C)
+    plot_decision_boundary(classifier)
+
+# %% [markdown] tags=["solution"]
+#
+# - For the lowest values of `C`, the overall pipeline underfits: it predicts
+#   the majority class everywhere, as previously.
+# - When `C` increases, the models starts to predict some datapoints from the
+#   "Chinstrap" class but the model is not very confident anywhere in the
+#   feature space.
+# - The decision boundary is no longer a straight line: the linear model is now
+#   classifying in the 100-dimensional feature space created by the `Nystroem`
+#   transformer. As are result, the decision boundary induced by the overall
+#   pipeline is now expressive enough to wrap around the minority class.
+# - For `C = 1` in particular, it finds a smooth red blob around most of the
+#   "Chinstrap" data points. When moving away from the data points, the model is
+#   less confident in its predictions and again tends to predict the majority
+#   class according to the proportion in the training set.
+# - For higher values of `C`, the model starts to overfit: it is very confident
+#   in its predictions almost everywhere, but it should not be trusted: the
+#   model also makes a larger number of mistakes on the test set (not shown in
+#   the plot) while adopting a very curvy decision boundary to attempt fitting
+#   all the training points, including the noisy ones at the frontier between
+#   the two classes. This makes the decision boundary very sensitive to the
+#   sampling of the training set and as a result, it does not generalize well in
+#   that region. This is confirmed by the (slightly) lower accuracy on the test
+#   set.
+#
+# Finally, we can also note that the linear model on the raw features was as
+# good or better than the best model using non-linear feature engineering. So in
+# this case, we did not really need this extra complexity in our pipeline.
+# **Simpler is better!**
+#
+# So to conclude, when using non-linear feature engineering, it is often
+# possible to make the pipeline overfit, even if the original feature space is
+# low-dimensional. As a result, it is important to tune the regularization
+# parameter in conjunction with the parameters of the transformers (e.g. tuning
+# `gamma` would be important here). This has a direct impact on the certainty of
+# the predictions.
diff --git a/appendix/notebook_timings.html b/appendix/notebook_timings.html
@@ -1029,9 +1029,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/linear_models_sol_03.html"><span class="doc">python_scripts/linear_models_sol_03</span></a></p></td>
-<td><p>2023-08-30 10:07</p></td>
+<td><p>2023-10-17 13:47</p></td>
 <td><p>cache</p></td>
-<td><p>3.97</p></td>
+<td><p>11.76</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/linear_regression_in_sklearn.html"><span class="doc">python_scripts/linear_regression_in_sklearn</span></a></p></td>