Solve conflicts

INRIA · Apr 26, 2024 · 1cb0046 · 1cb0046
2 parents bc96354 + 1237225
commit 1cb0046
Show file tree

Hide file tree

Showing 29 changed files with 84 additions and 65 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 📢 📢 📢 A new session of the [Machine learning in Python with scikit-learn
 MOOC](https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn),
 is available starting on November 8th, 2023 and will remain open on self-paced
-mode. Enroll for the full MOOC experience (quizz solutions, executable
+mode. Enroll for the full MOOC experience (quiz solutions, executable
 notebooks, discussion forum, etc ...) !
 
 The MOOC is free and hosted on the [FUN-MOOC](https://fun-mooc.fr/) platform

diff --git a/jupyter-book/_config.yml b/jupyter-book/_config.yml
@@ -3,6 +3,7 @@
 title : Scikit-learn course
 author: scikit-learn developers
 logo: 'scikit-learn-logo.png'
+copyright: "2022-2024"
 
 # Information about where the book exists on the web
 description: >-

diff --git a/jupyter-book/index.md b/jupyter-book/index.md
@@ -36,8 +36,8 @@ interpreting their predictions.
       <a href="https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn">
         "Machine learning in Python with scikit-learn MOOC"
       </a>,
-      is available starting on October 18, 2022 and will last for 3 months. Enroll for
-      the full MOOC experience (quizz solutions, executable notebooks, discussion
+      is available starting on November 8th, 2023 and will remain open in self-paced mode.
+      Enroll for the full MOOC experience (quiz solutions, executable notebooks, discussion
       forum, etc ...) !
       </br>
       The MOOC is free and the platform does not use the student data for any other purpose
@@ -79,7 +79,7 @@ You can cite us through the project's Zenodo archive using the following DOI:
 [10.5281/zenodo.7220306](https://doi.org/10.5281/zenodo.7220306).
 
 The following repository includes the notebooks, exercises and solutions to the
-exercises (but not the quizz solutions ;):
+exercises (but not the quizzes' solutions ;):
 
   https://github.com/INRIA/scikit-learn-mooc/
 

diff --git a/jupyter-book/tuning/parameter_tuning_manual_quiz_m3_01.md b/jupyter-book/tuning/parameter_tuning_manual_quiz_m3_01.md
@@ -1,7 +1,7 @@
 # ✅ Quiz M3.01
 
 ```{admonition} Question
-Which parameters below are hyperparameters of `HistGradientBosstingClassifier`?
+Which parameters below are hyperparameters of `HistGradientBoostingClassifier`?
 Remember we only consider hyperparameters to be those that potentially impact
 the result of the learning procedure and subsequent predictions.
 

diff --git a/notebooks/cross_validation_ex_01.ipynb b/notebooks/cross_validation_ex_01.ipynb
@@ -52,7 +52,7 @@
     "exercise.\n",
     "\n",
     "Also, this classifier can become more flexible/expressive by using a so-called\n",
-    "kernel that makes the model become non-linear. Again, no requirement regarding\n",
+    "kernel that makes the model become non-linear. Again, no understanding regarding\n",
     "the mathematics is required to accomplish this exercise.\n",
     "\n",
     "We will use an RBF kernel where a parameter `gamma` allows to tune the\n",
@@ -160,4 +160,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
diff --git a/notebooks/metrics_classification.ipynb b/notebooks/metrics_classification.ipynb
@@ -311,13 +311,13 @@
     "blood when the classifier predicted so or the fraction of people predicted to\n",
     "have given blood out of the total population that actually did so.\n",
     "\n",
-    "The former metric, known as the precision, is defined as TP / (TP + FP) and\n",
+    "The former metric, known as the precision, is defined as `TP / (TP + FP)` and\n",
     "represents how likely the person actually gave blood when the classifier\n",
-    "predicted that they did. The latter, known as the recall, defined as TP / (TP\n",
-    "+ FN) and assesses how well the classifier is able to correctly identify\n",
-    "people who did give blood. We could, similarly to accuracy, manually compute\n",
-    "these values, however scikit-learn provides functions to compute these\n",
-    "statistics."
+    "predicted that they did. The latter, known as the recall, defined as\n",
+    "`TP / (TP + FN)` and assesses how well the classifier is able to correctly\n",
+    "identify people who did give blood. We could, similarly to accuracy,\n",
+    "manually compute these values, however scikit-learn provides functions to\n",
+    "compute these statistics."
    ]
   },
   {
@@ -664,4 +664,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
diff --git a/python_scripts/01_tabular_data_exploration.py b/python_scripts/01_tabular_data_exploration.py
@@ -70,6 +70,15 @@
 # %%
 adult_census.head()
 
+# %% [markdown]
+# An alternative is to omit the `head` method. This would output the intial and
+# final rows and columns, but everything in between is not shown by default. It
+# also provides the dataframe's dimensions at the bottom in the format `n_rows`
+# x `n_columns`.
+
+# %%
+adult_census
+
 # %% [markdown]
 # The column named **class** is our target variable (i.e., the variable which we
 # want to predict). The two possible classes are `<=50K` (low-revenue) and

diff --git a/python_scripts/02_numerical_pipeline_hands_on.py b/python_scripts/02_numerical_pipeline_hands_on.py
@@ -34,7 +34,7 @@
 adult_census = pd.read_csv("../datasets/adult-census.csv")
 # drop the duplicated column `"education-num"` as stated in the first notebook
 adult_census = adult_census.drop(columns="education-num")
-adult_census.head()
+adult_census
 
 # %% [markdown]
 # The next step separates the target from the data. We performed the same
@@ -44,7 +44,7 @@
 data, target = adult_census.drop(columns="class"), adult_census["class"]
 
 # %%
-data.head()
+data
 
 # %%
 target
@@ -95,7 +95,7 @@
 # the `object` data type.
 
 # %%
-data.head()
+data
 
 # %% [markdown]
 # We see that the `object` data type corresponds to columns containing strings.
@@ -105,7 +105,7 @@
 
 # %%
 numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
-data[numerical_columns].head()
+data[numerical_columns]
 
 # %% [markdown]
 # Now that we limited the dataset to numerical columns only, we can analyse

diff --git a/python_scripts/02_numerical_pipeline_introduction.py b/python_scripts/02_numerical_pipeline_introduction.py
@@ -39,7 +39,7 @@
 # Let's have a look at the first records of this dataframe:
 
 # %%
-adult_census.head()
+adult_census
 
 # %% [markdown]
 # We see that this CSV file contains all information: the target that we would
@@ -56,10 +56,10 @@
 
 # %%
 data = adult_census.drop(columns=[target_name])
-data.head()
+data
 
 # %% [markdown]
-# We can now linger on the variables, also denominated features, that we later
+# We can now focus on the variables, also denominated features, that we later
 # use to build our predictive model. In addition, we can also check how many
 # samples are available in our dataset.
 

diff --git a/python_scripts/03_categorical_pipeline.py b/python_scripts/03_categorical_pipeline.py
@@ -81,7 +81,7 @@
 
 # %%
 data_categorical = data[categorical_columns]
-data_categorical.head()
+data_categorical
 
 # %%
 print(f"The dataset is composed of {data_categorical.shape[1]} features")
@@ -194,7 +194,7 @@
 
 # %%
 print(f"The dataset is composed of {data_categorical.shape[1]} features")
-data_categorical.head()
+data_categorical
 
 # %%
 data_encoded = encoder.fit_transform(data_categorical)
@@ -253,7 +253,7 @@
 # and check the generalization performance of this machine learning pipeline using
 # cross-validation.
 #
-# Before we create the pipeline, we have to linger on the `native-country`.
+# Before we create the pipeline, we have to focus on the `native-country`.
 # Let's recall some statistics regarding this column.
 
 # %%
@@ -329,9 +329,10 @@
 print(f"The accuracy is: {scores.mean():.3f} ± {scores.std():.3f}")
 
 # %% [markdown]
-# As you can see, this representation of the categorical variables is
-# slightly more predictive of the revenue than the numerical variables
-# that we used previously.
+# As you can see, this representation of the categorical variables is slightly
+# more predictive of the revenue than the numerical variables that we used
+# previously. The reason being that we have more (predictive) categorical
+# features than numerical ones.
 
 # %% [markdown]
 #

diff --git a/python_scripts/03_categorical_pipeline_column_transformer.py b/python_scripts/03_categorical_pipeline_column_transformer.py
@@ -165,7 +165,7 @@
 # method. As an example, we predict on the five first samples from the test set.
 
 # %%
-data_test.head()
+data_test
 
 # %%
 model.predict(data_test)[:5]

diff --git a/python_scripts/cross_validation_learning_curve.py b/python_scripts/cross_validation_learning_curve.py
@@ -13,7 +13,7 @@
 # generalizing. Besides these aspects, it is also important to understand how
 # the different errors are influenced by the number of samples available.
 #
-# In this notebook, we will show this aspect by looking a the variability of
+# In this notebook, we will show this aspect by looking at the variability of
 # the different errors.
 #
 # Let's first load the data and create the same model as in the previous

diff --git a/python_scripts/cross_validation_stratification.py b/python_scripts/cross_validation_stratification.py
@@ -52,12 +52,12 @@
     print("TRAIN:", train_index, "TEST:", test_index)
 
 # %% [markdown]
-# By defining three splits, we use three samples (1-fold) for testing and six (2-folds) for
-# training each time. `KFold` does not shuffle by default. It means that the
-# three first samples are selected for the testing set at the first split, then
-# the three next three samples for the second split, and the three next for the
-# last split. In the end, all samples have been used in testing at least once
-# among the different splits.
+# By defining three splits, we use three samples (1-fold) for testing and six
+# (2-folds) for training each time. `KFold` does not shuffle by default. It
+# means that the three first samples are selected for the testing set at the
+# first split, then the three next three samples for the second split, and the
+# three next for the last split. In the end, all samples have been used in
+# testing at least once among the different splits.
 #
 # Now, let's apply this strategy to check the generalization performance of our
 # model.

diff --git a/python_scripts/cross_validation_train_test.py b/python_scripts/cross_validation_train_test.py
@@ -12,7 +12,7 @@
 # of predictive models. While this section could be slightly redundant, we
 # intend to go into details into the cross-validation framework.
 #
-# Before we dive in, let's linger on the reasons for always having training and
+# Before we dive in, let's focus on the reasons for always having training and
 # testing sets. Let's first look at the limitation of using a dataset without
 # keeping any samples out.
 #
@@ -34,22 +34,22 @@
 # notebook. The target to be predicted is a continuous variable and not anymore
 # discrete. This task is called regression.
 #
-# This, we will use a predictive model specific to regression and not to
+# Thus, we will use a predictive model specific to regression and not to
 # classification.
 
 # %%
 print(housing.DESCR)
 
 # %%
-data.head()
+data
 
 # %% [markdown]
 # To simplify future visualization, let's transform the prices from the 100
 # (k\$) range to the thousand dollars (k\$) range.
 
 # %%
 target *= 100
-target.head()
+target
 
 # %% [markdown]
 # ```{note}
@@ -218,7 +218,7 @@
 import pandas as pd
 
 cv_results = pd.DataFrame(cv_results)
-cv_results.head()
+cv_results
 
 # %% [markdown]
 # ```{tip}

diff --git a/python_scripts/datasets_blood_transfusion.py b/python_scripts/datasets_blood_transfusion.py
@@ -46,7 +46,7 @@
 # * `Recency`: the time in months since the last time a person intended to give
 #   blood;
 # * `Frequency`: the number of time a person intended to give blood in the past;
-# * `Monetary`: the amount of blood given in the past (in c.c.);
+# * `Monetary`: the amount of blood given in the past (in cm³);
 # * `Time`: the time in months since the first time a person intended to give
 #   blood.
 #

diff --git a/python_scripts/ensemble_sol_02.py b/python_scripts/ensemble_sol_02.py
@@ -103,3 +103,10 @@
 
 plt.plot(data_range[feature_name], forest_predictions, label="Random forest")
 _ = plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
+
+# %% [markdown] tags=["solution"]
+# The random forest reduces the overfitting of the individual trees but still
+# overfits itself. In the section on "hyperparameter tuning with ensemble
+# methods" we will see how to further mitigate this effect. Still, interested
+# users may increase the number of estimators in the forest and try different
+# values of, e.g., `min_samples_split`.
diff --git a/python_scripts/linear_models_ex_02.py b/python_scripts/linear_models_ex_02.py
@@ -52,7 +52,7 @@
 
 data = penguins_non_missing[columns]
 target = penguins_non_missing[target_name]
-data.head()
+data
 
 # %% [markdown]
 # Now it is your turn to train a linear regression model on this dataset. First,

diff --git a/python_scripts/linear_models_ex_04.py b/python_scripts/linear_models_ex_04.py
@@ -17,7 +17,7 @@
 # In the previous Module we tuned the hyperparameter `C` of the logistic
 # regression without mentioning that it controls the regularization strength.
 # Later, on the slides on 🎥 **Intuitions on regularized linear models** we
-# metioned that a small `C` provides a more regularized model, whereas a
+# mentioned that a small `C` provides a more regularized model, whereas a
 # non-regularized model is obtained with an infinitely large value of `C`.
 # Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
 # model.

diff --git a/python_scripts/linear_models_sol_02.py b/python_scripts/linear_models_sol_02.py
@@ -46,7 +46,7 @@
 
 data = penguins_non_missing[columns]
 target = penguins_non_missing[target_name]
-data.head()
+data
 
 # %% [markdown]
 # Now it is your turn to train a linear regression model on this dataset. First,

diff --git a/python_scripts/linear_models_sol_04.py b/python_scripts/linear_models_sol_04.py
@@ -11,7 +11,7 @@
 # In the previous Module we tuned the hyperparameter `C` of the logistic
 # regression without mentioning that it controls the regularization strength.
 # Later, on the slides on 🎥 **Intuitions on regularized linear models** we
-# metioned that a small `C` provides a more regularized model, whereas a
+# mentioned that a small `C` provides a more regularized model, whereas a
 # non-regularized model is obtained with an infinitely large value of `C`.
 # Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
 # model.

diff --git a/python_scripts/linear_regression_without_sklearn.py b/python_scripts/linear_regression_without_sklearn.py
@@ -22,7 +22,7 @@
 import pandas as pd
 
 penguins = pd.read_csv("../datasets/penguins_regression.csv")
-penguins.head()
+penguins
 
 # %% [markdown]
 # We aim to solve the following problem: using the flipper length of a penguin,

diff --git a/python_scripts/metrics_classification.py b/python_scripts/metrics_classification.py
@@ -78,7 +78,7 @@
 # predictions a classifier can provide.
 #
 # For this reason, we will create a synthetic sample for a new potential donor:
-# they donated blood twice in the past (1000 c.c. each time). The last time was
+# they donated blood twice in the past (1000 cm³ each time). The last time was
 # 6 months ago, and the first time goes back to 20 months ago.
 
 # %%
@@ -188,13 +188,13 @@
 # blood when the classifier predicted so or the fraction of people predicted to
 # have given blood out of the total population that actually did so.
 #
-# The former metric, known as the precision, is defined as TP / (TP + FP) and
+# The former metric, known as the precision, is defined as `TP / (TP + FP)` and
 # represents how likely the person actually gave blood when the classifier
-# predicted that they did. The latter, known as the recall, defined as TP / (TP
-# + FN) and assesses how well the classifier is able to correctly identify
-# people who did give blood. We could, similarly to accuracy, manually compute
-# these values, however scikit-learn provides functions to compute these
-# statistics.
+# predicted that they did. The latter, known as the recall, defined as
+# `TP / (TP + FN)` and assesses how well the classifier is able to correctly
+# identify people who did give blood. We could, similarly to accuracy,
+# manually compute these values, however scikit-learn provides functions to
+# compute these statistics.
 
 # %%
 from sklearn.metrics import precision_score, recall_score

diff --git a/python_scripts/metrics_regression.py b/python_scripts/metrics_regression.py
@@ -97,8 +97,9 @@
 # %% [markdown]
 # The $R^2$ score represents the proportion of variance of the target that is
 # explained by the independent variables in the model. The best score possible
-# is 1 but there is no lower bound. However, a model that predicts the expected
-# value of the target would get a score of 0.
+# is 1 but there is no lower bound. However, a model that predicts the [expected
+# value](https://en.wikipedia.org/wiki/Expected_value) of the target would get a
+# score of 0.
 
 # %%
 from sklearn.dummy import DummyRegressor