Skip to content

Commit

Permalink
Solve conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
ArturoAmorQ committed Apr 26, 2024
2 parents bc96354 + 1237225 commit 1cb0046
Show file tree
Hide file tree
Showing 29 changed files with 84 additions and 65 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
📢 📢 📢 A new session of the [Machine learning in Python with scikit-learn
MOOC](https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn),
is available starting on November 8th, 2023 and will remain open on self-paced
mode. Enroll for the full MOOC experience (quizz solutions, executable
mode. Enroll for the full MOOC experience (quiz solutions, executable
notebooks, discussion forum, etc ...) !

The MOOC is free and hosted on the [FUN-MOOC](https://fun-mooc.fr/) platform
Expand Down
1 change: 1 addition & 0 deletions jupyter-book/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
title : Scikit-learn course
author: scikit-learn developers
logo: 'scikit-learn-logo.png'
copyright: "2022-2024"

# Information about where the book exists on the web
description: >-
Expand Down
6 changes: 3 additions & 3 deletions jupyter-book/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ interpreting their predictions.
<a href="https://www.fun-mooc.fr/en/courses/machine-learning-python-scikit-learn">
"Machine learning in Python with scikit-learn MOOC"
</a>,
is available starting on October 18, 2022 and will last for 3 months. Enroll for
the full MOOC experience (quizz solutions, executable notebooks, discussion
is available starting on November 8th, 2023 and will remain open in self-paced mode.
Enroll for the full MOOC experience (quiz solutions, executable notebooks, discussion
forum, etc ...) !
</br>
The MOOC is free and the platform does not use the student data for any other purpose
Expand Down Expand Up @@ -79,7 +79,7 @@ You can cite us through the project's Zenodo archive using the following DOI:
[10.5281/zenodo.7220306](https://doi.org/10.5281/zenodo.7220306).

The following repository includes the notebooks, exercises and solutions to the
exercises (but not the quizz solutions ;):
exercises (but not the quizzes' solutions ;):

https://github.com/INRIA/scikit-learn-mooc/

Expand Down
2 changes: 1 addition & 1 deletion jupyter-book/tuning/parameter_tuning_manual_quiz_m3_01.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# ✅ Quiz M3.01

```{admonition} Question
Which parameters below are hyperparameters of `HistGradientBosstingClassifier`?
Which parameters below are hyperparameters of `HistGradientBoostingClassifier`?
Remember we only consider hyperparameters to be those that potentially impact
the result of the learning procedure and subsequent predictions.
Expand Down
4 changes: 2 additions & 2 deletions notebooks/cross_validation_ex_01.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@
"exercise.\n",
"\n",
"Also, this classifier can become more flexible/expressive by using a so-called\n",
"kernel that makes the model become non-linear. Again, no requirement regarding\n",
"kernel that makes the model become non-linear. Again, no understanding regarding\n",
"the mathematics is required to accomplish this exercise.\n",
"\n",
"We will use an RBF kernel where a parameter `gamma` allows to tune the\n",
Expand Down Expand Up @@ -160,4 +160,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
14 changes: 7 additions & 7 deletions notebooks/metrics_classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -311,13 +311,13 @@
"blood when the classifier predicted so or the fraction of people predicted to\n",
"have given blood out of the total population that actually did so.\n",
"\n",
"The former metric, known as the precision, is defined as TP / (TP + FP) and\n",
"The former metric, known as the precision, is defined as `TP / (TP + FP)` and\n",
"represents how likely the person actually gave blood when the classifier\n",
"predicted that they did. The latter, known as the recall, defined as TP / (TP\n",
"+ FN) and assesses how well the classifier is able to correctly identify\n",
"people who did give blood. We could, similarly to accuracy, manually compute\n",
"these values, however scikit-learn provides functions to compute these\n",
"statistics."
"predicted that they did. The latter, known as the recall, defined as\n",
"`TP / (TP + FN)` and assesses how well the classifier is able to correctly\n",
"identify people who did give blood. We could, similarly to accuracy,\n",
"manually compute these values, however scikit-learn provides functions to\n",
"compute these statistics."
]
},
{
Expand Down Expand Up @@ -664,4 +664,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
9 changes: 9 additions & 0 deletions python_scripts/01_tabular_data_exploration.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,15 @@
# %%
adult_census.head()

# %% [markdown]
# An alternative is to omit the `head` method. This would output the intial and
# final rows and columns, but everything in between is not shown by default. It
# also provides the dataframe's dimensions at the bottom in the format `n_rows`
# x `n_columns`.

# %%
adult_census

# %% [markdown]
# The column named **class** is our target variable (i.e., the variable which we
# want to predict). The two possible classes are `<=50K` (low-revenue) and
Expand Down
8 changes: 4 additions & 4 deletions python_scripts/02_numerical_pipeline_hands_on.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
adult_census = pd.read_csv("../datasets/adult-census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")
adult_census.head()
adult_census

# %% [markdown]
# The next step separates the target from the data. We performed the same
Expand All @@ -44,7 +44,7 @@
data, target = adult_census.drop(columns="class"), adult_census["class"]

# %%
data.head()
data

# %%
target
Expand Down Expand Up @@ -95,7 +95,7 @@
# the `object` data type.

# %%
data.head()
data

# %% [markdown]
# We see that the `object` data type corresponds to columns containing strings.
Expand All @@ -105,7 +105,7 @@

# %%
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
data[numerical_columns].head()
data[numerical_columns]

# %% [markdown]
# Now that we limited the dataset to numerical columns only, we can analyse
Expand Down
6 changes: 3 additions & 3 deletions python_scripts/02_numerical_pipeline_introduction.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
# Let's have a look at the first records of this dataframe:

# %%
adult_census.head()
adult_census

# %% [markdown]
# We see that this CSV file contains all information: the target that we would
Expand All @@ -56,10 +56,10 @@

# %%
data = adult_census.drop(columns=[target_name])
data.head()
data

# %% [markdown]
# We can now linger on the variables, also denominated features, that we later
# We can now focus on the variables, also denominated features, that we later
# use to build our predictive model. In addition, we can also check how many
# samples are available in our dataset.

Expand Down
13 changes: 7 additions & 6 deletions python_scripts/03_categorical_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@

# %%
data_categorical = data[categorical_columns]
data_categorical.head()
data_categorical

# %%
print(f"The dataset is composed of {data_categorical.shape[1]} features")
Expand Down Expand Up @@ -194,7 +194,7 @@

# %%
print(f"The dataset is composed of {data_categorical.shape[1]} features")
data_categorical.head()
data_categorical

# %%
data_encoded = encoder.fit_transform(data_categorical)
Expand Down Expand Up @@ -253,7 +253,7 @@
# and check the generalization performance of this machine learning pipeline using
# cross-validation.
#
# Before we create the pipeline, we have to linger on the `native-country`.
# Before we create the pipeline, we have to focus on the `native-country`.
# Let's recall some statistics regarding this column.

# %%
Expand Down Expand Up @@ -329,9 +329,10 @@
print(f"The accuracy is: {scores.mean():.3f} ± {scores.std():.3f}")

# %% [markdown]
# As you can see, this representation of the categorical variables is
# slightly more predictive of the revenue than the numerical variables
# that we used previously.
# As you can see, this representation of the categorical variables is slightly
# more predictive of the revenue than the numerical variables that we used
# previously. The reason being that we have more (predictive) categorical
# features than numerical ones.

# %% [markdown]
#
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@
# method. As an example, we predict on the five first samples from the test set.

# %%
data_test.head()
data_test

# %%
model.predict(data_test)[:5]
Expand Down
2 changes: 1 addition & 1 deletion python_scripts/cross_validation_learning_curve.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# generalizing. Besides these aspects, it is also important to understand how
# the different errors are influenced by the number of samples available.
#
# In this notebook, we will show this aspect by looking a the variability of
# In this notebook, we will show this aspect by looking at the variability of
# the different errors.
#
# Let's first load the data and create the same model as in the previous
Expand Down
12 changes: 6 additions & 6 deletions python_scripts/cross_validation_stratification.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,12 +52,12 @@
print("TRAIN:", train_index, "TEST:", test_index)

# %% [markdown]
# By defining three splits, we use three samples (1-fold) for testing and six (2-folds) for
# training each time. `KFold` does not shuffle by default. It means that the
# three first samples are selected for the testing set at the first split, then
# the three next three samples for the second split, and the three next for the
# last split. In the end, all samples have been used in testing at least once
# among the different splits.
# By defining three splits, we use three samples (1-fold) for testing and six
# (2-folds) for training each time. `KFold` does not shuffle by default. It
# means that the three first samples are selected for the testing set at the
# first split, then the three next three samples for the second split, and the
# three next for the last split. In the end, all samples have been used in
# testing at least once among the different splits.
#
# Now, let's apply this strategy to check the generalization performance of our
# model.
Expand Down
10 changes: 5 additions & 5 deletions python_scripts/cross_validation_train_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# of predictive models. While this section could be slightly redundant, we
# intend to go into details into the cross-validation framework.
#
# Before we dive in, let's linger on the reasons for always having training and
# Before we dive in, let's focus on the reasons for always having training and
# testing sets. Let's first look at the limitation of using a dataset without
# keeping any samples out.
#
Expand All @@ -34,22 +34,22 @@
# notebook. The target to be predicted is a continuous variable and not anymore
# discrete. This task is called regression.
#
# This, we will use a predictive model specific to regression and not to
# Thus, we will use a predictive model specific to regression and not to
# classification.

# %%
print(housing.DESCR)

# %%
data.head()
data

# %% [markdown]
# To simplify future visualization, let's transform the prices from the 100
# (k\$) range to the thousand dollars (k\$) range.

# %%
target *= 100
target.head()
target

# %% [markdown]
# ```{note}
Expand Down Expand Up @@ -218,7 +218,7 @@
import pandas as pd

cv_results = pd.DataFrame(cv_results)
cv_results.head()
cv_results

# %% [markdown]
# ```{tip}
Expand Down
2 changes: 1 addition & 1 deletion python_scripts/datasets_blood_transfusion.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
# * `Recency`: the time in months since the last time a person intended to give
# blood;
# * `Frequency`: the number of time a person intended to give blood in the past;
# * `Monetary`: the amount of blood given in the past (in c.c.);
# * `Monetary`: the amount of blood given in the past (in cm³);
# * `Time`: the time in months since the first time a person intended to give
# blood.
#
Expand Down
7 changes: 7 additions & 0 deletions python_scripts/ensemble_sol_02.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,3 +103,10 @@

plt.plot(data_range[feature_name], forest_predictions, label="Random forest")
_ = plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")

# %% [markdown] tags=["solution"]
# The random forest reduces the overfitting of the individual trees but still
# overfits itself. In the section on "hyperparameter tuning with ensemble
# methods" we will see how to further mitigate this effect. Still, interested
# users may increase the number of estimators in the forest and try different
# values of, e.g., `min_samples_split`.
2 changes: 1 addition & 1 deletion python_scripts/linear_models_ex_02.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@

data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]
data.head()
data

# %% [markdown]
# Now it is your turn to train a linear regression model on this dataset. First,
Expand Down
2 changes: 1 addition & 1 deletion python_scripts/linear_models_ex_04.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# In the previous Module we tuned the hyperparameter `C` of the logistic
# regression without mentioning that it controls the regularization strength.
# Later, on the slides on 🎥 **Intuitions on regularized linear models** we
# metioned that a small `C` provides a more regularized model, whereas a
# mentioned that a small `C` provides a more regularized model, whereas a
# non-regularized model is obtained with an infinitely large value of `C`.
# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
# model.
Expand Down
2 changes: 1 addition & 1 deletion python_scripts/linear_models_sol_02.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@

data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]
data.head()
data

# %% [markdown]
# Now it is your turn to train a linear regression model on this dataset. First,
Expand Down
2 changes: 1 addition & 1 deletion python_scripts/linear_models_sol_04.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
# In the previous Module we tuned the hyperparameter `C` of the logistic
# regression without mentioning that it controls the regularization strength.
# Later, on the slides on 🎥 **Intuitions on regularized linear models** we
# metioned that a small `C` provides a more regularized model, whereas a
# mentioned that a small `C` provides a more regularized model, whereas a
# non-regularized model is obtained with an infinitely large value of `C`.
# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
# model.
Expand Down
2 changes: 1 addition & 1 deletion python_scripts/linear_regression_without_sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
import pandas as pd

penguins = pd.read_csv("../datasets/penguins_regression.csv")
penguins.head()
penguins

# %% [markdown]
# We aim to solve the following problem: using the flipper length of a penguin,
Expand Down
14 changes: 7 additions & 7 deletions python_scripts/metrics_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
# predictions a classifier can provide.
#
# For this reason, we will create a synthetic sample for a new potential donor:
# they donated blood twice in the past (1000 c.c. each time). The last time was
# they donated blood twice in the past (1000 cm³ each time). The last time was
# 6 months ago, and the first time goes back to 20 months ago.

# %%
Expand Down Expand Up @@ -188,13 +188,13 @@
# blood when the classifier predicted so or the fraction of people predicted to
# have given blood out of the total population that actually did so.
#
# The former metric, known as the precision, is defined as TP / (TP + FP) and
# The former metric, known as the precision, is defined as `TP / (TP + FP)` and
# represents how likely the person actually gave blood when the classifier
# predicted that they did. The latter, known as the recall, defined as TP / (TP
# + FN) and assesses how well the classifier is able to correctly identify
# people who did give blood. We could, similarly to accuracy, manually compute
# these values, however scikit-learn provides functions to compute these
# statistics.
# predicted that they did. The latter, known as the recall, defined as
# `TP / (TP + FN)` and assesses how well the classifier is able to correctly
# identify people who did give blood. We could, similarly to accuracy,
# manually compute these values, however scikit-learn provides functions to
# compute these statistics.

# %%
from sklearn.metrics import precision_score, recall_score
Expand Down
5 changes: 3 additions & 2 deletions python_scripts/metrics_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,9 @@
# %% [markdown]
# The $R^2$ score represents the proportion of variance of the target that is
# explained by the independent variables in the model. The best score possible
# is 1 but there is no lower bound. However, a model that predicts the expected
# value of the target would get a score of 0.
# is 1 but there is no lower bound. However, a model that predicts the [expected
# value](https://en.wikipedia.org/wiki/Expected_value) of the target would get a
# score of 0.

# %%
from sklearn.dummy import DummyRegressor
Expand Down
Loading

0 comments on commit 1cb0046

Please sign in to comment.