HA W.M.ALEX 2024-10-16
true
The Objective of Capstone Project is to develop a Recommendation System that Predict Movie ratings using a Machine Learning Model based on subset of MovieLens datasets. This subset represents a smaller portion of much larger datasets containing several Millions of rating. This datasets encompasses approximately 10 Millions Movies ratings. The Primary task is to leverage this data to Predict Movie ratings.
For this endeavor, this MovieLens datasets is partitioned into edx sets for developing the Algorithm , comprising approximately 9 Million Rows and final_holdout_test sets for Testing on Final Model, consisting approximately 1 Millions Rows. The edx datasets includes 69,878 Unique Users, 10,677 Unique Movies and 797 Combined genres. Each Movie is also categorized by its Combined genres and ratings range from 0.5 and 5.0 with increment of 0.5.
Prior to Models Building, Data exploration Analysis (EDA) were conducted which encompassing Data cleaning, Data pre-processing/Featuring Engineering, Data Visualization, Stratification Analysis, Distributions Analysis and Frequency Analysis. The Model employs a Collaborative Filtering Approach, augmented by Regularization Method to estimate the Movie Effects, User Effects, Genres Effects and Time Effects. These Methods are instrumental in penalizing the magnitude of the Parameters to avoid Overfitting.
The edx datasets is also split into train sets and test sets for training and testing the Algorithm using Cross-Validation Method. Our ultimate goal is to construct a Model that minimizes the loss, measured by our loss function, Root Mean Squared Error (RMSE).
If is the number of User-Movie combinations, is the rating for Movie by User , and is our Prediction, the RMSE is defined as follow:
Regularized Movie+User+Time Based Model comply with an Lowest RMSE of 0.863759.
The Final Model is defined as follow:
The Model’s Performance was evaluated using the Metric - Root Mean Squared Error. The reliability and trustworthiness of the Model are substantiated through incorporating Regularization and Cross-Validation Method during Model development.
Optimal Machine Learning Model achieved an RMSE of 0.863759, Signifying a Remarkable Outcome. Thus, I firmly assert our confidence in adopting the ultimate Model and Algorithm to construct the Movie Recommendation System.
Purpose of the Project is to build a Recommendation System based on MovieLens Dataset that Predict Movie ratings using Machine Learning Algorithm. The Movie ratings are on a 0.5-5.0 scale, where 5.0 represents the best Movie and 0.5 suggests it to be a bad Movie. Movie Recommendation System are more complicated Machine Learning challenges because each Outcome has a different set of Features/Predictors. For example, different Users Rate a different number of Movies and Rate different Movies. Consider the following scenario:
-
If the average rating for all Movies is 3.7, and “Jurassic Park” is better than an average Movie. User “A” might Rate it 0.5 points above the average.
-
User “B” is a cranky User, tends to Rate 0.8 points lower than average. Thus, the estimate for “Jurassic Park” by User “B” would be calculated as 3.7-0.8+0.5=3.4.
Many of the Movie ratings are influenced by Effects associated with either Users and Movies of their interactions. Different Users employ different rating scales, and a User can change their rating scales over time or genres. A Movie’s popularity may also change over time or genres due to external events. For example, cranky User who tended to Rate an average Movie 4.0, may now Rate such a movie 3.0 or even lower. But in some occasion, those user may Rate much higher.
To address this, we add Time Effects and Genres Effects to the baseline features/predictors. Including Time Specific Effects does not attempt to capture future changes but aims to capture transient effects that significantly influenced past User feedback. Some people may like a Movie and remember it as my most favorites because of it’s genres. While others may dislike it and forget about it. Thus only those liking them will mark the Movie as favorites, while those disliking them will not mention them at all. This behavior is expected towards most popular Movies, which can be either remembered as very good or not to be remembered.
However,some Movies are known to be bad and people who did not like them always give them a lower scores, indicating what they do not want to watch. However, for the other part of the population, who liked those Movies, they are not going to be remembered long Time as notable. Thus, long Time after watching the Movie, only those who disliked the Movie will Rate it. Some Movies are natural selection of Movies to be Rated. some Movies are natural candidates as bad Movies, while others are natural candidates as good Movies.
Thus, Time Effects and Genres Effects can explain Portion of Variability of the Movie ratings.
The MovieLens datasets comprises approximately 10 Million Rows of data. This datasets is randomly split into two separated datasets, edx and final_holdout_test. The edx datasets serves as the training sets, while the final_holdout_test datasets is used for Final Model Testing Purpose. Both datasets contain 6 Columns/Features/Predictors. The edx datasets includes approximately 9 Millions of Rows with 69,878 Unique Users, 10,677 Unique Movies and 797 combined genres. Movie rating in this datasets range from 0.5 and 5.0 with increments of 0.5.
A) Dimension
Rows | Columns |
---|---|
9000055 | 6 |
edx datasets
There are 9000055 Rows and 6 Columns in the edx datasets.
B) Column Name and Class
Class | |
---|---|
userId | integer |
movieId | integer |
rating | numeric |
timestamp | integer |
title | character |
genres | character |
edx datasets
There are 6 Columns and Class in the edx datasets.
C) Number of Unique movieId, userId and genres.
Description | Count |
---|---|
Number of Unique movieId | 10677 |
Number of unique userId | 69878 |
Number of unique combined genres | 797 |
edx datsets
There are 10677 Unique Movies, 69878 Unique Users and 797 Unique Combined genres in the edx datasets.
List of 10 Examples - Counting the Occurrences of rating (By movieId)
# - generate 10 examples - counting the Occurrences of rating by movieId
# - display output using kable function
m_result <- edx %>% count(movieId,name="Occurency") %>%
count(Occurency, name="Count") %>%
arrange(desc(Count)) %>%
dplyr::slice(1:10)
m_result %>% knitr::kable(caption="edx datasets")
Occurency | Count |
---|---|
4 | 154 |
2 | 152 |
1 | 126 |
5 | 120 |
3 | 119 |
6 | 107 |
11 | 107 |
7 | 96 |
12 | 93 |
8 | 91 |
edx datasets
List of 10 Examples - Counting the Occurrences of rating (By userId)
# - generate 10 examples - counting the occurrences of rating by userId
# - display output using kable function
u_result <- edx %>% count(userId,name="Occurency") %>%
count(Occurency, name="Count") %>%
dplyr::slice(30:40)
u_result %>% knitr::kable(caption="edx datasets")
Occurency | Count |
---|---|
40 | 682 |
41 | 568 |
42 | 641 |
43 | 640 |
44 | 636 |
45 | 607 |
46 | 613 |
47 | 532 |
48 | 574 |
49 | 557 |
50 | 510 |
edx datasets
Plot Distributions (Number of Occurrence) - movieId and userId
# - generate distributions (number of occurrence) using movieId and userId
# - ggplot histogram - m1 and m2
m1 <- edx %>%
dplyr::count(movieId) %>%
ggplot(aes(n)) +
geom_histogram(bins = 50, fill="#006EBB", color="black") +
labs(y="Count",x="Number of Occurrence") +
scale_x_log10() +
ggtitle("Movies")
m2 <- edx %>%
dplyr::count(userId) %>%
ggplot(aes(n)) +
geom_histogram(bins = 200, fill="cyan", color="#D883B7" ) +
labs(y="Count",x="Number of Occurrence") +
scale_x_log10()+
ggtitle("Users")
grid.arrange(m1, m2, ncol = 2)
It is evident from the distributions plotted above that some Movies are rated more than the other Movies and some Users are more active and Rate more number of Movies than the other Users do.
- movieId : A Unique identification number assigned to each Movie.
- title : The title for each Movie follow by the Release Year inside the parentheses “(1996)”.
- genres : The Combined genres of each Movie, where each genres are separated by a pipe “|”.
- userId : A unique identification assigned to each User.
- rating : The rating given by a Unique User for specific Movie.
- timestamp : The timestamp associated with a User’s rating for a particular Movie.
R Codes check columns with “NA” values in “edx” datasets
# - generate missing value (if any) in column rating of edx
# - display output using kable function
na_results <- edx[apply(is.na(edx),1,any),]
na_results %>% select(userId,movieId,rating,timestamp,title,genres) %>%
knitr::kable(caption="Columns with NA values in edx datasets")
userId | movieId | rating | timestamp | title | genres |
---|
Columns with NA values in edx datasets
As we can see from the output above, there are NO missing value in rating Column of edx datasets.
R Code check “rating” column for “Zeros” value in the “edx” datasets
# generate and display zeros value (if any) in column rating of edx
length(which(edx$rating==0))
## [1] 0
There are also No zeros value were given as rating in the edx datasets.
List of 8 Examples - edx datasets
# - generate 8 examples: the original edx - rating given by one User to one movie
# - display output of edx using kable function
edx %>%
group_by(title) %>%
mutate(f=length(rating)) %>% ungroup() %>%
filter(f > 10000) %>%
distinct(title, .keep_all=TRUE) %>%
arrange(desc(f)) %>%
select(movieId,title,genres,userId,rating,timestamp) %>%
dplyr::slice(1:8) %>%
knitr::kable(caption="edx datasets")
movieId | title | genres | userId | rating | timestamp |
---|---|---|---|---|---|
296 | Pulp Fiction | Comedy|Crime|Drama | 10 | 2.0 | 941529864 |
356 | Forrest Gump | Comedy|Drama|Romance|War | 1 | 5.0 | 838983653 |
593 | Silence of the Lambs, The | Crime|Horror|Thriller | 7 | 3.0 | 1049764435 |
480 | Jurassic Park | Action|Adventure|Sci-Fi|Thriller | 4 | 5.0 | 844416834 |
318 | Shawshank Redemption, The | Drama | 18 | 4.5 | 1111545917 |
110 | Braveheart | Action|Drama|War | 2 | 5.0 | 868245777 |
457 | Fugitive, The | Thriller | 6 | 5.0 | 1001083175 |
589 | Terminator 2: Judgment Day | Action|Sci-Fi | 1 | 5.0 | 838983778 |
edx datasets
Each Row in “edx” represents a “rating” given by one User to one Movie.
The edx datasets contain Columns/Features/Predictors like timestamp, genres, rating. Not all Features may be useful for Prediction and some may even be detrimental. Therefore we will select the most important Features. We will Transform and Extract the timestamp Column into Week of Date and Year of Date for readable format that better represent the underlying problem to the Model. Additionally, We will compute the Mean of rating by genres, Week of Date and Year of Date. These new Features are important for the Predictive Model as they provide insights into the problem, potentially improving the Model Performance.
3.1.3.1 Column genres
The genres Column includes every Genre that applies to Movie so I will define it as Combined genres. Some Movies also fall under several genres. The Extraction of Combined genres from the Movie into separate Rows is NOT Necessary, even those each Genre is separated by a pipe “|”. In reality, Movies often belong to multiple genres simultaneously. Manipulating the original Combined genres Column by splitting it into individual Genre per Row could lead to Data Misrepresentation. It might introduce data bias when the combined genres of a Movie are artificially separated.
In truth, the combined genres for each Movie provide a more Accurate Representation of Data.
Following Example explain why Extraction of Combined genres into separate Rows is Not Necessary
3.1.3.1.1 Mean of rating by genres BEFORE Combined genres Separation
List Mean of rating by genres in descending order by Number of rating (count > 10000)
## # A tibble: 20 × 4
## rank genres average_rating_genres count
## <chr> <chr> <dbl> <int>
## 1 1 Crime|Mystery|Thriller 4.20 26892
## 2 2 Action|Adventure|Comedy|Fantasy|Romance 4.20 14809
## 3 3 Adventure|Drama|Film-Noir|Sci-Fi|Thriller 4.15 13957
## 4 4 Adventure|Drama|War 4.08 14137
## 5 5 Crime|Horror|Thriller 4.08 33757
## 6 6 Crime|Film-Noir|Mystery|Thriller 4.06 24961
## 7 7 Comedy|Crime|Drama|Thriller 4.06 24341
## 8 8 Adventure|Mystery|Thriller 4.04 14712
## 9 9 Comedy|Drama|Romance|War 4.01 41762
## 10 10 Crime|Drama|Sci-Fi|Thriller 4.00 10730
## 11 11 Comedy|Crime|Drama 4.00 59071
## 12 12 Adventure|Children|Fantasy|Musical 3.99 11784
## 13 13 Drama|War 3.98 111029
## 14 14 Crime|Drama 3.95 137387
## 15 15 Action|Adventure|Drama|Fantasy 3.94 11941
## 16 16 Action|Drama|War 3.92 99183
## 17 17 Action|Crime|Romance 3.91 16090
## 18 18 Crime|Drama|Film-Noir 3.91 11249
## 19 19 Adventure|Animation|Children|Comedy|Fanta… 3.90 13063
## 20 20 Crime|Horror|Mystery|Thriller 3.88 27240
Mean of rating by genres (genres=“Drama”)
## # A tibble: 1 × 4
## rank genres average_rating_genres count
## <chr> <chr> <dbl> <int>
## 1 48 Drama 3.71 733296
Mean of rating by genres (genres=“Comedy”)
## # A tibble: 1 × 4
## rank genres average_rating_genres count
## <chr> <chr> <dbl> <int>
## 1 131 Comedy 3.24 700889
3.1.3.1.2 Plot Mean of rating by genres with error bars BEFORE genres Separated
3.1.3.1.3 Mean of rating by genres AFTER Combined genres Separated
List of 15 Examples - Mean of rating by genres in descending order. (count > 20000)
## # A tibble: 15 × 4
## rank genres average_rating_genres count
## <chr> <chr> <dbl> <int>
## 1 1 War 3.78 511147
## 2 2 Mystery 3.68 568332
## 3 3 Drama 3.67 3910127
## 4 4 Crime 3.67 1327715
## 5 5 Animation 3.60 467168
## 6 6 Musical 3.56 433080
## 7 7 Romance 3.55 1712100
## 8 8 Thriller 3.51 2325899
## 9 9 Fantasy 3.50 925637
## 10 10 Adventure 3.49 1908892
## 11 11 Comedy 3.44 3540930
## 12 12 Action 3.42 2560545
## 13 13 Children 3.42 737994
## 14 14 Sci-Fi 3.40 1341183
## 15 15 Horror 3.27 691485
3.1.3.1.4 Plot Mean of rating by genres with Error Bars AFTER genres Separated
List of rank, genres, Mean of rating by genres & Number of rating of “Comedy” and “Drama”
rank | genres | average_rating_genres | count |
---|---|---|---|
48 | Drama | 3.71236 | 733296 |
131 | Comedy | 3.23786 | 700889 |
BEFORE Combined genres Separated
rank | genres | average_rating_genres | count |
---|---|---|---|
3 | Drama | 3.67313 | 3910127 |
11 | Comedy | 3.43691 | 3540930 |
After Combined genres Separated
As illustrated by above Example, Data Misrepresentation is observed in Column of “rank”, “Mean of rating” by “genres” and “Number of rating” after the Separation of “Combined genres” in edx datasets.
3.1.3.2 Column timestamp
Create new Features based on timestamp to capture insights related to Time Effects.
- Week of the Date (d_w): Transform & Extract the Week of the Date in timestamp column.
- Month of the Date (d_m): Transform & Extract the Month of the Date in timestamp column.
- Year of the Date (d_y): Transform & Extract the Year of the Date in timestamp column.
These new Features will enhance our Models ability to capture Time Effects patterns.
R codes generate new Columns (d_w, d_m, d_y)
# data Wrangling: generate new columns (d_w, d_m, d_y) from column timestamp
edx <- edx %>% mutate(d_w=format(round_date(as_datetime(timestamp),"week"),"%Y-%m-%d"))
edx <- edx %>% mutate(d_m=format(round_date(as_datetime(timestamp),"month"),"%Y-%m-%d"))
edx <- edx %>% mutate(d_y=format(round_date(as_datetime(timestamp),"year"),"%Y-%m-%d"))
3.1.3.3 Column rating
Create new Features/Predictors can provide more detailed insights into Mean of rating by various dimensions such as Week of Date, Week of Year, genres. Additionally, I will also capture the total Number of rating by movieId.
The following steps outline the Feature Engineering process:
- Compute Mean of rating by movieId and generate a new Column m_r.
- Compute Mean of rating by movieId, Week of the Date and generate a new Column m_rw.
- Compute Mean of rating by movieId, Year of the Date and generate a new Column m_ry.
- Compute Mean of rating by userId, genres and generate a new Column m_rg.
- Compute Total Number of rating by movieId and generate a new Column tot_nr.
new Features enable us to capture more information & improve the robustness of Data Analysis.
R codes generate new Columns (m_r, m_rw, m_ry, m_rg, tot_nr)
# data Wrangling: generate new columns (m_r, m_rw, m_ry, m_rg, tot_nr) from column rating
edx <- edx %>% group_by(movieId) %>% mutate(m_r = mean(rating, na.rm=TRUE))
edx <- edx %>% group_by(movieId,d_w) %>% mutate(m_rw= mean(rating, na.rm=TRUE))
edx <- edx %>% group_by(movieId,d_y) %>% mutate(m_ry= mean(rating, na.rm=TRUE))
edx <- edx %>% group_by(userId,genres) %>% mutate(m_rg=mean(rating, na.rm=TRUE))
edx <- edx %>% group_by(movieId) %>% mutate(tot_nr=n()) %>% ungroup()
3.1.3.4 Column title
To facilitate easier interpretation and explanation, we will also create the new Features that capture more information than original Features.
- Extract the Release Year of Movie from title column and generate a new column release.
This new Feature will enhance our datasets by providing a clear and easily interpretable attribute related to the Movie’s Release Year.
R codes generate new column “release” from column “title”
# data wrangling: generate new column release from column title
edx <- edx %>% mutate(release = str_extract(title,"\\d{4}")) %>%
mutate(title = str_replace(title,"\\s*\\(\\d{4}\\)",""))
List 10 Examples encompassing new Features (d_w)
## # A tibble: 11 × 5
## userId movieId title d_w rating
## <int> <int> <chr> <chr> <dbl>
## 1 39748 592 Batman 1996-03-03 4
## 2 40233 150 Apollo 13 1996-03-03 5
## 3 35139 1 Toy Story 1996-01-28 4
## 4 20095 780 Independence Day (a.k.a. ID4) 1996-06-30 5
## 5 49668 590 Dances with Wolves 1996-03-31 5
## 6 39748 527 Schindler's List 1996-03-03 5
## 7 34955 380 True Lies 1996-03-03 4
## 8 15767 1210 Star Wars: Episode VI - Return of the Jedi 1996-10-27 4
## 9 35139 32 12 Monkeys (Twelve Monkeys) 1996-01-28 5
## 10 36008 50 Usual Suspects, The 1996-02-04 5
## 11 41500 608 Fargo 1996-03-17 5
List 8 Examples encompassing new Features (release, m_r, m_rw, m_ry, m_rg)
## # A tibble: 8 × 8
## userId release title rating m_r m_rw m_ry m_rg
## <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 34955 1994 Pulp Fiction 5 4.15 5 4.04 3.5
## 2 35435 1994 Forrest Gump 4 4.01 4.17 4.13 4
## 3 34955 1991 Silence of the Lambs, The 5 4.20 5 4.31 5
## 4 37153 1993 Jurassic Park 1 3.66 3.14 3.90 1
## 5 40233 1994 Shawshank Redemption, The 3 4.46 3 4.49 4
## 6 36881 1995 Braveheart 4 4.08 4 4.32 4
## 7 35435 1993 Fugitive, The 5 4.01 4.71 4.16 5
## 8 34955 1991 Terminator 2: Judgment Day 5 3.93 3 4.06 3.5
3.2.1.1 Frequency of rating By title
List of 20 Movies which is most frequently Rated by User with frequency > 10000
# generate 20 movies: most frequently rated by User with frequency > 10000
c_1 <- edx %>%
group_by(title) %>%
mutate(frequency=length(rating)) %>% ungroup() %>%
filter(frequency > 10000) %>%
distinct(title, .keep_all=TRUE) %>%
arrange(desc(frequency)) %>%
mutate(rank=rownames(.)) %>%
select(rank,title,frequency) %>%
dplyr::slice(1:20)
# display c_1 using kable function
c_1 %>% knitr::kable(caption="Frequency of rating By title")
rank | title | frequency |
---|---|---|
1 | Pulp Fiction | 31362 |
2 | Forrest Gump | 31079 |
3 | Silence of the Lambs, The | 30382 |
4 | Jurassic Park | 29360 |
5 | Shawshank Redemption, The | 28015 |
6 | Braveheart | 26212 |
7 | Fugitive, The | 26020 |
8 | Terminator 2: Judgment Day | 25984 |
9 | Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) | 25672 |
10 | Batman | 24585 |
11 | Apollo 13 | 24284 |
12 | Toy Story | 23790 |
13 | Independence Day (a.k.a. ID4) | 23449 |
14 | Dances with Wolves | 23367 |
15 | Schindler’s List | 23193 |
16 | True Lies | 22823 |
17 | Star Wars: Episode VI - Return of the Jedi | 22584 |
18 | 12 Monkeys (Twelve Monkeys) | 21891 |
19 | Usual Suspects, The | 21648 |
20 | Fargo | 21395 |
Frequency of rating By title
The Movie “Pulp Fiction” has the greatest Number of rating and is most frequently Rated by User.
3.2.1.2 Plot - Frequency of rating By title
# ggplot frequency of rating By title
c_1 %>% mutate(title=reorder(title,frequency)) %>%
ggplot(aes(title,frequency)) +
geom_bar(stat="identity",fill="#D883B7",color="white") +
labs(y="Frequency",x="Moive title",title="Frequency of User rating By Movie title") +
theme(axis.text.x=element_text(angle=90)) +
scale_y_continuous(breaks=seq(0,33000,by=3000)) +
coord_flip()
3.2.1.3 Frequency of rating By genres
List of 20 genres which is most frequently Rated by User with frequency > 10000
# generate 20 genres: most frequently rated by User with frequency > 10000
c_2 <- edx %>%
group_by(genres) %>%
mutate(frequency=length(rating)) %>% ungroup() %>%
filter(frequency > 10000) %>%
distinct(genres, .keep_all=TRUE) %>%
arrange(desc(frequency)) %>%
mutate(rank=rownames(.)) %>%
select(rank,genres,frequency) %>%
dplyr::slice(1:20)
# display c_2
print_df <- function(title,df)
{
cat(title, "\n\n")
cat(capture.output(print(n=50,df)), sep="\n")
}
print_df("List of 20 genres - Frequency of rating By genres ",c_2)
## List of 20 genres - Frequency of rating By genres
##
## # A tibble: 20 × 3
## rank genres frequency
## <chr> <chr> <int>
## 1 1 Drama 733296
## 2 2 Comedy 700889
## 3 3 Comedy|Romance 365468
## 4 4 Comedy|Drama 323637
## 5 5 Comedy|Drama|Romance 261425
## 6 6 Drama|Romance 259355
## 7 7 Action|Adventure|Sci-Fi 219938
## 8 8 Action|Adventure|Thriller 149091
## 9 9 Drama|Thriller 145373
## 10 10 Crime|Drama 137387
## 11 11 Drama|War 111029
## 12 12 Crime|Drama|Thriller 106101
## 13 13 Action|Adventure|Sci-Fi|Thriller 105144
## 14 14 Action|Crime|Thriller 102259
## 15 15 Action|Drama|War 99183
## 16 16 Action|Thriller 96535
## 17 17 Action|Sci-Fi|Thriller 95280
## 18 18 Thriller 94662
## 19 19 Horror|Thriller 75000
## 20 20 Comedy|Crime 73286
The genres “Drama” has the greatest Number of rating and is most frequently Rated by User.
3.2.1.4 Plot - Frequency of rating By genres
# ggplot frequency of rating By genres
c_2 %>% mutate(genres=reorder(genres,frequency)) %>%
ggplot(aes(genres,frequency)) +
geom_bar(stat="identity",fill="#584298",color="white") +
labs(y="Frequency",x="genres",title="Frequency of User rating By genres") +
theme(axis.text.x=element_text(angle=90)) +
scale_y_continuous(breaks=seq(0,800000,by=80000)) +
coord_flip()
Correlation is only meaningful in a particular context. Stratification is used to identifying the Correlation is meaningful as a summary statistic for Prediction. I stratify a rating into groups and compute the Mean summaries in each group (Mean of rating, Mean of rating by Week of Date, Mean of rating by Year of Date, Mean of rating by genres). After normalizing the rating and Mean of rating, the Mean of rating will be equal to 0 and the Standard Deviation will be equal to 1. Therefore, I will set intercept to 0 and slope to Correlation coefficient ().
3.2.2.1 Plot - Correlation of Normalized Mean of rating and rating
# compute correlation coefficient (r) of rating and m_r with 6 decimals place
r <- edx %>%
summarize(r=cor(rating,m_r)) %>%
pull(r)
r <- round(r,6)
# ggplot correlation of Normalized Mean of rating and rating
edx %>% mutate(rating=scale(rating),m_r=scale(m_r)) %>%
group_by(rating) %>%
summarize(m_r=mean(m_r)) %>%
ggplot(aes(rating,m_r)) + geom_point()+
geom_abline(intercept=0, slope=r,color="#F89E78")+
labs(x="scale(rating)",y="scale(m_r)",
title=paste("Normalized Mean of rating vs rating [cor =",r,"]"))
3.2.2.2 Plot - Correlation of Normalized Mean by Week of Date and rating
3.2.2.3 Plot - Correlation of Normalized Mean by Year of Date and rating
3.2.2.4 Plot - Correlation of Normalized Mean of rating by genres and rating
3.2.2.5 Correlation table - new Features (m_r, m_rw, m_ry, m_rg) of “edx” dataset
# generate correlation coefficient table (m_r, m_rw, m_ry, m_rg) of edx
avg_r_all <- edx %>%
select(rating,m_r,m_rw,m_ry,m_rg)
cor_r_all <- cor(na.omit(avg_r_all[, unlist(lapply(avg_r_all, is.numeric))]))
# display correlation coefficient table using kable function
cor_r_all %>% knitr::kable(caption="Correlation - new Features of edx dataset")
rating | m_r | m_rw | m_ry | m_rg | |
---|---|---|---|---|---|
rating | 1.000000 | 0.458432 | 0.609944 | 0.480843 | 0.785637 |
m_r | 0.458432 | 1.000000 | 0.751597 | 0.953393 | 0.368341 |
m_rw | 0.609944 | 0.751597 | 1.000000 | 0.788227 | 0.463979 |
m_ry | 0.480843 | 0.953393 | 0.788227 | 1.000000 | 0.385547 |
m_rg | 0.785637 | 0.368341 | 0.463979 | 0.385547 | 1.000000 |
Correlation - new Features of edx dataset
Data Visualization reveals that Mean of each group appear to follow a linear relationship. Mean of rating By genres seems to have more predictive power than Mean of rating, Mean of rating By Week of Date and Mean of rating By Year of Date. The rating to Mean of rating By Week of Date and Mean of rating By genres variability are quite large and this implies that they should explain a lot of variability.
Hence, I will add Mean of rating By genres as a Parameter for Genres Effects in Model building. Adding extra Features/Predictors can improve Root Mean Squared Error(RMSE), but may not when the added Features that are highly correlated with other Features. From Correlation table, I also observe that the Correlation between rating and Mean of rating By Week of Date is second greatest. Therefore, I will also add Mean of rating By Week of Date as Parameter for Time Effects in Model building.
3.2.3.1 Plot - rating Distributions of Movie
3.2.3.2 Plot - Distributions of Movie (Most Given rating in order from Most to Least)
# generate frequency group by rating
edx_asc <- edx %>%
group_by(rating) %>%
summarise(Frequency = n()) %>%
arrange(Frequency)
# ggplot distributions of movie from most given rating in order from most to least
edx_asc %>%
ggplot(aes(x=reorder(rating, Frequency), y=Frequency)) +
geom_bar(stat="identity", width=0.5, aes(fill=Frequency)) +
scale_fill_gradient(low="red", high="blue") +
coord_flip() +
labs(y="Frequency",
x="rating",
title = "Distributions of Movie (Most Given rating in order from Most to Least)")
The Five most given rating in order from most to least are (4.0, 3.0, 5.0, 3.5, 2.0).
In general, half score rating are less common than whole score rating. For example, there are fewer rating of 3.5 than there are rating of 3.0 or 4.0
3.2.3.3 Plot - Mean of rating Distributions
3.2.3.4 Plot - Mean of rating Distributions By movieId, Week of Date
3.2.3.5 Plot - Mean of rating Distributions By movieId, Year of Date
3.2.3.6 Plot - Mean of rating Distributions (By userId, genres)
3.2.3.7 Plot - Number of rating Distributions By movieId
3.2.3.8 Plot - rating Distributions through the day of year
3.2.3.9 Plot - rating Distributions (Month)
3.2.3.10 Plot - rating Distributions (Year)
During Models development stage, edx datasest is split into separate train_set and test_set datasets for training and testing different Models. And then apply the relevant Minimum Lambda () values on the Algorithm for the Final Model building using edx and final_holdout_test sets.
I use the Collaborative Filtering Approach to build our Machine Learning Models and compare different Models by evaluating their loss functions. The goal is to build a Ultimate Model that minimizes the loss. Root Mean Squared Error (RMSE) will be used as our loss function.
If is the number of User-Movie combinations, is the rating for Movie by User , and is our Prediction, the RMSE is defined as follow:
The estimate that minimizing RMSE represents the Predicted Ratings of all Movies across all Users. I will also use Regularization Method to penalize large estimates that arise from small sample sizes. Furthermore, the Cross-Validation Method will be adopted to mimic the RMSE.
Assumes the same rating for all Movies and all Users, with all the differences explained by random variation. The simplest Model that someone can build, is a Naive Mean Based Model that Predict always the Mean of rating on edx datasets which represents the true rating for all Movies and Users. The Mean is approximately 3.512465.
The Naive Mean Based Model is defined as follow:
with is the Mean and is the independent errors sampled from the same distribution centered at 0.
RMSE of Naive Mean Based Model on final_holdout_test sets is
1.061202.
It is NOT an Acceptable Result.
We can improve our Model by adding a term
,
that represents the Mean of rating for Movie
The first Non Naive Based Model takes into account the Movie Specific
Effects. Movies are Rated higher or lower associated with each
other.
The Movie Effects Based Model is defined as follow:
with is the Mean and is the independent errors sampled from the same distribution centered at 0. The is a measure for popularity of Movie .
The RMSE of Movie Effects Based Model on final_holdout_test
sets is 0.943909.
While this is an improvement over the Naive Mean Based Model,
but it is still Far from Satisfactory.
4.1.2.1 Plot - Movie Effects (frequency vs b_i)
4.1.2.2 Plot - User Effects (frequency vs b_u)
We can further improve our Model by adding , the User Specific Effects. The Model considers that Users have different preference therefore some Users give higher rating than others.
The Movie+User Effects Based Model is defined as follow:
with is the Mean and is the independent errors sampled from the same distribution centered at 0. The is a measure for popularity of Movie . The is a measure due to effects associated with User . The Model accounts for Movie to Movie difference through and User to User differences through .
The RMSE of Movie+User Effects Based Model on
final_holdout_test sets is 0.865349.
The result has Improved Significantly and it is Almost Reaching
the Desired Performance.
The Movie+User+Genres Effects Based Model is defined as follow:
with is the Mean and is the independent errors sampled from the same distribution centered at 0. The is a measure for popularity of Movie . The is a measure for mildness of User . I define as the genres for User rating of Movie .
RMSE of Movie+User+Genres Effects Based Model on
final_holdout_test sets is 0.864947.
It’s reaching the Desired Performance but it represents only a
very Little Improvement over Movie+User Effects Based Model.
The Movie+User+Time Effects Based Model is defined as follow:
with is the Mean and is the independent errors sampled from the same distribution centered at 0. The is a measure for popularity of Movie . The is a measure for mildness of User . We define a smooth function of .
The RMSE of Movie+User+Time Effects Based Model on
final_holdout_test sets is 0.864097.
While it performs Slightly Better than Movie+User+Genres Effects
Model, the Improvement is Modest.
The Regularization Method introduces a penalty term (often denoted as Lambda ) to address the issue of Overfitting. Specifically, it penalizes Movies with large estimates based on a small sample size. By doing so, it helps prevent the Model from fitting noise in the data and encourages more robust Predictions.
In order to optimize , Regularized Movie Effects Based Model is defined as follow:
reduced to this Model as follow:
The RMSE of Regularized Movie Effects Based Model on
final_holdout_test sets is 0.943852.
However, the Improvement over the Non-Regularized Movie Effects
Based Model is Minimal.
4.1.6.1 Plot - Regularized Movie Effects (RMSES vs Lambdas)
In order to optimize , it is defined as follow:
reduced to this Model as follow:
The RMSE of Regularized Movie+User Effects Based Model on final_holdout_test sets is 0.86482.
While it shows Substantial Improvement compared to the Regularized Movie Effects Based Model but the Improvement over Non-Regularized Movie+User Effects Based Model is only Marginal.
4.1.7.1 Plot - Regularized Movie+User Effects (RMSES vs Lambdas)
In order to optimize b_g, it is defined as follow:
reduced to this Model as follow:
The RMSE of Regularized Movie+User+Genres Effects Based Model on final_holdout_test sets is 0.864456.
While it is Slight Improvement over the Non-Regularized Movie+User+Genres Effects Based Model, the RMSE remains Very Close to the Regularized Movie+User Effects Based Model.
4.1.8.1 Plot - Regularized Movie+User+Genres Effects (RMSES vs Lambdas)
In order to optimize ( is Time Specific Effects), it is defined as follow:
reduced to this Model as follow:
The RMSE of Regularized Movie+User+Time Effects Based Model on final_holdout_test sets is 0.863759.
Although the Improvement in RMSE is Not Substantial, but this Model Outperforms All other Models by achieving the lowest RMSE value. Consequently, the Regularized Movie+User+Time Effects Based Model demonstrates a Very Good Performance.
4.1.9.1 Plot - Regularized Movie+User+Time Effects (RMSES vs Lambdas)
R codes for Final Regularized Movie+User+Time Effects Based Model Building
# 1) build regularized movie+user+time effects based model:
#
# a) training of edx datasets
# b) testing on final_holdout_test sets
#
# 2) apply relevant minimum lambda value on algorithm and compute final model RMSE
# minimum lambda value of regularized movie+user+time effects based model from wrangled data
lambd <- min_lambda$movie_user_time[1]
# compute mean of ratings of edx datasets
mu <- mean(edx$rating, na.rm=TRUE)
# compute b_i (movie effects) of edx datasets
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambd))
# compute b_u (user effects) of edx datasets
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambd))
# compute b_t (time effects) of edx datasets
b_t <- edx %>%
left_join(b_u, by="userId") %>%
group_by(movieId) %>%
summarize(b_t = sum( rating - b_u - m_rw ) /(n()+lambd))
# compute predicted ratings (mean of ratings + movie Effects + user Effects + time effects)
predicted_ratings <- final_holdout_test %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
left_join(b_t, by = "movieId") %>%
mutate(pred = mu + b_i + b_u + b_t ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_rmut <- RMSE(predicted_ratings, final_holdout_test$rating)
# generate regularized movie+user+time effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL = "Regularized Movie+User+Time Effects Based Model",
RMSE = final_model_rmse_rmut))
- Use Collaborative Filtering Approach for Models Building.
- Design and formula the linear relationship between the Predictors/Features.
- Fit our Machine Learning Models with Movie, User, Genres or Time Metrics for Prediction.
- Regularization Method is used to penalize magnitudes of Parameters to avoid Overfitting.
- Use Regularization to estimate Movie Effects, User Effects, Genres Effects and Time Effects.
- Determine Minimum Lambda values for different Models during training and testing process using Cross Validation Method.
- Predict rating of Movie by applying relevant Minimum Lambda on Algorithm.
- Compute Root Mean Squared Errors (RMSES) of different Models.
- Retain the relevant Minimum Lambda values to Compute RMSES for Final Models.
- Select the Ultimate Best Model with an lowest RMSE value among all of the Final Models.
The RMSE Metric is used to evaluate an Algorithm. These can all be derived from using values of Lambda () from 0 to 10 increment by 0.25 and then find Lambda values that minimizes the RMSES.
Lower the Root Mean Squared Error values indicate better Model Performance. The Goal is to minimize the RMSE between Predicted and Actual rating. Regularization Method is used to shrink deviations from the average towards 0. To apply the Method, I subtract the overall average before shrinking since using Regularization Method is shrinking values towards to 0. Predicted rating for each Model is divided by n + Lambda, with the size and Lambda a Regularization Parameter.
On average, RMSE Does Not Change much as gets Larger, but the Variability of RMSES Decreases.
A) Following table show Lambdas Minimum RMSE for Regularized Final Models
Effects | Lambdas |
---|---|
Regularized Movie Effects | 2.25 |
Regularized Movie+User Effects | 4.75 |
Regularized Movie+User+Genres Effects | 4.75 |
Regularized Movie+User+Time Effects | 4.00 |
Lambdas give the Minimun RMSE
B) Following table show Summary Results of RMSE for all Final Models
MODEL | RMSE |
---|---|
Naive Mean Based Model | 1.061202 |
Movie Effects Based Model | 0.943909 |
Movie+User Effects Based Model | 0.865349 |
Movie+User+Genres Effects Based Model | 0.864947 |
Movie+User+Time Effects Based Model | 0.864097 |
Regularized Movie Effects Based Model | 0.943852 |
Regularized Movie+User Effects Based Model | 0.864820 |
Regularized Movie+User+Genres Effects Based Model | 0.864456 |
Regularized Movie+User+Time Effects Based Model | 0.863759 |
Result of Final Models - RMSE
Final Model with Lowest RMSE is “Regularized Movie+User+Time Effects Based Model”.
MODEL | RMSE |
---|---|
Regularized Movie+User+Time Effects Based Model | 0.863759 |
Final Model with Lowest RMSE
After training different Models on the edx datasets and evaluating
them on final_holdout_test sets, The Regularized Movie+User+Time
Effects Based Model achieved an Lowest RMSE of 0.863759.
This Result also Demonstrates a Very Strong Performance.
Collaborative Filtering Approach captures the interactions between Users and Movies that result in diverse ratings. However, observed ratings are also influenced by Effects associated with Users and Movies, such as Genres Effects and Time Effects. During the analysis process, I explore and ascertain Time Effects account for a portion of the Variability of ratings. A observed trends indicates that More Frequently a Movie is rated, the Higher its Average rating. Through Data Visualization, Stratification Analysis, Distributions Analysis and Frequency Analysis, I determine that Mean of rating by Week of Date is the most appropriate Parameter for Modelling Time Effects.
The Regularization Method is employed to shrunk the Mean of rating towards Zero, thereby avoiding Overfitting. The Algorithm demonstrates Robust Performance in Predicting the Movie rating as evidenced by RMSE value of Final Model.
However, after incorporating Regularization and Cross Validation Method, the Improvement Over some Other Regularized Models is still Minimal. It may also be a Limitation of Regularization and Cross Validation Method when using the Collaborative Filtering Approach.
RMSE of Regularized Movie+User+Time Effects Based Model(0.863759) Outperforms Non-Regularized Movie+User+Time Effects Based Model(0.864097), as well as Regularized Movie+User+Genres Effects Based Model(0.864456) and Regularized Movie+User Effects Based Model(0.86482).
Regularized Movie+User+Time Effects Based Model is the Optimal Model, achieving an Lowest Root Mean Square Error (RMSE) of 0.863759. I am Optimistic about the Final Machine Learning Algorithm that we have Selected.
The Final Regularized Movie+User+Time Effects Based Model is defined as follow:
The Best Model is Regularized Movie+User+Time Effects Based Model, which achieved an Lowest RMSE. I am satisfied with the results, and the Reliability and Trustworthiness of the Model are Validated by the RMSE Metric Evaluation.
The Final Machine Learning Model achieved an Lowest RMSE of 0.863759 which can described as Exceptional Outcome. As a Result, I am Confident in adopting the Ultimate Model and Algorithm to build our Movie Recommendation System.
In this project, we refer our Movie Recommendation System to the Machine Learning tasks as Prediction since the predicted ratings output is Continuous. Thus, RMSE Metric is used to evaluate our Model/Algorithm. We can conduct research on Machine Learning Algorithm for assigning Predicted rating to the appropriate Classes of rating.
The 10 Classes of ratings are (0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0) because the ratings of Movie range from 0.5 to 5.0 with increment of 0.5.
Then, We can refer the Machine Learning task as Classification since the outcome is Categorical. Therefore, Metrics such as Accuracy, F1 Score, Sensitivity and Specificity can also be adopted to evaluate our Machine Learning Model in the Future.
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE
)
# Note: It takes approx. 4 Minutes to generate the Data Science Report.
#############################################
# Installing Packages and Loading Libraries #
#############################################
# Install Necessary Packages if required
if(!require(tidyverse)) install.packages("tidyverse",
repos = "http://cran.us.r-project.org")
if(!require(tidyr)) install.packages("tidyr",
repos = "http://cran.us.r-project.org")
if(!require(dplyr)) install.packages("dplyr",
repos = "http://cran.us.r-project.org")
if(!require(lubridate)) install.packages("lubridate",
repos = "http://cran.us.r-project.org")
if(!require(stringr)) install.packages("stringr",
repos = "http://cran.us.r-project.org")
if(!require(ggplot2)) install.packages("ggplot2",
repos = "http://cran.us.r-project.org")
if(!require(gridExtra)) install.packages("gridExtra",
repos = "http://cran.us.r-project.org")
if(!require(knitr)) install.packages("knitr",
repos = "http://cran.us.r-project.org")
if(!require(rstudioapi)) install.packages("rstudioapi",
repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret",
repos = "http://cran.us.r-project.org")
if(!require(tinytex)){
install.packages("tinytex", repos = "http://cran.us.r-project.org")
tinytex::install_tinytex()
}
# Loading Necessary Libraries
library(dslabs)
library(tidyverse)
library(dplyr)
library(tidyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(gridExtra)
library(knitr)
library(rstudioapi)
library(caret)
library(tinytex)
# Set number of significant digits=6 globally
options(digits=6)
set.seed(1990)
# Loading rda files which previously generated after running recommendation-system.R
load("rda/edx-original-rda.rda")
load("rda/edx-rda.rda")
load("rda/final-holdout-test.rda")
load("rda/min-lambda.rda")
# - generate dimension of edx: number of rows and columns
# - display output using kable function
dim_df <- data.frame(Rows = dim(edx_original)[1], Columns = dim(edx_original)[2])
dim_df %>% knitr::kable(caption="edx datasets")
# - generate column name and class of edx
# - display output using kable function
class_results <- data.frame(Class=sapply(edx_original, class))
class_results %>% knitr::kable(caption="edx datasets")
# compute number of unique movieId, userId and genres
u_movie_n <- length(unique(edx_original$movieId))
u_user_n <- length(unique(edx_original$userId))
u_genres_n <- length(unique(edx_original$genres))
# - generate table - Number of Unique movieId, userId and genres.
# - display output using kable function
um_result <- data.frame(Description="Number of Unique movieId", Count=u_movie_n )
um_result <- bind_rows(um_result,
data.frame(Description="Number of unique userId", Count=u_user_n))
um_result <- bind_rows(um_result,
data.frame(Description="Number of unique combined genres", Count=u_genres_n))
um_result %>% knitr::kable(caption="edx datsets")
# - generate 10 examples - counting the Occurrences of rating by movieId
# - display output using kable function
m_result <- edx %>% count(movieId,name="Occurency") %>%
count(Occurency, name="Count") %>%
arrange(desc(Count)) %>%
dplyr::slice(1:10)
m_result %>% knitr::kable(caption="edx datasets")
# - generate 10 examples - counting the occurrences of rating by userId
# - display output using kable function
u_result <- edx %>% count(userId,name="Occurency") %>%
count(Occurency, name="Count") %>%
dplyr::slice(30:40)
u_result %>% knitr::kable(caption="edx datasets")
# - generate distributions (number of occurrence) using movieId and userId
# - ggplot histogram - m1 and m2
m1 <- edx %>%
dplyr::count(movieId) %>%
ggplot(aes(n)) +
geom_histogram(bins = 50, fill="#006EBB", color="black") +
labs(y="Count",x="Number of Occurrence") +
scale_x_log10() +
ggtitle("Movies")
m2 <- edx %>%
dplyr::count(userId) %>%
ggplot(aes(n)) +
geom_histogram(bins = 200, fill="cyan", color="#D883B7" ) +
labs(y="Count",x="Number of Occurrence") +
scale_x_log10()+
ggtitle("Users")
grid.arrange(m1, m2, ncol = 2)
# - generate missing value (if any) in column rating of edx
# - display output using kable function
na_results <- edx[apply(is.na(edx),1,any),]
na_results %>% select(userId,movieId,rating,timestamp,title,genres) %>%
knitr::kable(caption="Columns with NA values in edx datasets")
# generate and display zeros value (if any) in column rating of edx
length(which(edx$rating==0))
# - generate 8 examples: the original edx - rating given by one User to one movie
# - display output of edx using kable function
edx %>%
group_by(title) %>%
mutate(f=length(rating)) %>% ungroup() %>%
filter(f > 10000) %>%
distinct(title, .keep_all=TRUE) %>%
arrange(desc(f)) %>%
select(movieId,title,genres,userId,rating,timestamp) %>%
dplyr::slice(1:8) %>%
knitr::kable(caption="edx datasets")
# - mean of rating by genres bEFORE combined genres separation
# - generate 20 examples-mean of rating by genres
# in descending order by number of rating(count > 10000)
genres_avg <- edx %>%
group_by(genres) %>%
summarize(average_rating_genres=mean(rating, na.rm=TRUE ),
se=sd(rating, na.rm=TRUE)/sqrt(n()), count=n()) %>%
filter(count > 10000) %>%
arrange(desc(average_rating_genres)) %>%
mutate(rank=rownames(.))
# display output of genres_avg
genres_avg %>% select(rank, genres, average_rating_genres, count) %>%
dplyr::slice(1:20)
# prepare genres_avg_-plot for ggplot
genres_avg_plot <- genres_avg %>% dplyr::slice(1:20)
# display mean of rating by genres (genres="Drama")
genres_avg %>% select(rank, genres, average_rating_genres, count) %>%
dplyr::slice(48)
# display mean of rating by genres (genres="Comedy")
genres_avg %>% select(rank, genres, average_rating_genres, count) %>%
dplyr::slice(131)
# generate and display rank, Combined genres, Mean of rating by genres
# and number of rating of "Comedy" and "Drama"
genres_t1 <- genres_avg %>%
filter(genres%in%c("Comedy","Drama")) %>%
select(rank, genres, average_rating_genres, count)
# ggplot mean of rating by genres with error bars before genres separation
genres_avg_plot %>%
ggplot(aes(x=genres, y=average_rating_genres)) +
geom_errorbar(aes(ymin=average_rating_genres - se,
ymax=average_rating_genres + se),color="#7977B8") +
geom_point(color="#F89E78") +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) +
labs(x="Combined genres", y="Mean of rating",
title="(Original Combined genres) Mean of rating by genres with Error Bars")
# - mean of rating by genres after Combined genres separation
# - data Wrangling: 1) separate combined genres into several Rows
# and each row contains only one genre
# 2) use data frame edx_original
# - generate mean of rating by genres in descending order(count > 20000)
genres_avg <- edx_original %>%
separate_longer_delim(genres, delim="|") %>%
group_by(genres) %>%
summarize(average_rating_genres=mean(rating, na.rm=TRUE ),
se=sd(rating, na.rm=TRUE)/sqrt(n()), count=n()) %>%
filter(count > 200000) %>%
arrange(desc(average_rating_genres)) %>%
mutate(rank=rownames(.))
# display output of genres_avg
genres_avg %>% select(rank, genres, average_rating_genres, count) %>%
dplyr::slice(1:20)
# generate rank, combined genres, mean of rating by genres
# and number of rating of "Comedy" and "Drama"
genres_t2 <- genres_avg %>%
filter(genres%in%c("Comedy","Drama")) %>%
select(rank, genres, average_rating_genres,count)
# Prepare genres_avg_plot for ggplot
genres_avg_plot <- genres_avg %>% dplyr::slice(1:20)
# ggplot mean of rating by genres with error bars after genres separated into several rows
genres_avg_plot %>%
ggplot(aes(x=genres, y=average_rating_genres)) +
geom_errorbar(aes(ymin= average_rating_genres- se,
ymax= average_rating_genres+ se),color="#7977B8") +
geom_point(color="#F89E78") +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1)) +
labs(x="Separated genres", y="Mean of rating",
title="(After genres Separated) Mean of rating by genres with Error Bars") +
scale_y_continuous(breaks=seq(0.5,5,by=0.1))
# display output table of genres_t1 using kable function
genres_t1 %>% knitr::kable(caption="BEFORE Combined genres Separated")
# display output table of genres_t2 using kable function
genres_t2 %>% knitr::kable(caption="After Combined genres Separated")
# data Wrangling: generate new columns (d_w, d_m, d_y) from column timestamp
edx <- edx %>% mutate(d_w=format(round_date(as_datetime(timestamp),"week"),"%Y-%m-%d"))
edx <- edx %>% mutate(d_m=format(round_date(as_datetime(timestamp),"month"),"%Y-%m-%d"))
edx <- edx %>% mutate(d_y=format(round_date(as_datetime(timestamp),"year"),"%Y-%m-%d"))
# data Wrangling: generate new columns (m_r, m_rw, m_ry, m_rg, tot_nr) from column rating
edx <- edx %>% group_by(movieId) %>% mutate(m_r = mean(rating, na.rm=TRUE))
edx <- edx %>% group_by(movieId,d_w) %>% mutate(m_rw= mean(rating, na.rm=TRUE))
edx <- edx %>% group_by(movieId,d_y) %>% mutate(m_ry= mean(rating, na.rm=TRUE))
edx <- edx %>% group_by(userId,genres) %>% mutate(m_rg=mean(rating, na.rm=TRUE))
edx <- edx %>% group_by(movieId) %>% mutate(tot_nr=n()) %>% ungroup()
# data wrangling: generate new column release from column title
edx <- edx %>% mutate(release = str_extract(title,"\\d{4}")) %>%
mutate(title = str_replace(title,"\\s*\\(\\d{4}\\)",""))
# generate and display 10 examples encompassing new features (d_w)
edx %>% group_by(title) %>%
mutate(f=length(rating)) %>% ungroup() %>%
filter(f > 15000) %>%
arrange(desc(f),timestamp) %>%
distinct(title, .keep_all=TRUE) %>%
select(userId,movieId,title,d_w,rating) %>%
dplyr::slice(10:20)
# generate 8 examples encompassing new features (release, m_r, m_rw, m_ry, m_rg, tot_nr)
edx %>% group_by(title) %>%
mutate(f=length(rating)) %>% ungroup() %>%
filter(f > 20000) %>%
arrange(desc(f),timestamp) %>%
distinct(title, .keep_all=TRUE) %>%
select(userId,release,title,rating,m_r,m_rw,m_ry,m_rg) %>%
dplyr::slice(1:8)
# generate 20 movies: most frequently rated by User with frequency > 10000
c_1 <- edx %>%
group_by(title) %>%
mutate(frequency=length(rating)) %>% ungroup() %>%
filter(frequency > 10000) %>%
distinct(title, .keep_all=TRUE) %>%
arrange(desc(frequency)) %>%
mutate(rank=rownames(.)) %>%
select(rank,title,frequency) %>%
dplyr::slice(1:20)
# display c_1 using kable function
c_1 %>% knitr::kable(caption="Frequency of rating By title")
# ggplot frequency of rating By title
c_1 %>% mutate(title=reorder(title,frequency)) %>%
ggplot(aes(title,frequency)) +
geom_bar(stat="identity",fill="#D883B7",color="white") +
labs(y="Frequency",x="Moive title",title="Frequency of User rating By Movie title") +
theme(axis.text.x=element_text(angle=90)) +
scale_y_continuous(breaks=seq(0,33000,by=3000)) +
coord_flip()
# generate 20 genres: most frequently rated by User with frequency > 10000
c_2 <- edx %>%
group_by(genres) %>%
mutate(frequency=length(rating)) %>% ungroup() %>%
filter(frequency > 10000) %>%
distinct(genres, .keep_all=TRUE) %>%
arrange(desc(frequency)) %>%
mutate(rank=rownames(.)) %>%
select(rank,genres,frequency) %>%
dplyr::slice(1:20)
# display c_2
print_df <- function(title,df)
{
cat(title, "\n\n")
cat(capture.output(print(n=50,df)), sep="\n")
}
print_df("List of 20 genres - Frequency of rating By genres ",c_2)
# ggplot frequency of rating By genres
c_2 %>% mutate(genres=reorder(genres,frequency)) %>%
ggplot(aes(genres,frequency)) +
geom_bar(stat="identity",fill="#584298",color="white") +
labs(y="Frequency",x="genres",title="Frequency of User rating By genres") +
theme(axis.text.x=element_text(angle=90)) +
scale_y_continuous(breaks=seq(0,800000,by=80000)) +
coord_flip()
# compute correlation coefficient (r) of rating and m_r with 6 decimals place
r <- edx %>%
summarize(r=cor(rating,m_r)) %>%
pull(r)
r <- round(r,6)
# ggplot correlation of Normalized Mean of rating and rating
edx %>% mutate(rating=scale(rating),m_r=scale(m_r)) %>%
group_by(rating) %>%
summarize(m_r=mean(m_r)) %>%
ggplot(aes(rating,m_r)) + geom_point()+
geom_abline(intercept=0, slope=r,color="#F89E78")+
labs(x="scale(rating)",y="scale(m_r)",
title=paste("Normalized Mean of rating vs rating [cor =",r,"]"))
# compute correlation coefficient (r) of rating and m_rw with 6 decimals place
r <- edx %>%
summarize(r=cor(rating,m_rw)) %>%
pull(r)
r <- round(r,6)
# ggplot correlation of normalized mean by week of date and rating
edx %>% mutate(rating=scale(rating),m_rw=scale(m_rw)) %>%
group_by(rating) %>%
summarize(m_rw=mean(m_rw)) %>%
ggplot(aes(rating,m_rw)) + geom_point()+
geom_abline(intercept=0, slope=r,color="#F89E78")+
labs(x="scale(rating)",y="scale(m_rw)",
title=paste("Normalized Mean of rating by Week of Date vs rating [cor =",r,"]"))
# compute correlation coefficient (r) of rating and m_ry with 6 decimals place
r <- edx %>%
summarize(r=cor(rating,m_ry)) %>%
pull(r)
r <- round(r,6)
# ggplot correlation of normalized mean by year of date and rating
edx %>% mutate(rating=scale(rating),m_ry=scale(m_ry)) %>%
group_by(rating) %>%
summarize(m_ry=mean(m_ry)) %>%
ggplot(aes(rating,m_ry)) + geom_point()+
geom_abline(intercept=0, slope=r,color="#F89E78")+
labs(x="scale(rating)",y="scale(m_ry)",
title=paste("Normalized Mean of rating by Year of Date vs rating [cor =",r,"]"))
# compute correlation coefficient (r) of rating and m_rg with 6 decimals place
r <- edx %>%
summarize(r=cor(rating,m_rg)) %>%
pull(r)
r <- round(r,6)
# ggplot correlation of normalized mean of rating by genres and rating
edx %>% mutate(rating=scale(rating),m_rg=scale(m_rg)) %>%
group_by(rating) %>%
summarize(m_rg=mean(m_rg)) %>%
ggplot(aes(rating,m_rg)) + geom_point()+
geom_abline(intercept=0, slope=r,color="#F89E78") +
labs(x="scale(rating)",y="scale(m_rg)",
title=paste("Normalized Mean of rating by genres vs rating [cor =",r,"]"))
# generate correlation coefficient table (m_r, m_rw, m_ry, m_rg) of edx
avg_r_all <- edx %>%
select(rating,m_r,m_rw,m_ry,m_rg)
cor_r_all <- cor(na.omit(avg_r_all[, unlist(lapply(avg_r_all, is.numeric))]))
# display correlation coefficient table using kable function
cor_r_all %>% knitr::kable(caption="Correlation - new Features of edx dataset")
# ggplot rating distributions of movie
edx %>% ggplot(aes(rating)) +
geom_histogram(binwidth=0.5,fill="#7977B8",color="black") +
labs(y="Frequency",x="rating",title="rating Distributions of Movie") +
scale_x_continuous(breaks=seq(0.5,5,by=0.5))
# generate frequency group by rating
edx_asc <- edx %>%
group_by(rating) %>%
summarise(Frequency = n()) %>%
arrange(Frequency)
# ggplot distributions of movie from most given rating in order from most to least
edx_asc %>%
ggplot(aes(x=reorder(rating, Frequency), y=Frequency)) +
geom_bar(stat="identity", width=0.5, aes(fill=Frequency)) +
scale_fill_gradient(low="red", high="blue") +
coord_flip() +
labs(y="Frequency",
x="rating",
title = "Distributions of Movie (Most Given rating in order from Most to Least)")
# ggplot mean of rating distributions
edx %>% ggplot(aes(m_r)) +
geom_histogram(binwidth=0.1,fill="#C6DC67",color="black") +
labs(y="Frequency",x="Mean of rating",title="Mean of rating Distributions") +
theme(axis.text.x=element_text(angle=90)) +
scale_x_continuous(breaks=seq(0.5,5,by=0.1))
# ggplot mean of rating distributions by movieId, week of date
edx %>% ggplot(aes(m_rw)) +
geom_histogram(binwidth=0.1,fill="#C6DC67",color="black") +
labs(y="Frequency",x="Mean of rating",
title="Mean of rating Distributions By movieId, Week of Date") +
theme(axis.text.x=element_text(angle=90)) +
scale_x_continuous(breaks=seq(0.5,5,by=0.1))
# ggplot mean of rating distributions by movieId, year of date
edx %>% ggplot(aes(m_ry)) +
geom_histogram(binwidth=0.1,fill="#C6DC67",color="black") +
labs(y="Frequency",x="Mean of rating",
title="Mean of rating Distributions By movieId, Year of Date") +
theme(axis.text.x=element_text(angle=90)) +
scale_x_continuous(breaks=seq(0.5,5,by=0.1))
# ggplot mean of rating distributions by userId, genres
edx %>% ggplot(aes(m_rg)) +
geom_histogram(binwidth=0.1,fill="cyan",color="black") +
labs(y="Frequency",x="Mean of rating",
title="Mean of rating Distributions By userId, genres") +
theme(axis.text.x=element_text(angle=90)) +
scale_x_continuous(breaks=seq(0.5,5,by=0.1))
# ggplot number of rating distributions by movieId
edx %>% ggplot(aes(movieId)) +
geom_histogram(binwidth=1,color="#7977B8") +
labs(y="Number of rating",x="movieId",title="Number of rating Distributions By movieId") +
theme(axis.text.x=element_text(angle=90)) +
scale_x_continuous(breaks=seq(0,70000,by=7000))
# ggplot rating distributions through the day of year
edx %>% ggplot(aes(yday_dw)) +
geom_histogram(binwidth=0.05,color="#F69289") +
labs(y="Frequency",x="Day",title="rating Distributions through the Day of Year") +
scale_x_continuous(breaks=seq(0,400,by=50))
# ggplot rating distributions (month)
edx %>% ggplot(aes(month_dm)) +
geom_histogram(binwidth=1,fill="#46C5DD",color="black") +
labs(y="Frequency",x="Months",title="rating Distributions (Month)") +
scale_x_continuous(breaks=seq(1,12,by=1))
# ggplot rating distributions (year)
edx %>% ggplot(aes(year_dy)) +
geom_histogram(binwidth=1,fill="#46C5DD",color="black") +
labs(y="Frequency",x="Years",title="rating Distributions (Year)") +
theme(axis.text.x=element_text(angle=90)) +
scale_x_continuous(breaks=seq(1993,2020,by=1))
# RMSE Function
RMSE <- function(predicted_ratings, true_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
######################################################################################
# Build Final Models - Training of edx and testing on final_holdout_test sets #
# #
######################################################################################
# #
# 1. Models Training of edx datasets. #
# #
# 2. Apply relevant Minimum Lambda value. #
# #
# 3. Final Models Testing on final_holdout_test sets. #
# #
# 4. Compute RMSE for Final Models. #
# #
# 5. Generate Final Model RMSE table. #
# #
######################################################################################
# build naive mean based model": 1) training of edx datasets
# 2) testing on final_holdout_test sets
# 3) compute final model RMSE
#
set.seed(755)
# compute mean of ratings using edx
mu <- mean(edx$rating, na.rm=TRUE)
# compute RMSE on final_holdout_test
final_model_rmse_naive <- RMSE(mu, final_holdout_test$rating)
# generate naive mean based model final model RMSE table
final_model_rmse_table <- data_frame(MODEL = "Naive Mean Based Model",
RMSE = final_model_rmse_naive)
# build movie effects based model: 1) training of edx datasets
# 2) testing on final_holdout_test sets
# 3) compute final model RMSE
# compute mean of ratings of edx
mu <- mean(edx$rating,na.rm=TRUE )
# compute b_i (movie effects) of edx
movie_avgs <- edx %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu))
# compute predicted ratings (mean of ratings + movie effects)
predicted_ratings <- final_holdout_test %>%
left_join(movie_avgs, by='movieId') %>%
mutate(pred = mu + b_i ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_m <- RMSE(predicted_ratings, final_holdout_test$rating)
# generate movie effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL ="Movie Effects Based Model",
RMSE = final_model_rmse_m ))
# qplot histogram: movie effects (frequency vs b_i)
movie_avgs %>%
qplot(b_i,geom ="histogram",bins=30,data =.,color=I("black")) +
labs(title="Movie Effects Histogram (b_i)")
# ggplot histogram: user effects (frequency vs b_u)
edx %>%
group_by(userId) %>%
summarize(b_u = mean(rating)) %>%
ggplot(aes(b_u)) +
geom_histogram(bins = 30, color = "black") +
labs(title="User Effects Histogram (b_u)")
# build movie+user effects based model: 1) training of edx datasets
# 2) testing final_holdout_test sets
# 3) compute final model RMSE
# compute b_u (user effects) of edx
user_avgs <- edx %>%
left_join(movie_avgs, by='movieId') %>%
group_by(userId) %>%
summarize(b_u = mean(rating - mu - b_i))
# compute predicted ratings (mean of ratings + movie effects + user effects)
predicted_ratings <- final_holdout_test %>%
left_join(movie_avgs, by='movieId') %>%
left_join(user_avgs, by='userId') %>%
mutate(pred = mu + b_i + b_u ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_mu <- RMSE(predicted_ratings,final_holdout_test$rating)
# generate movie+user effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL="Movie+User Effects Based Model",
RMSE = final_model_rmse_mu))
# build movie+user+genres effects based model: 1) training of edx datasets
# 2) testing on final_holdout_test sets
# 3) compute final model RMSE
# compute mean of ratings of edx
mu <- mean(edx$rating, na.rm=TRUE)
# compute b_i (movie effects) of edx
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu))
# compute b_u (user effects) of edx
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = mean(rating - b_i - mu))
# compute b_g (genres effects) of edx
b_g <- edx %>%
left_join(b_u, by="userId") %>%
group_by(genres) %>%
summarize(b_g = mean(rating - b_u - m_rg))
# compute predicted ratings (mean of ratings + movie effects + user Effects + genres effects)
predicted_ratings <- final_holdout_test %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
left_join(b_g, by = "genres") %>%
mutate(pred = mu + b_i + b_u + b_g ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_mug <- RMSE(predicted_ratings, final_holdout_test$rating)
# generate movie+user+genres effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL ="Movie+User+Genres Effects Based Model",
RMSE = final_model_rmse_mug))
# build movie+user+time effects based model: 1) training of edx datasets
# 2) testing of final_holdout_test sets
# 3) compute final model RMSE
# compute mean of ratings of edx
mu <- mean(edx$rating, na.rm=TRUE)
# compute b_i (movie effects) of edx
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu))
# compute b_u (user Effects) of edx
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = mean(rating - b_i - mu))
# compute b_t (time effects) of edx
b_t <- edx %>%
left_join(b_u, by="userId") %>%
group_by(movieId) %>%
summarize(b_t = mean(rating - b_u - m_rw))
# compute predicted ratings (mean of ratings + movie effects + user effects + time effects)
predicted_ratings <- final_holdout_test %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
left_join(b_t, by = "movieId") %>%
mutate(pred = mu + b_i + b_u + b_t ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_mut <- RMSE(predicted_ratings, final_holdout_test$rating)
# generate movie+user+time effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL ="Movie+User+Time Effects Based Model",
RMSE = final_model_rmse_mut))
# 1) build regularized movie effects based model: - training of edx datasets and
# - testing on final_holdout_test sets
#
# 2) apply relevant minimum lambda value on algorithm and compute final model RMSE
# minimum lambda value of regularized movie effects based model from wrangled data
lambd <- min_lambda$movie[1]
# compute mean of ratings of edx
mu <- mean(edx$rating,na.rm=TRUE )
# compute b_i (movie effects) of edx
movie_reg_avgs <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambd), n_i = n())
# compute predicted ratings (mean of ratings + movie effects)
predicted_ratings <- final_holdout_test %>%
left_join(movie_reg_avgs, by='movieId') %>%
mutate(pred = mu + b_i ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_rm <- RMSE(predicted_ratings, final_holdout_test$rating)
# generate regularized movie effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL ="Regularized Movie Effects Based Model",
RMSE = final_model_rmse_rm))
# lambdas = sequence from 0 to 10 increment by 0.25
lambdas <- seq(0, 10, 0.25)
# qplot - Regularized Movie Effects (RMSES vs Lambdas)
qplot(lambdas, min_lambda$rmses_rm) +
labs(x="Lambdas",y="RMSES",title="Regularized Movie Effects (RMSES vs Lambdas)")
# 1) build regularized movie+user effects based model:
#
# a) training of edx datasets.
# b) testing on final_holdout_test sets
#
# 2) apply relevant minimum lambda value on algorithm and compute final model RMSE
# minimum lambda value of regularized movie+user effects based model from wrangled data
lambd <- min_lambda$movie_user[1]
# compute mean of ratings of edx datasets
mu <- mean(edx$rating, na.rm=TRUE)
# compute b_i (movie effects) of edx datasets
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu) /(n()+lambd))
# compute b_u (user effects) of edx datasets
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu) /(n()+lambd))
# compute predicted ratings (mean of ratings + movie effects + user Effects)
predicted_ratings <- final_holdout_test %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
mutate(pred = mu + b_i + b_u ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_rmu <- RMSE(predicted_ratings, final_holdout_test$rating)
# generate regularized movie+user effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL = "Regularized Movie+User Effects Based Model",
RMSE = final_model_rmse_rmu))
# lambdas = sequence from 0 to 10 increment by 0.25
lambdas <- seq(0, 10, 0.25)
# qplot regularized movie+user effects (RMSES vs Lambdas)
qplot(lambdas, min_lambda$rmses_riu) +
labs(y="RMSES",x="Lambdas",title="Regularized Movie+User Effects (RMSES vs Lambdas)")
# 1) build regularized movie+user+genres effects based model:
#
# a) training of edx datasets
# b) testing on final_holdout_test sets
#
# 2) apply relevant minimum lambda value on algorithm and compute final model RMSE
# minimum lambda value of regularized movie+user+genres effects based model from wrangled data
lambd <- min_lambda$movie_user_genres[1]
# compute mean of ratings of edx
mu <- mean(edx$rating, na.rm=TRUE)
# compute b_i (movie effects) of edx datasets
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambd))
# compute b_u (user effects) of edx datasets
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambd))
# compute b_g (genres effects) of edx datasets
b_g <- edx %>%
left_join(b_u, by="userId") %>%
group_by(genres) %>%
summarize(b_g = sum(rating - b_u - m_rg)/(n()+lambd))
# compute predicted ratings (mean of ratings + movie effects + user effects + genres effects)
predicted_ratings <- final_holdout_test %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
left_join(b_g, by = "genres") %>%
mutate(pred = mu + b_i + b_u + b_g ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_rmug <- RMSE(predicted_ratings, final_holdout_test$rating)
# generate regularized movie+user+genres effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL = "Regularized Movie+User+Genres Effects Based Model",
RMSE = final_model_rmse_rmug))
# lambdas = sequence from 0 to 10 increment by 0.25
lambdas <- seq(0, 10, 0.25)
# qplot regularized movie+user+genres effects (RMSES vs Lambdas)
qplot(lambdas, min_lambda$rmses_rmug) +
labs(y="RMSES",x="Lambdas",title="Regularized Movie+User+Genres Effects (RMSES vs Lambdas)")
# 1) build regularized movie+user+time effects based model:
#
# a) training of edx datasets
# b) testing on final_holdout_test sets
#
# 2) apply relevant minimum lambda value on algorithm and compute final model RMSE
# minimum lambda value of regularized movie+user+time effects based model from wrangled data
lambd <- min_lambda$movie_user_time[1]
# compute mean of ratings of edx
mu <- mean(edx$rating, na.rm=TRUE)
# compute b_i (movie effects) of edx datasets
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambd))
# compute b_u (user effects) of edx datasets
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambd))
# compute b_t (time effects) of edx datasets
b_t <- edx %>%
left_join(b_u, by="userId") %>%
group_by(movieId) %>%
summarize(b_t = sum( rating - b_u - m_rw ) /(n()+lambd))
# compute predicted ratings (mean of ratings + movie Effects + user Effects + time effects)
predicted_ratings <- final_holdout_test %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
left_join(b_t, by = "movieId") %>%
mutate(pred = mu + b_i + b_u + b_t ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_rmut <- RMSE(predicted_ratings, final_holdout_test$rating)
# generate regularized movie+user+time effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL = "Regularized Movie+User+Time Effects Based Model",
RMSE = final_model_rmse_rmut))
# lambdas = sequence from 0 to 10 increment by 0.25
lambdas <- seq(0, 10, 0.25)
# qplot regularized movie+user+time effects (RMSES vs Lambdas)
qplot(lambdas, min_lambda$rmses_rmut) +
labs(y="RMSES",x="Lambdas",title="Regularized Movie+User+Time Effects (RMSES vs Lambdas)")
# 1) build regularized movie+user+time effects based model:
#
# a) training of edx datasets
# b) testing on final_holdout_test sets
#
# 2) apply relevant minimum lambda value on algorithm and compute final model RMSE
# minimum lambda value of regularized movie+user+time effects based model from wrangled data
lambd <- min_lambda$movie_user_time[1]
# compute mean of ratings of edx datasets
mu <- mean(edx$rating, na.rm=TRUE)
# compute b_i (movie effects) of edx datasets
b_i <- edx %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambd))
# compute b_u (user effects) of edx datasets
b_u <- edx %>%
left_join(b_i, by="movieId") %>%
group_by(userId) %>%
summarize(b_u = sum(rating - b_i - mu)/(n()+lambd))
# compute b_t (time effects) of edx datasets
b_t <- edx %>%
left_join(b_u, by="userId") %>%
group_by(movieId) %>%
summarize(b_t = sum( rating - b_u - m_rw ) /(n()+lambd))
# compute predicted ratings (mean of ratings + movie Effects + user Effects + time effects)
predicted_ratings <- final_holdout_test %>%
left_join(b_i, by = "movieId") %>%
left_join(b_u, by = "userId") %>%
left_join(b_t, by = "movieId") %>%
mutate(pred = mu + b_i + b_u + b_t ) %>%
pull(pred)
# compute RMSE on final_holdout_test
final_model_rmse_rmut <- RMSE(predicted_ratings, final_holdout_test$rating)
# generate regularized movie+user+time effects based model final models RMSE table
final_model_rmse_table <- bind_rows(final_model_rmse_table,
data_frame(MODEL = "Regularized Movie+User+Time Effects Based Model",
RMSE = final_model_rmse_rmut))
####################################################################################
# summary of lambda values that give the minimum RMSE for regularized final models #
####################################################################################
# determine minimum lambdas from wrangled data
min_lambda_m <- lambdas[which.min(min_lambda$rmses_rm)]
min_lambda_mu <- lambdas[which.min(min_lambda$rmses_riu)]
min_lambda_mug <- lambdas[which.min(min_lambda$rmses_rmug)]
min_lambda_mut <- lambdas[which.min(min_lambda$rmses_rmut)]
# generate data frame of min_lambda_result
min_lambda_result <- data.frame(Effects = "Regularized Movie Effects",
Lambdas = min_lambda_m )
min_lambda_result <- bind_rows(min_lambda_result,
data.frame(Effects="Regularized Movie+User Effects",
Lambdas=min_lambda_mu))
min_lambda_result <- bind_rows(min_lambda_result,
data.frame(Effects="Regularized Movie+User+Genres Effects",
Lambdas=min_lambda_mug))
min_lambda_result <- bind_rows(min_lambda_result,
data.frame(Effects="Regularized Movie+User+Time Effects",
Lambdas=min_lambda_mut))
# display min_lambda_result table using kable function
min_lambda_result %>% knitr::kable(caption="Lambdas give the Minimun RMSE")
###########################################
# summary RMSES of final models/algorithm #
###########################################
# display final_model_rmse_table using kable function
final_model_rmse_table %>% knitr::kable(caption="Result of Final Models - RMSE")
##########################################################################
# final regularized movie+user+time effects based model with lowest RMSE #
##########################################################################
# generate data frame of lowest_rmse_model
lowest_rmse_model <-data_frame(
MODEL=final_model_rmse_table$MODEL[which.min(final_model_rmse_table$RMSE)],
RMSE =final_model_rmse_table$RMSE[which.min(final_model_rmse_table$RMSE)])
# display lowest_rmse_model using kable function
lowest_rmse_model %>% knitr::kable(caption="Final Model with Lowest RMSE")
-
Harvard Professor “Rafael A. Irizarry”.(2019). INTRODUCTION TO DATA SCIENCE.
-
Steve,L.(2009). Netflix Awards $1Million Prize and Starts a New Contest. Bits. The New York Times.
-
https://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-mew-contest/
-
https://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
-
https://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
-
Comprehensive R Archive Network. https://cran.r-project.org/doc/manuals/r-release/R-lang.html
-
RDocumentation. https://www.rdocumentation.org
-
Microsoft. https://copilot.microsoft.com
-
Kaggle. https://www.kaggle.com
-
UCI Machine Learning Repository. https://archive.ics.uci.edu
-
Stack Overflow. https://stackoverflow.com
-
Statology. https://www.statology.org
I would like to express my gratitude to Harvard Professor “Rafael Irizarry” for this excellence in teaching. His expertise and insights were instrumental in shaping the direction of my learning in this Data Science Professional Certificate program.
Additionally, I also thanks all Teaching Assistants of this program at Harvard University for providing necessary feedback and assistance for my study. Special thanks to “Sapei” and Friends for their spiritual support.
Lastly, I extend my heartfelt thanks to my wife “Jess” for her unwavering understanding during the course of my professional study.