Project - Linear Regression
This is my final project for the course “Linear Regression and Modeling”. It covers analysis of a movie data set in R and using multiple linear regression to model and predict movie popularity. This project covers:
- Discussion on Data Set
- Exploratory Data Analysis
- Multiple Linear Regression model building
- Model diagnostics
- Prediction
- Conclusion
The Project file starts here…
<!DOCTYPE html>
Modeling and predicting movie popularity
Setup
Load packages
Part 1: Data
I just got a job as a data scientist at Paramount Pictures, awesome!
As our first project I have been given data and my task is to: analyse, model and predict movie ratings.
The data set consists of 561 randomly selected movies (dated pre-2016). It contains information on movie ratings from audiences and critics as well as other attributes (i.e. movie length). The information originates from Rotten Tomatoes and IMBD, which are enternainment-related websites.
The data I am anlysing here is observational and not from an experiment. Hence, any findings will be reported as associations and not causal relationships.
The various movie ratings in this data set are likely biased in several ways and therefore we should be careful to generalize findings. The following biases should be kept in mind:
- Voluntary response bias - people’s ratings are voluntary and might be biased towards being overly positive/negative
- Undercoverage bias - we are likely looking at responses from a geographic subpopulation (possibly mainly US?), as well as a demographic subpopulation of people particularly interested in movies and therefore taking the time to rate them (I certainly never rated a movie on one of these websites..have you?). Also critics represent a very small subpopulation of the general population.
Part 2: Research question
Can we predict whether an audience will like a movie?
I am a data scientist at Paramount Pictures and want to help the company figure out what makes a movie popular. Particularly, I want to inform business decisions to help Paramount make successful movies in the future that sell a lot of tickets. Therefore, in this instance I will focus my analysis on audience ratings and not the movie critics and aim to build an accurate model to predict movie popularity.
Part 3: Exploratory data analysis
Let’s look at the structure of the data set
## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
There are a lot of variables in this data set. However, upon closer examination, many of the variables (aside from the movie popularity scores themselves) will not be useful for predicting movie popularity.
Below is a table of variables that will be excluded.
Variable | Reason to exclude from analysis |
---|---|
title |
Title of movie, not associated with popularity |
studio |
too many levels (211) |
thtr_... |
Variables regarding release date (except year) |
dvd_... |
Variables regarding dvd-release |
director |
too many levels (532) |
actor1-5 |
Names of actors, not associated with popularity |
imdb_url |
Just the url of the imdb link |
rt_url |
Just the url of the rotten tomato link |
Then we’re left with the below variables to include in our analysis as independent variables.
Variable | Description |
---|---|
title_type |
Type of movie |
genre |
Genre of movie |
runtime |
Length of movie |
mpaa_rating |
MPAA rating of movie |
best_pic_nom |
Movie nominated for oscar |
best_pic_win |
Movie won oscar |
best_actor_win |
Actor won oscar |
best_actress_win |
Actress won oscar |
best_dir_win |
Director won oscar |
top200_box |
Movie in top 200 BoxOfficeMojo list |
thtr_rel_year |
Release year |
Note, that “Release year” will only be used as chronological information to later test the independence assumption for the linear regression model, and will not be used as an independent variable in the model.
Next, we have to decide on our response ratings variable.
Choosing a rating variable
As for ratings of movies, we have the choice of:
- Numerical:
imdb_rating
(andimdb_num_votes
),critics_score
,audience_score
- Categorical:
critics_rating
,audience_rating
We want an audience rating and not a critcs rating (based on our research question), and a rating that is numerical. Based on these conditions we can eliminate critics_score
, critics_rating
and audience_rating
Which leaves us with: imdb_rating
(and imdb_num_votes
) and audience_score
At this point we’ll reduce our data set to just the essential variables.
# simplify the data frame to just the variables we need
mps <- movies %>%
select(title_type, genre, runtime, mpaa_rating, best_pic_nom, best_pic_win, best_actor_win,
best_actress_win, best_dir_win, top200_box, imdb_rating, audience_score, thtr_rel_year)
# remove single NA value in the runtime variable
mps <- mps %>% filter(!is.na(runtime))
Let’s look at the distribution of both ratings variables next.
# histogram of imdb rating
ggplot(data = mps, aes(x = imdb_rating)) +
geom_histogram(binwidth = 0.4) +
xlab("IMDB rating")
As we can see from the above histogram, the distribution of imdb_rating
is slightly left-skewed and unimodal. It seems that people on IMDB rate movies overly positive and there is not a lot of variation in movie ratings.
Let’s now look at the Rotten Tomatoes audience rating.
# histogram of audience score on rotten tomatoes
ggplot(data = mps, aes(x = audience_score)) +
geom_histogram(binwidth = 4) +
xlab("RT audience score")
The audience_score
from Rotten Tomatoes shows more variability in poeple’s ratings. The distribution is still slightly left skewed as the dominant peak is towards the right end (high movie rating) of the graph.
Let’s plot the two above ratings variables next to the critics rating variable for comparison.
# using the inbuilt R plotting to make a panel of (2x2) plots
# multiply the imdb_rating variable by 10 so the scales are easier to compare
par(mfrow = c(2,2))
hist(mps$imdb_rating * 10, main = "IMDB rating", xlab = "Score")
hist(mps$audience_score, main = "RT audience rating", xlab = "Score")
hist(movies$critics_score, main = "RT critics rating", xlab = "Score")
In the above panel we can see histograms of IMDB Rating (top left), Rotten Tomatoes audience rating (top right) and Rotten Tomatoes critics rating (bottom left). The IMDB ratings have been multiplied by 10 so we can compare the values more easily to the other variables. We can see the same distribution pattern in the two ratings variables as before, mostly high ratings of movies on IMDB, more uniformly distributed ratings for RT audience and even more uniformly distributed ratings for RT critics.
Given the possible bias towards positive ratings in imdb_rating
as shown in the histogram, I will choose Rotten Tomatoes’ audience_score
as my dependant variable for the following linear regression. The Rotten Tomatoes audience score shows a more even distribution of scores across the whole range (much like the critics score) which seems more truthful to me, because, let’s be honest, there are a lot of good and bad movies out there.
Next, we will look at some interactions of the dependant variable (movie rating) and explanatory variables to see whether they show linear relationships and are therefore suited to be included in the linear model.
Interactions between rating and other variables
First, we will look at audience_score
for each type of movie (Documentary, Feature Film, TV Movie).
# boxplot of audience_score by title_type
ggplot(data = mps, aes(x = title_type, y = audience_score, color = title_type)) +
geom_boxplot() +
labs(title = "Popularity score by movie type") +
xlab("Movie type") +
ylab("Popularity score") +
scale_fill_discrete(name = "Movie type") +
labs(color = "Movie type")
In the above boxplot we can see type of movie on the x-axis and movie popularity score on the y-axis. Interestingly, documentaries have the highest median rating, whereas TV movies show the greatest variability of scores.
Let’s calculate some summary statistics of movie popularity for each movie type.
# summary statistics for audience_score by title_type, ordered by median score
t_mtype <- mps %>%
group_by(title_type) %>%
summarise(Median_score = median(audience_score), IQR_score = IQR(audience_score),
Lowest_score = min(audience_score), Highest_score = max(audience_score)) %>%
arrange(desc(Median_score))
# display table using kable (package knitr) because it looks nicer
kable(t_mtype, align = c("l","c","c","c","c"), digits = 0)
title_type | Median_score | IQR_score | Lowest_score | Highest_score |
---|---|---|---|---|
Documentary | 86 | 11 | 68 | 96 |
TV Movie | 75 | 62 | 19 | 86 |
Feature Film | 62 | 34 | 11 | 97 |
In the above table we can see the median popularity score for each movie type in the first column and the IQR in the second column. Documentaries show the highest score, followed by “TV Movie” and “Feature Film”. Interestingly, we can see from the last two columns, that “Feature Film” recorded the lowest as well as the highest score of all categories.
Now, let’s see whether movies which won best picture oscar also receive a higher rating, on average.
# boxplot of movie score by best picture oscar win
ggplot(data = mps, aes(x = best_pic_win, y = audience_score, color = best_pic_win)) +
geom_boxplot() +
labs(title = "Popularity score by best picture oscar win") +
xlab("Best picture oscar win") +
ylab("Popularity score") +
labs(color = "Best picture\noscar win")
In the above boxplot we can see movie popularity score (y-axis) for movies that did not win a best picture oscar (“no”, x-axis) and movies that did (“yes”, x-axis). The median score for movies that won an oscar is clearly higher compared to movies that did not win an oscar. The variability in scores amongst oscar-winning movies is also comparatively much lower.
Let’s look at the actual summary statistics associated with scores for oscar-winning movies.
# summary statistics for movie score by best picture oscar win
t_bposc <- mps %>%
group_by(best_pic_win) %>%
summarise(Median_score = median(audience_score), IQR_score = IQR(audience_score),
Lowest_score = min(audience_score), Highest_score = max(audience_score))
# display table using kable (package knitr) because it looks nicer
kable(t_bposc, align = c("l","c","c","c","c"), digits = 0)
best_pic_win | Median_score | IQR_score | Lowest_score | Highest_score |
---|---|---|---|---|
no | 65 | 33 | 11 | 96 |
yes | 84 | 8 | 69 | 97 |
As in the box plot before we can see in the table above table that movies which won best picture oscar have higher median popularity score and lower IQR compared to movies that did not win an oscar. Despite this, we can see that there are equally high maximum ratings for each category (see Highest_score, last column), whereas the oscar winning movie with the lowest popularity score still is quite high at a rating of 69 (see Lowest_score, 3rd column).
Finally, let’s see if there’s a relationship between length of a movie and popularity score using a scatterplot.
# scatterplot of runtime vs. audience score with linear line
ggplot(data = mps, aes(x = runtime, y = audience_score)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Runtime vs Popularity score") +
xlab("Runtime (minutes)") +
ylab("Popularity score")
In the above scatterplot we can see runtime (x-axis) and its association with popularity score (y-axis) as well as a linear approximation of the interaction (blue). It appears that the distribution of runtime is right-skewed with some extreme values on the high end. The extreme values away from the main cluster of data points seem to be highly influental and without them the line might have a completely different slope. Also the data points show an overall fan-shaped appearance which is problematic. The interaction between the variables runtime
and audience_score
should not be modeled using a linear model and therefore, we will exclude the runtime variable from the future regression analysis.
Part 4: Modeling
Now we will construct a multiple linear regression model predicting movie popularity score from features of movies. I want to build a model with high prediction accuracy, so Paramount can incorporate the information into their decision making and can predict future success of movies as accurately as possible. Therefore, I will do model selection using the R-squared approach. I will also build my model via backward selection.
To start with, I build a model using all variables to predict popularity score.
# create multiple linear regression model
movsc <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win, data = mps)
summary(movsc)
##
## Call:
## lm(formula = audience_score ~ title_type + genre + mpaa_rating +
## best_pic_nom + best_pic_win + best_actor_win + best_actress_win +
## best_dir_win, data = mps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.51 -13.06 1.56 13.15 39.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.1202 8.0536 9.203 < 2e-16 ***
## title_typeFeature Film -11.6345 6.5499 -1.776 0.076172 .
## title_typeTV Movie -21.4380 10.3098 -2.079 0.037988 *
## genreAnimation 2.9867 6.8893 0.434 0.664786
## genreArt House & International 8.8612 5.3410 1.659 0.097596 .
## genreComedy -0.6269 2.9358 -0.214 0.830975
## genreDocumentary 14.8869 7.0223 2.120 0.034402 *
## genreDrama 10.6500 2.5119 4.240 2.58e-05 ***
## genreHorror -9.0698 4.3848 -2.068 0.039007 *
## genreMusical & Performing Arts 21.2632 5.9841 3.553 0.000409 ***
## genreMystery & Suspense 1.3220 3.2958 0.401 0.688463
## genreOther 10.5724 5.0155 2.108 0.035429 *
## genreScience Fiction & Fantasy -4.3348 6.2970 -0.688 0.491461
## mpaa_ratingNC-17 -9.3284 13.4097 -0.696 0.486905
## mpaa_ratingPG -8.6263 4.8454 -1.780 0.075511 .
## mpaa_ratingPG-13 -13.2737 4.9215 -2.697 0.007183 **
## mpaa_ratingR -8.2735 4.7916 -1.727 0.084720 .
## mpaa_ratingUnrated -3.5026 5.5163 -0.635 0.525691
## best_pic_nomyes 22.2093 4.5035 4.932 1.05e-06 ***
## best_pic_winyes -0.7219 8.0529 -0.090 0.928597
## best_actor_winyes -0.6143 2.0704 -0.297 0.766771
## best_actress_winyes -0.7577 2.3151 -0.327 0.743572
## best_dir_winyes 6.8887 2.9898 2.304 0.021546 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.67 on 627 degrees of freedom
## Multiple R-squared: 0.2634, Adjusted R-squared: 0.2376
## F-statistic: 10.19 on 22 and 627 DF, p-value: < 2.2e-16
Above we can see the output of the multiple linear regression model prediciting audience_score
from title_type
+ genre
+ mpaa_rating
+ best_pic_nom
+ best_pic_win
+ best_actor_win
+ best_actress_win
+ best_dir_win
. The adjusted-r-squared of the model is 0.2376, meaning 23.76% of the variation in movie popularity score is explained by the model.
Now we carry out the backward selection process. We take a single variable out of the model at a time, fit the linear model again, calculate adjusted-r-squared. We then pick the model with the highest improvement in adjusted-r-squared over the full models’ adjusted-r-squared (full model: Radj = 0.2376). We then repeat this process until we are not able to achieve an improvement in our prediction accuracy.
# model 1: Excluding title type
movsc_1 <- lm(audience_score ~ genre + mpaa_rating + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win, data = mps)
r1 <- summary(movsc_1)$adj.r.squared
# model 2: Excluding genre
movsc_2 <- lm(audience_score ~ title_type + mpaa_rating + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win, data = mps)
r2 <- summary(movsc_2)$adj.r.squared
# model 3: Excluding mpaa_rating
movsc_3 <- lm(audience_score ~ title_type + genre + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win, data = mps)
r3 <- summary(movsc_3)$adj.r.squared
# model 4: Excluding best_pic_nom
movsc_4 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_win + best_actor_win +
best_actress_win + best_dir_win, data = mps)
r4 <- summary(movsc_4)$adj.r.squared
# model 5: Excluding best_pic_win
movsc_5 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom + best_actor_win +
best_actress_win + best_dir_win, data = mps)
r5 <- summary(movsc_5)$adj.r.squared
# model 6: Excluding best_actor_win
movsc_6 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom + best_pic_win +
best_actress_win + best_dir_win, data = mps)
r6 <- summary(movsc_6)$adj.r.squared
# model 7: Excluding best_actress_win
movsc_7 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom + best_pic_win + best_actor_win +
best_dir_win, data = mps)
r7 <- summary(movsc_7)$adj.r.squared
# model 8: Excluding best_dir_win
movsc_8 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom + best_pic_win + best_actor_win,
data = mps)
r8 <- summary(movsc_8)$adj.r.squared
# save model names and r-squared-adjust values as lists
models <- list("movsc_1","movsc_2","movsc_3","movsc_4","movsc_5","movsc_6","movsc_7","movsc_8")
r_vals <- list(r1, r2, r3, r4, r5, r6, r7, r8)
# print every model and its r-squared-adjusted value
for (item in seq(length(models))) {
print(paste(models[item], ": ", round(as.numeric(r_vals[item]), 5)))
}
## [1] "movsc_1 : 0.23431"
## [1] "movsc_2 : 0.16365"
## [1] "movsc_3 : 0.22696"
## [1] "movsc_4 : 0.20929"
## [1] "movsc_5 : 0.23881"
## [1] "movsc_6 : 0.23871"
## [1] "movsc_7 : 0.23869"
## [1] "movsc_8 : 0.2335"
Removing best_pic_win
from the model
We have the highest improvement in adjusted-R-squared in model 5 (improvement: 0.2376 -> 0.23881) when omitting variable best_pic_win
. Now we repeat the selection process.
# current final model
movsc_final <- movsc_5
# model 9: Excluding title type
movsc_9 <- lm(audience_score ~ genre + mpaa_rating + best_pic_nom + best_actor_win +
best_actress_win + best_dir_win, data = mps)
r9 <- summary(movsc_9)$adj.r.squared
# model 10: Excluding genre
movsc_10 <- lm(audience_score ~ title_type + mpaa_rating + best_pic_nom + best_actor_win +
best_actress_win + best_dir_win, data = mps)
r10 <- summary(movsc_10)$adj.r.squared
# model 11: Excluding mpaa_rating
movsc_11 <- lm(audience_score ~ title_type + genre + best_pic_nom + best_actor_win +
best_actress_win + best_dir_win, data = mps)
r11 <- summary(movsc_11)$adj.r.squared
# model 12: Excluding best_pic_nom
movsc_12 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_actor_win +
best_actress_win + best_dir_win, data = mps)
r12 <- summary(movsc_12)$adj.r.squared
# model 13: Excluding best_actor_win
movsc_13 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom +
best_actress_win + best_dir_win, data = mps)
r13 <- summary(movsc_13)$adj.r.squared
# model 14: Excluding best_actress_win
movsc_14 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom + best_actor_win +
best_dir_win, data = mps)
r14 <- summary(movsc_14)$adj.r.squared
# model 15: Excluding best_dir_win
movsc_15 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom + best_actor_win +
best_actress_win, data = mps)
r15 <- summary(movsc_15)$adj.r.squared
# save model names and r-squared-adjust values as lists
models <- list("movsc_9","movsc_10","movsc_11","movsc_12","movsc_13","movsc_14","movsc_15")
r_vals <- list(r9, r10, r11, r12, r13, r14, r15)
# print every model and its r-squared-adjusted value
for (item in seq(length(models))) {
print(paste(models[item], ": ", round(as.numeric(r_vals[item]), 5)))
}
## [1] "movsc_9 : 0.23551"
## [1] "movsc_10 : 0.16483"
## [1] "movsc_11 : 0.22818"
## [1] "movsc_12 : 0.20366"
## [1] "movsc_13 : 0.23991"
## [1] "movsc_14 : 0.23988"
## [1] "movsc_15 : 0.23311"
Removing best_actor_win
from the model
We have the highest improvement in adjusted-R-squared in model 13 (improvement: 0.23881 -> 0.23991) when omitting variable best_actor_win
. Now we do another iteration of the selection process.
# current final model
movsc_final <- movsc_13
# model 16: Excluding title type
movsc_16 <- lm(audience_score ~ genre + mpaa_rating + best_pic_nom +
best_actress_win + best_dir_win, data = mps)
r16 <- summary(movsc_16)$adj.r.squared
# model 17: Excluding genre
movsc_17 <- lm(audience_score ~ title_type + mpaa_rating + best_pic_nom +
best_actress_win + best_dir_win, data = mps)
r17 <- summary(movsc_17)$adj.r.squared
# model 18: Excluding mpaa_rating
movsc_18 <- lm(audience_score ~ title_type + genre + best_pic_nom +
best_actress_win + best_dir_win, data = mps)
r18 <- summary(movsc_18)$adj.r.squared
# model 19: Excluding best_pic_nom
movsc_19 <- lm(audience_score ~ title_type + genre + mpaa_rating +
best_actress_win + best_dir_win, data = mps)
r19 <- summary(movsc_19)$adj.r.squared
# model 20: Excluding best_actress_win
movsc_20 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom +
best_dir_win, data = mps)
r20 <- summary(movsc_20)$adj.r.squared
# model 21: Excluding best_dir_win
movsc_21 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom +
best_actress_win, data = mps)
r21 <- summary(movsc_21)$adj.r.squared
# save model names and r-squared-adjust values as lists
models <- list("movsc_16","movsc_17","movsc_18","movsc_19","movsc_20","movsc_21")
r_vals <- list(r16, r17, r18, r19, r20, r21)
# print every model and its r-squared-adjusted value
for (item in seq(length(models))) {
print(paste(models[item], ": ", round(as.numeric(r_vals[item]), 5)))
}
## [1] "movsc_16 : 0.23664"
## [1] "movsc_17 : 0.16607"
## [1] "movsc_18 : 0.22919"
## [1] "movsc_19 : 0.20476"
## [1] "movsc_20 : 0.24097"
## [1] "movsc_21 : 0.2343"
Removing best_actress_win
from the model
We have the highest improvement in adjusted-R-squared in model 20 (improvement: 0.23991 -> 0.24097) when omitting variable best_actress_win
. Now we repeat the backward selection again.
# current final model
movsc_final <- movsc_20
# model 22: Excluding title type
movsc_22 <- lm(audience_score ~ genre + mpaa_rating + best_pic_nom +
best_dir_win, data = mps)
r22 <- summary(movsc_22)$adj.r.squared
# model 23: Excluding genre
movsc_23 <- lm(audience_score ~ title_type + mpaa_rating + best_pic_nom +
best_dir_win, data = mps)
r23 <- summary(movsc_23)$adj.r.squared
# model 24: Excluding mpaa_rating
movsc_24 <- lm(audience_score ~ title_type + genre + best_pic_nom +
best_dir_win, data = mps)
r24 <- summary(movsc_24)$adj.r.squared
# model 25: Excluding best_pic_nom
movsc_25 <- lm(audience_score ~ title_type + genre + mpaa_rating +
best_dir_win, data = mps)
r25 <- summary(movsc_25)$adj.r.squared
# model 26: Excluding best_dir_win
movsc_26 <- lm(audience_score ~ title_type + genre + mpaa_rating + best_pic_nom
, data = mps)
r26 <- summary(movsc_26)$adj.r.squared
# save model names and r-squared-adjust values as lists
models <- list("movsc_22","movsc_23","movsc_24","movsc_25","movsc_26")
r_vals <- list(r22, r23, r24, r25, r26)
# print every model and its r-squared-adjusted value
for (item in seq(length(models))) {
print(paste(models[item], ": ", round(as.numeric(r_vals[item]), 5)))
}
## [1] "movsc_22 : 0.23766"
## [1] "movsc_23 : 0.16722"
## [1] "movsc_24 : 0.23006"
## [1] "movsc_25 : 0.20556"
## [1] "movsc_26 : 0.23544"
The above r-squared-adjusted values show no improvement over the full model (R-sq-adj: 0.24097) and therefore our final model is the following:
audience_score
~ title_type
+ genre
+ mpaa_rating
+ best_pic_nom
+ best_dir_win
Let’s run some model diagnostics next.
Model diagnostics
Now, we create plots to see whether the conditions for least squares regression (constant variability of residuals, nearly normal residuals and independent residuals) are reasonable.
# plot of fitted values and residuals
d1 <- ggplot(data = movsc_final, aes(x = .fitted, y = .resid)) +
geom_point(color = "#E69F00") +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted Values") +
ylab("Residuals")
# normal probability plot of residuals
d2 <- ggplot(data = movsc_final, aes(sample = .resid)) +
stat_qq(color = "#E69F00") + stat_qq_line()
# histogram of residuals
d3 <- ggplot(data = movsc_final, aes(x = .resid)) +
geom_histogram(binwidth = 4, color = "#E69F00")
# independence of residuals
indep <- data.frame(cbind(movsc_final$residuals, mps$thtr_rel_year))
names(indep) <- c("residuals", "release_year")
d4 <- ggplot(data = indep, aes(x = release_year, y = residuals)) +
geom_point(color = "#E69F00") +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Order of movie release") +
ylab("Residuals")
# plot multiple plots in a grid
grid.arrange(d1, d2, d3, d4, ncol = 2, nrow = 2)
In the panel above we can see a scatterplot of fitted values vs residuals (top left), a normal probability plot of the residuals (top right), a histogram of the residuals (bottom left) and a scatterplot of release year vs residuals (bottom right).
Constant variability of residuals: The scatterplot of residuals vs fitted values shows random scatter of residuals around zero and no overall structure. However, there is a clear fan-shape which is concerning. High variability of residuals is associated with lower fitted values and considerably less data points and much less scatter of residuals is associated with higher fitted values. A simple least-squares model does not seem suited to model this data.
Nearly normal residuals: The normal probability plot and the histogram show that the residuals are distributed nearly normal. There is a slight deviation from normality in the high end of the residuals and a few outliers in both tails of the distribution. Given our data set is fairly large, this should not be a concern.
Independent residuals: As ratings data is presumably collected and updated constantly there should be no issue regarding order of data collection, also regarding the other variables in the data set. Using the only chronoligical information available, I created a plot of release year vs residuals. We can see the plot shows random scatter, suggesting that there is no apparent structures related to the order movies were made. (Interestingly, it appears that there more data points for movies that were made more recently; there is also a possible trend in a decrease in variability over time which we should keep an eye on for future analyses)
Next, we look at the relationships between the response variable and the explanatory variables.
# plot of title_type vs residuals
dv1 <- ggplot(data = movsc_final, aes(x = title_type, y = .resid)) +
geom_point(color = "#0072B2") +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Title type") +
ylab("Residuals")
# plot of genre vs residuals
dv2 <- ggplot(data = movsc_final, aes(x = genre, y = .resid)) +
geom_point(color = "#0072B2") +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Genre") +
ylab("Residuals") +
theme(axis.text.x = element_blank())
# plot of mpaa_rating vs residuals
dv3 <- ggplot(data = movsc_final, aes(x = mpaa_rating, y = .resid)) +
geom_point(color = "#0072B2") +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Mpaa rating") +
ylab("Residuals")
# plot of best_pic_nom vs residuals
dv4 <- ggplot(data = movsc_final, aes(x = best_pic_nom, y = .resid)) +
geom_point(color = "#0072B2") +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Nominated for best picture") +
ylab("Residuals")
# plot of best_dir_win vs residuals
dv5 <- ggplot(data = movsc_final, aes(x = best_dir_win, y = .resid)) +
geom_point(color = "#0072B2") +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Won best director") +
ylab("Residuals")
# plot multiple plots in a grid
grid.arrange(dv1, dv2, dv3, dv4, dv5, ncol = 2, nrow = 3)
In the above panel we can see that variables title_type
, best_pic_nom
and best_dir_win
all show not constant variability of residuals. A clear fan-shape in the distribution of residuals can be seen for best_pic_nom
(see “Nominated for best picture”, 2nd row 2nd column), title_type
(see “Title type”, 1st row 1st column) and best_dir_win
(“Won best director”, 3rd row 1st column). Since the variability in the residuals highly fluctuates across groups for several variables indicative of non-linear relationships, we should use a more complex model and not least squares regression to model the data.
As it’s beyond the scope of this project to use models other than least squares regression, we will carry out interpretation of model coefficients and prediction just for practice.
Interpretation of coefficients:
##
## Call:
## lm(formula = audience_score ~ title_type + genre + mpaa_rating +
## best_pic_nom + best_dir_win, data = mps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.268 -13.135 1.311 13.350 39.264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.1461 8.0355 9.227 < 2e-16 ***
## title_typeFeature Film -11.6655 6.5349 -1.785 0.074726 .
## title_typeTV Movie -21.4636 10.2818 -2.088 0.037242 *
## genreAnimation 2.8671 6.8680 0.417 0.676482
## genreArt House & International 8.8367 5.3215 1.661 0.097301 .
## genreComedy -0.6997 2.9188 -0.240 0.810636
## genreDocumentary 14.8510 7.0058 2.120 0.034413 *
## genreDrama 10.4978 2.4809 4.231 2.67e-05 ***
## genreHorror -9.0348 4.3710 -2.067 0.039142 *
## genreMusical & Performing Arts 21.2589 5.9707 3.561 0.000398 ***
## genreMystery & Suspense 1.0866 3.2496 0.334 0.738204
## genreOther 10.5021 4.9924 2.104 0.035807 *
## genreScience Fiction & Fantasy -4.2673 6.2788 -0.680 0.496984
## mpaa_ratingNC-17 -9.4785 13.3502 -0.710 0.477975
## mpaa_ratingPG -8.7101 4.8301 -1.803 0.071821 .
## mpaa_ratingPG-13 -13.3409 4.9069 -2.719 0.006732 **
## mpaa_ratingR -8.2899 4.7802 -1.734 0.083368 .
## mpaa_ratingUnrated -3.5182 5.5037 -0.639 0.522900
## best_pic_nomyes 21.6367 3.9220 5.517 5.05e-08 ***
## best_dir_winyes 6.7103 2.8378 2.365 0.018351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.63 on 630 degrees of freedom
## Multiple R-squared: 0.2632, Adjusted R-squared: 0.241
## F-statistic: 11.84 on 19 and 630 DF, p-value: < 2.2e-16
With an adjusted-r-squared of ~ 0.241, our model explaines 24.1% of the variation in audience score.
The following interpretations of the slope coefficients for each variable are made under the assumption that all else is held constant in the model parameters.
Type | beta | higher/lower/reference score |
---|---|---|
Documentary | 0 | reference, 74 on average |
Feature Film | -11.6 | lower |
TV Movie | -21.3 | lower |
Genre | beta | higher/lower/reference score |
---|---|---|
Action & Adventure | 0 | reference, 74 on average |
Animation | 2.8 | higher |
Art House & International | 8.9 | higher |
Comedy | -0.7 | lower |
Documentary | 14.8 | higher |
Drama | 10.5 | higher |
Horror | -9.0 | lower |
Musical & Performing Arts | 21.3 | higher |
Mystery & Suspense | 1.1 | higher |
Other | 10.5 | higher |
Science Fiction & Fantasy | -4.3 | lower |
MPAA rating | beta | higher/lower/reference score |
---|---|---|
G | 0 | reference, 74 on average |
NC-17 | -9.5 | lower |
PG | -8.7 | lower |
PG-13 | -13.4 | lower |
R | -8.3 | lower |
Unrated | -3.7 | lower |
Nominated for best picture | beta | higher/lower/reference score |
---|---|---|
No | 0 | reference, 74 on average |
Yes | 21.6 | higher |
Director ever won an Oscar | beta | higher/lower/reference score |
---|---|---|
No | 0 | reference, 74 on average |
Yes | 6.7 | higher |
Overall it is surprising that documentaries are on average more popular than other movies and also movies of the genre Musical & Performing arts receive a higher rating than other genres on average (given all else held constant). In addition, movies suited for all ages (MPAA rating ‘G’) receive a higher rating all else held constant. Movies which have been nominated for best picture receive a substantially higher rating on average than movies that haven’t, and the same is true for movies directed by an Oscar-winning director versus not, all else held constant.
Part 5: Prediction
For the prediction, I selected the movie “Jason Bourne” (2016, rating: 55) and took its details from the Rotten Tomatoes website.
# create data frame for movie
jason_bourne <- data.frame(title_type = "Feature Film",
genre = "Action & Adventure",
mpaa_rating = "PG-13",
best_pic_nom = "no",
best_dir_win = "no")
# use predict() function to predict its score and a 95% CI using the final model
predict(movsc_final, jason_bourne, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 49.13975 14.15214 84.12735
The model predicts, with 95% confidence, that the movie “Jason Bourne” has an audience score between 14 and 84 points. The actual audience score of the movie with 55 does fall within the calculated interval and is also close to the predicted score of 49.
Part 6: Conclusion
In summary, we built a multiple linear regression model using least squares to successfully predict movie popularity. The final model explains 24% of the variation in movie popularity for an audience. In order to model movie populartiy with higher accuracy we should focus on finding new variables to include in the model. Also, given the linear model assumptions did not hold (due to unequal variance across groups for categorical variables), we should explore non-linear methods as well as using a larger sample in the future.
Paramount can use our model to get a general idea of how popular a movie would be, however they should also consider other factors when deciding on which movies could be successful. We also have to keep in mind that the movie ratings in our data set might not be related to actual ticket sales of movies. In other words, even though a movie received a high rating on an entertainment website, it doesn’t mean that a lot of people paid to go see that movie.