In recent years, the movie industry has become an instrumental source of entertainment in our modern world. Top studios such as Walt Disney, Universal, Paramount, Warner Bros, and Fox have produced many successful films and achieved impressive gross box office revenues as well as reputation. However, there are also many companies failing in the movie industry, which becomes the main concern of new managers and directors who want to pursue this career. Therefore, finding the most essential factors contributing to the success of a movie and predicting a movie’s box office revenue play a key role in the film industry. Being aware of a movie’s performance in advance will allow movie managers to allocate appropriate resources, strategies and adjustments to promote the success of their products.
In this project, we analyze the movie dataset and construct different models to predict the revenue and profitability of a movie using variables such as budget, runtime, genres, vote and score (which can be obtained from a test screening) … In addition, we use time series to examine the seasonality and trend of gross box office.
Our goal is to answer the following S.M.A.R.T questions:
Our dataset is sourced from Kaggle: https://www.kaggle.com/tmdb/tmdb-movie-metadata.
The data has nearly 5000 movie records with a lot of attributes. Before importing the data to R, we preprocessed the raw data in Python to obtain a nicer dataframe since the raw data contains some columns written in JSON with many attributes. Some attributes that are not necessary for our analysis were also excluded.
Below is the structure of our imported data.
## 'data.frame': 4803 obs. of 12 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ budget : int 237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
## $ genres : Factor w/ 21 levels "","Action","Adventure",..: 2 3 2 2 2 10 4 2 3 2 ...
## $ popularity : num 150.4 139.1 107.4 112.3 43.9 ...
## $ production_companies: Factor w/ 1314 levels "","100 Bares",..: 615 1263 265 696 1263 265 1263 758 1267 320 ...
## $ release_date : Factor w/ 3281 levels "","1916-09-04",..: 2315 1945 3185 2688 2635 1940 2450 3111 2246 3234 ...
## $ revenue : num 2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
## $ runtime : num 162 169 148 165 132 139 100 141 153 151 ...
## $ title : Factor w/ 4800 levels "(500) Days of Summer",..: 381 2653 3186 3614 1906 3198 3364 382 1587 444 ...
## $ vote_average : num 7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
## $ vote_count : int 11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
## $ Number_Genres : int 4 3 3 4 3 3 2 3 3 3 ...
In this step, we rename the data, remove duplicates and missing values (there are few missing records so removing them does not hurt our analysis). To evaluate the company variable, we divide the movie studios in 6 groups: Walt Disney, Warnes Bros, Sony, Universal, Paramount and Others which contains less famous studios. We also create new columns such as profit, profitable, season, quarter and year to support our research. All variables are updated to their correct formats (int, num, factor, date …).
P/s: The profitable column returns binary out come (0,1). If a movie has positive profit, the profitable is 1; otherwise, the profitable is 0.
Below is the final structure of our dataframe.
## 'data.frame': 3225 obs. of 15 variables:
## $ budget : int 237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
## $ genres : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
## $ popularity: num 150.4 139.1 107.4 112.3 43.9 ...
## $ company : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
## $ date : Date, format: "2009-12-10" "2007-05-19" ...
## $ revenue : num 2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
## $ runtime : num 162 169 148 165 132 139 100 141 153 151 ...
## $ title : Factor w/ 3224 levels "(500) Days of Summer",..: 259 1761 2129 2420 1265 2139 2256 260 1053 310 ...
## $ score : num 7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
## $ vote : int 11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
## $ profit : num 2.55e+09 6.61e+08 6.36e+08 8.35e+08 2.41e+07 ...
## $ profitable: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ season : Factor w/ 4 levels "Spring","Summer",..: 4 1 3 2 1 1 3 1 2 1 ...
## $ quarter : Factor w/ 4 levels "Q1","Q2","Q3",..: 4 2 4 3 1 2 4 2 3 1 ...
## $ year : num 2009 2007 2015 2012 2012 ...
The summary of continuous variables:
revenue | budget | popularity | runtime | score | vote | profit | |
---|---|---|---|---|---|---|---|
Min. :5.00e+00 | Min. :1.00e+00 | Min. : 0 | Min. : 41 | Min. :2.30 | Min. : 1 | Min. :-1.66e+08 | |
1st Qu.:1.71e+07 | 1st Qu.:1.05e+07 | 1st Qu.: 10 | 1st Qu.: 96 | 1st Qu.:5.80 | 1st Qu.: 179 | 1st Qu.: 2.52e+05 | |
Median :5.52e+07 | Median :2.50e+07 | Median : 20 | Median :107 | Median :6.30 | Median : 471 | Median : 2.64e+07 | |
Mean :1.21e+08 | Mean :4.07e+07 | Mean : 29 | Mean :111 | Mean :6.31 | Mean : 978 | Mean : 8.07e+07 | |
3rd Qu.:1.46e+08 | 3rd Qu.:5.50e+07 | 3rd Qu.: 37 | 3rd Qu.:121 | 3rd Qu.:6.90 | 3rd Qu.: 1148 | 3rd Qu.: 9.75e+07 | |
Max. :2.79e+09 | Max. :3.80e+08 | Max. :876 | Max. :338 | Max. :8.50 | Max. :13752 | Max. : 2.55e+09 |
There are high differences between the variables. We need to scale the data to achieve accurate models. All predictors will be standardized.
Before going to the main contents, we have an overview of the data.
In reality, film managers would want to predict the success of a movie before its main release. The information they may have are the budget, runtime, genres, production company, popularity, vote and score (vote and score can be obtained by a preview screening, popularity can be generated after advertisement, trailers and some leaks from the movies). We will use these variables as the predictors.
This section presents different types of models to predict the revenue: Linear Regression, Decision Tree (Regression Tree) and Random Forest. For each model, we perform model evaluation to obtain the best formula (adj R2, BIC, CP for Linear Regression, pruning for Regression Tree, tuning for Random Forest). Finally, we compare the three best models and make conclusion.
Before constructing the models, we have a first glance at the relationships between the revenue and predictors.
Budget, popularity and vote seem to have high correlations with the revenue.
To examine the dependency of revenue on categorical variables, we use anova test to test the means of revenue in different genres, companies and seasons.
Overall, there is an evidence that the frequency distributions of revenue in different genres are not the same. It seems that revenue is dependent on genres.
The p-values in the anova test with genre, company and season are 2.58410^{-80}, 1.12610^{-28} and 5.28810^{-10}, respectively.
Since all p-values are smaller than 0.05 level, there is evidence that the means of different genres or different companies or different seasons are not the same. We can conclude that the overall effect of genre or company or season on the revenue is statistically significant.
We split our data into training set and testing set with ratio 67:33.
In this part, we try to buid a linear regression model with numerical predictors.
Construct the model on training set (using all numberical variables)
##
## Call:
## lm(formula = revenue ~ ., data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.21e+08 -3.89e+07 -1.92e+06 2.46e+07 1.60e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122161921 2168480 56.34 < 2e-16 ***
## budget 82502895 2822244 29.23 < 2e-16 ***
## popularity 14588304 2952684 4.94 8.4e-07 ***
## runtime -1265467 2415512 -0.52 0.60
## score 212212 2648862 0.08 0.94
## vote 85807055 3723449 23.05 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.01e+08 on 2170 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.711
## F-statistic: 1.07e+03 on 5 and 2170 DF, p-value: <2e-16
## budget popularity runtime score vote
## 1.68 2.15 1.26 1.50 2.95
As seen from the results, score and runtime are not to be statistically significant. It seems that we can exclude these variables from the model. We will perform model selection using Adj R-squared, BIC and CP to optimize our model.
All three methods agree that the model with budget, popularity and vote is the best. It matches the our indication above that runtime and score can be removed from the model.
After feature selection, we build new optimized model.
##
## Call:
## lm(formula = revenue ~ budget + popularity + vote, data = train1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.21e+08 -3.85e+07 -2.19e+06 2.44e+07 1.60e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.22e+08 2.17e+06 56.36 < 2e-16 ***
## budget 8.23e+07 2.62e+06 31.45 < 2e-16 ***
## popularity 1.46e+07 2.95e+06 4.96 7.8e-07 ***
## vote 8.57e+07 3.47e+06 24.72 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.01e+08 on 2172 degrees of freedom
## Multiple R-squared: 0.711, Adjusted R-squared: 0.711
## F-statistic: 1.78e+03 on 3 and 2172 DF, p-value: <2e-16
## budget popularity vote
## 1.45 2.15 2.55
All predictors are statistically significant since their p-values are smaller than 0.05 level. The vifs are smaller than 3, which means that there the predictors are moderately correlated. The multicolliearity seems not to be problematic. Adj R-squared is 0.711 which is the same as the model with full numerical predictors. We can say that roughly 71% of the variance found in the response variable can be explained by the predictors.
We predict our linear model on the testing data and obtain the metrics such as RMSE and MAE for evaluation and comparison with other models.
test | train | |
---|---|---|
mae | 6.25e+07 | 5.86e+07 |
mse | 1.12e+16 | 1.02e+16 |
rmse | 1.06e+08 | 1.01e+08 |
mape | 1.06e+04 | 4.76e+03 |
The MAE and RMSE in testing set are slightly higher than those in training set, which means that our linear model’s performances on training and testing sets are not significant. The model performs a little better in the training set, which is acceptable since the model is constructed using this set. Moreover, the RMSEs are smaller than the average of revenues and the R-squared is 0.711. It appears that the model does not overfit or underfit the data.
In this part, we construct the model with all variables in the dataset.
##
## Call:
## lm(formula = revenue ~ ., data = train1_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.08e+08 -4.03e+07 -1.43e+06 2.90e+07 1.62e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108110505 6817790 15.86 < 2e-16 ***
## budget 77348867 3033574 25.50 < 2e-16 ***
## popularity 13617667 2932282 4.64 3.6e-06 ***
## runtime 5900196 2594312 2.27 0.02305 *
## score -1602884 2751829 -0.58 0.56030
## vote 87028064 3716646 23.42 < 2e-16 ***
## genresAdventure 14998048 8669568 1.73 0.08378 .
## genresAnimation 87790816 13236699 6.63 4.2e-11 ***
## genresComedy 25268186 7163941 3.53 0.00043 ***
## genresCrime -10102172 11467478 -0.88 0.37845
## genresDocumentary 44711638 23188936 1.93 0.05397 .
## genresDrama 4923647 7294556 0.67 0.49976
## genresFamily 82467003 20391297 4.04 5.4e-05 ***
## genresFantasy 2870822 13660650 0.21 0.83357
## genresHistory 15689738 24998838 0.63 0.53032
## genresHorror 17619142 10531825 1.67 0.09448 .
## genresMusic 23079584 29354497 0.79 0.43182
## genresMystery 6103862 24024211 0.25 0.79946
## genresRomance 17057626 14926772 1.14 0.25327
## genresScience Fiction -15440346 15106799 -1.02 0.30686
## genresThriller -7679779 12337527 -0.62 0.53370
## genresWar -56563625 33719837 -1.68 0.09360 .
## genresWestern -3456414 24929422 -0.14 0.88974
## companyParamount Pictures 19120256 8303147 2.30 0.02139 *
## companySony Pictures 4555821 7970621 0.57 0.56767
## companyUniversal Pictures 16743343 7455117 2.25 0.02481 *
## companyWalt Disney 16706288 6373478 2.62 0.00882 **
## companyWarner Bros -535044 8559594 -0.06 0.95016
## seasonSummer -822742 6192423 -0.13 0.89431
## seasonFall -9929435 6117507 -1.62 0.10471
## seasonWinter -5891796 6282550 -0.94 0.34845
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99300000 on 2145 degrees of freedom
## Multiple R-squared: 0.725, Adjusted R-squared: 0.721
## F-statistic: 188 on 30 and 2145 DF, p-value: <2e-16
The p-values and t-values indicate that there is no significance among different seasons and score is not statistically significant. Season and score seem not to be necessary predictors. The overall effect of genre or company is significant.
When inlcuding the season, genre and company in the model, the best numerical predictors are not changed (budget, popularity and vote). The effects of different seasons seem to be the same. The best formula for the linear model in this case is: revenue ~ budget + vote + company + genres + popularity
We build new model with the best formula.
##
## Call:
## lm(formula = revenue ~ budget + vote + company + genres + popularity,
## data = train1_full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.14e+08 -4.04e+07 -8.25e+05 2.96e+07 1.62e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103783104 5493099 18.89 < 2e-16 ***
## budget 79307464 2810718 28.22 < 2e-16 ***
## vote 87343750 3432074 25.45 < 2e-16 ***
## companyParamount Pictures 20187352 8296723 2.43 0.01505 *
## companySony Pictures 4706890 7968099 0.59 0.55477
## companyUniversal Pictures 17875529 7436574 2.40 0.01631 *
## companyWalt Disney 16364447 6369330 2.57 0.01026 *
## companyWarner Bros -135185 8558273 -0.02 0.98740
## genresAdventure 14924242 8645045 1.73 0.08443 .
## genresAnimation 80203108 12811282 6.26 4.6e-10 ***
## genresComedy 24197175 7152449 3.38 0.00073 ***
## genresCrime -8821228 11302914 -0.78 0.43522
## genresDocumentary 40500854 22990580 1.76 0.07827 .
## genresDrama 6560315 6935298 0.95 0.34429
## genresFamily 78112299 20296238 3.85 0.00012 ***
## genresFantasy 1516100 13618042 0.11 0.91136
## genresHistory 22335236 24701813 0.90 0.36599
## genresHorror 15897705 10491528 1.52 0.12985
## genresMusic 22199706 29264598 0.76 0.44818
## genresMystery 3544411 24017226 0.15 0.88269
## genresRomance 16669177 14880624 1.12 0.26276
## genresScience Fiction -15795028 15101070 -1.05 0.29570
## genresThriller -7928898 12337416 -0.64 0.52051
## genresWar -55041648 33646312 -1.64 0.10201
## genresWestern 726641 24731085 0.03 0.97656
## popularity 13447493 2930278 4.59 4.7e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 99400000 on 2150 degrees of freedom
## Multiple R-squared: 0.724, Adjusted R-squared: 0.721
## F-statistic: 225 on 25 and 2150 DF, p-value: <2e-16
The adj R-squared is 0.721 which is slightly better than the model with numerical variables (increase by 1.0%).
test | train | |
---|---|---|
mae | 6.19e+07 | 5.83e+07 |
mse | 1.09e+16 | 9.76e+15 |
rmse | 1.05e+08 | 9.88e+07 |
mape | 7.33e+03 | 8.55e+03 |
A little improvement in this model compared to the previous model with continuous variables. Adj R-squared increases by 1% and RMSE and MAE in both training set and testing sets slightly decrease.
This part is to compare two linear models using AIC and BIC.
AIC | BIC |
---|---|
86395 | 86424 |
AIC | BIC |
---|---|
86343 | 86497 |
Model 2 has lower AIC than Model 1, which indicates that Model 2 is better for predicting the revenue.
Model 1 has lower BIC than Model 2, which indicates that Model 1 is better as a true function to explain the revenue (BIC prefers simple models).
If we want an explanatory model, the linear model with continuous variables is better. However, in our case, since we need more predictive power, we would prefer the second linear model with both continuous and categorical predictors.
With Decision Tree we can address both numerical and categorical variables in the model.
We can try two functions to build a regression tree model.
Tree() function:
Rpart() function:
Both methods give the same tree model. Using rpart() gives nicer plots.
We perform pruning to optimize our tree. We need better R-squared, which means we need to find the CP with least relative error.
Here are the errors at each CP.
CP | Error |
---|---|
1 | 1.001 |
2 | 0.709 |
3 | 0.535 |
4 | 0.483 |
5 | 0.455 |
6 | 0.452 |
7 | 0.412 |
8 | 0.383 |
9 | 0.385 |
The error is lowest at 0.383 when CP = 8.
This is our pruned tree using CP = 8.
The tree excludes the node with genres.
test | train | |
---|---|---|
mae | 6.59e+07 | 6.04e+07 |
mse | 1.32e+16 | 1.06e+16 |
rmse | 1.15e+08 | 1.03e+08 |
mape | 1.36e+04 | 7.22e+03 |
The R-squared in decision tree is 0.617 which is smaller than Adj R-squared in the linear model (0.711). The RMSEs and MAEs in decision tree are higher than those in linear model. Overall, the decision tree model is not as good as the linear model.
In this part, we try a more powerful model than decision tree: random forest model. Random forest uses the ensemble method to combine many trees to boost the predictive power.
##
## Call:
## randomForest(formula = revenue ~ ., data = train1_full, ntree = 350)
## Type of random forest: regression
## Number of trees: 350
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 9.46e+15
## % Var explained: 73.2
The pseudo R-squared of random forest is 0.732. 73.2% of the variance found in the response variable can be explained by the predictors in random forest model. It is better than both linear model and decision tree model.
Let’s look how random forest works.
When the number of trees increase, the mean squared error MSE decreases. After a number of trees (around 100 trees in our case), the MSE does not have any significant change. We can see that random forest uses many trees to minimize the MSE, hence, promote the predictive power.
We have 8 predictors, but there are two variables tried at each split in the above random forest. The number of variables tried at each split is represented by parameter mtry. Different mtrys give different predictions, and this section is to find the best mtry for our model using hyperparameter tuning - a method to improve the performance of random forest.
## mtry = 2 OOB error = 9.46e+15
## Searching left ...
## mtry = 1 OOB error = 1.06e+16
## -0.118 0.05
## Searching right ...
## mtry = 4 OOB error = 9.33e+15
## 0.014 0.05
Using the tuneRF() function, we can find the best mtry. According to the graph, owest OOB Error is achieved at mtry = 4. We will buid a model with this mtry.
##
## Call:
## randomForest(formula = revenue ~ ., data = train1_full, mtry = 4, ntree = 350)
## Type of random forest: regression
## Number of trees: 350
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 9.17e+15
## % Var explained: 74
There is a slight improvement after we perform tuning; % variance explained increases by 1%. According to my studies, hyperparameter tuning not only improves the predictive power but also mitigates overfitting in random forest. However, we do not see any sign of overfitting in our project so we do not have a chance to see this ability of hyperparameter tuning.
test | train | |
---|---|---|
mae | 5.51e+07 | 5.24e+07 |
mse | 9.45e+15 | 9.17e+15 |
rmse | 9.72e+07 | 9.57e+07 |
mape | 5.80e+03 | 4.77e+03 |
There is no sign of overfitting in random forest model. The RMSEs and MAEs are smaller than both linear model and decision tree, which indicates that random forest has the best performance among three models.
Model | Linear Model | Regression Tree | Random Forest |
---|---|---|---|
R-squared(adjusted/ pseudo) | 0.721 | 0.711 | 0.741 |
MAE - train | 5.8e+07 | 6e+07 | 5.2e+07 |
MAE - test | 6.2e+07 | 6.6e+07 | 5.5e+07 |
RMSE - train | 9.9e+07 | 1e+08 | 9.6e+07 |
RMSE - test | 1e+08 | 1.1e+08 | 9.7e+07 |
Higher (pseudo) R-squared means that random forest fits the data better than linear model and decision tree. Lower RMSE and MSE indicate that random forest predictions are closer to the actual values.
In summary, random forest is the best model to predict revenue among three models. while decision tree has worst performance.
In this chapter, we try some stuffs with one hot encoding and apply Lasso and Ridge Regression to predict the profit.
We get that budget, popularity and vote have relatively strong correlations with profit. While runtime and score have moderate correlation. The correlations between profit and the other variables are similar to the result of correlation check for revenue.
For each company we create a new column under the name of that company and mark 0 or 1 depending on if the movie is of that company. We do the same to ‘month’ and ‘genre’ in order to select some columns out when we do feature selection later.
In general, month is not highly correlated with profit, but from both the plots of the Rhos with method spearman and method pearson, we do observe that month six (June) has relatively positive effect on profit, whereas month nine (september) has relatively negative effect on profit.
From the two plots of Rhos about the correlations between different companies and profits, company ‘Disney’ and ‘Universal’ have relatively stronger correlation with profit. By the amount of rhos between company and profit, we get that four companies are not correlated with profit too much.
## Abbreviation
## budget b
## popularity p
## runtime r
## vote v
## genresAdventure gnrsAd
## genresAnimation gnrsAn
## genresComedy gnrsCm
## genresCrime gnrsCr
## genresDocumentary gnrsDc
## genresDrama gnrsDr
## genresFamily gnrsFm
## genresFantasy gnrsFn
## genresHistory gnrsHs
## genresHorror gnrsHr
## genresMusic gnrsMs
## genresMystery gnrsMy
## genresRomance gR
## genresScience Fiction gSF
## genresThriller gT
## genresWar gnrsWr
## genresWestern gnrsWs
## month2 m2
## month3 m3
## month4 m4
## month5 m5
## month6 m6
## month7 m7
## month8 m8
## month9 m9
## month10 m10
## month11 m11
## month12 m12
## score s
## companyParamount Pictures cPP
## companySony Pictures cSP
## companyUniversal Pictures cUP
## companyWalt Disney cD
## companyWarner Bros cB
According to the plots above, we know budget, vote, May, June, December, Animation and Family are the best predictors to construct the model.
We try to find a value for lambda that is optimal. And it turns that the best lambda value is 0.032.
With lambda equal to 0.032, we have the r square to be 0.604, which is slightly better than the linear model we have. Since the best lambda value is close to 0, the original linear regression model is not overfit.
Here, we see that the lowest MSE is when \(\lambda\) = 0.006 and the r-squared becomes 0.603. The lasso regression drops 10 useless variables above and it has 27 non-zero coefficients.
##
## Call:
## randomForest(x = movie_RF, y = movie$profit, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 8.92e+15
## % Var explained: 64.4
The random forest model has the highest r-squared value: 0.644 with 500 trees. Since month is not significant and it can be treated numerically, we did not change month to factor variable. The model is even better with month in number (we got 0.639 r-squared for month in factor). Also it is easier to do further predictions using the regression model with month in integer.
It turns out that 500 hundred trees is enough to build a relatively good model.
To predict the profit calculated by box office revenue minus budget, we take the ongoing FrozenII. With all the variales we have and those we take from its previous movie ‘Frozen’ like popularity, socre and vote, we get a estimation of the profit of FrozenII to be 3.55610^{8} using random forest regresssion.
With some polynomials, we achieve a 0.653 adjusted r-squared value. In fact, some of the polynomial variables like poly(score,6) are significant. However, since it takes around 60 features in total, the polynomial regression like this is not preferable at all.
With the polynomial regression, the estimated profit of movie FrozenII is 5.90410^{8}.
Our goal in this chapter is to examine if Principal Component Analysis method is good at dimensional reduction. In our data, we have 5 continuous variables to predict the revenue. We perform principal component regression directly and see how many components are sufficient for the model.
Non-centered data:
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.89e+08 3.10e+07 926 23.9 20.1 0.699
## Proportion of Variance 9.74e-01 2.62e-02 0 0.0 0.0 0.000
## Cumulative Proportion 9.74e-01 1.00e+00 1 1.0 1.0 1.000
Centered data:
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.768 1.108 0.894 0.6466 0.5045 0.4186
## Proportion of Variance 0.521 0.204 0.133 0.0697 0.0424 0.0292
## Cumulative Proportion 0.521 0.725 0.859 0.9284 0.9708 1.0000
We see a significant differences in the variances and standard deviations of different components. Actually, the revenue and our predictors except budget are measured by different units. The revenue and budget are measured in millions which are significantly bigger than the scales of vote, score, popularity …
Therefore, we recommend to scale the data before constructing the model. We will see the differences between non-centered and centered data in the following parts.
In non-centered version, 1 component explains almost 100% of the variance, while in centered version, 1 component only explains about 50% of the variance and we need 3 components to reach 80% of the variance. This is another evidence to indicate that we should perform scaling since the the PC1 seems to overwhelm other components in non-centered version.
We construct the regression model using PCA method directly. We plot the MSEP and R2 with different numbers of components to see how many components are sufficient to capture the variance in revenue.
Non-centered version:
Centered version:
Both versions agree that 2 components give the optimized MSEP and R-squared.
We can see the coefficients of different components.
(PC1 is at position 2, position 1 is the intercept)
According to the graph, after PC2, the change of coefficients is not significant compared to the difference between the coeffcients of PC1 and PC2. Hence, we can use PC1 and PC2 to build the model since two components are sufficient to capture the variance.
Let’s see the summary of PCR model and compare the non-centered version to the centered version.
Non-centered version
## Data: X dimension: 2176 5
## Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 1.88e+08 122151898 108524196 107141642 102852443 102965230
## adjCV 1.88e+08 122077691 108448087 107128888 102792895 102854326
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 49.00 71.71 87.31 95.63 100.00
## revenue 57.87 66.61 67.94 70.47 71.13
## Data: X dimension: 2176 5
## Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps
## CV 1.88e+08 121720344 106321914 1.06e+08 103819563 102861273
## adjCV 1.88e+08 121665663 106238340 1.06e+08 103733078 102758138
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 48.50 71.69 87.41 95.61 100.00
## revenue 58.21 68.09 68.67 70.49 71.13
Scaled data have better variance explanation for revenue than non-scaled data. This indicates we need to standardize data for more accurate prediction.
We will evaluate our PCR model by predicting on the testing data.
There is a significant increase in the variance from PC1 to PC2, after that the change of variance is not drastic. We can say that 2 principal components are enough to capture the majority of variance in the testing data. Our PCR model seems to perform properly in the testing data (both training set and testing set requires 2 components).
We can use principal components as the predictors for a linear model. We use two components to build the models with both centered and non-centered versions
Non-centered version
##
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr.nc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20e+09 -4.08e+07 -5.95e+06 2.28e+07 1.76e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121493614 2330120 52.1 <2e-16 ***
## PC1 -89813816 1463518 -61.4 <2e-16 ***
## PC2 51258272 2149532 23.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09e+08 on 2173 degrees of freedom
## Multiple R-squared: 0.666, Adjusted R-squared: 0.666
## F-statistic: 2.17e+03 on 2 and 2173 DF, p-value: <2e-16
## PC1 PC2
## 1 1
Centered version
##
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11e+09 -4.03e+07 -4.95e+06 2.42e+07 1.73e+09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120012282 2277573 52.7 <2e-16 ***
## PC1 -92108152 1462914 -63.0 <2e-16 ***
## PC2 54901946 2115458 25.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.06e+08 on 2173 degrees of freedom
## Multiple R-squared: 0.681, Adjusted R-squared: 0.681
## F-statistic: 2.32e+03 on 2 and 2173 DF, p-value: <2e-16
## PC1 PC2
## 1 1
All vifs are 1, which is a nice feature of PCA method to prevent multicollinearity. The scaled version gives better R-squared than non-scaled version, which is another indicator for why scaling data is recommended. There is no big difference between the adj R-squared in linear model using 2 components (0.681) and the linear model using all variables (0.711). This shows that we can apply PCA to the linear model without hurting the predictive power.
Comparing to the linear model (of numerical variables) in previous chapter, the adjusted R-squared slightly decreases: the value drops from 71.1% to 68.1%. However, instead of using 5 variables (budget + popularity + vote + score + runtime), we need only 2 variables (PC1 and PC2) but capture the majority of the variance. In conclusion, we can promote the computational speed and reduce dimensionality while not significantly hurting the performance of the model by using PCA method.
In this chapter, we use different kinds of model to predict if a movie earns profit or not (revenue > budget -> earn profit). In our data, we use column profitable to represent whether a movie earns profit or not; a movie with revenue > budget is labeled as 1, otherwise it is labeled as 0.
Our prior chi-squared test indicates that there is evidence for the dependency of revenue on season, company and genres as all p-values are smaller than 0.05.
Here is the summary of logistic model.
##
## Call:
## glm(formula = y ~ ., family = "binomial", data = train3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.683 0.000 0.297 0.735 1.743
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.05e+00 5.35e-01 -3.84 0.00012 ***
## budget -1.75e-08 2.51e-09 -6.99 2.8e-12 ***
## popularity 1.61e-02 1.26e-02 1.28 0.20159
## runtime 1.66e-03 3.28e-03 0.50 0.61421
## score 2.27e-01 8.46e-02 2.68 0.00739 **
## vote 2.67e-03 4.49e-04 5.95 2.7e-09 ***
## genresAdventure 1.73e-01 2.51e-01 0.69 0.49015
## genresAnimation 2.70e-01 3.97e-01 0.68 0.49646
## genresComedy 4.31e-01 1.91e-01 2.26 0.02360 *
## genresCrime 7.35e-02 2.96e-01 0.25 0.80411
## genresDocumentary 4.66e-01 5.25e-01 0.89 0.37479
## genresDrama 7.25e-02 1.91e-01 0.38 0.70424
## genresFamily 2.31e-01 5.54e-01 0.42 0.67656
## genresFantasy 2.28e-01 4.32e-01 0.53 0.59765
## genresHistory 6.55e-01 7.13e-01 0.92 0.35873
## genresHorror 7.92e-01 3.16e-01 2.51 0.01212 *
## genresMusic 2.74e-01 7.09e-01 0.39 0.69916
## genresMystery -4.23e-01 6.33e-01 -0.67 0.50393
## genresRomance 8.50e-01 4.26e-01 2.00 0.04590 *
## genresScience Fiction 2.64e-01 5.05e-01 0.52 0.60148
## genresThriller -1.50e-01 3.29e-01 -0.46 0.64724
## genresWar -1.36e+00 8.29e-01 -1.64 0.10019
## genresWestern 2.27e+00 1.08e+00 2.10 0.03582 *
## companyParamount Pictures 9.81e-01 2.43e-01 4.03 5.5e-05 ***
## companySony Pictures 6.20e-01 2.22e-01 2.79 0.00524 **
## companyUniversal Pictures 8.52e-01 2.34e-01 3.64 0.00027 ***
## companyWalt Disney 8.60e-01 1.84e-01 4.67 3.0e-06 ***
## companyWarner Bros 7.69e-01 2.45e-01 3.13 0.00172 **
## seasonSummer 3.90e-01 1.74e-01 2.24 0.02505 *
## seasonFall -4.65e-02 1.62e-01 -0.29 0.77361
## seasonWinter 7.04e-02 1.69e-01 0.42 0.67734
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2409.2 on 2175 degrees of freedom
## Residual deviance: 1805.0 on 2145 degrees of freedom
## AIC: 1867
##
## Number of Fisher Scoring iterations: 8
We can test the overall effect of different genres/companies/seasons on the prediction using Wald test.
The p-values for genres, company and season are 0.12, 8.4e-09 and 0.045, respectively. It shows that the overall effect of genres is not statistically significant. The effect of company is the most significant.
We will optimize the model using the AIC criteria.
budget | popularity | runtime | score | vote | genres | company | season | Criterion |
---|---|---|---|---|---|---|---|---|
TRUE | FALSE | FALSE | TRUE | TRUE | FALSE | TRUE | TRUE | 1855 |
TRUE | TRUE | FALSE | TRUE | TRUE | FALSE | TRUE | TRUE | 1856 |
TRUE | FALSE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | 1857 |
TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | TRUE | 1858 |
TRUE | FALSE | FALSE | TRUE | TRUE | FALSE | TRUE | FALSE | 1858 |
The best formula for logistic model is: y ~ budget + score + vote + company + season. We can see that the feature selection excludes genres which is proved not to have significant effects in our Wald test.
##
## Call:
## glm(formula = y ~ budget + score + vote + company + season, family = "binomial",
## data = train3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.733 0.000 0.306 0.763 1.770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.44e+00 4.57e-01 -3.15 0.00163 **
## budget -1.83e-08 2.30e-09 -7.95 1.9e-15 ***
## score 2.10e-01 7.18e-02 2.93 0.00340 **
## vote 3.18e-03 2.37e-04 13.43 < 2e-16 ***
## companyParamount Pictures 9.64e-01 2.40e-01 4.02 5.7e-05 ***
## companySony Pictures 5.48e-01 2.19e-01 2.50 0.01237 *
## companyUniversal Pictures 8.42e-01 2.30e-01 3.67 0.00024 ***
## companyWalt Disney 8.75e-01 1.80e-01 4.86 1.2e-06 ***
## companyWarner Bros 7.88e-01 2.41e-01 3.27 0.00109 **
## seasonSummer 3.84e-01 1.72e-01 2.24 0.02513 *
## seasonFall -6.93e-02 1.59e-01 -0.43 0.66375
## seasonWinter 5.37e-02 1.66e-01 0.32 0.74708
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2409.2 on 2175 degrees of freedom
## Residual deviance: 1832.9 on 2164 degrees of freedom
## AIC: 1857
##
## Number of Fisher Scoring iterations: 7
All the companies are significant from each other. The Spring, Fall and Winter have the same effects, while Summer is significant from them.
According to this table, Summer has the best odds-ratio among four seasons, which means that the chance for a movie to earn profit in Summer is higher than in other seasons.
In the case of movie studios, a movie which is not from the five biggest studios has lowest chance to earn profit, whereas the movie from Paramount Pictures seems to have highest chance.
Another interesting point is the exponentiated coefficient of budget. The table shows the coef as 1 but in fact it is about 0.99999999817. If we increase the budget, the chance to earn profit will reduce but this decrease is not significant. In other words, the change in budget does not have much effect on the prediction. Budget really matters when it increase by billions of dollars, which may never happen in reality.
We can validate the model on the testing set with following methods.
p-value is smaller than 0.05, which indicates that the model is a good fit.
The area under the curve is 0.849, which also shows that the model is good enough. This test also agrees with the Hosmer and Lemeshow test.
23.9% the variance in y is explained by the predictors in our model. Not so bad but not so good.
Below is an example of confusion matrix at the cut-off 0.54 of the predicted probability.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1049
##
##
## | prediction
## actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 130 | 130 | 260 |
## | 0.500 | 0.500 | 0.248 |
## | 0.688 | 0.151 | |
## | 0.124 | 0.124 | |
## -------------|-----------|-----------|-----------|
## 1 | 59 | 730 | 789 |
## | 0.075 | 0.925 | 0.752 |
## | 0.312 | 0.849 | |
## | 0.056 | 0.696 | |
## -------------|-----------|-----------|-----------|
## Column Total | 189 | 860 | 1049 |
## | 0.180 | 0.820 | |
## -------------|-----------|-----------|-----------|
##
##
negative profit = 0. (revenue < budget) positive profit = 1. (revenue > budget)
The overall accuracy is 82%.
(horizontal)
(vertical)
This is the result of accuracies and kappas at different predicted probability’s thresholds/cut-off points.
threshold | accuracy | kappa |
---|---|---|
0.5 | 81.9 | 0.425 |
0.6 | 80.7 | 0.486 |
0.7 | 77.4 | 0.476 |
0.8 | 69.0 | 0.379 |
0.9 | 59.2 | 0.280 |
We want to predict whether a movie has negative profit or positive profit so we use accuracy and kappa and do not use recall, precision, …
The accuracy is the best at threshold 0.5 but the kappa (which reflects the inter-rater reliability) is the best at threshold 0.6. In fact, in the range between 0.5 and 0.6, the differences in accuracy and kappa are not significant (all the kappas represent moderate reliability and all the accuracies is roughly 81% or 82%). We can select any value between 0.5 and 0.6 as the cut-off point.
We can optimize our tree by pruning (in complex model, pruning helps to reduce overfitting).
Relative error is lowest at CP = 4, numbers of split = 8. We use CP = 8 to prune the tree.
Let’s see our tree after pruning.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1049
##
##
## | prediction
## actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 99 | 161 | 260 |
## | 0.381 | 0.619 | 0.248 |
## | 0.692 | 0.178 | |
## | 0.094 | 0.153 | |
## -------------|-----------|-----------|-----------|
## 1 | 44 | 745 | 789 |
## | 0.056 | 0.944 | 0.752 |
## | 0.308 | 0.822 | |
## | 0.042 | 0.710 | |
## -------------|-----------|-----------|-----------|
## Column Total | 143 | 906 | 1049 |
## | 0.136 | 0.864 | |
## -------------|-----------|-----------|-----------|
##
##
The overall accuracy is 80.5%.
(horizontal)
(vertical)
In general, the prediction of decision tree is not as good as the prediction of logistic model.
The area under the curve is 0.764. Classification tree seems not to be a good model in this case.
This is the confusion matrix at k = 3.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1049
##
##
## | prediction
## actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 92 | 168 | 260 |
## | 0.354 | 0.646 | 0.248 |
## | 0.520 | 0.193 | |
## | 0.088 | 0.160 | |
## -------------|-----------|-----------|-----------|
## 1 | 85 | 704 | 789 |
## | 0.108 | 0.892 | 0.752 |
## | 0.480 | 0.807 | |
## | 0.081 | 0.671 | |
## -------------|-----------|-----------|-----------|
## Column Total | 177 | 872 | 1049 |
## | 0.169 | 0.831 | |
## -------------|-----------|-----------|-----------|
##
##
We will find k which gives best accuracy.
According to the graph, accuracy is the best at k = 7.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1049
##
##
## | prediction
## actual | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 90 | 170 | 260 |
## | 0.346 | 0.654 | 0.248 |
## | 0.577 | 0.190 | |
## | 0.086 | 0.162 | |
## -------------|-----------|-----------|-----------|
## 1 | 66 | 723 | 789 |
## | 0.084 | 0.916 | 0.752 |
## | 0.423 | 0.810 | |
## | 0.063 | 0.689 | |
## -------------|-----------|-----------|-----------|
## Column Total | 156 | 893 | 1049 |
## | 0.149 | 0.851 | |
## -------------|-----------|-----------|-----------|
##
##
The overall accuracy of KNN is 77.4%.
KNN model is the worst among three models: Logit, Classification Tree and KNN.
In forecasting whether a movie has negative profit or positive profit, Logit Regression has the best performance.
P/s: We can try Random Forest Model as well and it would have the best result, but our project is too long so I do not cover RF here.
We use time series model in this section to determine the seasonality and trend in gross box office.
Before constructing time series models, we will see the revenue distribution by different quarters.
As can be seen from the box plots, a movie released during the 2nd quarter and 4th quarter tends to have better revenue.
The following graph is the change of revenue by quarters from 1995 to 2010.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.55e+09 -6.35e+08 2.92e+08 0.00e+00 9.27e+08 9.67e+08
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.05e+09 -4.67e+08 -8.91e+07 -3.50e+06 4.29e+08 1.99e+09 4
There is an increasing trend in the gross box office. It is reasonable since the world is developing and the majority of people are becoming richer. Therefore, the movie tickets are more expensive, which grants higher revenue.
We can also see a seasonal pattern in the revenue each year. More income is earned during the 2nd quarter and 4th quarter, which matches the visualization from box plots.
This is the prediction of HoltWinters for the next 23 quarters.
We can see how the predictions fit the actual data in testing set.
Better visualization with highcharter
HoltWinters has good predictions since the actual values are always in the 95% confidence interval of fitted values.
P/s: I do not have much time to study about the p, d, q so I use the auto.arima() function in this part. But I definitely learn about the way we determine the appropriate p, d, q for ARIMA model during the winter break.
This is the prediction of ARIMA model.
Let’s see how the predictions fit the actual data.
Highcharter for better visualization. (I try new stuffs with the 2nd graph)
In ARIMA model, some actual values are out of 95% confidence interval.
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
---|---|---|---|---|---|---|---|---|
Training set | 1.09e+08 | 8.33e+08 | 6.51e+08 | -1.23 | 22.9 | 0.812 | -0.156 | NA |
Test set | -3.03e+08 | 1.32e+09 | 9.99e+08 | -13.25 | 22.2 | 1.245 | -0.231 | 0.35 |
ME | RMSE | MAE | MPE | MAPE | MASE | ACF1 | Theil’s U | |
---|---|---|---|---|---|---|---|---|
Training set | 2.25e+07 | 7.06e+08 | 5.23e+08 | -4.1 | 17.7 | 0.652 | -0.012 | NA |
Test set | -3.20e+08 | 1.31e+09 | 9.88e+08 | -13.7 | 22.2 | 1.232 | -0.237 | 0.352 |
ARIMA does better on training set but the performances of two models on testing set are not much difference.
In my opinion, HoltWinters is better as the actual values are always in the 95 confidence interval. Furthermore, the differences between testing and training sets in HoltWinters are lower. ARIMA may result in overfitting issue since it fits the training set very well but fails to generalize the unseen data in testing set (low bias on training set, but high variance in testing set).
After analyzing movie dataset and constructing different models, we answer the SMART questions at the beginning and make some conclusions:
In summary, we have some suggestions for new movie managers and directors. To achieve success in the movie industry, they should pay attention to the budget, popularity, score and the genres of their movies. April, May and June would be the ideal periods to release their movies.
Comments
In this project, we tried many models to predict the success of a movie. We split the data into training and testing sets to evaluate the models and see that we have no overfitting or underfitting issue. The Random Forest also has low RMSE compared to the median and mean of actual values, which means that the model performs very well.
However, there are few things that we can improve in this study. We should split the data into three sets: training, validation and testing. The recommended spliting technique is k-fold cross validation to ensure the data in three sets are random.
Random Forest model is powerful, but sometimes it may have problems with overfitting. We have only hyperparameter tuning to mitigate overfitting in R and this may be a limitation. However, there is another method to address this drawback. We can use ensemble learning with different, heterogeneous models to promote the predictive power (random forest ensembles similar, homogeneous models - decision trees). Some ensemble techniques in machine learning are bagging, averaging and stacking (which is the most powerful technique in my opinion).
In ARIMA model, we can learn how to select appropriate (p,d,q) to construct the model instead of using auto.arima() function. It would give better results and more insights.