Chapter 1. Introduction

In recent years, the movie industry has become an instrumental source of entertainment in our modern world. Top studios such as Walt Disney, Universal, Paramount, Warner Bros, and Fox have produced many successful films and achieved impressive gross box office revenues as well as reputation. However, there are also many companies failing in the movie industry, which becomes the main concern of new managers and directors who want to pursue this career. Therefore, finding the most essential factors contributing to the success of a movie and predicting a movie’s box office revenue play a key role in the film industry. Being aware of a movie’s performance in advance will allow movie managers to allocate appropriate resources, strategies and adjustments to promote the success of their products.

In this project, we analyze the movie dataset and construct different models to predict the revenue and profitability of a movie using variables such as budget, runtime, genres, vote and score (which can be obtained from a test screening) … In addition, we use time series to examine the seasonality and trend of gross box office.

Our goal is to answer the following S.M.A.R.T questions:

  1. Which are the most important factors contributing to a movie’s success?
  2. Which is the best model to predict the gross box office of a movie?
  3. Is there any seasonal pattern in the total revenue of movies? Which period in a year grants movie industry highest box office?

Chapter 2. Preparation

2.1. Import data

Our dataset is sourced from Kaggle: https://www.kaggle.com/tmdb/tmdb-movie-metadata.

The data has nearly 5000 movie records with a lot of attributes. Before importing the data to R, we preprocessed the raw data in Python to obtain a nicer dataframe since the raw data contains some columns written in JSON with many attributes. Some attributes that are not necessary for our analysis were also excluded.

Below is the structure of our imported data.

## 'data.frame':    4803 obs. of  12 variables:
##  $ X                   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ budget              : int  237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
##  $ genres              : Factor w/ 21 levels "","Action","Adventure",..: 2 3 2 2 2 10 4 2 3 2 ...
##  $ popularity          : num  150.4 139.1 107.4 112.3 43.9 ...
##  $ production_companies: Factor w/ 1314 levels "","100 Bares",..: 615 1263 265 696 1263 265 1263 758 1267 320 ...
##  $ release_date        : Factor w/ 3281 levels "","1916-09-04",..: 2315 1945 3185 2688 2635 1940 2450 3111 2246 3234 ...
##  $ revenue             : num  2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
##  $ runtime             : num  162 169 148 165 132 139 100 141 153 151 ...
##  $ title               : Factor w/ 4800 levels "(500) Days of Summer",..: 381 2653 3186 3614 1906 3198 3364 382 1587 444 ...
##  $ vote_average        : num  7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
##  $ vote_count          : int  11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
##  $ Number_Genres       : int  4 3 3 4 3 3 2 3 3 3 ...

2.2. Data Cleanining

In this step, we rename the data, remove duplicates and missing values (there are few missing records so removing them does not hurt our analysis). To evaluate the company variable, we divide the movie studios in 6 groups: Walt Disney, Warnes Bros, Sony, Universal, Paramount and Others which contains less famous studios. We also create new columns such as profit, profitable, season, quarter and year to support our research. All variables are updated to their correct formats (int, num, factor, date …).

P/s: The profitable column returns binary out come (0,1). If a movie has positive profit, the profitable is 1; otherwise, the profitable is 0.

Below is the final structure of our dataframe.

## 'data.frame':    3225 obs. of  15 variables:
##  $ budget    : int  237000000 300000000 245000000 250000000 260000000 258000000 260000000 280000000 250000000 250000000 ...
##  $ genres    : Factor w/ 18 levels "Action","Adventure",..: 1 2 1 1 1 9 3 1 2 1 ...
##  $ popularity: num  150.4 139.1 107.4 112.3 43.9 ...
##  $ company   : Factor w/ 6 levels "Others","Paramount Pictures",..: 1 5 3 1 5 3 5 5 6 6 ...
##  $ date      : Date, format: "2009-12-10" "2007-05-19" ...
##  $ revenue   : num  2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
##  $ runtime   : num  162 169 148 165 132 139 100 141 153 151 ...
##  $ title     : Factor w/ 3224 levels "(500) Days of Summer",..: 259 1761 2129 2420 1265 2139 2256 260 1053 310 ...
##  $ score     : num  7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
##  $ vote      : int  11800 4500 4466 9106 2124 3576 3330 6767 5293 7004 ...
##  $ profit    : num  2.55e+09 6.61e+08 6.36e+08 8.35e+08 2.41e+07 ...
##  $ profitable: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ season    : Factor w/ 4 levels "Spring","Summer",..: 4 1 3 2 1 1 3 1 2 1 ...
##  $ quarter   : Factor w/ 4 levels "Q1","Q2","Q3",..: 4 2 4 3 1 2 4 2 3 1 ...
##  $ year      : num  2009 2007 2015 2012 2012 ...

2.3. Data Summary

The summary of continuous variables:

revenue budget popularity runtime score vote profit
Min. :5.00e+00 Min. :1.00e+00 Min. : 0 Min. : 41 Min. :2.30 Min. : 1 Min. :-1.66e+08
1st Qu.:1.71e+07 1st Qu.:1.05e+07 1st Qu.: 10 1st Qu.: 96 1st Qu.:5.80 1st Qu.: 179 1st Qu.: 2.52e+05
Median :5.52e+07 Median :2.50e+07 Median : 20 Median :107 Median :6.30 Median : 471 Median : 2.64e+07
Mean :1.21e+08 Mean :4.07e+07 Mean : 29 Mean :111 Mean :6.31 Mean : 978 Mean : 8.07e+07
3rd Qu.:1.46e+08 3rd Qu.:5.50e+07 3rd Qu.: 37 3rd Qu.:121 3rd Qu.:6.90 3rd Qu.: 1148 3rd Qu.: 9.75e+07
Max. :2.79e+09 Max. :3.80e+08 Max. :876 Max. :338 Max. :8.50 Max. :13752 Max. : 2.55e+09

There are high differences between the variables. We need to scale the data to achieve accurate models. All predictors will be standardized.

2.4 Data visualization

Before going to the main contents, we have an overview of the data.

Chapter 3. Revenue Prediction

In reality, film managers would want to predict the success of a movie before its main release. The information they may have are the budget, runtime, genres, production company, popularity, vote and score (vote and score can be obtained by a preview screening, popularity can be generated after advertisement, trailers and some leaks from the movies). We will use these variables as the predictors.

This section presents different types of models to predict the revenue: Linear Regression, Decision Tree (Regression Tree) and Random Forest. For each model, we perform model evaluation to obtain the best formula (adj R2, BIC, CP for Linear Regression, pruning for Regression Tree, tuning for Random Forest). Finally, we compare the three best models and make conclusion.

3.1. Dependency

Before constructing the models, we have a first glance at the relationships between the revenue and predictors.

Numerical Variables

Budget, popularity and vote seem to have high correlations with the revenue.

Categorical Variables

To examine the dependency of revenue on categorical variables, we use anova test to test the means of revenue in different genres, companies and seasons.

Overall, there is an evidence that the frequency distributions of revenue in different genres are not the same. It seems that revenue is dependent on genres.

The p-values in the anova test with genre, company and season are 2.58410^{-80}, 1.12610^{-28} and 5.28810^{-10}, respectively.

Since all p-values are smaller than 0.05 level, there is evidence that the means of different genres or different companies or different seasons are not the same. We can conclude that the overall effect of genre or company or season on the revenue is statistically significant.

3.2. Linear model

We split our data into training set and testing set with ratio 67:33.

Numerical Variables

In this part, we try to buid a linear regression model with numerical predictors.

Model Construction

Construct the model on training set (using all numberical variables)

## 
## Call:
## lm(formula = revenue ~ ., data = train1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.21e+08 -3.89e+07 -1.92e+06  2.46e+07  1.60e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 122161921    2168480   56.34  < 2e-16 ***
## budget       82502895    2822244   29.23  < 2e-16 ***
## popularity   14588304    2952684    4.94  8.4e-07 ***
## runtime      -1265467    2415512   -0.52     0.60    
## score          212212    2648862    0.08     0.94    
## vote         85807055    3723449   23.05  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01e+08 on 2170 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.711 
## F-statistic: 1.07e+03 on 5 and 2170 DF,  p-value: <2e-16
##     budget popularity    runtime      score       vote 
##       1.68       2.15       1.26       1.50       2.95

As seen from the results, score and runtime are not to be statistically significant. It seems that we can exclude these variables from the model. We will perform model selection using Adj R-squared, BIC and CP to optimize our model.

Feature selection

All three methods agree that the model with budget, popularity and vote is the best. It matches the our indication above that runtime and score can be removed from the model.

Best Model

After feature selection, we build new optimized model.

## 
## Call:
## lm(formula = revenue ~ budget + popularity + vote, data = train1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.21e+08 -3.85e+07 -2.19e+06  2.44e+07  1.60e+09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.22e+08   2.17e+06   56.36  < 2e-16 ***
## budget      8.23e+07   2.62e+06   31.45  < 2e-16 ***
## popularity  1.46e+07   2.95e+06    4.96  7.8e-07 ***
## vote        8.57e+07   3.47e+06   24.72  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01e+08 on 2172 degrees of freedom
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.711 
## F-statistic: 1.78e+03 on 3 and 2172 DF,  p-value: <2e-16
##     budget popularity       vote 
##       1.45       2.15       2.55

All predictors are statistically significant since their p-values are smaller than 0.05 level. The vifs are smaller than 3, which means that there the predictors are moderately correlated. The multicolliearity seems not to be problematic. Adj R-squared is 0.711 which is the same as the model with full numerical predictors. We can say that roughly 71% of the variance found in the response variable can be explained by the predictors.

Evaluation Metrics

We predict our linear model on the testing data and obtain the metrics such as RMSE and MAE for evaluation and comparison with other models.

test train
mae 6.25e+07 5.86e+07
mse 1.12e+16 1.02e+16
rmse 1.06e+08 1.01e+08
mape 1.06e+04 4.76e+03

The MAE and RMSE in testing set are slightly higher than those in training set, which means that our linear model’s performances on training and testing sets are not significant. The model performs a little better in the training set, which is acceptable since the model is constructed using this set. Moreover, the RMSEs are smaller than the average of revenues and the R-squared is 0.711. It appears that the model does not overfit or underfit the data.

Categorical and Numerical Variables

In this part, we construct the model with all variables in the dataset.

Model Construction

## 
## Call:
## lm(formula = revenue ~ ., data = train1_full)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.08e+08 -4.03e+07 -1.43e+06  2.90e+07  1.62e+09 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               108110505    6817790   15.86  < 2e-16 ***
## budget                     77348867    3033574   25.50  < 2e-16 ***
## popularity                 13617667    2932282    4.64  3.6e-06 ***
## runtime                     5900196    2594312    2.27  0.02305 *  
## score                      -1602884    2751829   -0.58  0.56030    
## vote                       87028064    3716646   23.42  < 2e-16 ***
## genresAdventure            14998048    8669568    1.73  0.08378 .  
## genresAnimation            87790816   13236699    6.63  4.2e-11 ***
## genresComedy               25268186    7163941    3.53  0.00043 ***
## genresCrime               -10102172   11467478   -0.88  0.37845    
## genresDocumentary          44711638   23188936    1.93  0.05397 .  
## genresDrama                 4923647    7294556    0.67  0.49976    
## genresFamily               82467003   20391297    4.04  5.4e-05 ***
## genresFantasy               2870822   13660650    0.21  0.83357    
## genresHistory              15689738   24998838    0.63  0.53032    
## genresHorror               17619142   10531825    1.67  0.09448 .  
## genresMusic                23079584   29354497    0.79  0.43182    
## genresMystery               6103862   24024211    0.25  0.79946    
## genresRomance              17057626   14926772    1.14  0.25327    
## genresScience Fiction     -15440346   15106799   -1.02  0.30686    
## genresThriller             -7679779   12337527   -0.62  0.53370    
## genresWar                 -56563625   33719837   -1.68  0.09360 .  
## genresWestern              -3456414   24929422   -0.14  0.88974    
## companyParamount Pictures  19120256    8303147    2.30  0.02139 *  
## companySony Pictures        4555821    7970621    0.57  0.56767    
## companyUniversal Pictures  16743343    7455117    2.25  0.02481 *  
## companyWalt Disney         16706288    6373478    2.62  0.00882 ** 
## companyWarner Bros          -535044    8559594   -0.06  0.95016    
## seasonSummer                -822742    6192423   -0.13  0.89431    
## seasonFall                 -9929435    6117507   -1.62  0.10471    
## seasonWinter               -5891796    6282550   -0.94  0.34845    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99300000 on 2145 degrees of freedom
## Multiple R-squared:  0.725,  Adjusted R-squared:  0.721 
## F-statistic:  188 on 30 and 2145 DF,  p-value: <2e-16

The p-values and t-values indicate that there is no significance among different seasons and score is not statistically significant. Season and score seem not to be necessary predictors. The overall effect of genre or company is significant.

Feature Selection

When inlcuding the season, genre and company in the model, the best numerical predictors are not changed (budget, popularity and vote). The effects of different seasons seem to be the same. The best formula for the linear model in this case is: revenue ~ budget + vote + company + genres + popularity

Best Model

We build new model with the best formula.

## 
## Call:
## lm(formula = revenue ~ budget + vote + company + genres + popularity, 
##     data = train1_full)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -6.14e+08 -4.04e+07 -8.25e+05  2.96e+07  1.62e+09 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               103783104    5493099   18.89  < 2e-16 ***
## budget                     79307464    2810718   28.22  < 2e-16 ***
## vote                       87343750    3432074   25.45  < 2e-16 ***
## companyParamount Pictures  20187352    8296723    2.43  0.01505 *  
## companySony Pictures        4706890    7968099    0.59  0.55477    
## companyUniversal Pictures  17875529    7436574    2.40  0.01631 *  
## companyWalt Disney         16364447    6369330    2.57  0.01026 *  
## companyWarner Bros          -135185    8558273   -0.02  0.98740    
## genresAdventure            14924242    8645045    1.73  0.08443 .  
## genresAnimation            80203108   12811282    6.26  4.6e-10 ***
## genresComedy               24197175    7152449    3.38  0.00073 ***
## genresCrime                -8821228   11302914   -0.78  0.43522    
## genresDocumentary          40500854   22990580    1.76  0.07827 .  
## genresDrama                 6560315    6935298    0.95  0.34429    
## genresFamily               78112299   20296238    3.85  0.00012 ***
## genresFantasy               1516100   13618042    0.11  0.91136    
## genresHistory              22335236   24701813    0.90  0.36599    
## genresHorror               15897705   10491528    1.52  0.12985    
## genresMusic                22199706   29264598    0.76  0.44818    
## genresMystery               3544411   24017226    0.15  0.88269    
## genresRomance              16669177   14880624    1.12  0.26276    
## genresScience Fiction     -15795028   15101070   -1.05  0.29570    
## genresThriller             -7928898   12337416   -0.64  0.52051    
## genresWar                 -55041648   33646312   -1.64  0.10201    
## genresWestern                726641   24731085    0.03  0.97656    
## popularity                 13447493    2930278    4.59  4.7e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 99400000 on 2150 degrees of freedom
## Multiple R-squared:  0.724,  Adjusted R-squared:  0.721 
## F-statistic:  225 on 25 and 2150 DF,  p-value: <2e-16

The adj R-squared is 0.721 which is slightly better than the model with numerical variables (increase by 1.0%).

Evaluation Metrics

test train
mae 6.19e+07 5.83e+07
mse 1.09e+16 9.76e+15
rmse 1.05e+08 9.88e+07
mape 7.33e+03 8.55e+03

A little improvement in this model compared to the previous model with continuous variables. Adj R-squared increases by 1% and RMSE and MAE in both training set and testing sets slightly decrease.

Comparison

This part is to compare two linear models using AIC and BIC.

Model 1: budget + vote + popularity

AIC BIC
86395 86424

Model 2: budget + vote + popularity + company + genres

AIC BIC
86343 86497

Model 2 has lower AIC than Model 1, which indicates that Model 2 is better for predicting the revenue.

Model 1 has lower BIC than Model 2, which indicates that Model 1 is better as a true function to explain the revenue (BIC prefers simple models).

If we want an explanatory model, the linear model with continuous variables is better. However, in our case, since we need more predictive power, we would prefer the second linear model with both continuous and categorical predictors.

3.3. Regression Tree

With Decision Tree we can address both numerical and categorical variables in the model.

Tree Construction

We can try two functions to build a regression tree model.

Tree() function:

Rpart() function:

Both methods give the same tree model. Using rpart() gives nicer plots.

Pruned Tree

We perform pruning to optimize our tree. We need better R-squared, which means we need to find the CP with least relative error.

Here are the errors at each CP.

CP Error
1 1.001
2 0.709
3 0.535
4 0.483
5 0.455
6 0.452
7 0.412
8 0.383
9 0.385

The error is lowest at 0.383 when CP = 8.

This is our pruned tree using CP = 8.

The tree excludes the node with genres.

Metrics

test train
mae 6.59e+07 6.04e+07
mse 1.32e+16 1.06e+16
rmse 1.15e+08 1.03e+08
mape 1.36e+04 7.22e+03

The R-squared in decision tree is 0.617 which is smaller than Adj R-squared in the linear model (0.711). The RMSEs and MAEs in decision tree are higher than those in linear model. Overall, the decision tree model is not as good as the linear model.

3.4. Random Forest

In this part, we try a more powerful model than decision tree: random forest model. Random forest uses the ensemble method to combine many trees to boost the predictive power.

Random Forest Model

## 
## Call:
##  randomForest(formula = revenue ~ ., data = train1_full, ntree = 350) 
##                Type of random forest: regression
##                      Number of trees: 350
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 9.46e+15
##                     % Var explained: 73.2

The pseudo R-squared of random forest is 0.732. 73.2% of the variance found in the response variable can be explained by the predictors in random forest model. It is better than both linear model and decision tree model.

Let’s look how random forest works.

When the number of trees increase, the mean squared error MSE decreases. After a number of trees (around 100 trees in our case), the MSE does not have any significant change. We can see that random forest uses many trees to minimize the MSE, hence, promote the predictive power.

Hyperparameter Tuning

We have 8 predictors, but there are two variables tried at each split in the above random forest. The number of variables tried at each split is represented by parameter mtry. Different mtrys give different predictions, and this section is to find the best mtry for our model using hyperparameter tuning - a method to improve the performance of random forest.

## mtry = 2  OOB error = 9.46e+15 
## Searching left ...
## mtry = 1     OOB error = 1.06e+16 
## -0.118 0.05 
## Searching right ...
## mtry = 4     OOB error = 9.33e+15 
## 0.014 0.05

Using the tuneRF() function, we can find the best mtry. According to the graph, owest OOB Error is achieved at mtry = 4. We will buid a model with this mtry.

Tuned RF model

## 
## Call:
##  randomForest(formula = revenue ~ ., data = train1_full, mtry = 4,      ntree = 350) 
##                Type of random forest: regression
##                      Number of trees: 350
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 9.17e+15
##                     % Var explained: 74

There is a slight improvement after we perform tuning; % variance explained increases by 1%. According to my studies, hyperparameter tuning not only improves the predictive power but also mitigates overfitting in random forest. However, we do not see any sign of overfitting in our project so we do not have a chance to see this ability of hyperparameter tuning.

Metrics

test train
mae 5.51e+07 5.24e+07
mse 9.45e+15 9.17e+15
rmse 9.72e+07 9.57e+07
mape 5.80e+03 4.77e+03

There is no sign of overfitting in random forest model. The RMSEs and MAEs are smaller than both linear model and decision tree, which indicates that random forest has the best performance among three models.

3.5. Conclusion

Model Linear Model Regression Tree Random Forest
R-squared(adjusted/ pseudo) 0.721 0.711 0.741
MAE - train 5.8e+07 6e+07 5.2e+07
MAE - test 6.2e+07 6.6e+07 5.5e+07
RMSE - train 9.9e+07 1e+08 9.6e+07
RMSE - test 1e+08 1.1e+08 9.7e+07

Higher (pseudo) R-squared means that random forest fits the data better than linear model and decision tree. Lower RMSE and MSE indicate that random forest predictions are closer to the actual values.

In summary, random forest is the best model to predict revenue among three models. while decision tree has worst performance.

Chapter 4. Profit Prediction

In this chapter, we try some stuffs with one hot encoding and apply Lasso and Ridge Regression to predict the profit.

4.1. Correlation

We get that budget, popularity and vote have relatively strong correlations with profit. While runtime and score have moderate correlation. The correlations between profit and the other variables are similar to the result of correlation check for revenue.

For each company we create a new column under the name of that company and mark 0 or 1 depending on if the movie is of that company. We do the same to ‘month’ and ‘genre’ in order to select some columns out when we do feature selection later.

In general, month is not highly correlated with profit, but from both the plots of the Rhos with method spearman and method pearson, we do observe that month six (June) has relatively positive effect on profit, whereas month nine (september) has relatively negative effect on profit.

From the two plots of Rhos about the correlations between different companies and profits, company ‘Disney’ and ‘Universal’ have relatively stronger correlation with profit. By the amount of rhos between company and profit, we get that four companies are not correlated with profit too much.

4.2. Linear Regression

Initial Model

  • The first model call is lm(formula = profit ~ ., data = movie_r).
  • By the result of our initial linear regression model on profit, we get that it is unlikely we will observe a relationship between some of the predictors (like some uncommon genres) and the response (profit)

Model Selection

##                           Abbreviation
## budget                               b
## popularity                           p
## runtime                              r
## vote                                 v
## genresAdventure                 gnrsAd
## genresAnimation                 gnrsAn
## genresComedy                    gnrsCm
## genresCrime                     gnrsCr
## genresDocumentary               gnrsDc
## genresDrama                     gnrsDr
## genresFamily                    gnrsFm
## genresFantasy                   gnrsFn
## genresHistory                   gnrsHs
## genresHorror                    gnrsHr
## genresMusic                     gnrsMs
## genresMystery                   gnrsMy
## genresRomance                       gR
## genresScience Fiction              gSF
## genresThriller                      gT
## genresWar                       gnrsWr
## genresWestern                   gnrsWs
## month2                              m2
## month3                              m3
## month4                              m4
## month5                              m5
## month6                              m6
## month7                              m7
## month8                              m8
## month9                              m9
## month10                            m10
## month11                            m11
## month12                            m12
## score                                s
## companyParamount Pictures          cPP
## companySony Pictures               cSP
## companyUniversal Pictures          cUP
## companyWalt Disney                  cD
## companyWarner Bros                  cB

According to the plots above, we know budget, vote, May, June, December, Animation and Family are the best predictors to construct the model.

Other linear models

  • With adjusted r-squared value equal to 0.591, 59.10% of profit can be explained by the 7 features above, like the movie budget, wheather the movie is release in May, June or Decemenber. The accuracy of the new model is close to the original one. We can reduce the number of variables to 7.
  • With the 7 features above, we are likely to achieve the model with similar level of fitness compared to the original linear regression model.

4.3. Ridge regression

  • The glmnet( ) function creates 201 models, with our choice of 201 \(\lambda\) values. Each model coefficients are stored in the object we named: ridge.mod
  • There are 38 coefficients for each model. The 201 \(\lambda\) values are chosen from \(10^{-10}\) to \(10^{10}\), essentially covering the ordinary least square model (\(\lambda\) = 0), and the null/constant model (\(\lambda\) approach infinity).

  • We try to find a value for lambda that is optimal. And it turns that the best lambda value is 0.032.

  • With lambda equal to 0.032, we have the r square to be 0.604, which is slightly better than the linear model we have. Since the best lambda value is close to 0, the original linear regression model is not overfit.

4.4. Lasso Regression

Here, we see that the lowest MSE is when \(\lambda\) = 0.006 and the r-squared becomes 0.603. The lasso regression drops 10 useless variables above and it has 27 non-zero coefficients.

4.5. Random Forest

## 
## Call:
##  randomForest(x = movie_RF, y = movie$profit, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 8.92e+15
##                     % Var explained: 64.4

The random forest model has the highest r-squared value: 0.644 with 500 trees. Since month is not significant and it can be treated numerically, we did not change month to factor variable. The model is even better with month in number (we got 0.639 r-squared for month in factor). Also it is easier to do further predictions using the regression model with month in integer.

It turns out that 500 hundred trees is enough to build a relatively good model.

To predict the profit calculated by box office revenue minus budget, we take the ongoing FrozenII. With all the variales we have and those we take from its previous movie ‘Frozen’ like popularity, socre and vote, we get a estimation of the profit of FrozenII to be 3.55610^{8} using random forest regresssion.

4.6. Polynomial Regression

With some polynomials, we achieve a 0.653 adjusted r-squared value. In fact, some of the polynomial variables like poly(score,6) are significant. However, since it takes around 60 features in total, the polynomial regression like this is not preferable at all.

With the polynomial regression, the estimated profit of movie FrozenII is 5.90410^{8}.

Chapter 5. PCA

Our goal in this chapter is to examine if Principal Component Analysis method is good at dimensional reduction. In our data, we have 5 continuous variables to predict the revenue. We perform principal component regression directly and see how many components are sufficient for the model.

5.1. Variance

Non-centered data:

## Importance of components:
##                             PC1      PC2 PC3  PC4  PC5   PC6
## Standard deviation     1.89e+08 3.10e+07 926 23.9 20.1 0.699
## Proportion of Variance 9.74e-01 2.62e-02   0  0.0  0.0 0.000
## Cumulative Proportion  9.74e-01 1.00e+00   1  1.0  1.0 1.000

Centered data:

## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5    PC6
## Standard deviation     1.768 1.108 0.894 0.6466 0.5045 0.4186
## Proportion of Variance 0.521 0.204 0.133 0.0697 0.0424 0.0292
## Cumulative Proportion  0.521 0.725 0.859 0.9284 0.9708 1.0000

We see a significant differences in the variances and standard deviations of different components. Actually, the revenue and our predictors except budget are measured by different units. The revenue and budget are measured in millions which are significantly bigger than the scales of vote, score, popularity …

Therefore, we recommend to scale the data before constructing the model. We will see the differences between non-centered and centered data in the following parts.

5.2. PVE

In non-centered version, 1 component explains almost 100% of the variance, while in centered version, 1 component only explains about 50% of the variance and we need 3 components to reach 80% of the variance. This is another evidence to indicate that we should perform scaling since the the PC1 seems to overwhelm other components in non-centered version.

5.3. PCR Model

Model Construction

We construct the regression model using PCA method directly. We plot the MSEP and R2 with different numbers of components to see how many components are sufficient to capture the variance in revenue.

Non-centered version:

Centered version:

Both versions agree that 2 components give the optimized MSEP and R-squared.

We can see the coefficients of different components.

(PC1 is at position 2, position 1 is the intercept)

According to the graph, after PC2, the change of coefficients is not significant compared to the difference between the coeffcients of PC1 and PC2. Hence, we can use PC1 and PC2 to build the model since two components are sufficient to capture the variance.

Summary of PCR

Let’s see the summary of PCR model and compare the non-centered version to the centered version.

Non-centered version

## Data:    X dimension: 2176 5 
##  Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)    1 comps    2 comps    3 comps    4 comps    5 comps
## CV        1.88e+08  122151898  108524196  107141642  102852443  102965230
## adjCV     1.88e+08  122077691  108448087  107128888  102792895  102854326
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps
## X          49.00    71.71    87.31    95.63   100.00
## revenue    57.87    66.61    67.94    70.47    71.13
## Data:    X dimension: 2176 5 
##  Y dimension: 2176 1
## Fit method: svdpc
## Number of components considered: 5
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)    1 comps    2 comps   3 comps    4 comps    5 comps
## CV        1.88e+08  121720344  106321914  1.06e+08  103819563  102861273
## adjCV     1.88e+08  121665663  106238340  1.06e+08  103733078  102758138
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps
## X          48.50    71.69    87.41    95.61   100.00
## revenue    58.21    68.09    68.67    70.49    71.13

Scaled data have better variance explanation for revenue than non-scaled data. This indicates we need to standardize data for more accurate prediction.

Model Evaluation

We will evaluate our PCR model by predicting on the testing data.

There is a significant increase in the variance from PC1 to PC2, after that the change of variance is not drastic. We can say that 2 principal components are enough to capture the majority of variance in the testing data. Our PCR model seems to perform properly in the testing data (both training set and testing set requires 2 components).

5.4. Linear Model

We can use principal components as the predictors for a linear model. We use two components to build the models with both centered and non-centered versions

Non-centered version

## 
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr.nc)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.20e+09 -4.08e+07 -5.95e+06  2.28e+07  1.76e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 121493614    2330120    52.1   <2e-16 ***
## PC1         -89813816    1463518   -61.4   <2e-16 ***
## PC2          51258272    2149532    23.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09e+08 on 2173 degrees of freedom
## Multiple R-squared:  0.666,  Adjusted R-squared:  0.666 
## F-statistic: 2.17e+03 on 2 and 2173 DF,  p-value: <2e-16
## PC1 PC2 
##   1   1

Centered version

## 
## Call:
## lm(formula = revenue ~ PC1 + PC2, data = movie_pcr)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.11e+09 -4.03e+07 -4.95e+06  2.42e+07  1.73e+09 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 120012282    2277573    52.7   <2e-16 ***
## PC1         -92108152    1462914   -63.0   <2e-16 ***
## PC2          54901946    2115458    25.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.06e+08 on 2173 degrees of freedom
## Multiple R-squared:  0.681,  Adjusted R-squared:  0.681 
## F-statistic: 2.32e+03 on 2 and 2173 DF,  p-value: <2e-16
## PC1 PC2 
##   1   1

All vifs are 1, which is a nice feature of PCA method to prevent multicollinearity. The scaled version gives better R-squared than non-scaled version, which is another indicator for why scaling data is recommended. There is no big difference between the adj R-squared in linear model using 2 components (0.681) and the linear model using all variables (0.711). This shows that we can apply PCA to the linear model without hurting the predictive power.

5.5. Conclusion

Comparing to the linear model (of numerical variables) in previous chapter, the adjusted R-squared slightly decreases: the value drops from 71.1% to 68.1%. However, instead of using 5 variables (budget + popularity + vote + score + runtime), we need only 2 variables (PC1 and PC2) but capture the majority of the variance. In conclusion, we can promote the computational speed and reduce dimensionality while not significantly hurting the performance of the model by using PCA method.

Chapter 6. Profitability

In this chapter, we use different kinds of model to predict if a movie earns profit or not (revenue > budget -> earn profit). In our data, we use column profitable to represent whether a movie earns profit or not; a movie with revenue > budget is labeled as 1, otherwise it is labeled as 0.

6.1. Chi-squared test

Our prior chi-squared test indicates that there is evidence for the dependency of revenue on season, company and genres as all p-values are smaller than 0.05.

6.2. Logistic Regression

Model Construction

Here is the summary of logistic model.

## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = train3)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.683   0.000   0.297   0.735   1.743  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -2.05e+00   5.35e-01   -3.84  0.00012 ***
## budget                    -1.75e-08   2.51e-09   -6.99  2.8e-12 ***
## popularity                 1.61e-02   1.26e-02    1.28  0.20159    
## runtime                    1.66e-03   3.28e-03    0.50  0.61421    
## score                      2.27e-01   8.46e-02    2.68  0.00739 ** 
## vote                       2.67e-03   4.49e-04    5.95  2.7e-09 ***
## genresAdventure            1.73e-01   2.51e-01    0.69  0.49015    
## genresAnimation            2.70e-01   3.97e-01    0.68  0.49646    
## genresComedy               4.31e-01   1.91e-01    2.26  0.02360 *  
## genresCrime                7.35e-02   2.96e-01    0.25  0.80411    
## genresDocumentary          4.66e-01   5.25e-01    0.89  0.37479    
## genresDrama                7.25e-02   1.91e-01    0.38  0.70424    
## genresFamily               2.31e-01   5.54e-01    0.42  0.67656    
## genresFantasy              2.28e-01   4.32e-01    0.53  0.59765    
## genresHistory              6.55e-01   7.13e-01    0.92  0.35873    
## genresHorror               7.92e-01   3.16e-01    2.51  0.01212 *  
## genresMusic                2.74e-01   7.09e-01    0.39  0.69916    
## genresMystery             -4.23e-01   6.33e-01   -0.67  0.50393    
## genresRomance              8.50e-01   4.26e-01    2.00  0.04590 *  
## genresScience Fiction      2.64e-01   5.05e-01    0.52  0.60148    
## genresThriller            -1.50e-01   3.29e-01   -0.46  0.64724    
## genresWar                 -1.36e+00   8.29e-01   -1.64  0.10019    
## genresWestern              2.27e+00   1.08e+00    2.10  0.03582 *  
## companyParamount Pictures  9.81e-01   2.43e-01    4.03  5.5e-05 ***
## companySony Pictures       6.20e-01   2.22e-01    2.79  0.00524 ** 
## companyUniversal Pictures  8.52e-01   2.34e-01    3.64  0.00027 ***
## companyWalt Disney         8.60e-01   1.84e-01    4.67  3.0e-06 ***
## companyWarner Bros         7.69e-01   2.45e-01    3.13  0.00172 ** 
## seasonSummer               3.90e-01   1.74e-01    2.24  0.02505 *  
## seasonFall                -4.65e-02   1.62e-01   -0.29  0.77361    
## seasonWinter               7.04e-02   1.69e-01    0.42  0.67734    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2409.2  on 2175  degrees of freedom
## Residual deviance: 1805.0  on 2145  degrees of freedom
## AIC: 1867
## 
## Number of Fisher Scoring iterations: 8

We can test the overall effect of different genres/companies/seasons on the prediction using Wald test.

The p-values for genres, company and season are 0.12, 8.4e-09 and 0.045, respectively. It shows that the overall effect of genres is not statistically significant. The effect of company is the most significant.

Feature Selection

We will optimize the model using the AIC criteria.

budget popularity runtime score vote genres company season Criterion
TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE 1855
TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE 1856
TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE 1857
TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE 1858
TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE 1858

The best formula for logistic model is: y ~ budget + score + vote + company + season. We can see that the feature selection excludes genres which is proved not to have significant effects in our Wald test.

Best Logit Model

## 
## Call:
## glm(formula = y ~ budget + score + vote + company + season, family = "binomial", 
##     data = train3)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.733   0.000   0.306   0.763   1.770  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -1.44e+00   4.57e-01   -3.15  0.00163 ** 
## budget                    -1.83e-08   2.30e-09   -7.95  1.9e-15 ***
## score                      2.10e-01   7.18e-02    2.93  0.00340 ** 
## vote                       3.18e-03   2.37e-04   13.43  < 2e-16 ***
## companyParamount Pictures  9.64e-01   2.40e-01    4.02  5.7e-05 ***
## companySony Pictures       5.48e-01   2.19e-01    2.50  0.01237 *  
## companyUniversal Pictures  8.42e-01   2.30e-01    3.67  0.00024 ***
## companyWalt Disney         8.75e-01   1.80e-01    4.86  1.2e-06 ***
## companyWarner Bros         7.88e-01   2.41e-01    3.27  0.00109 ** 
## seasonSummer               3.84e-01   1.72e-01    2.24  0.02513 *  
## seasonFall                -6.93e-02   1.59e-01   -0.43  0.66375    
## seasonWinter               5.37e-02   1.66e-01    0.32  0.74708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2409.2  on 2175  degrees of freedom
## Residual deviance: 1832.9  on 2164  degrees of freedom
## AIC: 1857
## 
## Number of Fisher Scoring iterations: 7

All the companies are significant from each other. The Spring, Fall and Winter have the same effects, while Summer is significant from them.

According to this table, Summer has the best odds-ratio among four seasons, which means that the chance for a movie to earn profit in Summer is higher than in other seasons.

In the case of movie studios, a movie which is not from the five biggest studios has lowest chance to earn profit, whereas the movie from Paramount Pictures seems to have highest chance.

Another interesting point is the exponentiated coefficient of budget. The table shows the coef as 1 but in fact it is about 0.99999999817. If we increase the budget, the chance to earn profit will reduce but this decrease is not significant. In other words, the change in budget does not have much effect on the prediction. Budget really matters when it increase by billions of dollars, which may never happen in reality.

Model Evaluation

We can validate the model on the testing set with following methods.

Hosmer and Lemeshow test

p-value is smaller than 0.05, which indicates that the model is a good fit.

ROC curve and AUC

The area under the curve is 0.849, which also shows that the model is good enough. This test also agrees with the Hosmer and Lemeshow test.

McFadden

23.9% the variance in y is explained by the predictors in our model. Not so bad but not so good.

Confusion Matrix

Below is an example of confusion matrix at the cut-off 0.54 of the predicted probability.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1049 
## 
##  
##              | prediction 
##       actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |       130 |       130 |       260 | 
##              |     0.500 |     0.500 |     0.248 | 
##              |     0.688 |     0.151 |           | 
##              |     0.124 |     0.124 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        59 |       730 |       789 | 
##              |     0.075 |     0.925 |     0.752 | 
##              |     0.312 |     0.849 |           | 
##              |     0.056 |     0.696 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       189 |       860 |      1049 | 
##              |     0.180 |     0.820 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

negative profit = 0. (revenue < budget) positive profit = 1. (revenue > budget)

The overall accuracy is 82%.

(horizontal)

  • When predicting negative profit, there is 50% chance that the logit model makes correct prediction.
  • When predicting positive profit, there is 92.5% chance that the logit model makes correct prediction.

(vertical)

  • If the logit model predicts a movie to have negative profit, there is 68.8% chance that the prediction is correct and there is 31.2% chance that it is not correct.
  • If the logit model predicts a movie to have positive profit, there is 84.9% chance that the prediction is correct and there is 15.1% chance that it is not correct.

This is the result of accuracies and kappas at different predicted probability’s thresholds/cut-off points.

threshold accuracy kappa
0.5 81.9 0.425
0.6 80.7 0.486
0.7 77.4 0.476
0.8 69.0 0.379
0.9 59.2 0.280

We want to predict whether a movie has negative profit or positive profit so we use accuracy and kappa and do not use recall, precision, …

The accuracy is the best at threshold 0.5 but the kappa (which reflects the inter-rater reliability) is the best at threshold 0.6. In fact, in the range between 0.5 and 0.6, the differences in accuracy and kappa are not significant (all the kappas represent moderate reliability and all the accuracies is roughly 81% or 82%). We can select any value between 0.5 and 0.6 as the cut-off point.

6.3. Classification Tree

Model Construction

Pruned Tree

We can optimize our tree by pruning (in complex model, pruning helps to reduce overfitting).

Relative error is lowest at CP = 4, numbers of split = 8. We use CP = 8 to prune the tree.

Let’s see our tree after pruning.

Model Accuracy

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1049 
## 
##  
##              | prediction 
##       actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |        99 |       161 |       260 | 
##              |     0.381 |     0.619 |     0.248 | 
##              |     0.692 |     0.178 |           | 
##              |     0.094 |     0.153 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        44 |       745 |       789 | 
##              |     0.056 |     0.944 |     0.752 | 
##              |     0.308 |     0.822 |           | 
##              |     0.042 |     0.710 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       143 |       906 |      1049 | 
##              |     0.136 |     0.864 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

The overall accuracy is 80.5%.

(horizontal)

  • When predicting negative profit, there is 38.1% chance that the logit model makes correct prediction.
  • When predicting positive profit, there is 94.4% chance that the logit model makes correct prediction.

(vertical)

  • If the logit model predicts a movie to have negative profit, there is 69.2% chance that the prediction is correct and there is 30.8% chance that it is not correct.
  • If the logit model predicts a movie to have positive profit, there is 82.2% chance that the prediction is correct and there is 17.8% chance that it is not correct.

In general, the prediction of decision tree is not as good as the prediction of logistic model.

ROC curve and AUC

The area under the curve is 0.764. Classification tree seems not to be a good model in this case.

6.4. KNN

This is the confusion matrix at k = 3.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1049 
## 
##  
##              | prediction 
##       actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |        92 |       168 |       260 | 
##              |     0.354 |     0.646 |     0.248 | 
##              |     0.520 |     0.193 |           | 
##              |     0.088 |     0.160 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        85 |       704 |       789 | 
##              |     0.108 |     0.892 |     0.752 | 
##              |     0.480 |     0.807 |           | 
##              |     0.081 |     0.671 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       177 |       872 |      1049 | 
##              |     0.169 |     0.831 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

We will find k which gives best accuracy.

According to the graph, accuracy is the best at k = 7.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1049 
## 
##  
##              | prediction 
##       actual |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |        90 |       170 |       260 | 
##              |     0.346 |     0.654 |     0.248 | 
##              |     0.577 |     0.190 |           | 
##              |     0.086 |     0.162 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        66 |       723 |       789 | 
##              |     0.084 |     0.916 |     0.752 | 
##              |     0.423 |     0.810 |           | 
##              |     0.063 |     0.689 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       156 |       893 |      1049 | 
##              |     0.149 |     0.851 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

The overall accuracy of KNN is 77.4%.

KNN model is the worst among three models: Logit, Classification Tree and KNN.

6.5. Conclusion

In forecasting whether a movie has negative profit or positive profit, Logit Regression has the best performance.

P/s: We can try Random Forest Model as well and it would have the best result, but our project is too long so I do not cover RF here.

Chapter 7. Time series

We use time series model in this section to determine the seasonality and trend in gross box office.

7.1. Visualization

Before constructing time series models, we will see the revenue distribution by different quarters.

As can be seen from the box plots, a movie released during the 2nd quarter and 4th quarter tends to have better revenue.

The following graph is the change of revenue by quarters from 1995 to 2010.

7.2. Decomposition

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.55e+09 -6.35e+08  2.92e+08  0.00e+00  9.27e+08  9.67e+08
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -1.05e+09 -4.67e+08 -8.91e+07 -3.50e+06  4.29e+08  1.99e+09         4

There is an increasing trend in the gross box office. It is reasonable since the world is developing and the majority of people are becoming richer. Therefore, the movie tickets are more expensive, which grants higher revenue.

We can also see a seasonal pattern in the revenue each year. More income is earned during the 2nd quarter and 4th quarter, which matches the visualization from box plots.

7.3. HoltWinters

This is the prediction of HoltWinters for the next 23 quarters.

We can see how the predictions fit the actual data in testing set.

Better visualization with highcharter

HoltWinters has good predictions since the actual values are always in the 95% confidence interval of fitted values.

7.4. ARIMA Model

P/s: I do not have much time to study about the p, d, q so I use the auto.arima() function in this part. But I definitely learn about the way we determine the appropriate p, d, q for ARIMA model during the winter break.

This is the prediction of ARIMA model.

Let’s see how the predictions fit the actual data.

Highcharter for better visualization. (I try new stuffs with the 2nd graph)

In ARIMA model, some actual values are out of 95% confidence interval.

7.5. ARIMA vs HoltWinter

HoltWinter

ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 1.09e+08 8.33e+08 6.51e+08 -1.23 22.9 0.812 -0.156 NA
Test set -3.03e+08 1.32e+09 9.99e+08 -13.25 22.2 1.245 -0.231 0.35

ARIMA

ME RMSE MAE MPE MAPE MASE ACF1 Theil’s U
Training set 2.25e+07 7.06e+08 5.23e+08 -4.1 17.7 0.652 -0.012 NA
Test set -3.20e+08 1.31e+09 9.88e+08 -13.7 22.2 1.232 -0.237 0.352

ARIMA does better on training set but the performances of two models on testing set are not much difference.

In my opinion, HoltWinters is better as the actual values are always in the 95 confidence interval. Furthermore, the differences between testing and training sets in HoltWinters are lower. ARIMA may result in overfitting issue since it fits the training set very well but fails to generalize the unseen data in testing set (low bias on training set, but high variance in testing set).

Chapter 8. Conclusion

After analyzing movie dataset and constructing different models, we answer the SMART questions at the beginning and make some conclusions:

  • According to our research, budget, popularity, vote, movie studios and genres are the most important factors contributing to a movie’s success.
  • Random Forest has the best performance in predicting the gross box office among three models (Linear Regression, Decision Tree and Random Forest).
  • To predict whether a movie has positive profit or negative profit, Logistic Regression has better accuracy than KNN and Decision Tree. We do not try Random Forest in this section, but in my opinion, Random Forest would have better result than Logistic Regression. Random Forest makes use of the combination concept in machine learning by ensembling many decision trees to reduce the Out-Of-Bag Errors. The model is also very powerful in dealing with high dimensionality.
  • We observe a seasonal pattern and an increasing trend in the revenue. Both HoltWinters and ARIMA models do well in predicting the revenue in the future, but I prefer the HoltWinters in this case since the actual values never surpass the 95% confidence intervals of HotlWinters’ predictions.

In summary, we have some suggestions for new movie managers and directors. To achieve success in the movie industry, they should pay attention to the budget, popularity, score and the genres of their movies. April, May and June would be the ideal periods to release their movies.

Comments

In this project, we tried many models to predict the success of a movie. We split the data into training and testing sets to evaluate the models and see that we have no overfitting or underfitting issue. The Random Forest also has low RMSE compared to the median and mean of actual values, which means that the model performs very well.

However, there are few things that we can improve in this study. We should split the data into three sets: training, validation and testing. The recommended spliting technique is k-fold cross validation to ensure the data in three sets are random.

Random Forest model is powerful, but sometimes it may have problems with overfitting. We have only hyperparameter tuning to mitigate overfitting in R and this may be a limitation. However, there is another method to address this drawback. We can use ensemble learning with different, heterogeneous models to promote the predictive power (random forest ensembles similar, homogeneous models - decision trees). Some ensemble techniques in machine learning are bagging, averaging and stacking (which is the most powerful technique in my opinion).

In ARIMA model, we can learn how to select appropriate (p,d,q) to construct the model instead of using auto.arima() function. It would give better results and more insights.