Introduction: Lights, Camera, Data!
Fasten your seatbelts and embark on our interactive data story. Through this article, you will get to understand the world of cinema from a different perspective. Side-by-side, we will explore the factors that make a movie successful and try answer the one billion-dollar question: Can we predict the success of a movie?
We will take you through the journey of the movie industry, from the early days of black-and-white silent films to the modern-day blockbusters. We will explore the trends that have shaped the industry, the genres that have dominated the box office, and the actors that have brought magic to the screen.
But first of all, let's introduce ourselves. We are a team of students with different backgrounds from EPFL's CS-401 course united by this project. We dived into the CMU Movie dataset wondering if we could predict the success of a movie. But let's start from the beginning.
What is Success?
For this story, we're looking at box office revenue as our measure of success. But money isn't everything. A movie can be loved by everyone who watched it or win tons of awards even if it didn't make a ton of money. By sticking with box office revenue, we're focusing on financial success, but it's important to keep in mind there's more than one way to measure a movie's impact.
Timing is (not) Everything
As we can imagine, the release date has a strong impact on the film's performance. First by plotting the data, we discover that summer is the best period to release a movie. More specifically, in the month of June. This was confirmed when assessing the correlation between the release month and the revenue. But keep in mind that the correlation is 0.072 which suggests release season alone is not a strong predictor of revenue in our dataset.
This can be linked with the fact that people are more likely to go to the cinema during the summer, because the kids are on holiday and parents are looking for activities to do with them.
The Genre Game: Trends that shape tastes
A wise man said once: "You can't please every audience, but the right genre pleases the right crowd". Keeping that in mind, it was for us a very important step to understand what are the main genres, both in terms of number of movies and average revenue.
Different times have different preferences, therefore, we also investigated the evolution of the main genres throughout the years. Doing so, we could drastically improve the prediction of the success of a movie.
We found some pretty interesing results, both about the genres and the movie industry in general. Period pieces dominated the 30's, almost every genre had a peak in the 70's followed by a vallley in the 80's. We can also see that there are some genres that are indeed leaving the crowd behind, leadered by fantasy. But not all is good, specially for indie who is not doing so well.
Too short, too long, or just right?
The runtime of a movie is a key factor in its success. Too short and the audience might feel cheated, too long and they might get bored. We have discovered that between 1920 and 1950, movies were usually slightly shorter, probably due to technical reasons. But with time, movies started to get longer and longer before shrinking again to stabillize around 105 minutes.
But what about the revenue? Back in the days from 1920 to 1950 revenues from short and long movies were almost identical, but from 1950 to 2020, long movies have been performing a lot better.
Of course, you could say that this data does not represent the ground truth but only shows what happened in the past. And you are absolutly right! We are not saying that short movies will perform less, simply that in average, short movies tend to perform less than long movies.
Exploring the Impact of Diversity in Film
The stories movies tell and the faces that bring them to life have a profound impact on how audiences connect with films. In this section, we examine how diversity, in the form of the ethnic composition of the cast, plays a role in shaping box office success. From the settings where stories unfold to the inclusivity of the people who tell them, we investigate whether diversity drives profitability and resonates with global audiences.
The Rise of Ethnic Diversity in Film Revenue
Ethnic diversity is transforming the film industry. Movies with diverse casts are increasingly outperforming those with homogeneous ensembles. Looking the graph below, you can see how lately the revenue of movies with many ethnicities increased exponentially. Back in the days the trend was to have not many ethnicities in the cast. Today, this trend is shifting, reflecting the growing importance of inclusive storytelling.
Diversity as a Driver of Box Office Success
Looking closely at the data, you can see that ethnic diversity plays a key role in a movie's success. In fact, we found that movies with a more diverse cast tend to perform better at the box office.
The Dominance of White-Centered Narratives
But is that really it? By looking more closely, it seems that movies with a white cast (english, british and australian ethnicities) get higher revenues. This is not a surprise since the movie industry is mainly dominated by white people.
We can note that there are ethnicities that contain others (for example, English belongs to British). However, as we have no way of separating this type of "ethnic families", and dropping them would be losing a lot of information that they can still provide, let's just keep them.
So, does it matches with what you thought?
Faces That Fill Seats
Up to now, we putted aside one huge factor that makes people go to the theater and see a movie: their favorite actors. As an actor, it is very important to have a personal approach when acting and to be able to connect with the audience. Building a persona that people like is key for people to come see you again. In the analysis below, we tried to see which faces of the past century have created the craze and led to a financial success.
The Youth Advantage vs Timeless Talent
Looking into the data, it seems that films with younger casts seem to perform better on average, suggesting a potential trend that producers might want to explore further. Younger actors may connect better with certain target audiences, particularly with the youngest. However, while this observation is interesting, it's important to approach it with caution, as usual. The smaller sample size for younger casts makes the data less stable, as can be seen by the standard error, although, even then, the trend is clear. In contrast, the larger quantity of samples for films featuring adult casts provides more robust statitics. In the end, we are only analyzing the average, so it does not mean that a movie with a younger cast will always perform better. It's a good place to remind you to be careful when analyzing data!
Actors' Gender
Gender representation in film has always been a critical topic of discussion. Now, we explore whether the gender distribution of actors in a film has any significant influence on its box office revenue. We ask: Does the proportion of male and female actors in the cast impact how much revenue the film generates?
To start, we analyzed the percentage of female characters in a movie and plotted it against the box office revenue. The graph below helps visualize whether movies with higher percentages of female characters perform better or worse financially.
Moving deeper, we investigate whether movies with predominantly male, predominantly female, or balanced casts have significantly different box office revenues. The graph below compares the average revenue for each category. This provides insights into whether gender balance or dominance correlates with financial success.
You can notice that movies with predominantly male casts tend to have nearly double the average box office revenue compared to movies with predominantly female casts or balanced casts. This reflects historical biases in the industry, where male actors often had larger roles and greater star power, leading to higher financial returns.
Emotions Behind the Words
Until now, we have been looking at explicit and quantitative data. Although looking at the numbers is important, it was clear for us that there is more to a movie than just numbers. We wanted to dive deeper into the emotions behind the movie and for that, we performed a sentimental analysis on their summary.
The analysis has shown that the sentiment scores of movie summaries can provide fascinating insights into the tone and emotional appeal of the stories. We categorized movies into different box office revenue groups and plotted the distributions of their sentiment scores.
VADER Sentiment Analysis
Using VADER, we analyzed the compound sentiment score, along with positive, negative, and neutral components for each movie summary. The graphs below illustrate how these scores are distributed across various box office revenue groups, offering insights into the emotional tone and how it might correlate with financial success.
TextBlob Sentiment Analysis
With TextBlob, we further explored the sentiment of movie summaries by calculating polarity (positivity/negativity) and subjectivity (objectiveness/opinionatedness). The graphs below show how these scores are distributed for each box office revenue group, giving a deeper perspective on how sentiment differs between successful and less successful movies.
By combining these insights, we uncovered patterns that suggest the emotional tone of a movie summary could influence its appeal and ultimately its box office performance.
Untangling Dependencies
The movie industry is a complex ecosystem with many interdependencies. In this section, we explore the relationships between different factors that contribute to a movie's success. By analyzing the data, we aim to uncover hidden patterns and insights that can help us predict the box office revenue of a movie.
Taking the runtime and the genre as example, we have discovered that the average runtime per genre has evolved over the years. This is a very interesting result since it shows that the "perfect" runtime is not the same for every movies: it can be influenced by its genre. Which seems pretty logical when you think about it: you would not expect a comedy to last 3 hours.
The Formula for Success: From Data to Blockbuster
Understanding how data shapes the outcome of a movie's success at the box office. Here's an exploration of the predictive models we used, their performance, and insights into their effectiveness.
Predictive Model Overview
- Logistic Regression: Classifies movies as successes or failures.
Our focus here is on analyzing the Logistic Regression model, as it provides both simplicity and interpretability for predicting box office success.
Logistic Regression Model Analysis
The Logistic Regression model was evaluated based on multiple performance metrics. Here are the key results and their implications:
- Confusion Matrix: The confusion matrix highlights the model's performance across four categories: True Positives, True Negatives, False Positives, and False Negatives. While the model correctly identifies many non-successes (True Negatives: 1230), it struggles with predicting successes (True Positives: 69). This indicates an imbalance in the data and a limitation in the model's feature set.
- Precision-Recall Curve: This curve visualizes the trade-off between precision and recall. As recall increases (capturing more successes), precision decreases (more false positives). This trade-off highlights the challenge of optimizing the model for both.
- ROC Curve: The ROC curve provides a measure of the model's discriminatory power, with an Area Under the Curve (AUC) of 0.81. This indicates that the model performs well. However, better feature engineering and additional data could enhance this performance.
These results show the need for further refinement, including rebalancing the dataset, exploring additional features, especially the budget and marketing spend.
Explore the Logistic Regression model's performance through the following interactive visualizations:
Epilogue
Great story isn't it? Indeed, we had a lot of fun exploring the data and trying to understand the movie industry. But from this analysis, many aspects must still be discussed.
For example, the casting: it has the same biases through the years as our society. Hence, we cannot say that films with a predominance of male actors will continue to perform better since it was (still is) harder for women to lead roles in big movies. Movies with high ethnic diversity perform better but the ethnicities that lead to this result are mainly european ethnicities.
We also have to keep in mind that the data we used is not exhaustive. We only used the CMU dataset and we know that there are many other datasets that could have been used. We also know that the data is not perfect and that we had to clean it a lot before being able to use it.
Finally, we have to remember that the movie industry is an art industry and analysis of the past does not predict the future. The last words are a remember that the data can be misleading and that we have to be careful when analyzing it.
And you, what films are you going to watch next?