Introduction

The project goal is to explore the relationship between actors’ traits — such as age, gender, ethnicity — and the character archetypes they portray in films. By analyzing casting patterns, this project aims to find out how specific actor profiles consistently coincide with archetypal roles like heroes, villains, mentors, or lovers. Our goal is to uncover whether certain traits predispose actors to particular roles and identify any underlying biases in casting decisions. This research also explores how these patterns vary across different film industries and across time. Ultimately, we aim to tell the story of how an actor’s characteristics shape their cinematic destiny, influencing not only their career trajectory but also how audiences perceive iconic characters on screen.

Dataset

We present our dataset with films, actors, characters, and archetypes!

<Figure size 1800x600 with 3 Axes>

The dataset consists of:

  • 87000+ pairs actor-film
  • 165 nationalities known for ~26k unique actors
  • 2 genders known for ~26k unique actors
  • no/yes education for ~26k unique actors
  • 88 religions known for ~1500 unique actors
  • 12 races known for 4000 unique actors
  • 7000 Places of Birth known for ~25k unique actors
  • Professions number known for ~26k unique actors
  • 12 Archetypes known for ~26k unique actors
  • Dates of birth known for 26k unique actors
  • Actor heights known for ~9k unique actors
  • Actor weight known for ~400 unique actors

Data collection

Our primary dataset is the CMU Movie Summary Corpus (CMUD), which contains actor features such as height, age, and Freebase ethnicity ID. However, these features provide limited information about the actors. Additional attributes like weight, race, hairstyle, and eye color are necessary for a more comprehensive analysis. To address this, we decided to enrich the dataset by parsing additional websites and utilizing other datasets.

  1. Enriching:

The initial CMUD includes links to movie and actor IDs from Freebase, which is no longer maintained. Fortunately, a Freebase data dump from 2017 is available here, containing all the data Freebase had at that time.

We processed this file to extract information about actors, such as education, number of professions, nationality, weight, religion, and place of birth.

It is also important to include the race of an actor, as it significantly affects their appearance. The initial CMUD only provides Freebase ethnicity IDs, so we reprocessed the Freebase dump to extract ethnicity information. We categorized ethnicities into 12 racial groups (a controversial issue): European, African, Indian, Asian, Pacific Islander, Middle Eastern, Latino, Indigenous/Native American, Arab, Caucasus, Caribbean, and Mixed.

Additionally, we found that the CMUD lacks a substantial number of movie summaries, which are essential for identifying character archetypes. To address this, we parsed data from IMDB and Wikipedia.

  1. Archetype analysis:

There is a paper which indentifies archetypes in movies using statistical methods.
We also used some specific statistical methods, which are known as LLMs. It is well known that they can do few-shot tasks with high accuracy. But to make a few-shot task we need to have a few examples. We decided to create a tiny dataset of 200 samples with movies summaries, characters names, characters archetypes.

Question 1. What Are the Most Common Character Archetypes in Movies?

Selection of archetype categories

A common method for analyzing archetypes in related literature involves using the website TV Tropes, which hosts an extensive collection of archetypes curated by TV viewers. However, we found that many archetypes on the website are highly specific and apply to only a limited number of examples (e.g., A Chat with Satan, The Drunken Sailor, Mock Millionaire). Such specificity restricts the amount of available data, making correlations less noticeable and their insights less impactful. For our analysis, we aimed to capture a broader perspective. To achieve this, we devised a set of generalized and widely recognized archetypes. As a starting point, we used a list of archetypal tropes from TV Tropes. Building on this list, we consolidated several archetypes into more general categories, resulting in the following set of archetypes:

Devised archetypes

Archetype Description Example 1 Example 2 Example 3
Love Interest / Romantic Partner A character involved in a romantic relationship with a main figure, influencing emotional arcs and sometimes motivating heroic action. "Rose DeWitt Bukater" in "Titanic" "Elizabeth Swann" in "Pirates of the Caribbean" "Rachel Dawes" in "Batman Begins"
Caregiver / Healer Provides nurture, comfort, or medical/spiritual healing; supports others’ well-being and stability. "Dr. Ellen Ripley" in "Aliens" "Samwise Gamgee" in "The Lord of the Rings" "Molly Weasley" in "Harry Potter series"
Mentor / Wise Guide Provides knowledge, training, or insight to help the hero or others grow and succeed. "Ms. Honey" in "Matilda" "Mr. Miyagi" in "The Karate Kid" "Obi-Wan Kenobi" in "Star Wars: A New Hope"
Intellectual / Creative (Scholar/Artist/Inventor) Values knowledge, innovation, or art; provides crucial insights, cultural depth, problem-solving, or visionary ideas. "Steve Jobs" in "Steve Jobs" "Leonardo da Vinci" in "The Da Vinci Code" "Ada Lovelace" in "The Imitation Game"
Ruler / Politician Holds formal power or influence; shapes policies, alliances, and social orders, whether for good or ill. "President Franklin D. Roosevelt" in "Hyde Park on Hudson" "Selina Meyer" in "Veep" "Duncan Idaho" in "Dune"
Sidekick / Loyal Companion A supportive ally, assisting the hero, offering loyalty, encouragement, and sometimes comic or emotional relief. "Ron Weasley" in "Harry Potter and the Philosopher's Stone" "Dr. John Watson" in "Sherlock Holmes: A Game of Shadows" "Robin" in "Batman: The Movie"
Warrior / Vigilante Skilled in combat and physical confrontation; may enforce justice or defend others, sometimes outside legal boundaries. "Batman" in "The Dark Knight" "John Wick" in "John Wick" "The Bride" in "Kill Bill"
Rogue / Trickster / Con Artist A cunning, rule-bending manipulator who achieves goals through deception, charm, or clever schemes. "Frank Abagnale Jr." in "Catch Me If You Can (2002)" "Irving Rosenfeld" in "American Hustle (2013)" "Lawrence Jamieson" in "Dirty Rotten Scoundrels (1988)"
Mystic / Seer Offers spiritual guidance, foresight, or mystical understanding, often steering characters with visions or cryptic wisdom. "The Oracle" in "The Matrix" "Yoda" in "The Empire Strikes Back" "Galadriel" in "The Lord of the Rings: The Fellowship of the Ring"
Outsider / Loner Operates on the fringe of society, often misunderstood or self-isolated, bringing a unique perspective. "Edward Scissorhands" in "Edward Scissorhands" "Rick Deckard" in "Blade Runner" "Chuck Noland" in "Cast Away"
Innocent / Vulnerable Naive, inexperienced, or in need of guidance; prompts protection or mentoring, and highlights moral stakes. "Daniel LaRusso" in "The Karate Kid" "Charlie Bucket" in "Willy Wonka & the Chocolate Factory" "Newt" in "Aliens"
Other Doesn't align with any of the proposed archetype categories "George McFly" in "Back to the Future" "Victor" in "John Wick" "Mayor" in "Ghostbusters"

These categories encompass the majority of characters portrayed on television and have minimal overlap. We decided to classify characters in the dataset into these categories using large language models. The specific task for the model was to select the most appropriate archetype from the list based on the movie summary. To ensure the quality of predictions, we compiled a set of examples for each analyzed archetype. After extensive testing and evaluation of different models with various prompts, we achieved an accuracy of 0.7 on the test set. Further manual analysis of the predictions revealed that 96% of the predictions were appropriate for the characters, even when they differed from the archetypes we had assigned. Using Gemini and GPT models, archetypes were inferred for over 80,000 characters from the CMU dataset. To test the soundness of predictions we can analyse the distribution of the archetypes with relation to the movie genre

The distribution of archetypes across different genres aligns closely with expectations based on the themes and structures typically associated with each genre. For instance, in Romance movies, the Love Interest archetype is most prominent. Action movies feature a higher proportion of Warriors than other genres. Documentaries frequently feature impactful characters such as Rulers, Mentors, and Intellectuals. Interestingly, short movies include characters that don’t fit into any archetype much more often then other genres. Likely this is due to the limited runtime, which doesn’t allow characters to show themselves within the boundries of any archetype.

Question 2

Which Actor Traits Correspond to Specific Archetypes?
Which actor traits — such as age, gender, ethnicity, and other physical attributes — are typically associated with specific archetypes? For instance, are certain traits more frequently linked to roles like heroes, villains, or mentors? Investigating these correlations can reveal patterns in casting decisions.

This question shapes our understanding of how features related to specific archetypes, i.e. what are the most probable features for an archetype.

We found that there are several common traits across most archetypes:

  • Race is predominantly European, Religion is Catholicism, Nationality is USA, and Place of Birth is New York City. Exceptions include the archetype Love Interest/Romantic Partner, where the Place of Birth is Mumbai, and Innocent/Vulnerable, where it is Los Angeles. This pattern likely reflects the fact that most films are produced in the United States.
  • Gender is Female only for the archetypes Innocent/Vulnerable, Caregiver/Healer, and Love Interest/Romantic Partner.

The figure below presents the behavior of averaged, scaled values for each archetype on a single plot. Notably, BMI and Weight exhibit similar trends. The oldest archetype is Ruler/Politician, while the tallest, heaviest, and highest in BMI is Warrior/Vigilante. In contrast, the shortest, youngest, lightest, and lowest in BMI is a single archetype: Innocent/Vulnerable.

Question 3

Do Casting Patterns Exhibit Biases Based on Actor Traits?
Do casting patterns exhibit biases based on actor traits like age, gender, or ethnicity? Are there noticeable trends in how certain demographics are cast in specific roles? Examining these patterns can shed light on potential biases within the casting industry.

The specific feature value affects archetype, if the probability to play a character with this archetype differs significantly between an actor with fixed feature value and an actor with no fixed value. It means that if \(P(Archetype)\) differs significantly from \(P(Archetype|Feature)\).
The heatmap below represents \(P(Archetype|Feature) - P(Archetype)\). The blue color is for the negative values, the orange color is for the positive ones, and NSS (Not Statistically Significant).

<Figure size 1000x1000 with 2 Axes>

Conditioning on Gender reveals a striking pattern: when the probability of playing a specific archetype increases for Female actors, it correspondingly decreases for Male actors, and vice versa. This asymmetry underscores how men and women are differently positioned within cinematic narratives, highlighting potential biases in how directors and screenwriters envision characters based on gender. The implications are profound, as the film industry not only reflects societal norms but actively shapes them. The figure illustrates this bias vividly, with the most significant differences observed in archetypes such as Love Interest/Romantic Partner (Female +13%, Male -7%), Caregiver / Healer (Female +7%, Male -4%), Rogue / Trickster / Con Artist (Female -7%, Male +4%), and Sidekick / Local Companion (Female -5%, Male +3%). Our dataset has films only until 2017, so there is a hope that this situation is different now.

The heatmaps below represent the difference in distributions when conditioning on features closely related to Race: Race, Religion, Nationality, and Place of Birth.

<Figure size 2000x2000 with 8 Axes>
  • For Love Interest / Romantic Partner, conditioning on these features reveals a significant increase in probability for actors from India and the Middle East. This trend could reflect the influence of Turkish and Bollywood melodramas, which may have reshaped the distribution.

  • For Warrior / Vigilante, conditioning on the features shows a notable increase in probability for actors from Asia and India. This is particularly intriguing, as the intertwined histories of Asian and Indian cultures, dating back to ancient times, are reflected here. The observation becomes even more compelling as the probabilities for Asian and Indian actors exhibit the same directional trends for almost all other archetypes, underscoring a deeper cultural connection.

  • Another fascinating finding is that the probability of playing a Rogue/Trickster/Con Artist increases by 20% for actors born in San Francisco. This is difficult to understand why this happens, because Los Angeles and San Francisco are in the same state and this website says that Los Angeles has higher crime rates in 2019. Maybe people from San Francisco just look more suspicious.

For the features Number of Professions and Education one can see that the effect is small, which means that there is no bias towards the number of professions or education of the actor.

<Figure size 2000x1000 with 4 Axes>

The figures below represents the values if we condition on numerical values such as Weight, Height, Age in Film, and BMI. The bins are sampled such that the probability of each bin is uniform, i.e. we \(\text{bin}_i\) represents \([\operatorname{quantile}(\frac{i}{n}), \operatorname{quantile}(\frac{i}{n + 1})]\).

<Figure size 2000x2000 with 8 Axes>

Here, we refer to the difference between probability and conditional probability simply as probability.

  • For Love Interest/Romantic Partner, the probability exhibits a consistent pattern across various features: it initially increases monotonically, reaches a peak, and then decreases monotonically. This indicates a clear bias among directors and screenwriters in defining the ideal archetype for this role. Based on the maxima, the “perfect” actor for this archetype would be 1.63m tall, weigh 57kg, and be 24 years old. This suggests a significant disadvantage for older, overweight, or taller actors, highlighting the film industry’s limited and unfair representation of love, which can take many forms.

  • For Caregiver/Healer and Innocent/Vulnerable, when conditioned on Height, the probability follows a similar pattern to Love Interest/Romantic Partner, though the peaks differ by about 3cm. An interesting observation is that for Age in Film, the probability increases monotonically for older actors. This trend also applies to archetypes like Mentor/Wise Guide and Ruler/Politician. This is logical, because those roles require a lot of experience, and in usual world elder people tend to have more experience. Or at least they know how to pretend that they have more experience.

  • For Rogue/Trickster/Con Artist and Warrior/Vigilante, the probability increases with both Height and Weight. Notably, the probability of being cast as a Warrior/Vigilante jumps by nearly 50% for actors with a BMI of 29 or higher. This reflects the expectation that warriors should have a strong, muscular build 💪💪💪

  • Another intriguing finding is that the probability of being cast as a Ruler/Politician increases with height. As noted in this article, U.S. presidents tend to be taller than the average American citizen.

Question 4. How Do Casting Trends Vary Across Genres and Film Industries?

How do these casting trends vary across different genres and film industries, such as Hollywood compared to Bollywood? Are there differences in how actors are cast for similar archetypes in different cultural or geographic contexts? Comparing casting practices can highlight cultural influences on the film industry.

So, there are several large movie producing countries. We can compare just them. One of the most polar comparisons is USA vs India vs Japan. Let’s see general archetypes distributions for them at first

Some of the most popular archetypes are Trickster, Romantic Partner, Mentor, and Caregiver. Let’s analyze them one by one.

How trickster image is different across cultures?

At first, let’s look at gender:

Seems like there’s no significant differnece between countries. Though, Japan has a bit higher portion of female con artists.

Let’s look at the height differences. As mean height is different in every country, we will subtract mean height by country of movie. This way, we will compare “higher than mean” or “lower than mean”

<Figure size 1800x1000 with 4 Axes>

That’s an interesting finding! While in all countries Tricksters are higher than the average actor of the corresponding country, in Japan this finding is the more noticeable. So, we can concludee that in Japan movies Trickster charecters are higher than average actor by the magnitude biger than in India or USA.

The difference between USA and India is not statistically significant. As well as the difference between Russia and USA.

code
stat = sps.ttest_ind(
    filtered[filtered.country == 'United States of America'].height_normalized,
    filtered[filtered.country == 'Japan'].height_normalized,
)
print(f"USA vs Japan statistic = {stat.statistic:.3f}, pvalue = {stat.pvalue:.4f}")
stat = sps.ttest_ind(
    filtered[filtered.country == 'United States of America'].height_normalized,
    filtered[filtered.country == 'India'].height_normalized,
)
print(f"USA vs India statistic = {stat.statistic:.3f}, pvalue = {stat.pvalue:.4f}")
USA vs Japan statistic = -2.789, pvalue = 0.0053
USA vs India statistic = -1.914, pvalue = 0.0556

The prominence of taller actors playing trickster roles in Japan may be linked to cultural associations between height and status. In Japanese culture, height often symbolizes superiority, authority, or being “above” others, which denotes both physical tallness and higher stature in terms of value or rank. Tricksters in Japanese folklore, such as the Tengu or Kitsune, are often depicted as figures of power and cunning, blending mischief with a commanding presence. Height could amplify this perception of dominance and charisma, making taller actors more fitting for these roles. Additionally, societal admiration for height as “cool” may further elevate the appeal of such portrayals. This contrasts with other cultures where tricksters are less tied to physical stature and more associated with wit or humor alone

Are there any interesting differences in race?

Yes!

Again, to address different distributions of races in every country’s actors, we will compute portion of tricksters for every race normalized by total actors race presence in country’s movies.

<Figure size 1500x1400 with 4 Axes>

Interesting observation is not what present, but what is not here. In Japan there were no entries of Caucasus, Indian, Native American, and Pacific Islander. Interesting how there were no Latino entries for India.

Also, UK is very different by casting Arab and Caucasus as tricksters relatively more frequently than in other countries

Let’s analyze a different archetype.

How “Love Interest / Romantic Partner” vary?

code
df_filtered['male_indicator'] = (df_filtered.gender == "Male").map(int)
mean_by_country = df[(~df.gender.isna())].explode('countries').copy()
mean_by_country['male_indicator'] = (mean_by_country.gender == "Male").map(int)
mean_by_country = mean_by_country.groupby('countries').male_indicator.mean()

stat = sps.ttest_ind(
    df_filtered[df_filtered.country == 'United States of America'].male_indicator - mean_by_country['United States of America'],
    df_filtered[df_filtered.country == 'India'].male_indicator - mean_by_country['India'],
)
print(f"USA vs India statistic = {stat.statistic:.3f}, pvalue = {stat.pvalue:.4f}")
stat = sps.ttest_ind(
    df_filtered[df_filtered.country == 'United States of America'].male_indicator - mean_by_country['United States of America'],
    df_filtered[df_filtered.country == 'Japan'].male_indicator - mean_by_country['Japan'],
)
print(f"USA vs Japan statistic = {stat.statistic:.3f}, pvalue = {stat.pvalue:.4f}")
USA vs India statistic = -6.816, pvalue = 0.0000
USA vs Japan statistic = -1.322, pvalue = 0.1861

Here, females prevail, and India has higher portion of males statistically significantly, the conducted ttest accounted for the base distribution of male / female in the country. That’s quite a funny fact.

General Observations

Now, let’s move on from cherry-picks to general observations

On the heatmap normalized for each contry we see that India really loves a lot of roles with Love Interest. At the same time, Netherlands mannaged to confuse our archetype inference pipeline the most. Hong Kong loves featuring Warriors

Non-significant values are turned to 0

<Figure size 1500x1500 with 1 Axes>

If we look at Other archetypes in Netherlands, we can see that most of such cases are silent movies. Indeed, it’s hard to analyse characters archetypes in such movies

genres
Silent     161
Drama      150
Classic     62
Comedy      60
Crime       28
Name: count, dtype: int64

After taking look at the general picture again, it’s interesting to know if, for example, actors casted for “Love Interest” differ for Romance and Comedy movies

<Figure size 1500x1500 with 1 Axes>

We may see that in comedy movies love interest is more likely to be a female than in romance movies

Conclusion

Indeed, we can see that casting trends vary across different genres and film industries, showing cultural and genre influence on casting choices. Some of the interesting cherry-picks are:

  • In Japan movies trickster charecters’ actors are higher than average actor compared to actors of India or USA
  • In Japan there were no entries of Caucasus, Indian, Native American, and Pacific Islander as trickster-playing actors. Also, there were no Latino entries for India
  • UK is very different from other countries by casting Arab and Caucasus as tricksters relatively more frequently
  • India has higher portion of males playing “Love Interest” roles than in other countries
  • Comedy movies love interest actor is more likely to be a female than in romance movies

Question 5. How Did Casting Trends For Different Archetypes Vary Across Time?

This section is dedicated to the analysis of casting impact on Archetype’s.

Under the “casting process”, we consider several Acror parameters, genre factor and world progress development, e.g., Box Office Revenue. In particular, we analyze the following main sub-topics:

  • We study how the Archetype changes for each Actor over time.
  • We investigate which Archetypes are more popular over time.
  • We identify which genres are more suitable for a specific Archetype.
  • We show which Archetypes are likely to be presented in movies with huge Box Office performance.
  • Finally, we present how Actors’ height and age affect the dominating Archetype in these cases and provide insights into which Actors are more suitable for a given Archetype.

Archetypes & Actors

We provide results on which Archetypes Actors prefer throughout their career paths. Some famous actors may prefer different roles and change them from time to time, while others may stick to the same Archetype roles and become the top pick for them (as we will see in the last section, especially).

<Figure size 3600x1800 with 1 Axes>
<Figure size 3600x1800 with 1 Axes>
<Figure size 3600x1800 with 1 Axes>
<Figure size 3600x1800 with 1 Axes>
<Figure size 3600x1800 with 1 Axes>

E.g., for some of the ‘‘canonical’’ examples of Actors, we may observe that they prefer different roles over time.

<Figure size 3600x1800 with 1 Axes>
<Figure size 3600x1800 with 1 Axes>

Archetype Prevalence Over Time

Let us see how movie Archetypes change over the years.

<Figure size 3600x2400 with 1 Axes>
  • What is obvious: the number of movies has increased, and thus the number of Archetypes has also increased.
  • What is less obvious: we observe the prevalence of the same 3 main Archetypes (Rogue/Trickster/Con Artist, Love Interest, Sidekick) consistently since the 1930s.

Archetypes & Genres

<Figure size 4200x3000 with 2 Axes>

We present the co-occurrence between the top 10 popular genres and Archetypes. For ‘Drama,’ the main genre, we observe that drama movies include characters representing all Archetypes. However, movies in less popular genres, such as ‘Horror,’ tend to feature many evil Archetypes. We point that for broad and popular movie genres, it is unlikely to find an Archetype that does not suit them.

Archetypes & Box Office

In this section, we consider:

  • Box Office Performance by Archetype
  • Box Office Performance by Genres
<Figure size 3600x2400 with 1 Axes>
<Figure size 3600x2400 with 1 Axes>

It is more likely that ‘Warrior/Vigilante’, ‘Sidekick’, or ‘Politician’ Archetypes will be featured in successful box office movies. However, movies in the comedy genre tend to be less successful compared to the other top four genres. Additionally, we observe a significant gap between movies that incorporate ‘Thriller,’ ‘Documentary,’ or ‘Epic/Disaster’ (our top two genres) and others.

Archetypes & their best bearers

In this section, we study some of the Actor parameters that attract them to become a good fit for a role. We are interested in the following Actor features:

  • height
  • age at the moment of movie’s release

additionally, we present examples of Actors who frequently play specific Archetypes.

<Figure size 3000x2400 with 1 Axes>

Here it is natural. But what about age?

We see that distributions for old enough Actors looks similarly.

<Figure size 3600x2400 with 1 Axes>
<Figure size 3600x2400 with 1 Axes>

However, shifts may be observed between older and younger Actors, e.g., a young person is unlikely to be cast to play a ‘Wise Guide’ character compared to an older one.

<Figure size 3600x1800 with 1 Axes>

Here we show the top-tier Archetypes among Actors who are older than 50 and younger than 35. By Percentage, we mean the proportion of Actors in these Archetypes, expressed as a percentage.

<Figure size 3600x1800 with 1 Axes>

At the same time, we do not observe the same difference between Actors over 95 and 100 years old as we do with Actors over 50. Which is interesting!

Finally, let us discuss which Archetypes Actors prefer more and which Actors are well-suited for specific Archetypes.

<Figure size 1000x800 with 1 Axes>
<Figure size 1000x800 with 1 Axes>
<Figure size 1500x3000 with 12 Axes>

Question 6. Are Certain Actors More Likely to Be Typecast into Specific Archetypes?

Part 1. Actors to be typecasted to specific archetypes

To answer this question let’s find the actors sorted in a way of how many roles (characters) they’ve played

actor_name unique_character_count
12536 John Wayne 129
16022 Mammootty 127
18105 Mohanlal 119
976 Amitabh Bachchan 111
17255 Mel Blanc 97

Now let’s check how many actors have 10 or more roles played

Now let’s normalize the counts:

Now we can visualize the archetype distribution for each actor using a bar plot

code
plot_archetype_distribution('A. J. Cook', archetype_distribution)
<Figure size 1000x600 with 1 Axes>
code
plot_archetype_distribution('John Wayne', archetype_distribution)
<Figure size 1000x600 with 1 Axes>

By looking at these plots we see that certain actors could or could not more likely to be typecast into specific archetypes. Now let’s use some of the statistical approaches and look at the results. For each actor let’s define a metric of “probability to being typecasted to the specific (set of) archetypes” as the highest probability among the archetypes in the previous charts. Then let’s see the distribution of this value for each actor and visualize it.

actor_name max_probability
0 A. J. Cook 0.200000
1 A.K Hangal 0.411765
2 Aamir Khan 0.360000
3 Aaron Eckhart 0.315789
4 Aaron Johnson 0.272727

Let’s analyze this distribution. First, let’s look at the descriptive statistics

code
max_probabilities.describe()
max_probability
count 1744.000000
mean 0.385056
std 0.129078
min 0.150000
25% 0.294118
50% 0.357143
75% 0.454545
max 1.000000

We can see that the mean is 0.39, meaning that on average 4 times out of 10 (almost a half of the times) actors got selected for the roles with the specific archetypes. However, this does not give us a clear understanding, so let’s also perform the Chi-Square Test:

code
from scipy.stats import chi2_contingency

max_probabilities['binned'] = pd.cut(max_probabilities['max_probability'], bins=[0, 0.33, 0.66, 1], labels=['Low', 'Medium', 'High'])
contingency_table = pd.crosstab(max_probabilities['actor_name'], max_probabilities['binned'])


chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)
print(f'Chi-square Statistic: {chi2_stat}, p-value: {p_value}, Degrees of Freedom: {dof}')
Chi-square Statistic: 3488.0000000000005, p-value: 0.48726319957658, Degrees of Freedom: 3486

Given the high p-value = 0.487 > 0.05 we conclude that there is no significant association (in this mathematical abstraction). In other words, the distribution of maximum probabilities categorized as ‘Low’, ‘Medium’, and ‘High’ does not significantly depend on the actors. The possible explanation and further development could be that we define max_probability not as the maximum probability for the specific archetype, but take, for example, 3 archetypes with maximum probabilities, which I believe will give the positive answer for this question. Let us check the contingency table and then redefine the max_probability. The contingency table gives understanding of how the categories are visually distributed among actors

<Figure size 1200x800 with 2 Axes>
code
top_3_sum_probabilities.describe()
sum_top_3_probabilities
count 1744.000000
mean 0.731918
std 0.116027
min 0.400000
25% 0.647059
50% 0.727273
75% 0.812500
max 1.000000
<Figure size 1200x800 with 2 Axes>
Chi-square Statistic: 1744.0, p-value: 0.4887414065286344, Degrees of Freedom: 1743

Even though this looks better, we still should check the statistical criterium:

For visual insight let’s look at the Contingency Table with actors and the Max Probabilities:

Conclusion

Given this definition ot typecasting (based on max_probabilities) the Chi-square test didn’t provide a positive answer to our question. The answer to this question varies on different parameter, so we can proceed to part 2 to analyze the insights brought by these parameters.

Part 2. Analyzing the typecasting of actors with specific traits to the archetypes

Gender

Let us create a separate dataframe to analyze the distribution of actor_gender by each archetype with significant occurrences.

actor_gender F M total
archetype
Caregiver / Healer 4705 2041 6746
Innocent / Vulnerable 2587 1434 4021
Mentor / Wise Guide 1644 6721 8365
Mystic / Seer 849 1113 1962
Other 3567 4442 8009
Outsider / Loner 982 2491 3473
Rogue / Trickster / Con Artist 3933 13848 17781
Ruler / Politician 623 4453 5076
Sidekick / Loyal Companion 1605 7721 9326
Warrior / Vigilante 736 5066 5802

Conclusions

From here we can draw some conclusions:

  • Women represent the Caregiver / Healer archetype in the movies significatly more often than Men
  • Women represent the Innocent / Vulnerable archetype in the movies more often than Men
  • Men represent the following archetypes in the movies significatly more often than Women:
    • Mentor / Wise Guide,
    • Outsider / Loner,
    • Rogue / Thinkster / Con Artist,
    • Ruler / Politician, Sidekick / Loyal Companion,
    • Warrior / Vigilante

Some of the biases could include the samplings of the movies in our dataset, however general trends are visible from this barplot and also correlate with stereotypical representation.

Height

Let us plot a distribution of actors by height and define 3 categories: lower than 25 percentile - small, 25-75 - medium, 75-100 - high. After we can analyze how actors with the height in the specific range are distributed across the archetypes. The important note is that men are on average heigher then women, so for the clearness of our assumptions let’s take men’s data, for example (same can be applied for women).

Small Threshold: 1.75
High Threshold: 1.85

actor_name actor_height height_category
4 Humphrey Bogart 1.74 Small
6 Matt McCoy 1.85 Medium
9 Faizon Love 1.77 Medium
14 Robert Taylor 1.82 Medium
15 Humphrey Bogart 1.74 Small
height_category High Medium Small total
archetype
Caregiver / Healer 267 607 237 1111
Innocent / Vulnerable 78 404 253 735
Mentor / Wise Guide 1110 2301 719 4130
Mystic / Seer 187 318 113 618
Other 381 803 371 1555
Outsider / Loner 308 902 310 1520
Rogue / Trickster / Con Artist 2036 4782 1693 8511
Ruler / Politician 733 1240 341 2314
Sidekick / Loyal Companion 988 2256 1090 4334
Warrior / Vigilante 1090 1864 493 3447

From here we can see that it is uncommon for the Innocent / Vulnerable archetype, for example, to be played by high actors. Another insight could be that for Ruler / Politician and Warrior / Vigilante it is uncommon that actors with small height play them. We would use these insights in our further analysis. Now let’s do the same with women and get:

From here we can see that it is uncommon for the Innocent / Vulnerable archetype, for example, to be played by high actresses, making an inclinde towards actresses with small and medium heights. However, the examples for Ruler / Politician and Warrior / Vigilante are not the same as with men, though we cannot draw conclusions from it.

Age

Let’s follow the same logic and see the percentiles for age:

25th Percentile: 27.0
75th Percentile: 47.0

code
fig = px.histogram(
    df,
    x='age_at_release',
    title='Distribution of Actors by Age at Movie Release',
    labels={'age_at_release': 'Age at Release, years'},
    nbins=30,
    text_auto=True,
    histnorm='percent'
)

fig.add_shape(
    type='line',
    x0=p25, y0=0,
    x1=p25, y1=25,
    line=dict(color='blue')
)

fig.add_shape(
    type='line',
    x0=p75, y0=0,
    x1=p75, y1=25,
    line=dict(color='red')
)

fig.add_annotation(
    x=p25,
    y=20,
    text='25th Percentile',
    showarrow=True,
    arrowhead=2
)

fig.add_annotation(
    x=p75,
    y=15,
    text='75th Percentile',
    showarrow=True,
    arrowhead=2
)
display.display_html(fig.to_html(full_html=False, include_plotlyjs='cdn'), raw=True)

For this particular case let’s consider actors with the age less or equal than 20 as young and more or equal than 60 as old and make a new pivoted table as in the previous cases.

age_category Middle-aged Old Young total
archetype
Caregiver / Healer 5364 871 522 6757
Innocent / Vulnerable 2213 76 1748 4037
Mentor / Wise Guide 6421 1660 294 8375
Mystic / Seer 1525 308 134 1967
Other 6672 790 583 8045
Outsider / Loner 2899 212 372 3483
Rogue / Trickster / Con Artist 16022 1185 605 17812
Ruler / Politician 3916 1112 58 5086
Sidekick / Loyal Companion 7869 489 997 9355
Warrior / Vigilante 5356 198 254 5808

Here are some insights we can draw from this plot:

  • If an actor is old, then for the archetypes of Mentor / Wise Guide it is higly possible that this actor can play it (the ratio with Middle-aged is not as big as for others). Other possible archetypes are Ruler / Politician (with also good ratio) or Rogue / Trickster / Con Artist (though less probable)
  • If an actor is young, then it is a high chance he/she would be typecasted to Innocent / Vulnerable archetype.

Insights Check

Let’s check some of our hypothesis’. Let’s take, for example, a young female and with the relatively small heights and check their distributions by archetypes. Our analysis indicates that in most cases they would be typecasted to Innocent / Vulnerable archetype.

Unnamed: 0 prediction archetype character_name movie_name movie_fb_id actor_fb_id model wikipedia_movie_id fb_movie_id ... fb_char_actor_map_id fb_char_id fb_actor_id actor_date_of_birth movie_release_date ethn_name race age_at_release age_category actor_age
77368 77368 Innocent / Vulnerable Innocent / Vulnerable Sunny Baudelaire Lemony Snicket's A Series of Unfortunate Events /m/04k9y6 /m/04pr_b gpt-4o 1228937 /m/04k9y6 ... /m/0j_x1q /m/04x2rn /m/04pr_b 2002-08-02 2004-12-16 NaN NaN 2 Young 22
77369 77369 Innocent / Vulnerable Innocent / Vulnerable Sunny Baudelaire Lemony Snicket's A Series of Unfortunate Events /m/04k9y6 /m/075jxk9 gpt-4o 1228937 /m/04k9y6 ... /m/09rrbch /m/04x2rn /m/075jxk9 2002-08-02 2004-12-16 NaN NaN 2 Young 22
83161 83161 Caregiver / Healer Caregiver / Healer Vanessa November Christmas /m/0dsbr6l /m/0fqmc4w gpt-3.5-turbo 29602277 /m/0dsbr6l ... /m/0hzxvh1 /m/04dwzvd /m/0fqmc4w 2002-05-06 2010-01-01 NaN NaN 7 Young 22

3 rows × 24 columns

<Figure size 1200x600 with 1 Axes>

which arguably corresponds with our conclusions

Question 7. What Is the Composite Image of the “Ideal” Actor for a Given Character Archetype?

Our analysis reveals that an actor’s traits can significantly influence the roles they are cast in and the archetypes they portray. In this section, we aim to identify the most representative actor profiles for specific archetypes based on historical data. To achieve this, we calculate the average traits of actors associated with each archetype and determine the characteristics most commonly linked to that archetype. We then identify actors whose traits closely align with these averages. The results are the following:

Warrior / Vigilante
Mentor / Wise Guide
Intellectual / Creative (Scholar/Artist/Inventor)
Rogue / Trickster / Con Artist
Innocent / Vulnerable
Ruler / Politician
Mystic / Seer
Caregiver / Healer
Outsider / Loner
Love Interest / Romantic Partner
Sidekick / Loyal Companion

Average Warrior / Vigilante is

  • Male
  • 185 cm height
  • weight of 88 kg
  • 39 years
Actors resembling the average character of this archetype:
Dean Cain
height: 183cm
weight: 86kg
ethnicity: Welsh
Don Frye
height: 185cm
weight: 93kg
Antonio Tarver
height: 188cm
weight: 90kg
ethnicity: African American

Average Mentor / Wise Guide is

  • Male
  • 183 cm height
  • weight of 79 kg
  • 46 years
Actors resembling the average character of this archetype:
John Amos
height: 181cm
weight: 79kg
ethnicity: African American
Mickey Rourke
height: 180cm
weight: 77kg
ethnicity: Irish
Bryan Callen
height: 180cm
weight: 77kg
ethnicity: Irish

Average Intellectual / Creative (Scholar/Artist/Inventor) is

  • Male
  • 182 cm height
  • weight of 76 kg
  • 39 years
Actors resembling the average character of this archetype:
Kuranosuke Sasaki
height: 182cm
weight: 74kg
Bae Yong Joon
height: 180cm
weight: 75kg
ethnicity: Korean
Mickey Rourke
height: 180cm
weight: 77kg
ethnicity: Irish

Average Rogue / Trickster / Con Artist is

  • Male
  • 183 cm height
  • weight of 81 kg
  • 41 years
Actors resembling the average character of this archetype:
Neil Patrick Harris
height: 185cm
weight: 82kg
John Amos
height: 181cm
weight: 79kg
ethnicity: African American
Mickey Rourke
height: 180cm
weight: 77kg
ethnicity: Irish

Average Innocent / Vulnerable is

  • Female
  • 167 cm height
  • weight of 54 kg
  • 21 years
Actors resembling the average character of this archetype:
Stella Stevens
height: 165cm
weight: 54kg
Florette Hillier
height: 168cm
weight: 54kg
ethnicity: British
Nina Dobrev
height: 167cm
weight: 55kg
ethnicity: Bulgarian

Average Ruler / Politician is

  • Male
  • 185 cm height
  • weight of 81 kg
  • 50 years
Actors resembling the average character of this archetype:
Eric Nesterenko
height: 185cm
weight: 84kg
Mike Horner
height: 185cm
weight: 77kg
John Amos
height: 181cm
weight: 79kg
ethnicity: African American

Average Mystic / Seer is

  • Male
  • 184 cm height
  • weight of 78 kg
  • 47 years
Actors resembling the average character of this archetype:
Mike Horner
height: 185cm
weight: 77kg
Kuranosuke Sasaki
height: 182cm
weight: 74kg
Morgan Freeman
height: 188cm
weight: 79kg
ethnicity: African American

Average Caregiver / Healer is

  • Female
  • 168 cm height
  • weight of 54 kg
  • 30 years
Actors resembling the average character of this archetype:
Chyler Leigh
height: 168cm
weight: 54kg
Kate Beckinsale
height: 170cm
weight: 55kg
ethnicity: English
Tika Sumpter
height: 170cm
weight: 55kg
ethnicity: African American

Average Outsider / Loner is

  • Male
  • 181 cm height
  • weight of 78 kg
  • 39 years
Actors resembling the average character of this archetype:
Mickey Rourke
height: 180cm
weight: 77kg
ethnicity: Irish
John Amos
height: 181cm
weight: 79kg
ethnicity: African American
Guy Pearce
height: 180cm
weight: 80kg
ethnicity: Australian

Average Love Interest / Romantic Partner is

  • Female
  • 168 cm height
  • weight of 55 kg
  • 26 years
Actors resembling the average character of this archetype:
Golshifteh Farahani
height: 169cm
weight: 55kg
Kate Beckinsale
height: 170cm
weight: 55kg
ethnicity: English
Jenny McCarthy
height: 170cm
weight: 54kg
ethnicity: Irish

Average Sidekick / Loyal Companion is

  • Male
  • 187 cm height
  • weight of 90 kg
  • 36 years
Actors resembling the average character of this archetype:
Chuck Liddell
height: 188cm
weight: 93kg
Pierce Brosnan
height: 188cm
weight: 89kg
ethnicity: Irish
Michael Irvin
height: 188cm
weight: 94kg
ethnicity: African American

Conclusion

Overall, we successfully executed our proposal from the second checkpoint in its entirety. All the additional datasets we mentioned were utilized in this work successfully.

Moreover, the core idea — that we can infer an archetype for each character — proved to be effective. We managed to gather all the necessary information for this purpose and made predictions for the complete dataset, making our project idea feasible.

For every question posed in the proposal, you will find a detailed, data-driven answer on this page. We will not list all the funny findings here in conclusion, but there definitely were a lot!