Animation Development

Animation Development#

Author: WuYan Nie

Course Project, UC Irvine, Math 10, F23

Introduction#

My final project is my personal interest of analyzing the animation development through the past two decades. Honestly, I started to watch anima when I was in middle school, and I really miss the moment that enjoyed the afternoon watching animation with my Dad. In my project, I focus on genre types and start year of anima and the relationship to user rating. I am curiou about whether animation become worse over time.

First, import the useful tools.

import altair as alt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression

Organize the Data to Be Useful#

Using pandas read the csv data file. First drop those repeating rows.

import pandas as pd
df = pd.read_csv('imdb_anime.csv')
df = df.drop(df.index[0:841]).copy()
len(df)

Notice the lengh of data is really big here. First, try to get rid of some missing rows and useless comlumns.

df2 = df.drop(['Episode Title','Episode','Metascore','Stars','Summary'], axis=1).copy()
df2 = df2.dropna(axis=0).copy()
df2 = df2.reset_index(drop=True)
len(df2)

This data still gets 14918 rows. Before making a range for my data, transfer the data type from object to float or int first.

df2['User Rating'] = df2['User Rating'].apply(lambda x: float(x))
df2['Number of Votes'] = df2['Number of Votes'].apply(lambda x: float(x.replace(',','')))
df2['Runtime'] = df2['Runtime'].apply(lambda x: float(x.replace(' min','').replace(',','')))
df2['Gross'] = df2['Gross'].apply(lambda x: float(x))
df2

	Title	Genre	User Rating	Number of Votes	Runtime	Year	Certificate	Gross
0	One Piece	Animation, Action, Adventure	8.9	187689.0	24.0	(1999– )	TV-14	187689.0
1	Teenage Mutant Ninja Turtles: Mutant Mayhem	Animation, Action, Adventure	7.4	28895.0	99.0	(2023)	PG	28895.0
2	The Super Mario Bros. Movie	Animation, Adventure, Comedy	7.1	189108.0	92.0	(2023)	PG	189108.0
3	Attack on Titan	Animation, Action, Adventure	9.1	434457.0	24.0	(2013–2023)	TV-MA	434457.0
4	Jujutsu Kaisen	Animation, Action, Adventure	8.5	82909.0	24.0	(2020– )	TV-MA	82909.0
...	...	...	...	...	...	...	...	...
14913	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	8.3	3456.0	24.0	(2019– )	TV-MA	3456.0
14914	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	8.4	3401.0	24.0	(2019– )	TV-MA	3401.0
14915	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	8.4	3375.0	24.0	(2019– )	TV-MA	3375.0
14916	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	7.3	3432.0	24.0	(2019– )	TV-14	3432.0
14917	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	7.9	3309.0	24.0	(2019– )	TV-MA	3309.0

14918 rows × 8 columns

Here I utilize a useful statistic tool named Interquartile Range Rule for my User Rating. I get rid those data with extremely high or low rating that seem to be outliers.
Them I choose a range from number of Votes, making sure I don’t analyse those animation with low voting and high rating.

import numpy as np
data1 = df2['User Rating'].sort_values()
q3, q1 = np.percentile(data1, [75 ,25])
iqr = q3 - q1

range1 = q1 - 1.5*iqr
range2 = q3 + 1.5*iqr

Here is how I apply the statistic method and range to get the analytic data of animation.
I wish to analyze both the genre, user rating, and year. So, I also edit the Year column to be the time when this animation first released.
Here is my df3, which is the final data frame I will utilize.

ba = (df2['User Rating'] <= range2) & (df2['User Rating'] >= range1)
df3 = df2[ba].copy()
df3 = df3[df3['Number of Votes'] > 2000].copy()
df3['Year']=df3['Year'].apply(lambda x: int(x.replace('(','').replace(')','').replace('–','').replace('I','').replace(' ','')[0:4]))
df3 = df3.reset_index(drop=True)
df3

	Title	Genre	User Rating	Number of Votes	Runtime	Year	Certificate	Gross
0	One Piece	Animation, Action, Adventure	8.9	187689.0	24.0	1999	TV-14	187689.0
1	Teenage Mutant Ninja Turtles: Mutant Mayhem	Animation, Action, Adventure	7.4	28895.0	99.0	2023	PG	28895.0
2	The Super Mario Bros. Movie	Animation, Adventure, Comedy	7.1	189108.0	92.0	2023	PG	189108.0
3	Attack on Titan	Animation, Action, Adventure	9.1	434457.0	24.0	2013	TV-MA	434457.0
4	Jujutsu Kaisen	Animation, Action, Adventure	8.5	82909.0	24.0	2020	TV-MA	82909.0
...	...	...	...	...	...	...	...	...
1201	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	8.3	3456.0	24.0	2019	TV-MA	3456.0
1202	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	8.4	3401.0	24.0	2019	TV-MA	3401.0
1203	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	8.4	3375.0	24.0	2019	TV-MA	3375.0
1204	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	7.3	3432.0	24.0	2019	TV-14	3432.0
1205	Demon Slayer: Kimetsu no Yaiba	Animation, Action, Adventure	7.9	3309.0	24.0	2019	TV-MA	3309.0

1206 rows × 8 columns

Genre and User Rating#

Frist, I am curious the relationship between Genre and User Rating. What type of genres are famous and more likely to get high rating.
Through value counts, I access the top 16 genre types of animation. The most common type is action and adventure, and I get rid of this genre since the number is too large. Then, I choose 10 most popular genre type and make a decision tree. (Use df4 here)

df3['Genre'].value_counts()[0:16]

Animation, Action, Adventure    585
Animation, Adventure, Comedy     89
Animation, Action, Drama         69
Animation, Comedy, Drama         59
Animation, Action, Comedy        55
Animation, Crime, Drama          46
Animation, Adventure, Drama      30
Animation, Action, Fantasy       24
Animation, Action, Crime         23
Animation, Drama, Fantasy        21
Animation, Action, Sci-Fi        21
Animation, Comedy, Romance       16
Animation, Adventure, Family     15
Animation, Drama, Family         14
Animation, Drama, Romance        14
Animation, Comedy, Fantasy       13
Name: Genre, dtype: int64

top_genre = df3['Genre'].value_counts().index[1:11]
df4 = df3[df3['Genre'].isin(top_genre)].copy()
df5 = df3[df3['Genre']=='Animation, Action, Adventure'].copy() #user later

X_train, X_test, y_train, y_test = train_test_split(df4[['User Rating']], df4["Genre"], test_size=0.2, random_state=53)
clf = DecisionTreeClassifier(max_leaf_nodes=10, random_state=53)
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_leaf_nodes=10, random_state=53)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plt.figure(figsize=(50, 30))
plot_tree(clf, 
                feature_names=clf.feature_names_in_,
                class_names=clf.classes_,
                fontsize=20,
                filled=True)

[Text(0.47058823529411764, 0.9, 'User Rating <= 6.85\ngini = 0.872\nsamples = 349\nvalue = [42, 17, 56, 20, 17, 75, 25, 46, 32, 19]\nclass = Animation, Adventure, Comedy'),
 Text(0.23529411764705882, 0.7, 'User Rating <= 5.95\ngini = 0.79\nsamples = 73\nvalue = [5, 6, 4, 7, 8, 29, 6, 7, 0, 1]\nclass = Animation, Adventure, Comedy'),
 Text(0.11764705882352941, 0.5, 'User Rating <= 5.75\ngini = 0.675\nsamples = 13\nvalue = [0, 0, 0, 2, 5, 5, 0, 0, 0, 1]\nclass = Animation, Action, Sci-Fi'),
 Text(0.058823529411764705, 0.3, 'gini = 0.571\nsamples = 7\nvalue = [0, 0, 0, 0, 2, 4, 0, 0, 0, 1]\nclass = Animation, Adventure, Comedy'),
 Text(0.17647058823529413, 0.3, 'gini = 0.611\nsamples = 6\nvalue = [0, 0, 0, 2, 3, 1, 0, 0, 0, 0]\nclass = Animation, Action, Sci-Fi'),
 Text(0.35294117647058826, 0.5, 'User Rating <= 6.55\ngini = 0.786\nsamples = 60\nvalue = [5, 6, 4, 5, 3, 24, 6, 7, 0, 0]\nclass = Animation, Adventure, Comedy'),
 Text(0.29411764705882354, 0.3, 'gini = 0.74\nsamples = 22\nvalue = [0, 2, 0, 4, 0, 9, 4, 3, 0, 0]\nclass = Animation, Adventure, Comedy'),
 Text(0.4117647058823529, 0.3, 'gini = 0.784\nsamples = 38\nvalue = [5, 4, 4, 1, 3, 15, 2, 4, 0, 0]\nclass = Animation, Adventure, Comedy'),
 Text(0.7058823529411765, 0.7, 'User Rating <= 8.05\ngini = 0.871\nsamples = 276\nvalue = [37, 11, 52, 13, 9, 46, 19, 39, 32, 18]\nclass = Animation, Action, Drama'),
 Text(0.5882352941176471, 0.5, 'User Rating <= 7.15\ngini = 0.864\nsamples = 190\nvalue = [19, 9, 42, 10, 9, 38, 15, 22, 16, 10]\nclass = Animation, Action, Drama'),
 Text(0.5294117647058824, 0.3, 'User Rating <= 7.05\ngini = 0.869\nsamples = 43\nvalue = [3, 4, 9, 4, 6, 7, 5, 3, 1, 1]\nclass = Animation, Action, Drama'),
 Text(0.47058823529411764, 0.1, 'gini = 0.88\nsamples = 25\nvalue = [3, 2, 3, 3, 3, 3, 4, 3, 1, 0]\nclass = Animation, Adventure, Drama'),
 Text(0.5882352941176471, 0.1, 'gini = 0.79\nsamples = 18\nvalue = [0, 2, 6, 1, 3, 4, 1, 0, 0, 1]\nclass = Animation, Action, Drama'),
 Text(0.6470588235294118, 0.3, 'gini = 0.855\nsamples = 147\nvalue = [16, 5, 33, 6, 3, 31, 10, 19, 15, 9]\nclass = Animation, Action, Drama'),
 Text(0.8235294117647058, 0.5, 'User Rating <= 8.65\ngini = 0.848\nsamples = 86\nvalue = [18, 2, 10, 3, 0, 8, 4, 17, 16, 8]\nclass = Animation, Action, Comedy'),
 Text(0.7647058823529411, 0.3, 'gini = 0.856\nsamples = 61\nvalue = [10, 2, 5, 3, 0, 6, 4, 14, 10, 7]\nclass = Animation, Comedy, Drama'),
 Text(0.8823529411764706, 0.3, 'User Rating <= 8.95\ngini = 0.778\nsamples = 25\nvalue = [8, 0, 5, 0, 0, 2, 0, 3, 6, 1]\nclass = Animation, Action, Comedy'),
 Text(0.8235294117647058, 0.1, 'gini = 0.755\nsamples = 20\nvalue = [8, 0, 3, 0, 0, 2, 0, 2, 4, 1]\nclass = Animation, Action, Comedy'),
 Text(0.9411764705882353, 0.1, 'gini = 0.64\nsamples = 5\nvalue = [0, 0, 2, 0, 0, 0, 0, 1, 2, 0]\nclass = Animation, Action, Drama')]

../../_images/f4194ff086598861884ec41eddeff01f1f0b454acd2fc83dfe5275d83b7e9637.png

clf.score(X_test, y_test)

0.18181818181818182

rfc = RandomForestClassifier(n_estimators=200, max_leaf_nodes=20)
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.17045454545454544

Notice how this graph illustrate the relationship between genre and user rating. The higher rating is for genre like action, while the lower rating is for adventure. You can find adventure appears more on the left side.
But the score is really lower, while my decision has 10 different genres. This is because the classification for genre is too rough and indistinguishable. Even I use the random forest method, this still get really low score.
What about the genre that contains both action, adventure, which is also the most popular genre. (Use df5 here)

df5['User Rating'].mean()

8.035042735042735

df5['User Rating'].std()

1.002199314326418

The value shows people are more likely to watch the animation that is about adventure and action. This high rating also indicates why there are plenty of anime with such themes.
What about the mean rating for other types?

for i in df3['Genre'].value_counts().index[1:11]:
    a = df3[df3['Genre']==i].copy()
    print(f"{a['User Rating'].mean()} for {i}")

152808988764045 for Animation, Adventure, Comedy
655072463768117 for Animation, Action, Drama
733898305084745 for Animation, Comedy, Drama
880000000000001 for Animation, Action, Comedy
11086956521739 for Animation, Crime, Drama
306666666666668 for Animation, Adventure, Drama
041666666666667 for Animation, Action, Fantasy
230434782608696 for Animation, Action, Crime
800000000000001 for Animation, Drama, Fantasy
704761904761904 for Animation, Action, Sci-Fi

You can see almost all other genre types have lower user rating than adventure and action here.
The conclusion is that people prefer to watch anima with theme adventure and action, and this is why there are so many this type of animation. However, it seems that action is more likely to be accepted by most audience (from our decision tree). This might due to that it is easier to make an enthusiastic action anima simply depict the action. On the other hand, a good adventure anima requires insightful or wild idea to build an attractive world. It is more creative than action type. This also explain why the combination of action and adventure is the most popular anima type.

top_genre = df3['Genre'].value_counts().index[0:12]
df4 = df3[df3['Genre'].isin(top_genre)].copy()
alt.Chart(df4).mark_circle().encode(
    x = 'Genre',
    y = 'User Rating',
    color = 'Year:Q',
    tooltip=['Year','User Rating', 'Genre']
)

alt.Chart(df4).mark_circle().encode(
    x = 'Genre',
    y = 'User Rating',
    color = 'Genre',
    tooltip=['Year','User Rating', 'Genre']
)

Those two graphs show the spread of genre and user rating. The first one also shows the pattern of year spread. The most popular type adventure and action are created more in past decade (color in deeper blue), while the other genre is more spread from 2020 to 1980.
This also show audiences give higher rating to adventure and action type anima, whereas the score spread from 9.5 to 5.7. This indicates factories tried to cater the audiences’ preference, and some anima is failed to be accepted by people. However, adventure and action type still get the lower number rating under 6 comparing to other types.

Year and User Rating#

Now, I am curious about how the time influence people’s preference and anima quanlities. I decide to build a timeline graph.

alt.Chart(df3).mark_circle().encode(
    x = alt.X('Year', scale = alt.Scale(domain = (1964,2023))),
    y = 'User Rating',
    color = 'Genre:N',
    tooltip=['Year','User Rating', 'Genre']
)

df3['Year'].value_counts()[1:10]

  107
   66
   63
   51
   50
   48
   41
   40
   39
Name: Year, dtype: int64

From this graph, you can find that there is a clear line in year 1998, 2006, 2013, 2019, and 2022. Notice that the max number of animations and the highest user rating also exists in those year. The red circle is the type of adventure and action.
It seems that in 1998, the animation started to grow up, and in 1999 you can see this pattern that the dots become more diverse and spread. This is the start of animation development.
In 2006, the type of crime and drama dominated the animation market, and this is why in the previous genre analysis, one particular type genre has higher user rating than adventure and action, and that is crime and drama.
Moreover, in most years, you can see how the adventure and action dominated the anima market and obtained high user rating. For other year, it behaves either diverse types of animas or only adventure and action type have high user rating.
The dots become more clustered over time, meaning more and more animations showed up. Meanwhile, the diversity along y axis indicate that anima created earlier are highly qualified. See in year 1998, only one dot received rating lower than 6, while more and more dots below 6 show up over time.
In conclusion, it’s hard to tell whether the animation is gotten worse over time, since the diversity of genre types affect the judgement. Building another linear relation between genre type and user rating in each year might be helpful.

top_genre = df3['Genre'].value_counts().index[0:10]
df6 = df3[df3['Genre'].isin(top_genre)].copy()
mean_ratings = df6.groupby(['Year', 'Genre'])['User Rating'].mean().reset_index()
mean_ratings

	Year	Genre	User Rating
0	1964	Animation, Adventure, Comedy	8.000000
1	1967	Animation, Action, Adventure	7.200000
2	1968	Animation, Action, Adventure	6.600000
3	1970	Animation, Adventure, Comedy	7.700000
4	1971	Animation, Action, Adventure	7.900000
...	...	...	...
241	2022	Animation, Drama, Fantasy	8.200000
242	2023	Animation, Action, Adventure	7.028571
243	2023	Animation, Action, Comedy	8.100000
244	2023	Animation, Adventure, Comedy	7.100000
245	2023	Animation, Drama, Fantasy	8.500000

246 rows × 3 columns

alt.Chart(mean_ratings).mark_line().encode(
    x = 'Year',
    y = 'User Rating',
    color = 'Genre:N',
    column = 'Genre'
)

There is still no clear pattern of relationship between year and user rating for each genre. The user rating varies in a range for each genre. Somehow for most genre, the mean user rating started to decrease in past three years. Almost all the line go down on the tail. But based on previous pattern, the line will goes up later.

Summary#

The analysis to genre and year as supervised factors fail to conclude that animation get worse over time. People tend to like animation of adventure and action types because this animation telling a story in people’s fantasies. For other types of animas, drama, comedy, and fantasy get the higher score. On the other hand, there is no clear relationship between year and user rating for each genre types. The graph goes up and down periodically. It is hard to tell whether as time goes the animation created was getting worse. I used to be really addicted to animation. I am a person with strong empathy. Each time watch the animation brought me a different life experience. By looking at the characters’ life, I felt empty after each time finishing watching. However, I spent less time watching and finding anima that I am interested in. Through this project, I want to figure out whether I changed. I guess the answer is that I am no longer the innocent and ignorant child. Animation is a way people relax and experience different life. Someone depicts their imaginations and created in this form of arts. Even though most famous animations were created from last decade, in the future, I believe there will be more interesting and excellent anima been created.

Reference#

https://www.kaggle.com/datasets/lorentzyeung/all-japanese-anime-titles-in-imdb