It’s not easy finding an anime show that can be tagged with a single genre. Comedies are almost always romances, and shows portraying school life are almost never about students sitting behind their desks. It’s either mysteries or sports clubs or students fighting other students in exoskeletons or giant robots. So if I say that school life is the most popular genre on MAL, of course you can believe me, then again that tells you almost nothing of the type of anime that dominant fan groups like to watch.
This is a follow-up to my exploration of AniDB’s genre tags. The idea was to take a look at crowdsourced data and see which genres usually stick together. The previous time I basically downloaded info from AniDB about every TV anime starting sometime in the last five years. I calculated how similar anime shows were among themselves in terms of genre information and grouped them accordingly. I was hoping that this grouping would correlate well with how popular the grouped anime are, and while I think the results were aiming in the right direction, I started questioning my methodology and my data source.
This time I tried doing the same, except that I used MyAnimeList’s data. I also changed my approach to similarity scores. Instead of using cosine similarity, which didn’t make much sense from a theoretical standpoint, I employed Jaccard similarity coefficient, which can be used for unweighted genre information. What is Jaccard similarity? Basically, if you have two anime that have the same subset of genres, then their Jaccard similarity score will be the number of genres in this subset divided by the number of genres presented by both. Do this pairwise for every two anime and you get a nice similarity matrix that we can run a clustering algorithm on. This algorithm spits out a grouping of anime titles, which enables us to check which genres are the most popular in each group. Below I present the results of this clustering procedure.
TV Anime Group 1 Titles: 136 Average voters: 57332 Most common genres: 1. comedy (128 titles) 2. school (118 titles) 3. romance (86 titles) 4. shounen (51 titles) 5. ecchi (48 titles) 6. harem (42 titles) 7. supernatural (28 titles) 8. action (23 titles) TV Anime Group 2 Titles: 149 Average voters: 57040 Most common genres: 1. action (145 titles) 2. fantasy (101 titles) 3. shounen (72 titles) 4. adventure (61 titles) 5. supernatural (48 titles) 6. comedy (48 titles) 7. magic (33 titles) 8. game (21 titles) TV Anime Group 3 Titles: 125 Average voters: 43241 Most common genres: 1. action (117 titles) 2. sci-fi (72 titles) 3. mecha (37 titles) 4. seinen (26 titles) 5. drama (20 titles) 6. super power (18 titles) 7. supernatural (16 titles) 8. adventure (14 titles) TV Anime Group 4 Titles: 184 Average voters: 33462 Most common genres: 1. school (49 titles) 2. drama (43 titles) 3. slice of life (43 titles) 4. romance (40 titles) 5. fantasy (37 titles) 6. shoujo (32 titles) 7. supernatural (31 titles) 8. mystery (26 titles) TV Anime Group 5 Titles: 110 Average voters: 25340 Most common genres: 1. slice of life (110 titles) 2. comedy (109 titles) 3. school (46 titles) 4. seinen (27 titles) 5. shounen (8 titles) 6. romance (7 titles) 7. drama (5 titles) 8. fantasy (5 titles) TV Anime Group 6 Titles: 142 Average voters: 21975 Most common genres: 1. comedy (142 titles) 2. supernatural (27 titles) 3. shounen (22 titles) 4. fantasy (21 titles) 5. mystery (16 titles) 6. romance (16 titles) 7. action (14 titles) 8. drama (14 titles)
I was working with 846 anime titles and 40 different genre tags. This time I paid special attention to the number of clusters. Because it’s hard to say what the correct number of clusters is (it could be just one, it could be 846 of them), unless certain metrics show solid numbers, such as the silhouette score from the previous time, you have to ask yourself what types of anime should these clusters be having. We obviously want as little clusters as possible so that we can generalize well, while at the same time we want to see popular anime grouped in their own clusters, respective of each title’s genre similarities above all else.
Alongside the anime’s average user rating MAL also lists the number of votes each title has. Casting a vote is an expression of emotional involvement. Proof of this is that the vote count is relatively higher for very good shows and for very bad shows. We can exploit this property and say that shows with a higher vote count were also, most likely, the most watched.
Therefore we model the number of anime clusters along this number. We look at the clusters’ vote count averages and try to maximize them for each cluster by either lowering or increasing the number of clusters. Anime may fall into different clusters as we increase their number, and so too will their respective vote counts.
There was more sorcery involved, but ultimately I settled on there being six clusters that you can see above. Some have less popular importance than others, that’s because popular importance is based on the aforementioned vote count average.
This optimization problem wasn’t a cakewalk, then again I didn’t bother with any mathematical procedure for it. I settled on analyzing the means and medians with the eyeball method. Ask if you want me to go into more detail, but I doubt you do. Also, I had to do some scripting to make all of this work in the first place. Code will be made available sometime in the future.
I still think AniDB’s tagging system is better, because they have over 180 tags, whereas MAL has only 40 or something. AniDB also lets users place a weight for each tag. Unfortunately these become improvements only with more active users, which AniDB compared to MAL doesn’t have. What the future has in store for this experiment, it’s hard to say, but for it to have the best conditions one would need to build a better, more detailed, and more popular anime database. Contact me if you’re into that kind of thing.