The IMDb Analysis: Genres and Ratings of Movies released between 2008-2018

A blog post by Georgia Charalambous, Cheuk Yim Edwin Cheng & Lianne Huitema
An assignment for the MOOC Data Analysis: Take it to the MAX()

Introduction

The film industry has dramatically increased over the last years. In 2016, 736 films were released in US cinemas and that is twice the number of films released in 2000. A similar pattern can be observed if we study the data from the UK [1].

IMDb, according to their website, is the world’s most popular source for movie and TV content. Their database contains information for more than 250 million entries including 4 million entries regarding information about movies [2].

download

The purpose of our research is to examine whether the genre of a movie contributes to the rating of the movie. In order to find an answer, we conducted data analysis which was divided in two main parts. Before going into the two parts of the report, some general remarks are made on the datasheet. The goal of the first part was to determine which movie genres are the highest and the lowest rated genres and after that, examine if the number of genres that a movie has, can affect its overall rating. The second part of the analysis aims to discover trends and patterns regarding the rating of movie genres and examines if there is a connection between the number of votes and the rating of a movie genre over a specific time period.

Dataset

The dataset used in the analysis was obtained from IMDb datasets [3], where subsets of IMDb data are offered for personal and non-commercial use [4]. The data is divided into seven different datasets in a tab-separated-values format. Each dataset contains different information, but each entry has a unique identifier that can link all datasets together. For the conduction of this analysis only information from two datasets was used. The datasets used are the “basics” dataset that contains information about titles that are in the IMDb and the other is the “rating” dataset that contains information about the IMDb rating of a title. The two datasets were merged using the unique identifier of each title and only attributes that were considered important and necessary for the conduction of the research were used. The final dataset contains the following attributes:

  • tconst – an alphanumeric unique identifier of the titles
  • titleType – the type/format of the title (e.g. movies, tvseries, short)
  • startYear – the release year of a title
  • genres – genres associated with the title (up to 3 genres)
  • averageRating – average of all the individual user ratings
  • numVotes – number of votes the title has received

After merging the two datasets, we decided that the focus of the analysis will only be movies. This means that we filtered other types like ‘short’ or ‘videoseries’ out. The total number of movies was 483,132, therefore it was decided to narrow down our scope to movies that were released in the last 10 years (2008 – April 2018). This resulted to 159,062 movies. Of those, only 74,307 have a rating score. According to the IMDb datasets page, the character ‘\N’ is used to show that the value of a field is missing or has null value. Therefore, before we conducted our analysis we cleaned the dataset by removing all entries that had the character ‘\N’. This process resulted to 73,019 movies entries that we used in our analysis.  

Two different data tools were used for the conduction of the analysis. The first part of the analysis was done using Microsoft Excel with pivot tables and its lookup and sum functions. In the second part, the programing language R was used. The plots were created by using the package “ggplot2” [5].

Data Analysis

General findings

Before going into the analysis of the different genres, a generic analysis was held. The datasheet that was left after the cleaning had a total of 73,019 movies. The movies were divided into 929 unique genres or genre combinations. 219 of these genres had only one movie within that specific genre or genre combination. Some of the interesting genres found that contained only one movie were e.g. Action,Adventure,Musical, Documentary,Game-Show,History and Family,Music,Sport. The movies were divided into one or more and with a maximum of three genres. But, when examining the dataset, a number of 1288 movies was found that did not have a genre. The character entry was ‘\N’, which, as said before, means that there was no information about that aspect. A number of reasons can be thought of as the cause of a “movie without a genre”. The 1288 genre-lacking movies got a total number of 39,036 votes, which is about 30 votes per movie. This is significantly less than the average amount of votes per movie in the whole dataset, which is equal to 3707 rounded off. The overall average rating of all movies released between 2008-2018 was 6.38, when only considering the movies without a genre this is a 6.44. Even though that 1288 may seem as a huge amount of movies, this is only 5.1% of all movies in the dataset. No clear conclusions can be drawn from these findings, but have to be kept in mind when analyzing the whole dataset.

Part A

What are the highest and lowest rated genres?

Table I. Top 5 Genres in Rating
Genre Average Rating Number of Votes
Documentary 7.27 218,653
Biography 6.82 4,910,776
History 6.69 1,923,857
War 6.61 1,918,097
Music 6.49 775,093
Table II. Bottom 5 Genres in Rating
Genre Average Rating Number of Votes
Action 5.70 30,839,648
Western 5.70 63,005
Sci-Fi 5.56 11,097,552
Thriller 5.46 9,588,469
Horror 4.89 4,172,760

In the tables above, we ranked the genres in terms of average rating, in a scale from 0 to 10, of the associated movie with the number of votes. In Table I, we can see that movies that categorized as Documentary have the highest rating of 7.3 with around 200 thousands individual votes, followed by Biography, rated 6.8 on average, with approximately 4.9 million votes.

In Table II, we can see the movies involved a genre of Horror rated the lowest of 4.9 out of 10. The second lowest rating belongs to the Thriller, scored around 5.5. By the observation of number of votes, we may conclude that IMDb voters would tend to respond (by voting) more on the movies they feel average, which could be interpreted that they tend to voice opinion if the films fell short of their expectations, by giving a below-average score.

This leads to another question we would like to examine:

How about the rating when the movies include multiple genres?

Table III. Top 5 Combination of Genres in Rating 
Genre Average Rating Number of Votes
Comedy,Drama,History 9.00 19
Animation,Biography,Comedy 8.90 24
Adventure,Drama,Sci-Fi 8.60 1,164,336
Adventure,Drama,War 8.30 1,034,433
Crime,Drama,Sport 8.20 22
Table IV. Bottom 5 Combination of Genres in Rating
Genre Average Rating Number of Votes
History,War 2.60 717
Action,Comedy,Music 2.60 766
History,Horror,Thriller 2.30 268
Action,Comedy,Drama 2.20 425
Mystery,Romance 1.70 190

In the dataset, most of the movies were categorized with more than 1 genre. For example, the movie “The Amazing Spider Man” would be under the genres of Action,Adventure,Sci-Fi. A maximum of three genres per movie is allowed.

Table III and IV represent the top 5 and bottom 5 in rating when it comes to movies with more than 1 genre. From Table III, we can see Comedy,Drama,History and Animation,Biography,Comedy topped with rating close to 9 out of 10. However, they might not be a valid result considering the number of votes which are significantly low. The genres of Adventure,Drama,Sci-Fi and Adventure,Drama,War ranked the third and fourth with score 8.6 and 8.3 respectively, with votes exceeding 1 million each. Overall, we could see that the genre Drama appears the most in the top 5 (highlighted in green), even though it does not make it to the top 5 when we sort the data individually.

From Table IV, we list out the lowest rated movie with the respective combination of genres. The combination of mystery and romance is the least favoured by the voters. One interesting fact is that even though drama usually works well with other elements, it does not go well with the genre of comedy and action, ranked the second last in the table.

Does putting more elements in a movie results in higher rating?

Table V. Number of Genre(s) with Average Rating
Number of Genre(s) Average Rating
1 6.09
2 5.87
3 5.97

Table V shows the average rating of movies when we break down the associated number of genre. However, this does not give us a clear picture of how an increase in genre impacts the rating. Therefore, we would like to see how rating differ in movies that has a particular genre.

Table VI. Average Rating in Biography
Biography
Number of Genre(s) Average Rating
1 6.18
2 6.58
3 6.94

The above table illustrates the average rating of movies that include only biography and those include biography and other elements. We can see that when we add more genre in a biography, the average rating would subsequently increase as well, showing a positive correlation.

But does the positive correlation hold for other genres as well?

Table VII. Average Rating in Animation
Animation
Number of Genre(s) Average Rating
1 7.10
2 6.30
3 6.20

In Table VII, it seems that putting more elements in an animation movie might not be a great idea, considering the drop we observed in average rating, proving that the concept of positive correlation might not always be true.

In a nutshell, the mix of elements in a movie would still greatly depend on each of the individual genre. Putting more elements in a movie does not always mean a positive feedback from the audience!

Part B

How are the ratings of the most produced genres movie types change over time?

The aim of this part of the analysis is to examine how the ratings of the most produced movie genre change over a 10 year (2008-2018) time period. By most produced genre movie types we mean genre types that have the highest number of movies. In Table VIII we can see the movie genre and the total number of films for the last 10 years. Using these top 10 movie genres we found all the movies of these genres that have been produced and their corresponding rating. We then calculated the average rating for each movie genre per year and then plotted the results which are shown in the Figure I.

Table VIII. Number of Movies per Genre
Movie Genre Number of Films
Drama 11989
Documentary 10202
Comedy 5856
Comedy,Drama 2632
Horror 2586
Thriller 1578
Drama,Romance 1575
Comedy,Romance 1273
Comedy,Drama,Romance 1277
Horror,Thriller 1000

Figure 1

Figure I. Average Rating per Movie Genre over the Years

The figure shows that almost all movie genres have an upward trend in rating. This might happen due to the improvement of the quality of the films. Overall, it seems that the rank of the movie genres based on the rating is somehow constant. Documentary movies are the most high rating movies for all the ten years while Horror and Horror,Thriller movies have the lowest rating. An interesting trend is the fact that movies that have Drama in the list of their movie genres seem to form a cluster and have approximately the same rating for the time period used in the analysis. Surprising is also the fact that this cluster of movies also increased their rating the most this time period. In 2008 their rating has ranged from 6.0 to 6.5 while in 2018 their rating has ranged from 7.0 to 7.5.

It is clear that none of the genres has managed to have average rating more than 7.5 out of 10 for all the years. This can lead us to the following assumption: the rating of the film does not necessarily influence the number of films that are produced for each genre. For example comedy films are the third most produced films despite the fact that their average rating does not get higher than 6.5 out of 10. Other factors probably contribute to the decision to produce a movie of a specific genre like for example the budget and the box office of the film. Moreover, it is impossible to make coherent conclusions of what influences the rating of a movie without conducting more experiments including an additional factor, e.g. cast and crew of the film.

To sum up, the current analysis shows that the rating of the most produced movie genres from 2008-2018 have not significantly changed with the exception of Drama movies, so it can be concluded that the preferences of the viewers remain the same for the last 10 years.

Figure 2

Figure II. Percentages of Votes per Genre over the Years

How do the number of votes of the most produced genres movie types change over time?

One flaw of our dataset is the limited information provided for the popularity of a movie either regarding on the total number of viewers or the number of viewers that enjoyed the film. We decided to use the attribute numVotes (number of votes that a movie has received) as an indicator of the popularity of the movie. A movie that receives higher number of votes, we assumed, is more popular than one with less votes. However, it should be noted that we understand that this assumption is not entirely correct since a movie can receive high number of votes but the votes can be a low rating vote. The assumption is made only for the conduction of the current analysis and the result should be interpreted with caution.

We used this assumption to see how the popularity (based on votes) of the most produced genres movies types change for the period 2008-2018. We used the same 10 movie genres as the previous question and the percentage (%) of votes that each movie genres received per year. For the calculation of the percentage, first we found the number of votes each movie genre receive for each year and then we divided with the sum of the total votes that the 10 movies genres received for that year. We plotted the results and can be found in Figure II.

Looking at the figure, we can conclude that there is not a consistent way that users vote for a movie of the genres we examined in this analysis. However, over the 10 year time period it seems that Thriller, Horror and Documentary movies are the three movies genres that received the lowest number of votes with Thriller movies be in the last position every year. This was also seen in the first part of the analysis. An interesting observation is regarding Comedy movies. For the last 9 years, as shown in the figure, they did not receive more than 25% of the votes, but looking at 2018 they received more than 35% of the votes. This does not necessarily mean that this year users enjoy more Comedy movies, it could be the result of more Comedy films produced so far in 2018 compared to the other genres used in the analysis.     

To sum up, using the number of votes as an indicator of the popularity of a movie genre did not give us a clear answer on how viewers and users of IMDb change their votes regarding movie genres over the years. The only constant observation is that Thriller movies remain the movie genre that receives the lowest percentage of votes each year. As a result, with the current parameters we cannot conclude if a genre is more popular than others.

What is the connection between rating and number of votes?  

As part of our analysis we wanted to examine if there is a connection between the rating and the number of votes a movie genre received. Looking at the two figures some interesting observations were found.

The Documentary movies, despite having the highest rating among the examined genres, is one of the three movie genres that received the lowest numbers of votes. This observation can be interpreted in different ways. For example, one can assume that people that watch Documentary movies are a less total amount compared to other genres, thus the smaller number of votes. But these viewers rate high, which result the movie to a high average rating overall. One common observation in both figures is the fact Thriller movies have not only the lowest rating, but also the smaller number of votes. This can either mean that viewers of Thriller movies did not enjoy the movie, which resulted the lower rating, or either the small number of votes contributed to less average rating even if viewers gave high rating.

However, when comparing the two figures, we cannot conclude anything regarding the rest of movie genres. As we saw earlier in the analysis, movies that include Drama in the list of their genres tend to stick together and have the almost the same rating (~6-7.5), but looking at the percentage of the votes there were cases in which the differences of votes between these genres were huge. For example in 2010, Comedy,Drama movies had approximately 7.5% of the votes while Comedy,Drama,Romance had 30% of the votes. Therefore, it will be ignorant from our part to assume that higher number of votes also mean and high rating of movies.

Summary, Discussion and Conclusion

The aim of this report was to analyze the movies and its genres that were available in the online free datasets provided by the website IMDb. After explaining how the dataset was found, cleaned and used with Microsoft Excel and the language R, the analysis started. 73,019 movies were analyzed using their genre(s) and ratings, with an average rating of 6.38. 929 different genres were found. Part A looked at three main topics. The first question answered which were the highest (Documentary and Biography) and lowest (Thriller and Horror) rated movies. When the analysis was performed on movies with multiple genres in question two, Comedy,Drama,History and Animation,Biography,Comedy were rated highest and Action,Comedy,Drama and Mystery,Romance lowest. The third question asked if putting more elements in a movie results in a higher rating. There was no relation seen in all movie genres, but when the analysis focused on Biography only a positive correlation was seen and a negative correlation for Animation. Therefore, no conclusion can be drawn on whether it is good for a movie’s rating when more elements are used. In part B, it was found that the rating of most produced movie genres did not change significantly over the years between 2008 and 2018. The only exception was the genre Drama. Over time, there is not a consistent way that users vote for a movie.

Several things have to be kept in mind regarding this analysis. First, while sometimes it is easy to put a certain genre on a movie, for other movies it may be harder to put one (or even 2 or 3) genre(s) that suits the movie best. The decision to which genre a movie belongs may be very subjective. Also, 5.1% of the movies in the dataset was not divided into any genre. Secondly, there are many more factors that influence the rating of a movie. The director, the choice of actors, the budget, the location and language are just some aspects of the endless list of factors that play a role in the overall rating of a movie. Thirdly, the assumption of popularity and the number of votes was made. It does not necessarily mean that when a movie gets a high number of votes, that this movie  is considered to be popular. It could also mean that all viewers of a movie genre have a high drive of rating the movies on IMDb.

In the end, some interesting discoveries were made for the movie business industry. When you want to do a thorough analysis on movie genres and ratings, it is encouraged to first look into how movies are rated on the IMDb website and how the genres are connected to a film. However, the free dataset enabled a lot of cool analysis and we hoped you, as the reader, enjoyed it as much as we did.

We want to end with the note that if we are thinking of changing our career from data analysts to movie directors, we now know which genre to take so that our movie has a chance of getting high ratings!

Thanks for reading our blog!

Georgia, Edwin, Lianne

 

[1] https://stephenfollows.com/how-many-films-are-released-each-year/
[2] https://help.imdb.com/article/imdb/general-information/what-is-imdb/G836CY29Z4SGNMK5?ref_=helpsect_cons_1_1#
[3] https://datasets.imdbws.com/
[4] https://www.imdb.com/interfaces/
[5] http://ggplot2.org/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s