Exploratory Data Analysis involving movie data

Jonna Wang
6 min readJan 17, 2021

Hum…this blog post will walk through the process of analyzing multiple movie datasets and share the feeling of my first data science project.

  • “Will this project be hard for a starter?”
  • “Nah, don’t be nervous.”

Data Understanding

Before starting this project, it will become easier when you get a clear idea of what you want to analyze. There were multiple large datasets. So I chose to get a preview of all the datasets. I was more interested in those “successful” movies. Wait, what are successful movies? That’s the first question that needs to think about it. I think “successful” movies can be defined by how much money you put in and how much money you will earn. So, in my analysis, I set the threshold as either domestic gross is at least two times the budget or worldwide gross is at least two times the budget.

Questions to consider:

  • What genres should be chosen among those successful movies?
  • Which directors and writers should be hired among those successful movies?
  • What is the optimal movie-length among those successful movies?
  • Which month has the highest release of successful movies?

Data Cleaning and Processing

Data cleaning is the most crucial and fundamental step for any data science-related projects. When you have multiple data sets, I would recommend starting with the one that has unique identifiers.

My cleaning steps:

  1. When you start a new data set, drop unnecessary columns
  2. Check for the duplicates for the current data set
  3. Combine other data frames into one single data frame
  4. Make sure every column is its correct data type
  5. Check the null values
  6. Use sub-data frame (select columns that related to the question)

Question 1

What genres should be chosen among those successful movies?

  • This question expects to find the top frequent genres.
  • Are the top frequent genres always be a good choice?
  • I think the better way to answer this question is to access from two perspectives: find the frequency and average budget of each genre.

We can see that those movie genres contain long strings. If we just simply use value_counts ( ) to find the frequency of different genres, there will be plenty of “unique” types. So it will be much better to split the long strings and convert them into rows.

First I dropped the rows that contain null values. Then, I use str.split( ) methods to separate the long string into list-like. After that, I used the .explode( ) function to transform each element of a list-like to a row with replicating index values.

I use the groupby function to group the same genre and agg function to find the count and average budget of each genre.

I used subplots to plot the counts of genres and the average budget of each genre together which makes it easier for comparison. For example, “Action” is one of the top frequent movie genres, but the corresponding average is much higher. This bar plot gives us some ideas that genre choosing can be based on how much budget you have.

Question 2

Which directors and writers should be hired among those successful movies?

  • This question expect to find top frequent directors and writers name and
  • Create two sub-data frames to analyze this question

Wait, no there was no “real” name except for the numeric numbers and those numbers were also long-strings…

Don’t worry, we can use one of the given data frames that have those numeric numbers as unique identifiers (representing names) and merge with our data frame. In other words, translating it. Before translating, we need to separate the long string and transform each element of a list-like into rows. Similar steps as we did before.

I use pd.merge ( ) to combine the two data frames so it can translate the names. I also did the same steps for finding top writers.

Have you considered if some directors or writers may have some “unsuccessful” movie productions? I analyzed the “unsuccessful” data frame the find those “successful” names in “unsuccessful”. As you can see, some of the top directors/writers have a perfect success rate while some of them have some failures. I think successful experience is necessary while the unsuccessful experience is also important. Because the best experience is learned from failures.

Question 3

What is the optimal movie-length among those successful movies?

  • This question expects to find some numbers that have representative meaning.

Well, you may think answering this question is just finding the average (mean) running time of successful movies. What if there exist some outliers which can easily affect the mean value? The median value, could be a reasonable option. But I think it would be much better to give a range for the optimal movie length.

I find the values between 25% quantile and 75% quantile (which represents 50% of the successful movie data) and plot as a boxplot. Boxplot is very convenient to have a visualization of quantile value, min/max, and outliers. So the optimal movie length is between 97 and 121 minutes

Question 4

Which month has the highest release of successful movies?

  • This question expect to find the top frequent month.

I use dt.month to extract the month. But this was in numeric number. So I define a function and apply it to the sub-data frame to translate the numbers into words.

We can see that the largest percentage of successful movies were released in both November and December.

PS:

HahHah, how you feel like it? It wasn’t that hard right. Before I began this project, I was nervous. But don’t let nervous beat you. Have a clear idea of what you are going to do. Break into step-by-step. Codes not working? Google always be your best friend.

--

--