How to Use The Movie Database’s API

Johnny D
5 min readApr 4, 2021

I don’t even have any skills. You know, like nunchuk skills, computer hacking skills. Movie studios only want data scientists with skills. That’s why I enrolled in the full time data science program at Flatiron School. What you’re reading now is my own personal Napoleon Dynamite story with data science and Hollywood, and I’m hopeful that by the end of this bootcamp, I will return to this blog with my own figurative dance to Canned Heat by Jamiroquai.

What will my Canned Heat look like? Well, early on in the program I confessed to our instructor my desire to combine machine learning sentiment analysis with a broad dataset of movie and box office returns. I want to see if I can add just one additional layer of complexity to the well documented quest to understand how to predict box office hits.

Has this been done before? Yes, kind of, but not exactly in the way I’m hoping to accomplish it. At the very least, it would be fun to build a tool to actively predict box office revenues on Twitter or an interactive site as a bit of a passion project.

The stars aligned for our Phase 1 project, which coincidentally required us to utilize our new python, data analysis, and visualization techniques to analyze box office revenues for a brand new hypothetical movie studio.

Our school provided some .csv files for analysis — datasets from Box Office Mojo, IMDB, Rotten Tomatoes, TMDB, and a few others. After experimenting with combining the datasets, I had a few concerns. The TMDB (The Movie Database) dataset had the most useful format for making a baseline dataframe, but there were two problems. First, I noticed a few duplicates in the data (which seemed intentionally inserted as a challenge). Second, combining TMDB data with something like Rotten Tomatoes to analyze critic and audience review scores left a lot of unused and mismatched data. Leaning on my years of data management, I felt strongly that the easiest way to resolve these issues would be to go straight to the source.

Taking a glance at the TMDB API, it seemed like it had most of what I was looking for — certainly more than what was provided in the project outset. MPAA ratings would be useful for demographic analysis. Movie budget, which could only be obtained by combining the somewhat broken dataframes provided in the project files, would be much cleaner straight from the API. Importantly, IMDB codes were also provided in the extended API data, meaning I could use these ubiquitous codes as keys to take the project further in the future.

Here’s a step by step guide on setting up the TMDB API for pulling data. Note — you could generate a session ID without creating a username, but generating your own API key will make things easier for repeat API pulls in the future.

  1. Create API key
  2. Head to https://www.themoviedb.org/signup?language=en-US to create a an account
  3. Once created, navigate to the top right corner. You’ll see a single letter that represents your user name. Click once for the dropdown, and then click on Settings.
  4. On the left, click on API, and check out your API Key (v3 auth)

This is your key! Keep it safe and hidden.

Now that you have your key, you’re ready to start coding.

For the code below, we’ll use an example API key ‘1234567thisisnotarea-lapikey.’ Best practices would have us store our API key in a JSON file in a remote or local drive so that sharing your code will not compromise your personal API key. More on that here.

We’ll start off by importing some packages and introduce our API key as a variable:

import requests
import pandas as pd
import json
api_key = ‘1234567thisisnotarealapikey’

Next, we’ll write some code to retrieve data from TMDB’s ‘Discover’ feature, which returns useful general data for preliminary analysis. The ‘for loop’ below will allow us to select the years we want to run. For my analysis, I wanted to pull all movies from 2012–2019 (post Avengers through pre-pandemic).

We’ll start with the basic request URL for Discover:

https://api.themoviedb.org/3/discover/movie?api_key=1234567thisisnotarealapikey

Requesting this URL will only return the first page of results, but once we combine it with a ‘for loop’ to retrieve all pages, it will pull all movies from TMDB. That’s quite a few (over 500,000), so filters will be useful not only for cutting down our request time but also for pulling more relevant data for analysis.

What’s great about this URL is that you can filter easily with ampersands and codes provided here: https://developers.themoviedb.org/3/discover/movie-discover.

For our use case, we’ll apply the following parameters:

  • &page=

Necessary for pulling data beyond the first page of results

  • &primary_release_year=

Helpful for specifying and looping through years if desired

  • &with_original_language=en

Note that this is different from &language=en-US. After reviewing data pulled with &language=en-US, it didn’t seem to filter out movies where english was not the primary language

  • &with_original_language=en will pull all movies where the spoken language is english
  • &with_release_type=3

This restricts our results to wide theatrical releases

  • &with_runtime.gte=80

Useful for filtering out shorts and anything shorter than what you would normally find at your traditional movie theater

More parameters can be found here

results = []for year in [2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]:     page_number = 1     response =  requests.get(‘https://api.themoviedb.org/3/discover/movie?api_key=' 
+ api_key + ‘&page=’ + str(page_number) +
‘&with_original_language=en&primary_release_year=’
+ str(year) + ‘&with_release_type=3&with_runtime.gte=80’)
total_pages = response.json()[‘total_pages’] results.extend(response.json()[‘results’]) while page_number < total_pages page_number += 1 response = requests.get(‘https://api.themoviedb.org/3/discover/movie?api_key=’
+ api_key + ‘&page=’ + str(page_number) +
‘&with_original_language=en&primary_release_year=’
+ str(year) + ‘&with_release_type=3&with_runtime.gte=80’)

results.extend(response.json()[‘results’])
results.to_csv(‘2012–2019.csv’)

Using pandas to run df.shape will return (18846, 27), meaning we have 18,846 movies in our dataframe and 27 columns full of different details.

Now, using TMDB’s Get Movie Details feature, we can use our new dataframe to loop through all of our movie ID’s (the [‘id’] column) and pull even more information.

To make things a bit less complicated, I utilized a python wrapper called tmdbsimple, which can be found here.

The code below accomplishes a few things:

  • Loops through our entire dataframe of 18,846 movies using the TMDB API’s “Get Details” URL (this will provide more information not available in the Discover URL, documentation can be found here)
  • Within the new details, digs into the ‘certifications’ column to return the MPAA rating in a new column
  • Handles errors (two were encountered in running this script without error handling)
import tmdbsimple as tmdb
tmdb.API_KEY = api_key
df_ids = df[‘id’]full_movies = []for idx in df_ids:
try:
movie = tmdb.Movies(idx)
movie_dic = {} movie_dic.update(movie.info()) movie.releases() for c in movie.countries:
if c[‘iso_3166_1’] == ‘US’:
certification = c[‘certification’]
movie_dic.update({‘mpaa_rating’ : str(certification)})
full_movies.append(movie_dic)
except:
print(f’{idx} caused an error.’)
df_full = pd.DataFrame(full_movies)df_full.to_csv(‘2012–2019 FULL.csv’)

With our data stored safely into a .csv file, we are ready for analysis! While the file is a bit hefty, you can still open and explore with excel or drop it into Google Sheets for a quick glance.

Hopefully this brief tutorial is helpful for your own analysis. I would like to give a shout out to the TMDB support forum, where I got an expedient answer to my question about how to pull MPAA ratings for individual movies.

For phase 2 of our bootcamp, we will be exploring linear regression, which will give me an opportunity to replicate (and hopefully update) some interesting findings from the mid-2000s.

--

--