BW #51: Academy Awards (solution)

Last week, we got this year's Oscar nominations. This week, we'll look through the history of Oscar winners and losers.

BW #51: Academy Awards (solution)

This week, we looked at data from the Academy Awards, aka the Oscars. These awards are presented annually by the Academy of Motion Picture Arts and Sciences (https://oscars.org). In researching this week’s topic, I learned that the Academy’s founding had a great deal to do with helping production companies avoid unions during labor negotiations. The history written at Wikipedia (https://en.wikipedia.org/wiki/Academy_of_Motion_Picture_Arts_and_Sciences) was quite eye-opening in this regard, and is worth a quick look.

The Oscars have been around for nearly a century, and while the number and names of the awards have changed, there’s definitely some cachet associated with being nominated for one, let alone for winning one.

Given that nominations were announced last week (https://www.nytimes.com/2024/01/23/movies/2024-oscar-nominees-list.html?unlocked_article_code=1.R00.b7Z6.51ltAGMr6vjb&bgrp=a&smid=url-share), I thought that we could look through the history of Oscar nominations and awards, and learn something interesting about them.

Data and eight questions

David Lu (https://davidlu.dev/), a software developer with an interest in open source, had put together a data set that includes much of the information from Kaggle (https://www.kaggle.com/datasets/unanimad/the-oscar-award), but also includes last week’s nominations. I asked you to download the CSV file that Lu provided on his site, and then gave you eight tasks and questions about them.

As always, the Jupyter notebook that I used to solve these problems is at the end of this post.

Download the `oscars.csv` file, and turn it into a data frame. We'll only use the Year, CanonicalCategory (which should then be renamed "Category"), Film, FilmId, Name, and Winner columns. Use PyArrow for the backend data storage. The "Winner" column should have True/False values, rather than True/NA values.

The first thing that I did, as always, was to load Pandas:

import pandas as pd

Given that the file has a “csv” extension, we can expect that it’s a CSV (aka “comma-separated values”) file, which we can then read into Pandas with the “read_csv” method:

df = (
    pd
    .read_csv(filename)
)

However, we would be wrong; this doesn’t do the trick. That’s because the file uses tabs (\t) to separate fields, rather than commas. I’m actually a big fan of using tabs, since they are less likely to be inside of our data, and thus to force the use of quotes.

We can tell read_csv to treat tabs as the field separator, rather than commas, by passing the “sep” keyword argument:

df = (
    pd
    .read_csv(filename, 
              sep='\t')
)

This is a good start, but there are a few more tasks to do. First of all, I asked you to include only some of the columns. We can pass a list of strings to the “usecols” keyword argument to select them:

df = (
    pd
    .read_csv(filename, 
              sep='\t',
             usecols=['Year', 'CanonicalCategory', 'Film', 'FilmId', 'Name', 'Winner'])
)

Furthermore, I asked you to use the “PyArrow” (https://arrow.apache.org/docs/python/index.html) backend for Pandas. Arrow is an open-source project that provides an in-memory backend suitable for use in data-analysis languages and libraries, including Pandas. By default, Pandas uses NumPy as its backend, but Arrow is the clear future direction, and the core developers are pushing ahead to use it.

Before you can use PyArrow as a backend, you’ll need to install it on your system with “pip install pyarrow”. Once you’ve done that, you can pass the “dtype_backend” keyword argument to read_csv:

df = (
    pd
    .read_csv(filename, 
              sep='\t',
             usecols=['Year', 'CanonicalCategory', 'Film', 'FilmId', 'Name', 'Winner'],
             dtype_backend='pyarrow'))
)

We now have a data frame with all of the rows and columns that we want. However, I asked you to do two additional things: First, I asked that you rename the “CanonicalCategory” column to be “Category”. The original data set has two columns, one with each of these names; the “Category” column is the official name that the Academy uses, whereas the “CanonicalCategory” ensures that we can compare awards across years by standardizing the names.

We can use the “rename” method on our data frame, passing a dictionary to the “columns” keyword argument, to rename the column:

df = (
    pd
    .read_csv(filename, 
              sep='\t',
             usecols=['Year', 'CanonicalCategory', 'Film', 'FilmId', 'Name', 'Winner'],
             dtype_backend='pyarrow')
    .rename(columns={'CanonicalCategory':'Category'})
)

The data frame now contains the data we want, and the column names we want. Actually, that’s not completely true: The “Winner” column, which indicates whether a nominee won their award, is theoretically a boolean value (aka True or False). But in actuality, the values are either True or pd.NA, the PyArrow (and modern Pandas) version of NaN. (More info is at https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html.)

We can turn all of those pd.NA values into False with a call to “replace”. Notice that here, we’ll pass a dictionary as an argument. That dict’s keys are the columns in which we want to make replacements. The dict’s values are themselves dicts, with the keys representing what we want replaced, and the values indicating what the substitutions should be.

df = (
    pd
    .read_csv(filename, 
              sep='\t',
             usecols=['Year', 'CanonicalCategory', 'Film', 'FilmId', 'Name', 'Winner'],
             dtype_backend='pyarrow')
    .rename(columns={'CanonicalCategory':'Category'})
    .replace({'Winner': {pd.NA: False}})
)

With this in place, we now have a data frame containing 11,856 rows and 6 columns. We can see that it’s using PyArrow by looking at the dtypes:

Year                  int64
Category    string[pyarrow]
Film        string[pyarrow]
FilmId      string[pyarrow]
Name        string[pyarrow]
Winner        bool[pyarrow]
dtype: object

The "Year" column should contain integers, but it doesn't, because the first few awards were listed as having two years (e.g., "1927/28"). Change the values in this column to reflect the second date (e.g., "1927/28" should become "1928") and then change the column to contain integers.

Dates and times can be frustrating to work with, for a variety of reasons. But when I saw that the “Year” column hadn’t been assigned an integer dtype value when read_csv imported the file, I was rather surprised. A quick look showed what I described in the problem statement, namely that during the first few years, the Oscars were listed as being in “1927/28”, with two years named.

At first, I thought that this meant the awards were given only once every two years, but I was wrong. If we were to just take any year value “19XX/YY”, we could replace it with “19YY” and be right. Moreover, this would then contain an integer value that we could turn into an appropriate dtype.

There are several ways to solve this problem, but my favorite is (of course) to use a regular expression. Moreover, I want to use one of the more powerful aspects of regular expressions, namely capturing. In other words, given a value “19XX/YY”, I’ll grab the values in YY and then replace the entire string with “19YY”. The value will then be ready for us to turn into an integer.

How does this work? Well, given that \d in a regular expression means “any digit 0-9,” the pattern we want can be described as “19\d\d/\d\d” — 19, followed by two digits, followed by a slash, and then another two digits. I can capture the final two digits with parentheses, as “19\d\d/(\d\d)”. I can then tell Pandas that I want to replace that first string with “19\1”, where \1 means, “Whatever was found in that first (and only) group of parentheses.”

That’s great, but how can I make this substitution? I can use the “str.replace” method, giving the search regexp, the replacement (with \1 substitution), and also indicating regexp=True as a keyword argument:

(
    df
    ['Year']
    .str.replace(r'19\d\d/(\d\d)', r'19\1', regex=True)
)

Notice that I used raw strings (i.e., with an r before the opening quote) to tell Python that it should double all of the backslashes, This reduces the potential confusion and fights between Python’s view of backslashed characters and the regular expression engine’s view.

The above gives us back a column containing only four-character strings, in which all of the characters are digits (aka years). However, they are still strings. We can invoke “astype” to get back an integer column, which we then assign back to the “Year” column in our data frame:

(
    df
    ['Year']
    .str.replace(r'19\d\d/(\d\d)', r'19\1', regex=True)
    .astype(int)
)

Our “Year” column now contains integers, giving us a chance to work with them numerically.