BW #51: Academy Awards

Last week, we got this year's Oscar nominations. This week, we'll look through the history of Oscar winners and losers.

BW #51: Academy Awards

It has been quite some time since I last saw a movie in a theater. But that doesn't mean I've stopped watching movies. On the contrary, thanks to streaming services and a projector, I've watched lots of movies over the last few years from the comfort of my own home.

I'm not the only one avoiding the cinema; industry data from The Numbers (https://www.the-numbers.com/market/) makes it pretty clear that even though North American ticket sales have rebounded somewhat since the pandemic, we still aren't at pre-pandemic levels.

Just because people aren’t going to the theater doesn’t mean movie makers are sitting idle, though.

The Academy Awards, whose nominations were announced last week (https://www.nytimes.com/2024/01/23/movies/2024-oscar-nominees-list.html?unlocked_article_code=1.R00.b7Z6.51ltAGMr6vjb&bgrp=a&smid=url-share), are undoubtedly a way to market the film industry and encourage us to watch more movies. But there are also many talented people working in the film industry, and it's nice to be able to give them some recognition.

This week, I decided to take a break from the often heavy and difficult news that fills the world, and to look at something a bit lighter, namely the history of Academy Award nominees.

Data and eight questions

In looking for data about the Academy Awards, I found a source on Kaggle that includes data from 1927 through 2023:

https://www.kaggle.com/datasets/unanimad/the-oscar-award

I was a bit disappointed that the data set hadn't yet been updated to include this year's nominees, but then discovered that David Lu (https://davidlu.dev/), a software developer with an interest in open source, had put together a similar data set -- and it included the latest nominations. That data is available here:

https://github.com/DLu/oscar_data/tree/main

The documentation on this page serves as a data dictionary, describing the columns in the data set, and how we could/should use them.

This week, I have eight questions and tasks for you based on this data set. The learning goals include working with PyArrow, strings and regular expressions, grouping, and sorting.

I’ll be back tomorrow with my full solutions, including the Jupyter notebook I used to solve these questions myself.

  • Download the `oscars.csv` file, and turn it into a data frame. We'll only use the Year, CanonicalCategory (which should then be renamed "Category"), Film, FilmId, Name, and Winner columns. Use PyArrow for the backend data storage. The "Winner" column should have True/False values, rather than True/NA values.
  • The "Year" column should contain integers, but it doesn't, because the first few awards were listed as having two years (e.g., "1927/28"). Change the values in this column to reflect the second date (e.g., "1927/28" should become "1928") and then change the column to contain integers.