BW #45: Netflix (solution)

Netflix has always been extremely secretive about who is watching what shows. Until last week, that is, when it released viewing data for the first half of 2023 — data that we'll explore with Pandas.

BW #45: Netflix (solution)

This week, we looked at the data that Netflix revealed last week about what people are watching. This is after years in which Netflix had refused to reveal very much at all about what people were watching. Given their many millions of subscribers, and the billions that they were investing in new titles, it made sense that they wouldn’t want to let competitors know which of their shows were successful.

But a whole lot of things have changed — among them, the introduction of advertising into the streaming world — and Netflix is slowly starting to reveal at least a bit of what they know.

The data wasn’t as detailed as might have liked, and only described viewing habits during the first half (January - June) of 2023. But it’s better than nothing, and already contains some interesting data for us to go through.

Data and 8 questions

This week’s data was from the "engagement report" that they released on December 12th at https://about.netflix.com/en/news/what-we-watched-a-netflix-engagement-report. The data that they mentioned in the report is downloadable as an Excel file:

/content/files/4cd45et68cgf/1HyknFM84ISQpeua6TjM7A/97a0a393098937a8f29c9d29c48dbfa8/what_we_watched_a_netflix_engagement_report_2023jan-jun.xlsx

I asked eight questions this week about the data. Below are my solutions; a link the full Jupyter notebook I used to solve the problems is at the end of this post.

Read the data from Excel into a data frame. Parse "Release Date" as a date.

Let’s start off by loading Pandas into memory:

import pandas as pd

With that in place, we can now use “read_excel” to load the spreadsheet into Pandas as a data frame. I downloaded the file, in part because I originally wasn’t sure how big it would be. (Turn out: It’s not that big.) I thus started off loading the file as follows:

filename = 'What_We_Watched_A_Netflix_Engagement_Report_2023Jan-Jun.xlsx'

df = pd.read_excel(filename)

That worked, in the sense that I got a data frame back. But it wasn’t accurate, largely because of the five rows of the Excel file used for the Netflix banner. The header lines were actually on row 6 of the spreadsheet — but because we use zero-based indexing in Python, we call that row 5 inside of Pandas.

While I was at it, I indicated that I only wanted column 1, 2, 3, and 4 (i.e., not the empty and useless first column, at index 0). The “usecols” keyword argument to read_excel lets you use either names or index numbers for columns. Normally, I prefer to use names, because they’re easier to read and understand. But in this case, it just seemed easier to pass the integers.

Rather than pass a list of integers, I decided to just pass a call to “range”, with a starting point of 1 and an ending point (i.e., one beyond the number I really want) of 5.

Finally, the “Release date” column should contain datetime information, and not be treated as strings. We could add the “parse_dates” keyword argument to ensure that Pandas treats it correctly. But Excel has already tagged that column as containing datetime information, and that is passed along to read_excel. So whereas we must use parse_dates when we’re using “read_csv”, we don’t have to do it when we’re using Excel.

Our final query is thus:

filename = 'What_We_Watched_A_Netflix_Engagement_Report_2023Jan-Jun.xlsx'

df = pd.read_excel(filename, 
                   header=5, 
                  usecols=range(1, 5),
                  parse_dates=['Release Date'])

Which columns, if any, have missing data? Is this significant?

One of the biggest issues with data analysis is the fact that data is often missing. We don’t want to represent that with any real value, such as 0, because it might get confused with actual values. In Pandas, we use either np.NaN (“not a number”), a float value from NumPy, or pd.NA, a more modern, flexible value that’s native to Pandas. Using these special not-a-value values allows us to find them, remove them, or replace them — but deciding just which of those would be appropriate very much depends on the circumstances.

In our case, I asked you to find out which columns have missing data. One trick that I’ve adopted is to invoke the “isna” method on the data frame. That returns a new data frame with the same index and columns as our original, but with all True and False values.

We can then invoke “sum” on the data frame. This sums up all of the values in each column, giving us a single integer value per column. But wait — if the output from “isna” is True or False, what happens when we sum each column?

Each True is treated as 1, and each False is treated as 0. The result of this query is thus a series whose index represents the columns of our data frame, and whose integers reflect how many of the values in that column are NaN:

df.isna().sum()

What happens when I run this on our data frame?

Title                      0
Available Globally?        0
Release Date           13359
Hours Viewed               0
dtype: int64

In other words, we see that there are only NaN values in a single column, namely “Release Date”. And there are a lot of those values! I can run this query to find out just how many:

df.isna().sum() / len(df.index)

Notice that I run “len” on df.index, which I’ve learned is the fastest way to get the number of rows in a data frame. The result:

Title                  0.000000
Available Globally?    0.000000
Release Date           0.733447
Hours Viewed           0.000000
dtype: float64

There are only NaN values in the “Release Date” columns, but boy are there a lot of them, in 73 percent of the rows. Which means that for about three quarters of the titles in Netflix’s data set, we don’t know when it was released. I’d say that this is a significant piece of data that we could make sure of it were around. But

I should add that actually, we don’t have NaN or NA values in the “Release Date” column. You can see that if you sort the values, for example:

1209    2010-04-01
2758    2010-04-01
1844    2010-04-01
3391    2010-09-22
3205    2010-09-22
           ...    
18209          NaT
18210          NaT
18211          NaT
18212          NaT
18213          NaT
Name: Release Date, Length: 18214, dtype: datetime64[ns]

Notice those values at the bottom? Those are NaT, short for “not a time,” and it’s the datetime equivalent to NaN. You can basically treat it as NaN, and it responds to the same methods as NaN, such as “dropna”. But it’s not quite the same thing.