Bamboo Weekly #162: Spotify and car accidents (solution)

This week, we looked at two data sets that you might not think to combine and compare: Traffic fatalities and Spotify streaming numbers.

I was inspired to examine this data by Slate Money (https://slate.com/podcasts/slate-money), where they mentioned a paper from the National Bureau of Economics Research (https://www.nber.org/). The paper found that traffic fatalities increased on days that popular albums were released (https://www.nber.org/papers/w34866) — correlating with the the highest number of people streaming music on Spotify.

The NBER paper used data from two sources:

FARS, the Fatality Analysis Reporting System at the US government's National Highway Traffic Safety Administration, part of the Department of Transportation (https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/), and
Julian Freyberg, whose Kaggle data set includes streaming data through 2022 (https://www.kaggle.com/datasets/jfreyberg/spotify-chart-data?resource=download). Spotify does provide its own downloadable data, but this is enough for our purposes.

This week, we'll dig into these data sets, and see what we can find!

Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.

Learning goals for this week include combining multiple files, dates and times, joins, and grouping.

Data and five questions

The data for this week has two sources:

First, you'll need to download data from FARS from 2017-2022 from the FARS site (https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/). You'll want the national data in CSV format for each year. The download comes in the form of a zipfile; the item of interest within each zipfile is called accident.csv, although the precise capitalization and directory name (or lack thereof) is rather inconsistent.
Second, you'll need to get the Kaggle data set. It's a free download for anyone with a Kaggle account (also free), at https://www.kaggle.com/datasets/jfreyberg/spotify-chart-data?resource=download .

Note that while the Spotify data looks at songs, the NBER paper looked at albums. We thus won't be able to replicate the paper precisely, but we can look at something similar.

In addition, here's a Pandas series containing the dates on which the 10 most popular albums were released, as described in the NBER paper:

album_releases = pd.Series(pd.to_datetime([
    '2022-10-21',  # Midnights - Taylor Swift
    '2021-09-03',  # Certified Lover Boy - Drake
    '2022-05-06',  # Un Verano Sin Ti - Bad Bunny
    '2018-06-29',  # Scorpion - Drake
    '2022-05-13',  # Mr. Morale - Kendrick Lamar
    '2022-05-20',  # Harry's House - Harry Styles
    '2022-11-04',  # Her Loss - Drake & 21 Savage
    '2021-08-29',  # Donda - Kanye West
    '2021-11-12',  # Red (Taylor's Version) - Taylor Swift
    '2020-07-24',  # Folklore - Taylor Swift
]))

Here are my five questions for this week, as well as my solutions and explanations:

Download the FARS data for 2017-2022, and combine all of the `accident` files (one in each year's zipfile) into a single Pandas data frame. Add a `date` column, with a `dtype` of `datetime`, that contains the year, month, and day of each accident. Keep only the 'STATENAME', 'FATALS', 'ROUTENAME', 'RUR_URBNAME', and the newly added `date` columns.

import pandas as pd
import requests
import zipfile
from plotly import express as px

Why these?

I obviously wanted Pandas. (That's why we're here!)
I used requests to download the files from FARS. I then saved each of these files on my local filesystem.
Each FARS file was a zipfile, so I used the zipfile module
Finally, I did some plotting with Plotly, so I loaded that up, too.

Then I needed to actually get the data. I did this in two parts: First, I retrieved the files from FARS. I decided to use Python to retrieve them, since the URLs were all pretty standard:


for output_year in range(2017, 2023):
    url = f'https://static.nhtsa.gov/nhtsa/downloads/FARS/{output_year}/National/FARS{output_year}NationalCSV.zip'
    r = requests.get(url)
    output_filename = f'data/bw-162-fars-{output_year}.zip'
    print(f'Downloading {output_year}')
    with open(output_filename, 'wb')  as w:
        w.write(r.content)

print('Done downloading')

Let's summarize the above code:

I used range to iterate over the years I wanted to download. Remember that range, like many things in Python, is "up to and not including" the endpoint. So by saying range(2017, 2023), I was asking for 2017-2022.
I used requests.get to retrieve the document.
I then used an f-string to create a filename.
I then used open to create a new file – using 'wb', because it's a binary file, so a plain 'w' will result in an error – and then wrote the content.
By using a with block, I can be sure that the files are all flushed and closed.

After running this code, I had six zipfiles. Now, a zipfile can contain any number of files inside of it. If the zipfile contains only one CSV file, then you can hand read_csv the filename. Pandas will unzip the file, extract the CSV from it, and turn that into a data frame.

But that wasn't the case here: Each zipfile contains a bunch of CSV files. And only one of them is really of interest.

But wait, it's worse: The file we want is called accident.csv. But the capitalization of the filename varies quite a bit. Sometimes it's in a subdirectory, and sometimes not. Is there a good way to standardize these extractions?

Yes, but it's a bit complex:

First, I decided to iterate over the same range as before, since those numbers are in the filenames. I then recreated the filename (zip_filename) with an f-string.

But then I wanted to look through the contents of the zipfile in order to find the accident-reporting CSV, whatever its capitalization or full path. For that, I used with zipfile.ZipFile, which is good for opening a zipfile in Python.

I then used a weird-but-useful idiom, using if 'accident.csv' in f.lower() to look through each filename in the zipfile's list of filenames. I put that in a generator expression, and then handed the expression to next. That returned the first filename that matched our criteria, assigning it to accident_file.

Then, still within the with zipfile.ZipFile block, I opened the file (using z.open), read the file's contents into a data frame with pd.read_csv, and appended the resulting data frame to all_dfs, a list:

all_dfs = []

for zip_year in range(2017, 2023):
    print(f'Unzipping {zip_year} into a data frame')
    zip_filename = f'data/bw-162-fars-{zip_year}.zip'
    with zipfile.ZipFile(zip_filename) as z:
        accident_file = next(f for f in z.namelist() 
                             if 'accident.csv' in f.lower())
        with z.open(accident_file) as f:
            all_dfs.append(pd.read_csv(f, engine='pyarrow'))

Notice that when invoking read_csv, I passed engine='pyarrow', I like to use PyArrow to load CSV files into data frames, especially when it's a potentially large amount of data.

At this point, I have a list of data frames. How can I turn them into a single data frame, assuming that they all have the same column names? pd.concat is the answer. We give it a list of data frames, and it returns one new one. The only trick is to pass the ignore_index=True keyword argument, which tells Pandas that we want a simple, numeric index starting with 0, ignoring whatever indexes the input data frames might have:

df = (
    pd
    .concat(all_dfs, ignore_index=True)
)

However, we still needed to do two things: First, get a date column from the existing YEAR, MONTH, and DAY columns. We can do that with pd.to_datetime, inside of assign:

df = (
    pd
    .concat(all_dfs, ignore_index=True)
    .assign(date= lambda df_: pd.to_datetime(df_[['YEAR', 'MONTH', 'DAY']]))
)

Finally, given the very large number of columns in the data frame, and the very small number we'll actually use, I used [[ ]] to keep only a small number of columns:

df = (
    pd
    .concat(all_dfs, ignore_index=True)
    .assign(date= lambda df_: pd.to_datetime(df_[['YEAR', 'MONTH', 'DAY']]))
    [['STATENAME', 'date', 'FATALS', 'ROUTENAME', 'RUR_URBNAME']]
)

I had originally intended to use ROUTENAME and RUR_URBNAME, but ended up not doing so. You can leave those out, if you want.

How many fatalities were there, on average, on a given day? How many fatalities were there, on average, on days when new albums dropped?

Now that we have the data, we can start to ask and answer some interesting questions about it. First, how many fatalities were there, on average? This requires the use of groupby, since we want to get the total number of fatalities per date, and then get the mean of all dates:

(
    df
    .groupby('date')['FATALS'].sum()
    .mean()
)

The mean is 107.5. (Which is pretty awful, even in a country as large as the United States.)

The mean often gets a bad rap, because it can be influenced by outliers. I decided to check the median, too:


(
    df
    .groupby('date')['FATALS'].sum()
    .median()
)

The median is 105 – which is so close to the mean that we can believe the mean is indeed representative.

Next, I took the Pandas series from above, and defined it as album_releases:


album_releases = pd.Series(pd.to_datetime([
    '2022-10-21',  # Midnights - Taylor Swift
    '2021-09-03',  # Certified Lover Boy - Drake
    '2022-05-06',  # Un Verano Sin Ti - Bad Bunny
    '2018-06-29',  # Scorpion - Drake
    '2022-05-13',  # Mr. Morale - Kendrick Lamar
    '2022-05-20',  # Harry's House - Harry Styles
    '2022-11-04',  # Her Loss - Drake & 21 Savage
    '2021-08-29',  # Donda - Kanye West
    '2021-11-12',  # Red (Taylor's Version) - Taylor Swift
    '2020-07-24',  # Folklore - Taylor Swift
]))

These were the albums mentioned in the NBER paper, and the dates on which they dropped.

In order to get the mean number of fatal accidents on those dates, we first need to make date into the index with set_index. That allows us to use loc to retrieve from a subset of dates:


(
    df
    .set_index('date')
    .loc[album_releases]
)

With that subset of rows, we can then repeat our groupby:


(
    df
    .set_index('date')
    .loc[album_releases]
    .groupby('date')['FATALS'].sum()
    .mean()
)

The result? The mean is 144.9, and the median is 142.5. Again, not precisely the same, but pretty close to one another.

Also? Wow, that's far far higher than the other days. So yes, we can definitely see that on those days, there were more traffic fatalities.

Bamboo Weekly #162: Spotify and car accidents (solution)

Data and five questions

How many fatalities were there, on average, on a given day? How many fatalities were there, on average, on days when new albums dropped?

Read next

Bamboo Weekly #162: Spotify and car accidents