Skip to content

Bamboo Weekly #162: Spotify and car accidents (solution)

Get better at: Combining files, dates and times, grouping, joining, correlations, and plotting

Bamboo Weekly #162: Spotify and car accidents (solution)

This week, we looked at two data sets that you might not think to combine and compare: Traffic fatalities and Spotify streaming numbers.

I was inspired to examine this data by Slate Money (https://slate.com/podcasts/slate-money), where they mentioned a paper from the National Bureau of Economics Research (https://www.nber.org/). The paper found that traffic fatalities increased on days that popular albums were released (https://www.nber.org/papers/w34866) — correlating with the the highest number of people streaming music on Spotify.

The NBER paper used data from two sources:

This week, we'll dig into these data sets, and see what we can find!

Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.

Learning goals for this week include combining multiple files, dates and times, joins, and grouping.

Data and five questions

The data for this week has two sources:

Note that while the Spotify data looks at songs, the NBER paper looked at albums. We thus won't be able to replicate the paper precisely, but we can look at something similar.

In addition, here's a Pandas series containing the dates on which the 10 most popular albums were released, as described in the NBER paper:

album_releases = pd.Series(pd.to_datetime([
    '2022-10-21',  # Midnights - Taylor Swift
    '2021-09-03',  # Certified Lover Boy - Drake
    '2022-05-06',  # Un Verano Sin Ti - Bad Bunny
    '2018-06-29',  # Scorpion - Drake
    '2022-05-13',  # Mr. Morale - Kendrick Lamar
    '2022-05-20',  # Harry's House - Harry Styles
    '2022-11-04',  # Her Loss - Drake & 21 Savage
    '2021-08-29',  # Donda - Kanye West
    '2021-11-12',  # Red (Taylor's Version) - Taylor Swift
    '2020-07-24',  # Folklore - Taylor Swift
]))

Here are my five questions for this week, as well as my solutions and explanations:

Download the FARS data for 2017-2022, and combine all of the accident files (one in each year's zipfile) into a single Pandas data frame. Add a date column, with a dtype of datetime, that contains the year, month, and day of each accident. Keep only the 'STATENAME', 'FATALS', 'ROUTENAME', 'RUR_URBNAME', and the newly added date columns.

import pandas as pd
import requests
import zipfile
from plotly import express as px

Why these?

Then I needed to actually get the data. I did this in two parts: First, I retrieved the files from FARS. I decided to use Python to retrieve them, since the URLs were all pretty standard:


for output_year in range(2017, 2023):
    url = f'https://static.nhtsa.gov/nhtsa/downloads/FARS/{output_year}/National/FARS{output_year}NationalCSV.zip'
    r = requests.get(url)
    output_filename = f'data/bw-162-fars-{output_year}.zip'
    print(f'Downloading {output_year}')
    with open(output_filename, 'wb')  as w:
        w.write(r.content)

print('Done downloading')

Let's summarize the above code:

After running this code, I had six zipfiles. Now, a zipfile can contain any number of files inside of it. If the zipfile contains only one CSV file, then you can hand read_csv the filename. Pandas will unzip the file, extract the CSV from it, and turn that into a data frame.

But that wasn't the case here: Each zipfile contains a bunch of CSV files. And only one of them is really of interest.

But wait, it's worse: The file we want is called accident.csv. But the capitalization of the filename varies quite a bit. Sometimes it's in a subdirectory, and sometimes not. Is there a good way to standardize these extractions?

Yes, but it's a bit complex:

First, I decided to iterate over the same range as before, since those numbers are in the filenames. I then recreated the filename (zip_filename) with an f-string.

But then I wanted to look through the contents of the zipfile in order to find the accident-reporting CSV, whatever its capitalization or full path. For that, I used with zipfile.ZipFile, which is good for opening a zipfile in Python.

I then used a weird-but-useful idiom, using if 'accident.csv' in f.lower() to look through each filename in the zipfile's list of filenames. I put that in a generator expression, and then handed the expression to next. That returned the first filename that matched our criteria, assigning it to accident_file.

Then, still within the with zipfile.ZipFile block, I opened the file (using z.open), read the file's contents into a data frame with pd.read_csv, and appended the resulting data frame to all_dfs, a list:

all_dfs = []

for zip_year in range(2017, 2023):
    print(f'Unzipping {zip_year} into a data frame')
    zip_filename = f'data/bw-162-fars-{zip_year}.zip'
    with zipfile.ZipFile(zip_filename) as z:
        accident_file = next(f for f in z.namelist() 
                             if 'accident.csv' in f.lower())
        with z.open(accident_file) as f:
            all_dfs.append(pd.read_csv(f, engine='pyarrow'))

Notice that when invoking read_csv, I passed engine='pyarrow', I like to use PyArrow to load CSV files into data frames, especially when it's a potentially large amount of data.

At this point, I have a list of data frames. How can I turn them into a single data frame, assuming that they all have the same column names? pd.concat is the answer. We give it a list of data frames, and it returns one new one. The only trick is to pass the ignore_index=True keyword argument, which tells Pandas that we want a simple, numeric index starting with 0, ignoring whatever indexes the input data frames might have:

df = (
    pd
    .concat(all_dfs, ignore_index=True)
)

However, we still needed to do two things: First, get a date column from the existing YEAR, MONTH, and DAY columns. We can do that with pd.to_datetime, inside of assign:

df = (
    pd
    .concat(all_dfs, ignore_index=True)
    .assign(date= lambda df_: pd.to_datetime(df_[['YEAR', 'MONTH', 'DAY']]))
)

Finally, given the very large number of columns in the data frame, and the very small number we'll actually use, I used [[ ]] to keep only a small number of columns:

df = (
    pd
    .concat(all_dfs, ignore_index=True)
    .assign(date= lambda df_: pd.to_datetime(df_[['YEAR', 'MONTH', 'DAY']]))
    [['STATENAME', 'date', 'FATALS', 'ROUTENAME', 'RUR_URBNAME']]
)

I had originally intended to use ROUTENAME and RUR_URBNAME, but ended up not doing so. You can leave those out, if you want.

How many fatalities were there, on average, on a given day? How many fatalities were there, on average, on days when new albums dropped?

Now that we have the data, we can start to ask and answer some interesting questions about it. First, how many fatalities were there, on average? This requires the use of groupby, since we want to get the total number of fatalities per date, and then get the mean of all dates:

(
    df
    .groupby('date')['FATALS'].sum()
    .mean()
)

The mean is 107.5. (Which is pretty awful, even in a country as large as the United States.)

The mean often gets a bad rap, because it can be influenced by outliers. I decided to check the median, too:


(
    df
    .groupby('date')['FATALS'].sum()
    .median()
)

The median is 105 – which is so close to the mean that we can believe the mean is indeed representative.

Next, I took the Pandas series from above, and defined it as album_releases:


album_releases = pd.Series(pd.to_datetime([
    '2022-10-21',  # Midnights - Taylor Swift
    '2021-09-03',  # Certified Lover Boy - Drake
    '2022-05-06',  # Un Verano Sin Ti - Bad Bunny
    '2018-06-29',  # Scorpion - Drake
    '2022-05-13',  # Mr. Morale - Kendrick Lamar
    '2022-05-20',  # Harry's House - Harry Styles
    '2022-11-04',  # Her Loss - Drake & 21 Savage
    '2021-08-29',  # Donda - Kanye West
    '2021-11-12',  # Red (Taylor's Version) - Taylor Swift
    '2020-07-24',  # Folklore - Taylor Swift
]))

These were the albums mentioned in the NBER paper, and the dates on which they dropped.

In order to get the mean number of fatal accidents on those dates, we first need to make date into the index with set_index. That allows us to use loc to retrieve from a subset of dates:


(
    df
    .set_index('date')
    .loc[album_releases]
)

With that subset of rows, we can then repeat our groupby:


(
    df
    .set_index('date')
    .loc[album_releases]
    .groupby('date')['FATALS'].sum()
    .mean()
)

The result? The mean is 144.9, and the median is 142.5. Again, not precisely the same, but pretty close to one another.

Also? Wow, that's far far higher than the other days. So yes, we can definitely see that on those days, there were more traffic fatalities.