A few weeks ago, while listening to Slate Money (https://slate.com/podcasts/slate-money), I heard a statistic that was simultaneously fantastic and awful: A recent paper published by the National Bureau of Economics Research (https://www.nber.org/) found that traffic fatalities increased on days that popular albums were released (https://www.nber.org/papers/w34866).
In other words: When Taylor Swift releases a new album, many people will stop whatever they're doing and start to listen. Today, a huge number of people listen via streaming services, such as Spotify. And of course, many of them will listen to Spotify while driving. The NBER researchers found that on days when major albums were released, the number of traffic fatalities was higher than on other days. Meaning, more or less, that so many people were listening to Spotify while driving that they got into accidents.
Now, one of the first things that you learn in a statistics class is that "correlation isn't causation," so we're not going to accuse Taylor Swift of homicide, or even reckless endangerment — at least, not just yet. But this was such an amazing set of findings that I thought it would be interesting (in a semi-morbid kind of way) to investigate this topic, and see if we could find similar results.
The NBER paper used data from two sources:
- FARS, the Fatality Analysis Reporting System at the US government's National Highway Traffic Safety Administration, part of the Department of Transportation (https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/), and
- Julian Freyberg, whose Kaggle data set includes streaming data through 2022 (https://www.kaggle.com/datasets/jfreyberg/spotify-chart-data?resource=download). Spotify does provide its own downloadable data, but this is enough for our purposes.
This week, we'll dig into these data sets, and see what we can find!
Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.
Learning goals for this week include combining multiple files, dates and times, joins, and grouping.
Data and five questions
The data for this week has two sources:
- First, you'll need to download data from FARS from 2017-2022 from the FARS site (https://www.nhtsa.gov/file-downloads?p=nhtsa/downloads/FARS/). You'll want the national data in CSV format for each year. The download comes in the form of a zipfile; the item of interest within each zipfile is called
accident.csv, although the precise capitalization and directory name (or lack thereof) is rather inconsistent. - Second, you'll need to get the Kaggle data set. It's a free download for anyone with a Kaggle account (also free), at https://www.kaggle.com/datasets/jfreyberg/spotify-chart-data?resource=download .
Note that while the Spotify data looks at songs, the NBER paper looked at albums. We thus won't be able to replicate the paper precisely, but we can look at something similar.
In addition, here's a Pandas series containing the dates on which the 10 most popular albums were released, as described in the NBER paper:
album_releases = pd.Series(pd.to_datetime([
'2022-10-21', # Midnights - Taylor Swift
'2021-09-03', # Certified Lover Boy - Drake
'2022-05-06', # Un Verano Sin Ti - Bad Bunny
'2018-06-29', # Scorpion - Drake
'2022-05-13', # Mr. Morale - Kendrick Lamar
'2022-05-20', # Harry's House - Harry Styles
'2022-11-04', # Her Loss - Drake & 21 Savage
'2021-08-29', # Donda - Kanye West
'2021-11-12', # Red (Taylor's Version) - Taylor Swift
'2020-07-24', # Folklore - Taylor Swift
]))
Here are my five questions for this week. I'll be back tomorrow with solutions and explanations:
- Download the FARS data for 2017-2022, and combine all of the
accidentfiles (one in each year's zipfile) into a single Pandas data frame. Add adatecolumn, with adtypeofdatetime, that contains the year, month, and day of each accident. Keep only the 'STATENAME', 'FATALS', 'ROUTENAME', 'RUR_URBNAME', and the newly addeddatecolumns. - How many fatalities were there, on average, on a given day? How many fatalities were there, on average, on days when new albums dropped?