BW #29: Auto accidents (solution)

There seem to be more cars on the road than ever. But are there also more auto accidents? This week, we'll examine data about accidents in OECD countries.

BW #29: Auto accidents (solution)

There's been lots of talk of automated vehicles (aka "self-driving cars") over the last few years. Between Elon Musk pushing Tesla's full self-driving mode (which the company tells you shouldn't be allowed to fully self-drive the car) and self-driving taxis making the news in San Francisco, I feel like we're not far off from a day when half or more of the cars on the road will be self-driving. There will be lots of bumps along the way, and while I'm excited about the prospect of automated vehicles, part of me also worries about the prospect of autonomous, multi-ton hunks of metal zooming down the road at high speed.

From Stable Diffusion: “A whimsical auto accident, where the cars are driven by robots, in the style of Dali”

The latest episode of Hard Fork (a New York Times technology podcast) spent a lot of time discussing them: https://www.nytimes.com/2023/08/18/podcasts/sam-bankman-fried-goes-to-jail-back-to-school-with-ai-and-a-self-driving-car-update.html One of the main points that host Kevin Roose makes is that self-driving cars will almost certainly be safer than human-driven cars. He points to some (admittedly not-the-best) data showing that to date, self-driving cars do indeed seem to be safer.

Moreover, this week’s “Make me smart” Tuesday edition did a deep dive on self-driving cars (https://www.marketplace.org/shows/make-me-smart/our-driverless-car-future/), and pointed out that while self-driving cars have been having all sorts of problems in San Francisco, they’re also getting far more attention than the human-driven cars.

This raises the question of how safe (or unsafe) regular ol' human-driven cars are. Are countries generally doing better at reducing deaths and injuries? Are some countries doing better than others? What trends do we see?

Data … and nine questions

This week, we looked at data from the OECD (Organization for Economic Co-operation and Development), what the Economist likes to call "a club of mostly-rich countries." They have collected a variety of road-accident data from their 38 member countries, giving us a chance to see who is doing well, who is doing poorly, and whether roads are getting safer over time.

The data comes in a single CSV file, which you can download from:

https://stats.oecd.org/sdmx-json/data/DP_LIVE/.ROADACCID.../OECD?contentType=csv&detail=code&separator=comma&csv-lang=en

We'll also make use of the Wikipedia page that translates ISO 3-letter country codes into country names:

https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes

I gave you nine tasks. Let’s go through them, in detail:

Load the data from OECD into a data frame. We won't look at the columns named "INDICATOR", "FREQUENCY", or "Flag Codes."

First, I loaded up Pandas:

import pandas as pd
from pandas import Series, DataFrame

Then I loaded the file into a data frame, using read_csv:

filename = 'DP_LIVE_21082023023516184.csv'
df = pd.read_csv(filename)

But that gave me the entire file in the data frame. I indicated that I didn’t want three of the columns. Fortunately, we can use the “drop” method to remove rows or columns we don’t want. To remove columns, we also have to pass the axis=“columns” keyword argument:

df = pd.read_csv(filename).drop(['INDICATOR', 'FREQUENCY', 'Flag Codes'], axis='columns')

I should note that we also could have expressed which columns we do want with the “usecols” keyword argument. Here, though, I figured that the number we wanted was much larger than the number we didn’t want. And thus, I used “drop”.

The entire data set is now loaded into Pandas. Let’s start with our analysis!

What is the most recent year for which we have data? Are there any countries for which the latest data isn't from that year?

This data is compiled annually, something that we could see from the “FREQUENCY” column that’s common to many OECD data sets. Since “FREQUENCY” only contained the letter “A”, meaning that the data was annual, I decided that we didn’t need it. The “TIME” column thus contains an integer, the year for which the data was collected.

Data is always messy, however, and it’s likely that not every country has provided us with data for all years. Indeed, it’s likely that some countries haven’t provided data in the most recent years. Which countries are these?

First, let’s find the most recent year for which we have any data, using the “max” method:

df['TIME'].max()

Running this returns the year 2021.

Now we need to find the most recent year for which we have data from each location. Stated differently, we want to find the maximum value of “TIME” for each value of “LOCATION”. That sounds like a groupby, and indeed that’s what we’re going to do:

df.groupby('LOCATION')['TIME'].max()

You can see, just by looking that there are indeed many countries for which we don’t have data in 2021. How can we find these? For starters, we can compare these per-location maximum values with our overall maximum value:

df.groupby('LOCATION')['TIME'].max() < df['TIME'].max()

This returns a boolean series. We can apply the boolean series to “.loc” , and thus get selected rows from a data frame. But it’ll have to be a data frame whose index matches ours, namely the countries.

One option is to apply “.loc” to the data frame we got from the “groupby”:

df.groupby('LOCATION')['TIME'].max().loc[
    df.groupby('LOCATION')['TIME'].max() < df['TIME'].max()
]

This does indeed find all of the countries whose most recent data was earlier than 2021:

LOCATION
ARG    2017
ARM    2017
BIH    2020
BLR    2020
CHN    2019
IND    2017
KAZ    2020
KHM    2016
MAR    2018
MEX    2020
MNE    2017
ROU    2019
RUS    2020
UKR    2017
UZB    2020
Name: TIME, dtype: int64

By the way, notice that the index is sorted — that’s standard in the results from “groupby” operations, unless you indicate that you would prefer for them not to be sorted.