BW #48: Aviation accidents (solution)

Last week's airplane accidents remind us that aviation can be dangerous. But how dangerous is it, and hasn't air travel gotten much safer over the years? This week, we calm ourselves down with data.

BW #48: Aviation accidents (solution)

[Programming note: Office hours for all Bamboo Weekly paid subscribers will take place this coming Sunday. I’ll send e-mail about it in the coming 24 hours. Please come with your questions and comments!]

At first glance, the last week hasn’t seemed very encouraging when it comes to aviation safety. Between a crash in Japan and the side coming off of an Alaska Airlines plane, you might feel justified in thinking of airline travel as unsafe. In many ways, I would argue the opposite, that if no one was hurt in either of these accidents, we should be impressed by and grateful for the engineers who invest so much time improving airplane safety.

But what do the numbers say? Have there been fewer injuries over time or more? Do we see any specific trends?

This week, we looked at data collected by the National Transportation Safety Board (NTSB), part of the US Department of Transportation (https://www.ntsb.gov/Pages/home.aspx). They collect data about airline (and other transportation) accidents, and also investigate when an accident takes place. Their analysis helps recommend ways for the airline industry to improve and to avoid repeating mistakes.

Data and seven questions

This week's data comes from the NTSB's "CAROL" database (https://data.ntsb.gov/carol-main-public/basic-search), which lets you query all sorts of transportation incidents. You can then download the result of your query in either JSON or CSV format.

I asked you to download data from January 1st, 1998 through today from CAROL. I’ll admit that this required doing a bit of annoying work, since CAROL only lets you download up to 10,000 records at a time. Through a bit of experimentation, I found that I could download all of these records in JSON format by choosing “Aviation” as the search mode, specifying five pairs of dates, and then clicking on “search”. Then, after the search was over, I would click on “download JSON data.”

The zipfile that I got from each download contained a JSON-formatted file as well as a readme containing my query.

I chunked the data into these date ranges:

  • 01/01/2018 to 12/31/2024
  • 01/01/2012 to 12/31/2017
  • 01/01/2007 to 12/31/2011
  • 01/01/2002 to 12/31/2006
  • 01/01/1998 to 12/31/2001

The files that I got were named:

  • cases2024-01-10_08-14.json
  • cases2024-01-10_08-15.json
  • cases2024-01-10_08-17.json
  • cases2024-01-10_08-18.json
  • cases2024-01-10_08-20.json

From what I can tell, the filenames are based on the date when you downloaded the data, so you likely got files with different names.

This week, I gave you seven tasks and questions. Here are my answers and explanations; a link to the full Jupyter notebook is at the bottom of this message:

Combine the five JSON files into a single data frame. You will only need the following columns: `cm_mkey`, `cm_eventDate`, `cm_vehicles`, `cm_fatalInjuryCount`, `cm_seriousInjuryCount`, and `cm_minorInjuryCount`. Treat the `cm_eventDate` column as a date. Set the `cm_mkey` column to be the index. How can you be sure that you downloaded all of the files, covering all years?

First, and before doing anything else, I have to load Pandas:

import pandas as pd

Then, with Pandas loaded, how can I turn the JSON files into a data frame? We’ve previously read numerous CSV files into Pandas using “read_csv”, and Excel files into Pandas using “read_excel”. The “read_json” function is similar, taking a filename as input and returning a data frame.

That’s fine for a single file and a single data frame, but how can we combine all five inputs into a single data frame? My preferred technique is to create a list of data frames using a list comprehension, invoking read_json on each file returned by “glob.glob”. I can then invoke “pd.concat” on that list of data frames, getting one data frame back:

import glob

all_dfs = [pd.read_json(one_filename)
           for one_filename in glob.glob('cases*json')]

The good news is that this will indeed give me a list of data frames, one per file. However, I wanted to turn the “cm_eventDate” column into a datetime column. To do that, I pass the “convert_dates” keyword argument to read_json, passing a list of columns (one, in this case) that should be parsed in this way:

import glob

all_dfs = [pd.read_json(one_filename,
                       convert_dates=['cm_eventDate'])
           for one_filename in glob.glob('cases*json')]

Now that we have a list of data frames, we can combine them with pd.concat:

df = (pd
      .concat(all_dfs)
     )

The resulting data frame is quite wide, with 38 columns — most of which aren’t really necessary for our analysis. I thus decided to keep only those columns that I needed:

df = (pd
      .concat(all_dfs)
      [['cm_mkey', 'cm_eventDate', 'cm_vehicles', 
        'cm_fatalInjuryCount', 'cm_seriousInjuryCount', 'cm_minorInjuryCount']]
     )

I had asked that you set the “cm_mkey” column to be the index. Normally, I would have used the “index_col” keyword when reading the file from disk to choose an index column. But for reasons that I don’t quite understand, read_json doesn’t have such an option. I thus decided to handle it after creating the large data frame, invoking “set_index”:

df = (pd
      .concat(all_dfs)
      [['cm_mkey', 'cm_eventDate', 'cm_vehicles', 
        'cm_fatalInjuryCount', 'cm_seriousInjuryCount', 'cm_minorInjuryCount']]
      .set_index('cm_mkey')
     )

I don’t know about you, but I made a number of mistakes when downloading the JSON files from the NTSB. How can we be sure that you haven’t forgotten any years when downloading the data?

I ran the following query:

(
    df['cm_eventDate'].dt.year
    .drop_duplicates()
    .sort_values()
    .diff()
)

The above query grabs the year (thanks to the “dt.year” accessor on datetime columns) from each of the records, then invokes “drop_duplicates”, returning the unique years that are in the system. I sorted the years with “sort_values”, and calculated the difference between them with “diff”. The result was a series in which each row told me by how much it differed from the previous row.

My thinking was that if a year was missing, then I would have seen a gap of more than 1 between two years. But since I only had 1s in the result (except for NaN on the first row), I could be sure that I had data from all of the years.

Count the number of vehicles involved in each incident. How often does each count occur?

How often does an accident only involve one vehicle, and how often does it include more? If we had an integer column indicating how many vehicles were involved in an accident, then we could easily run “value_counts” and find out.

But instead, the “cm_vehicles” column contains Python objects — specifically, lists of dictionaries, with one dict per vehicle involved in each accident. We need to calculate the length of each list. Then we can take those numbers and run them through value_counts.

The solution is to use the “apply” method, which lets us run an arbitrary function on each element in a series. Here, we’ll run the “len” function, which will return the length of each list. We’ll get a series of integers back — one on which we can invoke value_counts. Here, I passed the normalize=True keyword argument, so that I would get percentages returned, rather than the raw integer values:

df['cm_vehicles'].apply(len).value_counts(normalize=True)

The result:

cm_vehicles
1    0.985132
2    0.014805
3    0.000064
Name: proportion, dtype: float64

I’ll add that this makes sense to me; 1-airplane accidents seem far more common than 2- or 3-airplane accidents, even though we know, thanks to the accident in Japan, that they do occur.

There is another way to do this, which is a bit sneaky — but which I’ve used on numerous occasions. Basically, the “str” accessor lets us run string methods on series that contain strings. So if “s” contains string data, I can call “s.str.len()” and get back a new series containing integers, each representing the string’s length.

The thing is, Pandas doesn’t check what underlying types are in a column of dtype “object”. If the underlying object supports the “len” function, then we can run it via “str.len()”, too. We can thus say:

df['cm_vehicles'].str.len().value_counts(normalize=True)

I got the same results. Which raises the question of which is a better way to go. I used “%timeit” in my Jupyter notebook, and found that str.len ran in about 9 ms, whereas apply(len) ran in about 6.9 ms. So my sneaky trick of using str.len might be cool and interesting, but it’s also less efficient.