BW #46: Pedestrians (solution)

Are pedestrian deaths really rising in America, when they're declining in other countries? This week, we look at some of the data regarding traffic accidents and pedestrians.

BW #46: Pedestrians (solution)

This week, we looked at the US Department of Transportation’s data on car accidents, and specifically pulled information out about those involving pedestrians. The goal was to get a better sense of how many such accidents occur, when and where they occur, and whether they have indeed been rising over the last few years.

This topic was inspired by something that I’ve seen reported several times in the last few weeks, most notably in the New York Times (https://www.nytimes.com/interactive/2023/12/11/upshot/nighttime-deaths.html?unlocked_article_code=1.JE0.6nDD.DwDlgNGWvbej&smid=url-share), but also on Slate's Political Gabfest (https://slate.com/podcasts/political-gabfest/2023/07/extreme-weather-heat-and-floods-are-killing-us-political-gabfest), and in a Vox article (https://www.vox.com/23784549/pedestrian-deaths-traffic-safety-fatalities-governors-association).

Generated by DALL·E

And then, just last (Wednesday) night, I was reading “The Phoenix Economy” by Felix Salmon, in which he talks about this topic as part of a general discussion of risk, and how people re-thought it during the pandemic.

Since hearing about “stroads” (https://en.wikipedia.org/wiki/Stroad) and their influence on urban life in the United States, I’ve taken greater notice of how different cities are constructed, and how that affects our ability to get around on foot vs. by car.

Data and six questions

This week’s data comes from the "Fatality Analysis Reporting System" (FARS), part of the National Highway Traffic Safety Administration, which is itself part of the US Department of Transportation. The FARS home page is at:

https://www.nhtsa.gov/research-data/fatality-analysis-reporting-system-fars

The data is all available from the following site:

https://www.nhtsa.gov/content/nhtsa-ftp/251

That site (which then redirects to another URL) contains a folder for each year of FARS data, starting in 1975 and going through 2021. Inside of each year’s folder is a sub-folder called “National.” And inside of the “National” folder, you’ll see a file named FARSYYYYNationalCSV.zip, where YYYY is the year for which the data was collected.

The (very long) data dictionary that describes most (but not all!) of the columns in the data we'll be looking at is here:

https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/813426

This week, I gave you six questions and tasks for working with this data set. Most weeks, it’s fairly straightforward to load the data into a data frame — but this week, loading the data turned out to be the most difficult and complex part. But hey, that’s the way it often is with data analysis; importing and cleaning the data can often be harder than the analysis itself.

Here are my solutions to this week’s questions. As always, a link to the Jupyter notebook I used to solve these problems is at the bottom of this post. And given the complexity of what we’re doing this week, I expect more than ever to get suggestions and feedback on my solutions.

Create two data frames from the FARS data in 2021, one for accidents and one for people. Use the `requests` package (https://docs.python-requests.org/en/latest/index.html) plus the `zip` and `BytesIO` modules in Python's standard library to retrieve and process these files, turning them into a data frame.

Let’s start by loading up Pandas, as well as several other modules that we’ll need:

from zipfile import ZipFile
from io import BytesIO
import requests 

How can we use these together to download one file (say, from 2010) and turn it into a data frame?

Well, we can download the data via requests:

url = f'https://static.nhtsa.gov/nhtsa/downloads/FARS/2010/National/FARS2010NationalCSV.zip'

b = requests.get(url).content

The above returns a bytestring (aka “bytes”), a core Python data structure that contains individual 8-bit bytes. It’s easy to confuse this with ASCII characters, because ASCII was a one-byte encoding, where every character used one and only one byte. The good news is that this was easy to work with; the bad news was that it excluded most non-English languages, which rumor has it they speak in some countries. You shouldn’t think of a bytestring as characters, but rather as individual integers that can be turned into characters (i.e., a string) if and when we want.

Whenever we ask for “content” from requests, we always get a bytestring back. If the bytestring contains text, then we can turn it into a regular string with the “decode” method, but (as we’ll discuss in a bit) this can be tricky. But right now, our bytestring contains a zipfile, which is most definitely not text.

How can we read from the zipfile? We can use the “ZipFile” class from the “zipfile” module, handling it as a file and then extracting whatever is interesting to us. But we can’t quite do that yet, because ZipFile expects to get a file, and our bytestring is most definitely not a file.

Actually, ZipFile doesn’t really expect to get a file. Rather, it wants to get what we in the Python world call a “file-like object,” meaning something that implements the same API as a file. If you want to construct such an object from a string, simulating a text file, you can use io.StringIO. And if you want to do the same thing with bytes, then you can use io.BytesIO:

zipfile_contents = BytesIO(requests.get(url).content)

Here, we create a BytesIO object based on what we got back from requests. We can then use ZipFile on it. But then what?

One option is to extract the files that we got in the zipfile, and then read through those files on disk. But we can be much more elegant than that, taking advantage of ZipFile’s implementation of the context manager protocol — meaning, simply put, that we can put it in a “with” block. Then we can perform all sorts of actions on the zipfile and its contents without actually touching the filesystem:

    with ZipFile(zipfile_contents) as myzip:

Fine, but what should be inside of the “with” block? Basically, we want to find two files, “accident.csv” and “person.csv”. Sadly, these files might come in all sorts of weird combinations of capitalization, which makes it harder to find the file. However, we can be a bit clever, and do the following:

  • Get the list of files in the zipfile with myzip.namelist()
  • Iterate through each of those files, looking for accident.csv or person.csv, or anything looking roughly like them
  • If we find a match, then we set that to be the CSV filename we want to work with
  • If we don’t find any match, then we give an error message, and try again.

That sounds nice, but how can we look for filenames “or anything looking roughly like them”?

The answer is simple: Regular expressions, which allow us to search for patterns of text. If you find yourself using “if” with many different possible variations, but you can describe the text you’re trying to find with a single English sentence, then a regular expression might well help you.

Here’s what I did in my code to find accident.csv and person.csv, ignoring whatever crazy capitalization they decided to use in that particular reporting year:

        for basename in ['accident', 'person']:

            for one_filename in myzip.namelist():
                if re.search(f'{basename}\.csv', one_filename, re.I):
                    print(f'Matched {one_filename}')
                    csv_filename = one_filename
                    break
    
            else:
                print(f'No match; names are: {myzip.namelist()}')

In other words:

  • Iterate over each of “accident” and “person”, calling it “basename”
  • Go through each filename in the zipfile’s contents
  • Check to see if “basename” followed by “.csv”, ignoring capitalization, matches the word. I use “re.search” here, because it gives us the most flexibility.
  • I pass the “re.I” flag, indicating that I want the search to be case-insensitive
  • If I find a match, then I assign it to “csv_filename” and then break out of the for loop

Notice the “else” after the “for”? That’s a great Python feature that is also a bit confusing. Basically, that “else” will fire if we reach the end of the “for” loop without encountering a “break”. In other words: We went through all of the options, and none of them seemed to fit.

I have a YouTube video on this subject, if you want to learn more:

With the filenames in hand, we can then create data frames. In theory, we would like to do something as simple as this:

df = pd.read_csv(csv_filename)

But there are at least three problems with this:

  • First, I discovered (the hard way!) that these CSV files don’t use Unicode. That is, trying to read from these CSV files directly into a text file, string, or data frame won’t work. That’s because the files were written using a one-byte encoding known as Latin-1 that covers all ASCII characters (in bytes 0-127) and then many Western European languages (in bytes 128-255). Latin-1 was very popular for many years, and covered the majority of Western countries. But because it uses the top half of each byte, it’s incompatible with Unicode, and cannot be read that way. However, if we specify the encoding to “open” or “read_csv”, then we’re fine.
  • Some of the data files are also long enough that Pandas needs to break them up into chunks before analyzing them and deciding what dtype to assign to each column. This can result in some mismatches and/or warnings from Pandas indicating that the dtypes might be wrong. To avoid this, you can pass “low_memory=False” as a keyword argument. This will use more RAM, but it’ll run faster and give a more accurate assessment.
  • Finally, remember that we’re inside of a “with” block on the ZipFile, which I’ve called “myzip”. If we want to read the file without actually extracting it, we can use “myzip.open”.

Given all that, we can use the following combination of code to open the CSV file, read it into a data frame, and assign that data frame to a variable:

            with myzip.open(csv_filename) as csv_file:
                df = pd.read_csv(csv_file,
                                 encoding='Latin-1',
                                 low_memory=False))

Repeating that for each of our two CSV files will do the trick.

When you have that working, now create two data frames, `accident_df` and `person_df`, based on all of the data from 2010 through 2021.

The above was fine for turning one or two files into data frames. But we need to repeat this (on each of the two CSV files) for each of the years. How can we do that?

Here’s my basic plan:

  • I’ll create a dictionary, “all_dfs”, with two keys (“accident” and “person”), each of whose values is an empty list.
  • I’ll then iterate through each year we want to examine with “range”, building a URL that we download with requests and turn into a BytesIO.
  • We can then use the above code to find the matching filename (using regular expressions), reading the CSV file into a data frame, and appending that data frame to the appropriate list in “all_dfs”
  • Finally, we can call “pd.concat” on each of the lists, resulting in a single data frame that’s the result of combining/merging all of the downloaded data.

Here’s the code that I wrote and used, including a bunch of “print” statements to keep track of what’s happening:

from zipfile import ZipFile
from io import BytesIO
import requests 
import re

all_dfs = {'accident': [],
          'person':[]}

for year in range(2012, 2022):
    url = f'https://static.nhtsa.gov/nhtsa/downloads/FARS/{year}/National/FARS{year}NationalCSV.zip'
    print(url)

    zipfile_contents = BytesIO(requests.get(url).content)

    with ZipFile(zipfile_contents) as myzip:
    
        for basename in ['accident', 'person']:

            for one_filename in myzip.namelist():
                if re.search(f'{basename}.csv', one_filename, re.I):
                    print(f'Matched {one_filename}')
                    csv_filename = one_filename
                    break
    
            else:
                print(f'No match; names are: {myzip.namelist()}')
            
            with myzip.open(csv_filename) as csv_file:
                all_dfs[basename].append(pd.read_csv(csv_file,
                                                    encoding='Latin-1',
                                                    low_memory=False))
    
accident_df = pd.concat(all_dfs['accident'])
person_df = pd.concat(all_dfs['person'])

At the end of this process, we have two data frames, one from each of the CSV collections:

  • accident_df, with 335,959 rows and 92 (!) columns, and
  • person_df, with 828,742 rows and 172 (!!) columns

And yes, if we were interested in saving memory, we would definitely be choosier about the columns that we load, rather than creating such ridiculously wide data frames. I decided that they were small enough for most modern computers, and didn’t want to make the above code even more complex by suggesting that you pass a value to the “usecols” keyword argument.