BW #7: Bank failures (solution)

Last week, Silicon Valley Bank collapsed. How often does this happen? Where are collapsed banks located, and big are they?

BW #7: Bank failures (solution)

This week’s topic: Bank failures

This week, we looked at describing bank failures and assistance, which I (perhaps unfairly!) lumped into a single category of “failures.” This, of course, comes in the wake of the collapse of Silicon Valley Bank.

Our data set comes from the FDIC itself. You can download it in CSV format from:

https://banks.data.fdic.gov/explore/failures?aggReport=detail&displayFields=NAME%2CCERT%2CFIN%2CCITYST%2CFAILDATE%2CSAVR%2CRESTYPE%2CCOST%2CRESTYPE1%2CCHCLASS1%2CQBFDEP%2CQBFASSET&endFailYear=2023&sortField=FAILDATE&sortOrder=desc&startFailYear=1934

Our questions for this week are:

  1. According to our document, how many bank failures have there been since the FDIC was opened?
  2. What was the earliest failure in our data set? What is the most recent failure in our data set?
  3. In which five years did the greatest number of banks fail?
  4. In which three states were the greatest number of failed banks?
  5. What was the average market capitalization of the banks that failed? Given a capitalization of $200b, did that make SVB above, below, or about average?
  6. When was the most recent failure greater than SVB?
  7. Bank failures can be resolved in several different ways. How often, historically, have we seen each resolution? Were the odds good that SVB's uninsured depositors would get their money?
  8. What about bank failures in the last 25 years -- if we just look at those, do the odds change?
  9. What was the mean estimated loss in bank failures? What proportion of a bank's assets did this generally involve?

The learning goals for this week include working with dates and strings. And some insights into whether people were right to panic about losing their money when SVB went under.

Discussion

First, before anything else, I did my standard setup for working with Pandas:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

Then I had to retrieve the file and turn it into a data frame. Getting the URL accurate took a bit of time; it was only as I was doing the calculations that I discovered the default download starts in 2012, long after the FDIC started!

The URL is thus

https://banks.data.fdic.gov/explore/failures?aggReport=detail&displayFields=NAME%2CCERT%2CFIN%2CCITYST%2CFAILDATE%2CSAVR%2CRESTYPE%2CCOST%2CRESTYPE1%2CCHCLASS1%2CQBFDEP%2CQBFASSET&endFailYear=2023&sortField=FAILDATE&sortOrder=desc&startFailYear=1934

That URL takes us to a page from which we can download the data.

In my code, I handled it as follows, with read_csv, passing it a URL (rather than a downloaded filename):

data_url = 'https://pfabankapi.app.cloud.gov/api/failures?fields=NAME%2CCERT%2CFIN%2CCITYST%2CFAILDATE%2CSAVR%2CRESTYPE%2CCOST%2CRESTYPE1%2CCHCLASS1%2CQBFDEP%2CQBFASSET&filters=FAILYR%3A%5B1934%20TO%202023%5D&limit=10000&react=true&sort_by=FAILDATE&sort_order=desc&subtotal_by=RESTYPE&total_fields=QBFDEP%2CQBFASSET%2CCOST&format=csv&download=true&filename=bank-data'

df = pd.read_csv(data_url)

That loaded the data frame just fine. However, it wasn’t quite good enough for our purposes. That’s because the “FAILDATE” column contains date information, and we’ll need the date for our calculations. We can force Pandas to treat FAILDATE as a datetime column by passing the parse_dates keyword argument:

df = pd.read_csv(data_url,
                parse_dates=['FAILDATE'])

We won’t be needing all columns of the data frame, but it’s a small enough data set that I’m OK keeping them around, without any more filtering.

And with that, let’s move onto the first question!

According to our document, how many bank failures have there been since the FDIC was opened?

Assuming that each row in the data frame represents one bank failure, we merely need to grab the “shape” attribute from our data frame:

df.shape

(Remember that shape is an attribute, not a method! Trying to invoke it will result in an error.)

Of course, that returns a 2-element tuple with the number of rows and the number of columns. So you would have to retrieve index 0 from the resulting tuple to just get the number of rows.

Another way to get the number of rows is with:

len(df.index)

I’ve heard that this is the best, most idiomatic way to do it, in part because the index is a known length, and that it runs fastest. I know myself, though, and I tend to just get the shape, and look at the first element of the returned tuple.

What you should not do is invoke the “count” method, because it ignores NaN values. Plus, it’ll return a series indicating the number of non-NaN values in each column of the data frame.

What was the earliest failure in our data set? What is the most recent failure in our data set?

To get the earliest failure, we need to sort by “FAILDATE”, the column which we made sure to import as a date in our call to “read_csv”.

Remember that datetime objects are comparable, which means that we can use operators such as < and > on them. This also means that we can sort them. But while we could invoke sort_values on our data frame, then grabbing the first (or last) element, here we’re only interested in the date.

As such, it’s probably easiest to just call “min” and “max” on that column. The minimum is:

df['FAILDATE'].min()

returning April 19th, 1934, and the maximum is

df['FAILDATE'].max()

returning October 23rd, 2020. So none of the latest failures have made it into the database as of yet. It’s almost as if the FDIC had something else to do right now.

What if we want to get both the min and max at the same time? The “agg” method lets us specify any aggregation function, or even a list of such functions:

df['FAILDATE'].agg(['min', 'max'])