BW #43: Financial protection (solution)

Americans who are frustrated with a financial product or service can complain to the CFPB. How many complaints do they get a year, and from where? This week, we analyze their complaints database.

BW #43: Financial protection (solution)

[Administrative note: I’ll be holding office hours this coming Sunday. Look for e-mail in the coming day with Zoom details. I look forward to seeing you there!]

This week, we’re looking at the database of complaints sent to the Consumer Financial Protection Bureau (https://www.consumerfinance.gov/, https://en.wikipedia.org/wiki/Consumer_Financial_Protection_Bureau), a US government agency that tries to help consumers who have had problems with financial products and services. The database is large (about 3 GB) and describes a wide variety of complaints over more than a decade, giving us a chance to explore it and quite a bit of Pandas functionality. Let’s get it to it!

From ChatGPT: “Create a picture of a panda complaining to a bank manager about being swindled.”

Data and 8 questions

This week’s data set is the database of consumer complaints, available at the CFPB's site:

https://www.consumerfinance.gov/data-research/consumer-complaints/

You should download the entire database in CSV format from

https://files.consumerfinance.gov/ccdb/complaints.csv.zip

The zipfile is about 690 MB in size, and the unzipped file is about 3 GB in size, so make sure your computer has enough RAM to handle it!

Here are this week's eight tasks and questions:

Read the CFPB complaint data. Read all columns, and don't specify any dtypes. How long did the read take? How much memory does the resulting data frame take?

I started off by loading Pandas into memory, with the standard import and alias:

import pandas as pd

I also defined a “filename” variable with the CSV file I want to load. I’ll be loading it a large number of times, and decided

filename = 'complaints.csv'

I asked you to read the entire file (all columns) into a data frame. That’s easy enough to do with “read_csv”:

df = pd.read_csv(filename)

However, when I gave the above instruction, I got a warning from Pandas:

DtypeWarning: Columns (16) have mixed types. Specify dtype option on import or set low_memory=False.

What does this warning mean? Well, the data frame is large, with 4,367,328 rows and 18 columns. Instead of reading it all into memory, analyzing the contents of each column, and deciding what dtype to use, Pandas read it into memory in chunks. That saves memory, but it also means that the dtype it decides on from earlier chunks might not work with later chunks, leading to an incorrectly determined dtype.

Pandas suggests getting rid of this warning by either setting low_memory=False (thus letting it examine the entirety of a column’s data before setting the dtype) or by explicitly telling Pandas what dtypes to use.

I decided that it might be interesting to see just what kind of effect these suggestions might have.

For starters, I decided to set low_memory=False

df = pd.read_csv(filename, 
                 low_memory=False)

I wanted to know how long this took — and for such tasks, I’m a big fan of the magic “timeit” commands in Jupyter. %timeit runs a line of code repeatedly, giving you the mean run time, and %%timeit does so within a full cell. Note that if you’re using %%timeit, then it must be the first line in the cell. Python comments (starting with #) and other text must come after the call to %timeit.

%%timeit 

df = pd.read_csv(filename, 
                 low_memory=False)

The result? Loading the data on my system takes quite some time:

19.6 s ± 96.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In other words, it ran the code 7 times, and found that on average, it took 19.6 seconds to load.

How much memory does this data frame take up? An easy way to check is with the “memory_usage” method, which returns the memory usage for each column:

df.memory_usage()

I get the following results:

Index                                132
Date received                   34938624
Product                         34938624
Sub-product                     34938624
Issue                           34938624
Sub-issue                       34938624
Consumer complaint narrative    34938624
Company public response         34938624
Company                         34938624
State                           34938624
ZIP code                        34938624
Tags                            34938624
Consumer consent provided?      34938624
Submitted via                   34938624
Date sent to company            34938624
Company response to consumer    34938624
Timely response?                34938624
Consumer disputed?              34938624
Complaint ID                    34938624
dtype: int64

Does it seem a bit suspicious that each column takes up precisely the same amount of memory? It should — that’s because Pandas is telling us how much memory it is using, but not how much the Python strings contained in the data frame are using. In other words, this isn’t a real measure of how much memory is being used. To get that, we need to pass deep=True, which takes much longer but gives an actual answer:

df.memory_usage(deep=True)

This time, we get:

Index                                  132
Date received                    292610976
Product                          462396306
Sub-product                      317189659
Issue                            422208545
Sub-issue                        373518838
Consumer complaint narrative    1812265991
Company public response          380293226
Company                          364080378
State                            256506475
ZIP code                         269867646
Tags                             157331396
Consumer consent provided?       286252782
Submitted via                    264345302
Date sent to company             292610976
Company response to consumer     355644139
Timely response?                 261984666
Consumer disputed?               160648459
Complaint ID                      34938624
dtype: int64

Since this is a series, we can total it up with “sum”:

usage = df.memory_usage(deep=True).sum()
print(f'{usage:,}')

Displaying usage in an f-string lets us put commas between the digits:

6,764,694,516

The total memory usage of this data frame is… 6.7 GB! Wow, that’s a lot of memory.

Now specify categorical and date dtypes. How long did the read take? How much memory does this take?

Let’s see if we can shrink that memory usage a bit. One of the easiest ways to reduce the size of a text-based column is with a “category.” Consider a column containing country names. The country names likely repeat many times, with each mention of a country taking additional memory. If we could replace each country name with an integer, and then store the mapping from integers to country names on the side, we could save a lot of memory.

That’s precisely how a category works — except that Pandas handles it all for us automatically. Moreover, the fact that it’s a categorical column is pretty much invisible to us; we can still use all of the string functionality that Pandas makes available.

If we already have a series (including a column) of strings, then we can turn it into a category by saying:

s.astype(‘category’)

But in this case, we’ll go one better: We’ll tell Pandas that we want the text columns to be turned into categories as they’re loaded. This does mean that Pandas will need to spend some time analyzing the data in each column, finding the unique values, creating the category objects, and then rewriting the columns. So it might take a bit more time, but the memory savings could be substantial.

The way we do this is by passing the “dtype” keyword argument to “read_csv”. This argument takes a dict as a value; the dict’s keys are column names, and the values are the dtypes we want to assign to the columns. In this particular case, we’ll just want a ton of category columns:

df = pd.read_csv(filename, 
                 low_memory=False,                
                 dtype={"Product":'category',
                      "Sub-product":'category',
                      "Issue":'category',
                      "Sub-issue":'category',
                      "Company":'category',
                      "State":'category',
                      "Submitted via": 'category',
                      'Company response to consumer': 'category',
                      'ZIP code': 'category',
                      'Consumer consent provided?':'category',
                      'Company public response':'category',
                      'Timely response?':'category',
                      'Consumer disputed?':'category'                       
                      })

But wait, we can do even better: There are two columns that contain date information, which we can have Pandas parse into datetime columns. We do this by passing the “parse_dates” keyword argument. All together, we then have the following query:

%%timeit

df = pd.read_csv(filename, 
                 low_memory=False,
                parse_dates=['Date received', 'Date sent to company'],
                dtype={"Product":'category',
                      "Sub-product":'category',
                      "Issue":'category',
                      "Sub-issue":'category',
                      "Company":'category',
                      "State":'category',
                      "Submitted via": 'category',
                      'Company response to consumer': 'category',
                      'ZIP code': 'category',
                      'Consumer consent provided?':'category',
                      'Company public response':'category',
                      'Timely response?':'category',
                      'Consumer disputed?':'category'                       
                      })

Running the above, I got the following results from %%timeit:

20.4 s ± 254 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The extra analysis and building of categories did take some extra time. True, we needed the dates to be parsed, so that was useful. But the big question is whether we saved any memory. Let’s see:

usage = df.memory_usage(deep=True).sum()
print(f'{usage:,}') 

The result?

2,161,433,792

That’s right — we have reduced our memory usage by about 70 percent, simply by asking Pandas to use dtypes and categories. Not a bad savings for several minutes of work.