BW #31: Poverty (solution)

The US Census Bureau just reported an uptick in poverty rates in 2022, after a previous downturn. What have US poverty numbers looked like over the last few decades, for different ages and races?

BW #31: Poverty (solution)

This week, we looked at a recent report from the US Census Bureau about poverty in the United States. The Census Bureau has been reporting on poverty for decades, not only keeping track of how many people are below the poverty line but also details about their race/ethnicity, age, level of education, marital status, and location.

The study came to my attention in a New York Times article from several days ago, "Poverty Rate Soared in 2022 as Aid Ended and Prices Rose", which confirmed some of the predictions that poverty experts had made several years ago — namely, that laws passed in 2020 and 2021 had made a huge impact on child poverty, and that the fact they weren’t renewed led to an uptick in child poverty afterward. Here’s an article from last year indicating how these laws reduced child poverty: https://www.vox.com/2022/9/14/23352022/child-poverty-covid-tax-credit

When I read such an article, I immediately ask myself where and how I can find the underlying data. Fortunately, the Census Bureau and other US government organizations make their findings public, and that’s the data that I decided we would look at this week.

If you’re interested in learning more about these poverty statistics, you can check out the Census Bureau's page describing the study is at https://www.census.gov/library/publications/2023/demo/p60-280.html , and a full report at /content/files/content/dam/Census/library/publications/2023/demo/p60-280.pdf .

From Stable Diffusion: “money disappearing gradually over time”

Data and eight questions

This week, I asked you eight questions about poverty. The questions were based on a single Excel file:

/content/files/programs-surveys/demo/tables/p60/280/tablea3_hist_pov_by_all_and_age.xlsx

The Census Bureau split their data across a number of different Excel files, all of which can be found and downloaded from:

https://www.census.gov/library/publications/2023/demo/p60-280.html

Here are the questions and tasks for this week, followed by a link to the Jupyter notebook that I used to solve things. Comments, corrections, and questions are, as always, welcome:

Read the "History POV by all and age" file into a data frame. We're only interested (for now) in the "ALL RACES" section. Turn the years into an index, and remove the "percent" measures.

Let’s start off by loading Pandas:

import pandas as pd

In theory, you can just say “import pandas”, and I in fact did that for my first year or two using Pandas. However, the fact that everyone in the Pandas world uses the “pd” alias means that if you don’t do it, you’ll find it hard to copy from Stack Overflow, blogs, or elsewhere. So even if you type quickly, it’s worth using “from .. import” when loading Pandas.

It’s tempting to say that you can just load the entire Excel file into a Pandas data frame with the “read_excel” function. However, there are several problems with this, chief among them being that the spreadsheet isn’t really just a glorified CSV file:

  • Rows 1-2 are a headline
  • Row 3 is a description
  • Rows 4-6 are headers for the data
  • Rows 7-73 contain data for all races, with row 7 being a headline, “ALL RACES”
  • Rows 74-97 contain data for “WHITE ALONE” race, including row 74 itself, a headline
  • Column A sometimes contains a year, sometimes contains a year and a footnote number, and sometimes contains a description of the data that follows
  • This continues for several more races/ethnicities until we get to row 482, where we’re told that “N” represents unavailable data
  • Rows 483 and onward are the footnotes references in the data (particularly the years)

It isn’t too hard for a human to make sense of this Excel file. But if we want to read the data into Pandas and make some sense of it, we’ll need to carve up the document when we read it. Otherwise, our data frame won’t make any sense, and certainly won’t have clear or useful dtypes in each of the columns.

Where do we even start? For the first few questions, we’re only interested in the “ALL RACES” part of the spreadsheet. That means rows 8-73, since row 7 is just the subtitle for that section of the document. We also want rows 4-6 to be our headers.

Of course, while Excel starts to number rows with 1, Python and Pandas people start to number things with 0. This means that from a Pandas perspective, we’ll want the headers to be from rows 3, 4, and 5, and the data to be for 67 rows starting after them. We can take an initial stab with this code:

filename = 'tableA3_hist_pov_by_all_and_age.xlsx'

df = (
    pd.read_excel(filename,
                  header=[3,4,5],
                  nrows=67)
)

I could have just assigned the result of pd.read_excel to the “df” variable, I’m planning to chain a bunch of additional methods to that result. I’m thus using a style of Pandas coding that Matt Harrison has promoted, and which I’m slowly but surely warming up to. We can get away with this because when you open a set of parentheses in Python, suddenly individual lines are all handled together; no longer is the end of a line considered the end of your Python statement.

This works, but we still have the “ALL RACES” row in our first row. That’ll throw off any dtype we try to use for the first column. We’ll thus want to remove that first row. We can do that with the “drop” method:

df = (
    pd.read_excel(filename,
                  header=[3,4,5],
                  nrows=67)
    .drop(0)
)

This works, but if you examine the data closely, you’ll see that in the final rows, we have the single-character string “N” in a number of places. That “N” is supposed to represent missing values, what we in the Pandas world would call either “NaN” or “NA”. If we can get those “N” values to be either NaN or NA, then Pandas will know what to do. But keeping them as “N” will prevent us from doing any calculations with those numbers.

The solution is to tell “read_excel” that when it sees “N”, it should treat it as a NaN value. We can do that with the “na_values” keyword argument. That argument can get a single string (as we’ll do here), or a list of values that should all be interpreted as NaN:

df = (
    pd.read_excel(filename,
                  header=[3,4,5],
                  nrows=67,
                 na_values='N')
    .drop(0)
)

At this point, the index of our data frame is a simple integer range, starting with 0. Our columns, however, are a multi-index with three levels, corresponding to rows 4, 5, and 6 in Excel. I asked you to remove the “Percent” part of the third level. (We’ll calculate those percentages ourselves.)

How can you do this? Dropping part of a multi-index shouldn’t be that hard, but dropping a subset of a level of a multi-index seems rather difficult. And yet, thanks to the “drop” method allows us to specify that we want to drop anything called “Percent” on level 2 in the columns:

df = (
    pd.read_excel(filename,
                  header=[3,4,5],
                  nrows=67,
                 na_values='N')
    .drop(0)
    .drop('Percent', axis='columns', level=2)
)

Once we have done that, we still have a three-level multi-index on our columns. However, “Below poverty” no longer has two sub-indexes under it (“Number” and “Percent”); it only has one (“Number”).

This is a good place to remind you that “drop”, like many methods in Pandas, doesn’t modify the original data frame. Rather, it returns a new data frame with the changes we’ve asked for. We can assign the result of the method call to a variable, or we can call an additional method on it, as I’ve done here. You could, in theory, use the “inplace=True” keyword argument on drop to modify the existing data frame, but we’re strongly urged not to do that by the core Pandas developers.

Next, I want to make the year into the index. This turns out to be quite tricky, because so many of the years have footnotes attached. In Excel, those footnotes are available as superscript integers, and are clearly visible. But when read into Pandas, those footnotes become part of the integer! The first five years, as read into Pandas, will thus be 2022, 2021, 20201, 2019, and 2018. And no, that third number isn’t a typo; it’s the year 2020 with footnote number 1 mashed together. Yuck.

I figured that I might be able to use the “str” accessor in Pandas to invoke the “slice” method on each of these numbers, grabbing only the first four digits. But in order to do that, I need to turn the column into a bunch of strings. And in order to do that, I need to grab the column, which is a pain because of its long name and the fact that we have a multi-index.

I thus decided to use “assign” to create a new column, called “Year”. While you can pass a single scalar value to an assign keyword argument, or you can pass a list/series of values, here I decided to pass a function object via “lambda”. The lambda will:

  • Use “df_” as its parameter, since it’s a temporary data frame
  • Grab the column names from the data frame’s “columns” attribute
  • Use the 0-index column name to retrieve the hard-to-write column name
  • Use “astype” to get a new series based on that column, all strings
  • Invoke the “str.slice” method, getting only the first 4 characters
  • Again invoke “astype”, this time getting back an integer

Once I’ve got that new “Year” column, I can invoke “set_index” to turn it into our data frame’s index. And now my data frame has an index containing integers, the years (sans footnote numbers) from that first column:

filename = 'tableA3_hist_pov_by_all_and_age.xlsx'

df = (
    pd.read_excel(filename,
                  header=[3,4,5],
                  nrows=67,
                 na_values='N')
    .drop(0)
    .drop('Percent', axis='columns', level=2)
    .assign(Year=lambda df_: df_[df_.columns[0]].astype(str).str.slice(None, 4).astype(int))
    .set_index('Year')
)

This is great, but we aren’t quite done: There’s no more need for that second (bottom) row in our multi-index of columns, so we can drop that. Here, I’ll use the “droplevel” method to get rid of level 2 from our multi-index. That returns a new multi-index data structure, which I can then assign back to “df.columns”.

df.columns = df.columns.droplevel(2)

I then (finally) get rid of the annoyingly named first column, which I don’t need any more because I have the years in an index:

df = df.drop(df.columns[0], axis='columns')

Why did I use assignment for these final two actions, rather than keep chaining methods? I didn’t see an obvious way to use method chaining, and I searched — but if you have suggestions, please add them in comments. As I always say, Pandas is huge and complex, and I’m constantly discovering new functionality and techniques.

Once we’re done with all of this, we have a data frame with 66 rows and 8 columns.

By the way, if you’re wondering why I didn’t use the “skiprows” keyword argument to “read_excel”, that was my first instinct, to skip over a bunch of rows. But that didn’t play well with the multi-index for columns that we got from the spreadsheet. Another option would have been to

In 2022, what was the number of Americans living below the poverty line?

It’s easy, when looking at poverty data, to think in terms of percentages, and changes in those percentages over time. We can feel good when the number of poor people goes down, and bad when that number goes up.

But percentages can mask actual numbers. I thus asked myself, and you: How many people in the United States are currently living below the poverty line?

The most recent year in the survey is 2022. Since we have set our index to contain years, we can retrieve the row for 2022 with the “loc” accessor:

df.loc[2022] 

This returns the entire row whose index is 2022 as a series. The index for the series will be taken from our columns, meaning that we end up with a multi-indexed series:

All people         Total            330100.0
                   Below poverty     37920.0
Under 18 years     Total             71950.0
                   Below poverty     10780.0
18 to 64 years     Total            200200.0
                   Below poverty     21240.0
65 years and over  Total             57880.0
                   Below poverty      5897.0
Name: 2022, dtype: float64

How can we get the data for all people? “loc” takes two arguments, the first of which is a row selector (which is covered with 2022). The second, optional argument is a column selector, which can specify the column(s) we want. I could just say “All people”:

df.loc[
    2022,
    'All people'
]

Notice, by the way, that I like to use “loc” in this expanded, two-line form, so that we can separate the row selector from the column selector.

But the result that I get is a series, reflecting the two elements of the multi-index below “All people”:

Total            330100.0
Below poverty     37920.0
Name: 2022, dtype: float64

How can I say that I just want “All people”, and then “Below poverty” under that? I have to use a tuple:

df.loc[
    2022,
    ('All people', 'Below poverty')
] 

My row selector remains 2022, but my column selector is a tuple, indicating what values I want in the primary and secondary parts of the multi-index.

The number I get back is 37920.0. Which is fine, but … that seems a bit low, no? Right, because (as it says at the top of the spreadsheet) all of the numbers describe a population in thousands. In other words, to get the real number of poor Americans in 2022, we need to multiply the result by 1,000:

df.loc[
    2022,
    ('All people', 'Below poverty')
] * 1000

The result is thus not 37920, but rather 37,920,000.

In other words: In 2022, there were nearly 38 million Americans living under the poverty line. I don’t know about you, but to me that’s a shockingly high number of people, more than 3x the entire population of Israel, where I live.

I don’t have any brilliant suggestions or solutions to offer, but it shows just how vast the scope of this problem is, and how many people are affected by poverty on a day-to-day basis in one of the world’s wealthiest countries.