BW #23: Misery index (solution)

[A quick note: Meta’s new “Threads” social app is now online, as of this morning. I’m there, at https://www.threads.net/@reuvenlerner, and I’m starting to use it. I’ll likely still be more active on Twitter, LinkedIn, and YouTube, at least for the time being. But if you’re on Threads, please reach out and connect!]

I keep hearing analysts wonder how Americans can be so down about the economy. Most of the indicators are good, especially unemployment, which is at a very low level. And inflation, while higher than before, has come down quite a bit in the last two years; US inflation is certainly looking better than in much of Europe.

Sure, lots of high-tech companies are laying people off, which is extremely unpleasant. But there are still many jobs to be had, especially for skilled workers in the computer industry. And those layoffs come after record hiring over the last few years.

So… why are people so glum? One possible clue is the “misery index,” which economist Arthur Okun apparently devised in the 1960s. It combines the annual unemployment and inflation figures for a given year, giving us a sense of how miserable people ought to be feeling. There are a number of variations on this misery index; for this week’s problems, I stuck with the original and simplest one. If you’re looking for extra misery, you can go further than I did here, adding other factors to the index, as some economists have done.

Data and questions

This week, we took data from two different sources, both via the amazing FRED site run by the Federal Reserve Bank of St. Louis. The idea was to calculate the misery index on our own.

This week, I gave you eight questions and tasks, with a variety of learning goals: Working with Excel files, datetime data, changes, and line plots.

Download the seasonally adjusted unemployment numbers, as calculated by the Bureau of Labor Statistics, from https://fred.stlouisfed.org/series/UNRATE . Use the Excel version of the data. Create a data frame from that information, setting the `observation_date` column to be the index, and ensuring that the dtype is good for dates and times.

The first thing that I asked you to do was download data about the unemployment rate. There are a number of ways that unemployment can be calculated; in this case, I went with the simplest values as provided by the Bureau of Labor Statistics.

Note that this involves seasonally adjusted unemployment rates. The number of workers needed in every industry can change over time; you wouldn’t say that beach lifeguards are 100% unemployed in the winter, or that orange pickers are 100% unemployed in the summer, right? By taking these variations into account, they try to get a better measure of how many people are actually unemployed at any given time.

I asked you to download the Excel version of the data, in part because I wanted you to get a bit more practice working with Excel data, and also so that we could see some of the advantages of doing so.

As usual, I’ll start my work by importing Pandas:

import pandas as pd
from pandas import Series, DataFrame

The second line isn’t always necessary, but is often useful, and my fingers just type it automatically when I start to do some work with Pandas. That said, for this week’s solution, I will be creating a data frame out of existing data, so it’s not all for naught.

On the FRED site, I specified that I wanted to download the Excel file. It gave me a URL; I could have theoretically downloaded the file to my computer and then read it into Pandas, but given that it was small, I decided to just put the (long, ugly) URL directly into the read_excel method call:

unemployment_url = 'https://fred.stlouisfed.org/graph/fredgraph.xls?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2023-05-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2023-07-05&revision_date=2023-07-05&nd=1948-01-01'

unemployment_df = pd.read_excel(unemployment_url)

Did this work? Well, sort of… The main problem was that the file contains a number of initial lines of documentation, lines which aren’t useful data. Those mess up the column names, as well as the dtypes that Pandas assigns to each column.

I wanted to tell Pandas to ignore those initial lines. Given that one of the rows in the spreadsheet contains the column names, I can tell Pandas to skip several lines, moving directly to the column names, with the “header” keyword argument. Note that while the rows in Excel are numbered starting with 1, the header argument takes an integer starting with 0, indicating where the headers are located:

unemployment_df = pd.read_excel(unemployment_url, 
                                header=10)

This worked just fine, creating a data frame with the information that I wanted.

What dtypes did the values have? This is often a worry with CSV files, but that’s because CSV is a text-based format. When we load a CSV file, Pandas tries to guess what dtypes it should use based on the contents of each column. Sometimes it guesses correctly, but sometimes not. Plus, it takes time and memory to make such a decision. We can help it along by passing the `dtype` keyword argument, though.

In the case of Excel, none of that is needed, because Excel handles a variety of dtypes. And indeed, if I run “unemployment_df.dtypes” in Pandas after reading the Excel, I get the following:

observation_date    datetime64[ns]
UNRATE                     float64
dtype: object

This is perfect; the observation_date column is a 64-bit datetime object, and the actual unemployment rate is recorded as a float.

I asked you to make the “observation_date” column into the data frame’s index. We could take the existing data frame and run set_index on it, but I find it easier to just specify the index_col keyword argument when reading the Excel file:

unemployment_df = pd.read_excel(unemployment_url, 
                                header=10, 
                                index_col='observation_date')

We now have a data frame with a single column containing the unemployment rate, and a datetime index, describing which month’s unemployment number we’re looking at.

The misery index is calculated on an annual basis. This means that we'll need to transform our data frame from containing monthly data, with 12 rows per year, into one with annual data, with one row per year. Perform this transformation, such that the annual rate for each year is the mean of the 12 monthly calculations.

We managed to get the monthly unemployment percentage into a data frame, only to find out that if we’re going to measure the misery index, we’ll need the annual numbers. This means taking the mean of all unemployment numbers in a given year, and creating a data frame in which we have one row per year.

This sounds vaguely like a “groupby” kind of problem; when we use groupby, we run an aggregation method once for each distinct value in a categorical column. For example, we can get the average sales for each district, or the total number of students from each country. Here, we want to calculate the mean unemployment number for each year.

There is a variation on groupby that comes in handy in precisely these sorts of situations — resampling, which we can invoke with the “resample” method. The basic idea is as follows:

Make sure our data frame has a datetime index.
We invoke “resample”, telling it with which granularity we want to perform the calculation, such as “one day” or “three weeks” or “two months”. We use a combination of numbers and letters to describe the resampling period we want.
We will get back a new data frame, with one row per specified period. If there are holes in our data — that is, time periods without any data — then we’ll get NaN values for those periods. But our index will mark all of the time from the earliest time in the index to the latest time.

Since I want to take the existing monthly data and turn it into annual data, calculating the mean of the monthly values, I can use a resample code of “1Y”. I then say:

unemployment_df = unemployment_df.resample('1Y').mean()

Sure enough, I get the following result:

observation_date
1948-12-31    3.750000
1949-12-31    6.050000
1950-12-31    5.208333
1951-12-31    3.283333
1952-12-31    3.025000
1953-12-31    2.925000
1954-12-31    5.591667
1955-12-31    4.366667
1956-12-31    4.125000
1957-12-31    4.300000
Freq: A-DEC, Name: UNRATE, dtype: float64

Notice that the result of resampling gives us an index whose values are in the final moments of the period we requested, namely December 31st. If we had asked for quarterly information, we would have had the end of March, June, September, and December. And so forth.

With our annual unemployment data in hand, let’s get the rest of our data, and start to calculate some misery!

BW #23: Misery index (solution)

Read next

BW #23: Misery index

BW #24: Wildfire smoke