BW #47: Minimum wage (solution)

For many Americans, the start of 2024 also meant higher wages, thanks to updated minimum wage requirements in many states. This week, we compare minimum wage across states and time.

BW #47: Minimum wage (solution)

This week, we looked at changes in the minimum wage in the United States. While the federal minimum wage has been $7.25/hour since 2009, each individual state can raise it beyond that level. In such cases, the state’s minimum wage applies. Given that the federal minimum wage hasn’t changed in about 15 years, and that inflation was rather high over the last few years, it shouldn’t come as a surprise that a number of states’ minimum wage is now higher than the federal one.

We compared the different states, finding which raised their minimum wage on January 1st — and if so, by how much.

Data and six questions

The data this week comes from FRED, the St. Louis Fed's portal for economic data (https://fred.stlouisfed.org/). The data consists of CSV files from (nearly) all 50 states, with info and download links from this page:

https://fred.stlouisfed.org/release?rid=387

Part of the challenge this week involves figuring out how to download the data. I will say that when you go to a particular state's minimum-wage page (e.g., https://fred.stlouisfed.org/series/STTMINWGOH), click on "Download" and then select "CSV," you'll get a two-column CSV file with a date column (January 1st of the year for which we're reporting) and a value column. Notice that the data on FRED comes from URLs that end with `STTMINWGXX`, where the `XX` can be replaced by the state's two-letter code. That's not quite enough to get the data, but it's a good start.

I gave you six questions and tasks based on the data. Here are my solutions; a link to my Jupyter notebook is below.

Create a series containing the abbreviations for all US states. I found a source here: https://github.com/jasonong/List-of-US-States

Before doing anything else, I loaded Pandas:

import pandas as pd

If you go to the GitHub repo that I mentioned, you’ll see that it contains the state data in both CSV and YAML formats. You might then (naturally) assume that since clicking on the link for the CSV file (https://github.com/jasonong/List-of-US-States/blob/master/states.csv) displays it, you can use that same URL to retrieve the data into Pandas.

Pandas most often reads CSV data from files, we can pass “read_csv” a URL (string) as an argument, and get a data frame back. There’s no need to download the file before calling read_csv on it. You might thus try the following:

df = pd.read_csv('https://github.com/jasonong/List-of-US-States/blob/master/states.csv')

As you’ll discover upon trying this, though, you’ll get an error. The problem is that while the filename indicates we’re looking at “states.csv”, we’re actually looking at GitHub’s HTML page containing that, along with many other things. If we want the raw CSV file, we need to go to a different URL.

Fortunately, we can find that URL pretty easily by clicking on the “raw” button. That takes us to a different (but related) URL, one which gives us the CSV file on its own: https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv

url = 'https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv'
df = pd.read_csv(url)

There’s nothing wrong with the data frame that we create in this way. However, I wanted to create a series with just the state abbreviations. The easiest way to do this is to just select the “Abbreviation” column from the data frame as soon as we download it:

url = 'https://raw.githubusercontent.com/jasonong/List-of-US-States/master/states.csv'

us_states = (
    pd
    .read_csv(url)
    ['Abbreviation']
)

You might be wondering why I didn’t pass the “usecols” argument to read_csv, indicating that we were only interested in the “Abbreviation” column, and thus shrinking the size of the original data frame that we created. The reason is that we still would have ended up with a data frame, albeit a one-column data frame. I decided that given the small size of the data frame we created, it would be just fine to retrieve it all and then grab one column into a variable, “us_states”.

We now have a series, us_states, with 51 two-character string elements (50 states + DC). This will come in handy now, as we build the data frame with minimum-wage information about each of them.

Create a data frame in which there is one row per year, and the columns are two-letter state abbreviations. Be sure to treat the index as datetime values. If no data appears to be available for the state, then ignore it.

FRED (https://fred.stlouisfed.org/), the online information portal created and maintained by the St. Louis Fed, offers information about minimum-wage rates in various states at a set of fairly uniform URLs. Given a state’s two-letter abbreviation, we can get data about that state’s minimum wage at

https://fred.stlouisfed.org/series/STTMINWGXX

For example, if we want to see information for Illinois, whose two-letter abbreviation is “IL”, we can go to

https://fred.stlouisfed.org/series/STTMINWGIL

However, this link brings us to FRED’s display page for the data. It doesn’t give us the data itself. To retrieve that, we can click on the “download” button, select “CSV” from the menu, and download the file. If you look at the URL, though, you might be amazed by its length:

https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=STTMINWGIL&scale=left&cosd=1972-01-01&coed=2024-01-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Annual&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2024-01-04&revision_date=2024-01-04&nd=1972-01-01

A lot of this has to do with creating graphs, and doesn’t affect the CSV file in any way. But some of it does, including the starting and ending dates. I managed to trim the URL down quite a bit, without any obvious loss in data:

url = f'https://fred.stlouisfed.org/graph/fredgraph.csv?id=STTMINWG{one_state}&cosd=1968-01-01&coed=2024-01-01&fq=Annual&fam=avg&gsnd=2020-02-01&vintage_date=2024-01-03&nd=1968-01-01'

I’m sure that there are still sections that I could remove, but I was a bit nervous about doing so.

Notice that the above isn’t a URL, but rather the assignment of a string — an f-string, at that — to a variable. Inside of the f-string, I reference the variable “one_state”, which should contain the two-letter abbreviation of the state we’re looking at.

If I want to retrieve the data for each state, then I should basically be able to do the following, iterating over the values in our us_states series:

for one_state in us_states:
    print(one_state)
    url = f'https://fred.stlouisfed.org/graph/fredgraph.csv?id=STTMINWG{one_state}&cosd=1968-01-01&coed=2024-01-01&fq=Annual&fam=avg&gsnd=2020-02-01&vintage_date=2024-01-03&nd=1968-01-01'
    pd.read_csv(url)

This is a great start, except for four things:

  1. Some states don’t have minimum-wage information in FRED,
  2. We have created some data frames, but we haven’t stored them anywhere,
  3. We haven’t combined the downloaded data into a data frame, and
  4. We want to treat the index as a “datetime” type, rather than a string.

Let’s start with handling the states that don’t have any data in FRED. In such cases, we’ll get an exception, an HTTPError. That’s not a native exception to Python; it’s actually from the “urllib” module in Python’s standard library. We can trap it by saying:

from urllib.error import HTTPError

for one_state in us_states:
    print(one_state)
    try:
        url = f'https://fred.stlouisfed.org/graph/fredgraph.csv?id=STTMINWG{one_state}&cosd=1968-01-01&coed=2024-01-01&fq=Annual&fam=avg&gsnd=2020-02-01&vintage_date=2024-01-03&nd=1968-01-01'
        pd.read_csv(url)
    except HTTPError as e:
        print(f'Error: {e}')

This ensures that we’ll get all of the state data where it exists, and we won’t end up with an unhandled exception if something goes wrong. Notice that I still print the state that we’re on with each iteration of our “for” loop; that’s a practice that I’ve had for a very long time, allowing me to see that things are progressing when they might take a long time.

I also indicated that I wanted the “DATE” column, which is common to all of the downloaded CSV files, to be of type “datetime”, and to be the index. Both of these are easily handled with options to read_csv:

  • parse_dates — we pass a list of column names that we want to turn into datetime values
  • index_col — we pass a string (for one column) or a list (for a multi-index) of the column(s) we want to be the index.

In this case, we’ll pass both of these keyword arguments, and both will have the value of DATE. It’s really convenient to have the data in a uniform format, allowing us to do such a thing. Our code now looks like:

from urllib.error import HTTPError

for one_state in us_states:
    print(one_state)
    try:
        url = f'https://fred.stlouisfed.org/graph/fredgraph.csv?id=STTMINWG{one_state}&cosd=1968-01-01&coed=2024-01-01&fq=Annual&fam=avg&gsnd=2020-02-01&vintage_date=2024-01-03&nd=1968-01-01'
        pd.read_csv(url, parse_dates=['DATE'], index_col='DATE')
    except HTTPError as e:
        print(f'Error: {e}')

We’re doing everything right now, except for the biggest and hardest part: Taking each of these individual, one-column data frames and turning them into a larger data frame.

If we were just trying to combine two data frames, we would obviously want to use “pd.concat”, which takes a list of data frames and returns one new one, the result of combining them. In this case, though, it seems a bit more complex; I want to combine them, but I want each of the original data frames to be a separate column in the new, combined one, with the state abbreviation as a column name. Moreover, I want the index to represent a superset of all the dates mentioned in all of the downloaded state data.

Well, good news: pd.concat can do all of that! We just need to pass it a dictionary in which the state abbreviations are the keys, and the values are data frames. It’ll return a new data frame whose index combines all of the individual data frames’ indexes.

Let’s create the dictionary:

d = {}

for one_state in us_states:
    print(one_state)
    try:
        url = f'https://fred.stlouisfed.org/graph/fredgraph.csv?id=STTMINWG{one_state}&cosd=1968-01-01&coed=2024-01-01&fq=Annual&fam=avg&gsnd=2020-02-01&vintage_date=2024-01-03&nd=1968-01-01'
        d[one_state] = pd.read_csv(url, parse_dates=['DATE'], index_col='DATE')
    except HTTPError as e:
        print(f'Error: {e}')

When the loop has finished running, we’ll have dict whose keys are strings (state abbreviations) and whose values are single-column data frames with datetime indexes.

We can turn this dictionary into a data frame with:

df = pd.concat(d, axis='columns')

Notice that we have to specify axis='columns' to ensure that they’re placed side by side, rather than top-and-bottom.

The good news? It worked:

                  AK         AZ         AR
           STTMINWGAK STTMINWGAZ STTMINWGAR
DATE                                       
1968-01-01       2.10        NaN        NaN
1969-01-01       2.10        NaN        NaN
1970-01-01       2.10        NaN       1.10
1971-01-01       2.10        NaN       1.10
1972-01-01       2.10        NaN       1.20
1973-01-01       2.10        NaN       1.20
1974-01-01       2.10        NaN       1.20
1975-01-01       2.10        NaN       1.20
1976-01-01       2.80        NaN       1.90
1977-01-01       2.80        NaN       1.90
1978-01-01       2.80        NaN       1.90
1979-01-01       3.40        NaN       2.30

This is a subset of the data. As we can see, the index shows the years (really, January 1st of each year) for which we have data for a state. The columns are the state abbreviations. And we have values where one was reported, and NaN where one wasn’t.

And yet, something seems weird here: What’s with the multi-index?

Well, we did set the index with pd.concat. But there was already a column name in the original data frame that we downloaded. We could modify our downloading code to remove that column name. But another approach would be to use the “droplevel” method, which lets us remove part of a multi-index. Calling droplevel(1, axis='columns') removes the second part (level 1) of the columns:

df = (
    pd
    .concat(d, axis='columns')
    .droplevel(1, axis='columns')
)

The result? A data frame with 57 rows (one for each year of data) and 46 columns (one for each state + DC for which we have data). If any state is missing data for one of those years, we’ll have a NaN value there.