BW 130: Jobs reporting (solution)

This week, we looked at the PAYEMS ("Payroll employees, seasonally adjusted") data reported by the Bureau of Labor Statistics, part of the US Department of Labor. This data, normally released on the first Friday of each month, indicates how many Americans had non-farm, non-military jobs in the previous month. It's considered a reliable indicator of the state of the economy.

But when they're first released, the PAYEMS numbers are just an estimate. As more data comes in, those estimates are revised — sometimes upwards, and sometimes downwards — to reflect the actual numbers. The numbers are normally in each of the two months following their initial release, and then again when the final, annual report comes out in February.

This month's numbers have caused a firestorm, however, because they revealed a dramatic lowering from the last two months. President Donald Trump accused the BLS of adjusting the numbers for political reasons, and fired Erika McEntarfer, the head statistician at the Bureau of Labor Statistics (https://www.washingtonpost.com/business/2025/08/05/trump-jobs-data-firing-labor-statistics/). This has led to widespread fears that US government data, which has long been seen as non-partisan and reliable, might now be subject to partisan whims and interpretations.

This week, we looked at the numbers released by the BLS, and tried to put them in context.

Data and five questions

Normally, we can get excellent financial data from FRED (https://fred.stlouisfed.org/), the St. Louis Federal Reserve's online economic data portal. However, FRED only keeps the most recent revision of data. To get "vintage" data, reflecting when it was released, we need to instead use ALFRED (https://alfred.stlouisfed.org/), which keeps archived data sets.

You can get the data by going to ALFRED's page for PAYEMS (https://alfred.stlouisfed.org/series?seid=PAYEMS).

Click on the "download" button, and ask for "all vintages." We want the data measured in thousands of people, and for both observations and vintages to start in January, 1965. I retrieved the data in Excel format. We'll use this Excel file for this week's analysis.

This week's learning goals include working with Excel files, using window functions, dates and times, and plotting with Plotly. This last point, plotting with Plotly, reflects my personal growing interest with this plotting library; I will likely be using a great deal more in the future, thanks to its attractive output and easy-to-understand API.

Paid subscribers, including members of my LernerPython+data plan, can download the data from a link at the bottom of this page. You can also download my Jupyter notebook, and open it (with the data) using a one-click link to Google Colab.

Here are the five tasks and questions:

Create a Pandas data frame using the data in the Excel spreadsheet. The index should contain `datetime` data from the `observation_date` column. The columns should also be have `datetime` values, based on the final eight digits of each column's name. What does each row represent? What does each column represent?

I started by loading up Pandas:

import pandas as pd

I then loaded the file into Pandas using read_excel, which knows how to take an Excel file and turn it into a data frame. Because the Excel file contains multiple sheets, we can pass the sheet_name keyword argument, providing one or more sheet names as strings. Or, if the sheet names are hard to write, we can use one or more integers, starting with 0 for the first sheet. This is the second sheet, so I provide sheet_name=1:

filename = 'data/bw-130-payems-2.xlsx'

df = (pd
    .read_excel(filename, 
                sheet_name=1)
)

However, I also asked for the observation_date column to be used as an index. We can set that with index_col:

df = (pd
    .read_excel(filename, 
                sheet_name=1,
                index_col='observation_date')
)

Finally, I asked for the columns to be datetime values, rather than strings, using the final six digits in each column name as a date. Looking at the columns, we can see that they have names looking like PAYEMS_20250801. (By the way, I asked you to use the final six digits as a date, but I should have said eight, since the column names contain four-digit years.)

We can use the rename method to rename the columns of a data frame. Moreover, we can pass a columns keyword argument that takes a function as a value. That function can be a lambda, a function that takes each column name in turn. Whatever the function returns replaces the old column name.

We know that we want to take each column name and get a datetime value. Normally, we could do that with pd.to_datetime. However, to_datetime requires an input string that can somehow be matched to a datetime format.

We know that each column name contains YYYYMMDD date information. If we could get just that part of the column name passed to pd.to_datetime, then we could use a date format of %Y%m%d that would match it. How, then, can we remove the first part of the string, everything up to and including the _ character?

Well, our lambda gets each column name as a string. And as a Python string, we can run a variety of methods, including str.split. If we run str.split('_') on the column name, we get back Python list, one with two elements. The second element contains the date in YYYYMMDD format, which we can then pass to pd.to_datetime. And that's indeed what I did:

df = (pd
    .read_excel(filename, 
                sheet_name=1,
                index_col='observation_date')
    .rename(columns=lambda c_: pd.to_datetime(c_.split('_')[1], format='%Y%m%d'))
)

The result? We can summarize it with df.info():

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 727 entries, 1965-01-01 to 2025-07-01
Columns: 731 entries, 1965-01-07 to 2025-08-01
dtypes: float64(730), int64(1)
memory usage: 4.1 MB

We have datetime values for our index (on the rows), and datetime values for our columns. All but one of the columns contain float values, with the remaining one containing integers.

Why? Why would only one column contain ints?

Consider: Each column in this data frame represents a release date for PAYEMS data. Since we most recently got data on August 1, 2025, the final (rightmost) column has a value of 2025-08-01. Each row represents a month for which we have jobs data. The most recent month for which we have PAYEMS data is July, 2025. Thus, that would be in the row 2025-07-01.

Thus, the column 2025-08-01 and the row 2025-07-01 will have our most recent data. But what will be in the column 2025-07-03, showing data released in July, in the row 2025-07-01? That data didn't exist yet – which is why it's represented as NaN, or "not a number." And since NaN is a float value, any column containing NaN will be converted into a float.

And thus, because every column contains at least one NaN value except for the final one (which has an integer value in each row), we are left with one int column and the rest floats. If we were using PyArrow or Pandas "extension types" that allow for nullable integer columns, then this conversion wouldn't be necessary, and everything would remain as integers.

What were the July PAYEMS numbers that we got on Friday, August 1st for July? What were the revised numbers for May and June? By how much did the June numbers change since last month's report? By how much did the May numbers change since the June report? What sorts of adjustments did we see in May's data, first in July and then in August?

The most recent data was for July of 2025, which is in the row whose index is 2025-07-01. And that data was released on August 1st, 2025, in the column with a header of 2025-08-01. We can use .loc with its two-argument version, the first being a row selector and the second a column selector. In this particular case, each selector will just be a string:

(
    df
    .loc['2025-07-01', '2025-08-01']
)

According to the BLS, in July 2025, there were 159,539 Americans working in non-farm, non-military jobs.

Actually, that's not true: The data is in thousands of people, so we need to multiply this number by 1,000. Which makes a lot more sense! This means that the total number of workers in July 2025 was 159,539,000, or just under 160 million people.

But on August 1st, the BLS didn't just release July data. It also released updates for the previous two months. To get the revised figures for May and June, we'll first retrieve the final three columns from the data frame, showing the data released in June, July, and August.

There is no column equivalent to iloc, so we'll instead invoke df.columns and then use a slice ([-3:]) on it, to get the final three column names. That'll give us the final three columns.

We can then get the final three rows with tail, giving us a 3x3 data frame with the three most recent months of data and the three most recent releases. I can then get just the first two rows of that (for May and June data) with head:


(
    df[df.columns[-3:]]
    .tail(3)
    .head(2)
)

We got the following:

                  2025-06-06  2025-07-03  2025-08-01
observation_date                                    
2025-05-01          159561.0    159577.0      159452
2025-06-01               NaN    159724.0      159466

Notice that we have NaN where the data month comes before the release month.

In the final column, for data released on August 1st, we can see the revised figures for both May and June.

By how much did this numbers change from the previous month? We can use the diff method that comes with Pandas to perform that calculation:


(
    df[df.columns[-3:]]
    .tail(3)
    .diff(axis='columns')
)

Since we're only interested in finding out by how much these numbers changed from the previous BLS release in July, we can limit the result to one column:

(
    df[df.columns[-3:]]
    .tail(3)
    .diff(axis='columns')
    ['2025-08-01']
)

The results:

observation_date
2025-05-01   -125.0
2025-06-01   -258.0
2025-07-01      NaN
Name: 2025-08-01 00:00:00, dtype: float64

And now you can see what the political firestorm was about: BLS, in their August revision to May data, said that they had overestimated by 125,000 jobs. And in their August revision to June data, they had overestimated by 258,000 jobs. In other words, the economy isn't quite as strong as we had thought (or hoped) – and that is especially true for the Trump administration, which has been claiming a very strong economy since taking office.

How did the August revision of May's numbers compare with the original release in June? We'll once again grab the three most recent release columns, and also the three most recent rows. And we'll once again invoke diff, but this time we'll add periods=2 to the invocation. That means we don't want to compare each value with the one to its left, but rather with the value two columns to its left.

That leaves us with a lot of NaN values and a single number, which we can extract using .loc:

(
    df[df.columns[-3:]]
    .tail(3)
    .diff(periods=2, axis='columns')
    .loc['2025-05-01', '2025-08-01']
)

The result is -109, or a downward revision of 109,000 jobs.

BW 130: Jobs reporting (solution)

Data and five questions

Read next

BW 130: Jobs reporting

BW #131: Canadian border crossings