Skip to content

BW #138: Federal workers (solution)

Get better at: Working with multiple CSV files, grouping, pivot tables, dates and times, speed optimization, and memory optimization.

BW #138: Federal workers (solution)

[Sorry that this week's solutions went out a bit later than expected! I guess I already have something to atone for on Yom Kippur of next year...]

The federal government of the United States shut down on Wednesday. It's not clear when it'll reopen, but that can only happen after Congress passes a budget. For now, Democrats say they'll only vote in favor of the budget if it restores health-insurance subsidies for low-income Americans. Republicans say that they won't accept it. And so for now, a lot of government offices are closed. President Donald Trump has said that he might use the government shutdown to fire large numbers of federal employees, which raises all sorts of additional questions.

This week, I thought it might be interesting to look at data about US federal employees – who they are, and what sort of work we do. By the way, this isn't the first time we've looked at the federal workforce around here; BW 105 (https://www.bambooweekly.com/bw-105-federal-employees-solution/) looked at a different, if related, data set.

Data and five questions

This week's data comes from the OPM, the Office of Personnel Management (https://www.opm.gov/data/datasets/). I found some other government sites with information about federal employees, but they weren't working; it wasn't clear whether they were simply having problems, if they had been taken down earlier this year, or if they were unavailable because of the shutdown.

Regardless, I was happy to find that https://www.opm.gov/data/datasets/ is still working (as of this writing), and that I could download federal employee data March of this year, which was posted on July 1st. We'll be using the full data, which comes in three separate downloadable zipfiles from the OPM site, each of which contains a CSV file with information about federal employees. These files are listed as "data file 1," "data file 2," and "data file 3" on the download page.

Paid members, including subscribers to my LernerPython.com membership program (https://LernerPython.com), can download the data file from the bottom of this message. Paid subscribers also get all of the questions and answers, downloadable notebooks, one-click access to the data and notebook in Google Colab or Marimo Molab, and invitations to monthly office hours.

Here are this week's tasks and questions:

Create a data frame from the CSV files inside of the three zipfiles. Make sure that the DATECODE column is a datetime value. How much memory do you save by turning the string columns into categories? Make sure the SALARY column contains floats.

I started, as usual, by loading up Pandas:

import pandas as pd

Normally, I would import a file into a Pandas data frame using read_csv. But here, I have three files to import. How can I easily load all of them into a single data frame?

Let's assume that I can load each file into a separate data frame, and that I put those three data frames into a single list. I can then use pd.concat to combine them all into a single data frame.

The challenge, then, is to read the three files into a list of data frames. Fortunately, Python's list comprehensions make it relatively easy to do this. I moved the three files into a single directory with the names opm-1.txt, opm-2.txt, and opm-3.txt. (I could have left them in their original directories, but this was a bit easier.)

I then used glob (from Python's standard library) to get a list of files matching the pattern opm-?.txt from my computer's Downloads directory. Here's what my first attempt looked like:

pattern = '/Users/reuven/Downloads/opm-?.txt'

df = pd.concat([pd.read_csv(one_filename)
            for one_filename in glob.glob(pattern)])

Remember that the result of a list comprehension is a new list. In the above comprehension, Python iterates over the result of invoking glob.glob(pattern), a list of three strings, the matching filenames. I pass each filename to read_csv, which returns a data frame. In this way, I created a list of three data frames, which I then passed to pd.concat.

However, this didn't quite work. The main reason? The separator character in these CSV files wasn't a comma (','), but rather the vertical bar ('|'). I thus had to pass sep='|' to read_csv. Along the way, I also decided to pass engine='pyarrow', which uses PyArrow to load the data – typically at much higher speed than the builtin Pandas loading system:

df = pd.concat([pd.read_csv(one_filename, engine='pyarrow', 
             sep='|')
            for one_filename in glob.glob(pattern)])

This actually works! The only remaining issue is that the DATECODE column is treated as a text column, when I asked for it to be have a datetime dtype. Normally, PyArrow tries to interpret datetime values, but in this case the format wasn't one that it recognized. I thus passed the parse_dates keyword argument for read_csv, indicating which columns (well, only one here) needed to be treated as datetime values. And I passed date_format='%Y%m' , telling Pandas that the dates had four-digit years and two digit months:

df = pd.concat([pd.read_csv(one_filename, engine='pyarrow', 
             sep='|', parse_dates=['DATECODE'],
                      date_format='%Y%m')
            for one_filename in glob.glob(pattern)])

This did the trick, loading the data into a data frame.

But... wow, what a data frame it was! It consumed 3.7 GB of memory.

Remember that to calculate the memory usage of a data frame, you can say df.info(memory_usage='deep'). Just invoking df.info will give you a memory estimate, but it won't include the most important part, namely the lengths of the strings that are being stored as Python values! Rather, you'll get the lengths of the 64-bit pointers to those strings, which are often far smaller than the strings themselves.

My computer has lots of RAM, and can handle a 3.7 GB data frame without too much trouble. But not every computer has lots of memory, and there are plenty of data sets where the size can get to 10 or 100 times that size – and that can spell trouble, or at least slow things down.

I thus asked you to try to crunch this data frame down using one of my favorite techniques, categories. When we turn a Pandas column into a category, each unique value in the column is replaced by an integer. The mapping between integers and values is stored alongside the column, in the category definition.

Replacing strings with integers, especially in a large data frame with many repeated values, can result in pretty big savings in memory. Just how much can we save here?

I used select_dtypes method to get a subset of the data frame, only including the object columns (which are generally going to be strings), and then iterated over its column names. I ignored the SALARY column – but for every other column name, I invoked astype('category'), and then assigned the result back to the column itself.


for column_name in df.select_dtypes('object').columns:
    print(column_name)
    if column_name == 'SALARY':
        continue
    df[column_name] = df[column_name].astype('category')

Note that the above loop included an invocation of print. I always like to do that when processing columns, so that I can be sure that what I planned matches what is actually happening.

SALARY was a bit trickier, in that I first needed to replace the string REDACTED with np.nan. Without doing that, there wasn't any way for me to turn it into a float column. But with that in place, I could use astype(float) and get the change I wanted.


df['SALARY'] = df['SALARY'].replace({'REDACTED':np.nan}).astype(float) 

After doing these transformations, the size of my data frame shrank to a mere 153.2 MB. That's right – a savings of about 95%. Not bad for a few lines of code, right?

The full data frame contains 2,289,472 rows and 31 columns. We won't need most of these columns – and to be honest, we won't even need the DATESTAMP column. I originally thought that it would come in handy, but it turns out to be the timestamp for when the data file was created, not for when the person started, stopped, or changed their job. (But at least you got some more practice using those weird date-format strings, right?)

Assuming that each row in the data frame describes one federal employees, what's the fastest and best way to find out how many employees there are? The New York Times, as of this writing, says that the Environmental Protection Agency, Department of Education, Department of Commerce, Department of Labor, and Department of Housing and Urban Development have had the greatest percentage of furloughs across government agencies. How many people work, total, at those agencies? Does the number match what the New York Times reports? Why or why not?

Another way to phrase the first part of this question is: What's the fastest way to find the number of rows in a data frame?

I can think of a few different ways to get the number of rows:

Note that a fifth possibility, invoking df.count(), isn't a good idea, simply because it doesn't include NaN values. So you'll get the number of non-NaN values, which is a useful thing to know, but isn't quite what I'm asking here.

I've used Jupyter for years, and I got a bit addicted to the %timeit magic command that it provides for quick-and-easy timing of things. Now that I'm trying to use Marimo for more of my work, I've been a bit frustrated that there isn't a quick or easy replacement.

Except that there is: timeit is a module in the standard library, and if you call timeit.timeit, you can pass a lambda expression, i.e., an anonymous function. That function will then be invoked a number of times, giving us the average run time.

So to find out how long it takes to run df.size, I can just say:

import timeit

print(f'df.size: {timeit.timeit(lambda: df.size)}')

Notice that I've used an f-string, allowing me to describe the test and also run it inside of the {}.

I set up the following timing tests:


import timeit

print(f'df.size: {timeit.timeit(lambda: df.size)}')
print(f'df.shape[0]: {timeit.timeit(lambda: df.shape[0])}')
print(f'len(df): {timeit.timeit(lambda: len(df))}')
print(f'len(df.index): {timeit.timeit(lambda: len(df.index))}')

And here's how they fared:

len(df): 0.3080042921938002
len(df.index): 0.21482804184779525
df.shape[0]: 0.47857395792379975
df.size: 4.209399749990553

That's right – df.size is the slowest of the bunch, taking more than 10 times longer than getting len(df). But the fastest technique of all is actually len(df.index). Years ago, someone told me that len(df.index) is the fastest way to go, and I've always been amazed to see just how much faster it is than the alternatives.

Now that it's clear how to calculate the length of a data frame in the shortest time, I asked you to find the number of employees working for the five agencies with the highest percentage of furloughed workers.

First, I kept only two columns from the data frame, AGY (the two-character agency code) and AGYT (the full agency name). I then used dropna to remove any rows in which AGY was NaN:

(
    df
    [['AGY', 'AGYT']]
    .dropna(subset='AGY')
)

I did some looking and checking, and found the two-letter codes for the agencies that interested us. (Yes, I could have used the full names in my query, but that would have made it much harder to write and read.)

I used those codes in a loc and lambda combination, allowing me to keep rows for which the lambda function returned True. In this case, the function used str.contains along with a simple regular expression with the two-digit codes.

I should note that I checked a few other methods for doing this, including isin and a string on which I invoked str.split, and also an inline list of strings – and they were actually much slower than the regular expression! The performance was boosted a bit by passing regex=True, presumably because Pandas didn't need to figure out if the string was a regular expression or not.

Because I only wanted the AGYT column after selecting with loc, I passed that column name as the second argument to loc:

(
    df
    [['AGY', 'AGYT']]
    .dropna(subset='AGY')
    .loc[lambda df_: df_['AGY'].str.contains('EP|CM|ED|DL|HF', regex=True), 'AGYT']
)

Finally, I wanted to know how many rows were in this filtered data set. I invoked pipe , which allows me to invoke a function as part of a method chain. In this case, I could have just passed len as the argument to pipe, but using Marimo's built-in timer of cell execution, I found that passing a lambda that then invoked len on the index was still the fastest:

(
    df
    [['AGY', 'AGYT']]
    .dropna(subset='AGY')
    .loc[lambda df_: df_['AGY'].str.contains('EP|CM|ED|DL|HF', regex=True), 'AGYT']
    .pipe(lambda df_: len(df_.index))
)

This query returned 83,012, which was about 6,000 more people than I totaled from the New York Times report. Why the difference? The federal government is huge, and people are constantly joining and leaving even at the most stable of times. But the last few months have not been particularly stable, with many employees being fired, offered buy-outs, and even leaving of their own accord. Plus, this data is from March, which means that it's already six months out of date.

The numbers are roughly accurate, but to me, at least, this explains the discrepancy.