BW #19: Working women (solution)
Women's participation in the US workforce is at an all-time high, after a dramatic drop at the start of the pandemic. This week, we'll look through the history of working women in the US.
This week, we’re looking at data from the US Bureau of Labor Statistics, an agency that tries to keep track of labor-related information in the US economy. The BLS recently released a report saying that women’s participation in the workforce was at an all-time high. That sounded like good news to me, but it also surprised me quite a bit, given that at the start of the pandemic, that number was rather low.
I thus decided that this week, we would take a closer look at the number of women participating in the workforce.
Data and questions
The data comes in the form of an Excel spreadsheet, which you can download from this page:
Normally, it’s possible to come up with a good link for downloading a document from the Internet. But because this page allows you to choose the date range you want, there isn’t a single URL you can use to download the data.
I thus asked you to go to this page, indicate that you want data from all years (starting in 1948, and continuing through 2023). Once you’ve done that, you can hit the “Go” button, then click on the Excel logo halfway down the page. That should trigger the download of the Excel spreadsheet.
And with that, we’re off to the races! Here are the questions I asked, along with my solutions.
Load the Excel file into a data frame.
As always, the first things that I executed was my standard Pandas startup line:
import pandas as pd
With that in place, I was able to define a variable with the filename, and then load that filename into a data frame using read_excel:
filename = 'SeriesReport-20230607082216_a26d78.xlsx' df = pd.read_excel(filename)
However, if you actually run this code, you’ll have a few problems:
You get a warning (not an error) from Pandas, saying, “Workbook contains no default style, apply openpyxl's default.” I did some checking, and I can’t say that I really understand what is happening there. Moreover, the data all seemed to arrive intact. So while you generally want to listen to warnings in Python programs, here I give you permission to ignore it. (But if you know what it means, please let me know!)
The columns are all unnamed.
The values, at least in the first few rows, are all NaN (i.e., “not a number”).
The problem? Our spreadsheet’s data doesn’t actually start at the top. Rather, there are a bunch of lines with documentation and explanation. We thus need to tell Pandas to ignore the initial lines. Once we do that, the data will be tabular, and things should work better:
df = pd.read_excel(filename, header=12)
After loading the data, this is what I get from running df.head():
Wrangle the data such that the index will contain a datetime, combining the year and month. (The day can always be the first of the month.)
What do we have now? A “Year” column, and 12 month columns. The data that we want is all here, but this format turns out to be extremely inconvenient if we’re looking to measure trends over time. For example, if I want to know which month and year had the highest participation rate, it’ll be annoying to find. If we want to find the month with the greatest increase from the previous month, then it’ll be even harder — because Pandas isn’t really designed for us to compare December 1999 with January 2000.
We’ll thus need to rejigger our data frame, such that we have an index containing years and months — or to make it simpler, just dates.
But how can we do that?
In the past, we’ve taken two categorical columns and a numeric column, and we’ve turned them into a pivot table, with one categorical column becoming our index, the second categorical column becoming our column names, and the numeric column being our values.
We want to do something here, turning our wide data frame into a narrow one. Instead of 13 columns (year + 12 months), we want three columns: Year, month, and value. It’s almost the opposite of a pivot table. And it’s possible with a Pandas method known as “melt”.
Keep reading with a 7-day free trial