BW #11: Software jobs (solution)

With the recent layoffs in tech, it's easy to think there aren't any open jobs for programmers. How do coding jobs compare with the rest of the economy? And are all countries experiencing trouble?

BW #11: Software jobs (solution)

This week’s topic: Software jobs

This week’s data set came from the GitHub repo put together by HiringLab, set up by Indeed.com. We looked at the number of job postings as a proxy for job demand.

If you’re Git savvy, then you were able to download the entire repository with “git clone”:

https://github.com/hiring-lab/job_postings_tracker

If you aren’t familiar with Git, then you can use this URL to get the data as a zipfile:

https://github.com/hiring-lab/job_postings_tracker/archive/refs/heads/master.zip

Our questions for this week are:

  1. Turn aggregate job postings into a data frame (aggregate_df). We're interested in the date and seasonally adjusted number of job postings, but only where “variable” is “total postings”.
  2. On how many days since data collection began has the index been greater than 100 (i.e., more postings than in February 2020)?
  3. Draw a line plot with the posting index per day.
  4. Create a second data frame (sector_df) from the job_postings_by_sector_US file. We're interested in the date, postings_index, and display_name columns.
  5. Reformat the data such that the index contains the date, and the columns are the various sectors.
  6. On how many days since data collection began has the index for software development been greater than 100 (i.e., more postings than in February 2020)?
  7. Draw a line plot with the posting index for software development per day.
  8. On how many days since January 1st, 2023, has the index for software development been greater than 100 (i.e., more postings than in February 2020)?
  9. Which two sectors' job-posting indexes are most highly correlated with software development? Which two are least correlated with software development?
  10. Create a line plot, showing software job openings vs. the aggregate index.
  11. Create a line plot, showing software job openings in the US vs. those in Australia, Canada, Germany, France, and Great Britain.

Let’s get to it!

Turn aggregate job postings into a data frame (aggregate_df). We're interested in the date and seasonally adjusted number of job postings, but only where “variable” is “total postings”.

First, I had to set up Pandas with my usual imports:

import pandas as pd
from pandas import Series, DataFrame

Once I did that, I wanted to load the CSV file into a data frame. Truth be told, I could do that with a single line of code, the “read_csv” function:

filename = 'US/aggregate_job_postings_US.csv'
aggregate_df = pd.read_csv(filename)

This would actually work, if I wanted to turn it into a simple data frame. But I only wanted selected columns, I wanted to turn the “date” column into a “datetime” object, and I wanted to make that column the index. I could do that with:

aggregate_df = pd.read_csv(filename, 
                 parse_dates=['date'],
                 index_col='date')

Here, I create the data frame based on filename. I tell Pandas to parse the “date” column as a datetime when reading it in, and I also ask it to turn that column into the index.

I then modified this query to read only selected columns. I can do that with the “usecols” parameters, passing it a list of column names. But given the long names that the HiringLab people used, I thought it would be nice to rename them, which I can do by passing a list of strings as a “names” keyword argument.

But wait — if I give Pandas the names I want to use, then how can I specify which columns I want with “usecols”? That is, if I rename the columns “a”, “b”, and “c”, how can Pandas know to which of the columns in the input file it should give those names?

The answer is that we can pass “usecols” a list of integers, indicating the index (starting with 0) of the columns we want to select. Passing numbers to usecols, and then passing strings to names, allows us to choose columns and rename them.

Ah, but then we have a problem: If we set the names, then Pandas no longer assumes that the first line of the file contains column names. Rather, it assumes that the first line contains data. We need to tell it to ignore that first line, so that our data doesn’t get messed up, something we can do with the “header” keyword argument.

In the end, our call to read_csv looks like this:

aggregate_df = pd.read_csv(filename, 
                 usecols=[0, 2, 4],
                 parse_dates=['date'],
                 names=['date', 'postings_index', 'variable'],
                 header=1,
                 index_col='date')

(Note that there are also two columns containing data. I decided that we would only look at the seasonally adjusted data, which takes into consideration the fact that certain jobs are seasonal. For example, summer camps hire a lot in June and then don’t hire again until September. Package-delivery companies hire in November, but not in February. By using seasonally adjusted data, economists hope to have a truer sense of what’s happening in the world.)

The data that we loaded contains two types of data for each date. One is “total postings,” indicating how many job postings there were on a given day. Another is “new postings,” indicating how many new jobs were posted on a given day. I decided that we should only look at “total postings,” in no small part because we’ll be able to compare that with the sector-specific data.

I thus asked you to keep only those rows for which “variable” contains the “total postings”. We can do it in this way:

aggregate_df = aggregate_df.loc[aggregate_df['variable'] == 'total postings']

I create a boolean series by comparing the values in df[‘variable’] with “total postings”. I can then apply that boolean series to “aggregate_df.loc”. That returns a new data frame, based on aggregate_df, but only containing those rows where “variable” was equal to the string “total postings”.

With this, our aggregate data frame is in place, and we can start to perform some calculations and analysis.

On how many days since data collection began has the index been greater than 100 (i.e., more postings than in February 2020)?

The data’s first day started with a value of 100. This doesn’t mean that there were 100 job postings on that day, but rather that this was taken to be the baseline measurement. If more than that day’s job postings are online, we’ll have a value greater than 100. If fewer than that day’s job postings are online, we’ll have a value smaller than 100.

I thus wanted to know how often, since data collection was started, have we had days greater than 100.

The easiest way to calculate this is to make the comparison directly:

aggregate_df['postings_index'] > 100

This will return a boolean series in which True values represent days on which the value was greater than 100 and False values represent days on which it was 100 or less. We can find out how often each of these values occurred by using value_counts:

(aggregate_df['postings_index'] > 100).value_counts()

I got the following results:

postings_index
True     802
False    352
Name: count, dtype: int64

What if I want to find out not how many there were, but rather what percentage were there? For that, I can pass normalize=True to value_counts:

(aggregate_df['postings_index'] > 100).value_counts(normalize=True)

postings_index
True     0.694974
False    0.305026
Name: proportion, dtype: float64

According to our data, on nearly 70 percent of the days in our data set there were more job postings than that first day. Assuming that job postings are a proxy for health of the economy, that means that economy has, on balance, grown since that time.