BW #6: End of the humanities? (solution)

This week, we look into the subjects that US college students are studying — and how that compares with data from the last 50 years. What are they studying more, and what are they studying less?

This week’s topic: End of the humanities?

This week, we looked at which majors have become more popular over the last few decades at American universities, and which have become less popular.

Our data set set comes from the Digest of Education Statistics, with a table showing the total number of students majoring in each of 30 different areas of study. You can view the table at https://nces.ed.gov/programs/digest/d21/tables/dt21_322.10.asp. Better yet, you can download the file in Excel format from:

https://nces.ed.gov/programs/digest/d21/tables/xls/tabn322.10.xls

Our questions for this week are:

  1. Retrieve the Excel file with selected years, and turn it into a data frame.
  2. Remove the line numbering the surveys. Remove the total. And remove the lines at the bottom, after "Other and not classified".
  3. Remove the \n\ markings (for footnotes) from the "Field of study" column
  4. Remove newlines and other extraneous whitespace from the "Field of study" column
  5. Set the "Field of study" column to be the index.
  6. Which majors have had the greatest *increase* since the survey began in 1970-1971?
  7. Display the numbers with a comma before every three digits -- so that instead of showing "1000" it shows "1,000"
  8. Which majors have had the greatest *decrease* since the survey began in 1970-1971?
  9. If we only look at the last 10 years of the study (i.e., starting in 2010-11), do we see similar changes in majors?
  10. What percentage decline do we see in all fields containing the word "art," "language," "history," or "culture" in their names in the last 10 years?

Discussion

First, before anything else, I did my standard setup for working with Pandas:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

I spent some time looking through the Department of Education site, and found the page at https://nces.ed.gov/programs/digest/d21/tables/dt21_322.10.asp with a visual version of the data. Fortunately, there was also a button on that page labeled “download Excel,” making it possible to download the data into an Excel file.

I could have downloaded the file onto my computer and loaded it from there. If the file were huge, or if I would need to load it multiple times, I might have done that. But it’s a small file, and I only downloaded it a handful of times in preparing this week’s newsletter — so I took of the fact that read_excel, as well as all of the other read_* methods in Pandas, can take a URL, as well as a filename. I thus downloaded the file as follows:

url = 'https://nces.ed.gov/programs/digest/d21/tables/xls/tabn322.10.xls'
df = pd.read_excel(url)

However, this wasn’t quite enough, because the Excel file was formatted a bit strangely. Pandas normally assumes that the first line contains column names, and that subsequent rows contain data. Sometimes, though, the first line might contain titles or descriptions. In such cases, we need to tell Pandas to ignore one or more lines, skipping down to where the headers are located.

We can do that by passing the header=n keyword argument to read_excel, where n is the line number (starting at 0) on which the headers are located.

I thus ended up downloading the file with:

df = pd.read_excel(url, header=1)

Remove the line numbering the surveys. Remove the total. And remove the lines at the bottom, after "Other and not classified".

However, that wasn’t quite enough to get our data frame into shape. That’s because we had a few other rows that would throw off any analysis we tried to do.

For starters, once we loaded the file into a data frame, the first two non-header rows (indexes 0 and 1) contained information that we didn’t really need — one with the serial number of the survey in that column, and another with the total number of people enrolled in post-secondary (i.e., college or vocational) programs in that year.

In other words, we want to remove the rows at indexes 0 and 1. The easiest way to do this is with the “drop” method. Note that drop can be used to drop either rows or columns, and you can specify it explicitly by passing the “axis” keyword argument, with a value of either “rows” or “columns”. (Yes, you can use numbers instead, but I never remember which is 0 and which is 1, so I prefer the names.) The default is to drop rows.

We can drop one row by passing a single index value. To drop more than one row, we can pass a list of indexes.

By default, “drop” returns a new data frame, identical to the original one but without the rows that were specified. There is the option of passing the inplace=True keyword argument, which returns None and modifies the original data frame. However, the core Pandas developers have made it clear that there is no benefit to passing inplace=True, and that it will soon be deprecated.

In the end, we can remove the first two lines with:

df = df.drop([0, 1])

What about the final lines, after “Other and not classified”? We could again use the “drop” method, but here I think it’ll just be easiest to define a slice on our data frame, up to and not including the final five lines:

df = df[:-5]

Remove the \n\ markings (for footnotes) from the "Field of study" column

Those final five rows contained numbered footnotes. And while we’ve removed those footnotes, the references to them are still in the first column, labeled “Field of study.” It wouldn’t be terrible for us to keep things the way they are, but they look pretty ugly. For example, the major

Agriculture and natural resources

is currently written as

Agriculture and natural resources\1\

I asked you to remove all of the footnote references in that column. How can we do that?

Remember that every Pandas column is a series. On a series of strings, we can apply the “.str” accessor. This gives us access to a wide variety of string methods — some from standard Python, and others that Pandas has added to our arsenal. For example, if we have a series s containing strings, we can invoke

s.str.len()

We will get back a new series with the same index as s, but with values reflecting the length of each string in s.

The “replace” method lets us replace one string with another. For example, if I say

s.str.replace(‘a’, ‘b’)

then we’ll get back a new series with the same index as s, but with any “a” in s converted into “b”.

This is similar to what we want, but not quite good enough. After all, I don’t want to replace a particular character, or even a set of characters. I want to replace a pattern, one which I could describe as “a backslash, a digit, and another backslash.”

Whenever you find yourself describing text in such a way, you almost certainly want to reach for regular expressions, aka “regexp” or “regex”. Regular expressions have a long history, and are well known for being hard to learn and/or hard to use. I can assure you that they’re not bad — in fact, my free “Regexp crash course” will teach them to you in just 14 short e-mail lessons.

How can I use regular expressions here? It turns out that the Pandas version of replace actually supports them! You should pass the regex=True keyword argument, as follows:

s.str.replace(‘a’, ‘b’, regex=True)

What regular expression will we want to provide? Well, we want to find a backslash, followed by a digit, followed by another backslash.

This provides us with three different issues!

Remember that backslashes are used in Python strings for all sorts of special characters, such as \t and \n. If we want our string to include an actual backslash character, then we need to double it, as \.

But the Python string that we create then needs to be passed along to the regular expression engine. And that engine has all sorts of internal needs for backslashes, too! So we’ll actually need to escape all of our backslashes, doubling them to get them into our regexp engine.

That’s already ugly, but we can be saved (somewhat) by Python’s raw strings. A raw string has the letter “r” before the opening quote, and basically tells Python to double the backslashes in the string. The idea is that you get to write things the way you want them to be, and Python takes care of the backslash doubling internally. The two primary use cases for raw strings are, in my mind, (a) Windows paths and (b) regular expressions.

My regular expression is thus looking like this:

r’\\\\’

That is: An initial “r”, followed by two backslashes (which will be turned into one), followed by two more backslashes (which will be turned into one).

We’re missing something important, though — the digit in the middle. Fortunately, if you put \d in a regular expression, it’ll match any single digit. We can thus modify our regexp to read:

r’\\\d\\’

Can you believe that someone would call this unreadable? Crazy, I know.

Anyway, I want to remove any thing that matches this pattern in the “Field of study” column. To do that, I’ll look for my regexp, and replace it with an empty string:

df['Field of study'].str.replace(r'\\\d\\', '', regex=True)

This returns a new series, without changing the original one. I still need to assign it back to the column in df:

df['Field of study'] = df['Field of study'].str.replace(r'\\\d\\', '', regex=True)

After executing this code, the footnotes with their weird backslash syntax are gone.