BW #9: US house prices (solution)

Housing prices in the US have gone up quite a bit in the last few years. How much have they changed? How different are prices in different regions?

This week’s topic: US house prices

This week’s data comes from FRED, the amazing financial portal from the St. Louis Federal Reserve. I actually asked you to download six separate CSV files:

Our questions for this week are:

  1. Create a data frame with information from all six locations. The index should contain both the date of the price (as a date), and the two-letter location code.
  2. On which date, and in which region, were housing prices historically the highest?
  3. How long ago was that?
  4. In which region are the housing prices currently the highest?
  5. What are the mean and median prices for each location in our data set?
  6. What are the mean prices in New York, across the entire data set?
  7. Between 2000 and 2020, what were the average prices in New York?
  8. Between 2000 and 2020, what were the average prices in each of New York and California?
  9. Since January 2020, how much have home prices changed in each of the six regions?

The learning goals for this week include combining CSV files into a single data frame, working with time series, and extracting from multi-indexes.

Let’s dive in!

Create a data frame with information from all six locations. The index should contain both the date of the price (as a date), and the two-letter location code.

Before doing anything else, I’ll start with my standard three-line intro for anything Pandas-related:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

Most of the time, I create a data frame from a CSV or Excel file. Sometimes, however, I have to create my data frame by stitching together several CSV files. That’s exactly what I asked you to do this week, and in some ways, that was one of the harder tasks I asked you to accomplish.

(Actually, I think that a number of the things I asked for this week were a bit hard. Sorry about that!)

We’ll start from the end, namely that Pandas provides a “pd.concat” function. It takes a list of data frames, and returns a single data frame based on them. When I use this, I typically create a list called “all_dfs”, append the data frames to it, and then pass all_dfs to pd.concat:

df = pd.concat(all_dfs)

The data frame that is returned, which I call “df”, then contains all of the data from the input data frames.

How, though, will I create data frames from all six of the downloaded CSV files? I decided to use one of my favorite modules in Python, “glob”, whose “glob” function (yes, “glob.glob”) takes a string describing a pattern of filenames. It returns a list of filenames matching that pattern.

In this case, I put all of the files in my “Downloads” directory, which means that I was able to get all of the filenames with:

glob.glob('/Users/reuven/Downloads/??STHPI.csv')

That’s great… but what do I do now?

In theory, I could just concatenate all of the files together. They all look the same, with the same two columns, “date” and “prices.” However, I asked that you add a third column to each file, namely the two-letter abbreviation for the state whose data we’re reading. In other words: While we’re reading in six two-column CSV files, I would like to build a single, three-column data frame from them.

The best way that I can think of to do this is with a regular ol’ “for” loop. And yes, I often say that you shouldn’t use for loops with Pandas — but here, I’m using the loop to create a data frame, not to iterate over one.

For each filename we get from glob.glob, we’ll create a data frame, and then add a new column whose value is the two-letter code for the current state. Where will I get the letters from? From the filename!

Since all of the filenames have the same structure, I can grab them using a slice. I decided to use negative indexes (i.e., counting from the right) for my slice, ending up with one_filename[-11:-9]. In other words, I want to get two characters, starting at index -11 (i.e., 11 from the right) up to and not including index -9 (i.e., 9 from the right).

Here’s the code:

import glob

all_dfs = []

for one_filename in glob.glob('/Users/reuven/Downloads/??STHPI.csv'):
    one_df = pd.read_csv(one_filename, names=['date', 'price'], 
                         header=0,
                        parse_dates=['date'])
    one_df['location'] = one_filename[-11:-9]
    all_dfs.append(one_df)

While reading each CSV file, I also leaned on several additional options in read_csv:

  • I gave new names to the columns, using the “name” argument
  • I indicated that the file’s headers are on line 0 (i.e., the first line) of the file. Normally, the first line is assumed to be headers, but when you pass the “name” keyword argument, it assumes that all rows are data. This avoided having the headers mixed in with our actual data.
  • I also passed the “parse_dates” argument, indicating that the “date” column should be treated as a datetime value, rather than simple textual value.

After creating the data frame, I create a new column, “location”, whose value comes from the filename. Assuming that all filenames really stick to this format, we should be fine; this ensures that I’ll know from which data frame (and city) each of the rows came from.

After going through all six files, all_dfs contains a list of six data frames. We can then stitch them together:

df = pd.concat(all_dfs).set_index(['date', 'location']).sort_index()

I start off by invoking pd.concat, as described above. That returns a data frame. But before I return the data frame to the caller, I invoke “set_index” on it. By passing a list of strings, I indicate that our data frame should have a multi-index, with more than one component. The primary index component will be the “date” column, and the secondary will be the “location” column.

While I could have returned the result of “set_index” to the caller, I decided to do one more thing, namely sort the data frame by its index. There are a number of operations, including slices, that work best if the data frame is sorted by the index, so I decided to get it done right away.

We’ve read our data into a number of data frames, set the correct types, set the correct index. I think we’re ready to answer this week’s questions!

On which date, and in which region, were housing prices historically the highest?

If I want to find the highest price in our “prices” column, I can simply run the “max” method on df['prices']. But I didn’t ask for the highest price; I asked for the date and region of the highest price, which means the index associated with that price.

I could thus compare df[‘prices’].max() with every element in df[‘prices’], and then use the resulting boolean series to get the matching row — including its index — using df.loc:

df.loc[df['price'] == df['price'].max()]

But Pandas has a method that does this for us, namely df.idxmax, which returns the index associated with the highest value:

df['price'].idxmax()

This not only cuts down on the code I have to write, but also dramatically speeds up the execution. (I checked both methods with %timeit in Jupyter, and it was 4x faster to use idxmax.)

In any event, I see that the highest price was in New York, in the quarter starting July 1st, 2022. We’ve had two quarters since then, which means that prices have gone down in the last nine months.