BW #2: Egg prices (solution)

How much have egg prices increased in the last few years? And have they started to decrease?

This week’s topic: Egg prices

Yesterday, I asked you download information from the USDA, so that we can find out how much egg prices have increased in the last few years. The questions I posed were as follows:

  1. Given URLs for each of the last five years of egg prices, create a single data frame.
  2. What was the average low price for eggs in each of the years of data?
  3. In what year was the increase the greatest?
  4. Now get the average low price for eggs in each month and year of data.
  5. Have egg prices gone down in the last few months?

Discussion

Egg prices have been in the news quite a lot, so I thought that it would be interesting to download and explore this data. As I mentioned yesterday, I was sure that it would be relatively straightforward to do this, because the USDA’s site allows you to download data in Excel format.

But there were at least two sticking points: First, the so-called Excel format that they offer is really in HTML, and you can only download two years’ worth of data at a time.

So the first question is a non-trivial problem in and of itself: How can we, given a template URL (which I provided) for egg prices during a given year, retrieve multiple years, and turn all of that data into a single data frame?

Let’s start with the URL f-string that I gave you:

f'https://marketnews.usda.gov/mnp/py-report?&repDate=01%2F01%2F{year}&regionsDesc=&endDate=12%2F31%2F{year}&repMonth=1&run=Run&_producttypefrom=1&endYear=2023&producttypefrom=&repYear=2020&categoryDesc=Egg&frequency=Daily&datatype=None+Selected&previouscls=Breaking+Stock&_producttype=1&report=NW_PY041&category=Egg&subcategory=Breaking+Stock&endMonth=1&commodityDesc=Breaking+Stock&runReport=true&format=excel'

Let’s say that I want to retrieve data from 2018 through this year (2023). I could do something like the following:

for year in range(2018,2024):
    print(f'Retrieving {year}')
    url = f'https://marketnews.usda.gov/mnp/py-report?&repDate=01%2F01%2F{year}&regionsDesc=&endDate=12%2F31%2F{year}&repMonth=1&run=Run&_producttypefrom=1&endYear=2023&producttypefrom=&repYear=2020&categoryDesc=Egg&frequency=Daily&datatype=None+Selected&previouscls=Breaking+Stock&_producttype=1&report=NW_PY041&category=Egg&subcategory=Breaking+Stock&endMonth=1&commodityDesc=Breaking+Stock&runReport=true&format=excel'
    pd.read_excel(url)

At first glance, this would seem to work. But when I tried it, I quickly discovered that that Excel format that was being downloaded was … actually HTML. So instead of calling read_excel, we should use read_html. The good news? read_html can take a string of HTML, like we get from this URL, and turn it into a data frame. The bad news? It actually returns a list of data frames, one for each HTML table it finds at the named URL. We thus need to take the result from read_html, and grab index 0 from it:

for year in range(2018,2024):
    print(f'Retrieving {year}')
    url = f'https://marketnews.usda.gov/mnp/py-report?&repDate=01%2F01%2F{year}&regionsDesc=&endDate=12%2F31%2F{year}&repMonth=1&run=Run&_producttypefrom=1&endYear=2023&producttypefrom=&repYear=2020&categoryDesc=Egg&frequency=Daily&datatype=None+Selected&previouscls=Breaking+Stock&_producttype=1&report=NW_PY041&category=Egg&subcategory=Breaking+Stock&endMonth=1&commodityDesc=Breaking+Stock&runReport=true&format=excel'
    pd.read_html(url)[0]

Notice that read_html is a method that we run on “pd”, the overall Pandas module. That’s because we want to create a new data frame, and we’re not working on a data frame that already exists.

The thing is, we don’t just want to create a single data frame. We want to get a bunch of data frames, one for each year, and then combine them. We can do that by creating an empty list and then appending each created data frame to that list:

all_dfs = []
for year in range(2018,2024):
    print(f'Retrieving {year}')
    url = f'https://marketnews.usda.gov/mnp/py-report?&repDate=01%2F01%2F{year}&regionsDesc=&endDate=12%2F31%2F{year}&repMonth=1&run=Run&_producttypefrom=1&endYear=2023&producttypefrom=&repYear=2020&categoryDesc=Egg&frequency=Daily&datatype=None+Selected&previouscls=Breaking+Stock&_producttype=1&report=NW_PY041&category=Egg&subcategory=Breaking+Stock&endMonth=1&commodityDesc=Breaking+Stock&runReport=true&format=excel'
    all_dfs.append(pd.read_html(url)[0])

I added the call to “print” simply because it took a while to get each URL, and it’s nice to get updates from a long-running program.

How can we combine them? The pd.concat method takes a list of data frames, and returns a single data frame based on all of them. Thus, retrieving all of the data from the last five years can look like this:

URLs for each of the last five years of egg prices, create a single data frame.

all_dfs = []
for year in range(2018,2024):
    print(f'Retrieving {year}')
    url = f'https://marketnews.usda.gov/mnp/py-report?&repDate=01%2F01%2F{year}&regionsDesc=&endDate=12%2F31%2F{year}&repMonth=1&run=Run&_producttypefrom=1&endYear=2023&producttypefrom=&repYear=2020&categoryDesc=Egg&frequency=Daily&datatype=None+Selected&previouscls=Breaking+Stock&_producttype=1&report=NW_PY041&category=Egg&subcategory=Breaking+Stock&endMonth=1&commodityDesc=Breaking+Stock&runReport=true&format=excel'
    all_dfs.append(pd.read_html(url)[0])

df = pd.concat(all_dfs)    

This works fine — but it’s also a good opportunity (I believe) to use a list comprehension instead of a regular “for” loop. List comprehensions are perfect for situations in which we already have an iterable (e.g., our call to “range”) and we want to create a list based on it.

We can get the same results in this way:

df = pd.concat([pd.read_html(f'https://marketnews.usda.gov/mnp/py-report?&repDate=01%2F01%2F{year}&regionsDesc=&endDate=12%2F31%2F{year}&repMonth=1&run=Run&_producttypefrom=1&endYear=2023&producttypefrom=&repYear=2020&categoryDesc=Egg&frequency=Daily&datatype=None+Selected&previouscls=Breaking+Stock&_producttype=1&report=NW_PY041&category=Egg&subcategory=Breaking+Stock&endMonth=1&commodityDesc=Breaking+Stock&runReport=true&format=excel')[0]
                for year in range(2018, 2024)])

And yes, the URL is long and ugly, but we’ll ignore that for now.

With this data frame in place, can we start to perform our queries? Not quite yet. The data frame that we downloaded has a column named “Date”, with the month, day, and year on which the egg-price data was retrieved. However, the column contains text, as you’ll see if you run either df.info() or df.dtypes. If we’re going to perform date-related calculations, we’ll need to turn that column into a “datetime” column.

One way to create a datetime column from a text column is with the “pd.to_datetime” function:

pd.to_datetime(df['Date'])

We can then assign the result of this back to the “Date” column:

df['Date'] = pd.to_datetime(df['Date'])

With that column now a datetime, we can retrieve date-and-time-related elements from it via the “dt” accesor. That’ll come in handy very soon.

My next question was: What was the average low price for eggs in each of the years of data?

Whenever you want to get the average value of one (numeric) column for each unique value in another (categorical) column, you know you’re going to be using grouping. That’s all fine, except that here, what column will we group on? We want to group by year, but there isn’t a “year” column, just a “Date” column. True, but we can retrieve the year with df[‘Date’].dt.year. That leads us to the following:

df.groupby(df['Date'].dt.year)['Low Price'].mean()

The above code asks Pandas to calculate, for each unique year it finds in df[‘Date’], the mean value of the “Low Price” column. So we’ll find the mean low price for 2018, then for 2019, then for 2020, until we get to 2023. The “groupby” method is a powerful and common one, allowing us to run an aggregation method once for each value of a categorical column.

From this data, we can already see that the price — which is, according to the data set we downloaded, is in cents per dozen — went up significantly from 2021 to 2022, from about 49 cents/dozen to about 188 cents/dozen. Yikes!