Skip to content

Bamboo Weekly #168: US gas prices (solution)

Get better at: web scraping, handling multiple files, multi-indexes, cleaning data, pivot tables, and plotting with Plotly.

Bamboo Weekly #168: US gas prices (solution)

The war between the US (and Israel) and Iran has figured prominently in the news over the last two months. One of the reasons, of course, is the blockade of the Strait of Hormuz, through which huge supplies of oil, fertilizer, and helium are exported to the rest of the world. This has led to a rise in many prices, including those of oil — which translates into higher gasoline prices around the world.

Now, the US typically has lower gas prices than most developed countries. But they have jumped by quite a bit over the last two months, and people are taking notice.

How much have they risen? And has that rise been spread evenly across the country?

This week, we looked at data from AAA (the American Automobile Association), known for its towing services and (in the pre-GPS era) for its maps. We'll see how much prices have changed, what types of fuel has changed the most, and where the biggest changes have taken place.

Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.

Learning goals for this week include web scraping, combining data, cleaning data, multi-indexes, grouping, window functions, and plotting with Plotly.

Data and five questions

This week's data comes from AAA's "gas prices" web site, at https://gasprices.aaa.com/ . Specifically, we'll be looking at data from each of the 50 states, at URLs of the form https://gasprices.aaa.com/?state=XX, where XX is a two-letter state abbreviation.

To add to the challenge, you cannot download the data directly from the site. Nor can you use the read_html method directly on the site's URL's you will get a 403 ("forbidden") error message. Nor, it seems, can you even use the requests library to retrieve the data.

Instead, you'll have to use the curl_cffi package from PyPI, which has enough smarts to get around the bot-blocking software that AAA and others use. (Please download the data as few times as possible, to avoid angering them or abusing their site.) You'll then combine the 50 states' data into a single data frame, and start to perform calculations on it.

Here are my solutions and explanations for the five questions that I posed yesterday:

Using curl_cffi, create a Python dictionary with 50 key-value pairs. The keys will be the 50 US state abbreviations, and the values will be the curl_cffi response object that you got back from each state page.

I started off, as usual, by importing Pandas and Plotly:

import pandas as pd
from plotly import express as px

I wanted a list of US state names and abbreviations, and found such a module – appropriately named us – on PyPI (https://pypi.org/project/us/). If you iterate over us.states.STATES, then you get each of the 50 state abbreviations.

Normally, I would at this point invoke read_html once for each state's page on the AAA site. read_html returns a list of data frames, each representing one HTML table on a page. So if I were to iterate over each state abbreviation, use it to invoke read_html on the state's page, and then grab the first data frame in that list, we would be in great shape. The code would look something like this:

import us

all_dfs = [pd.read_html(f'https://gasprices.aaa.com/?state={abbr}')[0]
           for abbr in us.states.STATES]

There's just one problem, which is that AAA doesn't let you read or scrape their site. Or at least they make it difficult to do that. You can make your request, and you'll get a response, but it won't have a 200 ("OK") status code, and won't have the body of the page you tried to read.

I've encountered this before; many sites don't want you to scrape them. Fortunately, the requests library (https://docs.python-requests.org/en/latest/index.html) includes some tricks, including the ability to set the user-agent header – the string that a browser sends to a server, identifying itself. By using requests, I could claim to be another kind of browser (or even use https://pypi.org/project/fake-useragent/ on PyPI to choose a random one each time). That would solve the problem, right?

Except that this time around, it didn't help. It wasn't enough. The reason is apparently that Cloudflare, which provides proxy and protection to many Web sites, detects bots using more than just the user-agent string. That's where I discovered curl_cffi (https://pypi.org/project/curl-cffi/), which seems to be a bit smarter about evading detection by Cloudflare.

I thus decided to start with just retrieving the data, pushing the parsing into the next question. My goal was to create a dict in which the keys are the state abbreviations and the values are the responses from the AAA site.

I thus created an empty dict, responses. I then iterated over each state object, grabbing the abbreviation for each one. I also printed the value of the current abbreviation to help with debugging.

Then, using the abbreviation and an f-string, I defined url, the page I wanted to download from the AAA site.

I then used requests.get – similar to the requests.get method in the requests library – to retrieve the URL. But notice that I also passed a special keyword argument, impersonate='chrome', functionality that curl_cffi offers to, well, impersonate a browser. I then took the full response object that we got back, and stuck it into the dict with the state abbreviation as the key:

from curl_cffi import requests

responses = {}

for one_state in us.states.STATES:
    abbr = one_state.abbr
    print(f'Downloading {abbr}')
    url = f'https://gasprices.aaa.com/?state={abbr}'
    responses[abbr] = requests.get(url, impersonate="chrome")

Note that because I iterated over us.states.STATES, we didn't get data from Washington, DC. There is a page where the abbreviation is DC, but I decided that things were complicated enough that we could ignore DC gas prices for now. My apologies to any readers who live in DC itself!

Having run this code, I now have a dict that I can use to create a data frame — which is what we do in the next question.

Create a single data frame from the dict that you have created, using the first HTML table on each page (i.e., the one listing different types of gas prices, and at different times). The resulting data frame should have a two-level multi-index, with the outer layer being the two-letter state abbreviation, and the inner layer being the description. The data frame itself should have four float columns reflecting average gas prices in a given state.

Given a dict of response objects, how can we turn them into a single data frame?

Remember that I originally wanted to do something like this:

all_dfs = [pd.read_html(f'https://gasprices.aaa.com/?state={abbr}')[0]
           for abbr in us.states.STATES]

Now I'd like to do something similar, but instead of invoking read_html on a URL, I'll read it on the text that I got back from each of the URLs:

[pd.read_html(one_response.text)[0]
 for one_abbrev, one_response in responses.items()]

Because responses is a dict, and because I'll want to use the abbreviation as well as the response, I use dict.items here to get the key and value for each iteration. I then grab one_response.text, the string from the response, and pass it to read_html. Then, because we want the data frame from the first HTML table on the page, we pass [0].

The good news? The code is straightforward.

The bad news? It won't work. That's because read_html doesn't parse strings any more. If you give it a string, it'll assume that it's a filename or a URL, not text.

The solution is to use io.StringIO, from Python's standard library. StringIO objects act like files (i.e., they're "file-like objects"), which means that we can pass them in place of files. We can thus say:

from io import StringIO

[pd
 .read_html(StringIO(one_response.text))[0]
 for one_abbrev, one_response in responses.items()]

In other words, we take one_response.text, and use it to create a new instance of StringIO. We then pass that StringIO to read_html, which returns a list of data frames. We grab the first data frame. In the end, we have a list of data frames, one from each US state.

This is all great, but let's do a few more things. First of all, let's rename some of the columns, using rename, passing a dict so that their names are a bit more meaningful and standardized. The first column, which gets the oh-so-clear name Unnamed: 0, will be renamed Measure, since I'm bad at naming things. And Mid-Grade will lose the hyphen, and become Midgrade.

Also, we need to know which state we're talking about! So I used assign to add a new column to the data frame, state, containing the abbreviation.

[pd
    .read_html(StringIO(one_response.text))[0]
    .rename(columns={'Unnamed: 0':'Measure', 'Mid-Grade':'Midgrade'})
    .assign(state = one_abbrev)
    for one_abbrev, one_response in responses.items()]

I next used set_index to make a two-part multi-index from state and Measure, by passing a list of the two column names.

And finally, I turned all of the columns into floats by using apply on each of the columns. The lambda function that I invoked used str.strremoveprefix to remove the dollar signs (something that I originally did with regular expressions, but this was much simpler), followed by a call to astype to turn the now-containing-numeric-values series into a float:

[(pd
    .read_html(StringIO(one_response.text))[0]
    .rename(columns={'Unnamed: 0':'Measure', 'Mid-Grade':'Midgrade'})
    .assign(state = one_abbrev)
    .set_index(['state', 'Measure'])
    .apply(lambda s_: s_.str.removeprefix('$').astype(float))
    for one_abbrev, one_response in responses.items())

But of course, this still isn't quite enough. The list that we get back needs to be turned into a single data frame. That's where pd.concat comes in: It takes a list of data frames, and combines them into a single one.

Except that actually, we don't need to pass a list of data frames. We can instead pass a generator, created via a generator expression. The total query now looks like this:

df = pd.concat((pd
                .read_html(StringIO(one_response.text))[0]
                .rename(columns={'Unnamed: 0':'Measure', 'Mid-Grade':'Midgrade'})
                .assign(state = one_abbrev)
                .set_index(['state', 'Measure'])
                .apply(lambda s_: s_.str.removeprefix('$').astype(float))
               )
                for one_abbrev, one_response in responses.items())
df

Our data frame, df, now has 250 rows (five for each state) and 4 columns (one for each type of fuel).