BW #16: Consumer oil prices (solution)
How much do people pay for oil-based products in different countries? How do these prices vary over time?
This week, we’re looking at oil prices — not how much a barrel of oil costs, although that’s generally what you’ll see in the news, but rather how much petroleum-based products costs to consumers. Given how much we’ve heard about heating oil in Europe over the last year, and how much people in the US are talking about gasoline prices as the summer starts, I thought it might be interesting to look at this.
The International Energy Agency (IEA) tracks this sort of information and makes it publicly available. This weeks’ data came from https://www.iea.org/data-and-statistics/data-product/monthly-oil-price-statistics-2, where they offer monthly updates on the prices for three products: Gasoline, home heating oil, and diesel fuel.
As I noted yesterday, you'll need to register for a free IEA account in order to download the data. Once you do that, you'll want to download the Excel (xslx) version of the monthly prices excerpt; the file that I retrieved was last updated on May 10th, 2023.
Let’s get to it!
Load the "raw_data" tab of the Excel file into Pandas as a data frame.
Before we can do anything else, we’ll need to load up Pandas:
import numpy as np import pandas as pd
Strictly speaking, you don’t need to import NumPy. But I want to make sure that we have it available so that I can reference dtypes later on.
With Pandas imported, I’ll load the Excel file into a data frame with “read_excel”:
filename = 'IEA_Energy_Prices_Monthly_Excerpt_052023.xlsx' df = pd.read_excel(filename, sheet_name='raw_data')
There are two things to notice about how I loaded the Excel file here: First, because the Excel file contains several sheets, I needed to indicate which sheet I wanted to load. You can do this either by specifying the sheet name (which I did here) or by using its numeric index, starting with 0. If you don’t specify a sheet, then you get the first one (i.e., index 0), which is definitely not what we wanted.
The other thing to notice is that when we load a CSV file, Pandas has to parse the text in each row and column, and then decide which dtype it wants to use. We can give it hints by specifying a “dtype” keyword argument, but it’s up to us to go beyond the default guesses of int64, float64, and object (which is basically a catchall for everything else, but mostly means strings).
By contrast, Excel knows all about different data types. Which means that when we read an Excel file into Pandas, it knows precisely what sort of data we’re loading. This means (as we’ll see in a bit) that the TIME column will automatically be loaded as a “datetime” object; there is no need for a “parse_dates” parameter like we would use in read_csv.
Remove rows in which the VALUE column is `..`. Turn the VALUE column into a float.
You would think that the IEA would be nice and smart enough to mark missing data with something mainstream. But no — they put a two-character string, “..”, wherever they didn’t have data.
What’s wrong with that? It means that in Excel, the column that should contain floating-point data actually contains textual data. Which means that when we read it into Pandas, we get a column with a dtype of “object”. If we want to have a numeric column, we’ll need to remove the “..” strings, and then convert the column.
We can remove the rows in which VALUE is “..” with a simple comparison:
df['VALUE'] == '..'
This returns a boolean series, with True/False values based on whether the value is indeed “..”. We can find all of the rows where that is not the case by using the ~ (tilde) to reverse the logic, and then by applying that boolean series as a mask index on df.loc:
df = df.loc[~(df['VALUE'] == '..')]
This gives us the subset of “df” which doesn’t have such values for VALUE. We can then set the dtype to be float:
df['VALUE'] = df['VALUE'].astype(np.float64)
This works, and there isn’t anything wrong with it, per se. But there’s an easier way: What if we just tell Pandas that when it reads the Excel file, it should treat “..” as a synonym for NaN? We can do that with the “na_values” keyword argument, which can take a list of values it should treat as NaN. Note that these values are in addition to the usual, default values:
df = pd.read_excel(filename, sheet_name='raw_data', na_values=['..'])
With our data frame looking like this, we can remove all of the rows that contain NaN using drop_na. Or, if we prefer, we can keep them, knowing that they might (under some circumstances) be useful. Either way, because NaN is a float, the result of reading the data frame means that the dtype for that column is a float, without any extra effort on our part:
COUNTRY object PRODUCT object FLOW object UNIT object TIME datetime64[ns] VALUE float64 dtype: object
When reading data from a CSV file, we need to tell Pandas to interpret a column as a `datetype` dtype. Why didn't we have to do that here?
As I mentioned above, CSV is a text format. Which means that if we don’t specify the dtypes of the columns we read, Pandas has to guess. In the case of Excel, Pandas doesn’t have to guess; the file format itself specifies, rather clearly, what type of data each column contains. Which means that we don’t need to tell Pandas to treat one or more columns as datetime values; it gets that from the Excel file.
Each piece of data is repeated, once in US dollars and once in their national currency. So that we can compare prices, keep only the US dollars.
The “UNIT” column contains two values, “US dollars” and “National currency.” Which makes sense, since every country operates in its own currency, but we want to make apples-to-apples comparisons across countries.
If we were looking at a particular country, then we might want to keep the national currency rows. But we’ll only keep dollars. How can we do that?
We’ll just find those rows where UNIT is equal to “US dollars”. That’ll give us a boolean series back:
df['UNIT'] == 'US dollars'
We can apply this boolean series to df.loc as a mask index, getting back only those rows that use dollars, and then assigning the result back to df:
df = df.loc[df['UNIT'] == 'US dollars']
In the end, we have a data frame that is solely in US dollars.
Keep reading with a 7-day free trial
Subscribe to Bamboo Weekly to keep reading this post and get 7 days of free access to the full post archives.