BW #10: Oil prices (solution)

Oil prices have started to rise, so we'll look at the history of oil-import prices in the OECD. But actually? We'll look at the speed and efficiency of Pandas 2.0's new PyArrow back end!

This week’s topic: Oil prices

This week’s data came from the OECD, about 30 countries that pool data for (in theory) a better understanding of trade, economics, education, and governance. Here in Israel, the OECD is constantly making suggestions regarding how government policies; our governments sometimes even listen to those suggestions.

The OECD is a treasure trove of data, and among other things, they keep track of energy-related expenses — including how much each country spent on oil. I asked you to download a CSV file from here:

https://data.oecd.org/energy/crude-oil-import-prices.htm

I noted that you should ask for the “full indicator data” for all years (1980-2021). Moreover, I suggested that you download the file, rather than retrieve the data from a URL. That was mostly to avoid including network latency variations in our comparison between NumPy-based Pandas and PyArrow-based Pandas.

Our questions for this week are:

  1. Load the data into a data frame. We only want to load five columns: LOCATION, FREQUENCY, and TIME, Value, and Flag Codes. (Note the odd capitalization.)
  2. Check the memory usage of the data frame. (You don’t need to time this.)
  3. Keep only those rows with a monthly frequency (i.e., M).
  4. Create two new integer columns, YEAR and MONTH, based on the existing “TIME” column.
  5. What has been the per-country import price, taken over all measurements? Which countries have paid above the mean?
  6. Grouping by YEAR then MONTH, find the mean oil-import price across all countries. When was the mean price the highest? The lowest?

Let’s get to it! I’ll start by doing everything as per usual, with a traditional Pandas data frame created via read_csv. Once we’ve collected some benchmark information, we’ll then do everything again via PyArrow. And then we’ll be able to make some comparisons, and understand the trade-offs.

Load the data into a data frame. We only want to load five columns: LOCATION, FREQUENCY, and TIME, Value, and Flag Codes. (Note the odd capitalization.)

First, I’ll need to load the basics:

import pandas as pd
from pandas import Series, DataFrame

I’ve downloaded the file into the current directory, thus allowing me to import it this way:

filename = 'DP_LIVE_04042023115115682.csv'

Your file will also presumably start with “DP_LIVE” and then (from what I can tell) the day number, month number, year, and then — perhaps? — the number of seconds in the day, or something along those lines. Regardless, this is a pretty standard CSV file, and other than asking you to import only a subset of the columns, nothing really exciting is going on here:

df = pd.read_csv(filename, 
                 usecols=['LOCATION', 'FREQUENCY', 
                          'TIME', 'Value', 'Flag Codes'])

Sure enough, we get a data frame.

But wait: I wanted you to check how long it took to perform this action. There are several ways to do this, but if I’m in Jupyter (or IPython, its textual cousin), then I like to use the %timeit and %%timeit magic commands.

Magic commands all start with %, because they aren’t allowed in Python identifiers. This means that Jupyter can notice and intercept the magic command before it ever gets to Python. These commands can be rather simple or complex; the point is that they give instructions to Jupyter.

Magic commands with a single % operate on a single line. Those with two %% at the start run on an entire cell. (They must also be at the start of the cell; don’t try to put Python comments before they begin, or you’ll get weird error messages.) If you want to run one line of code, then just put %timeit before, and the timing will be checked.

Behind the scenes, these magic commands are actually running the “timeit” module’s “timeit” method, from the Python standard library. It’ll run the code a number of times, giving you the average of the times that it ran your code, in case there was a great deal of variation.

In order to see how long it took to load our code, I put the following in a Jupyter cell:

%%timeit
df = pd.read_csv(filename, 
                 usecols=['LOCATION', 'FREQUENCY', 
                          'TIME', 'Value', 'Flag Codes'])

Here’s the result that I got:

9.82 ms ± 698 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The mean running time was 9.82 ms, with a standard deviation of ± 698 µs. (There are 1,000 µs, aka microseconds, in one ms, aka milisecond.) That sounds pretty fast, especially given that there are only about 21,000 rows in this data frame.

Check the memory usage of the data frame.

How much memory are we using? My favorite way to check is by running df.info(), a method that tells us about the data frame object. (This is different from df.describe(), which returns descriptive statistics on our data frame’s values.)

df.info() returns quite a bit of information about our data frame: Its index type, its column names and dtypes, a summary of how many columns we have of each dtype, and then a description of how much memory we’re using.

The thing is, if you invoke df.info(), the memory description will be a bit weird:

memory usage: 822.6+ KB

What the heck is going on there? Can’t the computer keep track of how much memory is in the data frame? Why is it telling us that there are 822.6+ KB?

The problem is that we have several columns of type “object,” which often means that they’re Python strings. Pandas doesn’t store strings in NumPy, because that would be too limiting. As a result, it stores references to Python strings, which exist in Python memory. For Pandas to do and count the memory usage on each of these Python strings might take a long time. So instead, it tells us how much memory the string references are using, and then says, “It might be more, too!” by way of the +.

If we want to get an exact reading, we need to pass memory_usage='deep' to df.info(). That’ll take more time, but it’ll also give us the true memory usage:

memory usage: 4.7 MB

Wow, that’s a lot bigger! I mean, it was previously reporting about 20% of the actual memory being used.

So we now know that our data frame is using 4.7 MB of memory, thanks to the combination of Pandas references and Python objects.