Bamboo Weekly

Share this post

BW #10: Oil prices (solution)

www.bambooweekly.com

BW #10: Oil prices (solution)

Oil prices have started to rise, so we'll look at the history of oil-import prices in the OECD. But actually? We'll look at the speed and efficiency of Pandas 2.0's new PyArrow back end!

Reuven M. Lerner
Apr 6, 2023
∙ Paid
2
Share
Share this post

BW #10: Oil prices (solution)

www.bambooweekly.com

This week’s topic: Oil prices

This week’s data came from the OECD, about 30 countries that pool data for (in theory) a better understanding of trade, economics, education, and governance. Here in Israel, the OECD is constantly making suggestions regarding how government policies; our governments sometimes even listen to those suggestions.

The OECD is a treasure trove of data, and among other things, they keep track of energy-related expenses — including how much each country spent on oil. I asked you to download a CSV file from here:

https://data.oecd.org/energy/crude-oil-import-prices.htm

I noted that you should ask for the “full indicator data” for all years (1980-2021). Moreover, I suggested that you download the file, rather than retrieve the data from a URL. That was mostly to avoid including network latency variations in our comparison between NumPy-based Pandas and PyArrow-based Pandas.

Our questions for this week are:

  1. Load the data into a data frame. We only want to load five columns: LOCATION, FREQUENCY, and TIME, Value, and Flag Codes. (Note the odd capitalization.)

  2. Check the memory usage of the data frame. (You don’t need to time this.)

  3. Keep only those rows with a monthly frequency (i.e., M).

  4. Create two new integer columns, YEAR and MONTH, based on the existing “TIME” column.

  5. What has been the per-country import price, taken over all measurements? Which countries have paid above the mean?

  6. Grouping by YEAR then MONTH, find the mean oil-import price across all countries. When was the mean price the highest? The lowest?

Let’s get to it! I’ll start by doing everything as per usual, with a traditional Pandas data frame created via read_csv. Once we’ve collected some benchmark information, we’ll then do everything again via PyArrow. And then we’ll be able to make some comparisons, and understand the trade-offs.

Load the data into a data frame. We only want to load five columns: LOCATION, FREQUENCY, and TIME, Value, and Flag Codes. (Note the odd capitalization.)

First, I’ll need to load the basics:

import pandas as pd
from pandas import Series, DataFrame

I’ve downloaded the file into the current directory, thus allowing me to import it this way:

filename = 'DP_LIVE_04042023115115682.csv'

Your file will also presumably start with “DP_LIVE” and then (from what I can tell) the day number, month number, year, and then — perhaps? — the number of seconds in the day, or something along those lines. Regardless, this is a pretty standard CSV file, and other than asking you to import only a subset of the columns, nothing really exciting is going on here:

Keep reading with a 7-day free trial

Subscribe to Bamboo Weekly to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2023 Reuven M. Lerner
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing