BW #10: Oil prices (solution)
Oil prices have started to rise, so we'll look at the history of oil-import prices in the OECD. But actually? We'll look at the speed and efficiency of Pandas 2.0's new PyArrow back end!
This week’s topic: Oil prices
This week’s data came from the OECD, about 30 countries that pool data for (in theory) a better understanding of trade, economics, education, and governance. Here in Israel, the OECD is constantly making suggestions regarding how government policies; our governments sometimes even listen to those suggestions.
The OECD is a treasure trove of data, and among other things, they keep track of energy-related expenses — including how much each country spent on oil. I asked you to download a CSV file from here:
https://data.oecd.org/energy/crude-oil-import-prices.htm
I noted that you should ask for the “full indicator data” for all years (1980-2021). Moreover, I suggested that you download the file, rather than retrieve the data from a URL. That was mostly to avoid including network latency variations in our comparison between NumPy-based Pandas and PyArrow-based Pandas.
Our questions for this week are:
Load the data into a data frame. We only want to load five columns: LOCATION, FREQUENCY, and TIME, Value, and Flag Codes. (Note the odd capitalization.)
Check the memory usage of the data frame. (You don’t need to time this.)
Keep only those rows with a monthly frequency (i.e., M).
Create two new integer columns, YEAR and MONTH, based on the existing “TIME” column.
What has been the per-country import price, taken over all measurements? Which countries have paid above the mean?
Grouping by YEAR then MONTH, find the mean oil-import price across all countries. When was the mean price the highest? The lowest?
Let’s get to it! I’ll start by doing everything as per usual, with a traditional Pandas data frame created via read_csv. Once we’ve collected some benchmark information, we’ll then do everything again via PyArrow. And then we’ll be able to make some comparisons, and understand the trade-offs.
Load the data into a data frame. We only want to load five columns: LOCATION, FREQUENCY, and TIME, Value, and Flag Codes. (Note the odd capitalization.)
First, I’ll need to load the basics:
import pandas as pd
from pandas import Series, DataFrame
I’ve downloaded the file into the current directory, thus allowing me to import it this way:
filename = 'DP_LIVE_04042023115115682.csv'
Your file will also presumably start with “DP_LIVE” and then (from what I can tell) the day number, month number, year, and then — perhaps? — the number of seconds in the day, or something along those lines. Regardless, this is a pretty standard CSV file, and other than asking you to import only a subset of the columns, nothing really exciting is going on here:
Keep reading with a 7-day free trial
Subscribe to Bamboo Weekly to keep reading this post and get 7 days of free access to the full post archives.