BW #10: Oil prices

Oil prices have started to rise, so we'll look at the history of oil-import prices in the OECD. But actually? We'll look at the speed and efficiency of Pandas 2.0's new PyArrow back end!

Oil is one of the most important commodities: For most people, it’s what powers their automobile. But it’s used for many other types of fuel, such as for airplanes. And for plastic. And any number of other products that are part of our modern world.

Oil is also one of the biggest sources of income for many countries. If the price of oil gets too low, then those countries’ incomes go down. Oil-producing countries thus work together, more or less, to ensure that the price of oil never gets too low by cutting production.

OPEC Plus, an international consortium of oil-producing countries (often called a cartel) announced earlier this week that it would be cutting production by 1.2m barrels a day, in the hopes of keeping oil prices higher. Sure enough, the prices went up, and they might well go up even more.

This week, I thus thought that it would be interesting to look at historical oil prices.

But then, even bigger news hit, at least for data nerds: Pandas 2.0 was released. Among the biggest changes in Pandas 2.0 is the use of PyArrow for in-memory storage of values. NumPy is still the default, and it’s not necessarily going away, but PyArrow is the clear direction for the future of Pandas.

So this week, we’re going to do a few simple tasks. But we’ll do everything twice, and we’ll measure how much time it takes to perform each task. When we’re done, we’ll have a sense of whether PyArrow improves the performance of our Pandas tasks — and if so, which ones.

Note that in order to answer this week’s questions, you’ll need to install Pandas 2.0. It should be a simple matter of upgrading with “pip install -U pandas”, but these things can be tricky. You might well need to also install “pyarrow” explicitly.

Data and questions

We’ll look at oil prices and production statistics from the OECD, a group of countries that the Economist calls “a club of mostly rich countries.” (Maybe it’s just me, but I always giggle when I read that description.) In addition to using Pandas, I’ll be taking advantage of the %timeit and %%timeit magic commands in Jupyter to compare performance. You’re welcome to check the timing in other ways, if you prefer.

Our data set this week will come from the OECD’s oil import price page, at https://data.oecd.org/energy/crude-oil-import-prices.htm. We want the years 1980-2021 for all countries. Download the "full indicator data" as a CSV file, and save it. Note that the filename will reflect the date and time at which your download takes place, so my filename and yours will differ at the end.

Each of the questions should be performed twice, using `%timeit` to check how long each one took:

  1. With the default engine and data backend
  2. With the PyArrow engine and PyArrow backend

Yes, you can use the PyArrow loading engine and the default NumPy data backend, but I decided that asking you to do everything twice was bad enough; three times would really be too much.

Also note that %timeit doesn’t “leak” variable assignments, which means that we’ll need to do everything twice — once to calculate the timing, and once to actually perform the assignment.

This week’s questions:

  • Load the data into a data frame. We only want to load five columns: LOCATION, FREQUENCY, and TIME, Value, and Flag Codes. (Note the odd capitalization.)
  • Check the memory usage of the data frame. (You don’t need to time this.)