BW #41: Wine production (solution)

Climate change is causing all sorts of problems in the world — including, according to a recent report, the number of grapes grown for wine production. Let's look at the numbers on wine production.

BW #41: Wine production (solution)

Black Friday

Remember, my Black Friday sale, with 25% off annual memberships to my community (just Python, or Python+Data), and 40% off my courses, continues. Learn more https://lernerpython.com/bfcm-2023/. But don’t delay, because these deals only last through Monday.

And now, back to our regularly scheduled solutions about wine.

Last summer, I heard a great story on Marketplace about how England was now producing more and more wine. Now, England is many wonderful things — but the climate never struck me as appropriate for growing grapes, let alone wine-quality grapes. It turns out that as climate change raises temperatures around the world, the UK is increasingly suitable for vineyards.

Of course, this means that places which were previously suitable for growing grapes are finding it increasingly difficult to do so. That’s part of what the OIV, the French initials for the International Organization of Vine and Wine, reported recently about wine production around the world. Climate change is having a big effect on grapes, so much so that wine production in 2023 has been going down — and will likely be the lowest in 60 years. As they wrote, “Harvests have plummeted in the Southern Hemisphere and in some major countries due to the extreme climatic conditions.” You can read the full outlook for wine in 2023 here: /content/files/sites/default/files/documents/oiv_world_wine_production_outlook_2023.pdf

Fortunately, the OIV’s data is available for download. And while we’re not going to look at all of their conclusions — especially since most data comes from before 2023 — we can definitely draw some interesting conclusions about the state of winemaking. Along the way, we’ll explore some of the most common and useful Pandas functionality having to do with grouping, joins, and pivot tables.

Data and seven questions

As I mentioned, this week's questions comes from OIV, specifically, their full database. You can view it here:

https://www.oiv.int/what-we-do/data-discovery-report?oiv

I exported the data to Excel by clicking on the three dots in their viewer and choosing to export it to a file. With that in hand, I was able to start doing some analysis.

Here are the seven questions and tasks I gave you for this week. A link to download and view my Jupyter notebook follows the final solution, below.

Download the English-language version of the report as an Excel file. Turn it into a data frame, keeping only the named columns and rows with actual data.

The first thing that I had to do was start up Pandas:

import pandas as pd

With that in place, I wanted to take the file (data.xlsx) and turn it into a data frame. On the one hand, this should be pretty straightforward, using the “read_excel” function that comes with Pandas:

filename = 'data.xlsx'

df = (
    pd
    .read_excel(filename)
)

However, there are some problems with this:

  1. I got warnings when I tried to read the file with read_excel. The import still worked, but it gave me weird warnings about formatting in the file. I investigated this a bit, and it seems that the Excel file had invisible formatting that Pandas wanted to warn me about. I found that making an invisible modification to the file, and then saving it, solved this issue. If you got the warning and are OK with seeing it, you can just ignore it, though.
  2. I end up with a bunch of columns that I don’t want or need.
  3. The final row of the file contains information about the export, and thus corrupts the data.

Let’s first remove the columns that we don’t need: I decided to use the “filter” method, which lets me choose rows or columns based on the index or column names. I decided to use a regular expression, indicating that I want only those columns whose names contain alphanumeric characters and slashes, from the start to the end:

filename = 'data.xlsx'

df = (
    pd
    .read_excel(filename)
    .filter(regex='^[\w/]+$', axis='columns')
)

This gave me the columns I wanted, but I also wanted to remove the final two rows. (The final row has the export info, and the second-to-last row contains only NaN values.) I decided to use the “iloc” accessor, giving it a slice of [:-2], meaning that I want all of the rows in the data frame except for the final two.

The final query to load our data thus looks like this:

filename = 'data.xlsx'

df = (
    pd
    .read_excel(filename)
    .filter(regex='^[\w/]+$', axis='columns')
    .iloc[:-2]
)

The final data frame contains 52,795 rows and 7 columns.

In how many ways is grape production measured? Which unit is used for wine?

What columns does our data frame contain? Let’s take a look:

  • Continent
  • Region/Country
  • Product
  • Variable
  • Year
  • Unit
  • Quantity

When I saw this, I was a bit surprised to see that we’re using different units to measure things. At first, I thought that the different units were being used by different countries, sort of like the English system vs. the metric system. But it turns out that no, different products are measured in different ways.

We can get a complete list of the units, and how frequently they’re used, with “value_counts”. This is one of my favorite methods, which given a series (or a column) returns a new series whose index contains the unique values from the original series, and whose values contains integers, the number of times each of those values occurred.

To find out how often each of these measures is used, we can run value_counts on the “Units” column:

df['Unit'].value_counts()

This is what we get back:

Unit
tonnes     32674
1000 hl    17531
ha          2590
Name: count, dtype: int64

The three measures are:

  • tonnes (i.e., 1,000 kg), a measure of weight
  • hectoliters, a unit of volume — and “hecto” always means 100 in the metric system, so this represents 100 liters
  • ha, or hectares, where one hectare is 10,000 square meters — a unit of land.

It’s easy to see why these three different measures are used in this report. If you want to know how many grapes are being planted, you would measure it in hectares. If you want to know how many grapes were harvested, you would use tonnes. And if you want to know how much wine was produced, you would use hectoliters — or in our case, 1,000-hectoliter units.

By the way, most bottles of wine that I’ve seen have about 750 ml, aka 0.75 liters. So every 1,000 hectoliters would make about 133,000 bottles of wine.

Do we indeed see that wine is measured in hectoliters? Let’s ask our data frame: We’ll find all of the rows in which the product being measured is wine, and then we’ll run value_counts on all of those rows. If they’re using different measurement units, then we’ll find out how often each is used. But if they’re all using 1,000-hl units, then we’ll see just one row:

df.loc[
    df['Product'] == 'Wine', 
    'Unit'
].value_counts()

Here, I’m using “loc” in its two-argument form:

  • The first argument is the row selector. Here, we’re providing a boolean series, the result of running a comparison between the “Product” column and the string “Wine”. Wherever this comparison returns True, we’ll get a row back from df.
  • The second argument is the column selector. Here, we just want one column, “Unit”.

In other words, we want the “Unit” column from all rows in which the product is wine. We then run value_counts on that column:

Unit
1000 hl    17531
Name: count, dtype: int64

Since the resulting series has a single item whose index is “1000 hl”, it would appear that wine is indeed always measured in 1,000 hl units. We can also see it as a percentage:

df.loc[df['Product'] == 'Wine', 'Unit'].value_counts(normalize=True)

The result, expressed as a percentage:

Unit
1000 hl    1.0
Name: proportion, dtype: float64

By the way, assuming that only a single unit of measure is used for each product, we can find out what that unit is by running it on a “groupby” query:

df.groupby('Product')['Unit'].value_counts()

The result:

Product       Unit   
Dried Grapes  tonnes     15249
Fresh Grapes  tonnes     11513
Table Grapes  tonnes      5912
Vineyard      ha          2590
Wine          1000 hl    17531
Name: count, dtype: int64