BW #5: Ukrainian exports (solution)

This week, the world marks the first anniversary of Russia's invasion of Ukraine. Despite the war, how much is Ukraine successfully exporting? And who is receiving those products?

This week’s topic: Ukrainian exports

This week, we looked at various aspects of Ukraine’s agricultural exports in the wake of the Russian invasion, as well as the Black Sea Grain Initiative that allows products to be sent to other countries.

The URL to download the data in Excel format is:

https://docs.google.com/spreadsheets/d/e/2PACX-1vRisnQjodySbp6-XXPGhdsVMp2stg_gyuxw42pP41tuxeic63IARau6bV1TgjLiw_ciAWsTO5LarPqT/pub?output=xlsx

Here is what I asked you to do:

  1. Read the Excel data into a data frame. Again, the “Data” sheet is what interests us.

  2. Create a pivot table, showing how many tons of each commodity (rows) have left each port (columns).

  3. Create a pivot table, showing how many tons of each commodity (columns) were going to each destination country (rows).

  4. The worry was that lower-income countries would not have enough food to eat, because of the war. According to the data's "income group," what proportion of the grain are low-income countries receiving?

  5. Have we seen growth in the total tonnage shipped each month to developing countries?

  6. Finally, what 10 flags are most commonly used on the ships coming from Ukraine? Do any country's names appear more than once? If so, why, and how can you fix it?

Discussion

The first thing we need to do is read the data into a data frame. Fortunately, this is fairly straightforward; we can either download the file and load it from our local filesystem, or grab it directly via a URL. Because the data dictionary was in the Excel file’s first sheet, I decided to download it onto my computer, and load it into Pandas from there.

On my computer, the loading code looked like this:

df = pd.read_excel(filename, sheet_name='Data')

I used read_excel to bring the data into Pandas. And because the Excel file contained multiple sheets, I named the sheet that was of interest to me, giving me a single data frame.

However, if I load the data frame this way, there will be a problem down the road. Specifically, the “Departure” column is contains a date and time. Treating that column as a datetime not only saves memory, but also allows us to perform a number of time-based queries on the data. In theory, I could load the data as is, and then transform it into a datetime column with pd.to_datetime. However, I find it far more convenient to tell Pandas to set the dtype when it loads the file and creates the data frame. I do that by passing the column name as a value to the “parse_dates” keyword argument:

df = pd.read_excel(filename, sheet_name='Data', parse_dates=['Departure'])

With our data frame fully loaded into memory, we can start to attack the questions that I raised.

1. Create a pivot table, showing how many tons of each commodity (rows) have left each port (columns).

The Black Sea Grain Initiative ensures that Ukraine’s agricultural products can be exported via three ports on the Black Sea. The data set that we are looking at tracks each of the ships carrying products from Ukraine, indicating which port they left from, where they were going to, and what they were carrying.

We could find out how much grain was shipped from each port with a simple “groupby”. We could similarly find out how much of each product was shipped with a separate “groupby”. But we can also perform a two-dimensional “groupby”, known as a “pivot table,” which will allow us to see how much of each product was exported from each of the three ports.

A pivot table requires three pieces of data:

  1. The name of a categorical column, whose values will be used as the index of the resulting table.
  2. The name of a second categorical column, whose values will be used as the columns of the resulting table, and
  3. The name of a numeric column, whose values will be aggregated for each intersection of row and column.

We can create a pivot table with the “pivot_table” method for data frames. We’ll need to pass at least three keyword arguments:

  1. “index”, the name of the categorical column (“Commodity”) whose unique values will form the rows of the pivot table,
  2. “columns”, the name of the categorical column (“Departure port”) whose unique values will form the columns of the pivot table, and
  3. “values”, the name of the numeric column (“Tonnage”) whose values will be used for the intersection of every row and column.

We could thus create our pivot table as follows:

df.pivot_table(index='Commodity', columns='Departure port', values='Tonnage')

By default, pivot tables calculate the mean of all values for that row-column intersection. So the result that we’ll get will show how much tonnage, on average, of each commodity was sent from each port. That might be useful, but it’s not what we’re asking for.

We’ll thus need to add another keyword argument, “aggfunc”, naming the aggregation function we want to use. In this case, that’ll be “sum”:

df.pivot_table(index='Commodity', columns='Departure port', values='Tonnage', aggfunc='sum')

Sure enough, we get a table with three columns (the ports) and 18 rows (one for each commodity). Wherever we have a NaN value, we know that none of that commodity was shipped from that port.

By the way, it’s confusing that Pandas has both “pivot” and “pivot_table” methods. The “pivot” method doesn’t let you specify the aggregation function you want to use. So I tend to use “pivot_table” in my work.

2. Create a pivot table, showing how many tons of each commodity (columns) were going to each destination country (rows).

Now we’re going to create a second pivot table, showing what commodities were sent to which destination countries. Once again, we’ll need to pass four arguments:

  1. “index”, the name of the column (“Country”) whose unique values will be used for the rows of the pivot table,
  2. “columns”, the name of the column (“Commodity”) whose unique values will be used for the columns of the pivot table,
  3. “values”, the name of the numeric column (“Tonnage”) whose unique values will be aggregated, and
  4. “aggfunc”, the name of the function (“sum”) we want to apply to all values at each row-column intersection.

The code we’ll need to write is almost identical to what we used above, except that we’ll be choosing different columns:

df.pivot_table(columns='Commodity', index='Country', values='Tonnage', aggfunc='sum')

There are lots of NaNs there, indicating which commodities aren’t imported by various countries. But wow, Ukraine really supplies quite a bit of wheat and barley to the rest of the world! (Just to Israel, where I live, they exported 40k tons of barley and 11k tons of wheat, and 14k tons of sunflower meal. What the heck is sunflower meal, anyway? Ooh, it’s fed to livestock. But I digress.)

Notice, by the way, that just before the column for “Wheat,” we have a column for “What,” which only has a value for Italy. My guess is that this reflects a typo, which could be cleaned up if we were doing more serious analysis of this topic.