BW #22: Banana index (solution)

Some foods are worse for the environment than others. How can we measure their relative badness? This week, we look at the new "Banana index" from the Economist, and learn something about what we eat.

BW #22: Banana index (solution)

This week, we looked at the “Banana index” from the Economist, downloading its first-ever release of data, and then finding foods which have a bigger carbon footprint than a banana.

Along the way, we looked at ways to filter columns and rows, to employ broadcasting, to look into correlations, and even to decorate our data frame in different colors depending on the values inside.

Data

The repo for the banana index is at:

https://github.com/TheEconomist/banana-index-data/

We're going to download the data from version 1.0 of the banana index, in this CSV file:

https://github.com/TheEconomist/banana-index-data/releases/download/1.0/bananaindex.csv

Questions

This week, I asked you seven questions:

1. Download the data into a data frame. Set the index to be the “entity” column. Remove the “year”, “Banana values”, “type”, and “Chart?” columns.

Let’s start by loading up Pandas:

import pandas as pd

With that in place, we need to load the CSV file. In theory, we could download the file to the local filesystem and read with it “pd.read_csv”. However, read_csv, like most of the “read” methods in Pandas, accepts a filename, a file-like object, or a URL as its first argument. We can thus read the file directly from its URL, straight into a data frame:

url = 'https://github.com/TheEconomist/banana-index-data/releases/download/1.0/bananaindex.csv'

df = pd.read_csv(url)

Now, I asked that the “entity” column be set to the index. We could, in theory, first read the CSV file and then use “set_index” to modify it. But it’s easier and faster for us to simply pass the “index_col” keyword argument to read_csv:

df = pd.read_csv(url, index_col='entity')

In this particular case, we’re only going to use one column as an index. However, as is often the case in Pandas, you can pass a list of strings, rather than a single string, if you want to create a multi-index based on more than one of the columns in the downloaded file.

I then asked you to remove several of the columns, which we aren’t going to be using. The easiest way to do this is with the “drop” method. You can call “drop” with a single string value or with a list of strings. However, by default, “drop” works on the index (i.e., rows), rather than the columns. We can change this by passing “axis='columns'”:

df = df.drop(['year', 'Banana values', 'Chart?', 'type', 'Unnamed: 16'], axis='columns')

Note that instead of dropping these columns, I could have passed the “usecols” keyword argument to “read_csv”, indicating which columns I wanted to keep. The only issue there is that I wanted to keep most of them and drop only a handful, so I did it this way.

I now have a data frame with all of the data I want, with 160 rows and 111 columns.

Note that when I was putting together this week’s issue, I found that “read_csv” sometimes gave me an error when trying to retrieve the document from GitHub. I’m not sure what the cause of the problem was, but I found that waiting a few seconds and then retrying solved the issue. My best guess is that GitHub is trying to stop programs from retrieving data, and thus throttles the frequency with which they can make requests, but I’m really not sure.

2. Three of the columns contain pre-computed banana scores for kg, calories, and protein. For each of these columns, show the 10 highest-scoring food products.

In order to answer this question, we’ll need to do a few different things:

  1. Find the columns with the pre-computed banana scores
  2. Find the 10 highest-scoring food products for each of these columns
  3. Display these 10 highest-scoring products in each column

First: How can we find the columns with pre-computed banana scores? We can get a full list of columns with the “columns” attribute:

df.columns

Notice that it’s an attribute, not a method, so you don’t want to use parentheses.

The result that I get is an index object, which you can think of as a list of strings with some additional capabilities:

Index(['emissions_kg', 'emissions_1000kcal', 'emissions_100g_protein',
       'emissions_100g_fat', 'land_use_kg', 'land_use_1000kcal',
       'Land use per 100 grams of protein', 'Land use per 100 grams of fat',
       'Bananas index (kg)', 'Bananas index (1000 kcalories)',
       'Bananas index (100g protein)'],
      dtype='object')

There are three columns that clearly contain banana index data. I can always retrieve columns from a data frame by passing a list of columns inside of square brackets — that is, nested square brackets — as follows:

df[['Bananas index (kg)', 'Bananas index (1000 kcalories)',        'Bananas index (100g protein)']

And yes, this will work! But… yuck, right?

I would greatly prefer to have Pandas give me all of the columns whose names start with the string “Bananas”. Or even those whose names contain the substring “Bana” in them.

Fortunately, Pandas provides us with such a method, called “filter”. Now, the “filter” method is confusing — to me, at least — because it doesn’t filter the values. Rather, it filters the columns by their names. It lets you get a subset of the columns by giving it a substring, and passing that as a value to the “like” keyword argument:

df.filter(like='Bana')

This returns a new data frame, based on df, in which we have all of the rows but only those columns containing “Bana” in their names.

I should add that if you prefer, you can pass a different keyword argument, namely “regex”, along with a string containing a regular expression. So we could also have said:

df.filter(regex='^Bana')

(I love regular expressions, but I realize that they can be a bit surprising and difficult for people to understand. If that describes you, check out my free, 14-part “regexp crash course” at https://RegexpCrashCourse.com/. You’ll be using regular expressions in no time!)

Anyway, now that we’ve narrowed the data frame down to the columns that we want, we’ll want to get the 10 top items for each of these columns. This means sorting our data frame by the values in each of these columns, and taking the top 10 values. Looking at the index after each sort will let us see which food items are associated with these values.

If we were only interested in sorting by one value, then it would be obvious how to handle it. How can we sort by three different columns?

I normally tell people that when you use a “for” loop in Pandas, you’re almost certainly doing something wrong. But that’s because you should use the built-in vectorization, rather than the (typical) Python way of doing things with iterations.

In this case, though, we need to iterate, using a “for” loop not over the values in our data frame, but rather on the columns. We can then print each column name, then the result of sorting that column by values, grabbing the top 10 values:

for one_column in df.filter(like='Bana'):
    print(one_column)
    print(df[one_column].sort_values(ascending=False).head(10))
    print()

The result of iterating over the result of “filter” is to get each column name, one at a time. We print the column name, then use it to retrieve that column from df. Note that this will return a series, rather than a data frame.

Then we invoke “sort_values” on that series, which returns a new series, sorted in ascending order. Actually, it would normally be in ascending order, but here we pass the keyword argument ascending=False, which gives us results in descending order. We can then invoke “head(10)” to get the 10 top values for each one. Finally, we invoke “print” with no arguments to give us a blank line.

The result looks like this:

Bananas index (kg)
entity
Beef steak         148.563324
Beef mince         108.816189
Beef meatballs      81.052853
Beef burger         61.803856
Lamb chops          35.383303
Lamb (leg)          35.198904
Lamb burgers        30.833345
Cottage cheese      28.944313
Parmesan cheese     27.499275
Prawns              23.943772
Name: Bananas index (kg), dtype: float64

Bananas index (1000 kcalories)
entity
Beef steak        77.752629
Beef mince        54.049393
Beef meatballs    35.782826
Cottage cheese    33.508657
Prawns            30.063042
Lettuce           28.915534
Raspberries       28.105624
Beef burger       25.011498
Lamb (leg)        17.335509
Coconut milk      14.670763
Name: Bananas index (1000 kcalories), dtype: float64

Bananas index (100g protein)
entity
Coconut milk      33.320155
Sugar             20.685802
Grapes            15.128521
Beef steak         8.312805
Raspberries        7.878815
Beef mince         6.878002
Marmalade          6.495684
Butter             6.192303
Lettuce            5.173570
Beef meatballs     4.612929
Name: Bananas index (100g protein), dtype: float64

If you’re interested in finding the least carbon-emitting foods on the banana scale, you could just invoke “tail”.