BW #22: Banana index (solution)
Some foods are worse for the environment than others. How can we measure their relative badness? This week, we look at the new "Banana index" from the Economist, and learn something about what we eat.
This week, we looked at the “Banana index” from the Economist, downloading its first-ever release of data, and then finding foods which have a bigger carbon footprint than a banana.
Along the way, we looked at ways to filter columns and rows, to employ broadcasting, to look into correlations, and even to decorate our data frame in different colors depending on the values inside.
The repo for the banana index is at:
We're going to download the data from version 1.0 of the banana index, in this CSV file:
This week, I asked you seven questions:
1. Download the data into a data frame. Set the index to be the “entity” column. Remove the “year”, “Banana values”, “type”, and “Chart?” columns.
Let’s start by loading up Pandas:
import pandas as pd
With that in place, we need to load the CSV file. In theory, we could download the file to the local filesystem and read with it “pd.read_csv”. However, read_csv, like most of the “read” methods in Pandas, accepts a filename, a file-like object, or a URL as its first argument. We can thus read the file directly from its URL, straight into a data frame:
url = 'https://github.com/TheEconomist/banana-index-data/releases/download/1.0/bananaindex.csv' df = pd.read_csv(url)
Now, I asked that the “entity” column be set to the index. We could, in theory, first read the CSV file and then use “set_index” to modify it. But it’s easier and faster for us to simply pass the “index_col” keyword argument to read_csv:
df = pd.read_csv(url, index_col='entity')
In this particular case, we’re only going to use one column as an index. However, as is often the case in Pandas, you can pass a list of strings, rather than a single string, if you want to create a multi-index based on more than one of the columns in the downloaded file.
I then asked you to remove several of the columns, which we aren’t going to be using. The easiest way to do this is with the “drop” method. You can call “drop” with a single string value or with a list of strings. However, by default, “drop” works on the index (i.e., rows), rather than the columns. We can change this by passing “axis='columns'”:
df = df.drop(['year', 'Banana values', 'Chart?', 'type', 'Unnamed: 16'], axis='columns')
Note that instead of dropping these columns, I could have passed the “usecols” keyword argument to “read_csv”, indicating which columns I wanted to keep. The only issue there is that I wanted to keep most of them and drop only a handful, so I did it this way.
I now have a data frame with all of the data I want, with 160 rows and 111 columns.
Note that when I was putting together this week’s issue, I found that “read_csv” sometimes gave me an error when trying to retrieve the document from GitHub. I’m not sure what the cause of the problem was, but I found that waiting a few seconds and then retrying solved the issue. My best guess is that GitHub is trying to stop programs from retrieving data, and thus throttles the frequency with which they can make requests, but I’m really not sure.
2. Three of the columns contain pre-computed banana scores for kg, calories, and protein. For each of these columns, show the 10 highest-scoring food products.
In order to answer this question, we’ll need to do a few different things:
Find the columns with the pre-computed banana scores
Find the 10 highest-scoring food products for each of these columns
Display these 10 highest-scoring products in each column
First: How can we find the columns with pre-computed banana scores? We can get a full list of columns with the “columns” attribute:
Notice that it’s an attribute, not a method, so you don’t want to use parentheses.
The result that I get is an index object, which you can think of as a list of strings with some additional capabilities:
Index(['emissions_kg', 'emissions_1000kcal', 'emissions_100g_protein', 'emissions_100g_fat', 'land_use_kg', 'land_use_1000kcal', 'Land use per 100 grams of protein', 'Land use per 100 grams of fat', 'Bananas index (kg)', 'Bananas index (1000 kcalories)', 'Bananas index (100g protein)'], dtype='object')
There are three columns that clearly contain banana index data. I can always retrieve columns from a data frame by passing a list of columns inside of square brackets — that is, nested square brackets — as follows:
'Bananas index (kg)', 'Bananas index (1000 kcalories)', 'Bananas index (100g protein)']
And yes, this will work! But… yuck, right?
I would greatly prefer to have Pandas give me all of the columns whose names start with the string “Bananas”. Or even those whose names contain the substring “Bana” in them.
Fortunately, Pandas provides us with such a method, called “filter”. Now, the “filter” method is confusing — to me, at least — because it doesn’t filter the values. Rather, it filters the columns by their names. It lets you get a subset of the columns by giving it a substring, and passing that as a value to the “like” keyword argument:
This returns a new data frame, based on df, in which we have all of the rows but only those columns containing “Bana” in their names.
I should add that if you prefer, you can pass a different keyword argument, namely “regex”, along with a string containing a regular expression. So we could also have said:
(I love regular expressions, but I realize that they can be a bit surprising and difficult for people to understand. If that describes you, check out my free, 14-part “regexp crash course” at https://RegexpCrashCourse.com/. You’ll be using regular expressions in no time!)
Anyway, now that we’ve narrowed the data frame down to the columns that we want, we’ll want to get the 10 top items for each of these columns. This means sorting our data frame by the values in each of these columns, and taking the top 10 values. Looking at the index after each sort will let us see which food items are associated with these values.
If we were only interested in sorting by one value, then it would be obvious how to handle it. How can we sort by three different columns?
I normally tell people that when you use a “for” loop in Pandas, you’re almost certainly doing something wrong. But that’s because you should use the built-in vectorization, rather than the (typical) Python way of doing things with iterations.
In this case, though, we need to iterate, using a “for” loop not over the values in our data frame, but rather on the columns. We can then print each column name, then the result of sorting that column by values, grabbing the top 10 values:
for one_column in df.filter(like='Bana'): print(one_column) print(df[one_column].sort_values(ascending=False).head(10)) print()
The result of iterating over the result of “filter” is to get each column name, one at a time. We print the column name, then use it to retrieve that column from df. Note that this will return a series, rather than a data frame.
Then we invoke “sort_values” on that series, which returns a new series, sorted in ascending order. Actually, it would normally be in ascending order, but here we pass the keyword argument ascending=False, which gives us results in descending order. We can then invoke “head(10)” to get the 10 top values for each one. Finally, we invoke “print” with no arguments to give us a blank line.
The result looks like this:
Bananas index (kg) entity Beef steak 148.563324 Beef mince 108.816189 Beef meatballs 81.052853 Beef burger 61.803856 Lamb chops 35.383303 Lamb (leg) 35.198904 Lamb burgers 30.833345 Cottage cheese 28.944313 Parmesan cheese 27.499275 Prawns 23.943772 Name: Bananas index (kg), dtype: float64 Bananas index (1000 kcalories) entity Beef steak 77.752629 Beef mince 54.049393 Beef meatballs 35.782826 Cottage cheese 33.508657 Prawns 30.063042 Lettuce 28.915534 Raspberries 28.105624 Beef burger 25.011498 Lamb (leg) 17.335509 Coconut milk 14.670763 Name: Bananas index (1000 kcalories), dtype: float64 Bananas index (100g protein) entity Coconut milk 33.320155 Sugar 20.685802 Grapes 15.128521 Beef steak 8.312805 Raspberries 7.878815 Beef mince 6.878002 Marmalade 6.495684 Butter 6.192303 Lettuce 5.173570 Beef meatballs 4.612929 Name: Bananas index (100g protein), dtype: float64
If you’re interested in finding the least carbon-emitting foods on the banana scale, you could just invoke “tail”.
Keep reading with a 7-day free trial