BW #30: Uncertainty (solution)
What countries reflect greater uncertainty about their future than others? Does this measure stay constant over time? Are countries with higher uncertainty the ones we think we should worry about?
This week’s problems use data from the World Uncertainty Index (WUI, at https://worlduncertaintyindex.com and https://www.imf.org/en/Publications/fandd/issues/2020/03/imf-launches-world-uncertainty-index-wui-furceri) a metric that tries to describe how worried we should be about the world in general, or certain regions and countries in particular.
Perhaps that’s unfair; I tend to associate “uncertainty” with “worry,” but you can just say that the situation is unknown — for good or for bad. I tend to be an optimist, but even I understand how when you start to talk about uncertainty, things can look problematic. And even the researchers behind the WUI admit that a large dose of uncertainty can serve as a predictor of trouble in a nation.
This week, we looked at the WUI data. Can we identify times when uncertainty in the world went up? Which countries are more reliably uncertain? And how do Israel and Taiwan compare, with each other and with other countries in the world?
Data and seven questions
This week, I posed seven questions that we can answer with the WUI data set. It consists of a single Excel file with a number of different sheets, each with a different piece or collection of uncertainty data. You can get it from:
The learning goals for this week include breaking up date information, multi-indexes, method chaining, lambda and assign, and plotting.
Here are my seven tasks and questions for this week:
Read tab T1 into a data frame. Turn the "Year" column into a multi-index consisting of the year and quarter, both integers, and also remove the original "Year" column. Also: Do it all in a single, chained query starting with a call to "read_excel".
Before we can do anything else, we’ll need to load up Pandas. I’m going to do my longish, three-line intro that I always like to do, just to be prepared:
import numpy as np import pandas as pd from pandas import Series, DataFrame
With that out of the way, I downloaded the Excel file onto my system and wanted to read it into a data frame. If an Excel file contains only a single sheet, then calling read_excel returns a data frame. But if it contains multiple sheets, then the result is a list of data frames, one for each sheet.
Since we know which sheet we’ll want, we can just indicate that by passing the “sheet_name” keyword argument:
Sure enough, that returns a data frame. Because of how the Excel file is structured, the data frame has columns, based on the column names in the first row. So far, so good. But I indicated that I want to take the “Year” column and turn it into two integer columns, one with the year and one with the quarter. How can we do that?
Well, let’s first look at the values in that column:
1990q1 1990q2 1990q3 1990q4 1991q1 1991q2 1991q3 1991q4
As you can see, each value in this column is a four-digit year, the letter “q”, and then the quarter of that year for which we have data. How can we break that apart, so that we can have separate values for the year and quarter?
If we had a regular Python string containing “1990q1”, how would we break it apart? The answer is str.split, a string method that returns a list of strings. We could say “s.split('q')” and get back a list of two strings, first the year and then the quarter.
We can do this in Pandas via the str accessor, which gives us all of the builtin string methods, as well as some special, additional ones. Given a series s, we can say “s.str.split('q')”. The result will be a new series, each of whose values is a two-element list of strings.
That’s great, but… then what? Well, while we normally think of the “str” accessor as only working on string, it actually works on any kind of Python object, assuming that we invoke a method that the object supports. Given that str.split returns a list, we can then run “str.get(0)” on that resulting list to get the first element. In other words:
The above will return a series of strings, each of which contains the four-digit year from the series “s”. We can similarly use “get(1)” to get the quarter number.
But wait — we now have a series of strings. We actually want a series of integers. How can we change that? Easy; we use the “astype” method to get back a new series based on the original, whose values are the dtype we want. Given that we want integers, and that those 4-digit integers will all fit into 16 bits, we can say:
Note that if you do this with “np.int8”, it will seem to work… until you discover that 8-bit integers aren’t sufficient for a four-digit year, and that they’ve been transformed into negative numbers.
It’s now clear how I can take the original data frame and get two new series of integers from it, one for the year and one for the quarter. But I asked you to do this using chained methods. How would that look and work?
In general, we can create a new column by using the “assign” method on a data frame. Assign takes any number of keyword arguments; the name of the kwarg is the name of the new column, and the value we give contains the values. The value can be a series, but it can also be a function that returns a series. It’s not unusual to use lambda, a Python construct that creates an anonymous function object.
Here’s how we can do it:
We call “assign” on the data frame
We name the column as the name for the keyword argument
The value is lambda, which takes a single argument that is often called df_ (not to be confused with df), because it’s a data frame object but is also temporary — this is passed the data frame on which we run the method, and the series returned by the lambda is then set as the column’s value
Here’s how that code would look:
( pd .read_excel(filename, sheet_name='T1') .assign(Y=lambda df_: df_['Year'].str.split('q').str.get(0).astype(np.int16), Q=lambda df_: df_['Year'].str.split('q').str.get(1).astype(np.int8)) )
This gives us back a data frame with all of the Excel sheet’s original data, plus the two new columns Y and Q. We can make Y and Q into an index — specifically, a multi-index — by invoking set_index:
( pd .read_excel(filename, sheet_name='T1') .assign(Y=lambda df_: df_['Year'].str.split('q').str.get(0).astype(np.int16), Q=lambda df_: df_['Year'].str.split('q').str.get(1).astype(np.int8)) .set_index(['Y', 'Q']) )
Notice that we pass a list of two elements (‘Y’ and ‘Q’) to set_index, thus ensuring that it’s a multi-index with the primary index element being the year, and the secondary being the quarter number.
Finally, let’s get rid of the original “Year” column by invoking “drop”:
( pd .read_excel(filename, sheet_name='T1') .assign(Y=lambda df_: df_['Year'].str.split('q').str.get(0).astype(np.int16), Q=lambda df_: df_['Year'].str.split('q').str.get(1).astype(np.int8)) .set_index(['Y', 'Q']) .drop('Year', axis='columns') )
The final result is a data frame with the columns we want, and the rows and values we want — without invoking explicit Python assignment even once.
Which of the final five columns (comparing regions) currently has the greatest uncertainty? Does this strike you as a reasonable region to be the most uncertain in the world? Looking at the absolute numbers, what do you see with the measurement?
One of the nice things about using chained methods is that we can take the work we’ve already done and just add a handful of lines to them in order to get the results we want.
To compare the regions (the final five columns) in the data frame in the most recent collection of data, we’ll use iloc, which lets us retrieve rows via the index. The final row is (as per normal Python tradition) -1. We can also use slice syntax to grab the final five rows, making our call look like “iloc[-1, -5:]”:
( pd .read_excel(filename, sheet_name='T1') .assign(Y=lambda df_: df_['Year'].str.split('q').str.get(0).astype(np.int16), Q=lambda df_: df_['Year'].str.split('q').str.get(1).astype(np.int8)) .set_index(['Y', 'Q']) .drop('Year', axis='columns') .iloc[-1, -5:] )
That gives us the row that is of interest to us. And as usual in Pandas, when you get a single row back, it comes as a series. We can thus run series methods on it, including “max” to get the maximum value. But we don’t want the highest value; we want the index (i.e., the label) for that highest value. For that, we’ll use “idxmax”, which returns the index for the highest value:
( pd .read_excel(filename, sheet_name='T1') .assign(Y=lambda df_: df_['Year'].str.split('q').str.get(0).astype(np.int16), Q=lambda df_: df_['Year'].str.split('q').str.get(1).astype(np.int8)) .set_index(['Y', 'Q']) .drop('Year', axis='columns') .iloc[-1, -5:] .idxmax() )
We see that “Western Hemisphere” is the region with the greatest uncertainty. When I first saw that, I thought it must be a mistake. After all, shouldn’t there be more mentions of uncertainty in other regions of the world? But then I realized that the T1 data isn’t normalized; it’s just a strict numeric count of how often the word “uncertainty” is mentioned in Economist Intelligence Unit articles. And they probably write a lot more about countries in the Western Hemisphere than elsewhere.
So yes, by absolute count, that region has the greatest mentions of uncertainty. But that doesn’t mean we should use this measure. And indeed, other measures in this spreadsheet normalize the numbers — for example, looking at the number of mentions of uncertainty as a proportion of all words written, rather than as an absolute measure.
Keep reading with a 7-day free trial