BW #1: Government corruption (solution)

Which countries are most (and least) corrupt? Which countries have improved the most in the last decade?

This week, we’re looking at the latest data from Transparency International, via the data at /content/files/images/cpi2022_globalresultstrends.xlsx. The questions I asked were:

  1. According to Transparency International, what five countries were least corrupt in 2022?
  2. According to the same data, what five countries were most corrupt in 2022?
  3. Which region of the world was least corrupt?
  4. Finally, which five countries have made the greatest improvement in 10 years, between the 2012 report and the 2022 report?

Discussion

First and foremost, I had to turn the data set into a Pandas data frame. One way to do that is to download the file onto your local computer, and then use the “read_excel” on the file.

Another way is to give the full URL to the file, and pass it to read_excel as a first argument. Which sounds great… except that in this particular case, I found that I got a 403 (“forbidden”) HTTP response code. It would seem that certain software has been forbidden from downloading the file; you can use a browser, and even wget, but not Pandas.

So I downloaded the file, and then read it into memory:

filename = '/Users/reuven/Downloads/CPI2022_GlobalResultsTrends.xlsx'
df = pd.read_excel(filename)

But there’s a problem, as we can see if we look at the first few rows of the data frame. The columns have weird names. Row index 0 is all NaN values. Row index 1 contains text. and then, starting with row index 2, we get information about the countries.

The problem? The headers are in row index 2. Everything before that is just decoration and informative. If we want to read the data, and if we want the column names in the Excel file to be used as column names in our data frame, we’ll need to tell Pandas to use row index 2 as our headers:

df = pd.read_excel(filename, header=2)

Now, when I run df.head(), things look much better.

And now, we can start to answer some questions, starting with: What five countries were least corrupt in 2022?

The column containing that information is “CPI score 2022”. I can retrieve that with:

df['CPI score 2022']

That’ll give me the column with those scores. If I want the highest five scores, then I’ll want to sort the values in that column:

df['CPI score 2022'].sort_values()

But sort_values normally works in ascending order, and I want the five top values. So I’ll just ask for the sort to happen in reverse (descending) order:

df['CPI score 2022'].sort_values(ascending=False)

But we only want the first five, so we’ll run “head” to cut it off:

df['CPI score 2022'].sort_values(ascending=False).head()

But wait a second: If we do this, then we get the top five scores, but we don’t know what countries they belong to! That’s because the countries were in a separate column, called “Country / Territory”.

The easiest solution here, I think, is to set the index of df to be that country column, then sort based on it:

df.set_index('Country / Territory')[
    'CPI score 2022'].sort_values(ascending=False).head()

In this way, I found that the five least corrupt countries were found to be Denmark, Finland, New Zealand, Norway, and Sweden.

The second question was: What five countries were most corrupt in 2022?

This is pretty easy to answer, given what we’ve just done: We just sort in descending order, and the lowest-scoring countries will be at the top:

df.set_index('Country / Territory')[
    'CPI score 2022'].sort_values().head()

The winners, if you can call them that, are: Somalia, Syria, South Sudan, Venezuela, and Yemen. I guess I’ll now have to rethink attending Python conferences in each of these countries. Oh, well.