BW #8: Happiness (solution)

Which country's residents report being the happiest? Which countries have gotten happier in the last year, and which have gotten less happy?

This week’s topic: Happiness

This week, we looked at data from the World Happiness Report for 2023, which was released on March 20th — the International Day of Happiness.

The data set came from an Excel spreadsheet provided by the researchers. Much of their data seems to come from Gallup surveys, in which people are asked to indicate whether they’re living their best possible life (10), worst possible life (1), or somewhere in the middle.

The survey has taken place for several years, and I thought that it would be fun to look at this data — especially after we’ve been looking at depressing data for the last few weeks, including Ukrainian grain exports and failed banks.

Discussion

I started with my standard setup for working with Pandas:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

I then needed to download the data file, which was available in Excel format at:

https://happiness-report.s3.amazonaws.com/2023/DataForTable2.1WHR2023.xls

First thing, I used the URL to download the Excel file into a data frame:

url = 'https://happiness-report.s3.amazonaws.com/2023/DataForTable2.1WHR2023.xls'

df = pd.read_excel(url)

Remember that “read_excel”, the method that allows us to take input from an Excel spreadsheet and import it into Pandas as a data frame, works just like similar “read_” methods in Python. As such, its first argument can be:

  • A filename (string)
  • A file-like object, open for reading
  • A URL (string)

In the third case, Pandas retrieves the file and returns a new data frame. The file might be cached somewhere on your local system, but I’m not sure where that would happen, if at all. I love downloading data in this way, but if the file is large and you’ll be retrieving it frequently, then you should probably just put it onto your computer a single time and open it via a file path.

I didn’t ask you to do this, because I didn’t think it would make much of a difference on a relatively small data set — but it’s usually a good idea to load only those columns you truly need into a data set. When loading via “read_excel”, you can pass the “usecols” keyword argument, giving it a list of strings, the names of columns you want to load. My code thus looks like:

df = pd.read_excel(url, 
    usecols=['Country name', 'year', 
             'Life Ladder', 'Freedom to make life choices'])

With this in place, we’re all set to answer my questions for this week!

The main measure in the World Happiness Report is known as "life ladder," where people are asked where they currently are, on a scale from 1 to 10 — where 10 is the happiest possible life. According to this measure in 2022, which 10 countries are happiest?

The data is a bit unusual looking, in that we have one row for each time the survey was performed in a country. This means that we have 14 rows for Afghanistan (for years 2008-2022), 15 rows for Albania (2007-2022), 10 for Algeria (2010 - 2021), and so on.

To answer this question, it won’t be enough to get the “Life ladder” value for each country, because we have several such values for each country, one per year in which the survey was performed.

Rather, we’ll need to find all of the rows from the year 2022, and then compare the countries’ scores.

I start off by comparing the “year” column with 2022, thus creating a boolean series:

df['year'] == 2022

I can then apply that boolean series as a mask index to “df.loc”:

df.loc[df['year'] == 2022]

This returns all of the rows from df with a “year” value of 2022. But we’re only interested in two columns, “Country name” and “Life Ladder”. I’ll add a 2-element list containing those two column names as the column selector in “.loc”:

df.loc[df['year'] == 2022, ['Country name', 'Life Ladder']]

That returns a data frame containing two columns (“Country name” and “Life Ladder”), with all rows from the year 2022 in df.

If I’m interested in finding the happiest countries, then I’ll want to find those with the highest value for “Life Ladder”. I can get that by sorting our data frame by the values of “Life Ladder”, in descending order, using sort_values:

df.loc[df['year'] == 2022, ['Country name', 'Life Ladder']].sort_values('Life Ladder', ascending=False)

There isn’t anything wrong with this, but given that I’m only interested in the 10 highest-scoring countries, I’ll use “head” to get the top 10 elements:

df.loc[df['year'] == 2022, ['Country name', 'Life Ladder']].sort_values('Life Ladder', ascending=False).head(10)

According to the most recent year’s results, the happiest countries are (in descending order) Finland, Israel, Denmark, Iceland, and Sweden.

Another way to get the same results would be to take our data frame, and set the index to be the “year” column, with the “set_index” method. Then we don’t need to make the comparison with 2022; we can just retrieve all rows with that index using .loc. The rest of our query will look the same, though:

df.set_index('year').loc[2022, ['Country name', 'Life Ladder']].sort_values('Life Ladder', ascending=False).head(10)

The WHR actually calculates happiness by averaging the results from the three most recent years (i.e., 2020, 2021, and 2022). Given that measure, what are the 10 happiest countries in the world?

The above calculation is good, but the researchers who put together the annual report don’t rely on the most recent year’s survey alone. Rather, they take the mean from the three most recent surveys to give a country its happiness score. This is done to smooth over variations in the data from year to year.

I thus asked you to find the 10 happiest countries in the world using that methodology.

I started off by asking the data frame for those rows where the year was 2020, 2021, or 2022. The easiest way to do this is to use the “isin” method, which checks for membership in a list, returning True or False for each row:

df.loc[df['year'].isin([2020, 2021, 2022]

Using the returned boolean series, I extended my use of “.loc” to retrieve only two columns, “Country name” and “Life Ladder”:

df.loc[df['year'].isin([2020, 2021, 2022]),                     
      ['Country name', 'Life Ladder']]

Here, the boolean series is acting as my row selector, while the list of names (strings) acts as a column selector. The result is a data frame, a subset of df in which the only years are 2020-2022, and the only columns are “Country name” and “Life Ladder”.

With this in hand, I’m now ready to perform the calculation. Simply put, I want to get the mean for “Life Ladder” for each value of “Country name” — a classic example of using “groupby”.

In theory, I could use the following syntax:

df.loc[df['year'].isin([2020, 2021, 2022]), 
       ['Country name', 'Life Ladder']
       ].groupby('Country name')['Life Ladder'].mean()

And there’s no doubt that this will work.

However, I know that I’ll later want to use the result of this query somewhere else, and that I’ll need a data frame, with multiple columns. Thus, rather than invoke “mean” directly, I’ll use the “agg” method, which allows me to invoke any aggregate method I want on a series.:

df.loc[df['year'].isin([2020, 2021, 2022]), 
       ['Country name', 'Life Ladder']
       ].groupby('Country name')['Life Ladder'].agg(['mean'])

By passing a one-element list to “agg”, I ensure that I get a data frame back. This will make my life easier a bit down the road.

But wait, don’t I want to know which of the countries have the highest mean score from 2020-2022? Yes, and in order to get that, I’ll need to sort the results in descending order:

df.loc[df['year'].isin([2020, 2021, 2022]), 
       ['Country name', 'Life Ladder']
       ].groupby('Country name')['Life Ladder'].agg(['mean']
       ).sort_values('mean', ascending=False)

This will give the correct results, which is great. However, rather than print them (or the 10 top scorers) right away, I’m going to stick them into a variable:

happiness_2023 = df.loc[df['year'].isin([2020, 2021, 2022]), 
       ['Country name', 'Life Ladder']
       ].groupby('Country name')['Life Ladder'].agg(['mean']
       ).sort_values('mean', ascending=False)

Why a variable? Because I’ll use it later on, and having it accessible in this way will be useful.

Finally, let’s take a look at the 10 happiest countries in the world, according to this year’s survey:

happiness_2023.head(10)

The answer, in case you’re wondering, is close (but not identical) to what we saw from the raw scores: Finland, Denmark, Iceland, Israel, Netherlands, Sweden, Norway, Switzerland, Luxembourg, and New Zealand.

As someone who lives in Israel, I can assure you that this made local headlines. It ws repeatedly pointed out that the survey was taken last year, before our government made proposals that have unleashed the largest and most widespread protests we’ve seen in a long time. I’m definitely curious to hear how our score does in 2024!

I also found some articles asking how Finland is consistently ranked as the happiest country in the world. Here are two good ones:

In any event, we now have a variable, happiness_2023, containing each country and its 2023 happiness score.