BW #8: Happiness (solution)
Which country's residents report being the happiest? Which countries have gotten happier in the last year, and which have gotten less happy?
This week’s topic: Happiness
This week, we looked at data from the World Happiness Report for 2023, which was released on March 20th — the International Day of Happiness.
The data set came from an Excel spreadsheet provided by the researchers. Much of their data seems to come from Gallup surveys, in which people are asked to indicate whether they’re living their best possible life (10), worst possible life (1), or somewhere in the middle.
The survey has taken place for several years, and I thought that it would be fun to look at this data — especially after we’ve been looking at depressing data for the last few weeks, including Ukrainian grain exports and failed banks.
I started with my standard setup for working with Pandas:
import numpy as np import pandas as pd from pandas import Series, DataFrame
I then needed to download the data file, which was available in Excel format at:
First thing, I used the URL to download the Excel file into a data frame:
url = 'https://happiness-report.s3.amazonaws.com/2023/DataForTable2.1WHR2023.xls' df = pd.read_excel(url)
Remember that “read_excel”, the method that allows us to take input from an Excel spreadsheet and import it into Pandas as a data frame, works just like similar “read_” methods in Python. As such, its first argument can be:
A filename (string)
A file-like object, open for reading
A URL (string)
In the third case, Pandas retrieves the file and returns a new data frame. The file might be cached somewhere on your local system, but I’m not sure where that would happen, if at all. I love downloading data in this way, but if the file is large and you’ll be retrieving it frequently, then you should probably just put it onto your computer a single time and open it via a file path.
I didn’t ask you to do this, because I didn’t think it would make much of a difference on a relatively small data set — but it’s usually a good idea to load only those columns you truly need into a data set. When loading via “read_excel”, you can pass the “usecols” keyword argument, giving it a list of strings, the names of columns you want to load. My code thus looks like:
df = pd.read_excel(url, usecols=['Country name', 'year', 'Life Ladder', 'Freedom to make life choices'])
With this in place, we’re all set to answer my questions for this week!
The main measure in the World Happiness Report is known as "life ladder," where people are asked where they currently are, on a scale from 1 to 10 — where 10 is the happiest possible life. According to this measure in 2022, which 10 countries are happiest?
The data is a bit unusual looking, in that we have one row for each time the survey was performed in a country. This means that we have 14 rows for Afghanistan (for years 2008-2022), 15 rows for Albania (2007-2022), 10 for Algeria (2010 - 2021), and so on.
To answer this question, it won’t be enough to get the “Life ladder” value for each country, because we have several such values for each country, one per year in which the survey was performed.
Rather, we’ll need to find all of the rows from the year 2022, and then compare the countries’ scores.
I start off by comparing the “year” column with 2022, thus creating a boolean series:
df['year'] == 2022
I can then apply that boolean series as a mask index to “df.loc”:
df.loc[df['year'] == 2022]
This returns all of the rows from df with a “year” value of 2022. But we’re only interested in two columns, “Country name” and “Life Ladder”. I’ll add a 2-element list containing those two column names as the column selector in “.loc”:
Keep reading with a 7-day free trial
Subscribe to Bamboo Weekly to keep reading this post and get 7 days of free access to the full post archives.