BW #127: European comparisons (solution)

I'm in Prague this week, attending the Euro Python conference (https://ep2025.europython.eu/). I gave a tutorial on Tuesday called "Let's build a dictionary!" (https://ep2025.europython.eu/session/let-s-build-a-dictionary), and will be speaking tomorrow about, "What does = do?" (https://ep2025.europython.eu/session/what-does-do). I've also volunteered as a session chair over the last few days, introducing (some amazing) speakers. If you're at the conference, please come by and say "hello" in person!

As I wrote yesterday, I'm always amused by the way in which European conferences and get-togethers often evolve into comparisons between different countries' laws and customs. I thus decided that this week's questions would use data from Eurostat, which gathers and publishes statistics from a variety of European countries. There are obviously many different ways in which we can measure countries, so I decided to choose two topics tracked by Eurostat – the number of weekly hours worked (on average) by people in each country, and the percentage of homes with Internet access.

This week's learning goals include working with Excel files, handling date-time values, window functions, joins, and filtering data.

Paid subscribers (including subscribers to my LernerPython+data membership program) can download the data files and my Jupyter notebook. You can also click on a link to use my notebook inside of Google Colab. All of those links are at the bottom of this post.

Data and six questions

This week's data comes from Eurostat (https://ec.europa.eu/eurostat/), which collects, analyzes, and publishes data on a wide variety of topics. We'll look at only two of these, trying to answer a few questions about European countries' similarities and differences:

Number of weekly hours worked: https://ec.europa.eu/eurostat/databrowser/view/lfsa_ewhun2/default/table
Internet access: https://ec.europa.eu/eurostat/databrowser/view/isoc_ci_in_h/default/table

On each of these pages, click on the "download" button above the data set. You'll be able to download an Excel file (among other formats), each with a number of sheets containing the data that we want to look at.

First, let's look at how many weekly work hours people put in. Read from "Sheet 1" in the work file into a Pandas data frame. Examine the file to see what values should be considered `NaN`. make the country-name column the index. Explain a reason why the EU data and Euro-area might differ.

Let's start by bringing in Pandas:

import pandas as pd

We can normally read an Excel file into Pandas using read_excel. It works much like read_csv, in that it assumes we'll have a header row, and that each column has a different dtype – usually determined by the binary types that Excel assigns to each value. So in theory, we can just say

work_filename = 'data/eurostat_labor.xlsx'
work_df = pd.read_excel(work_filename)

The good news is that we'll get a data frame. The bad news is that it'll be the wrong data, in the wrong format, and with a lot of junk that interferes with our data. We'll need to pass a bunch of keyword arguments to read_excel to ensure that we get the data we want and need.

First, we'll pass sheet_name, which allows us to indicate (via an integer or a string) which sheet we want to get back. If we pass a single value, then we'll get a single data frame back; if we pass a list, then we'll get a dict of data frames, where the dict keys are the sheet names/indexes, and the values are the data frames themselves. Here, we only want Sheet 1, so we can pass that.
Then we'll pass header, which tells Pandas which row contains the row number (indexed starting with 0) that should be used for the column names. In this case, I decided to forego the column names completely, instead setting the names myself via the names keyword argument, which I'll show in a bit. I did this by passing header=None, meaning that the spreadsheet doesn't contain any header names. That's not quite true, but Pandas doesn't know, and I'm happy to play a bit of a trick on it.
I passed skiprows=14, indicating how many rows Pandas should skip over before starting to read data.
I passed index_col=0, which asks that the first column – identified here by a numeric index, since I indicated with header=None that there aren't any column names in the file – should be used as the data frame's index.
Normally, Pandas treats a number of values – the empty string, NA, NAN, and the like – as NaN values. But in this spreadsheet, two strings – : and 'not available' were stand-ins for NaN. In order to tell read_excel that these values should also be treated as NaN, I passed the na_values keyword argument, passing a list of strings that should be treated as NaN.
After the data itself, the spreadsheet contained a number of non-data rows. I wanted to ignore those, so I passed nrows=38, meaning that after reading 38 rows of actual values into the data frame, it should stop reading. To arrive at this, I did a bit of experimentation, although I could have just counted the rows in the spreadsheet manually.
Finally, I assigned names to the remaining (non-index) columns. The original spreadsheet had them as 2015-2024, one for each year of data. I could have passed a list of integers (or strings), but decided that it was easier to pass a range object. I thus passed names=range(2015, 2025). I gave the higher number as 2025 so that the actual highest number we would get in the range is 2024; remember that in Python, ranges and slices are always "up to and not including."

In the end, the query looked like this:

work_filename = 'data/eurostat_labor.xlsx'

work_df = pd.read_excel(work_filename, 
               sheet_name='Sheet 1',
               header=None,
               skiprows=14,
               index_col=0,
               na_values=[':', 'not available'],
               nrows=38,
               names=range(2015,2025))

But there was still a problem: On my computer, at least, I got complaints from the Excel file parser. These were in the form of warnings, not exceptions, which meant that the program still executed, and gave the desired result – but that it output a warning. This was more annoying than anything else, but it still left a bad taste in my mouth.

I decided to suppress this warning by importing the warnings module from the Python standard library. I then used a with statement and warnings.catch_warnings() to open a block within which I could control the warnings system.

Then, within the block, I applied warnings.simplefilter('ignore'). The combination meant that while warnings within the with block would be affected, other warnings would be processed as usual.

The final code for loading my data frame was thus:

import warnings

with warnings.catch_warnings():
    warnings.simplefilter("ignore")

    work_df = pd.read_excel(work_filename, 
                   sheet_name='Sheet 1',
                   header=None,
                   skiprows=14,
                   index_col=0,
                   na_values=[':', 'not available'],
                   nrows=38,
                   names=range(2015,2025))

It's common for people to believe that if you assign to a variable within an indented Python block, that you're doing so within a local scope, rather than the global scope. This isn't true! Scope isn't affected by indentation, which means that in this case, our work_df variable is global, and is available after the with block closes.

The result is a data frame with 38 rows and 10 columns. Because we treated : and 'not available' as NaN values, all of the columns' dtypes are float64, numeric values with which we'll be able to calculate.

We can examine the first two rows, for the EU as a whole and the Euro area in particular, with filter. I passed the regex keyword argument along with the regular expression '^Euro', which means that we only wanted rows whose indexes started with the word "Euro". I specified axis='rows', so that we would be choosing based on rows, and not based on column names:

work_df.filter(regex='^Euro', axis='rows')

To make it a bit easier to read, I transposed it with T, an alias for transpose:

work_df.filter(regex='^Euro', axis='rows').T

I got:

	European Union - 27 countries (from 2020)	Euro area – 20 countries (from 2023)
2015	37.4	36.7
2016	37.4	36.7
2017	37.4	36.7
2018	37.4	36.7
2019	37.3	36.7
2020	37.2	36.6
2021	37.2	36.6
2022	37.3	36.6
2023	37.1	36.5
2024	37.1	36.4

In every case, we see that people in the Euro area work fewer hours than people in Europe as a whole. I find that interesting, but I'm not sure what the reason would be.

One guess, purely speculative on my part, is that people in a wealthier society have the luxury of working fewer hours, and that perhaps this would point to the success of the euro in making its users wealthier. I can think of a number of counter-arguments to this claim, including the fact that a number of wealthy countries (e.g., Sweden and Denmark) haven't adopted the euro, and are quite wealthy.

We also see that the number of working hours has been declining in both the euro zone and the EU as a whole, with the gap between the two remaining fairly steady.

So... I'm not sure, and am open to thoughts and ideas.

In 2024 (the most recent year for which we have data), which five countries worked the greatest and least number of hours? Which five countries have most increased the number of hours worked (as a percentage) since 2015, and which have decreased?

Since we're only interested in 2024, we can use [] to grab that column, giving us a Pandas series:


(
    work_df
    [2024]

)

We can then invoke nlargest and nsmallest, both of which return the 5 largest and smallest values, respectively. Or we can invoke agg on our series, allowing us to run more than one aggregation method:


(
    work_df
    [2024]
    .agg(['nlargest', 'nsmallest'])

)

The result is a data frame in which we get both the largest and smallest values:

	nlargest	nsmallest
Türkiye	44.2	
Serbia	42.2	
Bosnia and Herzegovina	41.7	
Greece	41.0	
Poland	40.1	
Netherlands		31.6
Denmark		33.5
Norway		34.7
Germany		34.8
Ireland		35.7

People work the longest hours in Turkey and Serbia, and the fewest hours in Ireland and Germany. There definitely seems to be a correlation here between a country's overall wealth and economic health and the number of hours that people there work, even if we don't examine the numbers in depth.

I then asked which countries have increased and decreased the most (as a percentage) from 2015 to 2024. We can apply similar logic there, but first we need to grab two columns (2015 and 2024), and apply the Pandas window function pct_change across the columns:

(
    work_df
    [[2015, 2024]]
    .pct_change(axis='columns', fill_method=None)
    [2024]
    .agg(['nlargest', 'nsmallest'])
)

Notice that I passed fill_method=None to pct_change, which doesn't like having NaN values. Also notice that after running pct_change, the 2015 column contained NaN values, whereas 2024 contained the real values. I thus grabbed 2024 to have a single column on which we would run our analysis:

	nlargest	nsmallest
Sweden	0.040760869565217295	
Netherlands	0.03947368421052633	
Norway	0.02359882005899716	
Lithuania	0.0181818181818183	
Switzerland	0.016949152542372836	
Türkiye		-0.0714285714285714
Iceland		-0.04271356783919589
Finland		-0.034946236559139865
Austria		-0.032520325203251876
Serbia		-0.032110091743119185

The above output is accurate, but I wanted to make it a bit more readable. So I used map, and applied a lambda function that put the float in an f-string, limiting the output to two digits after the decimal point and then a % sign:


(
    work_df
    [[2015, 2024]]
    .pct_change(axis='columns', fill_method=None)
    [2024]
    .agg(['nlargest', 'nsmallest'])
    .map(lambda x: f'{x:0.2%}')
)

Here's the result that I got:

	nlargest	nsmallest
Sweden	4.08%	nan%
Netherlands	3.95%	nan%
Norway	2.36%	nan%
Lithuania	1.82%	nan%
Switzerland	1.69%	nan%
Türkiye	nan%	-7.14%
Iceland	nan%	-4.27%
Finland	nan%	-3.49%
Austria	nan%	-3.25%
Serbia	nan%	-3.21%

Interestingly, we can see that Sweden, the Netherlands, and Norway – all rich countries – have increased the number of hours worked over the last decade, while Serbia has decreased them, as have Austria and Finland. This might well poke a hole in my rich-poor country theory from above.

Countries and aggregations are complex and large, and it's hard to come up with a single reason for anything happening – but with a bit more digging through data, we could probably find some interesting correlations or reasons.

BW #127: European comparisons (solution)

Data and six questions

First, let's look at how many weekly work hours people put in. Read from "Sheet 1" in the work file into a Pandas data frame. Examine the file to see what values should be considered NaN. make the country-name column the index. Explain a reason why the EU data and Euro-area might differ.

In 2024 (the most recent year for which we have data), which five countries worked the greatest and least number of hours? Which five countries have most increased the number of hours worked (as a percentage) since 2015, and which have decreased?

Read next

BW #127: European comparisons

First, let's look at how many weekly work hours people put in. Read from "Sheet 1" in the work file into a Pandas data frame. Examine the file to see what values should be considered `NaN`. make the country-name column the index. Explain a reason why the EU data and Euro-area might differ.