I'm in Prague this week, attending the Euro Python conference (https://ep2025.europython.eu/). I gave a tutorial on Tuesday called "Let's build a dictionary!" (https://ep2025.europython.eu/session/let-s-build-a-dictionary), and will be speaking tomorrow about, "What does = do?" (https://ep2025.europython.eu/session/what-does-do). I've also volunteered as a session chair over the last few days, introducing (some amazing) speakers. If you're at the conference, please come by and say "hello" in person!
As I wrote yesterday, I'm always amused by the way in which European conferences and get-togethers often evolve into comparisons between different countries' laws and customs. I thus decided that this week's questions would use data from Eurostat, which gathers and publishes statistics from a variety of European countries. There are obviously many different ways in which we can measure countries, so I decided to choose two topics tracked by Eurostat – the number of weekly hours worked (on average) by people in each country, and the percentage of homes with Internet access.
This week's learning goals include working with Excel files, handling date-time values, window functions, joins, and filtering data.
Paid subscribers (including subscribers to my LernerPython+data membership program) can download the data files and my Jupyter notebook. You can also click on a link to use my notebook inside of Google Colab. All of those links are at the bottom of this post.
Data and six questions
This week's data comes from Eurostat (https://ec.europa.eu/eurostat/), which collects, analyzes, and publishes data on a wide variety of topics. We'll look at only two of these, trying to answer a few questions about European countries' similarities and differences:
- Number of weekly hours worked: https://ec.europa.eu/eurostat/databrowser/view/lfsa_ewhun2/default/table
- Internet access: https://ec.europa.eu/eurostat/databrowser/view/isoc_ci_in_h/default/table
On each of these pages, click on the "download" button above the data set. You'll be able to download an Excel file (among other formats), each with a number of sheets containing the data that we want to look at.
First, let's look at how many weekly work hours people put in. Read from "Sheet 1" in the work file into a Pandas data frame. Examine the file to see what values should be considered NaN
. make the country-name column the index. Explain a reason why the EU data and Euro-area might differ.
Let's start by bringing in Pandas:
import pandas as pd
We can normally read an Excel file into Pandas using read_excel
. It works much like read_csv
, in that it assumes we'll have a header row, and that each column has a different dtype – usually determined by the binary types that Excel assigns to each value. So in theory, we can just say
work_filename = 'data/eurostat_labor.xlsx'
work_df = pd.read_excel(work_filename)
The good news is that we'll get a data frame. The bad news is that it'll be the wrong data, in the wrong format, and with a lot of junk that interferes with our data. We'll need to pass a bunch of keyword arguments to read_excel
to ensure that we get the data we want and need.
- First, we'll pass
sheet_name
, which allows us to indicate (via an integer or a string) which sheet we want to get back. If we pass a single value, then we'll get a single data frame back; if we pass a list, then we'll get a dict of data frames, where the dict keys are the sheet names/indexes, and the values are the data frames themselves. Here, we only wantSheet 1
, so we can pass that. - Then we'll pass
header
, which tells Pandas which row contains the row number (indexed starting with 0) that should be used for the column names. In this case, I decided to forego the column names completely, instead setting the names myself via thenames
keyword argument, which I'll show in a bit. I did this by passingheader=None
, meaning that the spreadsheet doesn't contain any header names. That's not quite true, but Pandas doesn't know, and I'm happy to play a bit of a trick on it. - I passed
skiprows=14
, indicating how many rows Pandas should skip over before starting to read data. - I passed
index_col=0
, which asks that the first column – identified here by a numeric index, since I indicated withheader=None
that there aren't any column names in the file – should be used as the data frame's index. - Normally, Pandas treats a number of values – the empty string,
NA
,NAN
, and the like – asNaN
values. But in this spreadsheet, two strings –:
and'not available'
were stand-ins forNaN
. In order to tellread_excel
that these values should also be treated asNaN
, I passed thena_values
keyword argument, passing a list of strings that should be treated asNaN
. - After the data itself, the spreadsheet contained a number of non-data rows. I wanted to ignore those, so I passed
nrows=38
, meaning that after reading 38 rows of actual values into the data frame, it should stop reading. To arrive at this, I did a bit of experimentation, although I could have just counted the rows in the spreadsheet manually. - Finally, I assigned names to the remaining (non-index) columns. The original spreadsheet had them as 2015-2024, one for each year of data. I could have passed a list of integers (or strings), but decided that it was easier to pass a
range
object. I thus passednames=range(2015, 2025)
. I gave the higher number as 2025 so that the actual highest number we would get in the range is 2024; remember that in Python, ranges and slices are always "up to and not including."
In the end, the query looked like this:
work_filename = 'data/eurostat_labor.xlsx'
work_df = pd.read_excel(work_filename,
sheet_name='Sheet 1',
header=None,
skiprows=14,
index_col=0,
na_values=[':', 'not available'],
nrows=38,
names=range(2015,2025))
But there was still a problem: On my computer, at least, I got complaints from the Excel file parser. These were in the form of warnings, not exceptions, which meant that the program still executed, and gave the desired result – but that it output a warning. This was more annoying than anything else, but it still left a bad taste in my mouth.
I decided to suppress this warning by importing the warnings
module from the Python standard library. I then used a with
statement and warnings.catch_warnings()
to open a block within which I could control the warnings system.
Then, within the block, I applied warnings.simplefilter('ignore')
. The combination meant that while warnings within the with
block would be affected, other warnings would be processed as usual.
The final code for loading my data frame was thus:
import warnings
with warnings.catch_warnings():
warnings.simplefilter("ignore")
work_df = pd.read_excel(work_filename,
sheet_name='Sheet 1',
header=None,
skiprows=14,
index_col=0,
na_values=[':', 'not available'],
nrows=38,
names=range(2015,2025))
It's common for people to believe that if you assign to a variable within an indented Python block, that you're doing so within a local scope, rather than the global scope. This isn't true! Scope isn't affected by indentation, which means that in this case, our work_df
variable is global, and is available after the with
block closes.
The result is a data frame with 38 rows and 10 columns. Because we treated :
and 'not available'
as NaN
values, all of the columns' dtypes are float64
, numeric values with which we'll be able to calculate.
We can examine the first two rows, for the EU as a whole and the Euro area in particular, with filter
. I passed the regex
keyword argument along with the regular expression '^Euro'
, which means that we only wanted rows whose indexes started with the word "Euro". I specified axis='rows'
, so that we would be choosing based on rows, and not based on column names:
work_df.filter(regex='^Euro', axis='rows')
To make it a bit easier to read, I transposed it with T
, an alias for transpose
:
work_df.filter(regex='^Euro', axis='rows').T
I got:
European Union - 27 countries (from 2020) Euro area – 20 countries (from 2023)
2015 37.4 36.7
2016 37.4 36.7
2017 37.4 36.7
2018 37.4 36.7
2019 37.3 36.7
2020 37.2 36.6
2021 37.2 36.6
2022 37.3 36.6
2023 37.1 36.5
2024 37.1 36.4
In every case, we see that people in the Euro area work fewer hours than people in Europe as a whole. I find that interesting, but I'm not sure what the reason would be.
One guess, purely speculative on my part, is that people in a wealthier society have the luxury of working fewer hours, and that perhaps this would point to the success of the euro in making its users wealthier. I can think of a number of counter-arguments to this claim, including the fact that a number of wealthy countries (e.g., Sweden and Denmark) haven't adopted the euro, and are quite wealthy.
We also see that the number of working hours has been declining in both the euro zone and the EU as a whole, with the gap between the two remaining fairly steady.
So... I'm not sure, and am open to thoughts and ideas.
In 2024 (the most recent year for which we have data), which five countries worked the greatest and least number of hours? Which five countries have most increased the number of hours worked (as a percentage) since 2015, and which have decreased?
Since we're only interested in 2024, we can use []
to grab that column, giving us a Pandas series:
(
work_df
[2024]
)
We can then invoke nlargest
and nsmallest
, both of which return the 5 largest and smallest values, respectively. Or we can invoke agg
on our series, allowing us to run more than one aggregation method:
(
work_df
[2024]
.agg(['nlargest', 'nsmallest'])
)
The result is a data frame in which we get both the largest and smallest values:
nlargest nsmallest
Türkiye 44.2
Serbia 42.2
Bosnia and Herzegovina 41.7
Greece 41.0
Poland 40.1
Netherlands 31.6
Denmark 33.5
Norway 34.7
Germany 34.8
Ireland 35.7
People work the longest hours in Turkey and Serbia, and the fewest hours in Ireland and Germany. There definitely seems to be a correlation here between a country's overall wealth and economic health and the number of hours that people there work, even if we don't examine the numbers in depth.
I then asked which countries have increased and decreased the most (as a percentage) from 2015 to 2024. We can apply similar logic there, but first we need to grab two columns (2015 and 2024), and apply the Pandas window function pct_change
across the columns:
(
work_df
[[2015, 2024]]
.pct_change(axis='columns', fill_method=None)
[2024]
.agg(['nlargest', 'nsmallest'])
)
Notice that I passed fill_method=None
to pct_change
, which doesn't like having NaN
values. Also notice that after running pct_change
, the 2015 column contained NaN
values, whereas 2024
contained the real values. I thus grabbed 2024
to have a single column on which we would run our analysis:
nlargest nsmallest
Sweden 0.040760869565217295
Netherlands 0.03947368421052633
Norway 0.02359882005899716
Lithuania 0.0181818181818183
Switzerland 0.016949152542372836
Türkiye -0.0714285714285714
Iceland -0.04271356783919589
Finland -0.034946236559139865
Austria -0.032520325203251876
Serbia -0.032110091743119185
The above output is accurate, but I wanted to make it a bit more readable. So I used map
, and applied a lambda
function that put the float in an f-string, limiting the output to two digits after the decimal point and then a %
sign:
(
work_df
[[2015, 2024]]
.pct_change(axis='columns', fill_method=None)
[2024]
.agg(['nlargest', 'nsmallest'])
.map(lambda x: f'{x:0.2%}')
)
Here's the result that I got:
nlargest nsmallest
Sweden 4.08% nan%
Netherlands 3.95% nan%
Norway 2.36% nan%
Lithuania 1.82% nan%
Switzerland 1.69% nan%
Türkiye nan% -7.14%
Iceland nan% -4.27%
Finland nan% -3.49%
Austria nan% -3.25%
Serbia nan% -3.21%
Interestingly, we can see that Sweden, the Netherlands, and Norway – all rich countries – have increased the number of hours worked over the last decade, while Serbia has decreased them, as have Austria and Finland. This might well poke a hole in my rich-poor country theory from above.
Countries and aggregations are complex and large, and it's hard to come up with a single reason for anything happening – but with a bit more digging through data, we could probably find some interesting correlations or reasons.