[Want to get experience with agentic coding, with a mentor who will guide you the whole way? Join my upcoming cohort of HOPPy (Hands-on Projects in Python, https://LernerPython.com/hoppy).
This time, everyone will build an ambitious project — of their own choosing — using agentic tools such as Claude Code. The 8-week cohort starts Sunday, so don't delay! You can catch a recording of yesterday's info session at https://www.youtube.com/watch?v=Wk-KHDHkjkI if you have questions. Or just reach out to me, reuven@lerner.co.il.]
Summer (in the Northern Hemisphere) hasn't quite started yet, but it's already feeling quite warm – and many people I know are already discussing their summer vacation plans. That, along with a tidbit I heard on a podcast led me to wonder just how much vacation time people get in different countries.
So this week, we looked at several data sets having to do with vacation and holidays in different countries – how much people get, and also how much they take.
Data and five questions
Our main data set is Wikipedia, whose "List of annual leave per country" (https://en.wikipedia.org/wiki/List_of_minimum_annual_leave_by_country) provides a good starting point for comparing countries around the world. We'll retrieve that file, and use its table as our main starting point.
We'll also look at an interesting data set from Expedia, which asks people in a number of rich countries whether they feel deprived about vacation time. I got that from an article from CNBC, https://www.cnbc.com/2024/07/01/americans-take-less-time-off-but-europeans-are-more-vacation-deprived.html. We'll use the two (short) CSV files that you can download from that story, one about the vacation deprivation, and another about how many vacation days people in different countries take.
Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.
Learning goals for this week include retrieving data from the Web, joins, regular expressions, cleaning data, and plotting with Plotly.
Here are my five questions, along with my solutions and explanations:
Turn the main table from the Wikipedia page into a Pandas data frame. Remove all footnote references, turn missing values into NaN, take the lowest number from any range of numbers. Turn the vacation, holidays, and total columns into float dtypes. What five countries have the most mandated vacation days, and which have most holidays?
Before doing anything else, I loaded up Pandas and Plotly. I also loaded two other modules:
requests, since we'll be retrieving data via HTTP, andStringIO, which gives us a file-like interface to textual data.
The full start to my code was thus:
import pandas as pd
import requests
from io import StringIO
from plotly import express as pxMy original plan was to use read_html, a Pandas method that scrapes a Web site and returns a list of data frames that it finds there. However, Wikipedia is cracking down on scraping, which meant that we needed to try something else.
That's where requests came into play: Given a URL, you can invoke requests.get. That returns a "response" object via which you can get the numeric status code, a byte string with the content, or even the content as a regular Python string.
That's all great, except that Wikipedia won't let you scrape with requests, either. How does it know that I'm using requests, rather than a regular browser? The big giveaway is the User-Agent request header, with which a browser identifies itself. Fortunately, requests makes it quite easy to change the value of User-Agent or any other HTTP request header.
Rather than pretend to be a browser, I found that I could just give my own value for User-Agent, indicating that I'm doing this for Bamboo Weekly, and giving my e-mail address. I thought that this was a bit more honest, and went with it.
That worked, giving me a response object. I retrieved the text attribute from the response, passing that to StringIO, which returned a file-like object in memory. I then passed that StringIO to read_html, which (as I had hoped) returned a list of Pandas data frames, one for each HTML table in the file:
url = 'https://en.wikipedia.org/wiki/List_of_minimum_annual_leave_by_country'
headers = {
'User-Agent': 'BambooWeekly-DataFetch/1.0 (https://bambooweekly.com; reuven@lerner.co.il)'
}
response = requests.get(url, headers=headers)
html = response.content
all_dfs = pd.read_html(StringIO(response.text), na_values=['–'])
The only unusual thing in the above call to read_html was the use of the na_values keyword argument. This data used an en-dash (not a regular minus sign!) to indicate that there were missing values; by passing it to na_values, Pandas turns those into np.nan values.
A bit of experimenting showed that the table in which I was interested was at index 1. But as is often the case when you scrape data from a Web site, there was a lot of cleaning to do, in order to get it into usable shape. I ended up using two replace calls, indicating that I wanted to turn one thing into another:
- There were a number of footnotes in the data, all in the form of
[n], wherenwas an integer. I usedreplaceto remove all examples of[followed by one or more digits (\d+), followed by]. - Some countries had multiple measures for vacation days, expressed as
10-20or10/20. Some countries had different rules for different ages, for example. I just wanted the first number. I thus used a regular expression to ignore leading whitespace (^\s*), then grabbed\d+, meaning one or more digits, followed by one or more groups of\D.*, meaning a non-digit followed by one or more digits. I replaced all of that with the initial digits that I found.
Here's how that looks:
df = (
all_dfs[1]
.replace(r'\[\d+\]', '', regex=True)
.replace(r'^\s*(\d+)(\D.*)$', r'\1', regex=True)
)
To give the columns one-word names, I invoked set_axis, passing axis='columns', so that it wouldn't try to replace the data frame's index:
df = (
all_dfs[1]
.replace(r'\[\d+\]', '', regex=True)
.replace(r'^\s*(\d+)(\D.*)$', r'\1', regex=True)
.set_axis(['country', 'vacation', 'holidays', 'total', 'notes'], axis='columns')
)
Having cleaned the data, I was then able to convert three columns to floats. I did this with a dict comprehension, which I passed to astype. And finally, I set the index to be country names:
df = (
all_dfs[1]
.replace(r'\[\d+\]', '', regex=True)
.replace(r'^\s*(\d+)(\D.*)$', r'\1', regex=True)
.set_axis(['country', 'vacation', 'holidays', 'total', 'notes'], axis='columns')
.astype({c: float for c in ['vacation', 'holidays', 'total']})
.set_index('country')
)
The resulting data frame has 193 rows and 4 columns.
To find the countries with the most mandated vacation days, I invoked nlargest, indicating that I wanted to get rows back from df based on the vacation column, taking the 5 top rows:
(
df
.nlargest(columns='vacation', n=5)
)The result: Andorra, Algeria, Bahrain, Burkina Faso, and Equatorial Guinea. Note that if more than 5 countries share a maximum number, we'll only get the first 5.
What about holidays? Cambodia has the most, with 27 (!), followed by Iran, Lebanon, Sri Lanka, and Argentina.
What countries have 0 paid holidays? What countries have more paid holidays than vacation days?
Does any country really have 0 paid holidays? There's only one way to find out:
(
df
.loc[pd.col('vacation') == 0]
)Here, I'm using pd.col inside of .loc for a quick and easy filter on our data. And it turns out that there are four countries that mandate 0 paid holidays: Micronesia, Nauru, Palau, and the United States.
The data set discussed vacation days and holidays, and it occurred to me that perhaps there are countries with more holidays than vacation days. I created this query:
(
df
.loc[pd.col('holidays') > pd.col('vacation')]
)I quickly discovered that many countries have more paid holidays than vacation days – 32 of them, in fact.
In both of these queries, I used the combination of loc and pd.col, a powerful combination that is relatively new, given that pd.col is part of Pandas 3.0, which was released in January. If you have been using lambda along with loc , then you should strongly look into using pd.col instead. Queries using pd.col tend to be easier to write, read, and maintain than the earlier form using lambda.