Newsletter

BW #60: Iceland (solution)

I'm back from a vacation in Iceland! This week, we'll thus look at some Iceland-related statistics, including their population (not very large!) and facts about their tourism industry.

Reuven M. Lerner

Apr 4, 2024 — 15 min read

This week, we looked at a variety of data sources having to do with Iceland. As I wrote yesterday, I just got back from a vacation there, during which we drove around the country’s ring road.

It was hard not to notice how few people are in Iceland. It wasn’t unusual for us to go half an hour — on their main national highway! — without seeing any other cars, in either direction. Gas stations are all self-serve, stores have a limited number of employees, and parking lots are all automated, using a combination of apps, cameras, and threatening signs to ensure that you pay.

I thus thought it would be fun and interesting to look into some statistical data about Iceland — and along the way, use a variety of Pandas techniques and tools.

By the way, this is an AI-generated illustration of Godafoss, a huge and beautiful waterfall in northern Iceland

And here’s a photo I took at Godafoss on March 26th:

The waterfall is huge and impressive — but I gotta say, ChatGPT did a pretty great job with its illustration!

Data and six questions

This week, we looked at a few different data sets, all having to do with Iceland’s population and tourism. As always, a link to the Jupyter notebook I used in my solutions is at the bottom of this message.

Create a data frame from Wikipedia's page listing the size of each country and territory (https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area). Now create a second data frame from Wikipedia's page listing the population of each country and territory (https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations) ). Where does Iceland rank in terms of size? In terms of population? In terms of people per square kilometer of land?

The first thing I did was to load Pandas:

import pandas as pd

With that in place, I wanted to create two data frames, each of them from a table on a Wikipedia page. Fortunately, Pandas comes with “read_html”, which returns a list of data frames — one for each table that it encountered on the provided URL. I started by defining the URL and then using “read_html”, retrieving index 1 from that list:

area_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area'
area_df = (pd
           .read_html(area_url)[1]
           )

The thing is, I wanted to get the size of each country as a number, so that I could perform calculations on it, The Wikipedia page displayed the area as both square km and square miles, and also with commas every three digits.

I first decided to remove the square miles. I did this using a combination of “assign”, which lets me add a new column to a data frame, along with “lambda”, for an anonymous function. This function is handed the data frame, which I assign to a parameter “df_”, indicating that it’s temporary.

But what did I want to assign? I called the new column “area_with_commas”, so that I’d remember that it might not have square miles, but still has commas in it. I then used “str.replace” with a regular expression to remove anything in parentheses from the string. Specifically, my regular expression said:

Look for \(, meaning a literal open parentheses
Look for \S+, meaning one or more non-whitespace characters
Look for \), meaning a literal closed parentheses

I used a Python raw string (starting with an “r”) to ensure that the backslashed characters were passed along to the regular expression engine untouched.

Given this, the string “10 (20)” would be turned into the string “10”. I passed the keyword argument “regex=True” to tell “str.replace” that my source string was a regular expression.

But “assign” can take any number of keyword arguments, and they’re processed in order. I thus passed it a second keyword argument, this one assigning to “area” — and here, I removed the commas from the number, and returned a floating-point number using “astype”.

The result of this call to “assign” is a data frame with two new columns, “area_with_commas” and “area”:

area_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area'
area_df = (pd
           .read_html(area_url)[1]
           .assign(area_with_commas = lambda df_: df_['Total in km2 (mi2)'].str.replace(r'\(\S+\)', r'', regex=True),
                  area = lambda df_: df_['area_with_commas'].str.replace(',', '').astype(float))
          )

Next, I used a list of strings inside of square brackets to select only the columns that are of interest to me, namely “area” (which I just created) and “Country / dependency”, the name of the country:

area_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area'
area_df = (pd
           .read_html(area_url)[1]
           .assign(area_with_commas = lambda df_: df_['Total in km2 (mi2)'].str.replace(r'\(\S+\)', r'', regex=True),
                  area = lambda df_: df_['area_with_commas'].str.replace(',', '').astype(float))
           [['area', 'Country / dependency']]
          )

Finally, I used “set_index” to make the country name into the index of the data frame:

area_url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_area'
area_df = (pd
           .read_html(area_url)[1]
           .assign(area_with_commas = lambda df_: df_['Total in km2 (mi2)'].str.replace(r'\(\S+\)', r'', regex=True),
                  area = lambda df_: df_['area_with_commas'].str.replace(',', '').astype(float))
           [['area', 'Country / dependency']]
           .set_index('Country / dependency')
          )

Why set the index in this way? Because the “join” method only works when the index matches up. By ensuring that both of our data frames (this one and the next one) both have countries in their indexes, we’ll be able to join them more easily.

And indeed, I also asked you to create a data frame based on Wikipedia’s country population page. Here, I had to do something similar. I first used “read_html” to download a list of data frames from the Wikipedia page, and selected index 0:

population_url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

population_df = (pd
                 .read_html(population_url)[0]
                )

I then used “assign” to replace the existing “Location” column with its current value, minus any letters in square brackets that served as footnotes. My regular expression looked like this:

\[, search for a literal open square bracket
\w+, search for one or more alphanumeric characters
\], search for a literal close square bracket

In this way, I replaced the existing “Location” column with one that didn’t have any footnotes.

I also assigned to “population”, just copying the existing column with the population numbers from July 1st 2023. I just found this a bit easier than renaming the column:

population_url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

population_df = (pd
                 .read_html(population_url)[0]
                 .assign(Location=lambda df_: df_['Location'].str.replace(r'\[\w+\]', '', regex=True),
                        population=lambda df_: df_['Population (1 July 2023)'])
                )

Finally, I selected the two columns to which I had assigned (“Location” and “population”), and used “set_index” to set the country to be the index:

population_url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'

population_df = (pd
                 .read_html(population_url)[0]
                 .assign(Location=lambda df_: df_['Location'].str.replace(r'\[\w+\]', '', regex=True),
                        population=lambda df_: df_['Population (1 July 2023)'])
                 [['Location', 'population']]
                 .set_index('Location')
                )

With these two data frames in place, I can now answer the questions that were posed.

First, where does Iceland rank in terms of size (i.e., land area)? I’ll start by using “assign” to create a new column. The values in this new column will come from “rank”, a Pandas method that generates a numbered ranking based on the column:


(
    area_df
    .assign(rank=lambda df_: df_['area'].rank(ascending=False))
)

We can then use “loc” with the row selector (“Iceland”) and column selector (“rank”) to get Iceland’s size rank:


(
    area_df
    .assign(rank=lambda df_: df_['area'].rank(ascending=False))
    .loc['Iceland', 'rank']
)

I get a rank of 114. This means that of the various countries and regions in this table (which includes the entire planet, “Earth,” so it’s not exactly a precise ranking of countries), we’re at 114 from the top.

We can do something similar with population:

(
    population_df
    .assign(rank=lambda df_: df_['population'].rank(ascending=False))
    .loc['Iceland', 'rank']
)

Here, I get a rank of 181.

Finally, how can I rank Iceland’s population per square kilometer? I’ll need to use data from both of the data frames we’ve created. I can combine them, as I mentioned earlier, with “join”, thanks to the fact that both data frames use country names in their indexes:

df = (population_df
      .join(area_df)
     )

Note that this is a “left inner join,” the default way that joins are accomplished. That means the index of population_df will determine which rows are kept; if a country exists in area_df but not in population_df, it’ll be ignored. That’s find for our purposes, but you should know that there are both “right” and “outer” joins that take different approaches.

Next, I calculate the number of people per square km, and use “assign” to add this as a new column:

df = (population_df
      .join(area_df)
      .assign(people_per_km = lambda df_: df_['population'] / df_['area'])
     )

Finally, I use “sort_values” to sort by the new column, and I show the top 5 results — meaning, the countries with the smallest number of people per square km:

df = (population_df
      .join(area_df)
      .assign(people_per_km = lambda df_: df_['population'] / df_['area'])
      .sort_values('people_per_km')
      .head(5)
     )

The result:

                     population       area  people_per_km
Location                                                 
Greenland (Denmark)     56643.0  2166086.0       0.026150
Mongolia              3447157.0  1564116.0       2.203901
Namibia               2604172.0   824292.0       3.159283
Australia            26439112.0  7741220.0       3.415368
Iceland                375319.0   103000.0       3.643874

Iceland might not have the smallest number of people per square km, but it’s not too far off! It has about the same density as Australia and Namibia — but those countries are far larger than Iceland. And they still have much larger populations, even if they’re spread out over a greater area.

Next, we'll look at tourism to Iceland. Go to the OECD's data portal (https://stats.oecd.org/?lang=en). Even though it claims that this page is no longer relevant, that's not true! Search for "inbound tourism" in the widget on the left side, and download the CSV file with info about inbound tourism for the entire OECD. Create a data frame from that data in which the year is the index, and the value (i.e., number of tourists) is a column. You'll want only those rows where the variable is INB_ARRIVALS_TOTAL.

How many tourists come to Iceland? From walking through Reykjavik, seeing the people and the oodles of shops aimed at tourists, it would seem like there are quite a lot. But why trust my eyes, when I can check the data, instead?

I downloaded the OECD’s tourism data for countries over the last few years. (Yes, I admit that it’s a bit of a pain.) As is often the case with OECD data, they jam a bunch of different things into the same CSV file, expecting you to filter out the rows having to do with the topic you’re researching.

In this case, I used “read_csv” to read the data into Pandas. However, I wasn’t interested in all of the columns, so I passed “usecols” to select the ones I did want:

tourism_filename = 'TOURISM_INBOUND_03042024145736816.csv'

tourism_df = (pd
              .read_csv(tourism_filename,
                        usecols=['Country', 'Year', 
                                 'Value', 'VARIABLE'])
             )

I then used “loc” to select only those rows where “VARIABLE” was equal to “INB_ARRIVALS_TOTAL”, meaning the number of tourists who entered a country in a given year:

tourism_filename = 'TOURISM_INBOUND_03042024145736816.csv'

tourism_df = (pd
              .read_csv(tourism_filename,
                        usecols=['Country', 'Year', 
                                 'Value', 'VARIABLE'])
              .loc[lambda df_: df_['VARIABLE'] == 'INB_ARRIVALS_TOTAL']
             )

I used “set_index” to set the year as the index, and then kept only the “Value” and “Country” columns:

tourism_filename = 'TOURISM_INBOUND_03042024145736816.csv'

tourism_df = (pd
              .read_csv(tourism_filename,
                        usecols=['Country', 'Year', 
                                 'Value', 'VARIABLE'])
              .loc[lambda df_: df_['VARIABLE'] == 'INB_ARRIVALS_TOTAL']
              .set_index('Year')
              [['Value', 'Country']]
             )

This gave me a data frame in which the (non-unique) year values formed the index, with each row showing a country name and the number of tourists who entered that year.

The resulting data frame had 562 rows and the two columns we requested.

BW #76: Aging legislators (solution)

BW #76: Aging legislators

Pandas office hours, July 21st (recording)

Pandas office hours, July 21st