BW #53: Airport animals (solution)

How many animals entered Heathrow Airport in 2023? What do people bring into the airport? And how has this changed over the years?

BW #53: Airport animals (solution)

This week, we’re looking at data from the City of London’s Heathrow Animal Reception Centre. If an animal (legally) enters the UK by air, then it almost certainly passes through HARC, which classifies and checks it. The center (er, “centre”) also takes care of animals that are hungry or injured — sort of an airport VIP lounge for non-human passengers. I read about HARC in a recent Economist story ("How to transport a rhino," https://www.economist.com/britain/2024/01/25/how-to-transport-a-rhino), which I recommend as being short, amusing, and informative.

I found a document, dated November 1st, 2023, describing HARC’s activities over a five-year period. This week, we extracted the data from that document, and then looked at the numbers and types of animals that HARC has dealt with. Along the way, we got some practice retrieving data from PDF files and cleaning the data. Of course, tracking animals isn’t anything new — in fact, you could argue that Noah was the first person in history to keep track of animals in binary.

Data and seven questions

The data is, as I mentioned, buried inside of a letter in PDF format. The letter can be downloaded from here:

https://committees.parliament.uk/writtenevidence/126507/default/

And no, there is no filename or extension on that URL. Going to that link should force the download of the data, at least from a normal browser. Using `wget` doesn't seem to work, however. Moreover, the link seems to produce a randomly named file with each download, so the filename I use here almost certainly won’t match the one that you get on your system.

I didn’t see a data dictionary for this information, but I think that it’s mostly self-explanatory. I did look up some of the animal-related terms, and will happily bore you with the details, if you like.

This week, I have seven tasks and questions for you to answer based on the data.

Turn the table on page 3 of the PDF into a data frame. I used `tabula-py` (a wrapper around the `tabula-java` package written in Java), available on PyPI (https://pypi.org/project/tabula-py/). I also used JPype1 (https://pypi.org/project/JPype1/), which improved the Python-to-Java communication.

First, let’s load Pandas:

import pandas as pd

With that in place, we can read the table from the PDF file into a Pandas data frame. We can load the “tabula” package and then use its “read_pdf” function to grab the tables in the PDF file. I indicated that I wanted to read tables from page 3 of the PDF. However, “read_pdf” returns a list of data frames, even if there is only one on the page. For this reason, we’ll need to retrieve the first element of that list, asking for index 0:

import tabula

filename = '/Users/reuven/Downloads/RG0OM-p-.pdf'
df = tabula.read_pdf(filename, pages=3)[0]

I found, however, that reading things in this way created a huge number of warnings. Some of them came from the Java system that tabula uses in order to read from PDF files, which disappeared when I installed the “JPype1” package. But others were pretty persistent, and seemed to be the result of some soon-to-be deprecated function calls within tabula. To quiet down those warnings, I loaded the “warnings” module from the standard library, and told it that I wanted to ignore “FutureWarning”, the category that we were getting.

The final code that I used thus looked like this:

import tabula
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)  

filename = '/Users/reuven/Downloads/RG0OM-p-.pdf'
df = tabula.read_pdf(filename, pages=3)[0]

The result? We now have a data frame with the same data (pretty much) as was in the PDF file.

By the way, if you want to learn more about Python’s “warnings” module, you can watch my talk from PyCon US 2019:

The final column was mis-parsed, at least on my system, such that it contains information for both consignments and animals from 2023, separated by spaces. Replace this one column with two columns.

I was quite pleasantly surprised to see how well Tabula read data from the PDF file into Pandas. But I soon realized that the final two columns, containing data from the first 2/3 of 2023, was squished together into a single column.

The columns contained integers, and that the integers were separated by spaces. (If there had been words, then separating them would have been significantly harder.)

I started by using “str.split”, a Pandas method that we can apply to our series:

df['2023 Jan - Nov'].str.split()

Fortunately, this did the trick, giving me a 2-element list of strings in place of the original single string. But now what? How can I take that list and turn it into two separate Pandas series?

One option that I considered would be to grab the first element from each row, and then the second element from each row. I could do that, but it seemed pretty messy to me.

But an even better option is to use the “expand=True” keyword argument to “str.split”. This argument tells Pandas not to return a single series with a list of strings in each element, but rather to return a data frame, with each element in its own column:

df['2023 Jan - Nov'].str.split(expand=True)

This returns a two-column data frame, split in precisely the way that I wanted. However, its first row contained the original columns. I removed them by invoking “drop”, passing 0 (i.e., the first row in the data frame):

df['2023 Jan - Nov'].str.split(expand=True).drop(0)

I then assigned these two columns to our data frame by just assigning to two new column names:

df[['2023 Consignments', '2023 Animals']] = df['2023 Jan - Nov'].str.split(expand=True).drop(0)

The only remaining problem with the mis-parsed column is that the old one is still around. We can remedy that by invoking “drop” again, this time telling it to remove a column:

df = df.drop('2023 Jan - Nov', axis='columns')

Remove the first row. Set the index to be the (first) TAXA column. Drop the (final) "total" row. Drop the columns containing only NaN values (i.e., Unnamed 1, 3, 5, and 7). Turn all values into integers.

There’s still a bit of fixing and cleaning to do, as I outlined in this question.

I removed the first row by invoking “drop”, passing the integer 0, since that’s the index of the

df = (df
      .drop(0)
     )

Then I changed the index to use the “TAXA” column, using “set_index”:

df = (df
      .drop(0)
      .set_index('TAXA')
     )

Next, I removed the final line, containing the totals for all animals and years. Once again, I used “drop” to remove it:

df = (df
      .drop(0)
      .set_index('TAXA')
      .drop('Total')
     )

Next, I wanted to remove the columns that contained only NaN values. I thought about a few different ways to do it, and finally decided that we’re only talking about four columns, and I know their names — so I can just use “drop” again, passing it a list of the columns. I did decide, however, to use a list comprehension to do it in a slightly fancier (if overkill) way:

df = (df
      .drop(0)
      .set_index('TAXA')
      .drop('Total')
      .drop([f'Unnamed: {n}'
              for n in range(1, 8, 2)], axis='columns')
     )

Finally, I invoked “astype” on the data frame. Honestly, I hadn’t ever done this before; I often use “astype” on a series, but doing it to an entire data frame seemed like a bit much. And yes, you can pass “astype” a dict that specifies different types for different columns — but here, we’ll just use regular integers for all of the columns, so I passed the Python “int” type:

df = (df
      .drop(0)
      .set_index('TAXA')
      .drop('Total')
      .drop([f'Unnamed: {n}'
              for n in range(1, 8, 2)], axis='columns')
      .astype(int)
     )

The result is a data frame that we can finally (well, almost) start to use.

Read more