Skip to content
13 min read · Tags: csv pyarrow memory-optimization datetime pivot-table plotting

BW #131: Canadian border crossings (solution)

Get better at: CSV files, PyArrow, reducing memory usage, working with dates and times, pivot tables, and plotting.

BW #131: Canadian border crossings (solution)

For years, one group of travelers reliably visited the United States on a regular basis: Canadians. With the world's longest undefended border, a history of friendly relations, and a common language, it seems natural that Canadians would want to visit attractions in the United States.

But I've seen a growing number of articles (including https://www.travelandtourworld.com/news/article/canadian-travelers-are-turning-their-back-on-the-u-s-for-europe-heres-why/) indicating that Canadians have been visiting the US in smaller numbers than before. This is partly due to President Donald Trump's policies' treatment of foreigners, including many people legally living in the United States. But it's also due to his tariff policy and his attacks on Canada, including claims that the United States could and should annex them.

I thought that it would be interesting not just to read about declines in Canadian border crossings, but to perform the calculations ourselves. And so, this week we'll look at Canadian government statistics about border crossings, and how these numbers have changed over time.

Data and five questions

This week's data comes from Statistics Canada (https://www.statcan.gc.ca/en/start), the Canadian government's statistics bureau. The specific data we'll look at is from a page with the oh-so-exciting title, "Leading indicator, International visitors entering or returning to Canada by land, by vehicle type, vehicle licence plate and traveller type," which accurately describes the contents, even if it won't win any awards for exciting text. That page is at https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=2410005701 .

However, that Web page only displays a small part of the data. I clicked on "download options," and clicked on the second-to-final option, for a CSV file with all of the data. That downloaded a large (881 MB) zipfile to my computer. Unzipping it produced a 16 GB CSV file.

This file can take quite a while to download, so be prepared and patient. (I was neither!)

Learning goals for this week include PyArrow, reducing memory usage, working with dates and times, pivot tables, and plotting.

Paid subscribers, including members of my LernerPython+data membership program, can download a copy of the file from the end of this post. You can also download my Marimo notebook.

Here are my five tasks and questions:

Read the CSV file into Pandas twice, once using the regular NumPy backend, and a second time using PyArrow as the backend. What is the difference in memory usage between the two? How long does it take to load each one? (Note that these data frames will consume a fair amount of memory, so you should try this on computers with lots of memory.)

I started off by importing Pandas:

import pandas as pd

Then, because the file is in CSV format, I used read_csv to read it into a Pandas data frame. However, I knew that the file was very large, and thus decided not to use the default Pandas engine for reading CSV files.

Instead, I specified engine='pyarrow'. Note that this doesn't change the way in which the data is stored in the data frame – but PyArrow is almost always faster than the default Pandas engine at reading data.

Just to clarify: We can use PyArrow to read CSV files. But we can also, separately, use PyArrow for our back-end data storage. Saying engine='pyarrow' does the former. We'll see the latter in a moment.

(
    pd
        .read_csv(filename, engine='pyarrow')
)

I also wanted to know how much memory it would consume. I added a call to memory_usage, which returns a per-column summary of how much memory each uses. In the case of string data, memory_usage doesn't return an accurate number unless you specify deep=True. So I did that, and then invoked sum to total up the memory from all columns:

(
    pd
        .read_csv(filename, engine='pyarrow')
).memory_usage(deep=True).sum()

The above took just above 6 minutes on my computer. And how much memory did the data frame consume? 56,602,870,103 bytes, or 56 GB. That's already a pretty significant amount of memory for a single data frame!

PyArrow isn't just another way to read CSV files into Pandas. It primarily provides an in-memory data structure that is far more efficient than the default NumPy storage used in Pandas. Using PyArrow for our backend storage is still experimental, but we can activate it by invoking dtype_backend='pyarrow':

(
    pd
    .read_csv(filename, engine='pyarrow', 
              dtype_backend='pyarrow')
)

Once again, I also invoked memory_usage on the resulting data frame, then invoking sum to total up its memory usage:

(
    pd
    .read_csv(filename, engine='pyarrow', 
              dtype_backend='pyarrow')
).memory_usage(deep=True).sum()

The above query cut the loading time in half, to only three minutes. And the memory usage was only 17,615,004,670 bytes, about one-third the size of the NumPy-backed data frame.

Using PyArrow might thus seem like a big win! It took half the time and one-third the memory of the default Pandas options.

However, I quickly discovered that things aren't quite that easy: I started to perform the other analysis tasks using my PyArrow-based data frame, and found myself getting a variety of errors having to do with the lengths of strings in my data frame, something that I definitely didn't expect to encounter.

So while PyArrow continues to tease us with faster execution speed and lower memory usage, it's still not quite ready for prime time. You can still try to use it in your production systems, but don't be surprised if you encounter some weird errors and incompatibilities along the way.

Using the NumPy version, reduce the memory usage as best as you can. We only need the following columns: REF_DATE, GEO, Vehicle licence plate, Vehicle type, Traveller characteristics, Traveller type, VALUE, and UOM.

Given my failure to use PyArrow for back-end storage, I decided that it would still be worthwhile to try to crunch down the size of our data frame with a NumPy back end.

One easy way to reduce the size of a data frame is by limiting the columns that we load from the file. If we aren't going to be using a column – and it's sometimes only obvious which you'll need and which you won't after doing some experimentation – then we can exclude it from usecols, a keyword argument that takes a list of columns:


df = (
    pd
    .read_csv(filename, engine='pyarrow', 
              usecols=['REF_DATE', 'GEO', "Vehicle licence plate",
                      'Vehicle type', 'Traveller characteristics', 
                      'Traveller type', 'VALUE', 'UOM']
                      )
)

Another great way to save memory, as well as perform certain calculations, is by treating a column as datetime rather than as a string. datetime values are guaranteed to be 64 bits, which is often going to be less than the equivalent string containing a date and time. We can pass the parse_dates keyword argument to read_csv in order to specify which column should be treated as a datetime:


df = (
    pd
    .read_csv(filename, engine='pyarrow', 
                 parse_dates=['REF_DATE'],
              usecols=['REF_DATE', 'GEO', "Vehicle licence plate",
                      'Vehicle type', 'Traveller characteristics', 
                      'Traveller type', 'VALUE', 'UOM']
    )
)

However, there is one more place where we can save memory: Most of the columns have an object value for dtype, meaning that they are Python strings. The good news is that Python strings are flexible, but the bad news is that they consume a lot of memory, and they are in Python's memory, which means taking extra time to retrieve them.

Moreover, these strings tend to repeat themselves, which means that we're unnecessarily using lots of extra memory. Fortunately, Pandas offers use the category dtype, which operates something like an enum in a programming language: It takes every unique string in the series and assigns it a number. It then stores the number-to-string translation table, and stores the integers instead of the strings.

Categories, when applied to a string column with a great deal of repetition, is one of the biggest and easiest wins you can have in a Pandas data frame. If a data frame has already been loaded, you can use astype('category') to get back a new version using categories.

But in our case, loading the data frame and then converting it would just take extra time. We can instead specify the dtype for one or more columns by passing the dtype keyword argument to read_csv, giving it a Python dict for a value. Each key in the dict should be a column name, and each corresponding value is the dtype we want to associate with it – which, in our case, will be 'category':


df = (
    pd
    .read_csv(filename, engine='pyarrow', 
                 parse_dates=['REF_DATE'],
              usecols=['REF_DATE', 'GEO', "Vehicle licence plate",
                      'Vehicle type', 'Traveller characteristics', 
                      'Traveller type', 'VALUE', 'UOM'],
              dtype={'GEO':'category',
                    'Vehicle licence plate':'category', 
                    'Vehicle type':'category', 
                    'Traveller characteristics':'category',
                    'Traveller type':'category', 'UOM':'category'})
)

The result? It took a bit more than 4 minutes to load. But the memory usage was only 6,680,612,440 bytes – about 12 percent of the size of the original data frame.

In other words, with just a few minutes of fiddling, we managed to cut the data frame down to a much smaller size, without losing any of the information useful for our analysis.