BW #52: Border encounters (solution)

[Administrative note: Office hours for paid Bamboo Weekly subscribers will take place on Sunday. Come with any and all questions about Pandas! I’ll send a note with the Zoom link tomorrow.]

This week, we looked at data provided by the US government regarding “border encounters,” when officers from the Department of Homeland Security’s Customs and Border Protection (CBP) met people who hadn’t legally entered the United States.

The CBP classifies each of these encounters in one of three ways: Expulsion (i.e., the person is removed from the US without a hearing), apprehension (i.e., the person was found having entered the US illegally), or inadmissible (i.e., the person tried to enter at an established port of entry, but didn’t have the appropriate documentation).

My impression, based on reading the news, was that there was a recent, massive surge in people entering the US outside of standard ports of entry. Southern states have complained bitterly about the large number of people entering illegally — although according to US and international law, many people who are classified as “apprehension” or “inadmissible” are able to claim asylum or refugee status. In such cases, they then stay in the US until their case is heard before a judge.

Data and nine questions

This week’s data came from the CBP. The main source of data that they offer is a CSV file describing all border encounters since fiscal year 2021. (We’ll discuss fiscal years in detail, below.) There are also data files for pre-2021 border encounters, but my impression is that they were reported in a different way, and this was enough data for us to get a sense of the current trends.

The CSV file can be downloaded from the main CBP data page, at:

https://www.cbp.gov/document/stats/nationwide-encounters

The specific file that I asked you to download was for information from fiscal year 2021 through fiscal year 2024, ending in December of FY 2024:

https://www.cbp.gov/sites/default/files/assets/documents/2024-Jan/nationwide-encounters-fy21-fy24-dec-aor.csv

A data dictionary, describing the different fields and values contained in the CSV file, was here:

/content/files/sites/default/files/assets/documents/2023-Sep/nationwide-encounters-data-dictionary.pdf

Here are the nine questions and tasks that I gave you, with my detailed solutions and explanations. A link to the Jupyter notebook I used to perform the calculations follows my solutions:

Read the data from the CSV file into a data frame. Convert "2024 FYTD" into just "2024".

Before doing anything else, I loaded Pandas into Python:

import pandas as pd

Next, I used “read_csv” to load the CSV file into a data frame:

df = (pd
      .read_csv(filename)
     )

However, I wanted to change values in the “Fiscal Year” column to “2024” from “2024 (FYTD)”, meaning “fiscal year to date.” I can understand why, before FY 2024 is complete, the data would indicate that it was incomplete. However, this would cause a lot of trouble in working with dates, and I decided to standardize it.

I decided to use the “replace” method on the data frame. There are several ways to invoke this method; one is to simply pass it a dictionary, in which case the dict’s key-value pairs tell Pandas what values should be replaced with other values. We could, in theory, have done just that — but I decided to use a more advanced feature of replace, focusing our search-and-replace operation to a particular column. Here, the dict’s key is “Fiscal Year,” and the value is a dict whose key-value pairs indicate the values to be found and replaced:

df = (pd
      .read_csv(filename)
      .replace({'Fiscal Year':{'2024 (FYTD)':'2024'}})
     )

Create a "date" new column, based on the "Fiscal Year" and "Month (abbv)" columns, containing a datetime value for that year and month based on the fiscal year. Make that the index.

It’s always nice when we get a pre-packaged date/time field in a CSV file. We can then pass the “parse_dates” keyword argument to read_csv, and get a datetime field.

In this case, though, we weren’t so lucky: We got separate columns for the year (well, the fiscal year) and for the month. How can we turn that into a datetime column?

One way is to use “pd.to_datetime”, a Pandas function that takes a series of strings and returns a series of datetime objects. If we can create a series of strings in a reasonable format, then we can call to_datetime on those values.

I decided to use “assign” to create the new column. I created the column by concatenating together (with “+”) the “Fiscal Year” and “Month (abbv)” columns, with a minus sign between them.

However, when I ran pd.to_datetime on the resulting string, I got some Pandas warnings that told me the format was ambiguous, and that I should specify it clearly by passing a format string.

Such format strings are commonly used with “strftime” and “strptime”, used for formatting and parsing date strings. You can read about the different format strings here:

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

The basic idea is that the string is taken literally except for % followed by particular letters. For example, %Y means a 4-digit year, and %b is the name of a month. By passing a format string of “%Y-%b”, we can tell Pandas to parse the dates we have created in our new column, resulting in a datetime dtype:

df = (
    df
    .assign(date = pd.to_datetime(df['Fiscal Year'] + 
                                  '-' + df['Month (abbv)'], 
                                  format='%Y-%b'))
)

Note that when you create a datetime object but only supply a year and month, the resulting datetime has the year and month you specified, with the day being the 1st of the month. The time component is similarly set to midnight.

With this date column in place, I then asked you to use it as the data frame’s index. We can use the “set_index” method to accomplish this:

df = (
    df
    .assign(date = pd.to_datetime(df['Fiscal Year'] + 
                                  '-' + df['Month (abbv)'], 
                                  format='%Y-%b'))
    .set_index('date')
)

We now have a data frame whose index contains datetime values. This is known as a “time series,” and it’s both common and very useful.

However, as we’ll soon see, there are some problems with using fiscal years as if they were calendar years.

BW #52: Border encounters (solution)

Data and nine questions

Read the data from the CSV file into a data frame. Convert "2024 FYTD" into just "2024".

Create a "date" new column, based on the "Fiscal Year" and "Month (abbv)" columns, containing a datetime value for that year and month based on the fiscal year. Make that the index.

Read next

BW #52: Border encounters

BW office hours: February 11th