[Administrative note: Sorry for the publishing delay, but it has been a busy day! Slides from my presentation on the use of AI in teaching and learning Python are at https://speakerdeck.com/reuven/edu-summit-2025-dot-key. Tomorrow morning, I'm speaking on "The PyArrow revolution in Pandas." (https://us.pycon.org/2025/schedule/presentation/69/). And of course, I have a booth in the expo advertising my new-and-improved LernerPython.com learning platform — come and get a free T-shirt, as well as hang out to chat!]
This week, in the wake of massive problems at Newark Liberty Airport (and growing reports of problems at other US airports, too), we looked at flight delays in the United States, as recorded by by the Department of Transportation. Their Bureau of Transportation Statistics publishes information about every flight, from the airports it flew between, to whether the flight was cancelled or delayed, to the responsible carrier.
The most recent data is from January, which means that we won't be able to look at data from the most recent month. But when that data does come out, we'll be able to apply the same techniques to understand the scope of the issues.
Normally, I ask you to solve Bamboo Weekly problems using Pandas. But this week, I asked you to solve every problem twice – once with Pandas, and a second time with Fireducks (https://fireducks-dev.github.io/), a relatively new package that claims to speed Pandas operations dramatically by (among other things) breaking them into separate pieces, each of which is then handled by a different core before getting reassembled.
We'll see that in some cases, Fireducks does indeed dramatically speed things up – but in others, at least in the case of the data set we looked at this week, it didn't seem to make any difference whatsoever.
Data and five questions
This week's data comes from the Bureau of Transportation Statistics. You can fill out a form to indicate which fields you want to download from here:
https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr
You'll want the following fields:
- Year
- Month
- Day_Of_Month
- Reporting_Airline
- Origin airport ID
- Destination airport ID
- DepDelay
- ArrDelay
- Cancelled
- Diverted
You will also need to download two more files, the "lookup table" (really, another CSV file) for the reporting airline, and the lookup table (yes, a third CSV file) for the airport codes.
Learning goals for this week include Fireducks, grouping, pivot tables, dates and times, and joins.
Paid subscribers can download these files at the end of this message.
If you're a paid subscriber, then you'll be able to download the data directly from a link at the bottom of this post, download my Jupyter notebook, and click on a single link that loads the notebook and the data Google Colab.
Also: A video walk-through of me answering this week's first two questions will go online in the coming day or two. Meanwhile, check out the entire Bamboo Weekly playlist at https://www.youtube.com/playlist?list=PLbFHh-ZjYFwG34oZY24fSvFtOZT6OdIFm !
Here are my five questions:
Create two data frames – one with Fireducks, and a second with regular Pandas, from the "Ontime reporting" CSV file for January 2025. Create a date
column based on the YEAR
, MONTH
, and DAY_OF_MONTH
values. How much faster is Fireducks than Pandas? Than Pandas using the PyArrow CSV-loading engine?
Before doing anything else, we'll need to load up both Pandas and Fireducks. Normally, both are imported such that they're aliased to pd
. Instead, I'll import Pandas as pd
and Fireducks as fd
, even though that goes against strong conventions. However, I need two different aliases if they'll both be loaded at the same time, so I don't have much choice:
import pandas as pd
import fireducks.pandas as fd
Now that we've loaded those, and have access to both Pandas (pd
) and Fireducks (fd
), we can try to load the main CSV file (T_ONTIME_REPORTING.csv
) into a data frame. We'll first use Pandas, invoking read_csv
, which takes a filename as an argument and returns a data frame:
filename = 'data/T_ONTIME_REPORTING.csv'
pd_df = (
pd
.read_csv(filename)
)
This works, but I also asked you to turn the three separate columns YEAR
, MONTH
, and DAY_OF_MONTH
into a single datetime
column called date
. To do that, we can use the pd.to_datetime
function, invoked from within a lambda
function inside of the value in assign
. That is:
assign
adds a new column to the data frame- Instead of invoking something right away, we pass a
lambda
function, one which takes the data frame as an argument, and assigns it to the parameterdf_
- The body of the
lambda
expression is a call topd.to_datetime
, returning a series ofdatetime
values based on its input argument - The input argument to
pd.to_datetime
is a dict, one in which we specify the year, month, and date from the three columns.
Here's how the full loading looks in Pandas:
filename = 'data/T_ONTIME_REPORTING.csv'
pd_df = (
pd
.read_csv(filename)
.assign(date = lambda df_: pd.to_datetime({'year': df_['YEAR'],
'month': df_['MONTH'],
'day': df_['DAY_OF_MONTH']}))
)
The result is a data frame with 539,747 rows and 11 columns.
I then repeated this in Fireducks, using the same code except for turning pd
into fd
:
filename = 'data/T_ONTIME_REPORTING.csv'
fd_df = (
fd
.read_csv(filename)
.assign(date = lambda df_: pd.to_datetime({'year': df_['YEAR'],
'month': df_['MONTH'],
'day': df_['DAY_OF_MONTH']}))
)
Again, I get a data frame with 539,747 rows and 11 columns.
But wait – I wanted to find out how long it took to execute these queries. I thus added the Jupyter magic method %%timeit
, which calculates how much time it takes to execute a particular Jupyter cell. (It's a close cousin of %timeit
, which checks the speed of a single line of code.) For example:
%%timeit
filename = 'data/T_ONTIME_REPORTING.csv'
pd_df = (
pd
.read_csv(filename)
.assign(date = lambda df_: pd.to_datetime({'year': df_['YEAR'],
'month': df_['MONTH'],
'day': df_['DAY_OF_MONTH']}))
)
The Pandas code, %%timeit
reports, took 142 ms to run. The Fireducks code took much less, 48 ms. So we increase the speed of our code by more than 60 percent just by making this switch.
But I started to wonder: What if I continued to use Pandas, but instead of loading with the regular, default loader, I used the PyArrow loader for CSV?
Here's the code I used, specifying engine='pyarrow'
in our call to read_csv
:
%%timeit
filename = 'data/T_ONTIME_REPORTING.csv'
pd_df = (
pd
.read_csv(filename, engine='pyarrow')
.assign(date = lambda df_: pd.to_datetime({'year': df_['YEAR'],
'month': df_['MONTH'],
'day': df_['DAY_OF_MONTH']}))
)
The result? PyArrow's CSV loader worked in 31.6 ms, faster than either of the others. My guess is that this is because this CSV file, while not small, is not large enough for Fireducks to demonstrate its speed, breaking a file into chunks and handing each chunk to a different core.
Because assignments in %%timeit
cells aren't kept around in other cells, I then assigned to the variables in a separate cell:
filename = 'data/T_ONTIME_REPORTING.csv'
pd_df = (
pd
.read_csv(filename, engine='pyarrow')
.assign(date = lambda df_: pd.to_datetime({'year': df_['YEAR'],
'month': df_['MONTH'],
'day': df_['DAY_OF_MONTH']}))
)
fd_df = (
fd
.read_csv(filename)
.assign(date = lambda df_: pd.to_datetime({'year': df_['YEAR'],
'month': df_['MONTH'],
'day': df_['DAY_OF_MONTH']}))
)
So far, Fireducks has demonstrated it's much faster than regular Pandas, but that PyArrow takes away that first-place win.
Create a new data frame – again, once with Pandas and a second with Fireducks – that combines the three CSV files together. We'll want a column for the airline name (based on OP_UNIQUE_CARRIER
in the main CSV file), a column for the origin airport name (based on ORIGIN_AIRPORT_ID
in the main CSV file), and a column for the destination airport name (based on DEST_AIRPORT_ID
in the main CSV file). Compare the time needed for Pandas vs. Fireducks.
In this question, I asked you to combine the original CSV file with two others – one with information about each airport, and a second with information about each airline.
There are two basic ways to combine data frames in Pandas:
- If two data frames share (in part or in full) indexes, then we can perform a "join," creating a new data frame whose rows are based on the two previous ones.
- We can perform the same thing on two columns, one from each data frame, in a "merge." This is similar to "join," except that we're merging on any two columns.
The original data frame, pd_df
, has all of the information about airlines, cities, and the like. But the carriers and airports are both listed with numbers. We would rather use names, since we're more familiar with them. What we'll do is a multi-way invocation of merge
:
- We'll first merge
pd_df
withpd_airport_df
, based on a match ofORIGIN_AIRPORT_ID
inpd_df
andCode
inpd_airport_df
. - Then we'll merge the result of the above query with
pd_airport_df
(yes, a second time!), based on a match ofDEST_AIRPORT_ID
andCode
in pd_airport_df. - Then we'll merge that result with
pd_carriers_df
, whereverOP_UNIQUE_CARRIER
matchesCode
.
There's only one problem with what I'm suggesting, which is that we'll get an error. That's because when you join
or merge
to data frames, you end up with a wider data frame than before, with columns from both data frames. What if more than one column has the same name? You'll get an error from merge
.
Or you can pass the suffixes
keyword argument, containing a 2-element tuple of strings, whose values are added to the (left, right) column names in the result:
pd_airport_df = pd.read_csv('data/L_AIRPORT_ID.csv')
pd_carriers_df = pd.read_csv('data/L_UNIQUE_CARRIERS.csv')
(
pd_df
.merge(pd_airport_df,
left_on='ORIGIN_AIRPORT_ID',
right_on='Code',
suffixes=(None, '_origin'))
.merge(pd_airport_df,
left_on='DEST_AIRPORT_ID',
right_on='Code',
suffixes=(None, '_dest'))
.merge(pd_carriers_df,
left_on='OP_UNIQUE_CARRIER',
right_on='Code',
suffixes=(None, '_carrier'))
)
How long does this take to execute? We'll use %%timeit again:
%%timeit
pd_airport_df = pd.read_csv('data/L_AIRPORT_ID.csv')
pd_carriers_df = pd.read_csv('data/L_UNIQUE_CARRIERS.csv')
(
pd_df
.merge(pd_airport_df,
left_on='ORIGIN_AIRPORT_ID',
right_on='Code',
suffixes=(None, '_origin'))
.merge(pd_airport_df,
left_on='DEST_AIRPORT_ID',
right_on='Code',
suffixes=(None, '_dest'))
.merge(pd_carriers_df,
left_on='OP_UNIQUE_CARRIER',
right_on='Code',
suffixes=(None, '_carrier'))
)
This took 105 ms to execute on my computer.
An almost identical query worked for Fireducks:
%%timeit
fd_airport_df = fd.read_csv('data/L_AIRPORT_ID.csv')
fd_carriers_df = fd.read_csv('data/L_UNIQUE_CARRIERS.csv')
(
fd_df
.merge(fd_airport_df,
left_on='ORIGIN_AIRPORT_ID',
right_on='Code',
suffixes=(None, '_origin'))
.merge(fd_airport_df,
left_on='DEST_AIRPORT_ID',
right_on='Code',
suffixes=(None, '_dest'))
.merge(fd_carriers_df,
left_on='OP_UNIQUE_CARRIER',
right_on='Code',
suffixes=(None, '_carrier'))
)
The result? Fireducks took 377 ms... more than 3x as long as the initial Pandas-based query. What's going on? I'm really not sure. It could be that joins and merges are especially slow in Fireducks. It could be that Fireducks only shines on truly huge data sets, and that this data isn't bit enough to justify its use.