BW #118: Flight delays

[Administrative note: Are you at PyCon US? Please come and say "hi"! I'm giving a talk at tomorrow's Python education summit (https://us.pycon.org/2025/events/education-summit/) on the use of AI in teaching and learning Python, and then "The PyArrow revolution in Pandas," on Friday morning (https://us.pycon.org/2025/schedule/presentation/69/). I have a booth in the expo advertising my new-and-improved LernerPython.com learning platform — come and get a free T-shirt, as well as hang out to chat!]

This week marks the start of PyCon US 2025, with lots of talks, tutorials, and activities on the agenda. People come from all over the world to attend PyCon US – and this year, a number of them found their flights into Newark Liberty Airport were delayed, canceled, or rerouted to another airport.

Why? It seems that there were two basic sets of issues: Equipment failures and staffing shortages. The staffing shortages ranged from electricians (needed to fix problematic equipment, as well as do preventative maintenance) to air-traffic controllers (who were already woefully understaffed). The combination has led to a number of terrifying stories about air-traffic controllers unable to know where planes are for 90-second periods. And Newark, as a busy airport, requires more training than most others (
https://www.nytimes.com/2025/05/12/business/newark-airport-delays-staffing-shortage.html?unlocked_article_code=1.HE8.Ke_D.yCujYtg90-cx&smid=url-share).

This week, we'll look at data about airport delays, based on data from the Department of Transportation's Bureau of Transportation Statistics. The most recent data is from January, which means that we won't be able to look at data from the most recent month. But when that data does come out, we'll be able to apply the same techniques to understand the scope of the issues.

Normally, I ask you to solve Bamboo Weekly problems using Pandas. But this week, I'll ask you to perform each query twice, once with regular Pandas, and a second time using a relatively new package, Fireducks (https://fireducks-dev.github.io/), which implements much of the Pandas API while claiming to improve performance dramatically. We'll put that to the test, running a number of queries in both Pandas and Fireducks, and seeing which runs faster.

Data and five questions

This week's data comes from the Bureau of Transportation Statistics. You can fill out a form to indicate which fields you want to download from here:

https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGJ&QO_fu146_anzr=b0-gvzr

You'll want the following fields:

Year
Month
Day_Of_Month
Reporting_Airline
Origin airport ID
Destination airport ID
DepDelay
ArrDelay
Cancelled
Diverted

You will also need to download two more files, the "lookup table" (really, another CSV file) for the reporting airline, and the lookup table (yes, a third CSV file) for the airport codes.

Learning goals for this week include Fireducks, grouping, pivot tables, dates and times, and joins.

Paid subscribers can download these files at the end of this message.

I'll be back tomorrow with full explanations and solutions, including a downloadable Jupyter notebook.

Here are my five questions:

Create two data frames – one with Fireducks, and a second with regular Pandas, from the "Ontime reporting" CSV file for January 2025. Create a date column based on the YEAR, MONTH, and DAY_OF_MONTH values. How much faster is Fireducks than Pandas? Than Pandas using the PyArrow CSV-loading engine?
Create a new data frame – again, once with Pandas and a second with Fireducks – that combines the three CSV files together. We'll want a column for the airline name (based on OP_UNIQUE_CARRIER in the main CSV file), a column for the origin airport name (based on ORIGIN_AIRPORT_ID in the main CSV file), and a column for the destination airport name (based on DEST_AIRPORT_ID in the main CSV file). Compare the time needed for Pandas vs. Fireducks.

BW #118: Flight delays

Data and five questions

BW #117: Electricity (solution)