Hurricane Melissa (https://en.wikipedia.org/wiki/Hurricane_Melissa) continues to cause massive destruction in Caribbean, as yesterday's New York Times reported (https://www.nytimes.com/2025/10/29/world/americas/hurrican-melissa-jamaica-cuba-haiti.html?unlocked_article_code=1.xU8.DEhN.OKVaOHX-GiVU&smid=url-share). With winds blowing as fast as 295 km/h, it's no wonder that the storm has caused a great deal of property destruction, as well as injuries and deaths, in countries such as the Dominican Republic, Haiti, and Jamaica.
This week, we'll thus be looking at hurricane data from the Caribbean and North/Central America. We'll explore topics from wind speed to the names that are chosen to describe storms.
Data and six questions
This week's data is from IBTrACS, the International Best Track Archive for Climate Stewardship at the National Oceanic and Atmosphere Administration in the Department of Commerce (https://www.ncei.noaa.gov/products/international-best-track-archive). Their data is available in a number of formats, including CSV. From what I've seen, it’s the most comprehensive, globally standardized archive of tropical-cyclone tracks available.
We'll be looking at data from the North Atlantic, which is labeled NA, which can be downloaded from https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r01/access/csv/ibtracs.NA.list.v04r01.csv .
A good data dictionary for this data set is available at https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r01/doc/IBTrACS_v04r01_column_documentation.pdf .
This week's learning goals include: Cleaning data, grouping, working with dates and times, and plotting.
Paid subscribers to Bamboo Weekly, including members of my LernerPython online learning platform (https://LernerPython.com), get access to all questions and answers, a downloadable file with the data, invitations to monthly office hours, downlodable Jupyter/Marimo notebooks, and one-click access to my notebook in Google Colab or Molab, depending on the notebook style. Links to all of these are at the bottom of this issue.
Here are my solutions and explanations:
Read the data in from the CSV file. We only need some of the columns: SEASON, SUBBASIN, NAME, ISO_TIME,
LAT, LON, WMO_WIND, and MLC_CLASS. ISO_TIME needs to be a datetime. WMO_WIND and SEASON should be integers. LAT and LON should be floats.
I started by loading up Pandas and Plotly Express (which I'll use in some plotting later):
import pandas as pd
from plotly import express as pxSince the data was in CSV format, it made sense that I should read it into Pandas using read_csv:
filename = '/Users/reuven/Downloads/ibtracs.NA.list.v04r01.csv'
df = (pd
.read_csv(filename)
)I decided to use PyArrow to load the CSV file, since it's generally much faster, and is also smarter about identifying datetime columns. But loading the file this way didn't seem to work: I got some weird warnings. Worst of all, no matter what options I passed to read_csv, I couldn't get the ISO_TIME column to be in datetime format.
I finally figured out that while the first line of the file did contain column headers, the second line of the file contained some measurement units and notes. That messed everything up, giving just about every column a dtype of object, which basically means, "strings." And while strings are fine and great, they aren't that useful when trying to calculate windspeed or perform other numeric calculations. Moreover, that row prevented me from being able to run astype on a number of columns.
I first approached this by loading the file, removing that first (non-header) row, and then using assign on multiple columns to change the dtype. That worked, but it felt extremely clumsy. And besides, if you're running assign on nearly every column of interest, it feels like there must be a better way.
One possibility would be to use the header keyword argument to read_csv, telling Pandas that there aren't any headers at all. That, combined with skip_rows, would allow me to get the data in the right format.
But of course, that would mean not having the column names. And not being able to select them easily.
I finally hit upon an interesting solution, which involved invoking read_csv twice:
- I read the file in once, passing
skiprows=2and alsoheader=None, basically telling Pandas that the entire file (except for the first two lines) is legitimate data, and should be read into a data frame. - Inside of that invocation of
read_csv, I invokedread_csva second time, on the same file! The point of this invocation was to get the column names, so I passednrows=5, meaning that Pandas should only read five rows. I then invoked.columnson this truncated data frame, getting its column names – which I used as the value to thenameskeyword argument on the firstread_csv:
filename = '/Users/reuven/Downloads/ibtracs.NA.list.v04r01.csv'
df = (pd
.read_csv(filename, engine='pyarrow',
header=None, skiprows=2,
names=pd.read_csv(filename, nrows=5).columns)
)
In other words, I grabbed the column names, and only the column names, via read_csv, and then assigned those to be the column names of the data frame I retrieved without column names.
This code might have been a bit weird, but it was much cleaner and more straightforward than my previous attempts to get this to work with assign. That said, it didn't completely do the job. That's because the WMO_WIND column, which records the windspeed, was still an object column, containing strings. That's because it contained space characters (' ') to indicate missing data.
I know that I asked you to set WMO_WIND to be integers, and that's because I had originally used str.replace to replace ' ' to '0', and then ran astype(int) on the column. But I didn't feel super comfortable with that, and it was only after I published yesterday's newsletter that the answer became obvious: Use the na_values keyword argument to read_csv to indicate that ' ' should be treated as a NaN value. True, the column will then contain floats, rather than ints, but that's more than fine:
df = (pd
.read_csv(filename, engine='pyarrow',
header=None, skiprows=2, na_values=[' '],
names=pd.read_csv(filename, nrows=5).columns)
)With the data and column names in place, I passed a list of column names (strings) inside of [] to pare down the data frame:
df = (pd
.read_csv(filename, engine='pyarrow',
header=None, skiprows=2, na_values=[' '],
names=pd.read_csv(filename, nrows=5).columns)
[['SEASON', 'SUBBASIN','NAME', 'ISO_TIME',
'LAT', 'LON', 'WMO_WIND', 'MLC_CLASS']]
)This resulted in a data frame with 127,619 rows and 8 columns. The first storm information was recorded in June, 1851 (!), and the most recent data, at least when I downloaded the file, was from October 28th.
Which five storms had the highest measured windspeeds?
On the face of it, we could use the following query:
(
df
.nlargest(columns='WMO_WIND', n=5)
)This asks the data frame to return the 5 rows in which WMO_WIND has the highest values. And in many ways, that's not a bad way to go about things.
But it doesn't quite do the trick, because it might return the same storm more than once. Instead, I used groupby to get the highest windspeed for each of the storms in the database. That returned a series in which the index contained storm names, and the values contained the maximum windspeed for each storm:
(
df
.groupby('NAME')['WMO_WIND'].max()
.nlargest(5)
)This also seemed pretty reasonable – but then I realized (as we'll see in the next question) that storm names are sometimes reused, and that this might be conflating two or more storms with the same name. I thus used groupby on two categorical columns, SEASON and NAME:
(
df
.groupby(['SEASON', 'NAME'])['WMO_WIND'].max()
.nlargest(5)
)The result was a series with a two-level multi-index:
SEASON NAME WMO_WIND
1980 ALLEN 165
1935 UNNAMED 160
1988 GILBERT 160
2005 WILMA 160
2019 DORIAN 160
I should note that the wind speed measured by WMO is in knots, which is more than 305 km/h, or about 190 mph.