BW #68: Dangerously hot weather (solution)
I’m writing this on my return flight from New York to Israel, finishing up 2.5 weeks in United States — attending PyCon, meeting with clients, and even getting a chance to see some family. The weather has been pretty warm. The weather awaiting me back in Israel will be even warmer. And of course, we’re a few weeks before the summer solstice, so we’re soon going to learn (or remember) how hot things can get.
Hot weather isn’t just annoying; it can be downright dangerous. A recent New York Times article (https://www.nytimes.com/2024/05/25/climate/extreme-heat-biden-workplace.html?unlocked_article_code=1.vk0.ekJ8.Kg0h3dcMGz9d&smid=url-share) discussed the challenges that workers face as weather becomes increasingly hot.
The article cited a number of sources, one of which was from the National Weather Service (https://www.weather.gov/), which publishes statistics about various weather-related disasters and hazards:
https://www.weather.gov/hazstat/
Data and seven questions
This week, I asked you to download a document from the National Weather Service summarizing the last 80 years of weather-related fatalities and damage in the United States:
https://www.weather.gov/media/hazstat/80years_2023.pdf
As you can see from the file extension, it's a PDF file. You'll want to use the Tabula-py (https://tabula-py.readthedocs.io/en/latest/) package to read this into Pandas.
This week, I asked you to answer seven tasks and questions. My solutions are below; a link to the Jupyter notebook I used to solve the problems myself follows the solutions.
The questions:
Download the PDF file describing extreme weather incidents. Read the table into a data frame. We don't need the final "All Wx Fatalities" column. We also don't need the final three rows with summaries and totals. Ensure that both header rows are used for the header names. How much memory is being used? What dtypes are being used?
I started off, as usual, by importing Pandas:
import pandas as pd
But because I want to grab data from a PDF file, I’ll use the “tabula-py” package. Specifically, I’ll use its “read_pdf” method. I’ll also define a “filename” variable that refers to the PDF file we downloaded:
from tabula import read_pdf
filename = '/Users/reuven/Downloads/80years_2023.pdf'
With these in place, we can now use “read_pdf” to create a data frame:
df = (
read_pdf(filename,
pages=1,
multiple_tables=False,
pandas_options={'header':[0,1]}
)
[0]
.drop('All Wx', level=0, axis='columns')
.iloc[:-3]
)
This works, but there are numerous problems with the HTML that we got:
- We get a warning that we didn’t specify the page we want.
- We didn’t explicitly indicate that there will only be one table on the page.
- The first row is taken to be the header, but it’s actually a two-line header
- We don’t need the “All Wx” column
- We don’t need the final three lines
How can I take care of these issues? Let’s handle them, one at a time. Thanks to method chaining, we can just add to our query, little by little.
Enjoying this issue? Share it with your Pandas-using colleagues and friends!
First, let’s tell “read_pdf” that we only want to look on a single page, that there’s only one table, and that the header should come from both lines 0 and of the table. We can do this by passing several keyword arguments to “read_pdf”: “pages=1” tells it to only look on page 1, and “multiple_tables=False” tells is that there is one table.
But how can we tell “read_pdf” that we want to treat lines 0 and 1 as headers? That’s the sort of thing we would normally pass to “read_csv”, a Pandas method that we use all of the time. Does “read_pdf” take the same argument?
Not exactly: It doesn’t directly let us specify the header lines. But it does let us pass a keyword argument, “pandas_options”, whose value is a dict. Each key-value pair in that dict represents an option that we want to pass to “read_csv”, which is apparently invoked behind the scenes by “read_pdf”. So we can call:
df = (
read_pdf(filename,
pages=1,
multiple_tables=False,
pandas_options={'header':[0,1]}
)
)
The result of calling “read_pdf” in this way isn’t a data frame. Rather, it’s a list of data frames. We only want one, and it’s the first one, so we’ll use “[0]” to retrieve the first element of that list:
df = (
read_pdf(filename,
pages=1,
multiple_tables=False,
pandas_options={'header':[0,1]}
)
[0]
)
Now let’s remove the “All Wx” column. It’s actually called “All Wx Fatalities”, but because we asked for two rows in our header, we now have a two-level multi-index. (We’ll fix that soon.) but I can invoke “drop”, indicating the name of the column I want to drop, as well as its level (0, indicating the outer part) and the axis (“columns”).
Remember that “drop” doesn’t modify the existing data frame, but rather returns a new one that lacks the dropped column(s):
df = (
read_pdf(filename,
pages=1,
multiple_tables=False,
pandas_options={'header':[0,1]}
)
[0]
.drop('All Wx', level=0, axis='columns')
)
The only remaining issue, for now, is that the final three lines of the data frame contain totals and summaries that we don’t want to keep around. We can remove them by using “.iloc” to return all but the final three rows:
df = (
read_pdf(filename,
pages=1,
multiple_tables=False,
pandas_options={'header':[0,1]}
)
[0]
.drop('All Wx', level=0, axis='columns')
.iloc[:-3]
.dropna(thresh=4)
)
Notice that I also invoked “dropna” to remove any row containing NaN values. That’s because I found that a row containing nothing but NaNs somehow got into the data frame when I imported it. By default, “dropna” removes any row containing even one NaN value. That’s too strict for our purposes, so I used the “thresh” keyword argument to say that as long as we have at least 4 good values, we should keep the row.
This issue of Bamboo Weekly has nearly 4,000 words of explanations that took more than six hours to research and write. Thanks to the paid subscribers who support my work on Bamboo Weekly. Would you please join them?
The end result is a data frame containing 84 rows and 11 columns. That’s fine, except that we have a very strange multi-index of column names; they are simply spread across two lines.
How can we make the column names a bit more normal?
My first thought was just to assign a list of strings back to “df.columns”. That will work, but it felt a bit of a cop-out. Surely there must be some way to just combine the two parts of the index into one.
If you look at a multi-index (i.e., just retrieve “df.columns”), you’ll see that it is basically a list of tuples. Yes, it’s a special Pandas object, and there are a number of ways to access it — but it’s easier in some ways to think of it as a list of tuples.
Since I know that each tuple contains two strings (from the outer and inner levels), maybe I can just iterate over each tuple, join the two parts together, and assign that to “df.columns”? Here’s how that would look:
df.columns = [' '.join(one_t)
for one_t in df.columns]
This actually works! However, because the first column had a value for the outer layer and no value for the inner layer, Pandas provided one, “Unnamed: 0_level_1”. Which means that after running this comprehension, the first column has a name of “Year Unnamed: 0_level_1“. Not wrong, but not exactly what I would want.
I thus want to join the two parts of the tuple together unless the second part begins with “Unnamed”. In such a case, I’ll just take the first part. That sounds like an “if” statement in Python, but we can’t really put “if” statements, or any other statements, in list comprehensions. We can only include expressions.
One option would be to write a function, and then have that function include more complex logic, including “if” statements. The function could then return a string based on one or both tuple elements.
But this is one of those times when even I think it’s appropriate to use the expression version of “if-else” in Python. I generally avoid it, because I find it hard to read and understand. But it’s meant for precisely these situations. It syntax is
VALUE_IF_TRUE if CONDITION else VALUE_IF_FALSE
This is often compared with the “trinary” operator in C-like languages, aka ?: . The “trinary” name comes from the fact that it takes three arguments, as opposed to unary and binary operator, and doesn’t really describe what it does. It’s not quite the same, but it’s close. I even did a YouTube video about it:
Here, we can use it in this way:
df.columns = [one_t[0]
if one_t[1].startswith('Unnamed')
else ' '.join(one_t)
for one_t in df.columns]
Having done this, I can now change the “Year” column to be an integer. (It was an “object” column before, because it contained a NaN value. Now that we’ve removed those and normalized the names, we can more easily change it.)
df['Year'] = df['Year'].astype('int16')
After running the above code, I now have a data frame containing the data from the PDF file. What dtypes does it use? Let’s check, by grabbing the value of “dtypes” from the data frame:
Year int16
Lightning Fatalities float64
Tornado Fatalities float64
Flood Fatalities float64
Hurricane Fatalities float64
Heat Fatalities object
Cold Fatalities float64
Winter Fatalities float64
Rip Curr. Fatalities float64
Wind Fatalities float64
All Hazard Damages (M) object
dtype: object
How much memory does it use? Let’s check, using “memory_usage”. We specify “deep” to make sure that we could the size of each string object that is in Python memory, and thus not immediately available to Pandas:
df.memory_usage(deep=True).sum()
I get the following back:
13293
Not a crazy-huge value, but something to compare with.
Set all columns to be of type `pd.Int16Dtype` except for where `pd.Float64Dtype` or `pd.StringDtype` would be more appropriate. Remove any rows containing only NA values. Set "Year" to be the index. How much memory (if any) do you save by using these dtypes?
People are often surprised to find that NaN is a float value. This is annoying and weird if we have NaN in an integer column, because now the column must have a float dtype. But it often means that we need to give the column a dtype of “object”, in order to accommodate the different object types. That’s quite generic, and to some degree removes the usefulness of types in enforcing values and guaranteeing that methods will work.
We’ve talked about PyArrow in the past; one of its advantages over using NumPy is that it supports “nullable types.” In other words, you can have a string column that contains the equivalent of NaN, and it’ll stay a string column. You can have an integer column that contains the equivalent of NaN, and it’ll stay an int column. I say “the equivalent of NaN” because we don’t use NaN in this situation. Rather, we use “pd.NA”, a distinct Pandas-specific value that has the same idea as NaN, but works with these nullable types. NA is the reason why we have methods like “isna” as well as “isnan”, and keyword arguments like “skipna”.
Even if you’re not using PyArrow, Pandas provides us with built-in nullable types. These are known as “extension types,” and they have names like “pd.Int64Dtype”. The difference between extension types and regular Pandas types is that these use pd.NA rather than np.nan. In other words, they are nullable.
I asked you to change the dtypes for all columns to their equivalent extension types, and then to compare memory usage.
We can change the type of a column, as we saw before, with “astype”. That’s the typical way we do it; we take a column, run “astype” on it, indicate what dtype we want to get back, and then assign it back to the original column.
But there’s also a data frame version of “astype”, allowing us to convert many columns to new types. In this case, we indicate the new, destination types in a dict, with the column names as the keys and the new types as the values.
I decided that it would make sense to create such a dict using a dict comprehension. After all, we can iterate over each column name, and then provide an extension type as the value. But wait — what extension type should we provide? In most cases, as I indicated in the question, we’ll use “pd.Int16Dtype”. That should work for all columns except for one, “All Hazard Damages (M)”, which should use “pd.Float64Dtype”.
I decided that this would be a good place to use a dictionary. (Yes, this would be the second dictionary. The first one will be passed to “astype”. This one will be used to build the “astype” dict.) The dict would indicate which columns would use Int16Dtype and which would use Float64Dtype. But that seems like a lot of work, creating a dict with each column name.
I thus decided to use Python’s “defaultdict”. The default value we’ll return for our dict is pd.Int16Dtype. But if someone wants to use something else? That’s totally fine. Here’s how I created it:
from collections import defaultdict
conversion_dtypes = defaultdict(pd.Int16Dtype)
conversion_dtypes['All Hazard Damages (M)'] = pd.Float64Dtype()
The above means that I can now retrieve any key I want from “conversion_dtypes”. If I retrieve “All Hazard Damages (M)”, then I’ll get an instance of pd.Float64Dtype. Anything else? I’ll get an instance of pd.Int16Dtype. And yes, you need to invoke the extension type class in order to get an instance of it. It feels weird at first, but don’t worry — it feels weird later on, too.
Two of the columns contain commas and dollar signs. I used regular expressions and “str.replace” to handle them:
df['Heat Fatalities'] = (df
['Heat Fatalities']
.str.replace(r'\D', '',
regex=True)
)
df['All Hazard Damages (M)'] = (df
['All Hazard Damages (M)']
.str.replace(r'[^\d.]', '',
regex=True)
)
In the first regular expression, I asked to replace \D (i.e., any non-digit character) with the empty string. That took care of commas, allowing us to have integers in the column.
At first, I did the same thing for the second regular expression. But then I realized that I had removed the decimal points, which multiplied all of the values by 100. Whoops! I thus constructed a new regular expression, one using a “negative character class” — meaning that we’re looking for any character that isn’t a digit (\d) or a decimal point (.). Any non-digit, non-decimal point is replaced with the empty string.
With this in place, we can create our “conversions” dict, using a dict comprehension:
conversions = {column_name: conversion_dtypes[column_name]
for column_name in df.columns}
The above dict has column names as keys and the extension types as values.
I had already removed the NaN-only row in the previous solution, so I was just left to perform the conversions and set “Year” to be our index:
df = (
df
.astype(conversions)
.set_index('Year')
)
Sure enough, we can see that I’m using extension types:
Lightning Fatalities Int16
Tornado Fatalities Int16
Flood Fatalities Int16
Hurricane Fatalities Int16
Heat Fatalities Int16
Cold Fatalities Int16
Winter Fatalities Int16
Rip Curr. Fatalities Int16
Wind Fatalities Int16
All Hazard Damages (M) Float64
dtype: object
It’s sometimes hard to see, but notice that each extension type name is capitalized, as is traditional with Python class names.
Did this save us any memory? I again run
df.memory_usage(deep=True).sum()
And I get
3276
In other words, we’re using about one third the memory of the NumPy value equivalents. And we can now use pd.NA, which is more elegant.