BW #68: Dangerously hot weather

BW #68: Dangerously hot weather

Summer is starting, at least in the northern hemisphere. I keep hearing people say that it's surprisingly hot out. That's certainly true in Israel, where we expect hot weather during the summer (and often during the spring and fall, too), but the rest of the world is also experiencing unusually hot summers. Just this week, the New York Times reported that heat-related deaths are an increasingly big problem -- for workers, for employers who want to keep them safe on the job, and the government, which wants to ensure a safe workplace. (You can read the article here: https://www.nytimes.com/2024/05/25/climate/extreme-heat-biden-workplace.html?unlocked_article_code=1.vk0.ekJ8.Kg0h3dcMGz9d&smid=url-share )

The article cited a number of sources, one of which was from the National Weather Service (https://www.weather.gov/), which publishes statistics about various weather-related disasters and hazards:

https://www.weather.gov/hazstat/

As we start to enjoy (or not!) warmer weather, I thought it might be interesting to dig into this data, to see if heat-related fatalities are really increasing -- and if so, by how much.

Data and seven questions

On the National Weather Service's hazards list, there is a link to download the 80-year summary of all weather-related fatalities in the United States:

https://www.weather.gov/media/hazstat/80years_2023.pdf

As you can see from the file extension, it's a PDF file. You'll want to use the Tabula-py (https://tabula-py.readthedocs.io/en/latest/) package to read this into Pandas.

This week, I have seven tasks and questions for you to answer; I'll be back tomorrow with my solutions and explanations. The learning goals for this week include working with PDF files, nullable dtypes, plotting, and correlations.

The questions:

  1. Download the PDF file describing extreme weather incidents. Read the table into a data frame. We don't need the final "All Wx Fatalities" column. We also don't need the final three rows with summaries and totals. Ensure that both header rows are used for the header names. How much memory is being used? What dtypes are being used?
  2. Set all columns to be of type `pd.Int16Dtype` except for where `pd.Float64Dtype` or `pd.StringDtype` would be more appropriate. Remove any rows containing only NA values. Set "Year" to be the index. How much memory (if any) do you save by using these dtypes?