Skip to content

BW #128: Extreme heat (solution)

Get better at: Using PyArrow, pivot tables, plotting, optimizing query speed, datetime, multi-indexes, and using xarray

BW #128: Extreme heat (solution)

It's hot outside! As I write this, it's 34 degrees Celsius, and today is forecast to be the coolest in the coming week. But I'm not alone; people I've spoken with in the US, Europe, and China have all complained about extremely hot temperatures during the summer.

Two recent New York Times articles, one about this summer's temperatures (https://www.nytimes.com/2025/07/17/climate/climate-forward-extreme-heat.html?unlocked_article_code=1.Yk8.s6B8.zsHwJuy6CYBf&smid=url-share), and the other about the affect that climate change is having on vacations (https://www.nytimes.com/2025/07/14/world/europe/spain-italy-greece-heat.html?unlocked_article_code=1.Yk8.ODAn.Zk7NaeV2a2OO&smid=url-share) show that everyone is affected by the increase in

This week, we'll thus look at global temperatures, exploring their rise over the last number of years. But actually, the climate questions take a back seat in many ways to lots of other topics this week, from xarray to PyArrow.

Data and six questions

This week's data comes from the Copernicus Climate Change Service (C3S, https://climate.copernicus.eu), which offers scientific data about weather and climate. Our data comes from their "climate data store," for which you need to register (for free) to download data.

The data comes in GRIB format and xarray, neither of which I had used before. I hope that you'll enjoy the challenge of learning about these things as much as I did.

This week's learning goals include working with xarray, optimizing memory, multi-indexes, PyArrow, datetime data, pivot tables, and plotting.

Paid subscribers, including members of my LernerPython+data platform, can download the data file below. You can also download my Jupyter notebook, or get a one-click version of the notebook that runs in Google Colab.

Meanwhile, here are my six tasks and questions:

From the climate data store (https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=download), download data using either their "download" button or the API, with their provided Python code. (To use the API, you'll need to get a free key from their site, and configure your computer to use it.) Load the downloaded data into an xarray data set, and then turn the "an" part of that data set into a Pandas data frame. How big is this data set? How much memory does it use?

Let's start with the easy stuff, namely importing Pandas:

import pandas as pd

When I first found the climate data, I assumed that it would be distributed in either CSV for Excel format. We've dealt with some other formats here at Bamboo Weekly in the past, including fixed-width fields and even SAS files, so I figured it would be straightforward and easy.

But then I found that their data is distributed in GRIB format (https://en.wikipedia.org/wiki/GRIB). I hadn't ever heard of this format before, but it turns out that it's quite common and popular in the weather community.

That's fine, I figured – Pandas knows how to import lots of kinds of files. I'm sure that there's a way to read GRIB files directly into Pandas.

Maybe there is, but I didn't find one! Instead, the standard path seems to be reading a GRIB file into xarray, and then convert xarray to a Pandas data frame. (There are other ways, as well – one being to convert GRIB data into CSV, and then read the resulting file into Pandas.)

Wait, what is xarray? It's a data-manipulation library, available from PyPI and documented at https://docs.xarray.dev/en/stable/index.html . From what I've learned, xarray is similar to Pandas, in that it uses NumPy for efficient storage of data, adding row and/or column labels. Unlike Pandas, but like NumPy, xarray gives you access to n-dimensional data. So if you like the labels in Pandas but need more than the two dimensions of a Pandas data frame, xarray might provide the mix that you need.

In order to read our GRIB data into Pandas, we'll thus need to perform two steps:

  1. Read the GRIB data into xarray
  2. Export from xarray into a Pandas data frame

I'll soon give more details about how to do all of that. But I first had to download the data. The climate store lets you choose a number of different options for what measurements and time segments you want. There is a maximum number that you can download, and I quickly discovered that I had gone over the limit. I thus decided to pare down my request, to stay within the limits, keeping re-analysis data for both temperature and dewpoint over the last 20 years, on all months of the year, all days of the year, and midnight, 8 a.m., and 4 p.m.

This was within the limits of what I could download, so I clicked on "submit" and waited. And waited. And waited.

More than three hours later, I decided that perhaps I had asked for so much data that it was going to take a very long time to retrieve. And indeed, I found an note on the climate store page that smaller queries get higher priority.

I thus decided to pare things down even further, getting only data from 2020 until today, and only the 1st, 15th, and 28th day of each month. This still took a little while to download, and resulted in a 2.7 GB (!) file.

With the file in hand, what was my next step?

I'll be honest: I started to read a bit about xarray, and between reading the documentation and doing a whole lot of investigation with ChatGPT, I managed to find a way to do what I needed.

Wait, I use ChatGPT? Isn't that cheating? No, I don't think so at all. It's faster than reading the documentation and more specific than Stack Overflow. However, as we all know, chatbots are often wrong. And indeed, it was only because I argued with ChatGPT, asked it tons of questions, and experimented quite a bit that I was able to get everything to work. I'm a big believer in using AI chatbots to help you do your work faster, and to learn new topics. You just always need a very healthy dose of skepticism. I talked about this at PyCon US (https://speakerdeck.com/reuven/edu-summit-2025-dot-key) and even uploaded a YouTube video on the subject a few weeks ago (https://www.youtube.com/watch?v=rEddzOVCAEM).

I started by importing xarray as xr, its standard abbreviation. I then invoked open_dataset, passing the name of the file I wanted to read. I learned that whereas Pandas has different methods for different file types (e.g., pd.read_csv and pd.read_excel), xarray has a single open_dataset method that takes an engine keyword argument, indicating what kind of data is there.

I started with this call:

import xarray as xr
filename = '/Users/reuven/Downloads/8ebcbe0b209b8ada6865b8ece2f47f38.grib'


ds = xr.open_dataset(
    filename,
    engine="cfgrib"
)

This gave me an error:

DatasetBuildError: multiple values for key 'dataType'

DatasetBuildError: multiple values for unique key, try re-open the file with one of:
    filter_by_keys={'dataType': 'an'}
    filter_by_keys={'dataType': 'em'}

There was a lot of other code in the error between the first line and the final section, none of which really helped.

The problem, it would seem, is that GRIB-formatted data can include a number of different values for what xarray requires to be uniquely tagged. Because I had asked to import both an and em data, the GRIB file contained both – and xarray couldn't quite handle that.

The solution was to pass an additional keyword argument, backend_kwargs, which took a Python dict as a value. That dict is passed along to the GRIB-reading engine that feeds its output into xarray. And as the exception's message indicated, I wanted to choose one of the dataType key values.

I thus tried again:

ds = xr.open_dataset(
    filename,
    engine="cfgrib",
    backend_kwargs={"filter_by_keys": {"dataType": "an"}}
)

This worked, but gave me a warning (not an exception):

/Users/reuven/.pyenv/versions/3.13.5/lib/python3.13/site-packages/cfgrib/xarray_plugin.py:131: FutureWarning: In a future version, xarray will not decode timedelta values based on the presence of a timedelta-like units attribute by default. Instead it will rely on the presence of a timedelta64 dtype attribute, which is now xarray's default way of encoding timedelta64 values. To continue decoding timedeltas based on the presence of a timedelta-like units attribute, users will need to explicitly opt-in by passing True or CFTimedeltaCoder(decode_via_units=True) to decode_timedelta. To silence this warning, set decode_timedelta to True, False, or a 'CFTimedeltaCoder' instance.
  vars, attrs, coord_names = xr.conventions.decode_cf_variables(

Basically, a lot of weather data uses not just datetime values (which are fixed) but also timedelta values (which are relative to a datetime value). And it seems that xarray will be tightening things up in the future, and is thus strongly suggesting that we pass decode_timedelta=True in order to avoid ambiguity or errors.

My final version of the query was thus:

ds = xr.open_dataset(
    filename,
    engine="cfgrib",
    backend_kwargs={"filter_by_keys": {"dataType": "an"}},
    decode_timedelta=True
)

This worked, and actually worked very quickly. Suspiciously quickly, in my mind – my computer might be fast and have a lot of RAM, but it couldn't possibly have read the entire 2.7 GB file into a data structure in less than one second, right?

And indeed, my suspicions were accurate: When you ask xarray to load data, it doesn't do so right away. This is known as "lazy loading," and it delays the actual of loading of data into memory until we absolutely need to do so.

Just to double check, I asked Python what ds was:

>>> type(ds)
xarray.core.dataset.Dataset

So we have a special kind of data structure, an xarray data set. And just as Pandas data frames have a large number of to_* methods, this data set has a to_dataframe method that returns a Pandas data frame. However, because I was only interested in the 2-meter temperature (2tm), I gave that as the index into ds from which I wanted to get a data frame:

df = ds["t2m"].to_dataframe()

This took a while — after all, this is when the data was loaded from the 2.7 GB GRIB file into xarray, then processed, then output into a Pandas data frame. I timed the process on my computer, and it took about 3.5 minutes.

But in the end, I had a data frame! How big was it? Usually, the RAM used by a Pandas data frame is 3-4 times the size of a CSV file. But GRIB is highly compressed, so I was thinking that it might be larger.

I ran the memory_usage method on the data frame, passing deep=True in case there were values in Python (vs. NumPy) whose memory we wanted to calculate. memory_usage returns a series with the memory usage for each column; I invoked sum on those values to get the total usage:

>>> df.memory_usage(deep=True).sum()
np.int64(26163736400)

In other words, the resulting data frame was more than 26 GB in size. That's... not small! I asked to see its size:

>>> df.shape
(622944000, 5)

In other words, it's 622 million rows and five columns.

How much memory do you save by removing the number, step, surface, and valid_time columns, switching latitude and longitude to float32? Round the t2m column to 1 decimal place.

One of the issues with Pandas is that all of the data needs to fit into memory. For small data sets, that's not a big deal. But when things get large, it can either limit how much data you can analyze or just slow things down.

This raises the question of whether we can pare down the size of this data set – and how much that'll really give us. For starters, let's look at the dtypes of the columns:

number                  int64
step          timedelta64[ns]
surface               float64
valid_time     datetime64[ns]
t2m                   float32
dtype: object

These values were determined by xarray, which in turn got them from the GRIB file. But four of them – number, step, surface, and valid_time are irrelevant or empty, which means that we can remove these columns without any penalty. Let's do that, and see how much memory we save:

(
    df
    .drop(columns=['number', 'step', 'surface', 'valid_time'])
    .memory_usage(deep=True)
    .sum()
)

The result is np.int64(6229528400), or about 6.2 GB. Which means that dropping those columns has already dropped the size of the data frame by about three quarters. I'll call that a win!

But we can go further than this: We can change the latitude and longitude columns to be float32, rather than float64. That'll be good enough for our purposes, and will literally reduce their memory usage by half.

However, both of those columns are in the index. We'll thus need to invoke reset_index to move them into the regular data frame, use assign along with astype to change them into different types, and then turn them back into the index:

(
    df
    .reset_index()
    .drop(columns=['number', 'step', 'surface', 'valid_time'])
    .assign(latitude = lambda df_: df_['latitude'].astype('float32'),
            longitude = lambda df_: df_['longitude'].astype('float32'))
    .set_index(['time', 'latitude', 'longitude'])
    .memory_usage(deep=True)
    .sum()    
)

The result? np.int64(6229507468), or 6,229,528,400. In other words, we did reduce the memory usage somewhat... but not by that much.

What happened here? How can it be that we didn't save more memory? After all, 622 million rows * 8 bytes is nearly 5 GB that we should have saved.

The problem is that we put the columns back into the multi-index. Which means that Pandas has to use an Index object. And Pandas doesn't have a 32-bit Index object, so it chooses pd.Int64Index, removing the efficiency we had hoped for!

What if we just keep time in our index, and the other columns in the data frame itself?

(
    df
    .reset_index()
    .drop(columns=['number', 'step', 'surface', 'valid_time'])
    .assign(latitude = lambda df_: df_['latitude'].astype('float32'),
            longitude = lambda df_: df_['longitude'].astype('float32'))
    .set_index('time')
    .memory_usage(deep=True)
    .sum()
)

The result is np.int64(12458880000), or 12 GB. How can this be? How can cutting the size of these columns in half double the memory usage?

I spent quite some time looking into this, because I was surprised, as well. And I learned quite a bit!

Yes, Pandas uses 64-bit floats in the index. So we won't save any memory by using them there.

In a multi-index, Pandas doesn't store each and every combination. Rather, it stores the cartesian product of several different arrays. This means that while we think of the multi-index as being very large, it's actually only as large as each of the individual arrays that go into it. When we look at df.index and see the multi-index data structure printed out, that's the result of the cartesian product, not what is actually stored.

Moving the longitude and latitude columns into our data frame did allow us to use float32. But it also meant that we needed every single combination of longitude and latitude, which ironically increased the memory usage.

There is one more trick that we might be able to use here, namely categories. You can think of categories as an enum for our data frame, replacing each value with an integer, which is smaller than other values we might store. That is, Pandas will find every unique value in a column, assign it an integer, and then keep an index of those integers around. We typically use categories with strings, but if numeric values repeat enough, maybe they'll help?

In [18]: (
    ...:     df
    ...:     .reset_index()
    ...:     .drop(columns=['number', 'step', 'surface', 'valid_time'])
    ...:     .assign(latitude = lambda df_: df_['latitude'].astype('category'),
    ...:             longitude = lambda df_: df_['longitude'].astype('category'))
    ...:     .set_index('time')
    ...:     .memory_usage(deep=True)
    ...:     .sum()
    ...: )

The result: 9,967,170,904 – or about 10 GB.

In other words, none of these techniques reduces the memory usage of our data frame! We're better off leaving the longitude and latitude columns where they were, in the index, and as 64-bit integers.

However, I did ask you to round the temperature to only one digit after the decimal point. That won't save us any memory, but it will make the values easier to read. We can do that with the round method. I then assigned the result back to df to make this easier to use and work with:

df = (
     df
     .drop(columns=['number', 'step', 'surface', 'valid_time'])
     .assign(t2m = lambda df_: df_['t2m'].round(1))
)