BW #122: Economic growth (solution)

[Reminder: The 7th cohort of my Python Data Analysis Bootcamp (PythonDAB) will start on June 19th! I'm holding another webinar on Sunday with info about this 4-month intense-but-intimate mentoring program in Python, Git, and Pandas. Join me to learn more: https://us02web.zoom.us/webinar/register/WN_TgKu-kiwTVeylzMOQidTZA .]

Earlier this week, the World Bank released its latest report on "Global Economic Prospects" (https://www.worldbank.org/en/publication/global-economic-prospects). Its projections for economic growth over the coming year had to be revised dramatically since they first came out in January, in no small part because of changes in US trade policy.

Data and five questions

This week, we'll look at some of the data that the World Bank provided along with their written report, to better understand the current state of the world economy. If you go to the GEP home page (https://www.worldbank.org/en/publication/global-economic-prospects) and open the "more downloads" menu at the center of the page, choose "GDP growth data." That'll download an Excel spreadsheet to your computer with the data we'll be examining.

Paid subscribers can download copies of the data files, as well as my notebook, from the end of this message.

Learning goals for this week include reading from Excel files, cleaning data, multi-indexes, window functions, and plotting.

Download the Excel spreadsheet with the World Bank's data, and read the first sheet it into a Pandas data frame. We only want from rows 5 through 34 in Excel. Columns A through D should all be in a four-way multi-index. We do want all of the columns with projections, as well as the differences from earlier this year (columns M and N) to be included. In the columns we'll use for the index, replace `NaN` with the previous non-`NaN` value in that column.

People sometimes ask me how I come up with ideas for Bamboo Weekly. This week's topic came from a posting from economist Justin Wolfers on Bluesky (https://bsky.app/profile/justinwolfers.bsky.social), who wrote about the World Bank's report. I quickly looked up the report, found that it came with data in Excel format, and excitedly decided to dig in.

But this week demonstrates that just because data is in an Excel spreadsheet doesn't mean that it's easy to parse or work with. The spreadsheet was clearly designed to be a report for human eyes, not input for Pandas. And so if you found that reading this week's data into Pandas was more challenging than usual... well, that's probably the case.

That said, part of my goal with Bamboo Weekly is to expose you to enough real-world data sets, including poorly formatted ones, that you'll learn how to handle even the nastiest of files.

So, let's start by importing Pandas:

import pandas as pd

In an ideal world, we could just read the Excel file into Pandas with read_excel:

filename = 'data/GEP-June-2025-GDP-growth-data.xlsx'

df = (
    pd
    .read_excel(filename)
)

However, this falls short in a number of ways:

The headers aren't on the first line of the document
We don't want all of the columns
We don't want all of the rows

I'm ignoring the fact that this Excel document contains a number of sheets; we're only interested in the first one, which means we can ignore the others.

We can indicate the line on which the headers are located with the header keyword argument. Remember that whereas Excel would say the headers are on line 4, we would call it line 3, since Python starts counting with 0.

I also passed nrows=30, so that we would only read 30 rows from the document. Yes, there is more data after those rows, but I decided that things were complex enough without getting the "memorandum items," as the spreadsheet described.

We can ask for a selected number of columns with the usecols keyword argument; I normally prefer to use the names, rather than the integers, but these names are odd, and they repeat – so I decided to invoke usecols with the column numbers (starting with 0), and then pass names, allowing me to rename the columns.

I didn't have a great naming convention for the first four columns, which will constitute our multi-index. So I went with "world", "major_group", "minor_group", and "country". Again, this demonstrates even further that this spreadsheet was optimized for human eyes, and not for programs to read.

I also gave names to the remaining columns I loaded. In theory, you can repeat column names in Pandas, just as an index can contain repeated values. But it's a bad idea, and Pandas tries hard to stop us from doing that. And besides, why invite trouble? I thus changed the final two columns, which show the differences between the January forecast and the current forecast, to 2025fd and 2026fd, for "forecast differences."

Here's the query so far:

filename = 'data/GEP-June-2025-GDP-growth-data.xlsx'

df = (
    pd
    .read_excel(filename, 
                header=3, nrows=30,
                usecols=[0,1,2,3,5,6,7,8,9,10,12,13],
                names=['world', 'major_group', 'minor_group',
                       'country', '2022', '2023', '2024e',
                       '2025f', '2026f', '2027f', '2025fd', 
                       '2026fd']
               )
)

The above query does a decent job of grabbing only the rows we want, and only the columns we want.

In theory, we could then assign the index, either with index_col in read_excel or with set_index. But there is a bit of a problem here, in that the columns we want to use for the index contain a lot of NaN values. Those look nice on the page, in that they appear as whitespace – but it's not helpful.

For example, consider lines 7, 8, and 9 of the spreadsheet, with data about the US, euro area, and Japan. That's in column C; column B, which is effectively its heading, should contain the text "Advanced economies." That's because Pandas can't exactly look up and to the left to find out what heading things are under, at least not easily.

I thought about using interpolate to fill in the NaN values with whatever non-NaN value came above them. But I got a warning from interpolate, reminding me that it's really only for numeric data. If I want to interpolate string data, then I should use ffill and bfill, for "forward fill" and "backward fill." By running ffill on the four columns I want to use as an index, I can remove the NaN values and replace them with the text that came above. Great, right?

Well, mostly: The problem is that we cannot run ffill on only a subset of a data frame. So I decided to use assign to replace the world and major_group columns with the result of running bfill on each of them.

Why didn't I run bfill on the other two columns I want in my index? I actually did on minor_group, and ran into weird problems. And the countries don't need to be filled; that's the lowest and final part of the hierarchy:

filename = 'data/GEP-June-2025-GDP-growth-data.xlsx'

df = (
    pd
    .read_excel(filename, 
                header=3, nrows=30,
                usecols=[0,1,2,3,5,6,7,8,9,10,12,13],
                names=['world', 'major_group', 'minor_group',
                       'country', '2022', '2023', '2024e',
                       '2025f', '2026f', '2027f', '2025fd', 
                       '2026fd']
               )
    .assign(world = lambda df_: df_['world'].ffill(),
            major_group = lambda df_: df_['major_group'].ffill())
)

Finally, with all of this in place, I finally invoked set_index, putting my four-level multi-index into place:

filename = 'data/GEP-June-2025-GDP-growth-data.xlsx'

df = (
    pd
    .read_excel(filename, 
                header=3, nrows=30,
                usecols=[0,1,2,3,5,6,7,8,9,10,12,13],
                names=['world', 'major_group', 'minor_group',
                       'country', '2022', '2023', '2024e',
                       '2025f', '2026f', '2027f', '2025fd', 
                       '2026fd']
               )
    .assign(world = lambda df_: df_['world'].ffill(),
            major_group = lambda df_: df_['major_group'].ffill())
    .set_index(['world', 'major_group', 'minor_group', 'country'])
)

Sure enough, it worked! I got a data frame with 30 rows and 8 columns. Moreover, all of the columns have a dtype of float64, so I know that the data arrived in numeric format.

Column M, labeled "2025f", shows the percentage point difference from January 2025 predictions. Which five countries' predictions dropped by the greatest amount since the last forecast? What countries (if any) have improved their growth forecast since January?

Now that we have the data, we can perform some calculations. Column M in the original spreadsheet, which has a name 2025fd in my naming, contains the change from the World Bank's January forecast. (I must admit that I'm not sure what the difference is between a prediction, a forecast, and an estimate, so I apologize to the economists and statisticians for my imprecision.) How can we find the five countries whose values dropped by the greatest amount?

One option is to use sort_values and to retrieve the five rows with the smallest values. But I often like to use nsmallest, a method that does the same thing, but more directly:


(
    df['2025fd']
    .nsmallest(10)
)

Now, this will return the five countries whose forecast growth was downgraded by the greatest degree. However, don't we want to know the countries' names?

Fortunately, those are in the index. And when we retrieve the five smallest numbers, we'll get the associated index for each one:

world  major_group                               minor_group                   country             
World  Emerging market and developing economies  NaN                           Iran, Islamic Rep. 2   -3.2
                                                                               Mexico                 -1.3
                                                                               Thailand               -1.1
                                                                               South Africa           -1.1
       Advanced economies                        United States                 NaN                    -0.9
       Emerging market and developing economies  NaN                           Bangladesh 2           -0.8
                                                 Middle East and North Africa  NaN                    -0.7
                                                 NaN                           Saudi Arabia           -0.6
       Advanced economies                        NaN                           NaN                    -0.5
                                                 Japan                         NaN                    -0.5
Name: 2025fd, dtype: float64

Does that look a bit jumbled to you? Yeah, it does – because with four (!) multi-index sections, it's hard to keep things readable. I decided to remove two of the index levels, so that the remaining ones will be readable, using reset_index. I not only specified the index levels with their names, but also used drop=True, meaning that those index levels shouldn't be turned back into data frame columns, but should rather drop them entirely:


(
    df['2025fd']
    .nsmallest(10)
    .reset_index(level=['world', 'major_group'], drop=True)
)

Here's what I get:

minor_group                   country             
NaN                           Iran, Islamic Rep. 2   -3.2
                              Mexico                 -1.3
                              Thailand               -1.1
                              South Africa           -1.1
United States                 NaN                    -0.9
NaN                           Bangladesh 2           -0.8
Middle East and North Africa  NaN                    -0.7
NaN                           Saudi Arabia           -0.6
                              NaN                    -0.5
Japan                         NaN                    -0.5
Name: 2025fd, dtype: float64

Ah, much easier to read! Granted, the nature of the document means that we need to navigate two columns to know whether we're dealing with a country or a region, but this was good enough to get a good pictures.

We can see that Iran's economic prospects were very seriously downgraded – I mean, by more than 3 percent!?! (Growth figures are typically small, with countries aiming for +2 or +3 percent in a robust year, so losing 3.2 percent seems unfathomably bad.

So Iran has been downgraded by a lot since January. But the United States has been downgraded by 0.9 percent – not as much as Mexico, Thailand, and South Africa, but still a pretty serious hit, especially when we compare the numbers with the Euro area (-0.3) and Japan (-0.5).

BW #122: Economic growth (solution)

Data and five questions

Column M, labeled "2025f", shows the percentage point difference from January 2025 predictions. Which five countries' predictions dropped by the greatest amount since the last forecast? What countries (if any) have improved their growth forecast since January?

Read next

BW #122: Economic growth

Pandas office hours — Wednesday, June 18th