[Reminder: The 7th cohort of my Python Data Analysis Bootcamp (PythonDAB) will start on June 19th! I'm holding another webinar on Sunday with info about this 4-month intense-but-intimate mentoring program in Python, Git, and Pandas. Join me to learn more: https://us02web.zoom.us/webinar/register/WN_TgKu-kiwTVeylzMOQidTZA .]
Earlier this week, the World Bank released its latest report on "Global Economic Prospects" (https://www.worldbank.org/en/publication/global-economic-prospects). Its projections for economic growth over the coming year had to be revised dramatically since they first came out in January, in no small part because of changes in US trade policy.
Data and five questions
This week, we'll look at some of the data that the World Bank provided along with their written report, to better understand the current state of the world economy. If you go to the GEP home page (https://www.worldbank.org/en/publication/global-economic-prospects) and open the "more downloads" menu at the center of the page, choose "GDP growth data." That'll download an Excel spreadsheet to your computer with the data we'll be examining.
Paid subscribers can download copies of the data files, as well as my notebook, from the end of this message.
Learning goals for this week include reading from Excel files, cleaning data, multi-indexes, window functions, and plotting.
Download the Excel spreadsheet with the World Bank's data, and read the first sheet it into a Pandas data frame. We only want from rows 5 through 34 in Excel. Columns A through D should all be in a four-way multi-index. We do want all of the columns with projections, as well as the differences from earlier this year (columns M and N) to be included. In the columns we'll use for the index, replace NaN
with the previous non-NaN
value in that column.
People sometimes ask me how I come up with ideas for Bamboo Weekly. This week's topic came from a posting from economist Justin Wolfers on Bluesky (https://bsky.app/profile/justinwolfers.bsky.social), who wrote about the World Bank's report. I quickly looked up the report, found that it came with data in Excel format, and excitedly decided to dig in.
But this week demonstrates that just because data is in an Excel spreadsheet doesn't mean that it's easy to parse or work with. The spreadsheet was clearly designed to be a report for human eyes, not input for Pandas. And so if you found that reading this week's data into Pandas was more challenging than usual... well, that's probably the case.
That said, part of my goal with Bamboo Weekly is to expose you to enough real-world data sets, including poorly formatted ones, that you'll learn how to handle even the nastiest of files.
So, let's start by importing Pandas:
import pandas as pd
In an ideal world, we could just read the Excel file into Pandas with read_excel
:
filename = 'data/GEP-June-2025-GDP-growth-data.xlsx'
df = (
pd
.read_excel(filename)
)
However, this falls short in a number of ways:
- The headers aren't on the first line of the document
- We don't want all of the columns
- We don't want all of the rows
I'm ignoring the fact that this Excel document contains a number of sheets; we're only interested in the first one, which means we can ignore the others.
We can indicate the line on which the headers are located with the header
keyword argument. Remember that whereas Excel would say the headers are on line 4, we would call it line 3, since Python starts counting with 0.
I also passed nrows=30
, so that we would only read 30 rows from the document. Yes, there is more data after those rows, but I decided that things were complex enough without getting the "memorandum items," as the spreadsheet described.
We can ask for a selected number of columns with the usecols
keyword argument; I normally prefer to use the names, rather than the integers, but these names are odd, and they repeat – so I decided to invoke usecols
with the column numbers (starting with 0), and then pass names
, allowing me to rename the columns.
I didn't have a great naming convention for the first four columns, which will constitute our multi-index. So I went with "world", "major_group", "minor_group", and "country". Again, this demonstrates even further that this spreadsheet was optimized for human eyes, and not for programs to read.
I also gave names to the remaining columns I loaded. In theory, you can repeat column names in Pandas, just as an index can contain repeated values. But it's a bad idea, and Pandas tries hard to stop us from doing that. And besides, why invite trouble? I thus changed the final two columns, which show the differences between the January forecast and the current forecast, to 2025fd
and 2026fd
, for "forecast differences."
Here's the query so far:
filename = 'data/GEP-June-2025-GDP-growth-data.xlsx'
df = (
pd
.read_excel(filename,
header=3, nrows=30,
usecols=[0,1,2,3,5,6,7,8,9,10,12,13],
names=['world', 'major_group', 'minor_group',
'country', '2022', '2023', '2024e',
'2025f', '2026f', '2027f', '2025fd',
'2026fd']
)
)
The above query does a decent job of grabbing only the rows we want, and only the columns we want.
In theory, we could then assign the index, either with index_col
in read_excel
or with set_index
. But there is a bit of a problem here, in that the columns we want to use for the index contain a lot of NaN
values. Those look nice on the page, in that they appear as whitespace – but it's not helpful.
For example, consider lines 7, 8, and 9 of the spreadsheet, with data about the US, euro area, and Japan. That's in column C; column B, which is effectively its heading, should contain the text "Advanced economies." That's because Pandas can't exactly look up and to the left to find out what heading things are under, at least not easily.
I thought about using interpolate
to fill in the NaN
values with whatever non-NaN
value came above them. But I got a warning from interpolate
, reminding me that it's really only for numeric data. If I want to interpolate string data, then I should use ffill
and bfill
, for "forward fill" and "backward fill." By running ffill
on the four columns I want to use as an index, I can remove the NaN
values and replace them with the text that came above. Great, right?
Well, mostly: The problem is that we cannot run ffill
on only a subset of a data frame. So I decided to use assign
to replace the world
and major_group
columns with the result of running bfill
on each of them.
Why didn't I run bfill
on the other two columns I want in my index? I actually did on minor_group
, and ran into weird problems. And the countries don't need to be filled; that's the lowest and final part of the hierarchy:
filename = 'data/GEP-June-2025-GDP-growth-data.xlsx'
df = (
pd
.read_excel(filename,
header=3, nrows=30,
usecols=[0,1,2,3,5,6,7,8,9,10,12,13],
names=['world', 'major_group', 'minor_group',
'country', '2022', '2023', '2024e',
'2025f', '2026f', '2027f', '2025fd',
'2026fd']
)
.assign(world = lambda df_: df_['world'].ffill(),
major_group = lambda df_: df_['major_group'].ffill())
)
Finally, with all of this in place, I finally invoked set_index
, putting my four-level multi-index into place:
filename = 'data/GEP-June-2025-GDP-growth-data.xlsx'
df = (
pd
.read_excel(filename,
header=3, nrows=30,
usecols=[0,1,2,3,5,6,7,8,9,10,12,13],
names=['world', 'major_group', 'minor_group',
'country', '2022', '2023', '2024e',
'2025f', '2026f', '2027f', '2025fd',
'2026fd']
)
.assign(world = lambda df_: df_['world'].ffill(),
major_group = lambda df_: df_['major_group'].ffill())
.set_index(['world', 'major_group', 'minor_group', 'country'])
)
Sure enough, it worked! I got a data frame with 30 rows and 8 columns. Moreover, all of the columns have a dtype of float64
, so I know that the data arrived in numeric format.
Column M, labeled "2025f", shows the percentage point difference from January 2025 predictions. Which five countries' predictions dropped by the greatest amount since the last forecast? What countries (if any) have improved their growth forecast since January?
Now that we have the data, we can perform some calculations. Column M in the original spreadsheet, which has a name 2025fd
in my naming, contains the change from the World Bank's January forecast. (I must admit that I'm not sure what the difference is between a prediction, a forecast, and an estimate, so I apologize to the economists and statisticians for my imprecision.) How can we find the five countries whose values dropped by the greatest amount?
One option is to use sort_values
and to retrieve the five rows with the smallest values. But I often like to use nsmallest
, a method that does the same thing, but more directly:
(
df['2025fd']
.nsmallest(10)
)
Now, this will return the five countries whose forecast growth was downgraded by the greatest degree. However, don't we want to know the countries' names?
Fortunately, those are in the index. And when we retrieve the five smallest numbers, we'll get the associated index for each one:
world major_group minor_group country
World Emerging market and developing economies NaN Iran, Islamic Rep. 2 -3.2
Mexico -1.3
Thailand -1.1
South Africa -1.1
Advanced economies United States NaN -0.9
Emerging market and developing economies NaN Bangladesh 2 -0.8
Middle East and North Africa NaN -0.7
NaN Saudi Arabia -0.6
Advanced economies NaN NaN -0.5
Japan NaN -0.5
Name: 2025fd, dtype: float64
Does that look a bit jumbled to you? Yeah, it does – because with four (!) multi-index sections, it's hard to keep things readable. I decided to remove two of the index levels, so that the remaining ones will be readable, using reset_index
. I not only specified the index levels with their names, but also used drop=True
, meaning that those index levels shouldn't be turned back into data frame columns, but should rather drop them entirely:
(
df['2025fd']
.nsmallest(10)
.reset_index(level=['world', 'major_group'], drop=True)
)
Here's what I get:
minor_group country
NaN Iran, Islamic Rep. 2 -3.2
Mexico -1.3
Thailand -1.1
South Africa -1.1
United States NaN -0.9
NaN Bangladesh 2 -0.8
Middle East and North Africa NaN -0.7
NaN Saudi Arabia -0.6
NaN -0.5
Japan NaN -0.5
Name: 2025fd, dtype: float64
Ah, much easier to read! Granted, the nature of the document means that we need to navigate two columns to know whether we're dealing with a country or a region, but this was good enough to get a good pictures.
We can see that Iran's economic prospects were very seriously downgraded – I mean, by more than 3 percent!?! (Growth figures are typically small, with countries aiming for +2 or +3 percent in a robust year, so losing 3.2 percent seems unfathomably bad.
So Iran has been downgraded by a lot since January. But the United States has been downgraded by 0.9 percent – not as much as Mexico, Thailand, and South Africa, but still a pretty serious hit, especially when we compare the numbers with the Euro area (-0.3) and Japan (-0.5).