Skip to content
15 min read · Tags: regular-expressions excel cleaning window-functions plotting

BW #124: NATO Spending (solution)

Get better at: Working with Excel files, regular expressions, cleaning data, window functions, and plotting.

BW #124: NATO Spending (solution)

NATO (the North Atlantic Treaty Organization) was established after World War II, in no small part to protect Western democracies against the Soviet Union and other Communist threats. NATO is generally seen as a great success, and has even absorbed into its ranks a number of countries that were previously part of the Soviet Bloc before they became democracies.

In the wake of Russia's invasion of Ukraine and pressure from several US presidents, a number of NATO members have already pledged to increase defense spending to 2 percent. Several have even passed laws requiring that their governments do so.

As the New York Times notes (https://www.nytimes.com/live/2025/06/25/us/trump-nato-news?unlocked_article_code=1.Rk8.57Os._2tG-hkNEc42&smid=url-share), leaders at this week's NATO summit announced that members (but not "all members") pledged at this week's NATO summit to increase their defense spending even further.

Data and five questions

This week, we'll look at NATO members' defense spending, using data from NATO itself, on its "Defence expenditures and NATO’s 2% guideline" page (https://www.nato.int/cps/en/natohq/topics_49198.htm). At the bottom of the page, you'll find links to download reports from each year in Excel format; you want the 2024 data, at https://www.nato.int/cps/en/natohq/news_226465.htm.

Learning goals for this week include: Working with Excel files, cleaning data, window functions, regular expressions, and plotting.

Paid subscribers can download copies of the data files, as well as my notebook, from the end of this message. You can also see my Jupyter notebook directly in Google Colab.

Want to see me solve the first two questions? Check out the YouTube video at https://youtu.be/Gi20V_abj0E?list=PLbFHh-ZjYFwHEniYtZmq8ZxNBdOsDhjRN . The full Bamboo Weekly playlist is at https://www.youtube.com/watch?v=Gi20V_abj0E&list=PLbFHh-ZjYFwG34oZY24fSvFtOZT6OdIFm .

Here are the five questions and tasks:

Read the "defence expenditure" data from "table 2" into a Pandas data frame. (We want the first table on that sheet, with current prices and exchange rates.) Which five countries have increased their NATO spending most, as a percentage, in the latest budget cycle (from 2023-2024)? Which increased most since 2014, when the data was first tracked?

Before we do anything else, let's load Pandas:

import pandas as pd

Next, I'll define filename to contain the name of the Excel file we're loading. I'll then use read_excel to read it into a data frame:

filename = 'data/240617-def-exp-2024-TABLES-en.xlsx'
expenditure_df = (
        pd
        .read_excel(filename)
)

But this isn't what we want at all. That's because we want the data from the "TABLE2" sheet. To get that, we have to specify sheet_name with either the name or the integer index. The name is clear and easy to type, so I"ll use that. I'll also indicate that the column names (i.e., the "header row") are in row 7 – what Excel calls row 8, because it starts numbering with 1. And I only want the data from the first table on the sheet, so I'll pass the nrows keyword argument, such that we only read 33 rows into the data frame:

filename = 'data/240617-def-exp-2024-TABLES-en.xlsx'
expenditure_df = (
        pd
        .read_excel(filename, 
                    sheet_name='TABLE2',
                              header=7,
                              nrows=33)
)

This is good, but it's not quite good enough. That's because we have three columns that weren't named – one of which we want (with the country names) as the index, and two of which we don't care about. We can use drop to remove the columns we don't want, and set_index to set the one we do:

filename = 'data/240617-def-exp-2024-TABLES-en.xlsx'
expenditure_df = (
        pd
        .read_excel(filename, 
                    sheet_name='TABLE2',
                              header=7,
                              nrows=33)
    .drop(columns=['Unnamed: 0', 'Unnamed: 1'])
    .set_index('Unnamed: 2')
)

The only thing remaining is to get rid of the two empty lines at the start of the data frame. I decided to do this with iloc and a slice:

filename = 'data/240617-def-exp-2024-TABLES-en.xlsx'
expenditure_df = (
        pd
        .read_excel(filename, 
                    sheet_name='TABLE2',
                              header=7,
                              nrows=33)
    .drop(columns=['Unnamed: 0', 'Unnamed: 1'])
    .set_index('Unnamed: 2')
    .iloc[2:]
)

The resulting data frame has 31 rows and 11 columns. All 11 columns have a dtype of float64, so we can count on them to be numeric.

To find out which countries increased their budget as a percentage in the latest budget cycle, we first needed to calculate the percentage difference from each country's expense this year and the previous year.

We can normally compare each row with its predecessor with pct_change. But here we don't want to compare rows, we rather want to compare columns. Fortunately, we can pass axis='columns' to pct_change (and to many other Pandas methods) and have it work on the columns, rather than the rows:

(
    expenditure_df
    .pct_change(axis='columns')
)

The result is a data frame with the same size, index, and columns as expenditure_df, but in which the values have all been changed to reflect the percentage difference with the cell to its left. The leftmost column is then full of NaN ("not a number") values.

To find the percentage difference between last year and this year for each country, we can then select this year's data:

(
    expenditure_df
    .pct_change(axis='columns')
    ['2024e']
)

This returns a series. We can get the 5 largest values with nlargest . We can then see the names of the most-increased countries thanks to the fact that the country names are the data frame's index:

(
    expenditure_df
    .pct_change(axis='columns')
    ['2024e']
    .nlargest(5)
)

Here are the results:

Unnamed: 2
Romania*      0.542354
Czechia*      0.506007
Montenegro    0.414590
Türkiye       0.370916
Sweden        0.363539
Name: 2024e, dtype: float64

In other words, Romania spent 50 percent more on defense in 2024 than in 2023.

What if we compare 2024 numbers with the earliest data we have in the data frame, from 2014? In order to do that we won't want to compare the 2024 column with the 2023 column. Rather, we'll want to compare it with 2014.

One way to do this would be to retrieve just those two columns, and run pct_change on them again. But another way is to simply say, "Let's compare each column with the one n columns before it." We can do that with the periods keyword argument, which has a default value of 1, but can be set to anything.

Now, how many columns behind 2024 is 2014? I would say 9. But I decided to let Pandas calculate this, calculating len(expenditure_df.columns), subtracting 1, and passing the result to the periods keyword argument. Other than that, it's the same query as before:


(
    expenditure_df
    .pct_change(axis='columns', 
                periods=len(expenditure_df.columns)-1)
    ['2024e']
    .nlargest(5)
)

Here are the results:

Unnamed: 2
Lithuania*    4.378801
Latvia*       3.838939
Hungary       3.040984
Poland*       2.460544
Czechia*      2.459855
Name: 2024e, dtype: float64

Yes, you're reading this right: Lithuania's defense budget is now more than 4 times bigger than it was in 2014, followed by Latvia, Hungary, Poland, and Czechia. All of these countries have borders with Ukraine except for Czechia, which is still fairly close and has taken in many Ukrainian refugees.

The United States spends more on defense than any other NATO country, with Germany and the UK in the 2nd and 3rd position. Find how much the US spent (estimated) in 2024. Then see how many other NATO countries' defense spending, starting with Germany and the UK and continuing from largest to smallest, would be needed to match US defense spending. That is, if we'd like to say that "the US spends as much as the n next-largest contributors, combined," what is n?

To get the amount that the US spent in 2024, we can use loc. We often think of loc as a way to retrieve a row, but in its two-argument form, we can pass both a row selector (an index value, a list of index values, a boolean series, or a function that returns a boolean series) and a column selector (a column name or list of column names). Here, we'll use the simplest possible row and column selectors, and use them to assign a value to a variable:

us_spending = expenditure_df.loc['United States', '2024e']

Now that we know how much the US spent in 2024, we'll find how much every other country spent in 2024. We first select the column with [], and then use sort_values to put them in descending order, from the greatest spending to the least:

(
    expenditure_df
    ['2024e']
    .sort_values(ascending=False)
)

I don't want the US value, which I know will be highest. I'll thus remove it using iloc and a slice:

(
    expenditure_df
    ['2024e']
    .sort_values(ascending=False)
    .iloc[1:]
)

What I next want to do is calculate the following:

This kind of calculation is a window function. pct_changeis also a window function, but it's a "rolling" window function, performing its calculation repeatedly on a sliding "window" of rows or columns.

Here, we want an "expanding" window function, in which we sum the first two rows, then the first three, then the first four, etc., until we have the sum of them all. We can do that by invoking expanding and then invoking sum() on it:

(
    expenditure_df
    ['2024e']
    .sort_values(ascending=False)
    .iloc[1:]
    .expanding().sum()
)

We now have a series in which row n represents the total spending of the top n non-US countries. We can now use loc and lambda to find the point at which US spending is smaller than the total spending:

(
    expenditure_df
    ['2024e']
    .sort_values(ascending=False)
    .iloc[1:]
    .expanding().sum()
    .loc[lambda s_: s_ >= us_spending]
)

The result is:

Series([], Name: 2024e, dtype: float64)

What is this result? It's an empty series. That is, US defense spending is greater than all of the other NATO members put together. Which means that even when we sum all of the other countries' spending together, we fall short of the total in the US.