BW #63: Ukraine aid (solution)

Yesterday, just after Bamboo Weekly was published, President Joe Biden signed into law a package of bills that Congress had passed giving military aid to Ukraine, as well as Israel and Taiwan. For those of us who see Ukraine’s defense as essential, this was a welcome (if late) breath of fresh air.

But of course, the United States isn’t the only country that has been helping Ukraine. Many other countries, and particularly European nations, have been helping in numerous ways.

This week, we'll look at a data set from February describing the aid Ukraine had received by that time, and who had been supplying it.

Although I hoped to include some questions about Ukrainian refugees, they didn’t fit into the framework I established yesterday. I’ll just note that the data set we’re using this week includes information about refugees, and where they have gone. It’s worth exploring that data as well, if only to understand what massive numbers of people have been displaced as a result of this war, and how many countries are opening their borders (and budgets) to help Ukrainians find new homes while the war continues.

Data and seven questions

This week's data comes from the Kiel Institute for the World Economy (http://www.ifw-kiel.de/), located in the northern German city of Kiel. Their Ukraine Support Tracker (https://www.ifw-kiel.de/publications/ukraine-support-tracker-data-20758/) looks at all of the country-to-country ("bilateral") support that Ukraine has received since Russia invaded in January 2022. This means that it ignores assistance provided by individuals and non-profits.

It takes 6-8 hours to research and write each edition of Bamboo Weekly. Thanks to those of you who support my work with a paid subscription! Even if you can’t, it would mean a lot if you would share BW with your Python- and Pandas-using colleagues. Thanks!

The latest update to their data and reporting, published in February, is available from here:

https://www.ifw-kiel.de/publications/ukraine-support-tracker-data-20758/

You can download the data itself, in Excel format, from here:

https://www.ifw-kiel.de/fileadmin/Dateiverwaltung/IfW-Publications/fis-import/6b853c2f-90d8-4fb2-a1ad-5c1c4426fb94-Ukraine_Support_Tracker_Release_15.xlsx

This week, I gave you seven tasks and questions based on this data set. Here are my solutions; as always, a link to my Jupyter notebook is at the bottom of the post.

Turn the "Bilateral assistance, main data" tab from the Excel file into a data frame. Make sure the "Announcement Date" column is in datetime format, ignoring any rows where that column is missing or where the datetime isn't in YYYY-MM-DD format.

We’ll start off by loading Pandas:

import pandas as pd

Following that, we want to create a data frame from the “Bilateral assistance, main data” tab in the Excel file. We can always use “read_excel” to read from an Excel file, but if we don’t specify which sheet we want in a multi-sheet file, we get the first one. We can thus pass the “sheet_name” keyword argument, giving it an integer value (to number the sheets starting with 0), a string value (to name the sheet we want) or even a list of integers and/or strings (in which case we get back a dict of data frames).

Here, we’re only interested in one sheet, so we can ask for it by its index:

filename = '6b853c2f-90d8-4fb2-a1ad-5c1c4426fb94-Ukraine_Support_Tracker_Release_15.xlsx'

df = (
    pd
    .read_excel(filename, sheet_name=2)
)

Notice, by the way, the filename that I got when downloading from the Kiel site; I wouldn’t be surprised if the filename you got is both different and somewhat random looking.

The good news is that this gives us a data frame. The bad news, though, is that the “Announcement Date” column is not a datetime dtype. When working with CSV files, we normally need to explicitly tell Pandas which column(s) we want to treat as datetime values. But when reading from an Excel file, that’s usually not an issue, since Excel recognizes datetime as a distinct data type.

However, the values in “Announcement Date” aren’t all in a strict datetime format, and thus cannot be treated as datetime values. Excel treats them as strings, and when we read them into Pandas, the dtype of that “Announcement Date” column is a string. Actually, the dtype is “object,” meaning that each value is a distinct Python-language object — normally a string, but it could theoretically be anything at all.

How, then, can we massage this column into a datetime dtype?

First, we’ll drop all of the NaN values with “dropna”. Normally, “dropna” returns a new data frame without any NaN values. This means that if a row contains even one NaN value, regardless of the column, the row is removed. That seems a bit extreme here; I’m OK with removing rows, but only where “Announcement Date” has a NaN value.

To do that, we can call “dropna” with the “subset” keyword argument, passing a string (for one column) or a list (for multiple columns) indicating where the NaNs need to be in order for a row to be removed:

df = (
    pd
    .read_excel(filename, sheet_name=2)
    .dropna(subset='Announcement Date')
)

Now that we have removed rows with NaN in the “Announcement Date” column, we can try to call “pd.to_datetime” on the remaining values. But that won’t work, because the “Announcement Date” column contains some strings (in a variety of formats), and some datetimes. What we’ll need to do is:

Turn the entire column into string values with “astype”, assigning it to a new column, “date”
Using the “str.contains” method to keep only those rows in YYYY-MM-DD format
Invoke “pd.to_datetime” on the remaining values in the “date” column
Use “drop” to remove the old “Announcement Date” column

To do this, I’ll use “assign” (to create the “date” column) and also a regular expression, containing the following elements:

Match the front of the string with ^
Look for 4 decimal digits with \d{4}
-
Look for 2 decimal digits with \d{2}
-
Look for 2 decimal digits with \d{2}

Note that I don’t care what comes after that; I’ll assume that if the first part of the string is in YYYY-MM-DD format, then the rest should be fine, too. I also don’t check that the numbers are within the boundaries we would expect, so if we have a value of 9999-88-77, we’ll accept it even though “pd.to_datetime” will blow up.

Also note that as usual, it’s a good idea to use a raw string (i.e., a “r” before the opening quote) for any string containing a regular expression.

We end up with the following:

df = (
    pd
    .read_excel(filename, sheet_name=2)
    .dropna(subset='Announcement Date')
    .assign(date=lambda df_: df_['Announcement Date'].astype(str))
    .loc[lambda df_: df_['date'].str.contains(r'^\d{4}-\d{2}-\d{2}', regex=True)]
    .assign(date=lambda df_: pd.to_datetime(df_['date']))
    .drop('Announcement Date', axis='columns')
)

Calculate how many rows in the data set describe each kind of specific assistance.

The data set tracks many different types of aid that have been provided to Ukraine since the war began two years ago. How many of the rows describe each kind of help?

At first glance, this seems like an easy question to answer. First, we use square brackets to retrieve the “Type of Aid Specific” column from the data frame:

(
    df
    ['Type of Aid Specific']
)

Then we invoke the “value_counts” method, which tells us how many times each value appears in that column. The distinct values from the column form the index for the returned series, with integers — sorted from highest to lowest — forming the values:

(
    df
    ['Type of Aid Specific']
    .value_counts()
)

However, upon closer examination, this doesn’t seem quite right. The first result is “Weapons and equipment”, with 975 rows. The third result is “Weapons and Equipment”, with 331 rows. Notice the difference? In the first, “equipment” isn’t capitalized, but in the third, it is.

In other words, slight differences in capitalization and spelling can make a data set less accurate. Cleaning the data often involves cleaning up these various issues.

In this case, I decided to run “str.title” on the column before invoking “value_counts”. The “str.title” method capitalizes the first letter of each word, while lowercasing the other words. We go from 34 distinct values to 28, meaning that just standardizing the capitalization reduced the number of values by 6:

(
    df
    ['Type of Aid Specific']
    .str.title()
    .value_counts()
)

Here are the results that I got:

Type of Aid Specific
Weapons And Equipment                                         1306
Assistance                                                     333
Weapons                                                        270
Grant                                                          219
Equipment                                                      170
Military Equipment                                              92
Equipment And Assistance                                        80
Loan                                                            57
Funding For Weapon Acquisition Program                          28
Guarantee                                                       28
Material Assistance                                             17
Assistance And Equipment                                        14
Training                                                        10
Reconstruction                                                  10
European Peace Facility                                          9
Weapons And Assistance                                           9
Weapons And Military Equipment                                   8
Weapons, Equipment And Assistance                                6
Equipment                                                        4
Weapons                                                          2
Equipment And Medicines                                          2
Weapons, Equiment, Assistance                                    1
Training And Equipment                                           1
Weapons, Equipment, Funding For Weapon Acquisition Program       1
Swap-Line                                                        1
Protective Equipment And Medical Supplies                        1
Medicines                                                        1
Commitment                                                       1
Name: count, dtype: int64

Wait a second, though: We still seem to have two entries for “Equipment” and two for “Weapons.” Since they look identical to the naked eye, I wonder if there are any extra spaces there? We can find out by running “str.strip” as part of our method chain:

(
    df
    ['Type of Aid Specific']
    .str.title()
    .str.strip()
    .value_counts()
)

Sure enough, we’re now down to only 26 different entries. From a quick look at the data, we could probably improve this even more, but I’ll stop here. Plus, some of these categories are overlapping and ambiguous. After all, where does “Weapons and Equipment” stop, and where do “Equipment” and “Weapons” begin?

Here is the final set of results I got:

Type of Aid Specific
Weapons And Equipment                                         1306
Assistance                                                     333
Weapons                                                        272
Grant                                                          219
Equipment                                                      174
Military Equipment                                              92
Equipment And Assistance                                        80
Loan                                                            57
Funding For Weapon Acquisition Program                          28
Guarantee                                                       28
Material Assistance                                             17
Assistance And Equipment                                        14
Training                                                        10
Reconstruction                                                  10
Weapons And Assistance                                           9
European Peace Facility                                          9
Weapons And Military Equipment                                   8
Weapons, Equipment And Assistance                                6
Equipment And Medicines                                          2
Weapons, Equiment, Assistance                                    1
Training And Equipment                                           1
Weapons, Equipment, Funding For Weapon Acquisition Program       1
Swap-Line                                                        1
Protective Equipment And Medical Supplies                        1
Medicines                                                        1
Commitment                                                       1
Name: count, dtype: int64

BW #63: Ukraine aid (solution)

Data and seven questions

Turn the "Bilateral assistance, main data" tab from the Excel file into a data frame. Make sure the "Announcement Date" column is in datetime format, ignoring any rows where that column is missing or where the datetime isn't in YYYY-MM-DD format.

Calculate how many rows in the data set describe each kind of specific assistance.

Read next

BW #63: Ukraine aid

Pandas Workout is out!