Skip to content
13 min read · Tags: web-scraping pivot-table grouping regular-expressions cleaning

BW #123: Missiles (solution)

Get better at: Web scraping, pivot tables, grouping, cleaning data, and using regular expressions

BW #123: Missiles (solution)
I couldn't think of a cute panda picture for this week — so instead, here's a photo of the workspace I set up in our safe room, in case the siren goes off while I'm teaching a Python course. I haven't had to use this emergency setup so far, which is obviously good.

The big news of the week has been the growing conflict between Israel and Iran. Generally speaking, Israel has been attacking Iranian military and government installations. While at least some Iranian missiles have undoubtedly aimed to strike military bases and national infrastructure, the last few days saw massive destruction at a medical research building(https://www.timesofisrael.com/from-heart-tissue-to-dna-samples-weizmann-scientists-mourn-work-vaporized-in-iran-attack/), the surgical ward of a hospital, and countless homes – beyond the dozens of dead and hundreds of wounded (https://www.timesofisrael.com/heavy-damage-injuries-as-iranian-missile-hits-beersheba-hospital-dozens-hurt-in-central-cities/).

Fortunately, I managed to do all of my Python training without any air-raid sirens this week, which I'd count as a minor miracle. I've moved my safe-room "desk" and laptop back to their usual places, and hope that this will all be over before I need them again.

Missiles have obviously been on my mind quite a bit, and I decided that they would be an appropriate topic for this week's data analysis.

Data and five questions

This week's data comes from the Arms Control Association (https://www.armscontrol.org/), which describes itself as a nonpartisan organization dedicated to providing information and commentary on arms-control issues. (If you're interested, they have a page dedicated to Iran's nuclear program at https://www.armscontrol.org/factsheets/status-irans-nuclear-program-1 .)

There is no downloadable data file for paid subscribers, because downloading and importing it is part of the exercise this week. (And I'll admit that it's a bit tricky!)

But you can download the Jupyter notebook that I used to solve these problems, and there's a one-click link that'll load that notebook into Google Colab. Those links are at the bottom of this issue.

Meanwhile, here are my five tasks and questions:

Create a data frame from the "worldwide ballistic missile inventories" table at https://www.armscontrol.org/factsheets/worldwide-ballistic-missile-inventories . Use the first row as the column names. Remove the footnote numbers that appear at the end of country names, statuses and column titles. (We'll leave them elsewhere, where removing them is much trickier.) Make the country names the index for the data frame.

First, let's load Pandas:

import pandas as pd

We're used to creating data frames with read_csv (from CSV files) and read_excel (from Excel files). But one of my favorite methods for creating a data frame is read_html. Normally, you hand read_html a URL, and it returns a list of data frames, one for each HTML table it found on the page.

So, what happens when I tried that?

url = 'https://www.armscontrol.org/factsheets/worldwide-ballistic-missile-inventories'
pd.read_html(url)

I got an exception, with the following error code:

HTTPError: HTTP Error 403: Forbidden

Getting such an error code often means that the site doesn't want to be scraped. (And if you're wondering how it can know that I'm using a scraper, rather than an actual browser, it's because every HTTP client is supposed to indicate its identity.)

So, what can we do?

One option would be to go to the Web page, save it to a file, and then invoke read_html on that file. After all, all of the read_* methods can work on a URL, a filename, a file object, or a string.

But what if we try to download it with another HTTP client, one which won't announce its presence quite so loudly or problematically? I'm talking about requests, one of the most popular Python packages, which acts as an HTTP client.

I tried requests, and it worked just great! I downloaded the content from the page with requests.get, assigning its return value to the variable response. I then asked for response.content, which gave me a byte string (not a regular string!) of what it retrieved from the site. I handed that to read_html, which returned a list of data frames – in this case, a 1-element list of data frames, since the page contained only a single HTML table. And then I used [0] to get the first (and only!) data frame:

import requests
response = requests.get(url)

df = (
    pd
    .read_html(response.content)[0]
)    

We now have a data frame. But as is often the case after importing data, we're going to need to clean it up. For starters, the first row contains the column names. We could pass header=1 to read_html, and it would actually work – but we also have to clean up our column names, removing digits (i.e., footnote numbers) from the names.

We can grab the first row with .iloc[0], and we can then remove digits and asterisks from the right side of each of the values with str.rstrip, passing a string containing only digits and an asterisk. (str.rstrip is a variation on str.strip that only looks at the right side of the string.)

But then what? The result of invoking str.rstrip is a series of strings. We want those to be the data frame's column. We can do that by invoking the set_axis method on the data frame, and handing it the result of invoking iloc and str.rstrip .

But how can we tie this all together? We want to set the column names based on row 0 – but only after we've done some transformations on row 0 with str.rstrip.

My solution was to use pipe, which lets us invoke a function that takes a data frame as an argument. Our function can then invoke set_axis, passing it the result of our str.rstrip as an argument. Since this will be a one-off function, we can use a lambda for it. Here's how that will look:

response = requests.get(url)

df = (
    pd
    .read_html(response.content)[0]
    .pipe(lambda df_: df_.set_axis(
        df_.iloc[0].str.rstrip('0123456789*'),
        axis='columns'))
)    

After executing the above code, our data frame's column names are what we want – the first row's strings, minus the footnote numbers.

But the column names are still in the first row, and we don't want them there. We can remove them with drop, to which we can pass the index we want to remove – 0, in this case. Note that I included the axis='rows' keyword argument, even though that is the default value:

response = requests.get(url)

df = (
    pd
    .read_html(response.content)[0]
    .pipe(lambda df_: df_.set_axis(
        df_.iloc[0].str.rstrip('0123456789*'),
        axis='columns'))
    .drop(0, axis='rows')
)   

Next, I decided to remove the footnote numbers from the Country and Status columns, mostly because they were the easiest to deal with. (The others, such as footnotes after missile names, were extremely tricky, in no small part because many missile names included numbers, and finding the border was tough.) I once again decided to use str.rstrip, applying it to each column. To assign the result back to the column, I used assign, which adds or replaces a column in the returned data frame.

After making this change to Country, I used set_index to make that column into the data frame's index:

response = requests.get(url)

df = (
    pd
    .read_html(response.content)[0]
    .pipe(lambda df_: df_.set_axis(
        df_.iloc[0].str.rstrip('0123456789*'),
        axis='columns'))
    .drop(0, axis='rows')
    .assign(Country = lambda df_: 
                         df_['Country'].str.rstrip('0123456789*'),
            Status = lambda df_: 
                         df_['Status'].str.rstrip('0123456789*'))
    .set_index('Country')
)    

The resulting data frame had 171 rows and 4 columns.

Now clean up the data in the "Range" column, such that it contains only numbers. If a cell contains two numbers to indicate high and low parts of the range, then just keep the high part. What dtype is most appropriate here?

We humans can easily look at the Range column and understand the range of each listed missile. However, each value in the column ends with the string ' km', indicating how many kilometers it can travel to hit a target. In many cases, the range is listed with two numbers, a minimum and maximum.

I wanted to perform numeric calculations on the Range column, and that meant somehow turning its values into numbers.

I decided to do the easiest things first:

Here's how that looks:

df['Range'] = (
    df['Range']
    .str.removesuffix(' km')
    .str.replace(',', '')
    .str.replace('+', '')
)

How can we take ranges of numbers, e.g., "2-100", and turn them into just "100"? Regular expressions, of course! I used str.replace with regex=True, using the following search string:

Our replacement string is simple, \1, meaning, "Whatever we captured in the parentheses before." So if our original value is 2-100, the replacement value is 100. It's still a string, but closer to what we need to turn it into a numeric column.

Notice, by the way, that I'm using a raw string, with an r before the opening quote. That automatically doubles every backslash, avoiding errors and warnings before the backslashed characters arrive at the regular expression engine.

Here's what our code looks like so far:


df['Range'] = (
    df['Range']
    .str.removesuffix(' km')
    .str.replace(',', '')
    .str.replace('+', '')
    .str.replace(r'^\d+-(\d+)', r'\1', regex=True)
)

With our strings fixed, we should now be able to invoke astype and get back a numeric type. Most obvious would be an integer type, perhaps int16, which would cover the largest range.

But if you try to invoke astype(int16), you discover that we have forgotten one value, namely the Unknown string. What value should we put instead of Unknown? The answer, of courses, is NaN, the "not a number" value that isn't a value. Or the non-value that is a value. The most important thing to remember is that NaN is a float, and that it cannot be coerced into an integer. So say "goodbye" to using an int; we'll have to use float32 instead.

We could thus say:


import numpy as np

df['Range'] = (
    df['Range']
    .str.removesuffix(' km')
    .str.replace(',', '')
    .str.replace('+', '')
    .str.replace(r'^\d+-(\d+)', r'\1', regex=True)
    .replace('Unknown', np.nan)
    .astype('float32')
)

But wait a second: Why am I, at this stage of the game. transforming the string 'Unknown' into NaN? If the authors of this data set had used a standard string instead of 'Unknown', it would have become NaN automatically.

And that's when it hit me: I went back to the previous problem, and rewrote my call to read_html to include the na_values keyword argument. I then specified that 'Unknown' should be considered NaN when the file is read into a data frame:

response = requests.get(url)

df = (
    pd
    .read_html(response.content, na_values=['Unknown'])[0]
    .pipe(lambda df_: df_.set_axis(
        df_.iloc[0].str.rstrip('0123456789*'),
        axis='columns'))
    .drop(0)
    .assign(Country = lambda df_: 
                         df_['Country'].str.rstrip('0123456789*'),
            Status = lambda df_: 
                         df_['Status'].str.rstrip('0123456789*'))
    .set_index('Country')
)    

I was then able to trim this latest query; after all of the string transformations, I simply used astype to set things to float32:


df['Range'] = (
    df['Range']
    .str.removesuffix(' km')
    .str.replace(',', '')
    .str.replace('+', '')
    .str.replace(r'^\d+-(\d+)', r'\1', regex=True)
#     .replace('Unknown', np.nan)
    .astype('float32')
)

We now have a data frame with the right column names, index, and range numbers to perform some calculations.