The big news of the week has been the growing conflict between Israel and Iran. Generally speaking, Israel has been attacking Iranian military and government installations. While at least some Iranian missiles have undoubtedly aimed to strike military bases and national infrastructure, the last few days saw massive destruction at a medical research building(https://www.timesofisrael.com/from-heart-tissue-to-dna-samples-weizmann-scientists-mourn-work-vaporized-in-iran-attack/), the surgical ward of a hospital, and countless homes – beyond the dozens of dead and hundreds of wounded (https://www.timesofisrael.com/heavy-damage-injuries-as-iranian-missile-hits-beersheba-hospital-dozens-hurt-in-central-cities/).
Fortunately, I managed to do all of my Python training without any air-raid sirens this week, which I'd count as a minor miracle. I've moved my safe-room "desk" and laptop back to their usual places, and hope that this will all be over before I need them again.
Missiles have obviously been on my mind quite a bit, and I decided that they would be an appropriate topic for this week's data analysis.
Data and five questions
This week's data comes from the Arms Control Association (https://www.armscontrol.org/), which describes itself as a nonpartisan organization dedicated to providing information and commentary on arms-control issues. (If you're interested, they have a page dedicated to Iran's nuclear program at https://www.armscontrol.org/factsheets/status-irans-nuclear-program-1 .)
There is no downloadable data file for paid subscribers, because downloading and importing it is part of the exercise this week. (And I'll admit that it's a bit tricky!)
But you can download the Jupyter notebook that I used to solve these problems, and there's a one-click link that'll load that notebook into Google Colab. Those links are at the bottom of this issue.
Meanwhile, here are my five tasks and questions:
Create a data frame from the "worldwide ballistic missile inventories" table at https://www.armscontrol.org/factsheets/worldwide-ballistic-missile-inventories . Use the first row as the column names. Remove the footnote numbers that appear at the end of country names, statuses and column titles. (We'll leave them elsewhere, where removing them is much trickier.) Make the country names the index for the data frame.
First, let's load Pandas:
import pandas as pd
We're used to creating data frames with read_csv
(from CSV files) and read_excel
(from Excel files). But one of my favorite methods for creating a data frame is read_html
. Normally, you hand read_html
a URL, and it returns a list of data frames, one for each HTML table it found on the page.
So, what happens when I tried that?
url = 'https://www.armscontrol.org/factsheets/worldwide-ballistic-missile-inventories'
pd.read_html(url)
I got an exception, with the following error code:
HTTPError: HTTP Error 403: Forbidden
Getting such an error code often means that the site doesn't want to be scraped. (And if you're wondering how it can know that I'm using a scraper, rather than an actual browser, it's because every HTTP client is supposed to indicate its identity.)
So, what can we do?
One option would be to go to the Web page, save it to a file, and then invoke read_html
on that file. After all, all of the read_*
methods can work on a URL, a filename, a file object, or a string.
But what if we try to download it with another HTTP client, one which won't announce its presence quite so loudly or problematically? I'm talking about requests
, one of the most popular Python packages, which acts as an HTTP client.
I tried requests
, and it worked just great! I downloaded the content from the page with requests.get
, assigning its return value to the variable response
. I then asked for response.content
, which gave me a byte string (not a regular string!) of what it retrieved from the site. I handed that to read_html
, which returned a list of data frames – in this case, a 1-element list of data frames, since the page contained only a single HTML table. And then I used [0]
to get the first (and only!) data frame:
import requests
response = requests.get(url)
df = (
pd
.read_html(response.content)[0]
)
We now have a data frame. But as is often the case after importing data, we're going to need to clean it up. For starters, the first row contains the column names. We could pass header=1
to read_html
, and it would actually work – but we also have to clean up our column names, removing digits (i.e., footnote numbers) from the names.
We can grab the first row with .iloc[0]
, and we can then remove digits and asterisks from the right side of each of the values with str.rstrip
, passing a string containing only digits and an asterisk. (str.rstrip
is a variation on str.strip
that only looks at the right side of the string.)
But then what? The result of invoking str.rstrip
is a series of strings. We want those to be the data frame's column. We can do that by invoking the set_axis
method on the data frame, and handing it the result of invoking iloc
and str.rstrip
.
But how can we tie this all together? We want to set the column names based on row 0 – but only after we've done some transformations on row 0 with str.rstrip
.
My solution was to use pipe
, which lets us invoke a function that takes a data frame as an argument. Our function can then invoke set_axis
, passing it the result of our str.rstrip
as an argument. Since this will be a one-off function, we can use a lambda
for it. Here's how that will look:
response = requests.get(url)
df = (
pd
.read_html(response.content)[0]
.pipe(lambda df_: df_.set_axis(
df_.iloc[0].str.rstrip('0123456789*'),
axis='columns'))
)
After executing the above code, our data frame's column names are what we want – the first row's strings, minus the footnote numbers.
But the column names are still in the first row, and we don't want them there. We can remove them with drop
, to which we can pass the index we want to remove – 0, in this case. Note that I included the axis='rows'
keyword argument, even though that is the default value:
response = requests.get(url)
df = (
pd
.read_html(response.content)[0]
.pipe(lambda df_: df_.set_axis(
df_.iloc[0].str.rstrip('0123456789*'),
axis='columns'))
.drop(0, axis='rows')
)
Next, I decided to remove the footnote numbers from the Country
and Status
columns, mostly because they were the easiest to deal with. (The others, such as footnotes after missile names, were extremely tricky, in no small part because many missile names included numbers, and finding the border was tough.) I once again decided to use str.rstrip
, applying it to each column. To assign the result back to the column, I used assign
, which adds or replaces a column in the returned data frame.
After making this change to Country
, I used set_index
to make that column into the data frame's index:
response = requests.get(url)
df = (
pd
.read_html(response.content)[0]
.pipe(lambda df_: df_.set_axis(
df_.iloc[0].str.rstrip('0123456789*'),
axis='columns'))
.drop(0, axis='rows')
.assign(Country = lambda df_:
df_['Country'].str.rstrip('0123456789*'),
Status = lambda df_:
df_['Status'].str.rstrip('0123456789*'))
.set_index('Country')
)
The resulting data frame had 171 rows and 4 columns.
Now clean up the data in the "Range" column, such that it contains only numbers. If a cell contains two numbers to indicate high and low parts of the range, then just keep the high part. What dtype is most appropriate here?
We humans can easily look at the Range
column and understand the range of each listed missile. However, each value in the column ends with the string ' km'
, indicating how many kilometers it can travel to hit a target. In many cases, the range is listed with two numbers, a minimum and maximum.
I wanted to perform numeric calculations on the Range
column, and that meant somehow turning its values into numbers.
I decided to do the easiest things first:
- I used
str.removesuffix
to remove the' km'
value from each row - I removed commas (
','
), which were used to separate groups of three digits from one another in some numbers - I removed
+
, which was used in some values to indicate that the range was greater than the maximum number listed
Here's how that looks:
df['Range'] = (
df['Range']
.str.removesuffix(' km')
.str.replace(',', '')
.str.replace('+', '')
)
How can we take ranges of numbers, e.g., "2-100", and turn them into just "100"? Regular expressions, of course! I used str.replace
with regex=True
, using the following search string:
^
anchors the regular expression to the front of the string\d+
means 1 or more numeric digits-
is a literal-
character(\d+)
means that we're looking for 1 or more digits. But more importantly, because these digits are in parentheses, we have "captured" them, and can reuse them later.
Our replacement string is simple, \1
, meaning, "Whatever we captured in the parentheses before." So if our original value is 2-100
, the replacement value is 100
. It's still a string, but closer to what we need to turn it into a numeric column.
Notice, by the way, that I'm using a raw string, with an r
before the opening quote. That automatically doubles every backslash, avoiding errors and warnings before the backslashed characters arrive at the regular expression engine.
Here's what our code looks like so far:
df['Range'] = (
df['Range']
.str.removesuffix(' km')
.str.replace(',', '')
.str.replace('+', '')
.str.replace(r'^\d+-(\d+)', r'\1', regex=True)
)
With our strings fixed, we should now be able to invoke astype
and get back a numeric type. Most obvious would be an integer type, perhaps int16
, which would cover the largest range.
But if you try to invoke astype(int16)
, you discover that we have forgotten one value, namely the Unknown
string. What value should we put instead of Unknown
? The answer, of courses, is NaN
, the "not a number" value that isn't a value. Or the non-value that is a value. The most important thing to remember is that NaN
is a float, and that it cannot be coerced into an integer. So say "goodbye" to using an int; we'll have to use float32
instead.
We could thus say:
import numpy as np
df['Range'] = (
df['Range']
.str.removesuffix(' km')
.str.replace(',', '')
.str.replace('+', '')
.str.replace(r'^\d+-(\d+)', r'\1', regex=True)
.replace('Unknown', np.nan)
.astype('float32')
)
But wait a second: Why am I, at this stage of the game. transforming the string 'Unknown'
into NaN
? If the authors of this data set had used a standard string instead of 'Unknown'
, it would have become NaN
automatically.
And that's when it hit me: I went back to the previous problem, and rewrote my call to read_html
to include the na_values
keyword argument. I then specified that 'Unknown'
should be considered NaN
when the file is read into a data frame:
response = requests.get(url)
df = (
pd
.read_html(response.content, na_values=['Unknown'])[0]
.pipe(lambda df_: df_.set_axis(
df_.iloc[0].str.rstrip('0123456789*'),
axis='columns'))
.drop(0)
.assign(Country = lambda df_:
df_['Country'].str.rstrip('0123456789*'),
Status = lambda df_:
df_['Status'].str.rstrip('0123456789*'))
.set_index('Country')
)
I was then able to trim this latest query; after all of the string transformations, I simply used astype
to set things to float32
:
df['Range'] = (
df['Range']
.str.removesuffix(' km')
.str.replace(',', '')
.str.replace('+', '')
.str.replace(r'^\d+-(\d+)', r'\1', regex=True)
# .replace('Unknown', np.nan)
.astype('float32')
)
We now have a data frame with the right column names, index, and range numbers to perform some calculations.