Bamboo Weekly #147: Presidential pardons (solution)

This week, we looked at the history of presidential pardons in the United States. The US constitution allows presidents to pardon criminals or commute their sentences, almost without limit.

Every president in recent memory has issued a number of questionable, self-serving, and/or politically expedient pardons – but Donald Trump, not even one year into his second term, has gone beyond his predecessors, helping a long series of corrupt politicians and business associates.

The list includes ousted congressman George Santos, Binance head CZ, likely because of his business connections with the Trump family, and former Honduran president Juan Orlando Hernández, convicted of smuggling 400 tons of cocaine (https://www.nytimes.com/2025/12/03/nyregion/bove-trump-honduras-pardon.html?unlocked_article_code=1.508.zVbf.sfJ_aLYQd5iz&smid=url-share).

This week, we'll look at the number of pardons and commutations that each president has given, offering some insights into this weird and wild aspect of the US legal and political systems.

For more background, here's an article indicating that Joe Biden gave more clemency than any other president, along with some definitions: https://www.pewresearch.org/short-reads/2025/02/07/biden-granted-more-acts-of-clemency-than-any-prior-president/ . And here's an article listing all of the Republican politicians helped by Trump: https://www.newsweek.com/list-of-trumps-pardons-and-commutations-for-gop-lawmakers-10900810

Data and five questions

This week's data comes from the US Department of Justice's page about clemency, at https://www.justice.gov/pardon/clemency-statistics . It only contains data through the Biden administration, meaning that the numbers for the second Trump administration aren't yet available or included. Also, the data is in HTML tables on that page, rather than being downloadable. Fortunately, there are ways to deal with this!

Learning goals for this week include: Web scraping, combining data frames, multi-indexes, grouping, and plotting. As always I've tagged this issue with these learning goals. Go to https://www.bambooweekly.com/tags/ for a full list of tags, if you want to practice a particular aspect of Pandas some more.

Paid subscribers, including members of my LernerPython.com membership program, can scroll to the end of this message for my Marimo notebook and a one-click link that opens it in Molab. Paid subscribers are also invited to monthly office hours and can read all questions and answers.

Here are this week's five questions, along with my solutions and explanations:

Retrieve data from the DOJ's page into a data frame for each president. Using a manually constructed list of presidents' names (as strings), create a single data frame whose two-level index contains the president's name and the year, and the data columns contain `either Petitions Received` or `Petitions Granted`.

I started by loading up Pandas and Plotly Express:

import pandas as pd
from plotly import express as px

I then defined two variables. The first, presidents, is a list of strings containing the names of the presidents that appear on the DOJ page. The second is the URL of that page:

presidents = ['Biden', 'Trump', 'Obama', 'W Bush', 'Clinton', 'HW Bush', 'Reagan', 'Carter', 'Ford', 'Nixon', 'Johnson', 'Kennedy', 'Eisenhower', 'Truman', 'FDR', 'Hoover', 'Coolidge', 'Harding', 'Wilson', 'Taft', 'TR', 'Mckinley']
url = 'https://www.justice.gov/pardon/clemency-statistics'

Next, I invoked pd.read_html to return a list of data frames, one for each HTML table on the DOJ's page:

all_dfs = pd.read_html(url)

When I first did this, I assumed that the data frames would contain strings, integers, and floats – and they mostly did. But a number of the numeric columns were treated as strings. When I checked into things a bit more closely, I found that those columns included '-' to indicate missing values, forcing the column to contain strings.

I reran the query, passing na_values='-' to pd.read_html, and that did the trick:

all_dfs = pd.read_html(url, na_values='-')

Now that I have the data frames in a list, I want to combine them. The best way to do that is with pd.concat, which (by default) stacks them from top to bottom, which is precisely what we want.

But calling:

pd.concat(all_dfs)

isn't quite enough – mostly because we will want the name of the president who performed the pardon to be in the data frame. As things stand, the presidents' names are in the title above each data frame, but not in the frame itself.

I'm starting with a list of data frames. I want to have a list of data frames that are identical to the existing ones, but add a new column, president, with the president's name. This sounds like a job for a list comprehension – assuming that there's a way to add a new column to a data frame.

And there is, the assign method, which takes a keyword argument. The keyword name is used as the new column name, and the value is assigned to the rows of the column. If you provide a scalar value, then all rows get the same value assigned. If you provide a list or series, then each row gets a different value.

In this case, I wanted every row of the small data frame to get the same value. But how could I get the president's name?

I combined assign with the presidents list that I created above and enumerate in Python, which gives us the index of the current value:


df = (
    pd
    .concat([one_df.assign(president=presidents[index])
                for index, one_df in enumerate(all_dfs)])
)

I used enumerate on all_dfs to get not just each of the data frames in the list, but its index. I then retrieved presidents[index] and assigned that value to the president column. Each small data frame thus had a president column with the president responsible for those pardons. I passed that to pd.concat, and got a data frame back.

However, that data frame needed some trimming and work. I used a list of strings inside of [] to keep only four columns – which isn't quite true, because the data frame has a multi-index on the columns. Thus, the four columns I mentioned are the outer layer of column names:


df = (
    pd
    .concat([one_df.assign(president=presidents[index])
                for index, one_df in enumerate(all_dfs)])
    [['president', 'Fiscal Year', 'Petitions Received', 'Petitions Granted']]
)

Next, I wanted to remove the lines with Total in their names, which wouldn't be of much help, and would interfere with the year numbers. I thus used loc and lambda along with str.contains to find rows in which Fiscal Year contained the word Total. To do this, I first had to run astype(str) on the column, just to be sure I had strings everywhere, which seems weird but worked.

However, running str.contains found the rows that did contain the word Total. We wanted just the opposite, those rows that did not contain that word. To pull that off, I used ~ before the expression. That reversed the booleans for our mask index, and thus kept only non-total rows:


df = (
    pd
    .concat([one_df.assign(president=presidents[index])
                for index, one_df in enumerate(all_dfs)])
    [['president', 'Fiscal Year', 'Petitions Received', 'Petitions Granted']]
    .loc[lambda df_: ~df_[('Fiscal Year', 'Fiscal Year')].astype(str).str.contains('Total')]
)

Next, I wanted to use the year as part of the index. But even without the "total" lines, the Fiscal Year column contained many non-digit values. However, the digits always appeared at the start of the string.

Using regular expressions and str.replace, I captured 4 digits at the start of the string (^\d{4}), capturing them with (). I was then able to refer to those with \1 in the replacement string. Because the search string looked for all characters on the line, but the replacement string only contained '\1', I effectively removed all of the non-digits from the string. That allowed me to then invoke astype(int) on the values, getting an integer year column:


df = (
    pd
    .concat([one_df.assign(president=presidents[index])
                for index, one_df in enumerate(all_dfs)])
    [['president', 'Fiscal Year', 'Petitions Received', 'Petitions Granted']]
    .loc[lambda df_: ~df_[('Fiscal Year', 'Fiscal Year')].astype(str).str.contains('Total')]
    .assign(year = lambda df_: df_[('Fiscal Year', 'Fiscal Year')].astype(str).str.replace(r'^(\d{4}).*$', r'\1', regex=True).astype(int))
)

Finally, I did a bit more housekeeping on the data frame: I set the index to be a two-level multi-index, with president and year. I sorted the index by year. And I removed the column with Fiscal Year in both parts, outer and inner:


df = (
    pd
    .concat([one_df.assign(president=presidents[index])
                for index, one_df in enumerate(all_dfs)])
    [['president', 'Fiscal Year', 'Petitions Received', 'Petitions Granted']]
    .loc[lambda df_: ~df_[('Fiscal Year', 'Fiscal Year')].astype(str).str.contains('Total')]
    .assign(year = lambda df_: df_[('Fiscal Year', 'Fiscal Year')].astype(str).str.replace(r'^(\d{4}).*$', r'\1', regex=True).astype(int))
    .set_index(['president', 'year'])
    .sort_index(level='year')
    .drop(columns=[('Fiscal Year', 'Fiscal Year')])
)

The final result? A data frame with 140 rows and 10 columns.

For modern presidents, the `P` and `C` columns are used for pardons and commutation. But for earlier presidents, the `Pardons` and `Commutation` columns are used instead. Add the `Pardons` and `Commutations` numbers for `Petitions Granted` to the existing `P` and `C` columns, then remove the `Pardons` and `Commutation` columns.

After getting one big, beautiful data frame, I assumed that I could then start to perform some analysis. But as is so often the case, real-world data is messy. And it turns out, as I wrote in the question, that older records on the DOJ site used words to describe the number of pardons and commutations, while newer records used the one-letter P and C designations.

If I have two variables, x and y, and I want to combine them and assign the sum to x, I can say x = x + y.

If I have two series, s1 and s2 that share an index, and I want to combine them and assign the sum to s1, then I can say s1 = s1 + s2. However, there is an issue here: If there is a NaN value in either s1 or s2, then the result at that index will be NaN.

One solution is to call fillna, either on the data frame or on the individual columns. Another, which I decided to use, is to invoke the add method. When you invoke + on two series, this is the method that is invoked. Why, then, use the method when we can use the shorter and easier-to-understand + operator? Because the method gives us additional, optional arguments – including the ability to say fill_value=0. That tells Pandas to treat NaN as 0 when we perform math operations.

I thus said:

df[('Petitions Granted', 'P')] = df[('Petitions Granted', 'P')].add(df[('Petitions Granted', 'Pardon')], fill_value=0)

df[('Petitions Granted', 'C')] = df[('Petitions Granted', 'C')].add(df[('Petitions Granted', 'Commutation')], fill_value=0)

Notice that in order to address a specific multi-indexed column, I used a tuple. Each level of the tuple corresponded to a level of the multi-index.

I then wanted to remove the existing columns. That's easy; I can use the drop method. But then I was faced with a dilemma:

You can use inplace=True on drop and many other methods in Pandas. Doing so modifies the value in place, and returns None. However, the Pandas core developers have said for a long time that inplace=True doesn't save time or memory, and that it removes the possibility of method chaining. They strongly recommend against using it, and in favor of simply assigning back to the same variable.
Marimo can only work its magic if every variable is defined in a single cell. And so, assigning to a variable a second time will produce an error in Marimo, even if it's legal Python.

Faced with this dilemma – or going back to the first query and updating it to take these needs into account – I decided to simply assign to a new variable, creatively called df2, and then use df2 moving forward. But I will admit that it wasn't a totally comfortable feeling:

df2 = df.drop(columns = [('Petitions Granted', 'Pardon'),
                         ('Petitions Granted', 'Commutation')]
             )

The good news, after all this, is that we know precisely how many pardons and commutations each president in the data set (from the start of the 20th century) issued. For modern presidents, we also know how many requests they got, something we'll use in later questions.

Bamboo Weekly #147: Presidential pardons (solution)

Data and five questions

Read next

Bamboo Weekly #147: Presidential pardons

Bamboo Weekly #148: US Manufacturing