This week, in the wake of President Donald Trump's delivery of the State of the Union address on Tuesday, we looked at some background information about this annual American political ritual.
We examined data from the American Presidency Project at the University of California Santa Barbara (UCSB), where they have an online archive of SOTU speeches and data about them. (The full project is at https://www.presidency.ucsb.edu/.) We also looked at the speeches themselves from a slightly out-of-date archive, and thus explored how to work with text in Pandas.
Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.
Learning goals for this week include Web scraping, cleaning data, grouping, working with text, and plotting.
Data and six questions
Our primary data for this week's challenges will be the table of SOTU from UCSB:
We will also look at the text of the State of the Union speeches through 2017, looking at patterns in language and length. Those speeches are available from Kaggle, at:
https://www.kaggle.com/datasets/rtatman/state-of-the-union-corpus-1989-2017
Here are my six tasks and questions for the week, along with my solutions and explanations:
Create a Pandas data frame from the table of State of the Union addresses at UCSB. Keep only two header rows, and keep only those data rows with actual information about each president's address. Make the presidents' names into the index, replacing any nan values with the name that came before it. Also, have the outer layer of the multi-indexed column names be called "speech" and "written," to shorten things. Remove empty columns, as well.
I started, as usual, by loading both Pandas and Plotly. I also imported NumPy, so that I could use np.nan in my queries:
import numpy as np
import pandas as pd
from plotly import express as pxWith those in place, I then thought about how to retrieve the table from the UCSB presidency site. There isn't a downloadable version, but we do have a powerful tool in read_html, which scrapes a Web page, and returns a list of data frames, one for each HTML table on the page. I decided to give it a try, and discovered that there was only one table on the page, and it made for a pretty decent data frame:
url = 'https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union#Table%20of%20SOTU'
df = (
pd
.read_html(url)[0]
)However, there were a bunch of things to fix. For starters, the headers needed some trimming; I passed the header keyword argument with a value of [1,2], meaning that the column names needed a two-level multi-index. I also renamed two of the columns with the rename method, passing a dict to the columns keyword argument and indicating that only the outer level needed renaming by passing level=0:
df = (
pd
.read_html(url, header=[1,2])[0]
.rename(columns={'Delivered as a Speech (Address)': 'speech',
'Delivered as a Written Message': 'written'}, level=0)
)Next, I used dropna to remove any columns containing only NaN values. This was mainly to remove a number of extra, empty columns at the end of the data frame that were annoying. I also used replace to turn 'no messages' strings into np.nan values, which will make them easier to deal with:
df = (
pd
.read_html(url, header=[1,2])[0]
.rename(columns={'Delivered as a Speech (Address)': 'speech',
'Delivered as a Written Message': 'written'}, level=0)
.dropna(thresh=1, axis='columns')
.replace({'no messages':np.nan})
)The table is designed for humans, not machines, to read. For this reason, a president's name appears only once, with empty spaces in that column basically meaning, "still the president from above." To make sure that the column's blank spaces were filled with the presidents' names, I first used assign to create a new column, president. I passed a tuple to pd.col to represent the outer and inner parts of the column multi-index, and then invoked ffill, for "forward fill," replacing nan values with whatever came above them.
After creating the president column, I then used set_index to make it the index, and then used drop to remove the existing, now unnecessary, column:
df = (
pd
.read_html(url, header=[1,2])[0]
.rename(columns={'Delivered as a Speech (Address)': 'speech',
'Delivered as a Written Message': 'written'}, level=0)
.dropna(thresh=1, axis='columns')
.replace({'no messages':np.nan})
.assign(president = pd.col(('President', 'Unnamed: 0_level_1')).ffill())
.set_index('president')
.drop(columns=('President', 'Unnamed: 0_level_1'))
)Finally, I used .iloc[:-13] to keep all but the final 13 rows that had been read from the site. Those rows contain footnotes explaining various issues that occurred in the data, which then threw off the data types:
df = (
pd
.read_html(url, header=[1,2])[0]
.rename(columns={'Delivered as a Speech (Address)': 'speech',
'Delivered as a Written Message': 'written'}, level=0)
.dropna(thresh=1, axis='columns')
.replace({'no messages':np.nan})
.assign(president = pd.col(('President', 'Unnamed: 0_level_1')).ffill())
.set_index('president')
.drop(columns=('President', 'Unnamed: 0_level_1'))
.iloc[:-13]
)The result was a data frame with 67 rows and 11 columns. Note that most remaining columns contain years, but they still have str dtypes, because some of the years have markers (e.g., asterisks) after them that force the use of a string.
Which presidents' first appearance in the table is marked with an asterisk (indicating a non-official SOTU)?
The only presidents' first appearances that have * at the end of the year is under the speech column. To retrieve only the column for speech and 1st, you need to use a tuple inside of pd.col. I then used str.endswith to produce a boolean series, which was then applied to loc.
That combination returned the rows of the data frame in which the year ended with an asterisk. However, we wanted to know which presidents had given such speeches – and those values were on the index. Which, it turns out, we can retrieve via the index property on the data frame:
(
df
.loc[pd.col(('speech','1st')).str.endswith('*')]
.index
)The result:
Index(['Donald J. Trump - II', 'Joseph R. Biden', 'Donald J. Trump - I',
'Barack Obama', 'George W. Bush', 'William J. Clinton', 'George Bush',
'Ronald Reagan'],
dtype='str', name='president')