BW #49: Campaign finance (solution)

How much money do candidates for election raise in the US? This week, we look at data from the Federal Elections Commission, shedding light how much they earned, and how much they spent.

BW #49: Campaign finance (solution)

This week, we looked at the most recent reports from the Federal Election Commission (https://fec.gov/), which is responsible for overseeing the finances of election campaigns in the United States. This coming November, Americans will be electing a president, as well as many senators and all representatives to Congress. (There are also numerous state and local elections, but the FEC doesn’t appear to oversee their finances.)

Because each campaign has to submit its financial records to the FEC, we can get a picture of how much each has raised and spent, along with numerous other pieces of information regarding their finances. In theory, keeping campaign finances open and transparent reduces the influence of money in politics, although that would seem to be truer in theory than in practice.

If nothing else, this provided us with a new data set with which we can explore and learn not only about the world around us, but also the numerous ways in which we can use Pandas to analyze the data.

Data and 9 questions

This week’s data set is provided by the FEC. You can see all of the data sets that they offer from this page:

https://www.fec.gov/data/browse-data/?tab=bulk-data

The specific candidate data for this week’s questions come from the 2023-2024 election season, and is located here:

https://www.fec.gov/files/bulk-downloads/2024/weball24.zip

The data dictionary, which describes the fields in the data set, is available here:

https://www.fec.gov/campaign-finance-data/all-candidates-file-description/

I gave you nine tasks and questions for this week. As usual, a link to the Jupyter notebook that I used to answer these questions is at the bottom of this posting.

The data dictionary indicates the names (and meanings) of the columns in the actual candidate fund-raising data. Create a series containing the column names.

Before doing anything else, as usual, I loaded Pandas into Python and gave it the traditional alias:

import pandas as pd

This task might have seemed a bit weird. After all, a data dictionary is usually a document that tells us, the humans who are working to interpret the data, what it means. We can look at the data dictionary, understand the various types of data that we’ll be working with, know which columns we do and don’t want to load, and that’s it.

This week, I decided to take things a bit further — for reasons that became obvious in question 2. I wanted you to retrieve the data dictionary and turn it into a series based on that information.

Clearly, you could have opened the URL for the data dictionary and typed the column names into Pandas, manually creating a series.

But there’s a better way to do it, thanks to the “read_html” method in Pandas. Whereas most “read” methods in Pandas return a data frame based on a file on disk, read_html retrieves a URL and returns a list of data frames, one for each HTML table that it found on that page.

We can thus retrieve all of the data frames on the data-dictionary page with:

all_dfs = pd.read_html('https://www.fec.gov/campaign-finance-data/all-candidates-file-description/')

In the case of this page, the data frame we want is the first, i.e., at index 0. I can thus get that data frame with:

df = all_dfs[0]

However, I didn’t really want the full data frame. I just wanted the first column, which contained the names of columns for the FEC data itself. I can retrieve just the first column of that data frame with:

headers = (
    all_dfs
    [0]   # first data frame in all_dfs
    [0]   # first column in the data frame
)  

The above will work… mostly. The problem is that this includes the first row of the HTML table, “Column name”, and treats it as one of the data’s actual column headers. We want to ignore that first row; in order to do that, I use “iloc” and retrieved all but the first row. Notice that I pass two arguments to iloc: The first is a slice, selecting the rows that I’m interested in. The second argument is 0, the column that we want from the data frame:

headers = (
    all_dfs
    [0]
    .loc[1:, 0]
)

The result is a series of strings, the names of the columns that we’ll soon be retrieving from the FEC site.

Now create a data frame based on the candidate data. Use the column names from question 1 as the column names for the data frame.

Now we’ll download the FEC data. One way to do it is to download the file, unzip it, figure out the format, and then use the appropriate method to read it into Pandas.

However, Pandas offers us an easier way: From downloading and inspecting the file, I know that it’s in CSV format, except that it’s using “|” (vertical bars) as field separators. Rather than downloading the file and then calling “read_csv”, I can just hand the URL to read_csv. Pandas will retrieve the file, open it, and even unzip it, returning a data frame:

url = 'https://www.fec.gov/files/bulk-downloads/2024/weball24.zip'

df = pd.read_csv(url, 
                 sep='|')

However, there is a bit of a problem with the above: The file does download correctly, and we do get a data frame. But the first line is assumed to contain headers, meaning that those values are being treated as column names. This is most definitely not the case. We can suppress the first line being used for headers with the “header=None” keyword argument. We can then provide names for those columns with the “names” keyword argument. And whadaya know, we can pass “headers”, the variable we just defined based on scraping the HTML table, as the value:

url = 'https://www.fec.gov/files/bulk-downloads/2024/weball24.zip'

df = pd.read_csv(url, 
                 sep='|',
                header=None,
                names=headers)

The resulting data frame contains 2,889 rows and 30 columns. If the data were bigger, then I might worry about paring down some of those columns, but I can live with them, given the relatively small size (about 1.5 MB, according to df.memory_usage) of the data.

We’ve now loaded the financial information into memory. Note that this file is not cumulative; it only contains data for the most recent FEC reporting period. We could, in theory, download previous reports and then compare them. But we’ve got enough interesting questions to answer here that I don’t see that as necessary.