BW #27: Young voters (solution)
Young Americans don't typically vote. This week, we look at the Harvard Youth Poll, including whether they think voting matters, how afraid they have been feeling, and where they get their news.
This week, we looked at young people in the US, and some of the results from the latest Harvard Youth Poll (https://iop.harvard.edu/youth-poll/45th-edition-spring-2023). The poll asked so many interesting questions that it was hard to choose just a few topics — but I hope that it gave you a taste for this data, and that you even started to look at some of the other topics it surveyed.
Data and questions
The Harvard Youth Poll’s data is reported in a single Excel document. This Excel document was messier than most I’ve seen, because it put all of the answers, to dozens of questions, in a single spreadsheet document, on a single sheet. Which meant that reading the document was… challenging and messy. But you wouldn’t be reading Bamboo Weekly if you weren’t up for a challenge, right?
I posed eight questions and tasks. Let’s get to them:
Download the Excel file.
This was, by far, the easiest task of the week! I first started up Pandas:
import pandas as pd
Then I downloaded the Excel file from the Harvard Youth Poll’s site:
Now I was ready to start working with the data itself.
Grab the data from the "Likely voter 2024" question (starting on line 168 of the Excel file) into a data frame. Set the answers ("Definitely will be voting", etc.) to be the index. Remove the "All" column and the rows for "weighted N" and "unweighted N".
The Excel document provided by HYP has two sheets. You can think of each sheet as a separate document, sort of like a zipfile with numerous CSV files inside of it. In this case, the first sheet was a list of the questions asked in the poll, and the second sheet contained the results. The questions on the first sheet were all hyperlinks to the second sheet, which was a nice touch, I’ll admit.
Let’s start with reading the second sheet into a data frame, using read_excel:
likely_voter_2024_df = pd.read_excel(filename, sheet_name=1)
The above tells Pandas to read the entire second sheet (because the first has an index of 0) into a data frame. But it turns out that we don’t really want the entire sheet. That’ll give us the data, but not in any form that we can or will use. We need the values in a column to be of the same dtype to avoid having it labeled as “object”, and the headers for each of the sub-tables in the spreadsheet are the same everywhere, we want and need to snip out just a limited number of rows for the answers to each question.
This means that we’ll need to tell Pandas:
Ignore all of the rows up to where the desired snippet begins
Ignore the rows after where that snippet ends
We can accomplish this via a combination of the “skiprows” and “nrows” keyword arguments. The first tells Pandas how many rows to skip before starting to read from the file, and the second tells Pandas how many rows to read once it has begun. We can thus say:
likely_voter_2024_df = pd.read_excel(filename, sheet_name=1, skiprows=166, nrows=6)
This works! Moreover, I asked for just 6 rows, which cuts off the bottom two rows from that analysis, namely the weighted and unweighted numbers.
But wait: I also asked you to treat the first column as the index. To do that, I’ll need to use the index_col keyword argument:
likely_voter_2024_df = pd.read_excel(filename, sheet_name=1, skiprows=166, index_col=0, nrows=6)
This is all good, but what about the column labels? The way that this data is structured, the columns will be a multi-index, with main headers and subheaders. Normally, we can pass an integer value to the “header” keyword argument, to indicate which row should be treated as a header. To get a multi-index, we simply pass a list of integers:
likely_voter_2024_df = pd.read_excel(filename, sheet_name=1, header=[0,1], skiprows=166, index_col=0, nrows=6)
We have now loaded our data, have the answers in the index, and a multi-index of columns. This is all great, except for one thing, namely the “All” column at the start of the data frame. And yes, we could just live with it there, but I thought it might be nice to get rid of it.
Normally, we could use the “drop” method to get rid of a column. But doing that with a multi-index is annoying; there’s only a name at the multi-index’s lower level. So how exactly can we drop that column?
Well, it turns out that “drop” lets us specify the level at which we want to match the name. So by saying “All” (the value in the lower level) and then passing level=1 along with axis="columns", we’ll be all set:
likely_voter_2024_df = pd.read_excel(filename, sheet_name=1, header=[0,1], skiprows=166, index_col=0, nrows=6).drop('All', level=1, axis='columns')
Sure enough, this creates a data frame with the rows and columns we want, ready for analysis.
If you opened the file in Excel, by the way, you might have noticed that the numbers were all percentages. Why are they not displayed with percent signs in Pandas? Anyway, shouldn’t they be strings, if they have percent signs?
The answer is that Excel distinguishes between its data and how it’s displayed. (We can do that in Pandas to some degree, also, I’ll admit.) So when you see something with a % after it in Excel, it’s still a float, just being shown as a percentage. And when the data is read into Pandas, it’s kept as a float — and then displayed as one. So you don’t need to turn it from a string into a float, or the like.
Keep reading with a 7-day free trial