Pandas grouping practice questions — Bamboo Weekly #69 (Election participation)

This week's headlines have been full of election news. Yes, a major story has been that Donald Trump will be the first convicted felon to be nominated for the presidency by a major party in the US. But we just got results from elections in Mexico, where Claudia Sheinbaum was won the presidential election (https://www.nytimes.com/2024/06/02/world/americas/mexico-election.html?unlocked_article_code=1.xU0.Cwip.UwcN7wsJMugP&smid=url-share).

And we're getting results from India, where it looks like Narendra Modi has won a third term as prime minister, albeit with a weakened share of the parliament (https://www.nytimes.com/2024/06/04/world/asia/modi-india-election.html?unlocked_article_code=1.xU0.L3Lg.GpC3JHeHjBen&smid=url-share). And in just one month, the United Kingdom will hold a general election, one in which the ruling Conservative party appears to be heading for electoral disaster.

I considered looking at these individual countries' elections, and we might yet do that in future weeks. But I found an interesting data set from the International Institute for Democracy and Electoral Assistance (“International IDEA,” at https://idea.int) that describes voter turnout at countries around the world. We can look at which countries' citizens vote most often.

But wait, there's a catch: Over the last year, I've heard quite a bit about DuckDB. While I was at PyCon US in Pittsburgh, I heard people talk about it even more, and started to learn about it. And then I got e-mail from a Bamboo Weekly subscriber who suggested that I mention DuckDB. And so, this week, we'll use a combination of Pandas and DuckDB to answer questions about voter participation around the world. I'll add that this week's questions only scratch the surface of what we can do when using Pandas and DuckDB together; I expect to use them more in future issues, and (as always) welcome suggestions regarding what topics you would like to see.

Unfamiliar with DuckDB? Simply put, it's an in-memory, in-process database. So when you use it inside of a Python program, it's inside of the same process, with no client, server, or file. It has a complete and rich implementation of SQL, and can run queries on Pandas data frames. A quick tutorial is here: https://duckdb.org/2021/05/14/sql-on-pandas.html

I'll add that many of this week's questions, because they're using DuckDB, expect you to have at least some familiarity with SQL. My apologies if that makes the problems too hard to solve, but I will explain the SQL queries in tomorrow's solutions.

Data and five questions

This week's data comes from International IDEA's voter turnout database, whose home page is here:

https://www.idea.int/data-tools/data/voter-turnout-database

Go to that page, and click on the "Export data" button. This will download an Excel file. We're interested in the first ("All") sheet in the document.

Here are my five questions and tasks for this week. The learning goals include: Working with DuckDB and SQL, grouping, pivot tables, and working with datetime data.

I’ll be back tomorrow with my detailed solutions, including my Jupyter notebook.

Read the Excel file into Pandas as a data frame. However, turn the "Year" column into a datetime column, the numeric columns into float (removing "%" and "," characters), and the "Election type" and "Compulsory voting" columns into categories. How much memory do we save in making such changes?
Using Pandas, which five countries with non-compulsory voting has the highest mean VAP turnout? Now use DuckDB to calculate the same thing. How long did each query take?

BW #69: Election participation

Data and five questions

BW #68: Dangerously hot weather (solution)

BW #69: Election participation (solution)

BW #69: Election participation

Data and five questions

Read next

BW #68: Dangerously hot weather (solution)

BW #69: Election participation (solution)