Pandas grouping practice solutions — Bamboo Weekly #69 (Election participation)

[Another late issue! Lots of training and meetings about training today. Also, this issue took a lot of time to write and analyze. But if you’ve always been curious about DuckDB, how its queries stack up to Pandas, and where it’s faster and slower — I think that the wait was worthwhile, with the longest analysis and description section I’ve ever written, at 4,500 words. Thanks, as always, for your support!]

The last week has been full of election news, from Donald Trump’s new status as a convicted felon, to the upcoming elections in the United Kingdom, to the election of a new president of Mexico, to weaker support in India for Narendra Modi in what will likely be his third term as prime minister.

Rather than look at one particular country’s election data, I thought it might be appropriate to analyze data across a large number of countries. Moreover, I decided that this would be a good time to look into DuckDB, an in-memory analytical database that has been growing steadily in popularity over the last few years. It was mentioned on a number of occasions at PyCon US, and I thought that this might give us a good opportunity to look at DuckDB, comparing its syntax, capabilities, and speed with what we have in Pandas.

Traditional relational databases divide tasks between a server (where the data is stored, and where queries are executed) and a client (where the queries are formulated). In the simplest case, you have two processes, one server and one client. However, you can have multiple clients submitting queries to a single server. And it’s now common to have multiple servers working together. You can thus have a large number of processes, often located on different computers, communicating with one another.

By contrast, DuckDB doesn’t even consume one process. It’s an in-process database, meaning that it is loaded as a library into whatever program wants to use it, without any external connections. Granted, that process will then grow in size to accommodate DuckDB’s functionality, as well as the data that you’re analyzing. But you won’t have the overhead of having to transfer data across processes.

Moreover, DuckDB works with a variety of in-memory data structures, including Pandas data frames. You can thus load a CSV or Excel file into DuckDB — but you can also perform SQL “select” queries directly from a data frame.

A quick tutorial on DuckDB is here: https://duckdb.org/2021/05/14/sql-on-pandas.html. If you have any experience with SQLite or PostgreSQL, you’ll likely feel at home with DuckDB’s functionality and syntax. And if you haven’t ever used SQL before, then DuckDB isn’t a bad way to start.

Data and five questions

This week's data comes from International IDEA's voter turnout database, whose home page is here:

https://www.idea.int/data-tools/data/voter-turnout-database

Go to that page, and click on the "Export data" button. This will download an Excel file. We're interested in the first ("All") sheet in the document.

I gave you five questions and tasks this week, many of which asked you to perform the same task twice — once with Pandas, and a second time with DuckDB. The point was both to compare the syntax and to get a sense of whether DuckDB would offer any performance help. To some degree, I knew going into these tests that they would be unfair, because the data set is relatively small; I will soon be doing some tests on a much larger data set, to see how much that is affected.

Here are my questions for this week. As always, a link to my Jupyter notebook is at the end of the newsletter.

Read the Excel file into Pandas as a data frame. However, turn the "Year" column into a datetime column, the numeric columns into float (removing "%" and "," characters), and the "Election type" and "Compulsory voting" columns into categories. How much memory do we save in making such changes?

Let’s start off by loading Pandas:

import pandas as pd

Next, because it’s an Excel file, I’ll use “read_excel” to load it into Pandas:

filename = 'idea_export_voter_turnout_database_region.xlsx'

df = pd.read_excel(filename, sheet_name='All')

Note that I pass the “sheet_name” keyword argument, allowing me to indicate which sheet (or multiple sheets, if I want) I would like to get back as a data frame. It might seem like I’m somehow asking for all of the sheets in the Excel file to be returned as a data frame, but actually, the first sheet has a name of “All”. We thus get back a data frame containing 3,649 rows and 13 columns.

How much memory does this take up? We can find out by invoking “memory_usage” on our data frame, which returns the number of bytes used by each column. We can then use “sum” to find the total amount of memory used by the data frame.

Note that we need to pass the “deep=True” keyword argument in order to ensure that Python-object columns are calculated correctly:

df.memory_usage(deep=True).sum()

The result? 2,622,423 bytes, or about 2.5 MB.

I’ve found that data read from a CSV file usually needs some help and hints to determine the dtype of at least a few columns. Excel files, by contrast, support a rich set of data types, most of which map quite naturally to the dtypes in Pandas. Let’s check to see what dtypes we got:

Country                  object
ISO2                     object
ISO3                     object
Election Type            object
Year                     object
Voter Turnout            object
Total vote               object
Registration             object
VAP Turnout              object
Voting age population    object
Population               object
Invalid votes            object
Compulsory voting        object
dtype: object

Ah, I see — they are all of “object” dtypes, meaning that they were all imported as strings. That’s unusual, and I’m guessing that this is because the data was originally in a textual format (e.g., CSV), and was imported into Excel without doing any cleaning or setting of types.

It’s tempting to say that we can resolve this situation by just passing a few arguments to “read_excel”, telling it to parse the “Year” column as a date, and many of the rest as floats. But that’s sadly impossible, because so many of the number-wannabe columns contain dollar signs and commas, which aren’t legal there.

So we’ll have to go through a bunch of columns, removing non-digit characters, and then turning them into float columns. (Why float? Because they might well contain NaN values. Although as we saw last week, we could use nullable types…) That’s annoying, but it works.

I’ll thus go back to loading the Excel file, but with an important addition to our call to “read_excel”:

df = pd.read_excel(filename,
                   sheet_name='All',
                   parse_dates=['Year'])

Next, I’ll iterate over a bunch of columns, invoking “str.replace” on each to remove commas and percent signs, followed by “astype(‘int’)”, then assigned back to the original variable.

for one_colname in convert_to_float:
    df[one_colname] = (df
                       [one_colname]
                       .str.replace(r'[,%]', '', regex=True)
                       .astype('float')
                      )

Finally, I took two columns that contained (repeated) text, and made each a bit more efficient by turning them into Pandas categories.

convert_to_category = ['Election Type', 'Compulsory voting']

for one_colname in convert_to_category:
    df[one_colname] = df[one_colname].astype('category')

After all this, which made the data frame more robust, have we also managed to make it smaller? I again run

df.memory_usage(deep=True).sum()

The new size is 829,452 bytes, or about 31 percent of the original size. Which means that getting our data into good dtypes not only ensures we can perform the calculations we want and need, but that we’ve saved a lot of memory, too.

Here are the final dtypes:

Country                          object
ISO2                             object
ISO3                             object
Election Type                  category
Year                     datetime64[ns]
Voter Turnout                   float64
Total vote                      float64
Registration                    float64
VAP Turnout                     float64
Voting age population           float64
Population                      float64
Invalid votes                   float64
Compulsory voting              category
dtype: object

Using Pandas, which five countries with non-compulsory voting has the highest mean VAP turnout? Now use DuckDB to calculate the same thing. How long did each query take?

I’ve often heard people complain that too few people vote in elections. I was thus curious to know in which countries the greatest proportion of voters do actually vote. However, I had to remove countries in which there is mandatory voting, since that would clearly skew the numbers.

To calculate this in Pandas, we’ll first remove rows referring to countries in which there is compulsory voting. I did this using a combination of “loc” and “lambda”, looking for rows in which the “Compulsory voting” column had a “No” value:

(
    df
    .loc[lambda df_: df_['Compulsory voting'] == 'No']
)

Next, I performed a “groupby” operation, asking for the mean value of “VAP Turnout” for each unique value in the “Country” column:

(
    df
    .loc[lambda df_: df_['Compulsory voting'] == 'No']
    .groupby('Country')['VAP Turnout'].mean()
)

This produced a series of 185 elements, with country names as the index and the mean voting percentage for each country as the values. However, I was only interested in the five countries with the greatest percentage. One option would be to use “nlargest”, but I decided here to use a combination of “sort_values” and “head”. Note that I had to indicate that I wanted the sort to be in descending order:

(
    df
    .loc[lambda df_: df_['Compulsory voting'] == 'No']
    .groupby('Country')['VAP Turnout'].mean()
    .sort_values(ascending=False)
    .head(5)
)

The result was:

Country
Croatia                         315.038800
North Macedonia, Republic of    293.369565
Somalia                         129.466667
Cook Islands                    100.640000
Viet Nam                         96.898000
Name: VAP Turnout, dtype: float64

Um, wait a second. How can there be 315 percent voter turnout? We might be aiming for high turnout, but that seems a bit … high, no?

The FAQ at https://www.idea.int/data-tools/data/voter-turnout-database indicates that VAP (“voting age population”), along with turnout and other numbers, are often estimates, and that they aren’t always updated in sync. That said, we can probably assume that voter turnout in Croatia and Maceconia is quite high, even if not quite 315 percent.

What about the next three? Well, Somalia isn’t exactly a bastion of open democracy. The Cook Islands have a population of 15,000 people, so it seems likely to me that they have high turnout rates. And Vietnam … it’s a one-party state, so I’m not sure who people are voting for, but it’s good to know (I guess) that they’re voting in high percentages.

Bottom line, having a very high percentage of the public coming to vote doesn’t necessarily mean that you’re a model democracy.

How long did it take Pandas to perform this query? I use the “%%timeit” magic command in Jupyter, and found that it took, on average, 1.04 ms to execute my query.

What about DuckDB? How do we even use DuckDB, for that matter?

DuckDB is, as I wrote above, an in-memory relational database that uses SQL. It uses standard SQL queries to create tables, update them, and retrieve from them. However, the tables all reside in memory. Moreover, it’s a columnar database, meaning that (like Pandas) it is structured primarily along columns, rather than along rows, as traditional databases did. It’s not meant for high-speed transactions; rather, DuckDB is designed for high-performance, in-process analysis of data.

Because DuckDB is in the same process, it has access to all of the objects in memory. A nice side effect, particularly useful to us, is that it can thus query a Pandas data frame as if it were a database table. Pandas queries are often similar to SQL, and I’m sure that some database veterans have long wished that they could use SQL to query their data frames. Well, your wish has come true!

You’ll first have to install DuckDB on your computer. On my Mac, I use Homebrew for open-source installations. I thus added it with

brew install duckdb

However, I also needed to install the Python bindings for it. That was equally easy:

pip install duckdb

Inside of my Jupyter notebook, I then wrote:

import duckdb

Once that was done, I was able to query my data frame:

duckdb.query('''SELECT Country, mean("VAP Turnout") as mvt
                FROM df
                WHERE "Compulsory voting" = 'No'
                GROUP BY Country
                ORDER BY mvt DESC
                LIMIT 5''')

The above SQL query basically does what my Pandas query did:

I start with SELECT, indicating what columns and values I want to get back. I asked for the country name and the mean of the “VAP Turnout” column, aliased to the name “mvt”. The alias will make it easier to order by that column.
I indicate that I want to retrieve from “df”. That’s right; I just name the variable in which my data frame is located, and DuckDB takes care of the rest. Wow!
I add a condition, indicating that I only want rows without compulsory voting. Notice that SQL, unlike Python, distinguishes between single quotes and double quotes: Single quotes are for strings, whereas double quotes allow us to reference tables and columns whose names contain spaces.
We then group by the country, indicating that we want a result for each unique value in the Country column
We then order the results by the “mvt” alias we previously created, in descending order
We only want the top 5 results.

Those results are:

┌──────────────────────────────┬────────────────────┐
│           Country            │        mvt         │
│           varchar            │       double       │
├──────────────────────────────┼────────────────────┤
│ Croatia                      │ 315.03880000000004 │
│ North Macedonia, Republic of │  293.3695652173913 │
│ Somalia                      │ 129.46666666666667 │
│ Cook Islands                 │             100.64 │
│ Viet Nam                     │             96.898 │
└──────────────────────────────┴────────────────────┘

We basically got the same results, although we are seeing some slight differences in the float precision. Notice that we get the results back as a DuckDB table, as in all databases. Want to get back a Pandas data frame? No problem; just call “to_df”:

                        Country         mvt
0                       Croatia  315.038800
1  North Macedonia, Republic of  293.369565
2                       Somalia  129.466667
3                  Cook Islands  100.640000
4                      Viet Nam   96.898000

And how long does DuckDB take to do this calculation? 2.22 ms on average, about twice as long as Pandas.

BW #69: Election participation (solution)

Data and five questions

Read the Excel file into Pandas as a data frame. However, turn the "Year" column into a datetime column, the numeric columns into float (removing "%" and "," characters), and the "Election type" and "Compulsory voting" columns into categories. How much memory do we save in making such changes?

Using Pandas, which five countries with non-compulsory voting has the highest mean VAP turnout? Now use DuckDB to calculate the same thing. How long did each query take?

BW #69: Election participation

BW #70: Moon missions

BW #69: Election participation (solution)

Data and five questions

Read the Excel file into Pandas as a data frame. However, turn the "Year" column into a datetime column, the numeric columns into float (removing "%" and "," characters), and the "Election type" and "Compulsory voting" columns into categories. How much memory do we save in making such changes?

Using Pandas, which five countries with non-compulsory voting has the highest mean VAP turnout? Now use DuckDB to calculate the same thing. How long did each query take?

Read next

BW #69: Election participation

BW #70: Moon missions