Reminder: I'll be holding my "AI-powered Pandas Practice Workshop" on Monday, February 9th. In this course, you'll not only get better at understanding and writing Pandas queries. You'll learn how to use Claude Code to improve how you write and check your queries. Join me, and improve your Python, Pandas, and use of AI. Learn more here: https://lernerpython.com/product/ai-pandas-workshop-1/
Any questions? Just e-mail me at reuven@LernerPython.com.
This week, we're looking at data about the Winter Olympics. The 25th games (https://en.wikipedia.org/wiki/2026_Winter_Olympics) will open tomorrow in northern Italy. We'll look at information about previous games and athletes, making comparisons and generally trying to learn what we can about these winter sports.
Given the winter theme, I thought it would also be appropriate to compare the style and speed of Pandas with Polars, another data-analysis tool. (I mean, polar bears like winter, right?) I'll thus ask you to perform each of these analyses twice, once in Pandas and once in Polars. We can compare the query styles, and also the speed with which the queries are processed.
Paid subscribers, including members of my LernerPython+data program, get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.
Learning goals for this week include working with CSV files, cleaning data, joins, pivot tables, and using Polars.
Data and five questions
This week's data comes from a data set on GitHub from developer Keith Galli, at https://github.com/KeithGalli/Olympics-Dataset. We will use several of the files in the "clean-data" section of this repository, specifically:
bios.csv, with information about the athletes' biographiesnoc_regions.csv, listing the regions in the Olympic games, what we would normally call "countries," except that there isn't a perfect overlap between the notion of a team and a countryresults.csv, with the results from the Olympic games
Here are my five questions, along with my solutions and explanations:
Load each of the three files into a Pandas data frame, keeping only the rows for the Winter Olympics when reading results.csv. Make sure that the born_date and died_date columns (in bios.csv) are datetime values.
How many times has the Winter Olympics taken place? How many different athletes have competed? What country has won the greatest number of medals? (Show the country name, not the NOC code.) What country has won the greatest number of gold medals? (Again show the country name, not the NOC code.)
I started things off by loading both Pandas and Polars, as well as Plotly Express:
import pandas as pd
import polars as pl
from plotly import express as pxThen I wanted to load the files into Pandas data frames. I did that, directly from GitHub, using the fact that read_csv in Pandas (and all of its read_* methods) can retrieve from a URL, not just from a filename. However, because these files are stored on GitHub, it's important to grab them from the "raw" link, to ensure that things are retrieved easily and correctly.
The first file, bios.csv, contained information about each athlete. Here was my initial stab at downloading it. I used the PyArrow engine for reading CSV files, which is often faster, and generally picks up on datetime fields automatically:
bios_url = 'https://github.com/KeithGalli/Olympics-Dataset/raw/refs/heads/master/clean-data/bios.csv'
bios_df = (
pd
.read_csv(bios_url, engine='pyarrow')
)
In this case, though, it didn't turn either born_date nor died_date into datetime columns. I passed the parse_dates keyword argument to read_csv, and the columns came out just fine, with datetime dtypes:
bios_df = (
pd
.read_csv(bios_url, engine='pyarrow',
parse_dates=['born_date', 'died_date'])
)
The bios_df data frame had, in the end, 145,500 rows and 10 columns. Remember that this included all athletes who had competed in the Olympics, including in the Summer Olympics. We'll return to that later.
Next, I created a data frame from the teams and regions, known as NOCs. I once again used read_csv, but this time I told Pandas to make the NOC column, with a three-letter acronym for the country, into the data frame's index. This will come in handy when we want to join:
noc_regions_url = 'https://raw.githubusercontent.com/KeithGalli/Olympics-Dataset/refs/heads/master/clean-data/noc_regions.csv'
noc_regions_df = pd.read_csv(noc_regions_url, engine='pyarrow', index_col='NOC')This resulted in a data frame of 230 rows and 2 columns. One column (region) contains the name of the region / country. But a second one contains notes, something we won't use. But it doesn't use much memory, and the data frame is small, so we can keep it.
Finally, I read results.csv into, also from GitHub via read_csv. However, I used the opportunity to keep only those rows in which the type was 'Winter'. In Pandas 2.x, we would have done that with loc and lambda. But we're in the modern Pandas 3 era, which means that we can use pd.col to retrieve one column and run a comparison with it. After keeping only rows for the Winter Olympics, I then used drop to remove the type column:
results_url = 'https://github.com/KeithGalli/Olympics-Dataset/raw/refs/heads/master/clean-data/results.csv'
results_df = (
pd
.read_csv(results_url, engine='pyarrow')
.loc[pd.col('type') == 'Winter']
.drop(columns='type')
)This data frame has 64,509 rows and 10 columns.
This gave me the three data frames that I wanted to answer some basic questions about the Winter Olympics.
First, how many times has it taken place? We can invoke drop_duplicates on the year column in results_df, and then invoke count on those:
(
results_df
['year']
.drop_duplicates()
.count()
)I got a value of 27. Which ... seems weird, given that tomorrow opens the 25th Winter Olympics, no?
I did some checking, and it seems like there has been some contamination of this data set! When reading results_df into Pandas, we kept only those rows with 'Winter' as the value for type, and yet still had results from 2012, 2016, and 2020 — all of which were years of the Summer Olympics, not the Winter Olympics. So those years should have been labeled Summer, but were labeled Winter, and were thus included in our data set.
I'm going to ignore that, and just keep plowing forward. But it just goes to show that you always need to check your data.
How many different athletes competed? Since we don't need the athletes' names or other details, we can just get the unique athlete_id values from results_df:
(
results_df
['athlete_id']
.drop_duplicates()
.count()
)We get a result of 23,439.
Which country has won the greatest number of medals? This is a perfect use of a groupby query. We invoke count (which counts non-NaN values) on the medal column, grouping by noc:
(
results_df
.groupby('noc')['medal'].count()
.nlargest(10)
)We get results... but the question asked to see the team names, not the abbreviation. For this to work, we'll need to join the noc_df with results_df:
(
results_df
.set_index('noc')
.join(noc_regions_df)
.groupby('region')['medal'].count()
.nlargest(10)
)I again started with results_df, but used set_index to turn the noc column into its index. That's because join uses the indexes of the two data frames to perform the join. (This is in contrast with merge, which can use any columns.) I then perform the join from the noc-indexed results_df and noc_regions_df, giving us a new data frame that combines the two, adding the region column.
We can then use groupby on region, rather than noc, counting the number of times each team won a medal. To get the 10 highest-medalling countries, we use nlargest(10). The results are:
region medal
Russia 833
Germany 806
Canada 800
USA 787
Norway 603
Finland 516
Sweden 516
Austria 372
Switzerland 353
Czech Republic 258What if we restrict our query to gold medals? The query is similar, but adds a loc line that keeps only those lines in which medal == 'Gold'. While I could do this filtering at nearly any stage before the groupby, I decided that by removing the majority of rows early on, that would speed up the query a bit:
(
results_df
.loc[pd.col('medal') == 'Gold']
.set_index('noc')
.join(noc_regions_df)
.groupby('region')['medal'].count()
.nlargest(10)
)
The result:
region medal
Russia 408
Canada 360
Germany 289
Norway 225
USA 219
Sweden 159
Austria 115
Switzerland 113
Finland 95
South Korea 80Repeat all of the above, but with Polars. What is the speed difference for each query?
First, I loaded all of the data frames. I used pl.read_csv , which has a try_parse_dates keyword argument, and (like Pandas) can read directly from a URL. I set that to True, and sure enough, it read the date columns into datetime values:
bios_pl_df = (
pl
.read_csv(bios_url,
try_parse_dates=True)
)Getting the regions into a data frame was straightforward:
noc_regions_pl_df = pl.read_csv(noc_regions_url)
Because Polars doesn't have indexes, there was no need (or possibility) of passing index_col to pl.read_csv.
Finally, I loaded up the results.csv file. Notice that filtering in Polars is done with the filter method, similar to what we often do with loc. And you can see where Pandas got its inspiration, with pl.col naming a column, and then performing the comparison. Finally, I used drop to get rid of the unnecessary column:
results_pl_df = (
pl
.read_csv(results_url)
.filter(pl.col("type") == "Winter")
.drop('type')
)How much faster was Polars? Quite a bit:
bios.csvloaded in 1.36s (Pandas) vs. 804 ms (Polars)noc.csvloaded in 293 ms (Pandas) vs. 334 ms (Polars), but it's a tiny data setresults.csvloaded in 2.47s (Pandas) vs. 1.16s (Polars)
Of course, I was downloading them from the network, and that might reflect network fluctuations. But we do see that the loading took about half as long in Polars as Pandas.
What about the other queries?
Getting the number of Winter Olympics games was quite similar to Pandas:
(
results_pl_df
['year']
.unique()
.count()
)Notice that Polars has a unique method. So does Pandas, but it returns a NumPy array, which is less flexible.
We were similarly able to get the number of athletes in the Winter games:
(
results_pl_df
['athlete_id']
.unique()
.count()
)But what about the number of medals for each country, printing the name? That again required a join. In Pandas, join works on the index. But Polars data frames don't have an index. So we simply join one data frame with another, and then specify the left-hand column and the right-hand column to use – much like merge in Pandas.
Then I grouped, using group_by (notice the underscore!). Also notice how grouping works – we use the agg method, and then pass the column (with pl.col again), and then the aggregation method we want to use.
Finally, the Polars equivalent of nlargest is top_k, where we specify both the column name and the count:
(
results_pl_df
.join(noc_regions_pl_df, left_on='noc', right_on='NOC')
.group_by('region').agg(pl.col('medal').count())
.top_k(10, by='medal')
)Finally, I wanted to get the number of gold medals per country. That was the above query, plus a call to filter:
(
results_pl_df
.filter(pl.col('medal') == 'Gold')
.join(noc_regions_pl_df, left_on='noc', right_on='NOC')
.group_by('region').agg(pl.col('medal').count())
.top_k(10, by='medal')
)How long did these take?
- Number of games: 2ms (Pandas) vs. 1 ms (Polars)
- Number of athletes: 2ms (Pandas) vs. 1 ms (Polars)
- Number of medals per country 7ms (Pandas) vs. 5 ms (Polars)
- Number of gold medals per country: 14ms (Pandas) vs. 5 ms (Polars)
Once again, Polars is significantly faster. Especially when you combine filtering, joining, grouping, and sorting, you can see the big difference.