Sorry for the delay in getting these out, but I'm traveling before PyCon US starts next week in Long Beach, California.
This week, we're looking at the most recent data from Reporters Without Borders (known by its French acronym, RSF). Their annual report on press freedom (https://rsf.org/en/video-2026-rsf-world-press-freedom-index) shows that things aren't so great; for the first time, more than half of the countries in the world were ranked "difficult" or "very serious." The scores for a number of countries declined fairly significantly.
This week, we'll examine the data, seeing where things have improved, where they have declined, and if there are any places in which the RSF multi-faceted scoring can show some nuances.
Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.
Learning goals for this week include combining multiple files, multi-indexes, plotting with Plotly, cutting, and pivot tables.
Data and five questions
This week's data comes from RSF. The data itself is available in a few places. I chose to retrieve it from a GitHub repo at https://github.com/dw-data/world-press-freedom-2026/tree/main . The files we want to use are in the csvs/rsf-files subdirectory.
Here are my solutions and explanations for this week's five questions:
Download the GitHub repo with the data. Create a single Pandas data frame from the CSV files in the rsf-files subdirectory, reading data from 2022 - 2026. (The directory contains more files than just these, but there was a major change in methodology starting in 2022, so we'll ignore earlier files.) You will have to deal with file-encoding issues when importing the files. Make sure that the score columns are treated as floats. Remove the Country_* columns, as well as the Score 2026 and Score 2025 columns. Make the index a two-level multi-index from ISO and Year (which you should rename from Year (N)).
I started, as usual, by loading up Python and Plotly:
import pandas as pd
from plotly import express as px
However, I was also going to be reading multiple files, and then turning them into a single data frame. I like to use a list comprehension for such tasks, but sometimes you need a bit of help in order for the comprehension to look reasonable. I thus loaded three additional modules from the Python standard library:
from collections import defaultdict
import os
import globWhat did I use here?
defaultdictis, as the name indicates, a dictionary that returns a default value whenever you request a key that isn't in the dict. (It then adds the new key-value pair to the dict, so subsequent requests don't force a calculation.)osis the module that gives you access to the operating system, typically for dealing with files and directories.globis a module that lets you search for filenames via patterns. If you've ever used*.txt, then you've used a "globbing" pattern. You can also usea[23456]bto match files that start with a, contain one of the digits 2, 3, 4, 5, or 6, and then end with b.
I cloned the GitHub repo, and then used a combination of glob.glob and a list comprehension, along with pd.read_csv, to read all of the files from 2022 - 2026. The code looked like this:
[pd.read_csv(one_filename)
for one_filename in glob.glob('data/bw-169-rsf-files/202[23456].csv')]
I used a list comprehension to get a list of data frames back: I invoked read_csv on each file (from 2022-2026) in the directory of CSV files. This should, if all goes well, return a list of data frames.
However, this didn't work so well, for a number of reasons:
- By default, Python assumes that files contain characters in UTF-8 encoding. For reasons that I don't understand, most of the files use UTF-8, but two use Latin-1. We'll need to tell Python to use a different encoding for those files. We can pass the
encodingkeyword argument, but what can and should we use, if the files are different? We'll get back to this in a moment. - While CSV originally stood for "comma-separated values," you can actually use any character you want as a field separator. In this file, a semicolon (
;) is used to separate fields. - Because this data set is originally French, where they often use a comma instead of (or as) a decimal point, you'll need to pass the
decimalkeyword argument, indicating that it should interpret,between digits as a floating-point number.
Let's return for a moment to the encoding issue. One way to solve the problem would be to use a Python module that identifies the encoding (e.g., charset-normalizer), and then applies the correct one. If we were dealing with a large number of files, or a large number of encodings, then I would probably go for such a sophisticated solution.
But in reality, I found that the files were all encoded in UTF-8 except for two, which used Latin-1. And I could call those out. I didn't want to define a function, and I wanted to stick with the list comprehension.
I thus decided to use defaultdict. My thinking was that I could have it return a value of 'UTF-8' by default, but then load it up with the two Latin-1 files, with the filenames as keys and 'Latin-1' as values. Then I could retrieve the encoding from the dict based on the filename.
Putting this all together, I got:
encoding = defaultdict(lambda: 'utf-8',
{'2025.csv':'Latin-1', '2026.csv':'Latin-1'})
[(pd
.read_csv(one_filename,
encoding=encoding[os.path.basename(one_filename)],
sep=';',
decimal=',')
)
for one_filename in glob.glob('data/bw-169-rsf-files/202[23456].csv')]
This worked! Notice that I defined encoding to be a defaultdict not only with a lambda that is invoked for each new key, but also with two key-value pairs.
Also notice that I used os.path.basename to get just the final part of each filename, without the leading path. That made the code more readable and more portable.
I then assigned the list to all_dfs:
all_dfs = [(pd
.read_csv(one_filename,
encoding=encoding[os.path.basename(one_filename)],
sep=';',
decimal=',')
)
for one_filename in glob.glob('data/bw-169-rsf-files/202[23456].csv')]I now have a list of data frames. How can I turn those into a single data frame? The answer is pd.concat, which takes a list of data frames – which we conveniently have in all_dfs – and returns a single data frame, stacked (by default) vertically.
I then took advantage of the fact that I was already messing with the data frame, and made some other changes:
- I removed the country names using
drop, - I also removed the scores from 2026 and 2025, partly because we can calculate them ourselves, and partly because those columns only existed in specific years, using
dropand passing theindexkeyword argument, - I renamed
Year (N)to justYearwith therenamemethod, passing a dictionary of the column(s) I wanted to rename - I used
set_indexto make a two-level multi-index from the country codes and the years.
Here's the code:
pre_assign_df = (
pd.concat(all_dfs)
.drop(columns=['Country_FR', 'Country_EN', 'Country_ES',
'Country_PT', 'Country_AR', 'Country_FA',
'Score 2026', 'Score 2025'])
.rename(columns={'Year (N)': 'Year'})
.set_index(['ISO', 'Year'])
)In case you're wondering why I set a variable called pre_assign_df, it's mainly because of aesthetics; I decided to separate the tasks in this question and the next because the method chain was looking long and complicated. And because I'm using Marimo, which doesn't allow you to define the same variable more than once, I used the silly pre_assign_df name, allowing me to use df in the cell for the next (second) question.
The Score column isn't assigned in 2025 and 2026 data. Calculate it as the mean of the Political Context, Economic Context, Legal Context, Social Context, and Safety columns. Which 10 countries have the highest overall press-freedom score for 2026? Which 10 countries have the lowest scores?
I next wanted to finalize the definition of our data frame, calculating Score based on the mean of five other columns. I used assign, setting Score to be the sum of five columns – each retrieved with pd.col – and then divided them by 5. I also used the round method, specifying that two decimal places were enough:
df = (
pre_assign_df
.assign(Score = ((pd.col('Political Context') +
pd.col('Economic Context') +
pd.col('Legal Context') +
pd.col('Social Context') +
pd.col('Safety')) / 5).round(2))
)The resulting data frame, df, had 900 rows and 18 columns.
With this in place, I wanted to find the 10 countries with the greatest press freedom in 2026. I could find that out with:
(
df
.xs(2026, level='Year')
['Score']
.nlargest(10)
)I first used xs, the Pandas cross-section method, to indicate that I only wanted rows where the Year part of the index was set to 2026. Then I retrieved only Score , and then ran nlargest(10) to get the 10 biggest ones:
ISO Score
NOR 92.72
NLD 88.92
EST 88.55
DNK 88.47
SWE 87.61
FIN 86.22
IRL 85.93
CHE 84.83
LUX 84.13
PRT 83.71It worked, and shows that Norway, the Netherlands, and Estonia are the three countries with the greatest degree of overall press freedom. Americans might be surprised to find that even with First Amendment protections, the US doesn't rank in the top 10 countries with the freest press.
What about the 10 worst countries? You won't be surprised by what you see; I used the following query:
(
df
.xs(2026, level='Year')
['Score']
.nsmallest(10)
)The only difference, of course, is that now I'm invoking nsmallest. The results:
ISO Score
ERI 10.24
PRK 12.67
CHN 13.85
IRN 17.45
SAU 19.11
AFG 19.51
VNM 21.15
TKM 23.06
RUS 23.15
AZE 23.95The three lowest-scoring countries are Eritrea, North Korea, and China, followed closely by Iran and Saudi Arabia. (I must admit that I didn't realize Eritrea was that bad!)