Bamboo Weekly #169: Press freedom (solutions)

Sorry for the delay in getting these out, but I'm traveling before PyCon US starts next week in Long Beach, California.

This week, we're looking at the most recent data from Reporters Without Borders (known by its French acronym, RSF). Their annual report on press freedom (https://rsf.org/en/video-2026-rsf-world-press-freedom-index) shows that things aren't so great; for the first time, more than half of the countries in the world were ranked "difficult" or "very serious." The scores for a number of countries declined fairly significantly.

This week, we'll examine the data, seeing where things have improved, where they have declined, and if there are any places in which the RSF multi-faceted scoring can show some nuances.

Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.

Learning goals for this week include combining multiple files, multi-indexes, plotting with Plotly, cutting, and pivot tables.

Data and five questions

This week's data comes from RSF. The data itself is available in a few places. I chose to retrieve it from a GitHub repo at https://github.com/dw-data/world-press-freedom-2026/tree/main . The files we want to use are in the csvs/rsf-files subdirectory.

Here are my solutions and explanations for this week's five questions:

Download the GitHub repo with the data. Create a single Pandas data frame from the CSV files in the `rsf-files` subdirectory, reading data from 2022 - 2026. (The directory contains more files than just these, but there was a major change in methodology starting in 2022, so we'll ignore earlier files.) You will have to deal with file-encoding issues when importing the files. Make sure that the score columns are treated as floats. Remove the `Country_*` columns, as well as the `Score 2026` and `Score 2025` columns. Make the index a two-level multi-index from `ISO` and `Year` (which you should rename from `Year (N)`).

I started, as usual, by loading up Python and Plotly:

import pandas as pd
from plotly import express as px

However, I was also going to be reading multiple files, and then turning them into a single data frame. I like to use a list comprehension for such tasks, but sometimes you need a bit of help in order for the comprehension to look reasonable. I thus loaded three additional modules from the Python standard library:

from collections import defaultdict
import os
import glob

What did I use here?

defaultdict is, as the name indicates, a dictionary that returns a default value whenever you request a key that isn't in the dict. (It then adds the new key-value pair to the dict, so subsequent requests don't force a calculation.)
os is the module that gives you access to the operating system, typically for dealing with files and directories.
glob is a module that lets you search for filenames via patterns. If you've ever used *.txt, then you've used a "globbing" pattern. You can also use a[23456]b to match files that start with a, contain one of the digits 2, 3, 4, 5, or 6, and then end with b.

I cloned the GitHub repo, and then used a combination of glob.glob and a list comprehension, along with pd.read_csv, to read all of the files from 2022 - 2026. The code looked like this:

[pd.read_csv(one_filename)
 for one_filename in glob.glob('data/bw-169-rsf-files/202[23456].csv')]

I used a list comprehension to get a list of data frames back: I invoked read_csv on each file (from 2022-2026) in the directory of CSV files. This should, if all goes well, return a list of data frames.

However, this didn't work so well, for a number of reasons:

By default, Python assumes that files contain characters in UTF-8 encoding. For reasons that I don't understand, most of the files use UTF-8, but two use Latin-1. We'll need to tell Python to use a different encoding for those files. We can pass the encoding keyword argument, but what can and should we use, if the files are different? We'll get back to this in a moment.
While CSV originally stood for "comma-separated values," you can actually use any character you want as a field separator. In this file, a semicolon (;) is used to separate fields.
Because this data set is originally French, where they often use a comma instead of (or as) a decimal point, you'll need to pass the decimal keyword argument, indicating that it should interpret , between digits as a floating-point number.

Let's return for a moment to the encoding issue. One way to solve the problem would be to use a Python module that identifies the encoding (e.g., charset-normalizer), and then applies the correct one. If we were dealing with a large number of files, or a large number of encodings, then I would probably go for such a sophisticated solution.

But in reality, I found that the files were all encoded in UTF-8 except for two, which used Latin-1. And I could call those out. I didn't want to define a function, and I wanted to stick with the list comprehension.

I thus decided to use defaultdict. My thinking was that I could have it return a value of 'UTF-8' by default, but then load it up with the two Latin-1 files, with the filenames as keys and 'Latin-1' as values. Then I could retrieve the encoding from the dict based on the filename.

Putting this all together, I got:

encoding = defaultdict(lambda: 'utf-8', 
                       {'2025.csv':'Latin-1', '2026.csv':'Latin-1'})

[(pd
 .read_csv(one_filename, 
           encoding=encoding[os.path.basename(one_filename)], 
           sep=';', 
           decimal=',')
           )
for one_filename in glob.glob('data/bw-169-rsf-files/202[23456].csv')]

This worked! Notice that I defined encoding to be a defaultdict not only with a lambda that is invoked for each new key, but also with two key-value pairs.

Also notice that I used os.path.basename to get just the final part of each filename, without the leading path. That made the code more readable and more portable.

I then assigned the list to all_dfs:

all_dfs = [(pd
            .read_csv(one_filename, 
                      encoding=encoding[os.path.basename(one_filename)], 
                      sep=';', 
                      decimal=',')
           )
          for one_filename in glob.glob('data/bw-169-rsf-files/202[23456].csv')]

I now have a list of data frames. How can I turn those into a single data frame? The answer is pd.concat, which takes a list of data frames – which we conveniently have in all_dfs – and returns a single data frame, stacked (by default) vertically.

I then took advantage of the fact that I was already messing with the data frame, and made some other changes:

I removed the country names using drop,
I also removed the scores from 2026 and 2025, partly because we can calculate them ourselves, and partly because those columns only existed in specific years, using drop and passing the index keyword argument,
I renamed Year (N) to just Year with the rename method, passing a dictionary of the column(s) I wanted to rename
I used set_index to make a two-level multi-index from the country codes and the years.

Here's the code:

pre_assign_df = (
    pd.concat(all_dfs)
    .drop(columns=['Country_FR', 'Country_EN', 'Country_ES', 
                   'Country_PT', 'Country_AR', 'Country_FA',
                  'Score 2026', 'Score 2025'])
    .rename(columns={'Year (N)': 'Year'})
    .set_index(['ISO', 'Year'])
)

In case you're wondering why I set a variable called pre_assign_df, it's mainly because of aesthetics; I decided to separate the tasks in this question and the next because the method chain was looking long and complicated. And because I'm using Marimo, which doesn't allow you to define the same variable more than once, I used the silly pre_assign_df name, allowing me to use df in the cell for the next (second) question.

The `Score` column isn't assigned in 2025 and 2026 data. Calculate it as the mean of the `Political Context`, `Economic Context`, `Legal Context`, `Social Context`, and `Safety` columns. Which 10 countries have the highest overall press-freedom score for 2026? Which 10 countries have the lowest scores?

I next wanted to finalize the definition of our data frame, calculating Score based on the mean of five other columns. I used assign, setting Score to be the sum of five columns – each retrieved with pd.col – and then divided them by 5. I also used the round method, specifying that two decimal places were enough:

df = (
    pre_assign_df
    .assign(Score = ((pd.col('Political Context') + 
            pd.col('Economic Context') + 
            pd.col('Legal Context') +
            pd.col('Social Context') + 
            pd.col('Safety')) / 5).round(2))
)

The resulting data frame, df, had 900 rows and 18 columns.

With this in place, I wanted to find the 10 countries with the greatest press freedom in 2026. I could find that out with:

(
    df
    .xs(2026, level='Year')
    ['Score']
    .nlargest(10)
)

I first used xs, the Pandas cross-section method, to indicate that I only wanted rows where the Year part of the index was set to 2026. Then I retrieved only Score , and then ran nlargest(10) to get the 10 biggest ones:

ISO	Score
NOR	92.72
NLD	88.92
EST	88.55
DNK	88.47
SWE	87.61
FIN	86.22
IRL	85.93
CHE	84.83
LUX	84.13
PRT	83.71

It worked, and shows that Norway, the Netherlands, and Estonia are the three countries with the greatest degree of overall press freedom. Americans might be surprised to find that even with First Amendment protections, the US doesn't rank in the top 10 countries with the freest press.

What about the 10 worst countries? You won't be surprised by what you see; I used the following query:

(
    df
    .xs(2026, level='Year')
    ['Score']
    .nsmallest(10)
)

The only difference, of course, is that now I'm invoking nsmallest. The results:

ISO	Score
ERI	10.24
PRK	12.67
CHN	13.85
IRN	17.45
SAU	19.11
AFG	19.51
VNM	21.15
TKM	23.06
RUS	23.15
AZE	23.95

The three lowest-scoring countries are Eritrea, North Korea, and China, followed closely by Iran and Saudi Arabia. (I must admit that I didn't realize Eritrea was that bad!)

Bamboo Weekly #169: Press freedom (solutions)

Data and five questions

The Score column isn't assigned in 2025 and 2026 data. Calculate it as the mean of the Political Context, Economic Context, Legal Context, Social Context, and Safety columns. Which 10 countries have the highest overall press-freedom score for 2026? Which 10 countries have the lowest scores?

Read next

Bamboo Weekly #169: Press freedom