This week, we looked at data from the outbreak of hantavirus on the HV Hondius, and what has happened to the passengers there. Not that many people are involved, but the story is concerning, in part because the virus can spread from person to person — and it has already led to three deaths and dozens of hospitalizations (https://www.nytimes.com/2026/05/16/world/europe/hantavirus-hondius-cruise.html?unlocked_article_code=1.j1A.3KML.uri7nsW30w0Z&smid=url-share).
Moreover, we have access to data about the passengers, where they're from, what treatments they're getting, and if they died as a result of this terrible virus.
Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.
Learning goals for this week include working with CSV files, cleaning data, dates and times, pivot tables, null values, and plotting with Plotly.
Data and five questions
This week's data comes from a GitHub repo that is being updated on a regular basis with information from the hantavirus infection on the HV Hondius, at https://github.com/kraemer-lab/Hondius_hantavirus_h2026 . We'll specifically be looking at a CSV file, data/linelist/2026_hantavirus.csv, from the repository.
Here are the five questions and tasks that I posed, along with my solutions and explanations:
Read the list of people exposed to hantavirus into a Pandas data frame. Make sure that all of the date columns have a datetime dtype. How long were passengers on the ship? Was it the same for all passengers?
I started off by loading both Pandas and Plotly:
import pandas as pd
from plotly import express as pxNext, I wanted to read the data. The GitHub repo actually has a lot of data about hantavirus, including a JSON archive of news articles that I thought about exploring. But in the end, I decided to look at only the CSV file showing the 2026 outbreak. I used read_csv to read it.
But it wasn't enough to just invoke read_csv on the file. I also needed to pass two options:
na_values, so that the text'None'would be interpreted asNaN. Otherwise, the entire column would be seen as strings.parse_dates, indicating which columns should be seen as dates. Normally, using thepyarrowengine to read CSV files ensures that columns with datetime-like values are turned intodatetimevalues. But for whatever reason, that wasn't the case here. So I explicitly gave a list of columns that should be interpreted, and that worked just fine.
However, I found that this wasn't quite enough: The age column wasn't seen as a float! That's because, in at least one case, the age was approximate, so it was listed with an 's' after the number. I thus used assign to assign a new value to the age column – taking the original age column (via pd.col) , replacing the 's' string with an empty string using str.replace, and then using astype to get float values back:
filename = 'data/bw-171-hantavirus-repo/data/linelist/2026_hantavirus.csv'
df = (
pd
.read_csv(filename,
na_values=['None'],
parse_dates=['symptom_onset', 'ship_boarded', 'left_ship',
'confirmation_date', 'treatment_date', 'outcome_date'])
.assign(age = pd.col('age').str.replace('s', '').astype(float))
)
The result? A relatively small data frame, with 20 rows and 28 columns.
I was curious to know if all passengers were on the ship for the same amount of time. Thankfully, we had ship_boarded and left_ship, both datetime values. If you subtract one datetime from another, you end up with a timedelta, representing the distance between two points in time. When we humans think of "time," we sometimes mean a specific point in time (a datetime), and sometimes a stretch of time (a timedelta). In Pandas and other programming systems, we represent these with two different data structures.
But a timedelta is just a length of time, which means that we can perform calculations and operations on it. Including, for example, things like min, max, and mean. I thus ran this query:
(
(df['left_ship'] - df['ship_boarded'])
.dropna()
.describe()
)First, I got the length of time each passenger spent on the ship with some subtraction. That gave me a timedelta. Notice that I put the subtraction in (), because I wanted to invoke a method on the resulting timedelta, not on the second datetime value. (It took me a while to find that mistake, too!)
Next, I removed any NaN values, just to avoid potential confusion. Truth be told, Pandas normally ignores NaN values when performing these sorts of aggregations, but it made me feel better to remove them.
Finally, I invoked describe, summarizing the resulting timedelta values for me:
,value
count,15
mean,30 days 06:24:00
std,8 days 13:45:01.702827
min,13 days 00:00:00
25%,23 days 00:00:00
50%,35 days 00:00:00
75%,39 days 00:00:00
max,39 days 00:00:00
You can see that a timedelta shows itself in days, hours, and seconds (actually microseconds). So:
- The shortest time that someone was on the ship was 13 days.
- The longest time that someone was on the ship was 39 days.
I didn't ask you to do this, but you could use corr and a translation from medical condition to numbers, and find out if there's a correlation between time spent on the ship and how sick they got.
We do see that people were on the ship different amounts of time. It wasn't a uniform group.
Combine the outcome and treatment columns into a single column. From this data, what were the three most common treatments and outcomes? Does it matter in which order you perform the combination?
The outcome column contains either 'death' if the person died, or NaN if they didn't. By contrast, the treatment column contained strings describing people's treatment unless they had died, in which case it was NaN.
I asked you to combine the two into a single column. The easiest way to do this was to use fillna. We often think of using fillna with a single value, such as 0, to replace all NaN values with an integer or float. However, you can instead pass a series, in which case the NaN values will be replaced by the values from the series, using the index to know what value to insert.
(
df
['outcome']
.fillna(df['treatment'])
)With that in place, I now had a series containing no NaN values. (Actually, that's not true; one person had NaN in both columns.)
Does the order of the fillna matter? That is, would we get the same result running fillna on outcome and passing treatment, as running fillna on treatment and passing it outcome? The answer is "yes," if we are guaranteed to have NaN in only one of the two series. But if a row has non-NaN values in both series, then we'll get the value from the series on which we invoke fillna. So there will be a difference there.
After getting our combined data, I then ran value_counts, passing normalize=True, to get percentages rather than raw numbers. I then used round to get only two digits after the decimal point:
(
df
['outcome']
.fillna(df['treatment'])
.value_counts(normalize=True)
.round(2)
)The result:
outcome proportion
hospitalised 0.47
died 0.16
intensive care 0.11
monitored 0.11
quarantine 0.11
biocontainment unit, Nebraska 0.05We thus see that just under half (47 percent) of the people in the data set are hospitalized. 16 percent (three people) have died. Just over 10 percent are still in intensive care, being monitored, or quarantined.
Notice that one person (5 percent of the population, if you must know) is being quarantined in Nebraska. Why there, rather than elsewhere? I'm not sure, but I have heard that Nebraska can be lovely – although this doesn't seem like the best way to go there as a tourist.