Skip to content

BW 149: Flu season

Get better at: Working with CSV files, Pandas 3, joins, pivot tables, memory optimization, speed optimization, window functions, and plotting.

BW 149: Flu season

I've been feeling under the weather for the last week or two, and I'm somewhat relieved to know that it's not just me: Lots of people are sick this winter. New York's acting health commissioner has said that flu cases are "skyrocketing" in the city (https://www.nytimes.com/2025/12/16/nyregion/flu-cases-nyc.html?unlocked_article_code=1.9U8.C9hd.BYSZ35XQZRqc&smid=url-share). In Israel, where I live, the health ministry has recommended that vulnerable populations and medical workers wear masks in order to slow the spread of the flu.

Winter always brings a rise in cases of the flu. In order to get a handle on the current state of affairs, as well as plan for future outbreaks, governments need to monitor flu cases. But how can you do that, especially in a large country like the United States?

The Centers for Disease Control and Prevention (CDC) has developed a three-part system to try to track the flu:

  1. First, there's ILINet, thousands of doctors and medical providers who report how many patients feel sick with flu-like symptoms. (ILI stands for "influenza-like illness.") Not all of them actually have the flu, but this is a good first check of how many people are sick.
  2. When patients in clinics and hospitals are tested for flu, the institutions report test results to NREVSS (National Respiratory and Enteric Virus Surveillance System). This way, they can find out how many people are sick with the flu vs. something else (e.g., covid).
  3. Public health laboratories get samples of the positive flu tests, and report (also as part of NREVSS) which flu strains are spreading, and where.

By looking at all three of these, we can learn more about flu seasons over the years, whether this year is worse than previous years, and how many of those rough flu-like symptoms are really the flu itself.

But wait: This week, the Pandas core team announced that a release candidate for Pandas 3.0.0 is now available (https://pandas.pydata.org//community/blog/pandas-3.0-release-candidate.html). This means that Pandas 3.0 is close to release, and with it, we'll see a number of big changes. This week's Pandas topics are thus meant not only to challenge your Pandas fluency, but also to get you to compare the performance of Pandas 2 vs. Pandas 3.

To get a Pandas 3 environment going, I suggest using uv. (Unsure how to use uv? Check out my free, 15-part "uv Crash Course" at https://uvCrashCourse.com .) On my Mac, in the terminal, I typed:

$ cd ~/Desktop
$ uv init bwpd3
$ cd bwpd3
$ uv add 'pandas==3.0.0rc0' marimo[recommended] pyarrow plotly

Data and five questions

This week's data comes from the CDC's FluView portal. The data comes from their main dashboard at https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html . The data consists of three files, one for each of the sources mentioned above. To get the data, I clicked on the green "download data" button at the top right of the page. I asked for both data sources (ILINet and NREVSS) and all HHS (Health and Human Services) regions. This resulting zipfile opens up to contain four CSV files, three of which we will use in our analysis.

Learning goals for this week include: Working with CSV files, Pandas 3, joins, pivot tables, memory optimization, speed optimization, window functions, and plotting.

Paid subscribers, including members of my LernerPython.com membership program, get the data files provided to them, as well as all of the questions and answers each week, downloadable notebooks, and participation in monthly office hours.

Here are my five questions for this week. I'll be back tomorrow with solutions and full explanations: