BW #55: IVF (solution)

In vitro fertilization is in the news, thanks to an Alabama Supreme Court ruling. This week, we look at some data about IVF in the United States.

BW #55: IVF (solution)

Two administrative notes:

  1. I’ll be holding office hours for paid subscribers this coming Sunday, March 3rd. Look for a notice tomorrow with the full Zoom info.
  2. I’m giving two talks at PyCon US in May, including one about dates and times in Pandas. I’m also a sponsor of the conference, and will have a booth there telling people about my course subscription program, corporate training, and (of course) Bamboo Weekly. I hope to see you there!

This week, we looked at in vitro fertilization (IVF). The widespread fertility treatment made headlines when Alabama’s Supreme Court ruled that embryos are children, as per their state constitution. IVF typically involves the fertilization of several embryos outside of the womb; those that aren’t used are then frozen or destroyed. As a result, the court’s ruling has effectively shut down IVF treatment in Alabama. It has also sent political shock waves through much of the US, and might play a role in US elections this coming November.

This week’s data set came from the Centers for Disease Control and Prevention (CDC). The CDC surveyed a large number of fertility clinics across the US about their use of IVF. The questions ranged from the number of retrievals (i.e., of egg cells) that were performed to the number of transfer (i.e., insertion of fertilized embryos) that were done, to the transfer technique used, to the number of births (single and multiple) that resulted from IVF. The data set was from October 2023, with data reported in 2021; from what I can tell, IVF research commonly lags by about two years, because of the time involved in achieving pregnancy and birth.

Data and six questions

The home page for this data set is here:

https://data.cdc.gov/Assisted-Reproductive-Technology-ART-/2021-Final-Assisted-Reproductive-Technology-ART-Su/9tjt-seye/about_data

Note that while the above URL is where you have to get the data, it took me a while to figure out how to get it. Here's what I did:

  1. Click on the "export" button at the top of the page
  2. Select "CSV file" for export format (which should be selected anyway, as the default)
  3. Click on the "download" button

Here are my six tasks and questions for this week. As usual, a link to the Jupyter notebook I used to solve these is at the bottom of the post.

Download the CSV data. Read it into a data frame using both the PyArrow engine and for all of the backend dtypes (i.e., instead of NumPy).

Before doing anything else, let’s load Pandas:

import pandas as pd

I asked you to load the CSV file into a data frame, which we can do with “read_csv”:

filename = '2021_Final_Assisted_Reproductive_Technology__ART__Summary_20240228.csv'

df = pd.read_csv(filename)

This works just fine, but it takes a long time. That’s because by default, Pandas uses its own engine for reading CSV files, and uses NumPy for back-end storage. Both have served the Pandas community for a long time, but they are slowly being phased out in favor of PyArrow (https://pypi.org/project/pyarrow/), Python bindings for the Apache Arrow open-source project (https://arrow.apache.org/). Arrow is an in-memory database optimized for the sorts of things that Pandas does, which means that it is often faster than the default implementations. It also takes less memory, partly thanks to its native use of strings. Pandas is already warning us that version 3.0 will require PyArrow.

Today, we can use PyArrow in two different ways:

  1. We can specify that Pandas use it for parsing our CSV file, and
  2. We can use it instead of NumPy on the back end.

We can do both by passing the “engine” and “dtype_backend” keyword arguments to “read_csv”:

filename = '2021_Final_Assisted_Reproductive_Technology__ART__Summary_20240228.csv'

df = pd.read_csv(filename, engine='pyarrow', dtype_backend='pyarrow')

I compared the time it took to with the default settings and with the PyArrow engine and dtypes. On average, the default settings took 201 ms. By contrast, using PyArrow took only 12.9 ms on average.

Moreover, using the default NumPy dtypes consumed 88,423,682 bytes. Using PyArrow, we only needed 29,528,256 bytes, or less than half.

In other words: PyArrow loaded our data by a factor of nearly 20x the speed, and used half the memory, without any additional configuration or customization. PyArrow isn’t always faster than NumPy, and it doesn’t handle all of the functionality that NumPy does. But I’d argue that it’s worth using it wherever you can.

How many IVF clinics were surveyed for this data? How many questions were asked of each clinic?

It took me some time to understand how this data set was structured. I originally thought that each row of the CSV file represented a single fertility clinic, and that the columns were their answers to various questions. That’s not exactly the case; each question was listed in a separate row, which means (in theory) that if there were m clinics and n questions, then the data set has m*n rows.

The question text is in the “Question” column, while the answer is actually located in two columns, one called “Data_Value” which contains whatever the response was in string dtypes, and one called “data_value_num” which contains numeric values only as float dtypes. Because we’re using PyArrow, string values are “string[pyarrow]” and float values are “double[pyarrow]”.

Finding the number of clinics surveyed thus required that we get the unique values associated with the clinics.

Rows also contained “Breakout groups” and “Breakout” columns, which you can think of a form of multi-index inside of the CSV file. The breakout groups were the age of the patient, the egg/embryo type, and a generic yes/no group. The values themselves were age ranges, and types of eggs and embryos, among others. Because of the breakout groups and values, we have many more rows than the m*n calculation I mentioned above; again, if we think of the breakout groups as a multi-index, we can understand how the clinics were answering the survey questions not for all of their patients, but for specific groups, to give the CDC more precise data.

With a bit more work, we could have broken down the data in all sorts of ways — including the effectiveness of IVF for different age ranges and the effectiveness of different techniques. However, I decided that it was complex enough to deal with the questions, and focused on those for my questions this week.

And thus, how can we find out how many different clinics were surveyed? I decided to look at the “FacilityName” column, and to ask Pandas for the number of unique values it contained, using the “nunique” method:

df['FacilityName'].nunique()

I got a result of 453. Another way to do this would be to invoke the “unique” method:

df['FacilityName'].unique()

This returns a NumPy array of the unique values. There’s a third method we can try, “drop_duplicates”, which returns a Pandas series of the values:

df['FacilityName'].drop_duplicates()

Which of these does the job fastest? I again ran “%timeit” in Jupyter, and found:

  • “nunique” takes an average of 1.33 ms
  • “unique” takes an average of 1.28 ms
  • “drop_duplicates” takes an average of 2.84 ms

We again see that while Pandas offers many ways to accomplish the same (or similar) goal, choosing well can have a big impact on your query performance.

As for how many questions were asked, I ran the following:

df['Question'].nunique()

We see that the survey asked clinics to answer 45 questions.

Read more