[Reminder: The 7th cohort of my Python Data Analysis Bootcamp (PythonDAB) will start on June 19th! To learn more about this 4-month intense-but-intimate mentoring program in Python, Git, and Pandas, join me at a Webinar on Monday, June 9th: https://us02web.zoom.us/webinar/register/WN_by8ZHiQkRPGmXjZrkaPeVw .]
This week, we looked into the many scientific funding grants that the US has cancelled over the last few months. For many decades, a large proportion of science research funding in the United States came from the National Institutes of Health (NIH) and National Science Foundation (NSF). However, the Trump administration has canceled many of these grants, leaving thousands of scientists, researchers, and doctors at universities, medical centers, and government laboratories without funding – and sometimes without jobs.
The data we looked at this week comes from Grant Watch (https://grant-watch.us/), which tracks cancelled NIH and NSF grants. We explored what kinds of research were affected and the reasons why some of these grants were cancelled.
Data and seven questions
This week's data, as
This week's data comes from Grant Watch (https://grant-watch.us/), about which you can read on its home page. Their data comes from a variety of sources, including the government and individual researchers who have reported on terminated grants. NIH and NSF data is tabulated separately, so I'll ask a few questions from each data set. You can, of course, dig into either or both of them more deeply if and when you need.
Paid subscribers can download copies of the data files, as well as my notebook, from the end of this message. Now that I'm back from my trip to the US, I'll soon return to posting YouTube videos walking you through solutions to each week's first two questions.
Learning goals this week included working with CSV files, grouping, working with text, formatting, and date/time data.
Here are my solutions and explanations:
Read the NIH data into a Pandas data frame. Which 15 institutions had the greatest dollar amount of grants (usa_total_award
) terminated? Display each institution in "title" format (i.e., with each word starting with a capital letter) and with each dollar amount with commas before every three digits.
Let's start by importing Pandas:
import pandas as pd
Next, I downloaded a CSV version of Grant Watch's data, and then read it into a data frame, using read_csv
:
nih_filename = 'data/nih_terminations_airtable.csv'
nih_df = pd.read_csv(nih_filename)
To find the 15 institutions with the greatest total amount of terminated grants, we can use a groupby
query on the usa_total_award
column, invoking the sum
aggregation method:
(
nih_df
.groupby('org_name')['usa_total_award'].sum()
)
However, we only want the institutions with the 15 largest totals. We can get that with nlargest(15)
:
(
nih_df
.groupby('org_name')['usa_total_award'].sum()
.nlargest(15)
)
This works, but I asked you to format the values in a way that would make them more readable. For starters, I asked you t0 display institution names in "title" format. We can use str.title
, but in order to change the existing org_name
column, I decided that it was easiest to invoke assign
, and assign a new column in place of the existing one:
(
nih_df
.assign(org_name = lambda df_: df_['org_name'].str.title())
.groupby('org_name')['usa_total_award'].sum()
.nlargest(15)
)
Finally, I took the numeric dollar values and used an f-string with the format code ',.2f'
to add commas to the numbers, as well as two digits after the decimal point. Because I wanted to apply this function for each element in the series, I invoked map
:
(
nih_df
.assign(org_name = lambda df_: df_['org_name'].str.title())
.groupby('org_name')['usa_total_award'].sum()
.nlargest(15)
.map(lambda x: f'${x:,.2f}')
)
The results:
org_name usa_total_award
Fred Hutchinson Cancer Center $1,188,361,727.38
Harvard School Of Public Health $987,412,985.74
Harvard Medical School $817,395,752.22
Columbia University Health Sciences $805,368,280.60
Family Health International $679,392,068.71
Duke University $527,452,196.00
Harvard University $358,682,929.11
Emory University $233,538,382.00
University Of California, San Francisco $224,026,799.28
Allen Institute $173,380,914.00
Univ Of North Carolina Chapel Hill $133,237,001.00
Massachusetts General Hospital $127,449,149.00
University Of Minnesota $113,048,728.00
Stanford University $111,541,978.34
Northwestern University At Chicago $111,170,535.73
Repeat this task with the NSF data, looking at nsf_total_budget
. You'll want the nsf_startdate
column to be treated as a datetime
value. Do you see any overlap between the NIH and NSF results? What issues might we have in trying to combine the two data sets?
The two data sets are similar, but there are differences between them. I first loaded the NSF data into a data frame:
nsf_filename = 'data/nsf_terminations_airtable.csv'
nsf_df = pd.read_csv(nsf_filename,
parse_dates=['nsf_startdate'])
Notice that I used parse_dates
to tell read_csv
that the dtype
for nsf_startdate
should be treated as a datetime
value.
But before we can perform any calculations on the NSF data, we need to modify it. That's because the CSV file kept the leading dollar sign ($) at the start of every dollar amount, making it impossible to perform calculations with it.
I solved this by using assign
to assign a new series of values to nsf_total_budget
, taking the existing value, invoking str.replace
on it to remove the '$'
character, and then invoking astype(int)
to get integers back:
(
nsf_df
.assign(nsf_total_budget =
lambda df_: (df_['nsf_total_budget'].str.replace('$', '').astype(int))
)
The rest of the query then follows the same pattern as our query on NIH data: I added a second keyword argument to assign
, invoking of str.title
on the institution names as we had done before. I then invoked groupby
on org_name
, summing the value of nsf_total_budget
, then choosing the top 15 institutions, then adding dollar signs:
(
nsf_df
.assign(nsf_total_budget =
lambda df_: (df_['nsf_total_budget'].str.replace('$', '').astype(int),
org_name = lambda df_: df_['org_name'].str.title())
.groupby('org_name')['nsf_total_budget'].sum()
.nlargest(15)
.map(lambda x: f'${x:,.2f}')
)
The result:
org_name nsf_total_budget
Harvard University $148,058,246.00
Arizona State University $36,774,883.00
University Of Colorado At Boulder $27,554,170.00
University Of California-Berkeley $22,752,270.00
University Of Texas At El Paso $21,304,493.00
University Of Texas At Austin $20,581,798.00
University Of California-Irvine $17,553,149.00
Regents Of The University Of Michigan
- Ann Arbor $17,394,304.00
Auburn University $16,430,719.00
University Of Maryland Baltimore County $15,668,687.00
Michigan State University $15,558,603.00
University Of Washington $15,526,091.00
Sri International $14,633,152.00
University Of Wisconsin-Madison $14,027,537.00
Clark Atlanta University $13,969,850.00
As for whether we see any overlap between these two lists, Harvard stands out as being on both of them – in three places on the NIH list, and in the top place on the NSF list.
But if we were interested in combining these lists, I can already see that we would have trouble. That's because the institution names don't appear to be standardized. Just the fact that Harvard appears in three different ways on the NIH list shows that it would be challenging to combine these into a single data set.