Skip to content

Bamboo Weekly #158: University endowments (solution)

Get better at: Excel files, CSV files, Plotting with Plotly, grouping, maps, regular expressions, and joins.

Bamboo Weekly #158: University endowments (solution)

This week, we looked at the endowments of a variety of North American universities. Endowments have always been an important way that universities support themselves, but especially given dramatic cuts in research funding by the Trump administration, universities are relying on their endowments more than ever.

How big are these endowments? It varies a lot by university, as reported in the Financial Times blog FT Alphaville last week (https://www.ft.com/content/4287eba7-63eb-4d1d-ba10-cdb56d8a5520). Their article used data from NACUBO, the National Association of College and University Business Officers (https://www.nacubo.org/Research/2025/Public-NCSE-Tables).

This week, we looked at that data, to better understand the distribution of such endowments across universities and geographies. And yes, inspired by the FT Alphaville article, I was especially intrigued by the geographical distribution — which dovetailed nicely with my recent discovery that Plotly supports a variety of types of maps.

Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.

Learning goals for this week include working with Excel and CSV files, grouping, plotting (including plotting maps) with Plotly, cleaning data, and joins.

Data and six questions

This week's data comes from the NACUBO data:

https://www.nacubo.org/Research/2025/Public-NCSE-Tables

You can download the data directly from them, or (if you're a paid subscriber) from the link at the bottom of this message.

We'll also use, in the final question, a mapping from cities to longitude and latitude stored on GitHub at https://github.com/kelvins/US-Cities-Database .

Here are my solutions and explanations for this week's six questions:

Load the Excel file into a Pandas data frame. Make sure that the endowment amounts and changes are float values, and that the institution names only contain alphanumeric characters and whitespace. Drop the original versions of the columns.

I started off, as usual, by loading the appropriate packages into Python:

import pandas as pd
from plotly import express as px

I started by loading the file into Pandas with a call to read_excel. However, I needed to give read_excel two hints in order to read the data correctly. First, I told it that it should read column headers from line 4 (what Excel labels as line 5). And I told it only to read 678 rows — since following the actual data rows, it contained a bunch of footnotes that are fine for humans to read, but mess up read_excel:

filename = 'data/bw-158-2025-NCSE-Endowment-Market-Values-for-US-and-Canadian-Institutions-FINAL.xlsx'

df = (
        pd
        .read_excel(filename, header=4, nrows=678)
)        

This worked, in the sense that it gave me a data frame. But there were a number of problems with the data it read in. In particular, a number of columns – including those containing numeric values – had empty spaces. read_excel saw these as non-numeric strings, and thus treated the entire column as string values, which didn't really help.

My original solution involved running replace on the data frame, turning the ' ' strings into np.nan, and then converting some columns into float dtypes. But in reviewing my solution, I realized that there's a far easier way to do this, namely by passing the na_values keyword argument to read_excel. This lets us add one or more values that should be interpreted as nan. That dramatically cut down on the amount of manipulation and cleaning I needed to do. It also, somewhat to my surprise, meant that the endowment-value columns contained integers, rather than floats. Which... I was totally OK with, given that they were integers in any event.

The exception to that was the "change in total endowment" column, which included a number of footnote characters. I wanted to get rid of those, and just keep the numbers. To do that, I used a combination of assign and pd.col along with the replace method, removing any character that is not a digit, decimal point, or - sign, then turning the result into a float.

I also used the call to assign, along with another call to pd.col and str.replace to remove funny characters from the institution names.

Oh, and while already using assign, I decided to give these columns slightly easier-to-type names:

df = (
        pd
        .read_excel(filename, header=4, nrows=678, na_values=' ')
        .assign(endowment_change = pd.col('Change in Total Endowment Market Value (%)1').replace(r'[^\d.-]+', r'', regex=True).astype(float),
               institution_name = pd.col('Institution Name').str.replace('[^0-9a-zA-Z -]+', '', regex=True))
)

Next, I renamed the FY endowment columns, to make those easier to type, by calling rename:

df = (
        pd
        .read_excel(filename, header=4, nrows=678, na_values=' ')
        .assign(endowment_change = pd.col('Change in Total Endowment Market Value (%)1').replace(r'[^\d.-]+', r'', regex=True).astype(float),
               institution_name = pd.col('Institution Name').str.replace('[^0-9a-zA-Z -]+', '', regex=True))
        .rename(columns={'FY24 Total Endowment Market Value (in $1,000s)':'fy24_endowment',
                        'FY25 Total Endowment Market Value (in $1,000s)':'fy25_endowment'})
)

Finally, I removed a number of columns that I wasn't going to use – and in the case of endowment change, I removed it because I had already created a cleaned-up version using assign, earlier:

df = (
        pd
        .read_excel(filename, header=4, nrows=678, na_values=' ')
        .assign(endowment_change = pd.col('Change in Total Endowment Market Value (%)1').replace(r'[^\d.-]+', r'', regex=True).astype(float),
               institution_name = pd.col('Institution Name').str.replace('[^0-9a-zA-Z -]+', '', regex=True))
        .rename(columns={'FY24 Total Endowment Market Value (in $1,000s)':'fy24_endowment',
                        'FY25 Total Endowment Market Value (in $1,000s)':'fy25_endowment'})
        .drop(columns=['Change in Total Endowment Market Value (%)1',
                       'IPEDS UnitID2',
                      'Institution Name'])
)

The result? A data frame with 678 rows and 11 columns.

Calculate the mean endowment for each institution sector, formatting the numbers with commas and only two digits after the decimal point. Which type of institution has the greatest mean endowment? Is this surprising?

Whenever you want to perform a calculation for every unique value of a column, that's an invitation to use groupby. In this case, we wanted to group by "institution sector," which has the column name IPEDS Institution Sector2. I ran groupby on it, calculating the mean value of fy25_endowment . I then ran sort_values to get them in order. Finally, I used apply to format the values as floats, using str.format to ensure that every value separated digits with commas and had two digits after the decimal point:


(
    df
    .groupby('IPEDS Institution Sector2')['fy25_endowment'].mean()
    .sort_values()
    .apply('{:,.02f}'.format)
)

The result:

IPEDS Institution Sector2	fy25_endowment
Public 2-Year College	37,755.82
4-Year Public College/University	905,428.27
4-Year Private Nonprofit College/University	1,704,588.68
State System Office	5,390,054.69

Now, I would have expected that the mean endowment of private universities would be much higher than that of any others. But the "State System Office" category was much higher than those. I'm going to assume that this is because most state universities have a number of campuses, and that their endowment is a combination of all of those, rather than one particular school.

Still, we can see that private universities have, on average, about twice the endowment of public universities.