Skip to content
14 min read · Tags: memory-optimization grouping text filtering

BW #140: Stack Overflow survey (solution)

Get better at: Working with CSV, grouping, regular expressions, and filtering.

BW #140: Stack Overflow survey (solution)

This week, we looked at data from the latest developer survey from Stack Overflow. It seems quaint now, with everyone using generative AI to answer their questions – but for a very long time, Stack Overflow provided millions of programmers with answers to their questions. (Some of those answers were even accurate!)

We'll see what the future holds for Stack Overflow itself, but the company continues to produce an annual survey about programming trends. This year's results were published about two months ago as a set of summaries (https://survey.stackoverflow.co/2025/work/). Now that the raw data is downloadable from Stack Overflow, I thought it would be interesting to examine some of their findings, seeing if we can replicate them using Pandas.

Data and five questions

This week's data comes from Stack Overflow's 2025 survey. I found the data a bit hard to find, even though it's on the main "Stack Overflow Developer Survey" page at https://survey.stackoverflow.co/. Perhaps that's because the annual survey summary page (https://survey.stackoverflow.co/2025/work/) is more attractive, and is linked to by more sites, and doesn't directly link to the data.

Learning goals this week include memory optimization, categorization, grouping, and working with text (including regular expressions).

Paid subscribers, including members of my LernerPython.com site (https://LernerPython.com), can download the data file from the end of this message, as well as participate in office hours, download my Marimo notebook, and open the notebooks in Molab (https://molab.marimo.io/) with a single click.

Here are this week's solutions and explanations:

Read the 2025 survey data into a Pandas data frame. How much memory can you save by using categories instead of regular string ("object") columns? The JobSat column indicates, on a scale from 0-10, how satisfied people are in their jobs. Divide the population into "dissatisfied" (0-3), "neutral" (4-7), and "satisfied" (8-10). What proportion of respondents are in each category?

I started by loading up Pandas:

import pandas as pd

I then defined filename to contain the name of the file (in this case, the CSV file that I had zipped, to avoid going over GitHub's size limit) and used pd.read_csv to read it. Notice that read_csv is just fine reading from a compressed file:

filename = 'data/bw-140-so2025.zip'
df = pd.read_csv(filename)

It actually worked just fine, and didn't even take that long – just under 2 seconds on my computer.

But I like to use PyArrow's CSV-loading engine whenever I could, and I was curious to know whether it would actually work here:

filename = 'data/bw-140-so2025.zip'
df = pd.read_csv(filename, engine='pyarrow')

And... it didn't work! I got an error message:

    CSV parser got out of sync with chunker. This can mean the data file contains cell values spanning multiple lines; please consider enabling the option 'newlines_in_values'.
    See the console area for a traceback.

I'm not sure whether the CSV file really does have records that span multiple lines. But it almost doesn't matter, because the only way to overcome this problem is by passing an argument to the PyArrow CSV-loading routine. And there isn't any obvious way to do that from within pd.read_csv in Pandas.

It is possible to use PyArrow directly to read the CSV file into a PyArrow table, and then turn it into a Pandas data frame. But I decided that it wasn't worth doing that.

Next, I asked you to calculate the amount of memory used by this data frame. Because so many of the columns are strings, it isn't enough to use memory_usage or df.info, because both will tell you how much memory the pointers to Python strings are using, not the strings themselves. To get the full picture, you need to invoke df.memory_info(deep=True) or df.info(memory_usage='deep'). That takes longer, but gives you a more accurate picture.

df.info(memory_usage='deep')

I got:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49123 entries, 0 to 49122
Columns: 170 entries, ResponseId to JobSat
dtypes: float64(50), int64(1), object(119)
memory usage: 356.8 MB

We see that the data frame uses nearly 357 MB of memory. The CSV file (uncompressed) uses 134 MB of disk space, meaning that the in-memory version is about three times larger. That's pretty standard when we read data into Pandas, and is a bit of a problem.

However, we can use categories to reduce the memory size, which replace each string with an integer, and then make an association between the integer and the string. Each string is stored a single time, dramatically reducing how much memory is needed.

One way to replace all of the string columns with categories is to use select_dtypesto look for object columns (typically strings). Then we can use astype to assign back to those columns:

columns_to_categorize = (
    df
    .select_dtypes('object')
    .columns
)

for index, one_column in enumerate(columns_to_categorize):
    print(f'[{index}] {one_column}')
    df[one_column] = df[one_column].astype('category')

This actually works well! But I recently realized that there's an even better way to do this. I always think of astype as a method that runs on a series, and allows us to assign it to a new series – or back to the original series, with a new and different dtype. But there is also astype for data frames, which returns the entire data frame as a single dtype.

We can thus find all of the object columns, putting their names into a series. Then we can use that series to both retrieve the columns from df, giving us a subset of df. Then we can run astype('category') on all of the columns, and assign them back to the same subset:

df[columns_to_categorize] = df[columns_to_categorize].astype('category')

Much shorter than my for loop – and it does the same thing! The question is, how much memory did it save? Let's ask:

df.info(memory_usage='deep')

After categorizing, the data frame uses 66.4 MB. In other words, we have reduced the size by more than 80 percent. However, we still have all of the functionality that we did before, including access to the string methods.

Next, I asked you to determine how many programmers filling out the survey are satisfied with their jobs. The JobSat column in the data frame contains numbers from 0-10 indicating satisfaction, but I asked you to turn those numbers into three categories, "dissatisfied" (0-3), "neutral" (4-7), and "satisfied (8-10).

Fortunately, Pandas offers the pd.cut method, which takes a numeric series as an argument. It returns a new series with the same index as the input, but categorical values. The bins keyword argument defines the dividing lines between the categories, and the labels keyword argument defines the labels. Note that if there are three labels (as here), there must be four bins.

Also note that the left side of each bin is normally not included; the include_lowest keyword argument ensures that the lowest value will indeed be included.

My query thus looked like this:

(
    pd
    .cut(df['JobSat'],
         bins=[0, 3, 7, 11],
         labels=['dissatisfied', 'neutral', 'satisfied'],
        include_lowest=True)
     .value_counts(normalize=True)
)

You can see that after getting the categorical series back – and yes, its dtype is indeed category! – I then ran value_counts, passing normalize=True in order to get percentages back. I got the following results:

JobSat	proportion
satisfied	0.5059490297639154
neutral	0.4319333408399955
dissatisfied	0.06211762939608903

I must admit that I was a bit surprised by these results, given the number of people I speak with who indicate they are dissatisfied with their work. Then again, this is a self-selecting group of programmers who not only use Stack Overflow but are motivated to fill out a survey sponsored by Stack Overflow. That might account for the very low (6 percent) number of dissatisfied respondents.

Of course, it could also be that I simply hang out with a bunch of unhappy pessimists who aren't anything close to a representative sample.

What proportion of developers have at least a four-year university degree? Do we see a difference between the United States and other countries? What proportion of developers have a master's degree or PhD? Again, do we see a difference between the United States and other countries? What country has the greatest proportion of developers with masters and PhD degrees?

The EdLevel column indicates the respondent's educational level. We can thus find out what proportion of respondents have reached each educational level by running value_counts on EdLevel, and passing normalize=True.


(
    df
    ['EdLevel'].value_counts(normalize=True)
)

This gives us the proportion of each educational level:

EdLevel	proportion
Bachelor’s degree (B.A., B.S., B.Eng., etc.)	0.42104934805664734
Master’s degree (M.A., M.S., M.Eng., MBA, etc.)	0.2615051885124878
Some college/university study without earning a degree	0.12841308461746417
Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)	0.07542579075425791
Professional degree (JD, MD, Ph.D, Ed.D, etc.)	0.05448458003202529
Associate degree (A.A., A.S., etc.)	0.032441200324412
Other (please specify):	0.014577744504751805
Primary/elementary school	0.01210306319795371

But I only wanted to know about bachelor, master, and professional degrees. Since the result of value_counts is a series, I used filter to retrieve only those rows whose indexes contained one of those words:


(
    df
    ['EdLevel'].value_counts(normalize=True)
    .filter(regex='Bachelor|Master|Professional')
)

In the above regular expression, I used |, the alternation character, to indicate that I wanted any row that contained the string "Bachelor" or "Master" or "Professional". That returned the three rows that were of interest, each with its own proportion. I then ran sum to combine them:


(
    df
    ['EdLevel'].value_counts(normalize=True)
    .filter(regex='Bachelor|Master|Professional')
    .sum()
)

The result is np.float64(0.7370391166011604), or about 74 percent.

I was curious to know if this proportion was similar both in the United States and in other countries. I thus reran the same query, but before running value_counts on the EdLevel column, I first used loc and lambda to keep only those rows in which the respondent was from the United States:

(
    df
    .loc[lambda df_: df_['Country'] == 'United States of America']
    ['EdLevel'].value_counts(normalize=True)
    .filter(regex='Bachelor|Master|Professional')
    .sum()
)

This gave a result of np.float64(0.758850478967097), or about 76 percent – a bit higher, but not by a lot.

What about people from countries outside of the United States? I ran almost the same query as above, changing the == operator to !=:

(
    df
    .loc[lambda df_: df_['Country'] != 'United States of America']
    ['EdLevel'].value_counts(normalize=True)
    .filter(regex='Bachelor|Master|Professional')
    .sum()
)

The result here was np.float64(0.7331963604343998), or about 73 percent, slightly lower than the mean and the US proportion.

I then wondered if there was a difference between the US and other countries for people with graduate degrees. I thus reran the above two queries, but left out the word "Bachelor":

(
    df
    .loc[lambda df_: df_['Country'] == 'United States of America']
    ['EdLevel'].value_counts(normalize=True)
    .filter(regex='Master|Professional')
    .sum()
)

In the United States, coders have graduate degrees about 26 percent of the time, about one quarter. For people from outside of the United States, I ran:

(
    df
    .loc[lambda df_: df_['Country'] != 'United States of America']
    ['EdLevel'].value_counts(normalize=True)
    .filter(regex='Master|Professional')
    .sum()
)

Here, the number was far higher, at about 32.3 percent of the time, or just about one-third. That's a pretty striking difference – but it matches what I've seen when traveling to other countries, that a striking number of coders have graduate degrees.

Which country has the highest proportion of people with graduate degrees? This sounds, on the face of it, like a groupby operation, since I want to perform a calculation on EdLevel , getting a result for each value of Country. But what aggregation method would I want to run?

Once again, I can use value_counts, passing normalize=True. The result will be a series with a two-level multi-index on the rows, the outer with the country name and the inner for the educational level:


(
    df
    .groupby('Country')['EdLevel'].value_counts(normalize=True)
)

In order to separate things out most easily, I decided to use unstack method to turn the inner multi-index level into columns. In other words, I ended up with a data frame whose rows were labeled with country names, whose columns were educational levels, and whose values were the proportion of people with each educational level in each country:


(
    df
    .groupby('Country')['EdLevel'].value_counts(normalize=True)
    .unstack('EdLevel')
)

I was only interested in the columns for master's and professional degrees, and thus used filter on the column names. Notice that I had to specify axis='columns' to filter on the column names:


(
    df
    .groupby('Country')['EdLevel'].value_counts(normalize=True)
    .unstack('EdLevel')
    .filter(regex='Master|Professional', axis='columns')
)

This gave me a data frame with the countries in the rows and the proportion from each type of graduate degree in the columns. I ran sum to add the proportions together, again specifying axis='columns', and then ran nlargest to get the 10 countries with the highest such proportion:


(
    df
    .groupby('Country')['EdLevel'].value_counts(normalize=True)
    .unstack('EdLevel')
    .filter(regex='Master|Professional', axis='columns')
    .sum(axis='columns')
    .nlargest(10)
)

Here are the result:

Country	0
Antigua and Barbuda	1.0
Gabon	1.0
Mali	1.0
Mauritania	1.0
Barbados	0.75
Democratic People's Republic of Korea	0.75
France	0.7018544935805991
Luxembourg	0.6086956521739131
Qatar	0.6
Ukraine	0.5663900414937759

Um, that was not what I was expecting! I mean, France and Luxembourg, I sort of expected. Qatar and Ukraine, I guess made sense, too. But North Korea?

And what about those countries in which 100 percent of respondents had graduate degrees? It would seem that very few people filled out the survey. Only 2 did from Antigua and Barbuda, only 1 did from Gabon, and so forth.