BW #13: Python developers (solution)

In honor of PyCon US 2023, we'll look at some data from the most recent Python developers survey. What other programming languages do they use? What education did they get?

This week, as I slowly return home from PyCon US, I decided to look at data about Python and who uses it. The data came from the annual survey handled by JetBrains, the company behind PyCharm and other editors.

The data is in a single CSV file, which you can download from here:

https://drive.google.com/drive/folders/1nlvy45tE4gFX_oWNxG_UTC1-tLZBTcbR?usp=sharing

From that page, download the `sharing_data.csv` file onto your computer. I wasn’t able to find an easy way to give you a one-click URL to download it.

Once you’ve downloaded the file, I had a bunch of questions for you to answer:

  1. Load the file into a data frame. We'll only look at a handful of the file's (many!) columns:

    1. All columns starting with `job_role`

    2. All columns starting with `edu_level`

    3. All columns starting with `primary_proglang`

  2. How many people took the survey?

  3. How many people who took the survey have each kind of educational level? What percentage have a master's, doctoral degree, or professional degree?

  4. Turn the single `edu_level` column into many different columns, each indicating with a `True`/`False` value whether this person has that educational level. For example, there should be one column indicating whether they got a bachelor's degree, a second for master's degrees, a third for doctoral degrees, and so forth. Add these new columns to the data frame.

  5. Try to turn these columns back into a single one. Why does this fail?

  6. What are the 10 most common primary programming languages used by people who took the survey? Are the results surprising?

  7. How many people have more than one job role? How many have more than 5?

By the way, you might have noticed that I mislabeled the headline on yesterday’s e-mail as “BW #11,” when it was actually the 13th issue. No, this wasn’t an example of an off-by-two error; it just shows that I’m still traveling, somewhat jet lagged, and not as focused as I’d like to be. I’ve fixed the headline in the archives.

And now, let’s get started!

Load the file into a data frame. We'll only look at a handful of the file's (many!) columns:

  • All columns starting with `job_role`
  • All columns starting with `edu_level`
  • All columns starting with `primary_proglang`

Let’s start off by loading the necessary modules:

import pandas as pd
from pandas import Series, DataFrame

With that in place, I can then load the data frame into memory. It’s tempting to use read_csv as follows:

filename = 'DevEcosystem_2022_sharing_data.csv'
df = pd.read_csv(filename)

This will work! But there are a few problems:

  1. If you read the entire thing into memory, then you’ll end up with an absolutely enormous data frame, with more than 3,800 columns. It’ll take a long time to read that into memory, and then to analyze what dtype should be assigned to each column, and then just to store the data. You really want to cut down on the columns that you have.
  2. If you don’t specify the dtype for each column (which you can do, using the “dtype” keyword argument), then Pandas needs to analyze the values in each column in order to decide on the dtype. With so many rows, and so many columns, this means holding a lot of data in memory in order to make that determination. In such a case, Pandas will give you a warning, telling you that it is really not sure what to do, but that you should either specify dtypes or pass “low_memory=False”, which tells read_csv that it can use however much memory it needs in order to perform that analysis.

We’ll pass low_memory=False. But beyond that, we’re going to select a handful of columns. How can we do that?

The “usecols” keyword argument lets us specify which columns should be read. Normally, I like to specify them by passing a list of strings, the columns that I’d like to keep around. I find that to be the easiest and most readable method. But here, I want to read a lot of different columns, starting with a variety of different strings.

I could do this by reading one row from the CSV file into memory, and grabbing the column names:

column_names = pd.read_csv(filename, nrows=1).columns

Then I could iterate over those column names, returning only those that matched the pattern I wanted, perhaps with a list comprehension:

column_names = [one_column
               for one_column in column_names
               if (one_column.startswith('job_role') or 
                   one_column.startswith('edu_level') or 
                   one_column.startswith('primary_proglang'))
               ]

Notice that I’m using the “str.startswith” method, which returns True if the string in question starts with the string I pass as an argument.

Then I could say:

df = pd.read_csv(filename, 
            usecols=column_names,
            low_memory=False)

But did you know that str.startswith can instead take a tuple as an argument? If you pass a tuple of strings, then the method returns True if any of the strings appears at the start. Meaning, I can rewrite my above list comprehension as:

column_names = [one_column
               for one_column in column_names
               if (one_column.startswith(('job_role', 'edu_level',
                  'primary_proglang'))
               ]

This is pretty great, and I believe it’s an improvement over our previous code. However, it still feels a bit clunky. That’s because I’m reading from the file to get the names of the columns, so that I can filter through the columns with some code.

We could instead pass a list of integers as the “usecols” argument. However, that’s not really going to help us here.

But there is another option, one which seems particularly appropriate: We can pass a callable to usecols, meaning a Python function or class. That callable will be invoked for each of the column names. Wherever the callable returns True, the column will be kept. And wherever the callable returns False, the column will be ignored.

I can thus wrap my list comprehension into a function, and pass the function to “usecols”, as follows:

def columns_wanted_filter(column_name):
    return column_name.startswith(('job_role',
                                   'edu_level',
                                   'primary_proglang'))

df = pd.read_csv(filename,
                usecols=columns_wanted_filter,
                low_memory=False)

Doing this assigns a data frame to “df” that contains only the columns that are of interest to us. If and when we want to add other columns, or modify the way in which we select columns, I can just change my function.

And yes, I could use a “lambda” here, and in some cases that might be easiest and shortest. But I’ve shied away from lambda over the last few years, in part because (in my experience) they’re harder for beginner Python developers to understand. And yes, lambda tends to be common in Pandas, and you can’t ignore it completely when using Pandas — but if I can reduce its use, I will.

We end up with a data frame with 58,538 rows and 62 columns. That’s still a lot of columns (and not a small number of rows), but it’s certainly better than what we had before.

How many people took the survey?

This seems, at first glance, like an easy question to answer: We just need to check how many rows are in our data frame.

But wait: What’s the fastest, best, and easiest way to do that?

It might be tempting to use the “count” method on our data frame — or, perhaps, on one of the columns in our data frame. However, “count” only counts the number of non-NA (and non-NaN) values that are there. Which means that if you choose a column that contains a lot of NA values, you’ll get a completely wrong answer.

Moreover, calling “count” means that you’re invoking a method, which is generally going to be kind of slow.

That said, let’s see what kind of answer we can get, and how long it takes to run using the “timeit” magic method in Jupyter:

%timeit df.count()

The result that I get is:

64.6 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In other words, “timeit” ran this code 10 times, and found that on average, it took 64.6 ms to run. That' might sound pretty fast, but maybe there’s a faster way.

What if I run it on a single column? Let’s try the first one:

%timeit df[df.columns[0]].count()

How long did that take?

1.28 ms ± 86.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

That’s a lot faster… but maybe we can get it faster still?

One alternative would be to get the shape of our data frame. The “shape” attribute returns a two-element tuple, in which the first (at index 0) contains the number of rows, and the second (at index 1) contains the number of columns. I could thus say:

df.shape[0]

This will give me the right answer, but will it be any faster than running “count”? Let’s find out:

%timeit df.shape[0]

Running that results in:

433 ns ± 22.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Wow. Just to remind you, there are 1,000,000 (yes, 1 million) ns in 1 ms. So… yeah, using “shape” is a lot faster than calculating “count” on a single column.

But maybe we can get it to run faster yet?

For example, what if I just run “len” — yes, the standard Python “len” function — on our data frame? How long will that take?

%timeit len(df)

The output:

267 ns ± 8.07 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In other words, using “len” on the data frame takes about half as long as retrieving the result from “shape”.

I was long under the impression that this was the best we could do. But then someone told me about an even faster technique, and I decided to try it: We could run “len” on df.index. That is, we retrieve the index from our data frame, and then calculate the length of that index object:

%timeit len(df.index)

And how long does that take?

173 ns ± 2.62 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In other words: There are a number of ways to find out how many rows are in our data frame. They’ll all give us the same results. But if you want to count the rows, you’re best off invoking “len” on the data frame’s index.