BW #13: Python developers (solution)
In honor of PyCon US 2023, we'll look at some data from the most recent Python developers survey. What other programming languages do they use? What education did they get?
This week, as I slowly return home from PyCon US, I decided to look at data about Python and who uses it. The data came from the annual survey handled by JetBrains, the company behind PyCharm and other editors.
The data is in a single CSV file, which you can download from here:
https://drive.google.com/drive/folders/1nlvy45tE4gFX_oWNxG_UTC1-tLZBTcbR?usp=sharing
From that page, download the `sharing_data.csv` file onto your computer. I wasn’t able to find an easy way to give you a one-click URL to download it.
Once you’ve downloaded the file, I had a bunch of questions for you to answer:
Load the file into a data frame. We'll only look at a handful of the file's (many!) columns:
All columns starting with `job_role`
All columns starting with `edu_level`
All columns starting with `primary_proglang`
How many people took the survey?
How many people who took the survey have each kind of educational level? What percentage have a master's, doctoral degree, or professional degree?
Turn the single `edu_level` column into many different columns, each indicating with a `True`/`False` value whether this person has that educational level. For example, there should be one column indicating whether they got a bachelor's degree, a second for master's degrees, a third for doctoral degrees, and so forth. Add these new columns to the data frame.
Try to turn these columns back into a single one. Why does this fail?
What are the 10 most common primary programming languages used by people who took the survey? Are the results surprising?
How many people have more than one job role? How many have more than 5?
By the way, you might have noticed that I mislabeled the headline on yesterday’s e-mail as “BW #11,” when it was actually the 13th issue. No, this wasn’t an example of an off-by-two error; it just shows that I’m still traveling, somewhat jet lagged, and not as focused as I’d like to be. I’ve fixed the headline in the archives.
And now, let’s get started!
Load the file into a data frame. We'll only look at a handful of the file's (many!) columns:
All columns starting with `job_role`
All columns starting with `edu_level`
All columns starting with `primary_proglang`
Let’s start off by loading the necessary modules:
import pandas as pd
from pandas import Series, DataFrame
With that in place, I can then load the data frame into memory. It’s tempting to use read_csv as follows:
filename = 'DevEcosystem_2022_sharing_data.csv'
df = pd.read_csv(filename)
This will work! But there are a few problems:
If you read the entire thing into memory, then you’ll end up with an absolutely enormous data frame, with more than 3,800 columns. It’ll take a long time to read that into memory, and then to analyze what dtype should be assigned to each column, and then just to store the data. You really want to cut down on the columns that you have.
If you don’t specify the dtype for each column (which you can do, using the “dtype” keyword argument), then Pandas needs to analyze the values in each column in order to decide on the dtype. With so many rows, and so many columns, this means holding a lot of data in memory in order to make that determination. In such a case, Pandas will give you a warning, telling you that it is really not sure what to do, but that you should either specify dtypes or pass “low_memory=False”, which tells read_csv that it can use however much memory it needs in order to perform that analysis.
We’ll pass low_memory=False. But beyond that, we’re going to select a handful of columns. How can we do that?
The “usecols” keyword argument lets us specify which columns should be read. Normally, I like to specify them by passing a list of strings, the columns that I’d like to keep around. I find that to be the easiest and most readable method. But here, I want to read a lot of different columns, starting with a variety of different strings.
I could do this by reading one row from the CSV file into memory, and grabbing the column names:
column_names = pd.read_csv(filename, nrows=1).columns
Then I could iterate over those column names, returning only those that matched the pattern I wanted, perhaps with a list comprehension:
column_names = [one_column
for one_column in column_names
if (one_column.startswith('job_role') or
one_column.startswith('edu_level') or
one_column.startswith('primary_proglang'))
]
Notice that I’m using the “str.startswith” method, which returns True if the string in question starts with the string I pass as an argument.
Then I could say:
df = pd.read_csv(filename,
usecols=column_names,
low_memory=False)
Keep reading with a 7-day free trial
Subscribe to Bamboo Weekly to keep reading this post and get 7 days of free access to the full post archives.