BW #72: City travel (solution)
This week, we looked at the rich data set collected by Rafael Prieto-Curiel and Juan P. Ospina in their recent paper, "The ABC of mobility" (https://www.sciencedirect.com/science/article/pii/S0160412024001272). Their research looked at a large number of cities around the world, and how people travel to work within them. They broadly classified travel methods as A, B, or C:
A
stands for "active," and includes walking and bikingB
stands for "bus," and is the overall category for public transportationC
stands for "car," and describes people who drive to work
They found, not surprisingly, that cities in North America rely on cars more than in the rest of the world.
A recent article in the Economist (https://www.economist.com/interactive/2024-walkable-cities) summarized the paper. This seemed relevant not only because of when the paper was published, but also because of the recent cancellation of congestion pricing in Manhattan, which was about to go into effect later this month (https://www.nytimes.com/2024/06/05/nyregion/congestion-pricing-pause-hochul.html?unlocked_article_code=1.2k0.xMmn.a7Q2Lxo3J885&smid=url-share).
Data and seven questions
The researchers behind the ABC paper have a Web site (https://citiesmoving.com/) with interactive visualizations. And, fortunately for us, they have also made their data available at
https://github.com/rafaelprietocuriel/ModalShare/blob/main/ModalShare.csv
Here are the seven tasks and questions that I gave you, along with my detailed explanations. As always, a link to the Jupyter notebook that I used to solve these problems is at the bottom of the page.
Import the CSV file into a data frame. We'll want a three-part multi-index from the region
, Country
, and City
columns.
Before doing anything else, we'll load up Pandas:
import pandas as pd
With that in place, we can load up the CSV file. Because it wasn't too large, I decided to load the entire thing all at once, using read_csv
. However, it took a bit longer than I would have expected, so I used the PyArrow engine, by giving the engine='pyarrow'
keyword argument:
filename = '/Users/reuven/Downloads/ModalShare.csv'
df = pd.read_csv(filename, engine='pyarrow')
(In order to use the pyarrow
engine, you'll need to install the package with pip install pyarrow
.)
However, I also asked you to create a three-part multi-index. We could, of course, use set_index
after creating the data frame. But it's easier to just pass the index_col
keyword argument to read_csv
. As is almost always the case in Pandas, wherever we can pass a single string value to that keyword argument, we can instead pass a list of strings. We can thus specify our multi-index in this way:
filename = '/Users/reuven/Downloads/ModalShare.csv'
df = pd.read_csv(filename, engine='pyarrow',
index_col=['region', 'Country', 'City'])
The resulting data frame has 996 rows and 22 columns.
How many distinct cities were represented in this research? Which 10 cities were surveyed the most times?
On the face of it, this doesn't seem like a particularly challenging question. But when you consider that city names repeat themselves, and that some cities were surveyed multiple times, it becomes a bit trickier.
Fortunately, the survey gives each city a unique ID number, which we can grab from the CityID
column in the data frame. Counting the number of unique values of CityID
will tell us how many distinct cities were included.
One option would be to take the CityID
column, run drop_duplicates
on its elements, and then run the count
method on the results:
df['CityID'].drop_duplicates().count()
This gives us a result of 794 different cities. I worried that running two methods, drop_duplicates
and then count
, would take a while. (And yes, I realize that with such a small data set, nothing will really take "a while.") So I also considered using value_counts
to find out how often each element appeared, then grabbing the length of the index with len
:
len(df['CityID'].value_counts().index)
Both techniques gave me the same result. But which one really ran faster? Or was there no real difference?
I used the %timeit
magic command in Jupyter to time both of them, and was quite surprised by the results:
- Using
drop_duplicates().count()
took an average of 153 µs - Using
value_counts().index
and then runninglen
took an average of 320 µs
So it turns out that my instinct for what would run faster was completely wrong; my initial preference took twice as long!
Next, I asked you which 10 cities were surveyed the most times. Here, it was pretty clear that I would want to use value_counts
. But what would I run value_counts
on? And how would I then use the results?
I actually was able to run value_counts
on the index itself. Index objects aren't exactly Pandas series objects, but they're not exactly not series objects, either. We can often run series methods on them, including (as is the case here) where we have a multi-index.
I thus ran value_counts
on df.index
. The result of value_counts
is always a series in which the index contains the unique values, sorted (by default) in descending order of frequency. I can thus grab the index of that set of results to find which cities were most (and least) popular. By running head
on the series, I can even get the 10 most commonly referred to cities:
(
df
.index
.value_counts()
.head(10)
.index
)
Now, the output isn't beautiful; we get back a MultiIndex
object. But we can read it pretty easily, and if you squint just right, you can pretend that it's a list of tuples. Which, of course, it basically is:
MultiIndex([('Europe', 'Austria', 'Graz'),
('Europe', 'Austria', 'Vienna'),
('Europe', 'Germany', 'Leipzig'),
('Europe', 'Germany', 'Erfurt'),
('Europe', 'Germany', 'Dusseldorf'),
('Europe', 'Norway', 'Oslo'),
('Europe', 'Germany', 'Hannover'),
('Europe', 'Norway', 'Fredrikstad'),
('Europe', 'United Kingdom', 'London'),
('Europe', 'Germany', 'Karlsruhe')],
names=['region', 'Country', 'City'])
We can see that cities in Austria and Germany were surveyed more often than in other countries, and that Graz and Vienna (both in Austria) were the most-surveyed cities. I should note that the authors of the ABC paper didn't conduct these studies themselves; rather, they collected a huge number of other studies, conducted by other people. So we can't really blame (or praise) them for the selection of cities; they cast a very wide net, and it turns out that there was just more data for these cities (and countries) than others.