BW #21: Electric cars (solution)

How popular are electric cars? How much have they grown in popularity, and how popular are they in different regions of the world? This week, we'll look at some data about electric vars.

BW #21: Electric cars (solution)

This week, we looked at data describing the growth in the electric-car industry. The numbers, as we saw, are growing at breakneck speed, and the expectations are for even bigger growth in the years to come.

BTW, I’m sending this out a bit earlier than usual, so that anyone who wants to read my solution before office hours start will be able to do so.

Without further ado, let’s get to it:

Data

This week’s data came from the IEA (International Energy Agency (https://iea.org), which has some data about cars, oil consumption, and the like. I correctly told you that we were going to look at data about electric-car usage to date, but then I mistakenly told you that we were going to look at oil saved by using electric cars — when I actually decided to use the future projections for electric-car sales, instead. Whoops!

The data comes from this page:

https://www.iea.org/data-and-statistics/data-tools/global-ev-data-explorer

First, there is historical data about electric vehicles, at

https://api.iea.org/evs/?parameter=EV%20sales&mode=Cars&category=Historical&csv=true

Future projections of electric-car sales are here:

https://api.iea.org/evs/?parameter=EV%20sales&mode=Cars&category=Projection-STEPS&csv=true

These files are in CSV format.

(And yes, I flubbed it in yesterday’s message, telling you to download data about how much oil was saved by using electric cars. I had planned to include such data this week, but decided to look at the data you eventually saw. I hope that you saw the comment I attached to the message with a correction.)

Questions

Now that I’ve given you the correct data, let’s see if I can get you the right questions (and solutions). There were a total of seven questions:

This week’s learning goals are: Combining data frames, complex queries, pivot tables, multi-indexes, and plotting with Seaborn.

Create a data frame for the historical data.

The first thing that I’ll do is load up Pandas and Seaborn, since I’ll be using the latter for some visualization later on:

import pandas as pd
import seaborn as sns
from pandas import Series, DataFrame

The third line isn’t really necessary, but I always like to have “Series” and “DataFrame” available in the current namespace, rather than write “pd.” before everything. As usual, I load up Seaborn using the standard alias of “sns”.

With that in place, I can now load up the data:

historical_filename = 'IEA-EV-dataEV salesCarsHistorical.csv'
historical_df = pd.read_csv(historical_filename)

Notice two things: First of all, I used longer and more explicit variable names here than I usually do. (Yes, I should use better variable names, since “filename” and “df” are overused and non-descriptive. Guilty as charged.) But today I’m doing this because I’ll have a second data frame in a bit, and I want to be able to distinguish the two.

Fortunately for us, this CSV file has no irregularities, no “NaN” values, no dates, and no bad values. It’s surprisingly clean, which means that it loads into memory without a hitch.

In 2022 (the most recent year for which we have data), if we look at numbers for the entire world, which sold more, BEVs or PHEVs? How much more?

Now that we’ve loaded our data, we get to our first question: In 2022, in the “World” region, did people buy more battery-operated cars, or plug-in hybrids? There are a few different ways we can get to this solution. Here’s what I came up with:

  • In the end, we want a series with two elements, where the index contains “BEV” and “PHEV”, because then we can run the “diff” method and get our answer.
  • We can get there by setting the index to contain three columns: Region, year, and powertrain.
  • We can then use “loc” to retrieve rows for “World” and 2022, the first two elements of our multi-index.
  • What will remain is a data frame with “powertrain” as its index, and the two rows that match.
  • Then we can run diff!

In other words:

(
    historical_df.
    set_index(['region', 'year', 'powertrain']).
    sort_index().
    loc[('World', 2022), 'value'].
    diff()
)

(And yes, I’m slowly but surely migrating over to the Matt Harrison style of writing my queries. If nothing else, it makes these multi-part queries far more readable. This means putting the entire query inside of parentheses, and putting the dot at the end of each line, making it possible to invoke the method on the next line.)

I start off with historical_df, our data frame. I set its index using set_index, which returns a new data frame — identical to the previous one, but with those three columns set to be a multi-index. Yes, I could have run a query that retrieves based on the values we want, but in this case it seemed easier to do it via an index.

Next, in order to ensure that my results will work, and that Pandas won’t complain about retrieving from an unsorted data frame, I sort the data frame by its index. Because we have a three-part multi-index, Pandas will first sort by region, then by year, then by powertrain. Truth be told, that doesn’t matter much to us, given that we’ll be retrieving (and thus removing) one region and one year — but it’s predictable, which is a good place to be.

Then we retrieve all of the rows with “World” and 2022 in the outer two layers of the multi-index. We’ll get a data frame back, one in which the index contains only “BEV” and “PHEV”. However, we still have a number of columns around. We’re only interested in the “value” column, so we retrieve it via the second argument to “loc”.

Finally, we run diff, which returns NaN for BEV, but 4,400,000 for PHEV. Which means that in 2022, the IEA estimates that people bought 4,400,000 more battery-operated cars than plug-in hybrids. That’s quite a large number, given that all-electric cars barely existed a decade ago.