Skip to content
12 min read · Tags: csv plotly pivot-table correlations window-functions multi-index

BW #133: Wind power (solution)

Get better at: CSV files, pivot tables, multi-indexes, Plotly, correlations, and window functions.

BW #133: Wind power (solution)

This week's topic was wind power, inspired by the very large number of wind turbines I saw during my week-long trip to Germany. Renewable energy is increasingly cost effective, as well as good for the environment, so it's not a surprise that numerous countries are installing turbines. But still, it seemed like Germany was making a particularly big investment in this sort of power.

Sure enough, I read that Germany is well known for its investment into renewable energy. I decided that it was worth examining the numbers, to see how their turbines and growth compared with other countries.

Data and five questions

This week's data comes from Ember (https://ember-energy.org/), an energy think tank that publishes information about countries' energy production, use, and trade. The data comes from their "Yearly electricity data" page (https://ember-energy.org/data/yearly-electricity-data/). You want to download the "Yearly electricity data - long format", available at https://storage.googleapis.com/emb-prod-bkt-publicdata/public-downloads/yearly_full_release_long_format.csv .

Learning goals for this week include: Pivot tables, multi-indexes, plotting with Plotly, correlations, and window functions.

Paid subscribers, including members of my LernerPython+data membership program at https://LernerPython.com, can download a copy of the data from a link at the bottom of this message. Paid subscribers can also download my Jupyter/Marimo notebooks (at the bottom of this message) and participate in monthly office hours. This week's notebook is also available on Molab, a public notebook-sharing facility for Marimo users, with the one-click link given below.

Here are my solutions and explanations for this week's questions:

Import the CSV file into a Pandas data frame. Which five countries currently (from 2024 data) have the greatest amount of wind-generation capacity?

To solve this, I first imported Pandas:

import pandas as pd

I then defined a variable with the filename, and imported the CSV file into a data frame with read_csv:

filename = 'data/bw-133-ember-full.csv'

df = pd.read_csv(filename, engine='pyarrow')

Notice that I passed the engine='pyarrow' keyword argument, which tells Pandas to use PyArrow for reading, parsing, and importing the CSV file into Pandas. The backend data storage will still use NumPy; this is not the same as using the experimental dtype_backend keyword argument. But in my experience, PyArrow is much faster than the normal Pandas CSV-loading mechanism. It does have problems with some CSV files that the native Pandas CSV reader handles just fine – but generally speaking, you won't lose much from using the PyArrow loader.

I then asked which five countries, in 2024, have the greatest amount of wind-generation capacity.

First: I only wanted to look at countries, but the data set includes both countries and regions. One way to find rows with country information was to check the ISO 3 code column; if its value is NaN, then it doesn't have a country code – because it's not a country, and we can ignore it.

I similarly asked you to look at the capacity for production of energy using wind as a fuel. This meant looking for rows where Category was 'Capacity', where Subcategory was 'Fuel', and where Variable was Wind.

I filtered all four of these using my favorite combination of loc and lambda. The anonymous lambda function is applied to every row in the system, returning a boolean (True or False) value for each. loc treats the returned boolean series as a mask index, keeping only those rows with True values. In other words, each of the four loc/lambda lines filters out rows from the data frame, keeping only what is of interest to us:


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna() ]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
    .loc[lambda df_: df_['Variable'] == 'Wind']
)

I asked you to find only values from 2024. Another loc/lambda row does the trick here, comparing Year with 2024:


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna() ]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
    .loc[lambda df_: df_['Variable'] == 'Wind']
    .loc[lambda df_: df_['Year'] == 2024]
)

At this point, I was interested in the Value column – and specifically, the five top values – but I also wanted the country names associated with them. I decided to turn the Area column into the data frame's index with set_index, and then to retrieve only the Value column as a series:


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna() ]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
    .loc[lambda df_: df_['Variable'] == 'Wind']
    .loc[lambda df_: df_['Year'] == 2024]
    .set_index('Area')
    ['Value']
)

Finally, I invoked nlargest(5), getting the five largest values:


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna() ]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
    .loc[lambda df_: df_['Variable'] == 'Wind']
    .loc[lambda df_: df_['Year'] == 2024]
    .set_index('Area')
    ['Value']
    .nlargest(5)
)

The results:

Area	Value
China	521.75
United States of America	153.15
Germany	72.82
India	48.16
Brazil	32.96

You can see that China is far ahead of any other country's wind-generation capacity, followed by the United States. But in third place out of all countries in world is Germany, with about half the capacity of the entire United States – even though it's roughly the size of New Mexico. Next come two other much larger countries, India and Brazil.

It would indeed seem that what I saw, and what I read, are true – that Germany has a great deal of wind-generated energy, especially for a country of its size.

Which 10 countries have increased their wind-generating capacity the most from 2014 to 2023? Where is Germany in that mix?

I started by using loc/lambda to again keep only those rows that I wanted: For countries, measuring their fuel capacity:


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna()]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
)

I then performed several operations with the Year column: First, I used isin to keep only those rows where the year was either 2014 or 2023. Next, I used assign to replace the existing Year column with an equivalent one using strings, thanks to astype. Without doing so, I found that Pandas and/or Marimo would warn me in later stages against having column names that were actually integers.

I then used set_index to create a two-column multi-index on our data frame, using both Area and Year:


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna()]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
    .loc[lambda df_: df_['Year'].isin([2014, 2023])]
    .assign(Year = lambda df_: df_['Year'].astype(str))
    .set_index(['Area', 'Year'])
)

With that in place, I then made one additional selection with loc and lambda, keeping only those rows having to do with wind power. I used the two-argument version of loc to retrieve the Value column at the same time:


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna()]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
    .loc[lambda df_: df_['Year'].isin([2014, 2023])]
    .assign(Year = lambda df_: df_['Year'].astype(str))
    .set_index(['Area', 'Year'])
    .loc[lambda df_: df_['Variable'] == 'Wind', 'Value']
)

At this point, I had a series (with values) and a two-level multi-index. I moved the Year part of the multi-index into to the data frame's columns using unstack, producing a data frame in which the rows are identified by country names and the columns represent values for each country in either 2014 or 2023.


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna()]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
    .loc[lambda df_: df_['Year'].isin([2014, 2023])]
    .assign(Year = lambda df_: df_['Year'].astype(str))
    .set_index(['Area', 'Year'])
    .loc[lambda df_: df_['Variable'] == 'Wind', 'Value']
    .unstack('Year')
)

A number of countries either have no wind-generated electricity or there is no data for them. I removed such rows with a combination of fillna(0) and then loc and lambda to keep only those rows for which 2023 data shows some wind capacity:


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna()]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
    .loc[lambda df_: df_['Year'].isin([2014, 2023])]
    .assign(Year = lambda df_: df_['Year'].astype(str))
    .set_index(['Area', 'Year'])
    .loc[lambda df_: df_['Variable'] == 'Wind', 'Value']
    .unstack('Year')
    .fillna(0)
    .loc[lambda df_: df_['2023'] > 0]
)

Finally, I invoked diff, a built-in window function that compares records. By passing axis='columns, I got the difference between years (i.e., how much capacity was added) between 2014 and 2023. I then used [] to retrieve just the 2023 data, and used nlargest to find the top 10:


(
    df
    .loc[lambda df_: df_['ISO 3 code'].notna()]
    .loc[lambda df_: df_['Category'] == 'Capacity']
    .loc[lambda df_: df_['Subcategory'] == 'Fuel']
    .loc[lambda df_: df_['Year'].isin([2014, 2023])]
    .assign(Year = lambda df_: df_['Year'].astype(str))
    .set_index(['Area', 'Year'])
    .loc[lambda df_: df_['Variable'] == 'Wind', 'Value']
    .unstack('Year')
    .fillna(0)
    .loc[lambda df_: df_['2023'] > 0]
    .diff(axis='columns')
    ['2023']
    .nlargest(10)

)

The result:

Area	2023
China	345.07
United States of America	83.59
Germany	30.879999999999995
Brazil	24.08
India	22.270000000000003
United Kingdom	17.09
France	13.93
Sweden	11.129999999999999
Australia	9.11
Turkey	8.18

In the last decade or so, Germany has added more wind capacity than any other country, save the United States (with nearly three times as much added) and China (with about 10 times as much added).

It would thus seem that Germany not only has a great deal with wind-generated electrical power, but also that much of it is new, from the last decade or so.