This week's topic was wind power, inspired by the very large number of wind turbines I saw during my week-long trip to Germany. Renewable energy is increasingly cost effective, as well as good for the environment, so it's not a surprise that numerous countries are installing turbines. But still, it seemed like Germany was making a particularly big investment in this sort of power.
Sure enough, I read that Germany is well known for its investment into renewable energy. I decided that it was worth examining the numbers, to see how their turbines and growth compared with other countries.
Data and five questions
This week's data comes from Ember (https://ember-energy.org/), an energy think tank that publishes information about countries' energy production, use, and trade. The data comes from their "Yearly electricity data" page (https://ember-energy.org/data/yearly-electricity-data/). You want to download the "Yearly electricity data - long format", available at https://storage.googleapis.com/emb-prod-bkt-publicdata/public-downloads/yearly_full_release_long_format.csv .
Learning goals for this week include: Pivot tables, multi-indexes, plotting with Plotly, correlations, and window functions.
Paid subscribers, including members of my LernerPython+data membership program at https://LernerPython.com, can download a copy of the data from a link at the bottom of this message. Paid subscribers can also download my Jupyter/Marimo notebooks (at the bottom of this message) and participate in monthly office hours. This week's notebook is also available on Molab, a public notebook-sharing facility for Marimo users, with the one-click link given below.
Here are my solutions and explanations for this week's questions:
Import the CSV file into a Pandas data frame. Which five countries currently (from 2024 data) have the greatest amount of wind-generation capacity?
To solve this, I first imported Pandas:
import pandas as pd
I then defined a variable with the filename, and imported the CSV file into a data frame with read_csv
:
filename = 'data/bw-133-ember-full.csv'
df = pd.read_csv(filename, engine='pyarrow')
Notice that I passed the engine='pyarrow'
keyword argument, which tells Pandas to use PyArrow for reading, parsing, and importing the CSV file into Pandas. The backend data storage will still use NumPy; this is not the same as using the experimental dtype_backend
keyword argument. But in my experience, PyArrow is much faster than the normal Pandas CSV-loading mechanism. It does have problems with some CSV files that the native Pandas CSV reader handles just fine – but generally speaking, you won't lose much from using the PyArrow loader.
I then asked which five countries, in 2024, have the greatest amount of wind-generation capacity.
First: I only wanted to look at countries, but the data set includes both countries and regions. One way to find rows with country information was to check the ISO 3 code
column; if its value is NaN
, then it doesn't have a country code – because it's not a country, and we can ignore it.
I similarly asked you to look at the capacity for production of energy using wind as a fuel. This meant looking for rows where Category
was 'Capacity'
, where Subcategory
was 'Fuel'
, and where Variable
was Wind
.
I filtered all four of these using my favorite combination of loc
and lambda
. The anonymous lambda
function is applied to every row in the system, returning a boolean (True
or False
) value for each. loc
treats the returned boolean series as a mask index, keeping only those rows with True
values. In other words, each of the four loc
/lambda
lines filters out rows from the data frame, keeping only what is of interest to us:
(
df
.loc[lambda df_: df_['ISO 3 code'].notna() ]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
.loc[lambda df_: df_['Variable'] == 'Wind']
)
I asked you to find only values from 2024. Another loc
/lambda
row does the trick here, comparing Year
with 2024
:
(
df
.loc[lambda df_: df_['ISO 3 code'].notna() ]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
.loc[lambda df_: df_['Variable'] == 'Wind']
.loc[lambda df_: df_['Year'] == 2024]
)
At this point, I was interested in the Value
column – and specifically, the five top values – but I also wanted the country names associated with them. I decided to turn the Area
column into the data frame's index with set_index
, and then to retrieve only the Value
column as a series:
(
df
.loc[lambda df_: df_['ISO 3 code'].notna() ]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
.loc[lambda df_: df_['Variable'] == 'Wind']
.loc[lambda df_: df_['Year'] == 2024]
.set_index('Area')
['Value']
)
Finally, I invoked nlargest(5)
, getting the five largest values:
(
df
.loc[lambda df_: df_['ISO 3 code'].notna() ]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
.loc[lambda df_: df_['Variable'] == 'Wind']
.loc[lambda df_: df_['Year'] == 2024]
.set_index('Area')
['Value']
.nlargest(5)
)
The results:
Area Value
China 521.75
United States of America 153.15
Germany 72.82
India 48.16
Brazil 32.96
You can see that China is far ahead of any other country's wind-generation capacity, followed by the United States. But in third place out of all countries in world is Germany, with about half the capacity of the entire United States – even though it's roughly the size of New Mexico. Next come two other much larger countries, India and Brazil.
It would indeed seem that what I saw, and what I read, are true – that Germany has a great deal of wind-generated energy, especially for a country of its size.
Which 10 countries have increased their wind-generating capacity the most from 2014 to 2023? Where is Germany in that mix?
I started by using loc
/lambda
to again keep only those rows that I wanted: For countries, measuring their fuel capacity:
(
df
.loc[lambda df_: df_['ISO 3 code'].notna()]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
)
I then performed several operations with the Year
column: First, I used isin
to keep only those rows where the year was either 2014 or 2023. Next, I used assign
to replace the existing Year
column with an equivalent one using strings, thanks to astype
. Without doing so, I found that Pandas and/or Marimo would warn me in later stages against having column names that were actually integers.
I then used set_index
to create a two-column multi-index on our data frame, using both Area
and Year
:
(
df
.loc[lambda df_: df_['ISO 3 code'].notna()]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
.loc[lambda df_: df_['Year'].isin([2014, 2023])]
.assign(Year = lambda df_: df_['Year'].astype(str))
.set_index(['Area', 'Year'])
)
With that in place, I then made one additional selection with loc
and lambda
, keeping only those rows having to do with wind power. I used the two-argument version of loc
to retrieve the Value
column at the same time:
(
df
.loc[lambda df_: df_['ISO 3 code'].notna()]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
.loc[lambda df_: df_['Year'].isin([2014, 2023])]
.assign(Year = lambda df_: df_['Year'].astype(str))
.set_index(['Area', 'Year'])
.loc[lambda df_: df_['Variable'] == 'Wind', 'Value']
)
At this point, I had a series (with values) and a two-level multi-index. I moved the Year
part of the multi-index into to the data frame's columns using unstack
, producing a data frame in which the rows are identified by country names and the columns represent values for each country in either 2014 or 2023.
(
df
.loc[lambda df_: df_['ISO 3 code'].notna()]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
.loc[lambda df_: df_['Year'].isin([2014, 2023])]
.assign(Year = lambda df_: df_['Year'].astype(str))
.set_index(['Area', 'Year'])
.loc[lambda df_: df_['Variable'] == 'Wind', 'Value']
.unstack('Year')
)
A number of countries either have no wind-generated electricity or there is no data for them. I removed such rows with a combination of fillna(0)
and then loc
and lambda
to keep only those rows for which 2023 data shows some wind capacity:
(
df
.loc[lambda df_: df_['ISO 3 code'].notna()]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
.loc[lambda df_: df_['Year'].isin([2014, 2023])]
.assign(Year = lambda df_: df_['Year'].astype(str))
.set_index(['Area', 'Year'])
.loc[lambda df_: df_['Variable'] == 'Wind', 'Value']
.unstack('Year')
.fillna(0)
.loc[lambda df_: df_['2023'] > 0]
)
Finally, I invoked diff
, a built-in window function that compares records. By passing axis='columns
, I got the difference between years (i.e., how much capacity was added) between 2014 and 2023. I then used []
to retrieve just the 2023
data, and used nlargest
to find the top 10:
(
df
.loc[lambda df_: df_['ISO 3 code'].notna()]
.loc[lambda df_: df_['Category'] == 'Capacity']
.loc[lambda df_: df_['Subcategory'] == 'Fuel']
.loc[lambda df_: df_['Year'].isin([2014, 2023])]
.assign(Year = lambda df_: df_['Year'].astype(str))
.set_index(['Area', 'Year'])
.loc[lambda df_: df_['Variable'] == 'Wind', 'Value']
.unstack('Year')
.fillna(0)
.loc[lambda df_: df_['2023'] > 0]
.diff(axis='columns')
['2023']
.nlargest(10)
)
The result:
Area 2023
China 345.07
United States of America 83.59
Germany 30.879999999999995
Brazil 24.08
India 22.270000000000003
United Kingdom 17.09
France 13.93
Sweden 11.129999999999999
Australia 9.11
Turkey 8.18
In the last decade or so, Germany has added more wind capacity than any other country, save the United States (with nearly three times as much added) and China (with about 10 times as much added).
It would thus seem that Germany not only has a great deal with wind-generated electrical power, but also that much of it is new, from the last decade or so.