Bamboo Weekly #148: US Manufacturing (solution)

First, some announcements:

The first release candidate of Pandas 3.0 is out! I'll be giving a special talk about it, and its implications for your code, to LernerPython+data members on Sunday. Full Zoom information is in our private member Discord, at https://discord.com/channels/1249743821508710531/1297876125749608479/1446976398500560926.
We'll be holding Pandas office hours on Wednesday, December 17th. Full Zoom information will be sent early next week. Come with any/all Pandas questions you have! This is open to to LernerPython+data members, as well as paid BW subscribers.

How are American manufacturers doing? To hear the Trump Administration say it, things couldn't be better. But the US economy is showing many signs of slowing and weakness, especially in the wake of the trade war that Donald Trump has launched. Has manufacturing improved since Trump returned to the White House? Which manufacturing sectors have improved over the years, and which have done worse?

The US government itself keeps track of such information. Specifically, the International Trade Administration (https://www.trade.gov/), part of the Department of Commerce, collects data from manufacturers. This data includes everything from the trade balance in that sector, but also employment and inflation for producers. These statistics can definitely help to give us a sense of how manufacturing is doing in the US, including which areas are growing and which are shrinking.

Data and five questions

The data comes from the ITA "manufacturing industry tracker," at https://www.trade.gov/data-visualization/ita-manufacturing-industry-tracker . You can download the data by clicking on the "download data" button, which returns a CSV file.

Learning goals for this week include: Working with CSV files, date parsing, working with dates and times, joins, grouping, pivot tables, window functions, and plotting.

Paid subscribers, including members of my LernerPython.com membership program, get the data files provided to them, as well as all of the questions and answers each week, downloadable notebooks, and participation in monthly office hours.

Here are this week's questions, along with my answers and explanations.

Download the CSV file containing 3-digit data from the ITA Manufacturing Industry Tracker. Treat `Month of Date` as a `datetime` column for the first of the specified month. For each of the columns starting with `Avg`, rename to something shorter and turn into a numeric column, either integer or float. Remove the three-digit code from each sector, keeping just the name.

I started by importing both Pandas as Plotly, which I used for some of my plotting tasks:

import pandas as pd
from plotly import express as px

Next, I wanted to load the CSV file that I had downloaded from ITA. I wanted to have the Month of Date column imported as a datetime value, but that column's dates looked like Jul-25, with the three-letter month abbreviation followed by the year's final two digits. In order to parse this as a datetime, it wasn't enough to pass the parse_dates keyword argument. I also needed to pass the date_format argument, specifying '%b-%y', which matched the month abbreviation and two-digit year. (As always, I used https://www.strfti.me to figure out which percent codes to use; there's no way I would ever remember them on my own!)

My initial stab at loading the file looked like this, using read_csv:


filename = '/Users/reuven/Downloads/3-digitdata.csv'

df = (
        pd
        .read_csv(filename,
                 parse_dates=['Month of Date'],
                 date_format='%b-%y')
)

This was fine, but there were a number of other problems to overcome. For one, several of the supposedly numeric columns included commas, meaning that Pandas wouldn't treat them as either integers or floats. These columns were Avg. Exports, Avg. Imports, Avg. Employment (Thousands), and Avg. Trade Balance. Not only did I want these columns to be numeric, but I wanted to make the names a bit shorter and snappier, without needing the specify Avg at the start of each column name.

The easiest way to do this, is to use assign. Each keyword argument passed to assign results in a new column being added to the data frame. The keyword argument's name is used for the column name, and the value is used for the value – unless a function is used for the value, in which case the function is first invoked and then its return value is used.

I passed four keyword arguments to assign, and for each one, I ran str.replace , removing commas from the input string. I was then able to turn the values into either integers (with astype(int)) or floats (with astype(float)), now that the commas were gone.

The fifth column that I wanted to modify, Code+Definition, needed a bit more surgery, in that I wanted to remove any integers and - signs from the start of the string. For that, I again used str.replace, but used a regular expression, passing regex=True to ensure that Pandas would use the regex engine:


filename = '/Users/reuven/Downloads/3-digitdata.csv'

df = (
        pd
        .read_csv(filename,
                 parse_dates=['Month of Date'],
                 date_format='%b-%y')
        .assign(exports = lambda df_: df_['Avg. Exports'].str.replace(',', '').astype(int),
                imports = lambda df_: df_['Avg. Imports'].str.replace(',', '').astype(int),
                employment = lambda df_: df_['Avg. Employment (Thousands)'].str.replace(',', '').astype(float),
                trade_balance = lambda df_: df_['Avg. Trade Balance'].str.replace(',', '').astype(int),
                sector = lambda df_: df_['Code+Definition'].str.replace(r'\d+--', '', regex=True)
               )
)

At this point, I had the data frame that I wanted, more or less. But I thought it was a good idea to remove the columns I wouldn't need. I did this by invoking drop, passing the columns keyword argument and a list of the columns I wanted to remove:


filename = '/Users/reuven/Downloads/3-digitdata.csv'

df = (
        pd
        .read_csv(filename,
                 parse_dates=['Month of Date'],
                 date_format='%b-%y')
        .assign(exports = lambda df_: df_['Avg. Exports'].str.replace(',', '').astype(int),
                imports = lambda df_: df_['Avg. Imports'].str.replace(',', '').astype(int),
                employment = lambda df_: df_['Avg. Employment (Thousands)'].str.replace(',', '').astype(float),
                trade_balance = lambda df_: df_['Avg. Trade Balance'].str.replace(',', '').astype(int),
                sector = lambda df_: df_['Code+Definition'].str.replace(r'\d+--', '', regex=True)
               )
        .drop(columns=['Avg. Exports', 'Avg. Imports', 
                       'Avg. Employment (Thousands)',
                      'Avg. Trade Balance',
                      'PPI Series Code',
                      'Note',
                       'Ip Code',
                      'Employment Series ID',
                      'Code+Definition'])
)

I then invoked rename on the data frame, giving it a dictionary with the original column names and the new ones that I wanted to give:


filename = '/Users/reuven/Downloads/3-digitdata.csv'

df = (
        pd
        .read_csv(filename,
                 parse_dates=['Month of Date'],
                 date_format='%b-%y')
        .assign(exports = lambda df_: df_['Avg. Exports'].str.replace(',', '').astype(int),
                imports = lambda df_: df_['Avg. Imports'].str.replace(',', '').astype(int),
                employment = lambda df_: df_['Avg. Employment (Thousands)'].str.replace(',', '').astype(float),
                trade_balance = lambda df_: df_['Avg. Trade Balance'].str.replace(',', '').astype(int),
                sector = lambda df_: df_['Code+Definition'].str.replace(r'\d+--', '', regex=True)
               )
        .drop(columns=['Avg. Exports', 'Avg. Imports', 
                       'Avg. Employment (Thousands)',
                      'Avg. Trade Balance',
                      'PPI Series Code',
                      'Note',
                       'Ip Code',
                      'Employment Series ID',
                      'Code+Definition'])
        .rename(columns={'Avg. Industrial Production Index': 'industrial_production_index',
                        'Avg. PPI':'ppi',
                        'Month of Date':'date'})
)

I ended up with a data frame containing 2,688 rows and 8 columns.

Show the 5 sectors in which there has been the greatest positive change in trade balance from the start of the data to its end. Now show the 5 sectors in which we see the greatest negative change. What does it mean if there are negative numbers in the "greatest positive change" list?

Trump has long been convinced that the US shouldn't have a trade deficit, seeing it as a negative thing. (Many economists believe otherwise.) No matter what you think, though, there's no doubt that much of the manufacturing that used to take place in the US now happens in other countries. I was curious which industries had seen the largest and smallest changes in trade balance.

In order to find the difference between the start and end of the data – i.e., the earliest and latest data points for each industry – we only need three columns: date, sector, and trade_balance. Moreover, we can reshape that data into a new data frame – one in which each date (currently repeated across rows, because of multiple sectors) appears as the index and in which each sector (currently repeated across rows, because of multiple dates) appears as the columns.

In other words, I want to create a pivot table, which I often like to describe as a two-dimensional groupby operation. The index will contain the unique values from date, the columns will contain the unique values from sector, and each cell will contain the trade balance for that sector on that date.

To create this pivot table, we can use pivot_table, which takes an index keyword argument, a columns keyword argument, and a values keyword argument:


(
    df
    .pivot_table(index='date',
                columns='sector',
                values='trade_balance')
)

The returned data frame contains 128 rows (one for each month of data) and 21 columns (one for each sector), so it's a bit hard to read. But really, we're only interested in the first last rows of the data frame, which we can retrieve using iloc and a fancy index, the list [0, -1]. Row index 0 is the first, and row index -1 is the last , and because we're using iloc, the actual index doesn't matter.

This returned a two-row data frame. I ran diff on that, resulting in a new data frame with the same index as columns as the two-row one on which diff ran, but with NaN in the first row and the differences in the second row. A second call to iloc[-1] retrieved just the final row. I then invoked agg, which lets us run multiple aggregation methods, and I passed nlargest and nsmallest in a list:


(
    df
    .pivot_table(index='date',
                columns='sector',
                values='trade_balance')
    .iloc[[0,-1]]
    .diff()
    .iloc[-1]
    .agg(['nlargest', 'nsmallest'])
)

The result is a bit hard to read, so I'll first show the largest and then the smallest separately. Also, I decided to use apply along with str.format to add commas to the numbers, using this code:


(
    df
    .pivot_table(index='date',
                columns='sector',
                values='trade_balance')
    .iloc[[0,-1]]
    .diff()
    .iloc[-1]
    .nlargest()
    .apply('{:,}'.format)
)

Here's the result:

sector	2025-08-01 00:00:00
Petroleum & Coal Products	3,170,097,796.0
Primary Metal Products	3,005,775,552.0
Chemicals	265,952,938.0
Printed Matter & Related Products	-119,460,214.0
Transportation Equipment	-167,983,537.0

We can see that since 2015, when this data was first collected, the US has increased its net monthly petroleum exports by $3.17 billion, and primary metal products by another $3 billion. Not too shabby!

But not all is rosy: Even within the "largest, best" trade balances, we see that only three are positive. The 4th and 5th place sectors, printed matter and transportation equipment, now export less (and import more) than was the case 10 years ago. And that's in the "5 best" list!

We can similarly get the 5 sectors in which the trade deficit has gotten the largest:


(
    df
    .pivot_table(index='date',
                columns='sector',
                values='trade_balance')
    .iloc[[0,-1]]
    .diff()
    .iloc[-1]
    .nsmallest()
    .apply('{:,}'.format)
)

Here, I used nsmallest, as you might expect. The result:

sector	2025-08-01 00:00:00
Computer & Electronic Products	-18,907,059,148.0
Electrical Equipment, Appliances & Components	-7,471,428,226.0
Machinery, Except Electrical	-4,135,922,574.0
Food Manufactures	-3,375,186,561.0
Miscellaneous Manufactured Commodities	-2,551,986,559.0

In other words, the US is now importing, on a monthly basis, $18 billion more in computer and electronic products than was the case 10 years ago. with another $7 billion in electric equipment and appliances.

Again, many economists wouldn't necessarily say that a trade deficit is a bad thing, especially given that the US is now exporting services to much of the world, so it's OK to buy manufactured goods. Of course, that doesn't necessarily help the people who want the factory jobs — or at least hope that they might be able to get one.

Bamboo Weekly #148: US Manufacturing (solution)

Data and five questions

Show the 5 sectors in which there has been the greatest positive change in trade balance from the start of the data to its end. Now show the 5 sectors in which we see the greatest negative change. What does it mean if there are negative numbers in the "greatest positive change" list?

Read next

Bamboo Weekly #148: US Manufacturing

Pandas office hours — Wednesday, December 17th