Musician and satirist Tom Lehrer (https://en.wikipedia.org/wiki/Tom_Lehrer) passed away earlier this week at the age of 97 (https://www.nytimes.com/2025/07/27/arts/music/tom-lehrer-dead.html?unlocked_article_code=1.aU8.glnK.ehoesBEb_qDb&smid=url-share).
I first encountered Lehrer's music (e.g., "Silent E" at https://www.youtube.com/watch?v=91BQqdNOUxs) on The Electric Company. I then heard more of them at summer camp, where counselors occasionally performed his songs (e.g., "The Elements" at https://www.youtube.com/watch?v=AcS3NOQnsQM). Only when I was older did I learn more about him, finally discovering his pointed satirical compositions.
Having gone through the infamous "New Math" elementary-school curriculum myself, I was particularly amused by his song of the same name (https://www.youtube.com/watch?v=UIKGV2cTgqA). (And yes, I did really learn to count in base 8 back in the 5th grade!)
This week, partly to avoid yet another week discussing such serious issues as climate change, wars, autocracy, and economic turmoil, I thought it would be fun to analyze the lyrics to Tom Lehrer's songs.
Data and six questions
Back in 2007, Lehrer released all of his lyrics and music from copyright, putting them in the public domain (https://tomlehrersongs.com/). Sadly, the files containing lyrics on his official site are all in PDF, making it a bit frustrating to extract them.
Fortunately, someone named Graeme Cree put them into text files with some HTML markup at http://www.graeme.50webs.com/lehrer/index.htm . Our data will come from the text files on this site. However, to avoid crushing his site with millions of requests from Bamboo Weekly readers, I've put up a mirror of that part of his site on mine, at
https://files.lerner.co.il/www.graeme.50webs.com/lehrer/
I only copied the lyrics, so the site will look rather broken, without images or the musical files from the original.
Learning goals for this week include: Retrieving and working with text files, multiple files, regular expressions, and string handling in Pandas.
Paid subscribers, including members of my LernerPython+data subscription service, can download the data files in one zipfile below.
I'll be back tomorrow with my answers and explanations.
Meanwhile, here are my six tasks and questions:
- Retrieve all of the HTML files (with
htm
suffixes) from the Lehrer archive. I usedwget
(https://www.howtogeek.com/281663/how-to-use-wget-the-ultimate-command-line-downloading-tool/) for this purpose, but you might prefer to use a Python program. Ignore theintro
,russell
,short
, andindex
files. - Having retrieved the text files, turn them into a Python dict of dicts, and use it to create a Pandas data frame, one with a row for each album (with an index from the filename) and with three columns –-
title
,year
, andlyrics
. The dict of dicts should have the following properties:
- The key for the main dict should be the filename, minus the
.htm
suffix. - The inner dict should have a
title
key, whose value is taken from between the HTMLA
tags inside ofH2
tags. Ignore the number, where it exists, before the name. - The inner dict should have a
year
key, whose value should be taken from the parentheses in the title section but after the title itself. This should beNone
if you cannot find a year - The inner dict should have a
lyrics
key, whose value should be taken from between the HTMLPRE
tags. (If there are multiplePRE
tags, then use the first one.)