Skip to content
3 min read · Tags: multiple-files strings regular-expressions web-scraping

BW #129: Tom Lehrer

Get better at: Working with multiple files, strings, regular expressions, and Web scraping

BW #129: Tom Lehrer

Musician and satirist Tom Lehrer (https://en.wikipedia.org/wiki/Tom_Lehrer) passed away earlier this week at the age of 97 (https://www.nytimes.com/2025/07/27/arts/music/tom-lehrer-dead.html?unlocked_article_code=1.aU8.glnK.ehoesBEb_qDb&smid=url-share).

I first encountered Lehrer's music (e.g., "Silent E" at https://www.youtube.com/watch?v=91BQqdNOUxs) on The Electric Company. I then heard more of them at summer camp, where counselors occasionally performed his songs (e.g., "The Elements" at https://www.youtube.com/watch?v=AcS3NOQnsQM). Only when I was older did I learn more about him, finally discovering his pointed satirical compositions.

Having gone through the infamous "New Math" elementary-school curriculum myself, I was particularly amused by his song of the same name (https://www.youtube.com/watch?v=UIKGV2cTgqA). (And yes, I did really learn to count in base 8 back in the 5th grade!)

This week, partly to avoid yet another week discussing such serious issues as climate change, wars, autocracy, and economic turmoil, I thought it would be fun to analyze the lyrics to Tom Lehrer's songs.

Data and six questions

Back in 2007, Lehrer released all of his lyrics and music from copyright, putting them in the public domain (https://tomlehrersongs.com/). Sadly, the files containing lyrics on his official site are all in PDF, making it a bit frustrating to extract them.

Fortunately, someone named Graeme Cree put them into text files with some HTML markup at http://www.graeme.50webs.com/lehrer/index.htm . Our data will come from the text files on this site. However, to avoid crushing his site with millions of requests from Bamboo Weekly readers, I've put up a mirror of that part of his site on mine, at

https://files.lerner.co.il/www.graeme.50webs.com/lehrer/

I only copied the lyrics, so the site will look rather broken, without images or the musical files from the original.

Learning goals for this week include: Retrieving and working with text files, multiple files, regular expressions, and string handling in Pandas.

Paid subscribers, including members of my LernerPython+data subscription service, can download the data files in one zipfile below.

I'll be back tomorrow with my answers and explanations.

Meanwhile, here are my six tasks and questions:

    • The key for the main dict should be the filename, minus the .htm suffix.
    • The inner dict should have a title key, whose value is taken from between the HTML A tags inside of H2 tags. Ignore the number, where it exists, before the name.
    • The inner dict should have a year key, whose value should be taken from the parentheses in the title section but after the title itself. This should be None if you cannot find a year
    • The inner dict should have a lyrics key, whose value should be taken from between the HTML PRE tags. (If there are multiple PRE tags, then use the first one.)