This week, we looked at the music of Tom Lehrer (https://en.wikipedia.org/wiki/Tom_Lehrer), who passed away earlier this week at the age of 97 (https://www.nytimes.com/2025/07/27/arts/music/tom-lehrer-dead.html?unlocked_article_code=1.aU8.glnK.ehoesBEb_qDb&smid=url-share).
Lehrer put his music and lyrics into the public domain (https://tomlehrersongs.com/) in 2007, and there are numerous online archives from which they can be downloaded. I thought that it would be fun to assemble a data frame from his lyrics, and then analyze them in a number of ways.
Data and six questions
I found one archive of Lehrer's songs at http://www.graeme.50webs.com/lehrer/index.htm . Our data will come from the text files on this site. However, to avoid crushing his site with millions of requests from Bamboo Weekly readers, I've put up a mirror of that part of his site on mine, at
https://files.lerner.co.il/www.graeme.50webs.com/lehrer/
I only copied the lyrics, so if you go to my mirror site with your browser, it'll look rather broken, without images or the musical files from the original.
Learning goals for this week include: Retrieving and working with text files, multiple files, regular expressions, and string handling in Pandas.
Paid subscribers, including members of my LernerPython+data subscription service, can download the data files in one zipfile below. You'll also be able to download my Jupyter notebook. (Sorry, but there isn't a one-click notebook solution this week.)
Retrieve all of the HTML files (with htm
suffixes) from the Lehrer archive. I used wget
(https://www.howtogeek.com/281663/how-to-use-wget-the-ultimate-command-line-downloading-tool/) for this purpose, but you might prefer to use a Python program. Ignore the intro
, russell
, short
, and index
files.
Most weeks, I point you to an existing data set that we can then read into Pandas using a builtin method. But this time, we started with raw data in HTML files – and the files were on a Web site, meaning that we needed to find a way to download them onto a local machine.
I've long used the wget
program for retrieving data from remote sites. I thus knew that if I want to mirror a complete site, I'll need to use
wget --recursive http://files.lerner.co.il/www.graeme.50webs.com/lehrer
The good news? This indeed downloaded all of the files. But it also tried to download files that were on the original server. And it also retrieved a number of files that weren't of interest to me. I thus decided to add three arguments:
-A "*.htm"
, so that we would only download HTML files – which, on this file, all had a three-letter.htm
suffix. The-A
option lets you give a comma-separated list of extensions that you will accept when mirroring a site.-I "lehrer"
, to include only thelehrer
directory. On my mirror of the site, this wasn't necessary, since I only copied thelehrer
directory, but I did include it as-I "www.graeme.50webs.com/lehrer"
, naming the full subdirectory. However, I definitely needed this one on the original site, to avoid mirroring everything.--reject-regex='intro|russell|short|index'
, which defined a regular expression describing the files I wanted to ignore. These were the four files in the original site's directory which didn't have any lyrics, and thus messed up my parsing of the files.
With this, my wget
query looked like:
wget --recursive -A "*.htm" -I "www.graeme.50webs.com/lehrer" --reject-regex='intro|russell|short|index' http://files.lerner.co.il/www.graeme.50webs.com/lehrer
Note that by indicating that I wanted to retrieve the entire /lehrer
directory, but not naming index.htm
specifically, wget
retrieved all of the files and links via index.htm
, but didn't retrieve index.htm
itself.
I decided to add three more options, none of which were strictly necessary, but which I liked to use:
progress=bar
, to see a progress bar displayed as larger files were downloaded,--report-speed=bits
, to find out how quickly files are being downloaded, and--random-wait
, to pause a bit between requests, so as not to overwhelm a server with nonstop requests
The final version of my retrieving query was thus:
wget --recursive --progress=bar --report-speed=bits --random-wait -A "*.htm" -I "www.graeme.50webs.com/lehrer" --reject-regex='intro|russell|short|index' http://files.lerner.co.il/www.graeme.50webs.com/lehrer
This downloaded 50 files into the files.lerner.co.il/www.graeme.50webs.com/lehrer
under wherever I ran the wget
query. Each file contained the lyrics to one Tom Lehrer song; the largest file was 9.2 KB, and the smallest was 1.4 KB.
Having retrieved the text files, turn them into a Python dict of dicts, and use it to create a Pandas data frame, one with a row for each album (with an index from the filename) and with three columns –-title
, year
, and lyrics
.
Next, I wanted to take these 50 HTML files, first transforming them into a dict of dicts, and then into a Pandas data frame.
I knew that I would have to iterate over all of the files that I had downloaded. When I have to do that, I usually like to use glob
from the Python standard library. Globbing isn't the same as regular expressions, although it is also about matching patterns, and it does use some of the same symbols (albeit in slightly different ways). I could have invoked glob.glob
, but decided to use pathlib
from the standard library, since pathlib.Path
directory objects have their own glob
method. I thus started
for one_file in pathlib.Path(dirname).glob('*.htm'):
Each iteration gives us a pathlib.Path
object representing one file. (Normally, the glob.glob
method returns strings, the filenames in the directory.)
Normally, if we want to open a file in Python, we can use the open
builtin function. But if we have a Path
object, there's an open
method that we can invoke, instead. I used the with
construct to open each file for reading:
for one_file in pathlib.Path(dirname).glob('*.htm'):
with one_file.open() as infile:
However, I found that this gave me an error on several files. That's because Python assumes that input files are encoded in UTF-8, but these all used Latin-1. Fortunately, we can use the encoding
keyword argument to open
and solve the problem:
for one_file in pathlib.Path(dirname).glob('*.htm'):
with one_file.open(encoding='Latin-1') as infile:
content = infile.read()
Note that I'm usually not a big fan of invoking read
on an open file. That's because the file might be too big to fit into memory, and the file could crash. In this case, however, I knew that all of the files were very small, and could easily fit into memory – especially since we were reading them serially.
I thought about parsing the file with an HTML parsing toolkit, but decided that the structure was simple and regular enough that I could use regular expressions.
For starters, both the title and the year (when it existed) came from the <H2>
section near the top of the file. I used str.index
to find the starting and ending points, and then used a slice to retrieve just what was in that tag:
start_title_section = content.index('<H2>') + 4
end_title_section = content.index('</H2>')
title_section = content[start_title_section:end_title_section].strip()
I then used a regular expression to retrieve the title from inside of the H2
tag:
m = re.search('>\s*\d*\.?\s*(.+)<', title_section)
if m:
title = m.group(1)
else:
title = None
If you're thinking that the regular expression on the first line demonstrates why people should never use them... well, I kind of understand. But let's break it down a bit:
- Our regexp starts with
>
, since I want to search inside of theH2
tag's ending bracket. - I then looked for zero or more whitespace characters.
\s
means whitespace, and*
after it means zero or more such characters are needed to match. This gives us a cushion in case there is some whitespace between the tag and what's written. - Many (but not all) songs then had a number and a period. I used
\d
to indicate that we were looking for a digit, but then*
to indicate that we could see any number of digits, including zero. This allows us to match no digit, a single digit, or two digits. - I added another
\s*
for optional whitespace - Then, inside of
()
, I looked for.+
, meaning "one or more non-newline characters." This is where the title text was located, and the parentheses allowed me to capture it as a group – group 1, since it was the first (and only) set of parentheses in the regular expression. - Finally, I ended the regexp with
<
, the start of the closing</H2>
tag.
I invoked re.search
to look for the title text inside of title_section
. If I found it (and I did, all 50 times), then I would use m.group(1)
to grab the title text from group 1. If not, then I would set it to None
– but truth be told, I would see that as a failure of the parser, and would tweak it to work before allowing a None
title.
Next, I looked for the year:
m = re.search('\d{4}', title_section)
if m:
year = int(m.group(0))
else:
year = None
This was much easier, since the year (when it appeared) always did so as a four-digit number. I just searched for the first four-digit number we could find in the title section, and managed to extract some. I used m.group(0)
, meaning the entirety of what we had matched, since I didn't use ()
to group in the regular expression.
Next, I wanted to grab the lyrics. Here's my code:
start_lyrics = content.index('<PRE>') + 5
end_lyrics = content.index('</PRE>')
lyrics = content[start_lyrics:end_lyrics].strip()
In other words, I just grabbed everything between <PRE>
tags. Some files had more than one set of PRE
tags, but because str.index
finds the first location of the searched-for string,
I now had everything needed to add a new entry into our dict:
all_info[one_file.name.removesuffix('.htm')] = {'title':title,
'lyrics':lyrics,
'year':year}
All together, here's the my code:
import pandas as pd
import pathlib
import re
dirname = '/Users/reuven/Desktop/files.lerner.co.il/www.graeme.50webs.com/lehrer'
all_info = {}
for one_file in pathlib.Path(dirname).glob('*.htm'):
with one_file.open(encoding='Latin-1') as infile:
content = infile.read()
start_title_section = content.index('<H2>') + 4
end_title_section = content.index('</H2>')
title_section = content[start_title_section:end_title_section].strip()
# <A name="Agnes">I Got It From Agnes</A> (1952)
# Get the title
m = re.search('>\s*\d*\.?\s*(.+)<', title_section)
if m:
title = m.group(1)
else:
title = None
# Get the year
m = re.search('\d{4}', title_section)
if m:
year = int(m.group(0))
else:
year = None
start_lyrics = content.index('<PRE>') + 5
end_lyrics = content.index('</PRE>')
lyrics = content[start_lyrics:end_lyrics].strip()
all_info[one_file.name.removesuffix('.htm')] = {'title':title,
'lyrics':lyrics,
'year':year}
With all_info
now set, it was time to turn it into a data frame. The good news? It's not hard:
df = (
pd
.DataFrame(all_info)
)
The bad news? It put the filenames as columns and the three sub-dict keys are rows. I thus transposed it with T
:
df = (
pd
.DataFrame(all_info)
.T
)
Finally, I used assign
to change the year
column into a float64
dtype:
df = (
pd
.DataFrame(all_info)
.T
.assign(year = lambda df_: df_['year'].astype('float64'))
)
The result? A data frame with 50 rows and 3 columns, with an index from filenames.