Skip to content
13 min read · Tags: multiple-files strings regular-expressions web-scraping

BW #129: Tom Lehrer (solution)

Get better at: Working with multiple files, strings, regular expressions, and Web scraping

BW #129: Tom Lehrer (solution)

This week, we looked at the music of Tom Lehrer (https://en.wikipedia.org/wiki/Tom_Lehrer), who passed away earlier this week at the age of 97 (https://www.nytimes.com/2025/07/27/arts/music/tom-lehrer-dead.html?unlocked_article_code=1.aU8.glnK.ehoesBEb_qDb&smid=url-share).

Lehrer put his music and lyrics into the public domain (https://tomlehrersongs.com/) in 2007, and there are numerous online archives from which they can be downloaded. I thought that it would be fun to assemble a data frame from his lyrics, and then analyze them in a number of ways.

Data and six questions

I found one archive of Lehrer's songs at http://www.graeme.50webs.com/lehrer/index.htm . Our data will come from the text files on this site. However, to avoid crushing his site with millions of requests from Bamboo Weekly readers, I've put up a mirror of that part of his site on mine, at

https://files.lerner.co.il/www.graeme.50webs.com/lehrer/

I only copied the lyrics, so if you go to my mirror site with your browser, it'll look rather broken, without images or the musical files from the original.

Learning goals for this week include: Retrieving and working with text files, multiple files, regular expressions, and string handling in Pandas.

Paid subscribers, including members of my LernerPython+data subscription service, can download the data files in one zipfile below. You'll also be able to download my Jupyter notebook. (Sorry, but there isn't a one-click notebook solution this week.)

Retrieve all of the HTML files (with htm suffixes) from the Lehrer archive. I used wget (https://www.howtogeek.com/281663/how-to-use-wget-the-ultimate-command-line-downloading-tool/) for this purpose, but you might prefer to use a Python program. Ignore the intro, russell, short, and index files.

Most weeks, I point you to an existing data set that we can then read into Pandas using a builtin method. But this time, we started with raw data in HTML files – and the files were on a Web site, meaning that we needed to find a way to download them onto a local machine.

I've long used the wget program for retrieving data from remote sites. I thus knew that if I want to mirror a complete site, I'll need to use

wget --recursive http://files.lerner.co.il/www.graeme.50webs.com/lehrer

The good news? This indeed downloaded all of the files. But it also tried to download files that were on the original server. And it also retrieved a number of files that weren't of interest to me. I thus decided to add three arguments:

With this, my wget query looked like:

wget --recursive -A "*.htm"  -I "www.graeme.50webs.com/lehrer" --reject-regex='intro|russell|short|index' http://files.lerner.co.il/www.graeme.50webs.com/lehrer

Note that by indicating that I wanted to retrieve the entire /lehrer directory, but not naming index.htm specifically, wget retrieved all of the files and links via index.htm, but didn't retrieve index.htm itself.

I decided to add three more options, none of which were strictly necessary, but which I liked to use:

The final version of my retrieving query was thus:

wget --recursive --progress=bar --report-speed=bits --random-wait  -A "*.htm"  -I "www.graeme.50webs.com/lehrer" --reject-regex='intro|russell|short|index' http://files.lerner.co.il/www.graeme.50webs.com/lehrer

This downloaded 50 files into the files.lerner.co.il/www.graeme.50webs.com/lehrer under wherever I ran the wget query. Each file contained the lyrics to one Tom Lehrer song; the largest file was 9.2 KB, and the smallest was 1.4 KB.

Having retrieved the text files, turn them into a Python dict of dicts, and use it to create a Pandas data frame, one with a row for each album (with an index from the filename) and with three columns –-title, year, and lyrics.

Next, I wanted to take these 50 HTML files, first transforming them into a dict of dicts, and then into a Pandas data frame.

I knew that I would have to iterate over all of the files that I had downloaded. When I have to do that, I usually like to use glob from the Python standard library. Globbing isn't the same as regular expressions, although it is also about matching patterns, and it does use some of the same symbols (albeit in slightly different ways). I could have invoked glob.glob, but decided to use pathlib from the standard library, since pathlib.Path directory objects have their own glob method. I thus started

for one_file in pathlib.Path(dirname).glob('*.htm'):

Each iteration gives us a pathlib.Path object representing one file. (Normally, the glob.glob method returns strings, the filenames in the directory.)

Normally, if we want to open a file in Python, we can use the open builtin function. But if we have a Path object, there's an open method that we can invoke, instead. I used the with construct to open each file for reading:

for one_file in pathlib.Path(dirname).glob('*.htm'):
    with one_file.open() as infile:

However, I found that this gave me an error on several files. That's because Python assumes that input files are encoded in UTF-8, but these all used Latin-1. Fortunately, we can use the encoding keyword argument to open and solve the problem:

for one_file in pathlib.Path(dirname).glob('*.htm'):
    with one_file.open(encoding='Latin-1') as infile:
        content = infile.read()

Note that I'm usually not a big fan of invoking read on an open file. That's because the file might be too big to fit into memory, and the file could crash. In this case, however, I knew that all of the files were very small, and could easily fit into memory – especially since we were reading them serially.

I thought about parsing the file with an HTML parsing toolkit, but decided that the structure was simple and regular enough that I could use regular expressions.

For starters, both the title and the year (when it existed) came from the <H2> section near the top of the file. I used str.index to find the starting and ending points, and then used a slice to retrieve just what was in that tag:

        start_title_section = content.index('<H2>') + 4
        end_title_section = content.index('</H2>')
        title_section = content[start_title_section:end_title_section].strip()

I then used a regular expression to retrieve the title from inside of the H2 tag:

        m = re.search('>\s*\d*\.?\s*(.+)<', title_section)
        if m:
            title = m.group(1)
        else:
            title = None

If you're thinking that the regular expression on the first line demonstrates why people should never use them... well, I kind of understand. But let's break it down a bit:

I invoked re.search to look for the title text inside of title_section. If I found it (and I did, all 50 times), then I would use m.group(1) to grab the title text from group 1. If not, then I would set it to None – but truth be told, I would see that as a failure of the parser, and would tweak it to work before allowing a None title.

Next, I looked for the year:

        m = re.search('\d{4}', title_section)
        if m:
            year = int(m.group(0))
        else:
            year = None

This was much easier, since the year (when it appeared) always did so as a four-digit number. I just searched for the first four-digit number we could find in the title section, and managed to extract some. I used m.group(0), meaning the entirety of what we had matched, since I didn't use () to group in the regular expression.

Next, I wanted to grab the lyrics. Here's my code:

        start_lyrics = content.index('<PRE>') + 5
        end_lyrics = content.index('</PRE>')
        lyrics = content[start_lyrics:end_lyrics].strip()

In other words, I just grabbed everything between <PRE> tags. Some files had more than one set of PRE tags, but because str.index finds the first location of the searched-for string,

I now had everything needed to add a new entry into our dict:

        all_info[one_file.name.removesuffix('.htm')] = {'title':title, 
                                                        'lyrics':lyrics,
                                                        'year':year}

All together, here's the my code:


import pandas as pd
import pathlib
import re

dirname = '/Users/reuven/Desktop/files.lerner.co.il/www.graeme.50webs.com/lehrer'

all_info = {}

for one_file in pathlib.Path(dirname).glob('*.htm'):
    with one_file.open(encoding='Latin-1') as infile:

        content = infile.read()
    
        start_title_section = content.index('<H2>') + 4
        end_title_section = content.index('</H2>')
        title_section = content[start_title_section:end_title_section].strip()
    
        # <A name="Agnes">I Got It From Agnes</A> (1952)
        # Get the title
        m = re.search('>\s*\d*\.?\s*(.+)<', title_section)
        if m:
            title = m.group(1)
        else:
            title = None
    
        # Get the year
        m = re.search('\d{4}', title_section)
        if m:
            year = int(m.group(0))
        else:
            year = None
    
        start_lyrics = content.index('<PRE>') + 5
        end_lyrics = content.index('</PRE>')
        lyrics = content[start_lyrics:end_lyrics].strip()
    
        all_info[one_file.name.removesuffix('.htm')] = {'title':title, 
                                                        'lyrics':lyrics,
                                                        'year':year}

With all_info now set, it was time to turn it into a data frame. The good news? It's not hard:

df = (
      pd
      .DataFrame(all_info)
)

The bad news? It put the filenames as columns and the three sub-dict keys are rows. I thus transposed it with T:

df = (
      pd
      .DataFrame(all_info)
      .T
)

Finally, I used assign to change the year column into a float64 dtype:

df = (
      pd
      .DataFrame(all_info)
      .T
      .assign(year = lambda df_: df_['year'].astype('float64'))
)

The result? A data frame with 50 rows and 3 columns, with an index from filenames.