Skip to content
10 min read · Tags: web-scraping multiple-files regular-expressions grouping

BW #135: Airline seats (solution)

Get better at: Scraping, working with multiple files, regular expressions, grouping, and pivot tables

BW #135: Airline seats (solution)

How roomy (or not) are seats on different airlines? I have been on quite a travel spree for the last few months, mostly attending conferences but also taking a short vacation. I've thus been on a lot of airplanes, with a number of different carriers, and many different planes.

I thus thought it might be interesting to find which companies have the roomiest (and narrowest) seats.

I was happy to find that SeatGuru, which often has useful information about which seats are better and worse on a given flight, has a set of Web pages answering these exact questions (https://www.seatguru.com/charts/generalcharts.php).

This week, we'll read through that data, and attempt to find out if there's a big difference between carriers, seats, and other amenities.

Data and five questions

This week's data, as I mentioned, comes from SeatGuru, from https://www.seatguru.com/charts/generalcharts.php . But that page doesn't have the data itself; that is spread across six different pages. Each page has information about a different duration flight (short haul or long haul, where six hours is the cap for short-haul flights) and the various classes they offer.

Learning goals for this week include: Scraping data from Web pages, regular expressions, working with multiple files, pivot tables, and grouping.

Paid subscribers won't have access to the data itself this week, since scraping and downloading it is part of the task. But you will be able to download my Marimo notebook, and use a one-click link to try it in Marimo's Molab collaboration service.

Here are this week's five tasks and questions:

Download seat data from the six Web pages linked to from the "general charts" page. Combine them into a single data frame, adding a flight_type column indicating the page it was taken from. Convert seat width and pitch into floats: if the value is a range, take the maximum; if it includes text (like ‘approx.’ or ‘standard’), strip the text.

For starters, I loaded up Pandas:

import pandas as pd

Next, I saw that while the main page is at https://www.seatguru.com/charts/generalcharts.php, that page has six links, each leading to a page with seat information. The information on each page is in an HTML table, which Pandas can often extract using read_html. I experimented on one of those pages, and found that read_html did the job just fine, returning a one-element list of data frames from each page.

That's fine for one page, but how can I retrieve the table from each page? I decided to create a list of the central, or "stub" part of each page's URL, and to then iterate in a for loop over that list, running read_html on each URL. The results would then go into a list of data frames that I could combine using pd.concat.

In the end, I used a generator comprehension to achieve this:

stubs = ['shorthaul_economy', 
        'shorthaul_first_class', 
        'longhaul_economy', 
        'premium_economy',
        'longhaul_business_class',
        'longhaul_first_class']

df = (pd
     .concat(pd
             .read_html(f'https://www.seatguru.com/charts/{one_stub}.php')[0]
             )
     )

This code actually works fine, but it's missing one critical component, namely a column that indicates the type of flight the data is for. After all, we'll want to compare flight_type (or maybe I should have called it seat_type?).

I thus added a call to assign, indicating that flight_type should be assigned the value of one_stub, the part of the URL that changes:

df = (pd
     .concat(pd
             .read_html(f'https://www.seatguru.com/charts/{one_stub}.php')[0]
             .assign(flight_type=one_stub)
             for one_stub in stubs)
)

However, I also wanted to turn the Seat Width and Seat Pitch columns into numeric dtypes, likely floats. Trying to run astype on either of these columns resulted in errors, because each had some non-numeric values, such as words describing the units or - signs between two numbers in a range.

I decided that the easiest way to deal with this would be to grab the number from inside of the Seat Width and Seat Pitch cells using a regular expression. The regular expression itself was ^.*?([\d.]+).*$, which admittedly looks like gobbledegook. However:

I used str.replace on both Seat Width and Seat Pitch to replace each original value with the number we found inside of it, using str.replace. After that succeeded, I then ran astype(float) to get a floating-point version of the data into that column.

I'll admit that I'm not a big fan of keeping spaces in column names – and using assign means that we're limited to column names that expressible in Python. I thus kept seat_width and seat_pitch. Then I invoked drop to remove the original columns:

stubs = ['shorthaul_economy', 
        'shorthaul_first_class', 
        'longhaul_economy', 
        'premium_economy',
        'longhaul_business_class',
        'longhaul_first_class']

df = (pd
     .concat(pd
             .read_html(f'https://www.seatguru.com/charts/{one_stub}.php')[0]
             .assign(flight_type=one_stub)
             for one_stub in stubs)
     .assign(seat_width = lambda df_: df_['Seat Width']
         .str.replace(r'^.*?([\d.]+).*$', r'\1', regex=True).astype(float),
             seat_pitch = lambda df_: df_['Seat Pitch']
                 .str.replace(r'^.*?([\d.]+).*$', r'\1', regex=True).astype(float))
     .drop(columns=['Seat Pitch', 'Seat Width'])
     )

The result? A data frame with 2,731 rows and 10 columns, including a flight_type column indicating the type of seat reservation the person had made. All but two of the columns had a dtype of object, which theoretically means any Python object, but in reality normally means strings. The two non-string columns were the ones I created with assign, both of which were of type float64.

Notice that everything here was done inside a call to pd.concat, which takes a list of data frames and combines them – by default, top-to-bottom, but it can be used to combine them left-to-right. We thus end up with one large data frame containing all of the values from the six pages at SeatGuru.

Is there a difference between the width and pitch of short-haul, long-haul, and long-haul premium economy seats?

First, I wanted to keep only economy seats from our data frame. To do that, I used my favorite combination of loc and lambda, with the lambda expression returning a boolean series that can be used as a mask index.

In this case, my lambda ran str.contains('economy') on df_['flight_type'], returning True if the string showed it was an economy flight. The result of the loc was thus a data frame containing only economy flights – long-haul, short-haul, and long-haul premium:


(
    df
    .loc[lambda df_: df_['flight_type'].str.contains('economy')]
)

I then ran groupby. I grouped on flight_type, since I wanted get a calculation for every unique value of flight_type. The calculation was going to be mean, and on both seat_width and seat_pitch. Notice that in order to calculate on both of those, I had to pass a list of strings, meaning nested []:


(
    df
    .loc[lambda df_: df_['flight_type'].str.contains('economy')]
    .groupby('flight_type')[['seat_width', 'seat_pitch']].mean()
)

The result:

flight_type	seat_width	seat_pitch
longhaul_economy	17.56294498381877	31.848465266558968
premium_economy	18.655459183673468	37.607142857142854
shorthaul_economy	17.549726553672315	31.26381059751973

We can see that regular economy seats are, on average, about 17.55 inches in width, while premium economy seats are 18.6 inches. That inch can make a difference in your comfort!

Even more striking is the difference in the seat pitch, meaning the angle to which you can lower your seat. You get a bit more than 31 degrees in regular economy, but 37 degrees in premium economy – again, it doesn't seem like a big deal, but it can help you sleep and relax more easily.