How roomy (or not) are seats on different airlines? I have been on quite a travel spree for the last few months, mostly attending conferences but also taking a short vacation. I've thus been on a lot of airplanes, with a number of different carriers, and many different planes.
I thus thought it might be interesting to find which companies have the roomiest (and narrowest) seats.
I was happy to find that SeatGuru, which often has useful information about which seats are better and worse on a given flight, has a set of Web pages answering these exact questions (https://www.seatguru.com/charts/generalcharts.php).
This week, we'll read through that data, and attempt to find out if there's a big difference between carriers, seats, and other amenities.
Data and five questions
This week's data, as I mentioned, comes from SeatGuru, from https://www.seatguru.com/charts/generalcharts.php . But that page doesn't have the data itself; that is spread across six different pages. Each page has information about a different duration flight (short haul or long haul, where six hours is the cap for short-haul flights) and the various classes they offer.
Learning goals for this week include: Scraping data from Web pages, regular expressions, working with multiple files, pivot tables, and grouping.
Paid subscribers won't have access to the data itself this week, since scraping and downloading it is part of the task. But you will be able to download my Marimo notebook, and use a one-click link to try it in Marimo's Molab collaboration service.
Here are this week's five tasks and questions:
Download seat data from the six Web pages linked to from the "general charts" page. Combine them into a single data frame, adding a flight_type
column indicating the page it was taken from. Convert seat width and pitch into floats: if the value is a range, take the maximum; if it includes text (like ‘approx.’ or ‘standard’), strip the text.
For starters, I loaded up Pandas:
import pandas as pd
Next, I saw that while the main page is at https://www.seatguru.com/charts/generalcharts.php, that page has six links, each leading to a page with seat information. The information on each page is in an HTML table, which Pandas can often extract using read_html
. I experimented on one of those pages, and found that read_html
did the job just fine, returning a one-element list of data frames from each page.
That's fine for one page, but how can I retrieve the table from each page? I decided to create a list of the central, or "stub" part of each page's URL, and to then iterate in a for
loop over that list, running read_html
on each URL. The results would then go into a list of data frames that I could combine using pd.concat
.
In the end, I used a generator comprehension to achieve this:
stubs = ['shorthaul_economy',
'shorthaul_first_class',
'longhaul_economy',
'premium_economy',
'longhaul_business_class',
'longhaul_first_class']
df = (pd
.concat(pd
.read_html(f'https://www.seatguru.com/charts/{one_stub}.php')[0]
)
)
This code actually works fine, but it's missing one critical component, namely a column that indicates the type of flight the data is for. After all, we'll want to compare flight_type
(or maybe I should have called it seat_type
?).
I thus added a call to assign
, indicating that flight_type
should be assigned the value of one_stub
, the part of the URL that changes:
df = (pd
.concat(pd
.read_html(f'https://www.seatguru.com/charts/{one_stub}.php')[0]
.assign(flight_type=one_stub)
for one_stub in stubs)
)
However, I also wanted to turn the Seat Width
and Seat Pitch
columns into numeric dtypes, likely floats. Trying to run astype
on either of these columns resulted in errors, because each had some non-numeric values, such as words describing the units or -
signs between two numbers in a range.
I decided that the easiest way to deal with this would be to grab the number from inside of the Seat Width
and Seat Pitch
cells using a regular expression. The regular expression itself was ^.*?([\d.]+).*$
, which admittedly looks like gobbledegook. However:
^
means, "anchor the regexp to the start of the string- `.*` means, "Find zero or more non-newline characters"
- The
?
after.*
means: Take the minimum characters to make this work, not the maximum. - Then, inside of
()
for capturing, we look for[\d.]+
. That means we're looking for any combination of digits and decimal points. Because these()
are the first (and only) capturing parentheses in this regular expression, we'll be able to grab their contents with\1
. - Then we look for zero or more non-newline characters with
.*
- Then we anchor the regexp with
$
to the end of the string.
I used str.replace
on both Seat Width
and Seat Pitch
to replace each original value with the number we found inside of it, using str.replace
. After that succeeded, I then ran astype(float)
to get a floating-point version of the data into that column.
I'll admit that I'm not a big fan of keeping spaces in column names – and using assign
means that we're limited to column names that expressible in Python. I thus kept seat_width
and seat_pitch
. Then I invoked drop
to remove the original columns:
stubs = ['shorthaul_economy',
'shorthaul_first_class',
'longhaul_economy',
'premium_economy',
'longhaul_business_class',
'longhaul_first_class']
df = (pd
.concat(pd
.read_html(f'https://www.seatguru.com/charts/{one_stub}.php')[0]
.assign(flight_type=one_stub)
for one_stub in stubs)
.assign(seat_width = lambda df_: df_['Seat Width']
.str.replace(r'^.*?([\d.]+).*$', r'\1', regex=True).astype(float),
seat_pitch = lambda df_: df_['Seat Pitch']
.str.replace(r'^.*?([\d.]+).*$', r'\1', regex=True).astype(float))
.drop(columns=['Seat Pitch', 'Seat Width'])
)
The result? A data frame with 2,731 rows and 10 columns, including a flight_type
column indicating the type of seat reservation the person had made. All but two of the columns had a dtype of object
, which theoretically means any Python object, but in reality normally means strings. The two non-string columns were the ones I created with assign
, both of which were of type float64
.
Notice that everything here was done inside a call to pd.concat
, which takes a list of data frames and combines them – by default, top-to-bottom, but it can be used to combine them left-to-right. We thus end up with one large data frame containing all of the values from the six pages at SeatGuru.
Is there a difference between the width and pitch of short-haul, long-haul, and long-haul premium economy seats?
First, I wanted to keep only economy seats from our data frame. To do that, I used my favorite combination of loc
and lambda
, with the lambda
expression returning a boolean series that can be used as a mask index.
In this case, my lambda
ran str.contains('economy')
on df_['flight_type']
, returning True
if the string showed it was an economy flight. The result of the loc
was thus a data frame containing only economy flights – long-haul, short-haul, and long-haul premium:
(
df
.loc[lambda df_: df_['flight_type'].str.contains('economy')]
)
I then ran groupby
. I grouped on flight_type
, since I wanted get a calculation for every unique value of flight_type
. The calculation was going to be mean
, and on both seat_width
and seat_pitch
. Notice that in order to calculate on both of those, I had to pass a list of strings, meaning nested []
:
(
df
.loc[lambda df_: df_['flight_type'].str.contains('economy')]
.groupby('flight_type')[['seat_width', 'seat_pitch']].mean()
)
The result:
flight_type seat_width seat_pitch
longhaul_economy 17.56294498381877 31.848465266558968
premium_economy 18.655459183673468 37.607142857142854
shorthaul_economy 17.549726553672315 31.26381059751973
We can see that regular economy seats are, on average, about 17.55 inches in width, while premium economy seats are 18.6 inches. That inch can make a difference in your comfort!
Even more striking is the difference in the seat pitch, meaning the angle to which you can lower your seat. You get a bit more than 31 degrees in regular economy, but 37 degrees in premium economy – again, it doesn't seem like a big deal, but it can help you sleep and relax more easily.