BW #28: Pret a Manger (solution)

Economic statistics can take a long time to collect and publish. That’s because they can be complex to collect, complex to compute, and then complex to validate. These statistics can have a big impact, so you don’t want to get them wrong. The downside, of course, is that the delay between the data’s collection and publication also leads to a lag in policy-makers’ ability to make changes rapidly.

That’s presumably part of the reasoning behind the experimental real-time data collection project run by the UK’s Office for National Statistics (https://www.ons.gov.uk). They have identified a number of ways to collect and publish economic data quickly, in order to help government and business leaders to act more quickly.

This week, we looked at data the ONS collects about Pret a Manger, a popular chain of sandwich shops in the UK. The data describes how many in-store purchases were made in a given week at Pret shops. We don’t know what they bought, how much they paid, or how many people were eating together. We also don’t know how many orders were made at any given location; the data is grouped into regional categories.

https://pe-insights.com/wp-content/uploads/2020/10/pret-a-manger.jpg

Moreover, we don’t have the raw sales numbers. Rather, they are indexed to January 2020. If the data shows 100 for a given region’s weekly report, that means sales are equivalent to those in January 2020. A 50 would mean half that amount, and a 200 would mean twice that amount. In that way, we can see how in-store sales have changed in the last few years.

You can read about the latest real-time data, including Pret a Manger, at https://www.ons.gov.uk/economy/economicoutputandproductivity/output/bulletins/economicactivityandsocialchangeintheukrealtimeindicators/3august2023 .

Data and questions

The data can be downloaded from the main page for the Pret a Manger research:

https://www.ons.gov.uk/economy/economicoutputandproductivity/output/datasets/transactionsatpretamanger

The excel file itself can be downloaded from here:

/content/files/file.xlsx

A full description of these store locations, and of the overall methodology for this study, can be found at https://www.ons.gov.uk/economy/economicoutputandproductivity/output/methodologies/coronavirusandthelatestindicatorsfortheukeconomyandsocietymethodology .

This week, I gave you seven questions and tasks. Let’s get to them; a link to the Jupyter notebook that I used to solve these problems follows the answers themselves.

Read the data into a data frame, treating the "Week Ending" column as dates and using it as the index.

While CSV is certainly a popular format for data distribution,

We’ll start by loading Pandas:

import pandas as pd

Then we can load the Excel file. I downloaded it onto my computer, and was thus able to read it in with the read_excel method. However, a simple call to read_excel won’t be enough:

The data is located on the third sheet of the Excel document, which we need to tell to “read_excel”,
The data doesn’t start until row 4 of the sheet,
We want to parse the “Week Ending” column as a date, and
We want to set the “Week Ending” column to be the index of our data frame.

Put together, we’ll need the following code:

filename = 'transactionsatpretamangerdataset030823.xlsx'

df = pd.read_excel(filename,
                   sheet_name=2,
                  header=3,
                  parse_dates=['Week Ending'],
                  index_col='Week Ending')

Notice that because Python (and Pandas) use zero-based indexing, sheet number 3 is identified with an index of 2, and line 4 is passed to the “header” keyword argument with a value of 3. It’s a bit confusing, especially when the rows are explicitly numbered in Excel — but we manage to read the data.

In the end, our data frame has 126 rows (one for each week during which data was collected, indexed by the final day in that week) and 10 columns (one for reach regional category). The dtypes of the non-index columns are all int64, which is overkill for the numbers we’ll be using, but that’s not something to worry about in such a small data frame.

What three categories of stores are doing best, as of the latest data, vs. the baseline?

We can get the most recent data — that is, the final row in the data frame — in either of two ways:

We can use “loc” to retrieve that row via its date. After all, we can always retrieve a row from a data frame using “loc” and the row’s index.
We can also use “iloc” to retrieve the row via its numeric position, much as we retrieve elements from strings, lists, and tuples in Python. No matter what the index, iloc lets us use numbers. You can even use -1 to indicate that you want to count from the end, rather than from the start.

I decided to use iloc:

df.iloc[-1]

This retrieved the final row:

Yorkshire              123
London: Suburban       112
Manchester             104
London: West End        96
London: Stations        90
Regional Towns         107
Scotland                96
London: Airports       163
London: City Worker     82
Regional Stations       82
Name: 2023-07-27 00:00:00, dtype: int64

But I wasn’t interested in finding all of the values. Rather, I wanted to find the locations with the three greatest values vs. their original baseline. To get this, I’ll need to run “sort_values” on the series we got back. Here, I’m going to switch to a multi-line, chained-method approach, even if it isn’t strictly necessary:

(
    df
    .iloc[-1]
    .sort_values(ascending=False)
)

This gives me the following result:

London: Airports       163
Yorkshire              123
London: Suburban       112
Regional Towns         107
Manchester             104
London: West End        96
Scotland                96
London: Stations        90
London: City Worker     82
Regional Stations       82
Name: 2023-07-27 00:00:00, dtype: int64

Remember that this chained approach works because Python’s normally very strict rules for indentation and line endings melt away when you open parentheses. This allows us to put part of a query on each line, making it more readable, as well as easier to comment on and edit.

Our results are great, but I’m only interested in the three highest scores. I can use head for that:

(
    df
    .iloc[-1]
    .sort_values(ascending=False)
    .head(3)
)

The result:

London: Airports    163
Yorkshire           123
London: Suburban    112
Name: 2023-07-27 00:00:00, dtype: int64

In other words, compared with January 2020, the stores that are having the greatest number of in person sales are located at London’s four airports, in Yorkshire, and in suburban London.

This already hints at what we’re going to see, namely that in-person purchases of lunch food in areas with office buildings went down during the pandemic and haven’t completely recovered, even if they’re doing better than was the case. By contrast, airports are doing very well, reflecting the “revenge tourism” attitude that I’ve both heard about and personally experienced, flying everywhere you can, now that you can.

BW #28: Pret a Manger (solution)

Data and questions

Read the data into a data frame, treating the "Week Ending" column as dates and using it as the index.

What three categories of stores are doing best, as of the latest data, vs. the baseline?

Read next

BW #28: Pret a Manger

No Bamboo Weekly this week