This week, we're looking at the history of museum heists. This is, of course, because of the theft of $100m in jewels from the Louvre in Paris on October 19th — in daylight, in only 8 minutes, by people dressed as construction workers, armed with glass cutters and a large dose of self confidence. The New York Times recently described the theft itself in great detail, at https://www.nytimes.com/2025/10/30/world/europe/inside-louvre-jewel-heist.html?unlocked_article_code=1.0k8.3jfi.Fqdb95Os1tWk&smid=url-share .
How common are such heists? When and where do they take place? How many people are involved? And do they often use weapons? These, and other questions, were raised by Sandra Clopés and Marc Balcells, professors at Pompeu Fabra University in Barcelona, Spain, in a recent paper, "The Science of Art Theft: Using Data to Identify Criminal Patterns, 1990–2022" (https://www.cambridge.org/core/journals/international-journal-of-cultural-property/article/science-of-art-theft-using-data-to-identify-criminal-patterns-19902022/838EFE01FB7C7AE9244FB458FF103EB1), published in April in the International Journal of Cultural Property.
The paper only includes heists from 1990 through 2022, but can still provide us with an understanding of museum thefts. And indeed, that's our topic for this week, trying to better understand the database of heists used in the paper.
Data and five questions
This week's data consists of an Excel spreadsheet describing the museum thefts from 1990-2022. The data was available in a PDF file alongside of the academic paper, but the formatting was such that getting a computer to make sense of it would have been difficult. I thus e-mailed Professor Clopés, who kindly and quickly sent me the original Excel file. You can download it from here:
This week, I have five tasks and questions for you based on this database.
Paid subscribers to Bamboo Weekly, as well as members of my LernerPython subscription program, can look to the bottom of this message for my downloadable Marimo notebook, and a clickable link to view and explore the data using the Marimo "Molab" facility.
Learning goals for this week include: Working with dates and times, cleaning data, grouping, regular expressions, and handling weird multi-indexes.
Here are this week's five tasks, with my solutions and explanations:
Read the Excel file into a Pandas data frame. Use lines 2 and 3 from the Excel file as the column names, but end up with a single-level index on the columns, not a multi-index. Ensure that the Full Date and Date of heist discovery are datetime values; assume that "-" as the day of the month is really the first day of that month. Which country appears in this database the most times? Has any museum been robbed more than once in this period? Which museum had the most items stolen? Which had the greatest monetary value stolen? (For this last question, you can treat missing values as 0.)
I started off by importing Pandas. But I also used a lot of regular expressions this week, and I thus imported the re module from Python's standard library:
import pandas as pd
import reNext, I used read_excel to read the Excel file into a data frame. I passed the header keyword argument to indicate the row on which the column names were listed. In this case, however, I wanted to use two rows, which meant passing a list of row numbers as the value for header:
filename = 'data/bw-144-heists.xlsx'
df = (
pd
.read_excel(filename,
header=[1,2])
)This assigned a data frame, but I wanted to clean it up. First, I wanted to take the multi-index and turn it into a single index. That's doable with the set_axis method, which makes it possible to set the column names or the index during a method chain. However, I wanted to perform some processing here, which meant using a function – in this case, a lambda expression. That isn't possible with set_axis.
I thus ended up invoking pipe, which lets us invoke any function we want, so long as it takes a data frame as an argument. In this case, the function was an anonymous lambda expression, one which invoked set_axis. But the value given to set_axis was the result of running a list comprehension on the column names, invoking str.join on each one.
In other words: You can think of a multi-index as a list of tuples. I used a list comprehension to iterate over each 2-element tuple, invoking str.join on it to get a single string. From a list of tuples, we thus got a list of strings, which I passed to set_axis. Note that I joined the strings with / characters, which was an arbitrary choice:
df = (
pd
.read_excel(filename,
header=[1,2])
.pipe(lambda df_: df_.set_axis(['/'.join(col).strip()
for col in df_.columns],
axis='columns'))
)Next, I wanted to make sure that the date columns were both datetime dtypes. There were two date columns in this data frame, one of which was handled correctly, thanks to the fact that excel is a binary format, and thus passes datetime values just fine. But the other column, Guard stations/Full date, had some missing days, which weren't interpreted correctly.
I thus used assign to create a new column, full_date, based on the original one. I used a lambda expression to invoke pd.to_datetime on the original column. For reasons that aren't totally clear to me, pd.to_datetime was OK with the '-' characters instead of days, treating them as the first of the month. So really, we just needed to invoke this function to perform the transformation.
I didn't need the old Guard stations/Full date column after this, and used drop to remove it:
df = (
pd
.read_excel(filename,
header=[1,2])
.pipe(lambda df_: df_.set_axis(['/'.join(col).strip()
for col in df_.columns],
axis='columns'))
.assign(full_date = lambda df_:
pd.to_datetime(df_['Guard stations/Full date']))
.drop(columns=['Guard stations/Full date'])
)Finally, I decided to perform some trimming on some of the column names. Wherever it was empty on either row of the two-row multi-index, we had "Unnamed column." I again used pipe with set_axis and re.sub to remove any of these placeholder names from the column names.
My regular expressions here just looked for Unnamed:./ and also /Unnamed.$, replacing each of those with an empty string, ''. I could have done this with a single regexp, but decided that it would be easier and clearer to do it in two steps:
df = (
pd
.read_excel(filename,
header=[1,2])
.pipe(lambda df_: df_.set_axis(['/'.join(col).strip()
for col in df_.columns],
axis='columns'))
.assign(full_date = lambda df_:
pd.to_datetime(df_['Guard stations/Full date']))
.drop(columns=['Guard stations/Full date'])
.pipe(lambda df_: df_.set_axis([re.sub('Unnamed:.*/', '', col)
for col in df_.columns],
axis='columns'))
.pipe(lambda df_: df_.set_axis([re.sub('/Unnamed:.*$', '', col)
for col in df_.columns],
axis='columns'))
)If you haven't ever learned regular expressions, I'll strongly suggest that you try them! I have a free e-mail course on the subject, at https://RegexpCrashCourse.com.
We now have a data frame with 40 rows (one for each heist) and 57 columns, one for each data point the professors examined in the paper.
Now we can start to answer some questions. For example, which country appears in the database most often? To answer this, I retrieved the Country column and invoked value_counts on it. value_counts always sorts its output from most common to least, so the most commonly robbed country will be on top:
(
df
['Country']
.value_counts()
)Here are the results:
Country count
England 6
Norway 4
Germany 3
Sweden 3
France 3
Netherlands 3
Australia 2
USA 2
Brazil 2
Italy 2
Switzerland 2
Egypt 1
Belgium 1
Holland 1
South Africa 1
Austria 1
Paraguay 1
Russia 1
Korea 1I guess I somewhat expected England to be at the top, or near to it. But I most certainly did not expect Norway to be in the second position. France, which provided some of the motivation for this week's data set, has had its museums robbed less often than Sweden, which was quite surprising to me.
Have any museums been robbed more than once? To find out, I once again invoked value_counts, but this time on the Target name column. I then used loc and lambda to keep only those elements that are greater than 1:
(
df
['Target name']
.value_counts()
.loc[lambda s_: s_ > 1]
)Three museums were robbed twice during the time period in the database:
Target name count
British Museum 2
Château de Fontainebleau Chinese Museum 2
Permanenten KODE Museum 2Which museum had the most items stolen? For this, I used groupby:
- My categorical column was
Target name, since I wanted a count per museum - The numerical column on which we calculated was
'Discovery/Number of stolen items. - The aggregation method was
sum, to add the numbers together.
Remember that groupby returns its result sorted by the elements of the categorical column. For this reason, I then invoked sort_values, passing ascending=False to get them from most to least:
(
df
.groupby('Target name')['Discovery/Number of stolen items'].sum()
.sort_values(ascending=False)
)The results:
Target name Discovery/Number of stolen items
Permanenten KODE Museum 78
The Green Vault 21
Van Gogh Museum 20
Fitzwilliam Museum 18
Château de Fontainebleau Chinese Museum 15
Australian Museum 15
Isabella Stewart Gardner Museum 13
British Museum 11
Moderna Museet 8
Kunsthal Rotterdam 7
Musée d’Art Moderne de la Ville 5
Museo Nacional de Bellas Artes de Asunción 5
Pretoria Art Museum 5
Gongju National Museum 4
Museo Chácara do Céu 4
E.G. Bührle Collection 4
Galleria Nazionale d'Arte Moderna 3
Fondation Baur 3
Durham University Oriental Museum 3
Nationalmuseum 3
Schirn Kunsthalle 3
Whitworth Art Gallery 3
Munch Museum 2
Museu de Arte de São Paulo 2
Reubens House Museum 1
The Jewish Museum 1
Singer Laren Museum 1
Tretyakov Gallery 1
Art Gallery of New South Wales 1
Norway National Museum 1
Kunsthistorisches Museum 1
Henry Moore Foundation 1
Galleria d’Arte Moderna Ricci-Oddi 1
Bode Museum 1
Mohamed Mahmoud Khalil Museum 1
TEFAF Maastricht 0
Drottningholm Palace 0It's one thing to steal a large number of items. Another question we can ask is: Which museum had the greatest monetary value of items stolen?
On the face of it, this should be a simple query, running groupby on the museum names, and then invoking sum on the amounts that were stolen from each museum. However, it turns out that the Discovery/Value of target items in euros (€) column was very stubborn, and didn't want to be treated as numeric.
I thus moved Target name to be the index, and selected only the value column, which was seen as an O (or "object") column, meaning that it contains Python objects. To wrestle it into shape, I went through a bunch of steps:
- ran
fillna(0)to turnNaNvalues into the integer 0 - ran
astype(str)to turn the series into a string - ran
str.replaceto remove any non-decimal characters (since there were some commas and periods in there) - ran
astype(float)to get floats back
Now, with a float column, I could finally invoke groupby, naming the index (Target name) as my categorical. I also invoked sort_values, to see the amounts from highest to lowest:
(
df
.set_index('Target name')
['Discovery/Value of target items in euros (€)']
.fillna(0)
.astype(str)
.str.replace(r'\D', '', regex=True)
.astype(float)
.groupby('Target name').sum()
.sort_values(ascending=False)
)I then added a call to apply with str.format, to see euro signs before the amounts, commas before every three digits, and two digits after the decimal point:
(
df
.set_index('Target name')
['Discovery/Value of target items in euros (€)']
.fillna(0)
.astype(str)
.str.replace(r'\D', '', regex=True)
.astype(float)
.groupby('Target name').sum()
.sort_values(ascending=False)
.apply('€{0:,.02f}'.format)
)The results:
Target name Discovery/Value of target items in euros (€)
Van Gogh Museum $448,000,000.00
Isabella Stewart Gardner Museum $441,000,000.00
The Green Vault $135,000,000.00
E.G. Bührle Collection $121,000,000.00
Munch Museum $106,000,000.00
Norway National Museum $105,800,000.00
Musée d’Art Moderne de la Ville $100,000,000.00
Galleria d’Arte Moderna Ricci-Oddi $60,000,000.00
Kunsthistorisches Museum $50,000,000.00
Moderna Museet $46,000,000.00
Museu de Arte de São Paulo $44,300,000.00
Nationalmuseum $31,000,000.00
Galleria Nazionale d'Arte Moderna $29,800,000.00
TEFAF Maastricht $27,000,000.00
Schirn Kunsthalle $21,100,000.00
Kunsthal Rotterdam $18,100,000.00
Fitzwilliam Museum $16,000,000.00
Museo Chácara do Céu $8,700,000.00
Singer Laren Museum $6,000,000.00
Whitworth Art Gallery $4,800,000.00
Mohamed Mahmoud Khalil Museum $4,570,000.00
Bode Museum $3,700,000.00
Henry Moore Foundation $3,500,000.00
Fondation Baur $3,500,000.00
Durham University Oriental Museum $2,250,000.00
Pretoria Art Museum $2,000,000.00
Museo Nacional de Bellas Artes de Asunción $1,000,000.00
Australian Museum $1,000,000.00
The Jewish Museum $1,000,000.00
Art Gallery of New South Wales $1,000,000.00
Tretyakov Gallery $167,815.00
Reubens House Museum $140,400.00
British Museum $50,000.00
Gongju National Museum $13,532.00
Permanenten KODE Museum $0.00
Drottningholm Palace $0.00
Château de Fontainebleau Chinese Museum $0.00So yes, the Louvre theft was worth quite a lot, at $100m. But the Van Gogh Museum, and the Gardner Museum, both lost more than $400m in art. Yikes!
In how many cases was the heist discovered on the same day as it took place? For those heists that were discovered later on, how long did it take, on average? What was the longest amount of time it took for the heist to be discovered?
I first set the index to be Target name, and then selected (using filter) columns containing the string 'ate' somewhere in the name:
(
df
.set_index('Target name')
.filter(regex='ate')
)To find the difference between the day on which the heist happened and when it was discovered, I ran diff , specifying that I want to subtract one column from the other. The first column was then replaced with NaT ("not a time") values, but the second column, still called full_date, was now full of timedelta values. I kept only the full_date column:
(
df
.set_index('Target name')
.filter(regex='ate')
.diff(axis='columns')
['full_date']
)Because of the order of the column, the differences were now negative. So I invoked abs to get the absolute value of each timedelta, then used loc and lambda to keep only those values in which the time was greater than 0.
Note that you can compare with a timedelta with another timedelta value or with a string, as I did here:
(
df
.set_index('Target name')
.filter(regex='ate')
.diff(axis='columns')
['full_date']
.abs()
.loc[lambda s_: s_ > '0 seconds']
.sort_values(ascending=False)
)The result:
Target name full_date
Whitworth Art Gallery 2 days 00:00:00
Musée d’Art Moderne de la Ville 1 days 00:00:00
The Jewish Museum 1 days 00:00:00
British Museum 1 days 00:00:00As we can see, it generally took only 1 day (more or less) for the heist to be discovered. However, a theft at the Whitworth Art Gallery took 2 days.
On average, it thus took 1 day and 6 hours for the authorities to find out. However, given that our granularity is only in one-day increments, we're likely missing a fair amount of data here to make that calculation more precise.