[Administrative note: The 9th cohort of my Python Data Analysis Bootcamp, an 18-week small program in Python, Git, Pandas, and agentic coding, starts on June 4th. Learn more about it at https://LernerPython.com/bootcamp, or come to a free info session on Monday, June 1st: https://us02web.zoom.us/webinar/register/WN_JJ3GK2DCRFGVBzy9Vc41Tw.]
The 2026 FIFA World Cup (https://en.wikipedia.org/wiki/2026_FIFA_World_Cup) starts in about two weeks, with soccer ("football") matches all over North America – mainly in the United States, but also in Canada and Mexico. When I was in the US earlier this month for PyCon, numerous cities had signs advertising the games that they would be hosting. I've met people who will be traveling to the US specifically for these games. And of course, people around the world will be glued to their televisions (or, nowadays, their phones) watching as many games as they can.
This week, we'll look at data about World Cup games through the last tournament in 2022. The data, from a GitHub repo set up by Josh Fjelstul (https://github.com/jfjelstul/worldcup), offers a wealth of data about World Cup teams, games, and players. We'll try to understand which teams (men's and women's) have played and won the most, and just how old soccer players can be, while still participating in the World Cup. (Sadly, the results indicate that I might no longer be an attractive recruit for championship teams.)
Paid subscribers, both to Bamboo Weekly and to my LernerPython+data membership program (https://LernerPython.com) get all of the questions and answers, as well as downloadable data files, downloadable versions of my notebooks, one-click access to my notebooks, and invitations to monthly office hours.
Learning goals for this week include working with CSV files, pivot tables, plotting, filtering, and working with dates and times.
Data and five questions
This week's data, as I indicated, comes from Josh Fjelstul's GitHub repo, https://github.com/jfjelstul/worldcup . We'll use files in the data-csv directory, which were kindly denormalized – meaning that they allowed us to avoid performing extensive joins.
Here are my five tasks and problems. I'll be back tomorrow with my full solutions and explanations:
- Read the tournament data into a data frame. Then create a stacked bar plot showing how many tournaments have been hosted by each country. Each bar should be divided, to show how many men's tournaments and how many women's tournaments have been hosted in each country.
- Read the matches data into a data frame. Without using the "result" column, how often does the home team win vs. the away team (vs. tie games)? Do matches that take place in the morning (before 12 noon), afternoon (between 12 noon and 6 p.m.), or at night (after 6 p.m.) have the greatest mean number of goals? Do home or away teams win more often on weekends? Is there any correlation between the total number of goals and the hour at which the match started? Answer this last question both numerically and graphically.