BW #78: Stock markets (solution)

This week, we looked at data from a number of major stock-market indexes. The goals were not just to find out if the major drops we saw earlier this week were unusually large in historical terms, but also to see what connections we can find across the various indexes.

Data and six questions

There are a number of ways to download the data, and you'll get roughly the same information, in roughly the same format, no matter where you get it from. I personally went to investing.com (https://investing.com) where I signed up for a free account, went to the "major indices" page at https://www.investing.com/indices/major-indices , and downloaded a CSV file from January 1, 2005 through August 7, 2024 for the following indexes:

  • DAX (Germany)
  • Dow Jones Industrial Average (US)
  • FTSE 100 (UK)
  • NASDAQ (US)
  • Nikkei 225 (Tokyo)
  • S&P 500 (US)
  • Shanghai Composite (China)

Here are the six tasks and questions that I posed for you based on this data. As usual, a link to the Jupyter notebook I used to put this together

Take the seven input CSV files, and turn them into a single data frame. The index of the data frame should be a two-part multi-index, with the first name of the filename on the outer layer and the date on the inner layer. We only care about the Date, Price, High, Low, Vol., and Change % columns.

For starters, I loaded Pandas. But I also loaded the os module from Python's standard library, since I'll use it in loading the files:

import pandas as pd
import os

I put all of the CSV files into a subdirectory called data under my Jupyter notebook's directory, and then made a list of those files:

filenames = ['DAX Historical Data.csv',
             'Dow Jones Industrial Average Historical Data.csv',
             'FTSE 100 Historical Data.csv',
             'NASDAQ Composite Historical Data.csv',
             'Nikkei 225 Futures Historical Data.csv',
             'S&P 500 Historical Data.csv',
             'Shanghai Composite Historical Data.csv']

Notice that it's totally OK for filenames to contain spaces. If you type the name on the command line, then you'll need to use quotes or backslashes – but in Python, where filenames are strings, it's easier to understand how we can include spaces.

Given those CSV files, how can we create a single data frame? The easiest way is to iterate over the filenames, creating one data frame from each. We can put those data frames into a list, and then use pd.concat to combine them.

However, there's a small snag, namely that we want to include to indicate from which file we took the data. Indeed, I want the resulting data frame to have a two-part multi-index, with the outer part indicating which index we're dealing with, and the inner part being the date of the reading.

I would normally use a list comprehension to read these files and create data frames from them, but adding the index name would complicate that. I thus decided to use a regular for loop, iterating over the list and creating a new data frame with each iteration. I created an empty all_dfs list, into which we can then append each of the data frames we create:

all_dfs = []

for one_filename in filenames:
    one_df = (pd
          .read_csv(os.path.join('data', one_filename),
                    usecols=['Date', 'Price', 'High', 
                             'Low', 'Vol.', 'Change %'],
                    parse_dates=['Date'])
          .assign(source=one_filename.split()[0])
          .set_index(['source', 'Date'])
         )
    all_dfs.append(one_df)

Let's walk through the above code:

  1. First, we run read_csv, reading in a filename. I use the usecols keyword argument to indicate which columns I actually want in the resulting data frame, and the parse_dates column to indicate that the Date column shouldn't be treated as a string, but rather as a datetime value.
  2. Note that I use os.path.join to add an initial directory name (data) to one_filename, which is cleaner than using an f-string.
  3. I then use the assign method to add a new column to the data frame. I grab the first word from one_filename, giving me a unique identifier that I can use, and assign it to all of the rows for that source column.
  4. Having created a data frame from the CSV file, and then having added the source column from the filename, I then set the two-part index with set_index, passing a list of column names, source and Date.
  5. Finally, I add the newly created data frame to all_dfs.

When the loop has finished running, all_dfs is a list of data frames. We can then pass that to pd.concat, getting a single data frame back:

df = pd.concat(all_dfs)

The result is a single data frame with a two-part multi-index on the rows, 28,637 rows, and 5 columns.

Modify the Change % to be a float column, rather than a string column. Modify Price, High, and Low, to be floats.

It's great that we managed to read the data into our data frame, but there is still work to be done. For example, the Change % column contains strings, even though we can actually work with the values if they're floats. In theory, we could use astype(float) on the column, getting a column of floats back – but we cannot do that with the values as they stand, because there is a % sign at the end of of each value.

Fortunately, Pandas comes with a lot of string functionality, all available via the str accessor. Many of the methods come from Python, but there are many that were inspired by other languages and frameworks. Even some of the Python methods have extra functionality, incorporating such features as regular expressions.

But to be honest, we don't need any of those things to turn the Change % column into a float dtype. We can use str.strip, a method well known to Python developers as able to remove leading and trailing whitespace from strings. It turns out that calling str.strip without any arguments is useful because we so often want to remove whitespace – but we can also pass a string argument to the method. In such a case, the characters in that string are all removed from the front and back of the string. For example, if I were to call str.strip('abc'), then the strings would not start or end with a, b, or c.

In this case, I'll run str.strip, passing it the single-character string '%' as an argument, to remove the % character from the end of the string. The resulting series of strings then contains strings that can be turned into floats with astype, which we then do, assigning the result back to df['Change %']:

df['Change %'] = (df
                  ['Change %']
                  .str.strip('%')
                  .astype(float)
                 )

We still have three other columns that we need to turn into float dtypes, as well. They all have the same problem, namely that the numbers contains commas every three digits – very nice for displaying numbers in a readable format, but not so good for calculations, or for using a float dtype.

We can solve this by invoking str.replace each of these columns, removing any commas. Then we can use astype(float) to get floats back. For example:

df['Low'] = df['Low'].str.replace(',', '').astype(float)

This works, but I would rather not have three nearly identical assignments in a row. Maybe I can somehow do this in a loop?

I could, but I came up with an even wilder idea: I'll create a dict in which the keys are the column names, and the values are lambda expressions containing the combination of str.replace and astype. Even better, I can do this in a dict comprehension:

rewriting = {one_column : 
               lambda df_: (df_[one_column]
                            .str.replace(',', '')
                            .astype(float))
             for one_column in ['Price', 'High', 'Low']
}

That's great, but what can I do with this dictionary? I can pass it to df.assign, which takes key-value pairs. However, we'll need to use ** to turn the dict we've created into those key-value pairs:

df = df.assign(**rewriting)

This will invoke each of our lambda expressions on the associated columns, then assigning the new (float) values in place of the string values. And indeed, after running this, I can say:

df.dtypes

And here's what I get:

Price       float64
High        float64
Low         float64
Vol.         object
Change %    float64
dtype: object

We're not there yet, but it's definitely progress.

Read more