BW #33: Fracking (solution)

Fracking is a commonly used technique to extract petroleum products. It also seems to use a great deal of water. How much water does it use, and is the same amount needed everywhere? Let's check!

BW #33: Fracking (solution)

This week, we’re looking at fracking (https://en.wikipedia.org/wiki/Fracking), a commonly used technique to extract oil and gas from the earth. Fracking has long been controversial, and is even illegal in some countries — but a New York Times story earlier this week discussed a new problem, namely the amount of water that is going to fracking. The fact that many areas of the United States are low on water makes it all the more controversial for fracking to use so much of this precious resource.

Dall-E, in response to: A car (in its entirety) on the beach. A nozzle is filling the car not with gas, but with water from the ocean. .png

Data and eleven questions

This week, we looked at data from FracFocus, an organization that collects information about fracking, including chemicals injected into the earth to improve the fracking process. This data set also describes how much water is used. The data files are all in CSV format, and can be downloaded via this link:

    https://fracfocusdata.org/digitaldownload/FracFocusCSV.zip

I gave you eleven tasks and questions this week. That’s a lot, but there was a lot to explore with this data set! (I thought about doing much more, but there’s a limit as to how much I can ask of you in a given week.) Here are the questions, along with my solutions and explanations:

Import the "FracFocusRegistry" CSV files into a single data frame. Include only the columns: JobStartDate, TotalBaseWaterVolume, StateName, CountyName, and FederalWell. Don't bother trying to parse the `JobStartDate` column into datetimes just yet. Let them be strings.

First, let’s load Pandas:

import pandas as pd

Now that we’ve done that, we need to create a single data frame from a bunch of CSV files. Fortunately, the CSV files all fit the same pattern; if we were in the Unix shell, then we could describe them as

FracFocusRegistry_*.csv

(Of course, we could name each of the files individually. But we’re programmers, and we want to subscribe to the DRY (“don’t repeat yourself”) rule as often as possible. Fortunately, Python’s standard library comes with the “glob” module, whose “glob” function (i.e., “glob.glob”) returns a list of filenames matching a pattern. We can thus say:

import glob
glob.glob(f'FracFocusRegistry_*.csv')

But of course, we aren’t interested in the filenames. Rather, we want to take each of those CSV files, turn it into a data frame, and then combine all of those data frames into a single, large data frame. (You could, in theory, combine all of the files and then read them into Pandas with one call to read_csv, but that would raise all sorts of other issue.)

What we’ll need to do is take each filename and run read_csv on it. The result from each call to read_csv will be a data frame. If we put each of those data frames into a list, then we can call pd.concat on the list, getting a single, large data frame as a result.

My favorite way to do this is with a list comprehension, which lets us invoke a Python expression repeatedly as we iterate over a bunch of values. Since we know what columns we want from the CSV files, we can even specify that with the “usecols” keyword argument:

import glob

all_dfs = [
    pd.read_csv(one_filename,
      usecols=['JobStartDate', 
        'TotalBaseWaterVolume', 'StateName', 
        'CountyName', 'FederalWell'])
    for one_filename in glob.glob(f'FracFocusRegistry_*.csv')
]

df = pd.concat(all_dfs)

The above code iterates over each of the filenames we get back from calling glob.glob. It runs pd.read_csv on each filename, returning a data frame. all_dfs is the list of data frames we get back from each of these calls. pd.concat takes a list of data frames and returns one new one from its inputs:

import glob

dirname = '/Users/reuven/Courses/Current/data/fracfocus'

all_dfs = [
    pd.read_csv(one_filename,
      usecols=['JobStartDate', 
        'TotalBaseWaterVolume', 'StateName', 
        'CountyName', 'FederalWell'])
    for one_filename in glob.glob(f'{dirname}/FracFocusRegistry_*.csv')
]

df = pd.concat(all_dfs)

We now have a single data frame with our five columns and 6,087,921 rows.

Now turn `JobStartDate` into datetime dtypes. If the input date string cannot be parsed, then leave it as `NaT`.

Normally, when we read data from a CSV file, we can indicate that a particular column contains date and time information (rather than a string) by passing the “parse_dates” keyword argument, along with a list of columns we want to be parsed as dates. However, if you were to try that with these files, it wouldn’t work. That’s because there are a number of lines that contain illegal datetime values. Using “parse_dates” in “read_csv” will result in a string column, rather than datetime64.

In such a case, it’s better to read the data in as a string, and then use “pd.to_datetime” to perform the conversion. That’s because to_datetime takes a variety of arguments that we can use to specify the conversion more clearly and explicitly.

The basic idea is that we call pd.to_datetime on a series of strings, and get back a series of datetime64 objects. We can then assign that series to a new column in our data frame, or (more likely) assign it back to the same column that we’re converting, replacing strings with datetimes.

In the simplest case, we could say:

df['JobStartDate'] = pd.to_datetime(df['JobStartDate'])

But as I mentioned above, that’ll give us some warnings about the date format, and an indication that we would be better off specifying the date format explicitly, by passing the “format” keyword argument and a string describing the date format. (You have to use format codes to do this, and I never remember them. Fortunately, there’s https://strftime.org, which lists them.)

In this particular case, I decided to try my luck passing the “errors” keyword argument, with a value of “coerce”. When you do that, any failure to parse the date creates a “NaT” value — short for “Not a time,” the datetime equivalent of “NaN,” or “Not a number.”. This would guarantee that the resulting column would contain datetime64 values, with the only question being how many of them would be NaT:

df['JobStartDate'] = pd.to_datetime(df['JobStartDate'], errors='coerce')

Sure enough, this worked just fine, with only a handful of rows containing NaT values.