BW #26: Hot weather (solution)

Is it hot where you live? Temperatures are rising all over the globe. This week, we'll find out where it has been hottest, and whether temperatures have been rising over the last few years.

BW #26: Hot weather (solution)

This week, we looked at high-temperature data, in an attempt to see where and when it has been particularly hot over the years, and whether we can see a general trend toward hotter temperatures.

Our data this week came from the National Centers for Environmental Information (NCEI), part of the National Oceanic and Atmospheric Administration (NOAA), part of the US Department of Commerce.

The data is huge, even after we whittle it down and use only a part of it. Part of this week’s learning goals to include taking a large number of files and turning them into a data frame containing only some of the data.

I first suggested that you look through the data dictionary for this data set, to understand the structure and contents of what we’re going to be working with. The data dictionary is located here:

https://www.ncei.noaa.gov/pub/data/ghcn/daily/readme.txt

I then gave you 7 questions and tasks for this week, the bulk of the work being in question 3, where we create the data frame based on the files.

So, without further ado, let’s get to this week’s questions:

Download the list of weather stations (https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt) and turn it into a data frame. Use the specifications for the file, as described in the README. You'll want to set your own names for the column headers. Make the `id` column into the index.

There are a lot of weather stations positioned all over the world, and the data set that we’re working with this week includes data from all of them. Each weather station has a unique ID, as well as a location name, longitude, and latitude. Turning the stations into a data frame is a good first step.

Before doing anything else, I decided to load up NumPy and Pandas:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

With those in place, I then started to work on the data itself, create a new data frame with the station information.

Most of the time, we deal with files in CSV or Excel format. However, the stations aren’t in either of those formats, as you might have seen in the data dictionary. Instead, they are in “fixed-width format,” meaning that while each line in the file contains a single record, the fields aren’t separated by delimiter characters. Rather, each field contains a specific number of characters.

Fortunately, Pandas comes with read_fwf, a method for reading from fixed-width field files. If you read through the README file, you’ll see that section IV indicates that each line contains 85 characters, divided into nine fields:

------------------------------
Variable   Columns   Type
------------------------------
ID            1-11   Character
LATITUDE     13-20   Real
LONGITUDE    22-30   Real
ELEVATION    32-37   Real
STATE        39-40   Character
NAME         42-71   Character
GSN FLAG     73-75   Character
HCN/CRN FLAG 77-79   Character
WMO ID       81-85   Character
------------------------------

In theory, we could call read_fwf without any arguments other than the filename from which we want to read. But in reality, that’ll give us a huge mess. That’s because by default, read_fwf tries to infer where the columns are. In a file like this one, where there are numerous empty fields, it’ll quickly get confused and give us the wrong number of fields.

Fortunately, the data dictionary tells us the columns used by each field. Which means that we can indicate which field goes where by passing a list of tuples to the “colspecs” argument:

stations_df = pd.read_fwf(stations_filename, 
                 colspecs=[(0,11), (12, 20), (21,30), (31,37), (38,40), (41,71), (72, 75), (76,79), (80,85)]) 

First of all, notice that the numbers in my tuples and the numbers in the above specification from the README aren’t quite the same. That’s because the data dictionary called the first column 1 — but in Python, the first column is 0. We thus need to subtract 1 from the starting point for each of the columns.

What about the ending point? Shouldn’t the first tuple be (0, 10) rather than the specified (1, 11)? No, because Python (and read_fwf) almost always assume “up to but not including” the endpoint. So when we say (0, 11), we mean that we want 11 characters, starting with index 0, and going through index 10.

I’m not sure why, but the format used for the stations leaves an empty character between columns, which just makes it more confusing.

If you use the above code to read in the file, you’ll quickly discover that it’s still a bit weird. That’s because read_fwf, like read_csv, assumes that the first row names the columns. To convince it otherwise, we’ll need to set “header=None”, indicating that there are no headers in this file, and then pass a list of strings to the “names” argument, telling it what to call the columns.

Finally, we can pass “index_col” and name the “id” column (defined in “names” above) as our index:

stations_df = pd.read_fwf(stations_filename, 
                 header=None, 
                 names='id latitude longitude elevation state name gsn_flag hcn_crn_flag wmo_id'.split(), 
                 colspecs=[(0,11), (12, 20), (21,30), (31,37), (38,40), (41,71), (72, 75), (76,79), (80,85)] ,
                 index_col='id') 

The result is a data frame with 124,954 rows, each describing a different weather-monitoring station somewhere in the world.

Download the GHCND-ALL data (https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd_all.tar.gz). NOTE: This file is 3.4 GB in size, so it might take a while to download to your computer. Follow the directions in the README to un-tar the file. This will result in about 30 GB of files being created under the `ghcnd_all` directory.

Most people are used to working with zipfiles. A zipfile can contain a number of files within it, and also compresses those files. It’s thus super convenient to work with zip.

But before zip came along, there was “tar” format, short for “tape archive.” The idea of a tarfile was that you could take a whole bunch of files and put them together inside of one file, typically for backup purposes. I could “tar up” an entire directory, including its files and subdirectories, and store that file somewhere. Then, if/when I need to retrieve those files, I could untar them.

Note that tar archived files, but didn’t compress them. A few different compression schemes existed at the time, but the “GNU zip” format (aka “gzip”), from the people at the Free Software Foundation (sort of a predecessor to the open-source movement), quickly took hold. Note that there was no connection between gzip and the zip that we now know; I’d argue that this was a foolish choice on the GNU people’s part, but there you have it.

In the Unix world, it’s thus pretty common to have a file with the “.tar.gz” dual suffix. To open the file, you first need to un-gzip it, and then you have to un-tar it. There are far too many options to both tar and gzip to explain them here, but the README explains that after you have downloaded the ghcnd_all.tar.gz file, you can open it up into a directory with the following command:

tar xzvf ghcnd_all.tar.gz

In short, the above command first un-gzips the file (that’s the “z” option), then extracts the file (that’s the “x” option), doing it verbosely (i.e., telling us what it’s doing) and working on the file we specify (i.e., ghcnd_all.tar.gz).

The gzipped tarfile that we downloaded is 3.4 GB in size. And when opened up, into a new directory? The complete contents are 30 GB, with 124,946 files, each ending with “.dly”, meaning that it contains daily information from a particular weather station.