BW #4: Eating well (solution)
Is it really true that some American children eat less than one vegetable per day? This week, we're looking at data from the National Survey of Children's Health, and see what we can learn.
This week’s topic: Vegetables
This week, we’re looking at data from the CDC’s NSCH, focusing on children who rarely eat any vegetables. I asked you to:
Download the topical data file, in SAS format, from:
https://www2.census.gov/programs-surveys/nsch/datasets/2021/nsch_2021_topical_SAS.zip. Turn it into a data frame. We're only interested in the following columns:
FIPSST
VEGETABLE
FRUIT
SUGARDRINK
Turn the
FIPSST
column into an integer, and make it the index.What percentage of children had, on average, less than one vegetable per day during the week preceding the study?
What percentage of children had, on average, less than one vegetable per day and less than one fruit per day during the week preceding the study?
What percentage of children had, on average, less than one vegetable per day and less than one fruit per day and did have a sugary drink during the week preceding the study?
Download the FIPS state reference info, in CSV format, from https://www2.census.gov/geo/docs/reference/state.txt. Turn this into a data frame, with the
STATE
column as the index. The only other column we care about isSTATE_NAME
.What percentage of children, per state, had, on average, less than one vegetaable per day during the week preceding the study?
There’s a lot to deal with here, so let’s get to it!
Discussion
The first thing that I asked you to do was import a data file stored in SAS format. Now, I’ve never used SAS, and I don’t know much about its data files. But it turns out that not everyone distributes their data in CSV or Excel files; if your organization uses SAS, then why not store things in SAS’s format? I’m sure (again, speaking from ignorance) that there are some advantages to it, such as binary storage and not having to infer dtypes.
What happens when a Pandas user encounters a SAS file? My initial guess was that I’d have to convert it to CSV. But it turns out that Pandas has a read_sas function that allows you to create a data frame directly from a SAS file.
There were two different issues that we had to deal with in opening the file. First, the data file came inside of a zipfile, so we couldn’t open it directly via a URL. We had to open the zipfile up, and then run read_sas on the file in SAS 7 format, which is what Pandas supports. That’s the file with a “sas7bdat” extension, weighing in at 182 MB.
The second problem with read_sas is that it doesn’t let you choose which columns you want in your data frame with a “usecols” keyword argument. I was thus forced to read the entire file into a data frame, and then pare that data frame down by stating which columns I wanted:
filename = '/Users/reuven/Downloads/nsch_2021_topical_SAS/nsch_2021_topical.sas7bdat'
df = pd.read_sas(filename)
df = df[['FIPSST', 'VEGETABLES', 'FRUIT', 'SUGARDRINK']]
This all seems fine, until we look at our data frame:
See how the FIPSST column’s values look? That seems kind of weird, no? And indeed, that column must have been stored in SAS as bytes, which were then read into Pandas as bytes, or one-character byte strings. We want integers, though, not bytes. And this means running the FIPSST column, which indicates which US state or territory is being referenced, through “astype”:
df['FIPSST'] = df['FIPSST'].astype(np.int8)
Since there are only a few dozen values for FIPSST, I decided that I’d use np.int8 for the dtype. I thus invoked “astype(np.int8)” on my column, getting back a new series, which I then assigned back to df[‘FIPSST’].
With this in place, the dtypes are all set correctly. We can thus finish up the first question by making the FIPSST column our data frame’s index (calling set_index):
df = df.set_index('FIPSST')
We’re now set to start answering some of the questions that I had posed.
Keep reading with a 7-day free trial
Subscribe to