Bamboo Weekly #143: Phones in school (solution)

This week, we looked at the use of cellphones in schools – more specifically, in schools in the United States. The data came from the National Center for Education Statistics, part of the US government's Department of Education. Their "School Pulse Panel" is a representative sample of schools that regularly provide updates on a wide range of topics. The SPP survey has recently started to ask about cellphones, Internet access, and related topics in their surveys. The latest survey, available from https://nces.ed.gov/surveys/spp/results.asp , and I thought it would be interesting to explore and find out just how cellphone usage is perceived.

Data and five questions

The data comes from the SPP in the form of an Excel file. Go to the SPP page, and scroll down to the "technology" section, at https://nces.ed.gov/surveys/spp/results.asp#technology-dec24-chart-1 . You can then click on the link https://nces.ed.gov/surveys/spp/docs/release/Technology.xlsx , labeled "see full data for 2024-2025 technology."

This week's learning goals include grouping and pivot tables, plotting, and multi-indexes.

Paid subscribers, including members of my LernerPython.com program, can read all of today's questions. Those subscribers will get the full set of questions and answers tomorrow, along with an invitation to office hours and downloadable data and notebooks.

Meanwhile, here are my five tasks and questions. I'll be back tomorrow with full answers and explanations:

Read the "School Level" sheet from the Excel spreadsheet, keeping the `Subgroup`, `Respondents`, `Description`, `Category`, and `Dec24 Pct` columns. Turn # into 0, and other symbols into NaN. Replace the `Dec24 Pct` column with `percentage`, a float column. Keep only those rows that contain the word "phone" or "internet" in the "Description" column. Keep only rows in which the `Respondents` value is "All Schools". What percentage of each type of school (elementary, middle, and high) have a policy regarding cell phone usage during school hours?

Before doing anything else, I loaded up Pandas:

import pandas as pd

With that in place, I then wanted to load the "School Level" sheet from the data file. Since the file is in Excel format, that means using read_excel. It works much like read_csv and other reading methods in Pandas, in that it sits in the top-level pd namespace and returns a data frame. However, read_excel doesn't natively read Excel files; it depends on other Python packages to do so. You'll thus need to install either xlrd or openpyxl from PyPI.

I then read from the spreadsheet, indicating the filename, and sheet name, and which columns I wanted to load (with usecols):

filename = 'data/bw-143-tech.xlsx'

df = (pd
     .read_excel(filename, 
                 sheet_name='School Level',
                usecols=['Subgroup', 'Respondents',
                         'Description', 'Category',
                         'Dec24 Pct']
                         )
)

This was a good start, but wasn't quite enough. In particular, I wanted to treat a number of symbols as NaN values, and # as 0. I decided to tackle this in two parts: First, I used the na_values keyword argument to treat those symbols as NaN. Second, I then ran replace on the returned data frame, turning '#' into '0':

df = (pd
     .read_excel(filename, 
                 sheet_name='School Level',
                usecols=['Subgroup', 'Respondents',
                         'Description', 'Category',
                         'Dec24 Pct'],
                 na_values=['‡', '†', '—']
                )
     .replace('#', '0')
   )

na_values is a great option to use when, as here, the authors of a data set use a number of non-standard markers instead of NaN. Here, each of these symbols had a slightly different meaning, but I decided to treat them all as NaN. All, that is, except for '#', which I interpreted as needing to be 0.

Notice that in my call to replace, the replacement value was not the integer 0, but rather the string '0'. I actually tried to replace '#' with an integer, but Pandas gave me a warning reminding me that it's a bad idea to perform such a replacement with a different data type. I thus used a string, '0', as the replacement text, knowing that I could later turn the entire column into a numeric type. (We'll get to that in a moment.)

I next wanted to keep only those rows containing either "phone" or "internet". To do that, I fired up str.contains, putting it inside of a loc and lambda combination. However, I kept getting error messages from Pandas, telling me that it couldn't run str.contains on NaN values. It turns out that there was one row (just one!) containing NaN that was gumming up the works. I thus added a call to dropna just before the loc, telling Pandas that I only wanted to remove rows with NaN in the Description column. By default, dropna removes rows containing NaN anywhere, in any numbers. I also used loc and lambda to keep only those rows in which Respondents was 'All Schools':

df = (pd
     .read_excel(filename, 
                 sheet_name='School Level',
                usecols=['Subgroup', 'Respondents',
                         'Description', 'Category',
                         'Dec24 Pct'],
                 na_values=['‡', '†', '—']
                )
     .replace('#', '0')
     .dropna(subset='Description')
     .loc[lambda df_: df_['Description'].str.contains('phone|internet', 
                                                      case=False, regex=True)]
     .loc[lambda df_: df_['Respondents'] == 'All Schools']
     )

In order to work with the percentage of schools giving each answer to the survey, I then needed to turn the Dec24 Pct column into a numeric dtype. I would have used assign to overwrite the existing column, except that assign names columns with Python's keyword arguments – and the name in those keyword arguments must be a valid Python identifier, meaning without space characters. I thus decided to create a new column, percentage, turning Dec24 Pct into a float column. I then used drop to remove both Dec24 Pct (since I had replaced it with percentage) and Respondents (since they all contained the same value, 'All Schools'):

df = (pd
     .read_excel(filename, 
                 sheet_name='School Level',
                usecols=['Subgroup', 'Respondents',
                         'Description', 'Category',
                         'Dec24 Pct'],
                 na_values=['‡', '†', '—']
                )
     .replace('#', '0')
     .dropna(subset='Description')
     .loc[lambda df_: df_['Description'].str.contains('phone|internet', 
                                                      case=False, regex=True)]
     .loc[lambda df_: df_['Respondents'] == 'All Schools']

     .assign(percentage = lambda df_: df_['Dec24 Pct'].astype(float)) 
     .drop(columns=['Dec24 Pct', 'Respondents'])
     )

The result was a data frame with 123 rows and 4 columns.

I then wanted to find out the percentage of each type of school with a phone policy. I first used loc to select only those rows in which the Description column contained the question I wanted to ask, and in which the Category column contained the answer ("Yes") that I was looking for. I invoked set_index to make Subgroup into the data frame's index, and then selected only the percentage column:

(
    df
    .loc[lambda df_: df_['Description'] == "Does your school or district have a policy on students' cell phone usage during school hours?"]
    .loc[lambda df_: df_['Category'] == 'Yes']
    .set_index('Subgroup')
    ['percentage']
)

The result:

Subgroup	percentage
Elementary	96
Middle/Combined	95
High/Secondary	92

In other words, a very high – more than 90 percent! – have explicit policies regarding phone use. This was the first time that such a question was asked, which is a shame; I would have been interested in seeing its growth over time. But I guess it's no surprise that so many schools have decided, in 2025, that they need to regular phone use during school hours.

One of the questions (in the `Description` column) asks, "Which of the following best describes your school's or district's policy on students having their cell phones in the classroom?" Show the percentage of each type of school (the `Subgroup` column) answering "Students can have their cell phones during all classes" and "Students cannot have their cell phones during any classes." Is there a significant difference? What would explain it?

Next, I used loc and lambda to keep only those rows for the question that I was looking for, i.e., the best description of the school's policy on phones in the classroom. I then used a second loc and lambda to find those answers (Category) that were for "any classes" or "all classes," using a regular expression.

This left me with a 6-row, 4-column data frame with all of the school types and their policies. I used pivot_table to make the data more readable, making the school types (Subgroup) into the rows, the answers (school policy) into the columns, and the numbers (which didn't need any aggregation method) as the values:


(
    df
    .loc[lambda df_: df_['Description'] == "Which of the following best describes your school's or district's policy on students having their cell phones in the classroom?"]
    .loc[lambda df_: df_['Category'].str.contains('during any classes|during all classes')]
    .pivot_table(index='Subgroup',
                columns='Category',
                values='percentage')
)

The result, with the answers trimmed so that you can read it more easily:

Subgroup	can have	cannot have
Elementary	4	86
High/Secondary	9	55
Middle/Combined	4	81

We see that the majority of schools forbid cellphones all of the time. But high schools are much less likely to forbid them, and more than twice as likely to allow them. On the one hand, I understand that high schools give their students more freedom. But on the other hand, I would expect more high-school students to have cellphones, and for it thus to be a bigger problem.

Bamboo Weekly #143: Phones in school (solution)

Data and five questions

Read next

Bamboo Weekly #143: Phones in school (solution soon)