Project 1: Digital Divide

Data Prep

Based on PPIC's Just the Facts report "California's Digital Divide"

Research Question(s):

  1. What share households in X state have access to high-speed internet?
  2. Does this number vary across demographic groups? (in this case race/ethnicity).

Goal:

  • explore datafiles (acsdata.data.gz) and create a working dataset from it.

Context:

Obtained American Community Survey (ACS) survey data from IPUMS.
It contains basic demographics:

  • age
  • gender
  • race/ethnicity

and geographic indicators:

  • state
  • county

Step 1: Set up your working environment.

Import all necessary libraries and create Paths to your data directories. This ensures reproducibility across file systems (windows uses \ instead of /)

We need

  1. pandas to work with the data.
  2. pathlib, and more specifically its Path object, to work with paths. This will ensure our code works in both Windows (which uses \ in its file paths) and MacOS/Linux (which uses /).
  3. datetime - tip: There are version control systems for data but tagging your data files with the date is not a bad first step if you're getting started.
In [1]:
# setting up working environment
import _____ as pd
from _____ import Path
from datetime import datetime as dt
today = __.today().strftime("%d-%b-%y")

print(today)
27-Apr-19

note: even if you are on windows you can type the path forward slashes / below

In [2]:
# data folder and paths
RAW_DATA_PATH = ____("../data/raw/")
XXXX_XXXXX_XXXX = ____("../data/interim/")
YYYY_YYYYY_YYYY = ____("../data/processed/")
ZZZZ_ZZZZZ_ZZZZ = ____("../data/final/")

NOTE: I've included a tools.py script with the function tree which displays a directory's tree (obtained from RealPython's tutorial on the pathlib module).

from our tools script import tree so we can use it.
In [ ]:
 
In [ ]:
tree(________)

Step 2: Load and explore the data

With pandas loading data is as easy as .read_csv(PATH_TO_CSV_FILE) and that works most of the time. Pandas read_csv() is so powerful it'll even read compressed files without any other parameter specification. Try the following:

data = pd.read_csv(RAW_DATA_PATH / 'acs_data.csv.gz')
data.head()

*make sure you change RAW_DATA_PATH to match whatever variable name you chose for it earlier.

In [ ]:


IPUMS offers a few data formats which can be more useful [docs]:

In addition to the ASCII data file, the system creates a statistical package syntax file to accompany each extract. The syntax file is designed to read in the ASCII data while applying appropriate variable and value labels. SPSS, SAS, and Stata are supported. You must download the syntax file with the extract or you will be unable to read the data. The syntax file requires minor editing to identify the location of the data file on your local computer.

In this case, we'll be using a Stata file (.dta). The main reason is that .dta files can store value labels which pandas can then read and convert columns to Categorical columns in our pandas DataFrame. This 1) saves memory, and 2) is good practice because certain social sciences really, really, really love Stata so their interesting datasets are likely .dta files.

However, pandas cannot read compressed .dta directly like it can .csv files. IPUMS, uses gzip compressed format and python includes a gzip module in its standard library.

Import gzip and try the following:

with gzip.open(RAW_DATA_PATH / 'acs_data.dta.gz') as file:
    data = pd.read_stata(file)

and then display the first five rows of your data DataFrame.

In [ ]:
# import gzip and load data
In [ ]:
# display first 5 rows

Step 3: Familiarize yourself with the dataset

We've already seen .head() - the pandas method that will display the first 5 rows of your DataFrame. This gives you an idea of what your data looks like. However, there are is a lot more .info() you can get out of your dataframe. You can also just ask the data to .describe() itself...

In [ ]:
# find out more info about your dataframe
data.____()
In [ ]:
# describing your data
data.____()

Check out the shape of your data with it's .shape attribute. Notice the lack of parentheses.

In [ ]:
data._____

Step 4: Trim your data

Right now you're working with your masterfile - a dataset containing everything you could need for your analysis. You don't really want to modify this dataset because you might be using it for other analyses. For example, we're going to be analyzing access to high-speed internet in a state of your choosing but next week you might want to run the same analysis on another state or maybe just on a specific county. To make sure you can reuse your data and code later let's create an analytical file or a working dataset, a dataset that contains only the data needed for this specific analysis at hand.

First, we are only interested in finding the "Digital Divide" of one state right now. The masterfile contains data for all 50 states and the Disctric of Columbia.

What you want to do is find all the rows where the statefip matches the your state's name. This is called boolean indexing.

Try the following

data['statefip'] == 'ohio'

Note: you can change 'ohio' to any other of the 50 states or 'district of columbia' for DC.

In [ ]:
# try boolean indexing
___['______'] == '_______'

This is going to return a pandas.Series of booleans (Trues and Falses) which then you can use to filter out any unnecessary rows.

It's good practice to save these as a variable early in your code (if you know them beforehand) or right before you use them in case you use these conditionals in more than one place. This is going to save you time if you decide to change the value you're comparing, 'ohio' for 'california' for example.

mask_state = (data['statefip'] == 'ohio')
data[mask_state].head()
In [ ]:
# try it yourself
mask_state = (________________________ == _______)
data[mask_state].____()

let's save it to another variable with a more useful name:

state_data = data[mask_state].copy()

You have to use .copy() to create actual copies of the data. If you ran

state_data = data[mask_state]

state_data would be a view of the data dataframe. This can have unintended consequences down the road if you modify your dataframes. A lot of the times you'd get just a warning and your code will run just as intented - but why take risks, right?

In [ ]:
# save your data to state_data
state_data = __________.copy()

Now, let's see what .columns we have in our dataframe. You can find these the same way you found the .shape of it earlier.

In [ ]:
state_data._____

Are there any columns that you are confident you don't need? If you are not 90% sure you won't need a variable don't drop it.

Dropping columns is as easy using .drop() on your dataframe.

state_data.drop(columns = ['list', 'of', 'columns', 'to', 'drop'])
In [ ]:
 

If there are variables you think you won't need but you're not very sure that's the case, you should explore them.

pandas dataframe's columns are pandas.Series and they have methods and attributes just like dataframes.

Let's explore the variable gq which stands for Group Quarters. From the IPUMS docs:

Group quarters are largely institutions and other group living arrangements, such as rooming houses and military barracks.

Let's see what .unique() values the state_data['gq'] series has...

In [ ]:
 

We can also see the .value_counts() which would give us a better idea of how useful this column might be. For example, if a column has 2 values but 99% of the observations have one value and 1% have the other - you could drop column altogether since it might not add a lot value to your analysis.

Some variables have 100% of it's rows with the same value... *cough* *cough* state_data['year']...

In [ ]:
 

From IPUMS docs:

There are three slightly different definitions of group quarters in the IPUMS. For the period 1940-1970 (excluding the 1940 100% dataset), group quarters are housing units with five or more individuals unrelated to the householder. Before 1940 and in 1980-1990, units with 10 or more individuals unrelated to the householder are considered group quarters. In the 2000 census, 2010 census, the ACS and the PRCS, no threshold was applied; for a household to be considered group quarters, it had to be on a list of group quarters that is continuously maintained by the Census Bureau. In earlier years, a similar list was used, with the unrelated-persons rule imposed as a safeguard.

Because of this and the fact that most of our observations fall into the 1970 and 1990 definition, we'll stick to those 2 for our analysis.

Let's create another mask to filter out households that don't fit our definition.

For multiple conditions we use: & and | operators (and and or, respectively)

In [ ]:
mask_household = ( CONDITION ONE ) | ( CONDITION TWO )

note: another value added from having categorical variables is that, if they are ordered, you can use the <, > operators for conditions as well.

mask_household = (state_data['gq'] <= 'additional households under 1990 definition')

note: since you are overwriting state_data you don't need to use .copy() but it doesn't hurt and if you're a beginner at pandas it's good practice for when you actually need to use .copy().

In [ ]:
state_data = state_data[mask_household].______()

At this point you're really close to a working_data dataset. You have:

  1. Kept one state's information and dropped the rest.
  2. Kept only those households you're interested in and dropped the rest

Our research question 1 is: "What share of households in X state have access to high-speed internet?"

Mathematically, $$ \frac{households\ with\ high\ speed\ internet}{households\ in\ state}$$

Your state_data dataset contains all you need to find the answer.


Step 5: Save your data

Now that you have trimmed your masterfile into a working_data dataset you should save it.

We've been working with a .dta file and it'd be best if we keep it that way.

Try the following:

state_data.to_stata(INTERIM_DATA_PATH / f'state_data-{today}.dta', write_index = False)

A few things:

  1. We're using f-strings to tag our datafile with today's date.
  2. You're turning off the write_index flag so you don't add a 'index' column to your .dta file. In this dataset, our index isn't meaningful. In other analysis you might have a meaningful index and you won't want to turn off this flag.
In [ ]:
 

Step 6: Bonus

What if we changed our research question a little bit, from
"What share of households in X state have access to high-speed internet?
to
"What share of households with school-age children in X state have access to high-speed internet?"

This would be an interesting statistic to policy-makers, especially if we find discrepancies across demographic groups (research question 2).

The challenge here is that the unit of observation in our state_data file is a (weighted) person and we want to filter out those households without any school-age children in them. This might sound a little complicated at first but it just requires modifying our previous workflow just a little.

We need to do a few things:

  1. Define what we mean by school-age children.
  2. Create a mask to grab all households where these children are.
  3. Create a list of unduplicated household identifiers ('serial')
  4. Use that list to drop unwanted observations.

Step 6.1: School-age children

Most people would agree school age (Elementary through High School) is 6 - 17 year olds. Some people are interested in K-12 (5 - 17 or 18). Some people wouldn't include 18 year olds. Whatever measure you choose you must be able to defend why you are choosing it.

For this analysis, I'll suggest we use 5 - 18 year olds (K-12) but you can choose whatever age range you want. Maybe high-school kids 14-18? That'd be interesting, you probably need access to high-speed internet at home a lot more in high school than you do in kindergarden.

In [ ]:
mask_children = (state_data['age'] >= ___) & (___________________ <= )
What data type is state_data['age'] again?
Categorical. This means that you even though its values _look_ like numbers, they're actually _value labels_ aka strings.

Now that we have our mask, we can use it to create a list of households with children in them.

Earlier we applied a mask to a dataframe and saved it to another variable. Here, we'll go a step further and grab just a column of that filtered out dataframe.

Try it yourself first.

Hint: How did we grab and explore a single column of a dataframe earlier?

In [ ]:
households_with_children = _________________________________________
In [ ]:
households_with_children.head()

How do you think we can .drop_duplicates()?

In [ ]:
 

Once you have your unduplicated list of households with children all you have to do is to check if a serial value from our state_data dataset .isin() our households_with_children series.

In [ ]:

Let's save that as our working_data dataset and save that to memory.

In [ ]:
working_data = _____________________
working_data.to_stata(INTERIM_DATA_PATH / f'working_data-{today}.dta', write_index = False)
In [ ]: