Project 1: Digital Divide¶

Data Analysis¶

Based on PPIC's Just the Facts report "California's Digital Divide"¶

Research Question(s):¶

What share households with school-age children in X state have access to high-speed internet?
Does this number vary across demographic groups? (in this case race/ethnicity).

Goal:¶

Use our working-data dataset (created in Data_Prep notebook notebook) to answer our research questions.

Context:¶

Write yourself a description of the context: Include a description of the data (data set contains X state's data for YYYY year)

Step 1: Set up your working environment.¶

Import all necessary libraries and create Paths to your data directories. This ensures reproducibility across file systems (windows uses \ instead of /)

We need

pandas to work with the data.
pathlib, and more specifically its Path object, to work with paths. This will ensure our code works in both Windows (which uses \ in its file paths) and MacOS/Linux (which uses /).
datetime - tip: There are version control systems for data but tagging your data files with the date is not a bad first step if you're getting started.
tree - to display a directory's tree.

In [1]:

# setting up working environment
import _____ as pd
from _____ import Path
from tools import _________
from ______ import _______ as dt
today = dt.______()._______("%_-%_-%_")

print(today)

27-Apr-19

In [2]:

# data folder and paths
RAW_DATA_PATH = ____("../data/raw/")
XXXX_XXXXX_XXXX = ____("../data/interim/")
YYYY_YYYYY_YYYY = ____("../data/processed/")
ZZZZ_ZZZZZ_ZZZZ = ____("../data/final/")

In [ ]:

tree(INTERIM_DATA_PATH)

In [4]:

data = pd.read_stata(INTERIM_DATA_PATH / f'working_data-{today}.dta')

In [5]:

data._______

Out[5]:

(44816, 14)

In [6]:

data._______()

Out[6]:

	year	serial	hhwt	stateicp	countyfip	cinethh	cihispeed	pernum	perwt	relate	sex	age	race	hispan
0	2017	953662	57	ohio	0	yes, with a subscription to an internet service	yes (cable modem, fiber optic or dsl service)	1	58	head/householder	female	48	white	not hispanic
1	2017	953662	57	ohio	0	yes, with a subscription to an internet service	yes (cable modem, fiber optic or dsl service)	2	62	child	male	20	white	not hispanic
2	2017	953662	57	ohio	0	yes, with a subscription to an internet service	yes (cable modem, fiber optic or dsl service)	3	78	child	female	9	white	not hispanic
3	2017	953668	140	ohio	61	yes, with a subscription to an internet service	yes (cable modem, fiber optic or dsl service)	1	140	head/householder	male	28	black/african american/negro	not hispanic
4	2017	953668	140	ohio	61	yes, with a subscription to an internet service	yes (cable modem, fiber optic or dsl service)	2	192	sibling	female	16	black/african american/negro	not hispanic

In [7]:

data._____()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44816 entries, 0 to 44815
Data columns (total 14 columns):
year         44816 non-null category
serial       44816 non-null int32
hhwt         44816 non-null int16
stateicp     44816 non-null category
countyfip    44816 non-null int16
cinethh      44816 non-null category
cihispeed    44816 non-null category
pernum       44816 non-null int8
perwt        44816 non-null int16
relate       44816 non-null category
sex          44816 non-null category
age          44816 non-null category
race         44816 non-null category
hispan       44816 non-null category
dtypes: category(9), int16(3), int32(1), int8(1)
memory usage: 1.2 MB

Our unit of observation is still a (weighted) person but we're interested in household-level data.

From IPUMS docs:

HHWT indicates how many households in the U.S. population are represented by a given household in an IPUMS sample.

It is generally a good idea to use HHWT when conducting a household-level analysis of any IPUMS sample. The use of HHWT is optional when analyzing one of the "flat" or unweighted IPUMS samples. Flat IPUMS samples include the 1% samples from 1850-1930, all samples from 1960, 1970, and 1980, the 1% unweighted samples from 1990 and 2000, the 10% 2010 sample, and any of the full count 100% census datasets. HHWT must be used to obtain nationally representative statistics for household-level analyses of any sample other than those.

Users should also be sure to select one person (e.g., PERNUM = 1) to represent the entire household.

Step 2: Drop all observations were `pernum` doesn't equal 1¶

In [ ]:

mask_pernum = (________ _= 1)

In [ ]:

data[mask_pernum].shape

Save your data to an appropriately named variable.

In [ ]:

state_households = ____[_________]

Step 3: Familiarize yourself with your variables of interest¶

From IPUMS docs:

CINETHH reports whether any member of the household accesses the Internet. Here, "access" refers to whether or not someone in the household uses or connects to the Internet, regardless of whether or not they pay for the service.

In [ ]:

# find the value_counts for your cinethh series

From IPUMS docs:

CIHISPEED reports whether the respondent or any member of their household subscribed to the Internet using broadband (high speed) Internet service such as cable, fiber optic, or DSL service.

User Note: The ACS 2016 introduced changes to the questions regarding computer use and Internet access. See the comparability section and questionnaire text for more information. Additional information provided by the Census Bureau regarding these question alterations are available in the report: ACS Content Test Shows Need to Update Terminology

In [ ]:

# find the value_counts for your cihispeed series

quick tip .value_counts() has a normalize parameter:

In [13]:

pd.Series.value_counts?

Signature:
pd.Series.value_counts(
    self,
    normalize=False,
    sort=True,
    ascending=False,
    bins=None,
    dropna=True,
)
Docstring:
Return a Series containing counts of unique values.

The resulting object will be in descending order so that the
first element is the most frequently-occurring element.
Excludes NA values by default.

Parameters
----------
normalize : boolean, default False
    If True then the object returned will contain the relative
    frequencies of the unique values.
sort : boolean, default True
    Sort by values.
ascending : boolean, default False
    Sort in ascending order.
bins : integer, optional
    Rather than count values, group them into half-open bins,
    a convenience for ``pd.cut``, only works with numeric data.
dropna : boolean, default True
    Don't include counts of NaN.

Returns
-------
counts : Series

See Also
--------
Series.count: Number of non-NA elements in a Series.
DataFrame.count: Number of non-NA elements in a DataFrame.

Examples
--------
>>> index = pd.Index([3, 1, 2, 3, 4, np.nan])
>>> index.value_counts()
3.0    2
4.0    1
2.0    1
1.0    1
dtype: int64

With `normalize` set to `True`, returns the relative frequency by
dividing all values by the sum of values.

>>> s = pd.Series([3, 1, 2, 3, 4, np.nan])
>>> s.value_counts(normalize=True)
3.0    0.4
4.0    0.2
2.0    0.2
1.0    0.2
dtype: float64

**bins**

Bins can be useful for going from a continuous variable to a
categorical variable; instead of counting unique
apparitions of values, divide the index in the specified
number of half-open bins.

>>> s.value_counts(bins=3)
(2.0, 3.0]      2
(0.996, 2.0]    2
(3.0, 4.0]      1
dtype: int64

**dropna**

With `dropna` set to `False` we can also see NaN index values.

>>> s.value_counts(dropna=False)
3.0    2
NaN    1
4.0    1
2.0    1
1.0    1
dtype: int64
File:      /anaconda3/envs/pycon/lib/python3.7/site-packages/pandas/core/base.py
Type:      function

In [ ]:

# try it on your cinethh series

In [ ]:

# on cihispeed

This would be the end of our analysis if we weren't working with weighted data. Weighted data means each of our observations represent more than one person or household.

perwt = "Person's weight"

hhwt = "Household's weight"

.value_counts(normalize=True) counts the number of observations for each of a series' values and then divides it by the total count. If each of our observations was 1 person/household, we would have the answer already.

What we need to do is aggregate.

Step 4: Grouping and aggregating data¶

The mechanics are kind of the same:

Count the number of observations each that match each of the values in a series.
Add up not the number of observations but the weight of each observation.
Divide by the total.

Step 4.1: Group your data by their corresponding values¶

In [17]:

state_households.groupby("_________")

Out[17]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x114b5a710>

From the docs:

A groupby operation involves some combination of splitting the

object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

We're missing the applying a function part of it.

Try the following:

state_households.groupby("countyfip").sum()

you can pass almost any function to this.

Try .mean(), .max(), .min(), .std().

In [ ]:

You can select columns just like you would any other regular dataframe.

In [ ]:

state_households.groupby("________")['hhwt']._____()

In [ ]:

n_households = state_households.groupby("cihispeed")['hhwt'].sum()[2]
_state = state_households['statefip'].unique()[0]
print(f"""
We can see now {n_households:,} households in {_state} have access to high-speed internet. But, out of how many?

To make this easier to follow, let's save our results to a variable:
""")

In [ ]:

households_with_highspeed_access = ____________._____("_____")["____"].___()

households_with_highspeed_access

This looks like any regular pandas.Series, how do we find the total .sum() of a series elements?

math

In [ ]:

That's our denominator!

nice

When you apply and operation to a pandas.Series it maps to each of its elements.

Try the following:

households_with_highspeed_access * 1_000_000

households_with_highspeed_access + 1_000_000

households_with_highspeed_access / 1_000_000

In [ ]:

Now that you know the denominator of our equation (how many households total in X state), how would you find each of the 3 values in your households_with_highspeed_access share of the total?

In [ ]:

Part 2 of analysis: Creating derived variables¶

Now that you have answered Research Question 1, we can move on to Q2:

Does this number vary across demographic groups? (in this case race/ethnicity).

pandas .groupby() function can take a list of columns by which to group by

Try the following:

state_households.groupby(['race', 'cihispeed'])[['hhwt']].sum()

Notice that I'm passing [['hhwt']] (a 1-element list) and not just ['hhwt'] try both yourself and let's discuss what's the difference.

In [ ]:

Step 1: Define your groups¶

Pandas' .loc indexer serves not only to slice dataframes but also to assign new values to certain slices of dataframes.

For example,

mask_madeup_data = (data['column_1'] == 'no answer')
data.loc[mask_madeup_data, 'new_column'] = 'this row did not answer'

The code above grabs all the rows that satisfy the condition and then looks at 'new_column', if it doesn't exist, it'll create it for you and assign the value 'this row did not answer' to all the rows that match the condition. The rest will be filled with null values (NaNs).

Let's create our masks¶

In [ ]:

mask_latino =

In [ ]:

mask_white =

In [ ]:

mask_black =

In [ ]:

mask_______ =

In [ ]:

mask_______ =

In [ ]:

mask_______ =

Assign the values to a new column 'racen' for Race/Ethnicity

In [ ]:

state_households.loc[mask_latino, 'racen'] = 'Latino'
state_households.loc[mask_white, 'racen'] = 'White'
state_households.loc[mask_black, 'racen'] = 'Black/African-American'
state_households.loc[mask_______, 'racen'] = '_______'
state_households.loc[mask_______, 'racen'] = '_______'
state_households.loc[mask_______, 'racen'] = '_______'

Checking your results.

Under your new logic, all race values should fit into racen values so there should not be any null values, right?

Pandas .isna() returns a series of either True or False for each value of a series depending on whether or not it is Null.

AND

in python, True = 1 and False = 0.

What do you think would happen if you as for the .sum() total of a pandas.Series of booleans?

In [ ]:

Multiple ways of grouping data¶

Now that you have derived a working variable for race/ethnicity you can aggregate your data to answer RQ2. In pandas, there are many ways to do this, some of them are:

.groupby() like we've done so far.
.pivot_table()
pd.crosstabs() <- this one is a pandas method, not a DataFrame method. More later.

GroupBy¶

In [ ]:

state_households.groupby(['racen', '______'])[['______']]._____()

Let's save that to an appropriately named variable since we'll be using it later.

In [ ]:

cihispeed_by_racen = state_households.groupby(['racen', '______'])[['______']]._____()

Now, this grouped dataframe has the total number of households in each of these racen-cihispeed groups.

We need the share of cihispeed values by racen group.

In our equation,

$$ \frac{households\ with\ high\ speed\ internet}{total\ households\ in\ racen\ group}$$

We need to find the denominator.

In [ ]:

# find the denominator

In [ ]:

# divide your racen-cihispeed by denominator

In [ ]:

# save to appropriately named variable
shares_cihispeed_by_racen = 

This is a multi-level index dataframe and there are a few ways to slice it. Let's try 3:

a classsic .loc slice
a cross-section (.xs())
the .reset_index() method

Classic .loc

In [ ]:

shares_cihispeed_by_racen.loc[INDEX_SLICER, COLUMNS]

Cross-section

In [59]:

shares_cihispeed_by_racen.xs?

Signature: shares_cihispeed_by_racen.xs(key, axis=0, level=None, drop_level=True)
Docstring:
Return cross-section from the Series/DataFrame.

This method takes a `key` argument to select data at a particular
level of a MultiIndex.

Parameters
----------
key : label or tuple of label
    Label contained in the index, or partially in a MultiIndex.
axis : {0 or 'index', 1 or 'columns'}, default 0
    Axis to retrieve cross-section on.
level : object, defaults to first n levels (n=1 or len(key))
    In case of a key partially contained in a MultiIndex, indicate
    which levels are used. Levels can be referred by label or position.
drop_level : bool, default True
    If False, returns object with same levels as self.

Returns
-------
Series or DataFrame
    Cross-section from the original Series or DataFrame
    corresponding to the selected index levels.

See Also
--------
DataFrame.loc : Access a group of rows and columns
    by label(s) or a boolean array.
DataFrame.iloc : Purely integer-location based indexing
    for selection by position.

Notes
-----
`xs` can not be used to set values.

MultiIndex Slicers is a generic way to get/set values on
any level or levels.
It is a superset of `xs` functionality, see
:ref:`MultiIndex Slicers <advanced.mi_slicers>`.

Examples
--------
>>> d = {'num_legs': [4, 4, 2, 2],
...      'num_wings': [0, 0, 2, 2],
...      'class': ['mammal', 'mammal', 'mammal', 'bird'],
...      'animal': ['cat', 'dog', 'bat', 'penguin'],
...      'locomotion': ['walks', 'walks', 'flies', 'walks']}
>>> df = pd.DataFrame(data=d)
>>> df = df.set_index(['class', 'animal', 'locomotion'])
>>> df
                           num_legs  num_wings
class  animal  locomotion
mammal cat     walks              4          0
       dog     walks              4          0
       bat     flies              2          2
bird   penguin walks              2          2

Get values at specified index

>>> df.xs('mammal')
                   num_legs  num_wings
animal locomotion
cat    walks              4          0
dog    walks              4          0
bat    flies              2          2

Get values at several indexes

>>> df.xs(('mammal', 'dog'))
            num_legs  num_wings
locomotion
walks              4          0

Get values at specified index and level

>>> df.xs('cat', level=1)
                   num_legs  num_wings
class  locomotion
mammal walks              4          0

Get values at several indexes and levels

>>> df.xs(('bird', 'walks'),
...       level=[0, 'locomotion'])
         num_legs  num_wings
animal
penguin         2          2

Get values at specified column and axis

>>> df.xs('num_wings', axis=1)
class   animal   locomotion
mammal  cat      walks         0
        dog      walks         0
        bat      flies         2
bird    penguin  walks         2
Name: num_wings, dtype: int64
File:      /anaconda3/envs/pycon/lib/python3.7/site-packages/pandas/core/generic.py
Type:      method

In [ ]:

shares_cihispeed_by_racen.xs(key = '________', level = _)

.reset_index()

Another way to slice a multi-level index dataframe is to make it a not-multi-level index dataframe. To do that you need to reset its index. After that, we can slice it how we've been slicing our dataframes previously.

In [ ]:

__________ = ____________._________()

In [ ]:

__________

In [ ]:

mask_yes_cihispeed = (_____________ = '___________')
_______[mask_yes_cihispeed]

Pivot Tables¶

The second method of aggregating our data is .pivot_table()s.

If you've worked with Excel, you might already be familiar with what a pivot table is.

From Wikipedia:

A pivot table is a table of statistics that summarizes the data of a more extensive table (such as from a database, spreadsheet, or business intelligence program). This summary might include sums, averages, or other statistics, which the pivot table groups together in a meaningful way.

In [66]:

state_households.pivot_table?

Signature:
state_households.pivot_table(
    values=None,
    index=None,
    columns=None,
    aggfunc='mean',
    fill_value=None,
    margins=False,
    dropna=True,
    margins_name='All',
)
Docstring:
Create a spreadsheet-style pivot table as a DataFrame. The levels in
the pivot table will be stored in MultiIndex objects (hierarchical
indexes) on the index and columns of the result DataFrame.

Parameters
----------
values : column to aggregate, optional
index : column, Grouper, array, or list of the previous
    If an array is passed, it must be the same length as the data. The
    list can contain any of the other types (except list).
    Keys to group by on the pivot table index.  If an array is passed,
    it is being used as the same manner as column values.
columns : column, Grouper, array, or list of the previous
    If an array is passed, it must be the same length as the data. The
    list can contain any of the other types (except list).
    Keys to group by on the pivot table column.  If an array is passed,
    it is being used as the same manner as column values.
aggfunc : function, list of functions, dict, default numpy.mean
    If list of functions passed, the resulting pivot table will have
    hierarchical columns whose top level are the function names
    (inferred from the function objects themselves)
    If dict is passed, the key is column to aggregate and value
    is function or list of functions
fill_value : scalar, default None
    Value to replace missing values with
margins : boolean, default False
    Add all row / columns (e.g. for subtotal / grand totals)
dropna : boolean, default True
    Do not include columns whose entries are all NaN
margins_name : string, default 'All'
    Name of the row / column that will contain the totals
    when margins is True.

Returns
-------
table : DataFrame

See Also
--------
DataFrame.pivot : Pivot without aggregation that can handle
    non-numeric data.

Examples
--------
>>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
...                          "bar", "bar", "bar", "bar"],
...                    "B": ["one", "one", "one", "two", "two",
...                          "one", "one", "two", "two"],
...                    "C": ["small", "large", "large", "small",
...                          "small", "large", "small", "small",
...                          "large"],
...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
>>> df
     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9

This first example aggregates values by taking the sum.

>>> table = pivot_table(df, values='D', index=['A', 'B'],
...                     columns=['C'], aggfunc=np.sum)
>>> table
C        large  small
A   B
bar one      4      5
    two      7      6
foo one      4      1
    two    NaN      6

We can also fill missing values using the `fill_value` parameter.

>>> table = pivot_table(df, values='D', index=['A', 'B'],
...                     columns=['C'], aggfunc=np.sum, fill_value=0)
>>> table
C        large  small
A   B
bar one      4      5
    two      7      6
foo one      4      1
    two      0      6

The next example aggregates by taking the mean across multiple columns.

>>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
...                     aggfunc={'D': np.mean,
...                              'E': np.mean})
>>> table
                  D         E
               mean      mean
A   C
bar large  5.500000  7.500000
    small  5.500000  8.500000
foo large  2.000000  4.500000
    small  2.333333  4.333333

We can also calculate multiple types of aggregations for any given
value column.

>>> table = pivot_table(df, values=['D', 'E'], index=['A', 'C'],
...                     aggfunc={'D': np.mean,
...                              'E': [min, max, np.mean]})
>>> table
                  D   E
               mean max      mean min
A   C
bar large  5.500000  9   7.500000   6
    small  5.500000  9   8.500000   8
foo large  2.000000  5   4.500000   4
    small  2.333333  6   4.333333   2
File:      /anaconda3/envs/pycon/lib/python3.7/site-packages/pandas/core/frame.py
Type:      method

What we need are four things:

What variable will become our index?
What variable will become our columns?
What variable will become our values?
How will we aggregate our values?

Pandas is going to grab each unique value in the variables you choose and use those as rows in your .index or separate columns in your .columns. The values variable should be quantitative in this case (but it doesn't have to be, necessarily). .pivot_table will by default find the mean of your values variable for each cell in your new table, in this case we don't care about the mean, we want to sum up the total number of households.

Try the following:

state_households.pivot_table(
    index = '______',
    columns = '______', 
    values = 'hhwt',
    aggfunc = '___',
    margins = True,
)

In [ ]:

Save it to an appropriately named variable.

In [ ]:

households_pivot_table = state_households.pivot_table(
    index = '_____',
    columns = '______',
    ______ = '______',
    ______ = '____',
    _______ = True,
)

What do you think the next step should be?

In [ ]:

Project 1: Digital Divide¶

Data Analysis¶

Based on PPIC's Just the Facts report "California's Digital Divide"¶

Research Question(s):¶

Goal:¶

Context:¶

Step 1: Set up your working environment.¶

Step 2: Drop all observations were pernum doesn't equal 1¶

Step 3: Familiarize yourself with your variables of interest¶

Step 4: Grouping and aggregating data¶

Step 4.1: Group your data by their corresponding values¶

Part 2 of analysis: Creating derived variables¶

Step 1: Define your groups¶

Let's create our masks¶

Multiple ways of grouping data¶

GroupBy¶

Pivot Tables¶

Step 2: Drop all observations were `pernum` doesn't equal 1¶