Political Alignment Case Study

Allen Downey

MIT License

Introduction

This is the first in a series of notebooks that make up a case study in exploratory data analysis.

In this notebook, we

  1. Read data from the General Social Survey (GSS),

  2. Clean the data, particularly dealing with special codes that indicate missing data,

  3. Validate the data by comparing the values in the dataset with values documented in the codebook.

  4. Generate "resampled" datasets that correct for deliberate oversampling in the dataset, and

  5. Store the resampled data in a binary format (HDF5) that makes it easier to work with in the notebooks that follow this one.

The following cell loads the packages we need. If everything works, there should be no error messages.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Reading the data

The data we'll use is from the General Social Survey (GSS). Using the GSS Data Explorer, I selected a subset of the variables in the GSS and made it available along with this notebook.

In [2]:
# Load the data file

import os

if not os.path.exists('gss_eda.tar.gz'):
    !wget https://github.com/AllenDowney/PoliticalAlignmentCaseStudy/raw/master/gss_eda.tar.gz
    !tar -xzf gss_eda.tar.gz

In general, Pandas can read data in most standard formats, including CSV, Excel, Stata, and SPSS.

Unfortunately, the current version of Pandas cannot read the data generated by GSS.

As a workaround, I wrote functions to read the Stata dictionary file and use the information there to read the Stata data file using pd.read_fwf, which reads fixed-width files.

The follow function reads the dictionary file and returns a Pandas DataFrame.

In [3]:
import re

def read_stata_dict(fp, **options):
    """Reads a Stata dictionary file.

    fp: open file pointer
    options: dict of options passed to open()

    returns: DataFrame
    """
    type_map = dict(byte=int, int=int, long=int, float=float,
                    double=float, numeric=float)

    var_info = []
    for line in fp:
        match = re.search(r'_column\(([^)]*)\)', line)
        if not match:
            continue
        start = int(match.group(1))
        t = line.split()
        vtype, name, fstring = t[1:4]
        name = name.lower()
        if vtype.startswith('str'):
            vtype = str
        else:
            vtype = type_map[vtype]
        long_desc = ' '.join(t[4:]).strip('"')
        var_info.append((start, vtype, name, fstring, long_desc))

    columns = ['start', 'type', 'name', 'fstring', 'desc']
    variables = pd.DataFrame(var_info, columns=columns)

    # fill in the end column by shifting the start column
    # NOTE: the last column doesn't work
    variables['end'] = variables.start.shift(-1, fill_value=0)

    return variables

Now we can read the dictionary file, GSS.dct; the result is a DataFrame with information about the variables in the dataset.

In [5]:
with open('GSS.dct') as fp:
    variables = read_stata_dict(fp)

variables.tail()
Out[5]:
start type name fstring desc end
160 3201 <class 'float'> discaffm %20f A man won't get a job or promotion 3221
161 3221 <class 'float'> discaffw %20f A woman won't get a job or promotion 3241
162 3241 <class 'float'> fehire %20f Should hire and promote women 3261
163 3261 <class 'float'> meovrwrk %20f Men hurt family when focus on work too much 3281
164 3281 <class 'float'> avoidbuy %20f Boycotted products for pol reasons 0

We are going to use pd.read_fwf to read the data file, so we have to get the variable information into the required format.

I'll extract the "column specifications", start and end.

In [6]:
colspecs = variables[['start', 'end']]
colspecs.head()
Out[6]:
start end
0 1 21
1 21 41
2 41 61
3 61 81
4 81 101
In [7]:
colspecs = variables[['start', 'end']]
colspecs.tail()
Out[7]:
start end
160 3201 3221
161 3221 3241
162 3241 3261
163 3261 3281
164 3281 0

And the names of the variables.

In [8]:
names = variables['name']
names.head()
Out[8]:
0       year
1        id_
2     agewed
3    divorce
4       sibs
Name: name, dtype: object

Now we can use pd.read_fwf to read the data file GSS.dat.

In [9]:
with open('GSS.dat') as fp:
    gss = pd.read_fwf(fp,
                      colspecs=colspecs.values.tolist(),
                      names=names)

We can use shape and head to see what the dataset looks like.

In [10]:
print(gss.shape)
gss.head()
(64814, 165)
Out[10]:
year id_ agewed divorce sibs childs age educ paeduc maeduc ... relactiv fechld fepresch fefam fejobaff discaffm discaffw fehire meovrwrk avoidbuy
0 1972 1 0 0 3 0 23 16 10 97 ... 0 0 0 0 0 0 0 0 0 NaN
1 1972 2 21 2 4 5 70 10 8 8 ... 0 0 0 0 0 0 0 0 0 NaN
2 1972 3 20 2 5 4 48 12 8 8 ... 0 0 0 0 0 0 0 0 0 NaN
3 1972 4 24 2 5 0 27 17 16 12 ... 0 0 0 0 0 0 0 0 0 NaN
4 1972 5 22 2 2 2 61 12 8 8 ... 0 0 0 0 0 0 0 0 0 NaN

5 rows × 165 columns

This dataset has 64814 rows, one for each respondent, and 105 columns, one for each variable.

To read this dataset, we had to download an "archive file" and use tar to unpack the archive.

I had to write a function to read the dictionary file; then we could process the results and use Pandas to read the data file.

Sometimes you get lucky and loading data is easier than this; sometimes you are unlucky and it's harder.

It can take time and persistence to get a dataset ready to work with. Online resources can help you figure out the details.

Validation

Now that we've got the data loaded, it is important to validate it, which means checking for errors.

The kinds of errors you have to check for depend on the nature of the data, the collection process, how the data is stored and transmitted, etc.

For this dataset, there are three kinds of validation we'll think about:

1) We need to check the integrity of the dataset; that is, whether the data were corrupted or changed during transmission, storage, or conversion from one format to another.

2) We need to check our interpretation of the data; for example, whether the numbers used to encode the data mean what we think they mean.

3) We will also keep an eye out for data, or patterns, that might indicate problems with the survey process and the recording of the data. For example, in a different dataset I worked with, I found a surprising number of respondents whose height was supposedly 62 centimeters. After investigating, I concluded that they were probably 6 feet, 2 inches, and their heights were recorded incorrectly.

Validating data can be a tedious process, but it is important. If you interpret data incorrectly and publish invalid results, you will be embarrassed in the best case, and in the worst case you might do serious harm. See this article for a recent example.

However, I don't expect you to validate every variable in this dataset. Instead, I will demonstrate the process, and then ask you to validate one additional variable as an exercise.

The first variable we'll validate is called polviews. It records responses to the following question:

We hear a lot of talk these days about liberals and conservatives. I'm going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal--point 1--to extremely conservative--point 7. Where would you place yourself on this scale?

You can read the documentation of this variable in the GSS codebook.

The responses are encoded like this:

1   Extremely liberal
2   Liberal
3   Slightly liberal
4   Moderate
5   Slghtly conservative
6   Conservative
7   Extremely conservative
8   Don't know
9   No answer
0   Not applicable

The following function, values, takes a Series that represents a single variable and returns the values in the series and their frequencies.

In [11]:
def values(series):
    """Count the values and sort.
    
    series: pd.Series
    
    returns: series mapping from values to frequencies
    """
    return series.value_counts().sort_index()

Here are the values for the variable polviews.

In [12]:
column = gss['polviews']
In [13]:
values(column)
Out[13]:
0     6777
1     1682
2     6514
3     7010
4    21370
5     8690
6     8230
7     1832
8     2326
9      383
Name: polviews, dtype: int64

To check the integrity of the data and confirm that we have loaded it correctly, we'll do a "spot check"; that is, we'll pick one year and compare the values we see in the dataset to the values reported in the codebook.

We can select values from a single year like this:

In [14]:
one_year = (gss['year'] == 1974)

And look at the values and their frequencies:

In [15]:
values(column[one_year])
Out[15]:
1     22
2    201
3    207
4    564
5    221
6    160
7     35
8     70
9      4
Name: polviews, dtype: int64

If you compare these results to the values in the codebook, you should see that they agree.

Exercise: Go back and change 1974 to another year, and compare the results to the codebook.

Missing data

For many variables, missing values are encoded with numerical codes that we need to replace before we do any analysis.

For polviews, the values 8, 9, and 0 represent "Don't know", "No answer", and "Not applicable".

"Not applicable" usually means the respondent was not asked a particular question.

To keep things simple, we'll treat all of these values as equivalent, but we lose some information by doing that. For example, if a respondent refuses to answer a question, that might suggest something about their answer. If so, treating their response as missing data might bias the results.

Fortunately, for most questions the number of respondents who refused to answer is small.

I'll replace the numeric codes 8, 9, and 0 with NaN, which is a special value used to indicate missing data.

In [16]:
clean = column.replace([0, 8, 9], np.nan)

We can use notna and sum to count the valid responses:

In [17]:
clean.notna().sum()
Out[17]:
55328

And we use isna to count the missing responses:

In [18]:
clean.isna().sum()
Out[18]:
9486

We can check these results against the codebook; at the bottom of that page, it reports the number of "Valid cases" and "Missing cases".

However, in this example, the results don't match. The codebook reports 53081 valid cases and 9385 missing cases.

To figure out what was wrong, I looked at the difference between the values in the codebook and the values I computed from the dataset.

In [19]:
clean.notna().sum() - 53081
Out[19]:
2247
In [20]:
clean.isna().sum() - 9385
Out[20]:
101

That looks like about one year of data, so I guessed that the numbers in the code book might not include the most recent data, from 2018.

Here are the numbers from 2018.

In [21]:
one_year = (gss['year'] == 2018)
one_year.sum()
Out[21]:
2348
In [22]:
clean[one_year].notna().sum()
Out[22]:
2247
In [23]:
clean[one_year].isna().sum()
Out[23]:
101

It looks like my hypothesis is correct; the summary statistics in the codebook do not include data from 2018.

Based on these checks, it looks like the dataset is intact and we have loaded it correctly.

Replacing missing data

For the other variables in this dataset, I read through the code book and identified the special values that indicate missing data.

I recorded that information in the following function, which is intended to replace special values with NaN.

In [26]:
def gss_replace_invalid(df):
    """Replace invalid data with NaN.
    
    df: DataFrame
    """
    df.realinc.replace([0], np.nan, inplace=True)                  
    df.educ.replace([98, 99], np.nan, inplace=True)
    
    df.age.replace([98, 99], np.nan, inplace=True)      # 89 means 89 or older
    df.cohort.replace([0, 9999], np.nan, inplace=True)
    df.marcohrt.replace([0, 9999], np.nan, inplace=True)
    
    df.cappun.replace([0, 8, 9], np.nan, inplace=True)
    df.gunlaw.replace([0, 8, 9], np.nan, inplace=True)
    df.divlaw.replace([0, 8, 9], np.nan, inplace=True)
    df.pornlaw.replace([0, 8, 9], np.nan, inplace=True)
    df.racopen.replace([0, 8, 9], np.nan, inplace=True)
    df.grass.replace([0, 8, 9], np.nan, inplace=True)
    df.letdie1.replace([0, 8, 9], np.nan, inplace=True)
    df.prayer.replace([0, 8, 9], np.nan, inplace=True)
    df.affrmact.replace([0, 8, 9], np.nan, inplace=True)

    df.abany.replace([0, 8, 9], np.nan, inplace=True)
    df.abdefect.replace([0, 8, 9], np.nan, inplace=True)
    df.abnomore.replace([0, 8, 9], np.nan, inplace=True)
    df.abhlth.replace([0, 8, 9], np.nan, inplace=True)
    df.abpoor.replace([0, 8, 9], np.nan, inplace=True)
    df.abrape.replace([0, 8, 9], np.nan, inplace=True)
    df.absingle.replace([0, 8, 9], np.nan, inplace=True)
    
    df.fepol.replace([0, 8, 9], np.nan, inplace=True)
    df.fejobaff.replace([0, 8, 9], np.nan, inplace=True)
    df.fehire.replace([0, 8, 9], np.nan, inplace=True)
    df.fechld.replace([0, 8, 9], np.nan, inplace=True)
    df.fepresch.replace([0, 8, 9], np.nan, inplace=True)
    df.fefam.replace([0, 8, 9], np.nan, inplace=True)
    df.discaffm.replace([0, 8, 9], np.nan, inplace=True)
    df.discaffw.replace([0, 8, 9], np.nan, inplace=True)
    df.meovrwrk.replace([0, 8, 9], np.nan, inplace=True)
    
    df.sexeduc.replace([0, 8, 9], np.nan, inplace=True)
    df.premarsx.replace([0, 8, 9], np.nan, inplace=True)
    df.xmarsex.replace([0, 8, 9], np.nan, inplace=True)
    df.teensex.replace([0, 8, 9], np.nan, inplace=True)
    df.homosex.replace([0, 5, 8, 9], np.nan, inplace=True)  # 5 is other

    df.racmar.replace([0, 8, 9], np.nan, inplace=True)
    df.spanking.replace([0, 8, 9], np.nan, inplace=True)
    df.racpres.replace([0, 8, 9], np.nan, inplace=True)

    df.fear.replace([0, 8, 9], np.nan, inplace=True)
    df.databank.replace([0, 8, 9], np.nan, inplace=True)
    df.happy.replace([0, 8, 9], np.nan, inplace=True)
    df.hapmar.replace([0, 8, 9], np.nan, inplace=True)
    
    df.natroad.replace([0, 8, 9], np.nan, inplace=True)
    df.natsoc.replace([0, 8, 9], np.nan, inplace=True)
    df.natmass.replace([0, 8, 9], np.nan, inplace=True)
    df.natpark.replace([0, 8, 9], np.nan, inplace=True)
    df.natchld.replace([0, 8, 9], np.nan, inplace=True)
    df.natsci.replace([0, 8, 9], np.nan, inplace=True)
    df.natenrgy.replace([0, 8, 9], np.nan, inplace=True)
    df.natspac.replace([0, 8, 9], np.nan, inplace=True)
    df.natenvir.replace([0, 8, 9], np.nan, inplace=True)
    df.natheal.replace([0, 8, 9], np.nan, inplace=True)
    df.natcity.replace([0, 8, 9], np.nan, inplace=True)
    df.natcrime.replace([0, 8, 9], np.nan, inplace=True)
    df.natcrimy.replace([0, 8, 9], np.nan, inplace=True)
    df.natdrug.replace([0, 8, 9], np.nan, inplace=True)
    df.nateduc.replace([0, 8, 9], np.nan, inplace=True)
    df.natrace.replace([0, 8, 9], np.nan, inplace=True)
    df.natarms.replace([0, 8, 9], np.nan, inplace=True)
    df.nataid.replace([0, 8, 9], np.nan, inplace=True)
    df.natfare.replace([0, 8, 9], np.nan, inplace=True)
    
    df.health.replace([0, 8, 9], np.nan, inplace=True)
    df.life.replace([0, 8, 9], np.nan, inplace=True)
    
    df.helpful.replace([0, 8, 9], np.nan, inplace=True)
    df.fair.replace([0, 8, 9], np.nan, inplace=True)
    df.trust.replace([0, 8, 9], np.nan, inplace=True)

    df.conclerg.replace([0, 8, 9], np.nan, inplace=True)
    df.coneduc.replace([0, 8, 9], np.nan, inplace=True)
    df.confed.replace([0, 8, 9], np.nan, inplace=True)
    df.conpress.replace([0, 8, 9], np.nan, inplace=True)
    df.conjudge.replace([0, 8, 9], np.nan, inplace=True)
    df.conlegis.replace([0, 8, 9], np.nan, inplace=True)
    df.conarmy.replace([0, 8, 9], np.nan, inplace=True)

    df.confinan.replace([0, 8, 9], np.nan, inplace=True)
    df.conbus.replace([0, 8, 9], np.nan, inplace=True)
    df.conlabor.replace([0, 8, 9], np.nan, inplace=True)
    df.conmedic.replace([0, 8, 9], np.nan, inplace=True)
    df.contv.replace([0, 8, 9], np.nan, inplace=True)
    df.consci.replace([0, 8, 9], np.nan, inplace=True)  

    df.spkhomo.replace([0, 8, 9], np.nan, inplace=True)
    df.spkath.replace([0, 8, 9], np.nan, inplace=True)
    df.spkmslm.replace([0, 8, 9], np.nan, inplace=True)
    df.spkrac.replace([0, 8, 9], np.nan, inplace=True)
    df.spkcom.replace([0, 8, 9], np.nan, inplace=True)
    df.spkmil.replace([0, 8, 9], np.nan, inplace=True)
    
    df.colhomo.replace([0, 8, 9], np.nan, inplace=True)
    df.colath.replace([0, 8, 9], np.nan, inplace=True)
    df.colmslm.replace([0, 8, 9], np.nan, inplace=True)
    df.colrac.replace([0, 8, 9], np.nan, inplace=True)
    df.colcom.replace([0, 8, 9], np.nan, inplace=True)
    df.colmil.replace([0, 8, 9], np.nan, inplace=True)
    
    df.libhomo.replace([0, 8, 9], np.nan, inplace=True)
    df.libath.replace([0, 8, 9], np.nan, inplace=True)
    df.libmslm.replace([0, 8, 9], np.nan, inplace=True)
    df.librac.replace([0, 8, 9], np.nan, inplace=True)
    df.libcom.replace([0, 8, 9], np.nan, inplace=True)
    df.libmil.replace([0, 8, 9], np.nan, inplace=True)
    
    df.satjob.replace([0, 8, 9], np.nan, inplace=True)
    df.satfin.replace([0, 8, 9], np.nan, inplace=True)
    df.finrela.replace([0, 8, 9], np.nan, inplace=True)

    df.union_.replace([0, 8, 9], np.nan, inplace=True)
    df.res16.replace([0, 8, 9], np.nan, inplace=True)

    df.fund.replace([0, 8, 9], np.nan, inplace=True)
    df.memchurh.replace([0, 8, 9], np.nan, inplace=True)
    df.fund16.replace([0, 8, 9], np.nan, inplace=True)
    df.reliten.replace([0, 8, 9], np.nan, inplace=True)
    df.pray.replace([0, 8, 9], np.nan, inplace=True)
    df.sprel16.replace([0, 8, 9], np.nan, inplace=True)
    
    df.bible.replace([0, 8, 9], np.nan, inplace=True)
    df.postlife.replace([0, 8, 9], np.nan, inplace=True)
    df.god.replace([0, 8, 9], np.nan, inplace=True)
    df.reborn.replace([0, 8, 9], np.nan, inplace=True)
    df.savesoul.replace([0, 8, 9], np.nan, inplace=True)

    df.relpersn.replace([0, 8, 9], np.nan, inplace=True)
    df.sprtprsn.replace([0, 8, 9], np.nan, inplace=True)
    df.relexp.replace([0, 8, 9], np.nan, inplace=True)
    df.relactiv.replace([0, 98, 89], np.nan, inplace=True)
    df.attend.replace([9], np.nan, inplace=True)

    df.compuse.replace([0, 8, 9], np.nan, inplace=True)

    df.degree.replace([8, 9], np.nan, inplace=True)
    df.padeg.replace([8, 9], np.nan, inplace=True)
    df.madeg.replace([8, 9], np.nan, inplace=True)
    df.spdeg.replace([8, 9], np.nan, inplace=True)
    df.partyid.replace([8, 9], np.nan, inplace=True)
    df.polviews.replace([0, 8, 9], np.nan, inplace=True)

    df.chldidel.replace([-1, 8, 9], np.nan, inplace=True)
    df.childs.replace([9], np.nan, inplace=True)
    df.adults.replace([9], np.nan, inplace=True)

    df.divorce.replace([0, 8, 9], np.nan, inplace=True)
    df.agewed.replace([0, 98, 99], np.nan, inplace=True)
    df.relig.replace([0, 98, 99], np.nan, inplace=True)
    df.relig16.replace([0, 98, 99], np.nan, inplace=True)
    df.age.replace([0, 98, 99], np.nan, inplace=True)
    
    # note: sibs contains some unlikely numbers
    df.sibs.replace([-1, 98, 99], np.nan, inplace=True)
    df.educ.replace([97, 98, 99], np.nan, inplace=True)
    df.maeduc.replace([97, 98, 99], np.nan, inplace=True)
    df.paeduc.replace([97, 98, 99], np.nan, inplace=True)
    df.speduc.replace([97, 98, 99], np.nan, inplace=True)

    df.phone.replace([0, 2, 9], np.nan, inplace=True)
    df.owngun.replace([0, 3, 8, 9], np.nan, inplace=True)
    df.pistol.replace([0, 3, 8, 9], np.nan, inplace=True)
    df.hunt.replace([0, 8, 9], np.nan, inplace=True)

    df.class_.replace([0, 5, 8, 9], np.nan, inplace=True)
    df.pres04.replace([0, 8, 9], np.nan, inplace=True)
    df.pres08.replace([0, 8, 9], np.nan, inplace=True)
    df.pres12.replace([0, 8, 9], np.nan, inplace=True)
In [27]:
gss_replace_invalid(gss)
In [28]:
for varname in gss.columns:
    if gss[varname].dtype == np.float64:
        gss[varname] = gss[varname].astype(np.float32)

At this point, I have only moderate confidence that this code is correct. I'm not sure I have dealt with every variable in the dataset, and I'm not sure that the special values for every variable are correct.

So I will ask for your help.

Exercise: In order to validate the other variables, I'd like each person who works with this notebook to validate one variable.

If you run the following cell, it will choose one of the columns from the dataset at random. That's the variable you will check.

If you get year or id_, run the cell again to get a different variable name.

In [29]:
np.random.seed(None)
np.random.choice(gss.columns)
Out[29]:
'sibs'

Go back through the previous two sections of this notebook and replace polviews with your randomly chosen variable. Then run the cells again and go to this online survey to report the results.

Note: Not all questions were asked during all years. If your variable doesn't have data for 1974 or 2018, you might have to choose different years.

Resampling

The GSS uses stratified sampling, which means that some groups are deliberately oversampled to help with statistical validity.

As a result, each respondent has a sampling weight which is proportional to the number of people in the population they represent.

Before running any analysis, we can compensate for stratified sampling by "resampling", that is, by drawing a random sample from the dataset, where each respondent's chance of appearing in the sample is proportional to their sampling weight.

In [30]:
def resample_rows_weighted(df, column):
    """Resamples a DataFrame using probabilities proportional to given column.

    df: DataFrame
    column: string column name to use as weights

    returns: DataFrame
    """
    weights = df[column]
    sample = df.sample(n=len(df), replace=True, weights=weights)
    return sample
In [31]:
def resample_by_year(df, column):
    """Resample rows within each year.

    df: DataFrame
    column: string name of weight variable

    returns DataFrame
    """
    grouped = df.groupby('year')
    samples = [resample_rows_weighted(group, column)
               for _, group in grouped]
    sample = pd.concat(samples, ignore_index=True)
    return sample
In [32]:
np.random.seed(19)
sample = resample_by_year(gss, 'wtssall')

Saving the results

I'll save the results to an HDF5 file, which is a binary format that makes it much faster to read the data back.

First I'll save the original (not resampled) data.

An HDF5 file is like a dictionary on disk. It contains keys and corresponding values.

to_hdf takes three arguments:

  • The filename, gss_eda.hdf5.

  • The key, gss

  • The compression level, which controls how hard the algorithm works to compress the file.

So this file contains a single key, gss, which maps to the DataFrame with the original GSS data.

With compression level 3, it reduces the size of the file by a factor of 10.

In [33]:
# if the file already exists, remove it

import os

if os.path.isfile('gss_eda.hdf5'):
    !rm gss_eda.hdf5
In [34]:
# save the original

gss.to_hdf('gss_eda.hdf5', 'gss', complevel=3)
In [35]:
!ls -l gss_eda.hdf5
-rw-rw-r-- 1 downey downey 6461174 Jan  9 13:22 gss_eda.hdf5

And I'll create a second file with three random resamplings of the original dataset.

In [36]:
# if the file already exists, remove it
import os

if os.path.isfile('gss_eda.3.hdf5'):
    !rm gss_eda.3.hdf5

This file contains three keys, gss0, gss1, and gss2, which map to three DataFrames.

In [37]:
# generate and store three resamplings
keys = ['gss0', 'gss1', 'gss2']

for i in range(3):
    np.random.seed(i)
    sample = resample_by_year(gss, 'wtssall')

    sample.to_hdf('gss_eda.3.hdf5', keys[i], complevel=3)
In [38]:
!ls -l gss_eda.3.hdf5
-rw-rw-r-- 1 downey downey 19281025 Jan  9 13:23 gss_eda.3.hdf5

For the other notebooks in this case study, we'll load this resampled data rather than reading and cleaning the data every time.

In [ ]: