This is the fourth in a series of notebooks that make up a case study in exploratory data analysis. This case study is part of the Elements of Data Science curriculum.
This notebook is a template for a do-it-yourself, choose-your-own-adventure mini-project that explores the relationship between political alignment and other attitudes and beliefs.
I will outline the steps and provide sample code. You can choose which survey question to explore, adapt my code for your data, and write a report presenting the results.
As an example, I wrote up the results from this notebook in a blog article.
In previous notebooks, we looked at changes in political alignment over time, and explored the relationship between political alignment and survey questions related to "outlook".
The analysis in this notebook follows the steps we have seen:
For your variable of interest, you will read the code book to understand the question and valid responses.
You will compute and display the distribution (PMF) of responses and the distribution within each political group.
You will recode the variable on a numerical scale that makes it possible to interpret the mean, and then plot the mean over time.
You will use a pivot table to compute the mean of your variable over time for each political alignment group (liberal, moderate, and conservative).
Finally, you will look at results from three resamplings of the data to see whether the patterns you observed might be due to random sampling.
The following cell installs the empiricaldist
library if necessary.
try:
import empiricaldist
except ImportError:
!pip install empiricaldist
If everything we need is installed, the following cell should run without error.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from empiricaldist import Pmf
The following cells define functions from previous notebooks we will use again.
def decorate(**options):
"""Decorate the current axes.
Call decorate with keyword arguments like
decorate(title='Title',
xlabel='x',
ylabel='y')
The keyword arguments can be any of the axis properties
https://matplotlib.org/api/axes_api.html
"""
ax = plt.gca()
ax.set(**options)
handles, labels = ax.get_legend_handles_labels()
if handles:
plt.legend()
from statsmodels.nonparametric.smoothers_lowess import lowess
def make_lowess(series):
"""Use LOWESS to compute a smooth line.
series: pd.Series
returns: pd.Series
"""
y = series.values
x = series.index.values
smooth = lowess(y, x)
index, data = np.transpose(smooth)
return pd.Series(data, index=index)
def plot_series_lowess(series, color):
"""Plots a series of data points and a smooth line.
series: pd.Series
color: string or tuple
"""
series.plot(linewidth=0, marker="o", color=color, alpha=0.5)
smooth = make_lowess(series)
smooth.plot(label="_", color=color)
def plot_columns_lowess(table, columns, color_map):
"""Plot the columns in a DataFrame.
table: DataFrame with a cross tabulation
columns: list of column names, in the desired order
color_map: mapping from column names to color_map
"""
for col in columns:
series = table[col]
plot_series_lowess(series, color_map[col])
In the first notebook, we downloaded GSS data, loaded and cleaned it, resampled it to correct for stratified sampling, and then saved the data in an HDF5 file, which is much faster to load.
The following cells downloads the file.
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print("Downloaded " + local)
download(
"https://github.com/AllenDowney/PoliticalAlignmentCaseStudy/raw/master/gss_pacs_resampled.hdf"
)
Now I'll load the first resampled DataFrame
.
datafile = "gss_pacs_resampled.hdf"
gss = pd.read_hdf(datafile, "gss0")
gss.shape
(68846, 204)
gss.head()
year | id | divorce | sibs | childs | age | educ | degree | sex | race | ... | ballot | wtssall | sexbirth | sexnow | eqwlth | realinc | realrinc | coninc | conrinc | commun | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1972 | 910 | 2.0 | 7.0 | 0.0 | 60.0 | 12.0 | 1.0 | 1.0 | 2.0 | ... | NaN | 0.8893 | NaN | NaN | NaN | 30458.0 | NaN | 41667.0 | NaN | NaN |
1 | 1972 | 1181 | 2.0 | 4.0 | 0.0 | 53.0 | 6.0 | 0.0 | 2.0 | 1.0 | ... | NaN | 1.7786 | NaN | NaN | NaN | 24366.0 | NaN | 33333.0 | NaN | NaN |
2 | 1972 | 1003 | 2.0 | 5.0 | 0.0 | 72.0 | 8.0 | 0.0 | 2.0 | 2.0 | ... | NaN | 0.8893 | NaN | NaN | NaN | 2707.0 | NaN | 3704.0 | NaN | NaN |
3 | 1972 | 904 | NaN | 5.0 | 0.0 | 19.0 | 11.0 | 0.0 | 1.0 | 1.0 | ... | NaN | 1.7786 | NaN | NaN | NaN | 37226.0 | NaN | 50926.0 | NaN | NaN |
4 | 1972 | 708 | 2.0 | 7.0 | 3.0 | 44.0 | 14.0 | 1.0 | 2.0 | 1.0 | ... | NaN | 0.8893 | NaN | NaN | NaN | 43994.0 | NaN | 60185.0 | NaN | NaN |
5 rows × 204 columns
The General Social Survey includes questions about a variety of social attitudes and beliefs. We can use this dataset to explore changes in the responses over time and the relationship with political alignment.
In my subset of the GSS data, I selected questions that were asked repeatedly over the interval of the survey.
To follow the process demonstrated in this notebook, you should choose a variable that you think might be interesting.
If you are not sure which variable to explore, here is a random selection of three that you can choose from:
cols = list(gss.columns)
for col in ["id", "year", "ballot", "age", "sex", "race"]:
cols.remove(col)
np.random.shuffle(cols)
for col in cols[:3]:
print(col)
sexornt helpful wtssall
Fill in the name of the variable you chose below, and select a column.
The variable I'll use as an example is homosex
, which contains responses to this question (see https://gssdataexplorer.norc.org/variables/634/vshow):
What about sexual relations between two adults of the same sex--do you think it is always wrong, almost always wrong, wrong only sometimes, or not wrong at all?
varname = "homosex"
column = gss[varname]
column.tail()
68841 1.0 68842 NaN 68843 NaN 68844 4.0 68845 4.0 Name: homosex, dtype: float64
Here's the distribution of responses:
column.value_counts(dropna=False).sort_index()
1.0 24228 2.0 1785 3.0 2770 4.0 11524 5.0 94 NaN 28445 Name: homosex, dtype: int64
Use this link to read the codebook for the variable you chose.
Then fill in the following cell with the responses and their labels.
responses = [1, 2, 3, 4, 5]
labels = [
"Always wrong",
"Almost always wrong",
"Sometimes wrong",
"Not at all wrong",
"Other",
]
And here's what the distribution looks like. I use plt.xticks
to label the x-axis and rotate the labels.
pmf = Pmf.from_seq(column)
pmf.bar(alpha=0.7)
decorate(xlabel="Response", ylabel="PMF", title="Distribution of responses")
plt.xticks(responses, labels, rotation=30)
None
Remember that these results are an average over the entire interval of the survey, so you should not interpret it as a current condition.
If we make a cross tabulation of year
and the variable of interest, we get the distribution of responses over time.
xtab = pd.crosstab(gss["year"], column, normalize="index")
xtab.head()
homosex | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
---|---|---|---|---|---|
year | |||||
1973 | 0.724066 | 0.057400 | 0.082988 | 0.112725 | 0.022822 |
1974 | 0.709972 | 0.047051 | 0.073736 | 0.126404 | 0.042837 |
1976 | 0.679806 | 0.057400 | 0.089212 | 0.173582 | 0.000000 |
1977 | 0.729767 | 0.062414 | 0.071331 | 0.136488 | 0.000000 |
1980 | 0.725772 | 0.055994 | 0.058148 | 0.160086 | 0.000000 |
xtab.tail()
homosex | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
---|---|---|---|---|---|
year | |||||
2012 | 0.449878 | 0.029340 | 0.075795 | 0.444988 | 0.0 |
2014 | 0.388069 | 0.029520 | 0.077491 | 0.504920 | 0.0 |
2016 | 0.379444 | 0.036667 | 0.075000 | 0.508889 | 0.0 |
2018 | 0.332471 | 0.034282 | 0.057568 | 0.575679 | 0.0 |
2021 | 0.259203 | 0.040607 | 0.069829 | 0.630361 | 0.0 |
Now we can plot the results.
for response, label in zip(responses, labels):
xtab[response].plot(label=label)
decorate(xlabel="Year", ylabel="Percentage", title="Attitudes about same-sex relations")
This visualization is useful for exploring the data, but I would not present this version to an audience.
To explore the relationship between this variable and political alignment, I'll recode political alignment into three groups:
d_polviews = {
1: "Liberal",
2: "Liberal",
3: "Liberal",
4: "Moderate",
5: "Conservative",
6: "Conservative",
7: "Conservative",
}
I'll use replace
and store the result as a new column in the DataFrame.
gss["polviews3"] = gss["polviews"].replace(d_polviews)
With this scale, there are roughly the same number of people in each group.
gss["polviews3"].value_counts(dropna=False)
Moderate 22950 Conservative 20359 Liberal 16195 NaN 9342 Name: polviews3, dtype: int64
pmf = Pmf.from_seq(gss["polviews3"])
pmf.bar(color="C1", alpha=0.7)
decorate(
xlabel="Political alignment",
ylabel="PMF",
title="Distribution of political alignment",
)
Now we can use groupby
to group the respondents by political alignment.
by_polviews = gss.groupby("polviews3")
by_polviews
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f3592777460>
Next I will plot the distribution of responses in each group.
But first I'll make a dictionary that maps from each group to a color.
muted = sns.color_palette("muted", 5)
sns.palplot(muted)
color_map = {"Conservative": muted[3], "Moderate": muted[4], "Liberal": muted[0]}
Now I'll make a PMF of responses for each group.
for name, group in by_polviews:
plt.figure()
pmf = Pmf.from_seq(group[varname])
pmf.bar(label=name, color=color_map[name], alpha=0.7)
decorate(xlabel="Response", ylabel="PMF", title="Distribution of responses")
But again, these results are an average over the interval of the survey, so you should not interpret them as a current condition.
For each group, we could compute the mean of the responses, but it would be hard to interpret. So we'll recode the variable of interest to make the mean more... meaningful.
For the variable I chose, a majority of respondents chose "always wrong". I'll use that as my baseline response with code 1, and lump the other responses with code 0.
We can use replace
to recode the values and store the result as a new column in the DataFrame.
d_recode = {1: 1, 2: 0, 3: 0, 4: 0, 5: 0}
gss["recoded"] = column.replace(d_recode)
gss["recoded"].name = varname
And we'll use value_counts
to check whether it worked.
gss["recoded"].value_counts(dropna=False)
NaN 28445 1.0 24228 0.0 16173 Name: homosex, dtype: int64
If we compute the mean, we can interpret it as "the fraction of respondents who think same-sex sexual relations are always wrong".
NOTE: Series.mean
drops NaN values before computing the mean.
gss["recoded"].mean()
0.5996881265315215
Now we can compute the mean of the recoded variable in each group.
means = by_polviews["recoded"].mean()
means
polviews3 Conservative 0.720455 Liberal 0.406851 Moderate 0.601922 Name: homosex, dtype: float64
To get the values in a particular order, we can use the group names as an index:
names = ["Conservative", "Moderate", "Liberal"]
means[names]
polviews3 Conservative 0.720455 Moderate 0.601922 Liberal 0.406851 Name: homosex, dtype: float64
Now we can make a bar plot with color-coded bars:
colors = color_map.values()
means[names].plot(kind="bar", color=colors, alpha=0.7, label="")
decorate(
xlabel="Political alignment",
ylabel="Fraction saying yes",
title="Are same-sex sexual relations always wrong?",
)
plt.xticks(rotation=0)
None
As we might expect, more conservatives think homosexuality is "always wrong", compared to moderates and liberals.
We can use groupby
to group responses by year.
by_year = gss.groupby("year")
From the result we can select the recoded variable and compute the mean.
time_series = by_year["recoded"].mean()
And we can plot the results with the data points themselves as circles and a local regression model as a line.
plot_series_lowess(time_series, "C1")
decorate(
xlabel="Year",
ylabel="Fraction saying yes",
title="Are same-sex sexual relations always wrong?",
)
The fraction of respondents who think homosexuality is wrong has been falling steeply since about 1990.
So far, we have grouped by polviews3
and computed the mean of the variable of interest in each group.
Then we grouped by year
and computed the mean for each year.
Now we'll use pivot_table
to compute the mean in each group for each year.
table = gss.pivot_table(
values="recoded", index="year", columns="polviews3", aggfunc="mean"
)
table.head()
polviews3 | Conservative | Liberal | Moderate |
---|---|---|---|
year | |||
1974 | 0.795620 | 0.546512 | 0.767892 |
1976 | 0.775000 | 0.489848 | 0.716698 |
1977 | 0.823666 | 0.564165 | 0.769772 |
1980 | 0.834906 | 0.539130 | 0.740171 |
1982 | 0.801126 | 0.599198 | 0.811820 |
The result is a table that has years running down the rows and political alignment running across the columns.
Each entry in the table is the mean of the variable of interest for a given group in a given year.
Now we can use plot_columns_lowess
to see the results.
columns = ["Conservative", "Moderate", "Liberal"]
plot_columns_lowess(table, columns, color_map)
decorate(
xlabel="Year",
ylabel="Fraction saying yes",
title="Are same-sex sexual relations always wrong?",
)
Negative attitudes about homosexuality have been declining in all three groups, starting at about the same time, and at almost the same rate.
The figures we have generated so far in this notebook are based on a single resampling of the GSS data. Some of the features we see in these figures might be due to random sampling rather than actual changes in the world.
By generating the same figures with different resampled datasets, we can get a sense of how much variation there is due to random sampling.
To make that easier, the following function contains the code from the previous analysis all in one place.
You will probably have to update this function with any changes you made in my code.
def plot_by_polviews(gss, varname):
"""Plot mean response by polviews and year.
gss: DataFrame
varname: string column name
"""
gss["polviews3"] = gss["polviews"].replace(d_polviews)
column = gss[varname]
gss["recoded"] = column.replace(d_recode)
table = gss.pivot_table(
values="recoded", index="year", columns="polviews3", aggfunc="mean"
)
columns = ["Conservative", "Moderate", "Liberal"]
plot_columns_lowess(table, columns, color_map)
decorate(
xlabel="Year",
ylabel="Fraction saying yes",
title="Are same-sex relations always wrong?",
)
Now we can loop through the three resampled datasets and generate a figure for each one.
datafile = "gss_pacs_resampled.hdf"
for key in ["gss0", "gss1", "gss2"]:
df = pd.read_hdf(datafile, key)
plt.figure()
plot_by_polviews(df, varname)
You should review your interpretation in the previous section to see how it holds up to resampling. If you see an effect that is consistent in all three figures, it is less likely to be an artifact of random sampling.
If it varies from one resampling to the next, you should probably not take it too seriously.
Political Alignment Case Study
Copyright 2020 Allen B. Downey
License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)