Political Alignment Case Study

Allen Downey

MIT License


This is the second in a series of notebooks that make up a case study in exploratory data analysis.

In this notebook, we:

  1. Look at responses to the variable polviews, which represent political alignment on a 7-point scale from liberal to conservative.

  2. Compare the distribution of responses in 1974 and 1990.

  3. Plot the mean and standard deviation of responses over time as a way of quantifying changes in political alignment and polarization.

  4. Use local regression to plot a smooth line through noisy data.

  5. Use cross tabulation to compute the fraction of respondents in each category over time.

  6. Plot the results using a custom color palette.

As an exercise, you will look at changes in political party affiliation over the same period.


If you are running this notebook in Colab, the following cell downloads the empiricaldist library.

If you are running in another environment, you will need to install it yourself.

In [1]:
# If we're running in Colab, set up the environment

import sys
IN_COLAB = 'google.colab' in sys.modules

    !pip install empiricaldist

If everything we need is installed, the following cell should run without error.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from empiricaldist import Pmf

The following cell defines a function I use to decorate the axes in plots.

In [3]:
def decorate(**options):
    """Decorate the current axes.
    Call decorate with keyword arguments like
    The keyword arguments can be any of the axis properties

Loading the data

In the previous notebook, we downloaded GSS data, loaded and cleaned it, resampled it to correct for stratified sampling, and then saved the data in an HDF5 file, which is much faster to load. In this and the following notebooks, we'll download the HDF5 file and load it.

The following cell downloads the file.

In [4]:
# Load the data file

import os

datafile = 'gss_eda.3.hdf5'
if not os.path.exists(datafile):
    !wget https://github.com/AllenDowney/PoliticalAlignmentCaseStudy/raw/master/gss_eda.3.hdf5

This file contains three DataFrames containing resamples of the GSS data. We'll work with the first resampling, gss0, to get started; at the end of this notebook, we'll see the other two as well.

In [5]:
gss = pd.read_hdf(datafile, 'gss0')

Political alignment

The people surveyed as part of the GSS were asked about their "political alignment", which is where they place themselves on a spectrum from liberal to conservative.

The variable polviews contains responses to the following question:

We hear a lot of talk these days about liberals and conservatives. I'm going to show you a seven-point scale on which the political views that people might hold are arranged from extremely liberal--point 1--to extremely conservative--point 7. Where would you place yourself on this scale?

Here are the valid responses:

1   Extremely liberal
2   Liberal
3   Slightly liberal
4   Moderate
5   Slightly conservative
6   Conservative
7   Extremely conservative

To see how the responses have changed over time, we'll inspect them at the beginning and end of the observation period.

First I'll select the column.

In [6]:
polviews = gss['polviews']

Then compute a Boolean Series that's True for responses from 1974.

In [7]:
year74 = (gss['year'] == 1974)

Now we can select the responses from 1974.

In [8]:
polviews74 = polviews[year74]

As in the previous notebook, we'll use values to plot the values in the series and their frequencies.

In [9]:
def values(series):
    """Count the values and sort.
    series: pd.Series
    returns: series mapping from values to frequencies
    return series.value_counts().sort_index()

Here are the responses from 1974.

In [10]:

And here are the responses from 2018.

In [11]:
year18 = (gss['year'] == 2018)
polviews18 = polviews[year18]


To visualize these distributions, we'll use the Probability Mass Function (PMF), which is similar to a histogram. The difference is that the PMF is "normalized", which means that it shows the percentage of people who gave each response, rather than the number.

I use the Pmf class from empiricaldist to compute them.

In [12]:
from empiricaldist import Pmf

Here's the distribution from 1974:

In [13]:
pmf74 = Pmf.from_seq(polviews74)
pmf74.bar(label='1974', color='C0', alpha=0.7)

decorate(xlabel='Political view on a 7-point scale',
         ylabel='Fraction of population',
         title='Distribution of political views')


Here's the distribution from 1974:

In [14]:
pmf18 = Pmf.from_seq(polviews18)
pmf18.bar(label='2018', color='C1', alpha=0.7)

decorate(xlabel='Political view on a 7-point scale',
         ylabel='Fraction of population',
         title='Distribution of political views')


In both cases, the most common response is 4, which is the code for "moderate". And few respondents describe themselves as "extremely" liberal or conservative.

So maybe we're not so polarized after all.

To make it easier to compare the distributions, I'll plot them side by side.

In [15]:
pmf74.bar(label='1974', width=-0.45, align='edge', alpha=0.7)

pmf18.bar(label='2018', width=0.45, align='edge', alpha=0.7)

decorate(xlabel='Political view on a 7-point scale',
         ylabel='Fraction of population',
         title='Distribution of political views')


Now we can see the changes in the distribution more clearly. It looks like the number of people at the extremes (1 and 7) has increased, and the fraction of liberal (2) and slightly liberal (3) has decreased.

Exercise: To summarize these changes, we can compare the mean and standard deviation of polviews in 1974 and 2018.

The mean of the responses measures the balance of people in the population with liberal or conservative leanings. If the mean increases over time, that might indicate a shift in the population toward conservatism.

The standard deviation measures the dispersion of views in the population; if it increases over time, that might indicate an increase in polarization.

Compute the mean and standard deviation of polviews74 and polviews18.

What do they indicate about changes over this interval?

In [16]:
# Solution goes here
In [17]:
# Solution goes here
In [18]:
# Solution goes here

Time series

At this point we have looked at the endpoints, 1974 and 2018, but we don't know what happened in between.

To see how the distribution changes over time, we can group by year and compute the mean of polviews during each year.

First I'll use groupby to group the respondents by year.

In [19]:
gss_by_year = gss.groupby('year')

The result is a DataFrameGroupBy value that represents a collection of groups. We can loop through the groups and display the number of respondents in each:

In [20]:
for year, group in gss_by_year:
    print(year, len(group))

In many ways the DataFrameGroupBy behaves like a DataFrame. We can use the bracket operator to select a column:

In [21]:
polviews_by_year = gss_by_year['polviews']

A column from a DataFrameGroupBy is a SeriesGroupBy. If we invoke mean on it, the results is a series that contains the mean of polviews for each year of the survey.

In [22]:
mean_series = polviews_by_year.mean()

And here's what it looks like.

In [23]:
mean_series.plot(color='C2', label='polviews')
         ylabel='Mean (7 point scale)',
         title='Mean of polviews')

Exercise: The standard deviation quantifies the spread of the distribution, which is one way to measure polarization.

Plot standard deviation of polviews for each year of the survey from 1972 to 2018.

Does it show evidence of increasing polarization?

In [24]:
# Solution goes here

Local regression

In the previous section we plotted mean and standard deviation of polviews over time. Both plots are quite noisy.

We can use local regression to compute a smooth line through these data points.

The following function takes a Pandas Series and uses and algorithm called LOWESS to compute a smooth line. LOWESS stands for "locally weighted scatterplot smoothing".

In [25]:
from statsmodels.nonparametric.smoothers_lowess import lowess

def make_lowess(series):
    """Use LOWESS to compute a smooth line.
    series: pd.Series
    returns: pd.Series
    y = series.values
    x = series.index.values

    smooth = lowess(y, x)
    index, data = np.transpose(smooth)

    return pd.Series(data, index=index) 

We'll use the following function to plot data points and the smoothed line.

In [26]:
def plot_series_lowess(series, color):
    """Plots a series of data points and a smooth line.
    series: pd.Series
    color: string or tuple
    series.plot(linewidth=0, marker='o', color=color, alpha=0.5)
    smooth = make_lowess(series)
    smooth.plot(label='_', color=color)

The following figure shows the mean of polviews and a smooth line.

In [27]:
mean_series = gss_by_year['polviews'].mean()
plot_series_lowess(mean_series, 'C2')
decorate(ylabel='Mean (7 point scale)',
         title='Mean of polviews',
         xlim=[1972, 2020])

One reason the PMFs for 1974 and 2018 did not look very different is that the mean seems to have gone up (more conservative) and then down again (more liberal).

Generally, it looks like the U.S. has been trending toward liberal for the last 20 years, or more, at least in the sense of how people describe themselves.

Exercise: Use plot_series_lowess to plot the standard deviation of polviews with a smooth line.

In [28]:
# Solution goes here

Cross tabulation

In the previous sections, we treated polviews as a numerical quantity, so we were able to compute means and standard deviations.

But the responses are really categorical, which means that each value represents a discrete category, like "liberal" or "conservative".

In this section, we'll treat polviews as a categorical variable. Specifically, we'll compute the number of respondents in each category for each year, and plot changes over time.

Pandas provides a function called crosstab that computes a cross tabulation.

It takes two Series as arguments and returns a DataFrame.

In [29]:
year = gss['year']
column = gss['polviews']

xtab = pd.crosstab(year, column)

Here are the first few lines from the result.

In [30]:

It contains one row for each value of year and one column for each value of polviews. Reading the first row, we see that in 1974, 31 people gave response 1, "extremely liberal", 201 people gave response 2, "liberal", and so on.

The number of respondents varies from year to year, so we need to "normalize" the results, which means computing for each year the fraction of respondents in each category, rather than the count.

crosstab takes an optional argument that normalizes each row.

In [31]:
xtab_norm = pd.crosstab(year, column, normalize='index')

Here's what that looks like for the 7-point scale.

In [32]:

To make the results easier to interpret, I'm going to replace the numeric codes 1-7 with strings. First I'll make a dictionary that maps from numbers to strings:

In [33]:
# recode the 7 point scale with words
d7 = {1: 'Extremely liberal', 
      2: 'Liberal', 
      3: 'Slightly liberal', 
      4: 'Moderate', 
      5: 'Slightly conservative', 
      6: 'Conservative', 
      7: 'Extremely conservative'}

Then we can use the replace function like this:

In [34]:
polviews7 = gss['polviews'].replace(d7)

We can use values to confirm that the values in polviews7 are strings.

In [35]:

If we make the cross tabulation again, we can see that the column names are strings.

In [36]:
xtab_norm = pd.crosstab(year, polviews7, normalize='index')

We are almost ready to plot the results, but first we need some colors.

Color palettes

Seaborn provides a variety of color palettes, which you can read about here.

To represent political views, I'll use a diverging palette from blue to red.

In [37]:
palette = sns.color_palette('RdBu_r', 7)

The middle color is white, which won't work when we plot it, so I will replace it with a purple color from another palette.

In [38]:
muted = sns.color_palette('muted', 7)
purple = muted[4]

Here's the modified diverging palette with purple in the middle.

In [39]:
palette[3] = purple

Now I'll make a "color map", which is a dictionary that maps from the responses to the corresponding colors.

In [40]:
columns = ['Extremely liberal', 
           'Slightly liberal', 
           'Slightly conservative', 
           'Extremely conservative']
In [41]:
color_map = dict(zip(columns, palette))

for key, value in color_map.items():
    print(key, value)


To plot the results, I use the following function, which takes a DataFrame and plots each column using plot_series_lowess.

In [42]:
def plot_columns_lowess(table, columns, colors):
    """Plot the columns in a DataFrame.
    table: DataFrame with a cross tabulation
    columns: list of column names, in the desired order
    colors: mapping from column names to colors
    for col in columns:
        series = table[col]
        plot_series_lowess(series, colors[col])

The following function sets the position of the figure legend.

In [43]:
def anchor_legend(x, y):
    """Place the upper left corner of the legend box.
    x: x coordinate
    y: y coordinate
    plt.legend(bbox_to_anchor=(x, y), loc='upper left', ncol=1)

Here are the 7 categories plotted as a function of time.

In [44]:
plot_columns_lowess(xtab_norm, columns, color_map)
         title='Fraction of people with each political view',
         xlim=[1972, 2020])

anchor_legend(1.02, 1.02)

This way of looking at the results suggests that changes in political alignment during this period have generally been slow and small.

The fraction of self-described moderates has not changed substantially.

The fraction of conservatives increased, but seems to be decreasing now; the number of liberals seems to be increasing.

The fraction of people at the extremes has increased, but it is hard to see clearly in this figure.

We can get a better view by plotting just the extremes.

In [45]:
columns2 = ['Extremely liberal', 'Extremely conservative']

plot_columns_lowess(xtab_norm, columns2, color_map)
         title='Fraction of people with extreme political views',
         xlim=[1970, 2020])

anchor_legend(1.02, 1.02)

This figure shows that the fraction of people who describe themselves as "extreme" has increased from about 2.5% to about 4%.

In relative terms, that's a big increase. But in absolute terms these tails of the distribution are still small.

Exercise: Let's do a similar analysis with partyid, which encodes responses to the question:

Generally speaking, do you usually think of yourself as a Republican, Democrat, Independent, or what?

The valid responses are:

0   Strong democrat
1   Not str democrat
2   Ind,near dem
3   Independent
4   Ind,near rep
5   Not str republican
6   Strong republican
7   Other party

You can read the codebook for partyid here.

Here are the steps I suggest:

1) If you have not already saved this notebook, you should do that first. If you are running on Colab, select "Save a copy in Drive" from the File menu.

2) Now, before you modify this notebook, make another copy and give it an appropriate name.

3) Search and replace polviews with partyid (use "Edit->Find and replace").

4) Run the notebook from the beginning and see what other changes you have to make.

You will have to make changes in d7 and columns. Otherwise you might get a message like

TypeError: '<' not supported between instances of 'float' and 'str'

Also, you might have to drop "Other party" or change the color palette.

And you should change the titles of the figures.

What changes in party affiliation do you see over the last 50 years? Are things going in the directions you expected?

Write a headline (or a couple) that describe the most substantial changes you see.

In [ ]: