#!/usr/bin/env python
# coding: utf-8

# # Political Alignment and Outlook

# This is the third in a series of notebooks that make up a [case study in exploratory data analysis](https://allendowney.github.io/PoliticalAlignmentCaseStudy/).
# This case study is part of the [*Elements of Data Science*](https://allendowney.github.io/ElementsOfDataScience/) curriculum.

# In the previous chapter, we used data from the General Social Survey (GSS) to plot changes in political alignment over time.
# In this notebook, we'll explore the relationship between political alignment and respondents' beliefs about themselves and other people.
# We'll use the following variables from the GSS dataset:
# 
# * `happy`: Taken all together, how would you say things are these days--would you say that you are very happy, pretty happy, or not too happy?
# 
# * `trust`: Generally speaking, would you say that most people can be trusted or that you can't be too careful in dealing with people?
# 
# * `helpful`: Would you say that most of the time people try to be helpful, or that they are mostly just looking out for themselves?
# 
# * `fair`: Do you think most people would try to take advantage of you if they got a chance, or would they try to be fair?
# 
# We'll start with the last question; then as an exercise you can look at one of the others.
# Here's the plan:
# 
# 1. First we'll use `groupby` to compare the average response between groups and plot the average as a function of time.
# 
# 2. We'll use the Pandas function `pivot table` to compute the average response within each group as a function of time.
# 
# 3. Finally, we'll use resampling to see whether the features we see in the results might be due to randomness, or whether they are likely to reflect actual changes in the works.

# If everything we need is installed, the following cell should run without error.

# In[1]:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# The following cells define functions from previous notebooks we will use again.

# In[2]:


def values(series):
    """Count the values and sort.

    series: pd.Series

    returns: series mapping from values to frequencies
    """
    return series.value_counts().sort_index()


# In[3]:


def decorate(**options):
    """Decorate the current axes.

    Call decorate with keyword arguments like
    decorate(title='Title',
             xlabel='x',
             ylabel='y')

    The keyword arguments can be any of the axis properties
    https://matplotlib.org/api/axes_api.html
    """
    ax = plt.gca()
    ax.set(**options)
    handles, labels = ax.get_legend_handles_labels()
    if handles:
        ax.legend(handles, labels)


# In[4]:


from statsmodels.nonparametric.smoothers_lowess import lowess


def make_lowess(series):
    """Use LOWESS to compute a smooth line.

    series: pd.Series

    returns: pd.Series
    """
    y = series.values
    x = series.index.values

    smooth = lowess(y, x)
    index, data = np.transpose(smooth)

    return pd.Series(data, index=index)


# In[5]:


def plot_series_lowess(series, color):
    """Plots a series of data points and a smooth line.

    series: pd.Series
    color: string or tuple
    """
    series.plot(linewidth=0, marker="o", color=color, alpha=0.5)
    smooth = make_lowess(series)
    smooth.plot(label="_", color=color)


# In[6]:


def plot_columns_lowess(table, columns, colors):
    """Plot the columns in a DataFrame.

    table: DataFrame with a cross tabulation
    columns: list of column names, in the desired order
    colors: mapping from column names to colors
    """
    for col in columns:
        series = table[col]
        plot_series_lowess(series, colors[col])


# ## Loading the data
# 
# In the first notebook, we downloaded GSS data, loaded and cleaned it, resampled it to correct for stratified sampling, and then saved the data in an HDF5 file, which is much faster to load.  Now we'll download the HDF5 file and load it.
# 
# The following cell downloads the file if necessary.

# In[7]:


from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download(
    "https://github.com/AllenDowney/PoliticalAlignmentCaseStudy/raw/master/gss_pacs_resampled.hdf"
)


# I've created an HDF file that contains three `DataFrame` objects with resampled GSS data.  We'll work with the first resampling, `gss0`, to get started; at the end of this chapter, we'll see the other two as well.

# In[8]:


datafile = "gss_pacs_resampled.hdf"
gss = pd.read_hdf(datafile, "gss0")
gss.shape


# ## Are People Fair?
# 
# In the GSS data, the variable `fair` contains responses to this question (see <https://gssdataexplorer.norc.org/variables/440/vshow>):
# 
# > Do you think most people would try to take advantage of you if they got a chance, or would they try to be fair?
# 
# The possible responses are:
# 
# ```
# 1	Take advantage
# 2	Fair
# 3	Depends
# ```
# 
# As always, we start by looking at the distribution of responses, that is, how many people give each response:

# In[9]:


values(gss["fair"])


# The plurality think people try to be fair (2), but a substantial minority think people would take advantage (1).
# There are also a number of NaNs, mostly respondents who were not asked this question.

# In[10]:


gss["fair"].isna().sum()


# To count the number of people who chose option 2, "people try to be fair", we'll use a dictionary to recode option 2 as `1` and the other options as `0`.

# In[11]:


recode_fair = {1: 0, 2: 1, 3: 0}


# As an alternative, we could include option 3, "depends", by mapping it to 1, or give it less weight by mapping it to an intermediate value like 0.5.
# 
# We can use `replace` to recode the values and store the result as a new column in the DataFrame.

# In[12]:


gss["fair2"] = gss["fair"].replace(recode_fair)


# And we'll use `values` to check whether it worked.

# In[13]:


values(gss["fair2"])


# ## Fairness Over Time
# 
# As we saw in the previous notebook, we can use `groupby` to group responses by year.

# In[14]:


gss_by_year = gss.groupby("year")


# From the result we can select `fair2` and compute the mean.

# In[15]:


fair_by_year = gss_by_year["fair2"].mean()


# Here's the result, which shows the fraction of people who say people try to be fair, plotted over time.  As in the previous notebook, we plot the data points themselves with circles and a local regression model as a line.

# In[16]:


plot_series_lowess(fair_by_year, "C1")

decorate(
    xlabel="Year",
    ylabel="Fraction saying yes",
    title="Would most people try to be fair?",
)


# Sadly, it looks like faith in humanity has declined, at least by this measure.
# 
# Let's see what this trend looks like if we group the respondents by political alignment.

# ## Political Views on a 3-point Scale
# 
# In the previous notebook, we looked at responses to `polviews`, which asks about political alignment.  The valid responses are:
# 
# ```
# 1	Extremely liberal
# 2	Liberal
# 3	Slightly liberal
# 4	Moderate
# 5	Slightly conservative
# 6	Conservative
# 7	Extremely conservative
# ```
# 
# To make it easier to visualize groups, we'll lump the 7-point scale into a 3-point scale.

# In[17]:


recode_polviews = {
    1: "Liberal",
    2: "Liberal",
    3: "Liberal",
    4: "Moderate",
    5: "Conservative",
    6: "Conservative",
    7: "Conservative",
}


# We'll use `replace`, as we've seen before, and store the result as a new column in the DataFrame.

# In[18]:


gss["polviews3"] = gss["polviews"].replace(recode_polviews)


# With this scale, there are roughly the same number of people in each group.

# In[19]:


values(gss["polviews3"])


# ## Fairness by Group
# 
# Now let's see who thinks people are more fair, conservatives or liberals.
# We'll group the respondents by `polviews3`.

# In[20]:


by_polviews = gss.groupby("polviews3")


# And compute the mean of `fair2` in each group.

# In[21]:


by_polviews["fair2"].mean()


# It looks like conservatives are a little more optimistic, in this sense, than liberals and moderates.
# But this result is averaged over the last 50 years.  Let's see how things have changed over time.

# ## Fairness over Time by Group
# 
# So far, we have grouped by `polviews3` and computed the mean of `fair2` in each group.
# Then we grouped by `year` and computed the mean of `fair2` for each year.
# Now I want to group by `polviews3` and `year`, and compute the mean of `fair2` in each group over time.
# 
# We could do that computation "by hand" using the tools we already have, but it is so common and useful that it has a name: it is called a **pivot table**, and Pandas provides a function that computes it (see <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html>).
# 
# The function is called `pivot_table`, and it takes the following arguments:
# 
# * `values`, which is the name of the variable we want to summarize: `fair2` in this example.
# 
# * `index`, which is the name of the variable that will provide the row labels: `year` in this example.
# 
# * `columns`, which is the name of the variable that will provide the column labels: `polview3` in this example.
# 
# * `aggfunc`, which is the function used to "aggregate", or summarize, the values: `mean` in this example.

# In[22]:


table = gss.pivot_table(
    values="fair2", index="year", columns="polviews3", aggfunc="mean"
)


# The result is a table that has years running down the rows and political alignment running across the columns.
# Each entry in the table is the mean of `fair2` for a given group in a given year.

# In[23]:


table.head()


# Reading across the first row, we can see that in 1975, moderates were slightly more optimistic than the other groups.
# 
# Reading down the first column, we can see that the estimated mean of `fair2` among conservatives varies from year to year.  It is hard to tell looking at these numbers whether it is trending up or down; we can get a better view by plotting the results.

# ## Plotting the Results

# Before we plot the results, I'll make a dictionary that maps from each group to a color.

# In[24]:


muted = sns.color_palette("muted", 5)
sns.palplot(muted)


# In[25]:


colors = {"Conservative": muted[3], "Moderate": muted[4], "Liberal": muted[0]}


# We can use `plot_columns_lowess` to see the results.

# In[26]:


columns = ["Conservative", "Liberal", "Moderate"]
plot_columns_lowess(table, columns, colors)

decorate(
    xlabel="Year",
    ylabel="Fraction saying yes",
    title="Would most people try to be fair?",
)


# The fraction of respondents who think people try to be fair has dropped in all three groups, although liberals and moderates might have leveled off.
# In 1975, liberals were the least optimistic group.  In 2018, they might be the most optimistic.
# But the responses are quite noisy, so we should not be too confident about these conclusions.
# 
# We can get a sense of how reliable they are by running the resampling process a few times and checking how much the results vary.

# ## Simulating Possible Datasets
# 
# The figures we have generated so far in this notebook are based on a single resampling of the GSS data.  Some of the features we see in these figures might be due to random sampling rather than actual changes in the world.
# 
# By generating the same figures with different resampled datasets, we can get a sense of how much variation there is due to random sampling.
# 
# To make that easier, the following function contains the code from the previous analysis all in one place.

# In[27]:


def plot_by_polviews(gss):
    """Plot mean response by polviews and year.

    gss: DataFrame
    """
    gss["polviews3"] = gss["polviews"].replace(recode_polviews)
    gss["fair2"] = gss["fair"].replace(recode_fair)

    table = gss.pivot_table(
        values="fair2", index="year", columns="polviews3", aggfunc="mean"
    )

    plot_columns_lowess(table, columns, colors)

    decorate(
        xlabel="Year",
        ylabel="Fraction saying yes",
        title="Would most people try to be fair?",
    )


# Now we can loop through the three resampled datasets in the HDF5 file and generate a figure for each one.

# In[28]:


datafile = "gss_pacs_resampled.hdf"

for key in ["gss0", "gss1", "gss2"]:
    df = pd.read_hdf(datafile, key)

    plt.figure()
    plot_by_polviews(df)


# Features that are the same in all three figures are more likely to reflect things actually happening in the world.  Features that differ substantially between the figures are more likely to be artifacts of random sampling.
# In this context, "artifact" has the sense of "something observed in a scientific investigation or experiment that is not naturally present but occurs as a result of the preparative or investigative procedure" (from <https://www.lexico.com/en/definition/artifact>).

# **Exercise:** As an exercise, you can run the same analysis with one of the other variables related to outlook including `happy`, `trust`, `helpful`, and maybe `fear` and `hapmar`.
# 
# For these variables, you will have to read the codebook to see the responses and how they are encoded, then think about which responses to report.
# In the notebook for this chapter, there are some suggestions to get you started.

# Here are the steps I suggest:
#     
# 1) If you have not already saved this notebook, you might want to do that first. If you are running on Colab, select "Save a copy in Drive" from the File menu.
# 
# 2) Now, before you modify this notebook, make another copy and give it an appropriate name.
# 
# 3) Search and replace `fair` with the name of the variable you select (use "Edit->Find and replace").
# 
# 4) Run the notebook from the beginning and see what other changes you have to make.
# 
# Write a few sentences to describe the relationships you find between political alignment and outlook.

# ## Summary
# 
# The case study in this chapter and the previous one demonstrates a process for exploring a dataset and finding relationships among the variables.
# 
# In the previous chapter, we started with a single variable, `polviews`, and visualized its distribution at the beginning and end of the observation interval.
# Then we used `groupby` to see how the mean and standard deviation changed over time.
# Looking more closely, we used cross tabulation to see how the fraction of people in each group changed over time.
# 
# In this chapter, we added a second variable, `fair`, which is one of several questions in the GSS related to respondents' beliefs about other people.
# We used `groupby` again to see how the responses have changed over time.
# Then we used a pivot table to show how the responses within each political group have changed over time.
# Finally, we used multiple resamplings of the original dataset to check whether the patterns we identified might be due to random sampling rather than real changes in the world.
# 
# The tools we used in this case study are versatile; they are useful for exploring other variables in the GSS dataset, and other datasets as well.
# And the process we followed is one I recommend whenever you are exploring a new dataset.

# Political Alignment Case Study
# 
# Copyright 2020 Allen B. Downey
# 
# License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)