Notebook

Lab 07 - Modeling Categorical Effects¶

In [ ]:

%matplotlib inline

In [ ]:

import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [ ]:

import random

from client.api.notebook import Notebook
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy
import seaborn as sns

In [ ]:

import shared.src.utils.util as shared_util

In [ ]:

sns.set_context("notebook", font_scale=1.7)

In [ ]:

ok = Notebook("ok/config")

Learning Objectives¶

Practice writing categorical-numerical models in pyMC3
Become more familiar with running and reporting the results of classical statistical tests with scipy.stats.
Recognize the benefits of the modeling flexibility of the sampling approach.

❤️ React for Bayesian Monte Carlo, 😡 React for Null Hypothesis Significance Testing¶

You're working as a research scientist studying social cognition and social processing. You're using fMRI machines to search for areas of the brain that respond to emotionally and socially salient stimuli.

Since your grant money comes from Facebook, you're trying to find areas of the brain that respond differentially to discovering that a social media post has received one of the six reactions: like, love, haha, wow, sad, and angry, pictured below.

¶

From a literature review, you've identified a region that is likely to respond to these stimuli.

You've collected the average activation values for the center of this region for a number of participants as they observed reactions to their social media posts. The data is stored in fmri_data.pkl. (This data is simulated, not real).

In [ ]:

fmri_df = pd.read_pickle("data/fmri_data.pkl")

fmri_df.sample(5)

In [ ]:

fmri_df.reaction.unique()

The area of interest takes up one "pixel" of the fMRI image. Since fMRI measures brain activity in 3D space, that is, in a volume, these are known as voxel s, short for "volumetric pixels". The values you've measured are called voxel activities or voxel activations, shortned to voxel_act in the fmri_df.

For each participant, you also measured the average activation of this voxel during the presentation of non-socially-relevant, non-emotional stimuli, and subtracted this value off.

Therefore, you're looking to see whether the average activation across subjects (rows of fmri_df) is different from 0. You suspect that an ANOVA is the right model for this data: a categorical effects model with a Gaussian likelihood and equal variances in each group.

Begin by visualizing the data.¶

Consider a box-and-whisker plot, or a collection of distplots.

Q Does this look like data on which this model can be used? Explain.¶

Now define an ANOVA model for this data in pyMC.

Define priors for the group means, making sure there is a mean for each group, and then use the group labels (react_idx) to set the parameters of a Normally-distributed variable for the likelihood. For that variable, set observed equal to the observed voxel activations. You'll also need a prior for the (pooled) standard deviation. You are free to choose your prior as you want.

Refer back to the lecture material if you get stuck.

Q Explain your choices for the prior component of the model.¶

Using this model, draw samples and visualize the posterior with `pm.plot_posterior`.¶

Q Which of the group means are different from 0? What does that mean for your original hypothesis that this brain area responds to emotionally-salient stimuli?¶

Now, run a traditional ANOVA on the data using scipy.stats.f_oneway.

Save the values of $F$ and $p$ to reaction_F and reaction_p, respectively.

In [ ]:

ok.grade("q1")

Q Was the difference in group means statistically significant? Does this agree or disagree with your results from the pyMC model above?¶

Beyond ANOVA: Modeling 👏 Emoji 👏 Use 👏¶

Not every kind of effect can be detected by an ANOVA, since not all data is normal and not all changes affect the mean.

Say you're working in a lab studying Human-Computer Interaction.

Your thesis is on emoji use. What causes people to use emojis? How can we improve the emoji experience for users?

As an early step in this research, you're collecting some demographic data on emoji use. You suspect that younger generations (those born after 1985, or with a birthdate >1985) use more emojis than do older generations (those born before 1985, or with a birthdate <1985). To test your hypothesis, you collect some data, emoji_data.pkl. (This data is simulated, not real).

In [ ]:

emoji_df = pd.read_pickle("data/emoji_data.pkl")

In [ ]:

emoji_df.head()

The emoji_ct variable counts the number of emojis in the text message, while the generation_idx and birthdate variables code the generation of the text message sender, the former as a number and the latter as a string.

In [ ]:

emoji_df.describe()

Visualize the distribution of the emoji_ct for each generation.

Q Use your visualization to predict whether an ANOVA will return a positive result. Explain your prediction.¶

Q Run an ANOVA and report the results.¶

This result doesn't accord with your understanding of the data, so you decide to dig deeper.

Plot a histogram of the data from each group, using `bins` as defined below.¶

In [ ]:

bins = np.arange(0, max(emoji_df["emoji_ct"]))

The data within each group is decidedly not normally distributed.

Q What's the most salient non-normal aspect of the data? Hint: Look especially at the `<1985` histogram.¶

Despite the negative ANOVA result, the two distributions look different.

Q Describe, in your own words, the differences between the two distributions.¶

It appears that there are two kinds of text messages: messages with emojis, which then have a variable number of emojis, and messages without emojis.

Given that we are working with counts, the natural distribution for modeling this data is the ZeroInflatedPoisson distribution (see the pyMC docs for a plot of the distribution).

You can think of it as a random variable with two parts: its value is either equal to 0, or it is Poisson-distributed. You could implement it from scratch with pm.math.switch. It has two parameters: theta, which is the rate of the Poisson part, and psi, which is the chance the value is Poission-distributed.

In our case, psi is the chance that the text message contains any emojis, while theta is the average number of emojis in the texts that contain emojis.

Q Explain, in your own words, why a sampling or Monte Carlo approach, like the one we are taking in this class, allows us to work with zero-inflated Poisson random variables more easily than a traditional analytical approach, based on defined statistical tests like the $F$ test.¶

Below, define a model for the emoji data that uses the Zero-Inflated Poisson.

Write it by analogy to the ANOVA model: define priors for the model parameters, making sure each group gets a random variable for each of its prameters (notice that the ZeroInflatedPoisson variable has two parameters, not just one, and neither is shared, unlike the sd in an ANOVA). Then index into those with the group labels (generation_idx) to set the parameters of the ZeroInflatedPoisson variable. For that variable, set observed equal to the observed counts.

Think carefully about the priors for the parameters of the Zero-Inflated Poisson variable. There are multiple right answers, but also lots of wrong answers. The variable psi must be between 0 and 1 and is continuous, while the variable theta just has to be above 0 (positive-only). You might have some other prior information or assumptions you want to include by choosing the initial distribution for these variables.

(2 pts) Using this model, draw samples and visualize a posterior over the parameters `psi` and `theta` with `pm.plot_posterior` or something similar.¶

Q Report your findings. Do the groups differ? If so, how?¶

In [ ]:

ok.score()

Lab 07 - Modeling Categorical Effects¶

Learning Objectives¶

❤️ React for Bayesian Monte Carlo, 😡 React for Null Hypothesis Significance Testing¶

¶

Begin by visualizing the data.¶

Q Does this look like data on which this model can be used? Explain.¶

Q Explain your choices for the prior component of the model.¶

Using this model, draw samples and visualize the posterior with pm.plot_posterior.¶

Q Which of the group means are different from 0? What does that mean for your original hypothesis that this brain area responds to emotionally-salient stimuli?¶

Q Was the difference in group means statistically significant? Does this agree or disagree with your results from the pyMC model above?¶

Beyond ANOVA: Modeling 👏 Emoji 👏 Use 👏¶

Q Use your visualization to predict whether an ANOVA will return a positive result. Explain your prediction.¶

Q Run an ANOVA and report the results.¶

Plot a histogram of the data from each group, using bins as defined below.¶

Q What's the most salient non-normal aspect of the data? Hint: Look especially at the <1985 histogram.¶

Q Describe, in your own words, the differences between the two distributions.¶

Q Explain, in your own words, why a sampling or Monte Carlo approach, like the one we are taking in this class, allows us to work with zero-inflated Poisson random variables more easily than a traditional analytical approach, based on defined statistical tests like the $F$ test.¶

(2 pts) Using this model, draw samples and visualize a posterior over the parameters psi and theta with pm.plot_posterior or something similar.¶

Q Report your findings. Do the groups differ? If so, how?¶

Using this model, draw samples and visualize the posterior with `pm.plot_posterior`.¶

Plot a histogram of the data from each group, using `bins` as defined below.¶

Q What's the most salient non-normal aspect of the data? Hint: Look especially at the `<1985` histogram.¶

(2 pts) Using this model, draw samples and visualize a posterior over the parameters `psi` and `theta` with `pm.plot_posterior` or something similar.¶