%matplotlib inline
import sys
sys.path.append("../../")
from shared.src import quiet
from shared.src import seed
from shared.src import style
import random
from client.api.notebook import Notebook
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy
import seaborn as sns
import shared.src.utils.util as shared_util
sns.set_context("notebook", font_scale=1.7)
ok = Notebook("ok/config")
scipy.stats
.You're working as a research scientist studying social cognition and social processing. You're using fMRI machines to search for areas of the brain that respond to emotionally and socially salient stimuli.
Since your grant money comes from Facebook,
you're trying to find areas of the brain
that respond differentially to discovering that
a social media post has received one of the six reactions:
like
, love
, haha
, wow
, sad
, and angry
,
pictured below.
From a literature review, you've identified a region that is likely to respond to these stimuli.
You've collected the average activation values for the center of this region
for a number of participants as they observed reactions to their social media posts.
The data is stored in fmri_data.pkl
. (This data is simulated, not real).
fmri_df = pd.read_pickle("data/fmri_data.pkl")
fmri_df.sample(5)
fmri_df.reaction.unique()
The area of interest takes up one "pixel" of the fMRI image.
Since fMRI measures brain activity in 3D space,
that is, in a volume,
these are known as voxel s,
short for "volumetric pixels".
The values you've measured are called voxel activities
or voxel activations,
shortned to voxel_act
in the fmri_df
.
For each participant, you also measured the average activation of this voxel during the presentation of non-socially-relevant, non-emotional stimuli, and subtracted this value off.
Therefore, you're looking to see whether the average activation across subjects (rows of fmri_df
)
is different from 0.
You suspect that an ANOVA is the right model for this data:
a categorical effects model with a Gaussian likelihood
and equal variances in each group.
Now define an ANOVA model for this data in pyMC.
Define priors for the group means,
making sure there is a mean for each group,
and then use the group labels (react_idx
)
to set the parameters of a Normal
ly-distributed variable
for the likelihood.
For that variable, set observed
equal to the observed voxel activations.
You'll also need a prior for the (pooled) standard deviation.
You are free to choose your prior as you want.
Refer back to the lecture material if you get stuck.
pm.plot_posterior
.¶Now, run a traditional ANOVA on the data using scipy.stats.f_oneway
.
Save the values of $F$ and $p$ to reaction_F
and reaction_p
, respectively.
ok.grade("q1")
Not every kind of effect can be detected by an ANOVA, since not all data is normal and not all changes affect the mean.
Say you're working in a lab studying Human-Computer Interaction.
Your thesis is on emoji use. What causes people to use emojis? How can we improve the emoji experience for users?
As an early step in this research, you're collecting some demographic data on emoji use.
You suspect that younger generations (those born after 1985, or with a birthdate >1985
) use more emojis than do older generations (those born before 1985, or with a birthdate <1985
).
To test your hypothesis, you collect some data, emoji_data.pkl
. (This data is simulated, not real).
emoji_df = pd.read_pickle("data/emoji_data.pkl")
emoji_df.head()
The emoji_ct
variable counts the number of emojis in the text message,
while the generation_idx
and birthdate
variables code the generation of the text message sender,
the former as a number and the latter as a string.
emoji_df.describe()
Visualize the distribution of the emoji_ct
for each generation.
This result doesn't accord with your understanding of the data, so you decide to dig deeper.
bins
as defined below.¶bins = np.arange(0, max(emoji_df["emoji_ct"]))
The data within each group is decidedly not normally distributed.
<1985
histogram.¶Despite the negative ANOVA result, the two distributions look different.
It appears that there are two kinds of text messages: messages with emojis, which then have a variable number of emojis, and messages without emojis.
Given that we are working with counts,
the natural distribution for modeling this data
is the ZeroInflatedPoisson
distribution
(see the pyMC docs for a plot of the distribution).
You can think of it as a random variable with two parts:
its value is either equal to 0, or it is Poisson-distributed.
You could implement it from scratch with pm.math.switch
.
It has two parameters:
theta
, which is the rate of the Poisson part,
and psi
, which is the chance the value is Poission-distributed.
In our case,
psi
is the chance that the text message contains any emojis,
while theta
is the average number of emojis in the texts that contain emojis.
Below, define a model for the emoji data that uses the Zero-Inflated Poisson.
Write it by analogy to the ANOVA model:
define priors for the model parameters,
making sure each group gets a random variable
for each of its prameters
(notice that the ZeroInflatedPoisson
variable has two parameters, not just one,
and neither is shared, unlike the sd
in an ANOVA).
Then index into those with the group labels (generation_idx
)
to set the parameters of the ZeroInflatedPoisson
variable.
For that variable, set observed
equal to the observed counts.
Think carefully about the priors for the parameters of the Zero-Inflated Poisson variable.
There are multiple right answers, but also lots of wrong answers.
The variable psi
must be between 0
and 1
and is continuous,
while the variable theta
just has to be above 0
(positive-only).
You might have some other prior information or assumptions you want to include
by choosing the initial distribution for these variables.
psi
and theta
with pm.plot_posterior
or something similar.¶ok.score()