#!/usr/bin/env python
# coding: utf-8

# Find this project in [Github: UberHowley/SPOC-File-Processing](https://github.com/UberHowley/spoc-file-processing)
# 
# ---
# 
# # Discussion Forums in a Small Private Online Course (SPOC)
# In previous work [detailed here](http://www.irishowley.com/website/pMOOChelpers.html) I found that up- and down-voting had a significant negative effect on the number of peer helpers MOOC students invited to their help seeking thread. Results from a previous survey experiment also suggested that framing tasks with a learning-oriented (or value-emphasized) instruction decreased evaluation anxiety, which is the hypothesized mechanism through which voting inhibits help seeking. Following on this work, I wished to investigate **how voting impacts help seeking in more traditional online course discussion forums** and **if we can alleviate any potential costs of effective participation while still maintaining the benefits of voting in forums**? 
# 
# So this experiment took place in a Small Private Online Course (SPOC) with a more naturalistic up/downvoting setup in a disucssion forum and used prompts that emphasized the value of participating in forums.
# 
# *I used Python to clean the logfile data, gensim to assign automated [LDA] topics to each message board post, and pandas to perform statistical analyses in order to answer my research questions.*
# 
# ## Research Questions
# Our QuickHelper system was designed to advise students on which peers they might want to help them, but also to answer  theory-based questions about student motivation and decision-making in the face of common interactional archetypes employed in MOOCs. This yielded the following research questions:
# 1. What is the effect of no/up/downvoting on {learning, posting comments, posting quality comments, help seeking}
# 1. What is the effect of neutral/goal prompting on {learning, posting comments, posting quality comments, help seeking}
# 1. Is there an interaction between votingXprompting on {learning, posting comments, posting quality comments, help seeking}
# 1. What is the relationship between {posting, posting quality, help seeking} and learning?
# 1. Which prompts are the most successful at getting a student response? (Does this change over time?) What about the highest quality responses?
# 1. What behavior features correlate strongly with learning or exam scores? (i.e., can we predict student performance with behaviors in the first couple of weeks?)
#   1. Is there a correlation between pageviews and performance?
#   1. What is the most viewed lecture? etc.
#   1. What lecture is viewed the most by non-students? etc.
# 
# ## Experimental Setup
# The parallel computing course in which this experiment took place met in-person twice per week for 1:20, but all lecture slides (of which there were 28 sets) were posted [online](http://15418.courses.cs.cmu.edu/spring2015/). Each slide had its own discussion forum-like comments beneath it, as shown below. Students enrolled in the course were required to "individually contribute one interesting comment per lecture using the course web site" for a 5% participation grade. The experiment began in the TODO week of the course, and ran for TODO weeks.
# ![SPOC Discussion Forum](http://www.irishowley.com/website/img/proj_spoc_forum.jpg)
# 
# Approximately 70 consenting students are included in this 3X2 experiment.
# 
# ### Voting
# There were three voting conditions in this dimension: no voting, upvoting only, and downvoting, as shown below. Each student was assigned to one of these conditions, and they only ever saw one voting condition. There was the issue of cross-contamination of voting conditions, if one student looked at a peer's screen, but the instructors did not see any physical evidence of this cross contamination until the last week of the course.
# ![SPOC Voting Conditions](http://www.irishowley.com/website/img/proj_spoc_voting.jpg)
# 
# ### Prompting
# Students were also assigned to only one of two prompting conditions: positive (learning-oriented/value-emphasis) or neutral. These prompts were in the form of an email sent through the system, that would have a preset "welcome prompt" followed by a customizable "instructional prompt." The welcome prompt was either value-emphasis or neutral in tone, and the instructional prompt was either a general 'restate' request or a more specific in nature, prompting the student to answer a specific question. The welcome prompts were predefined and not customizable by the sender of the prompts, while the instructional portion was customizable.  In order to maintain the utility of the email prompots, instructor-customization had to be supported.
# 
# <table><tr><td><b>Neutral Welcome Prompt</b></td><td><b>Positive Welcome Prompt</b></td></tr>
# <tr><td><ul><li>This is a discussion you can participate in.</li>
# <li>Here is a discussion you can participate in.</li>
# <li>This is a thread you can participate in.</li>
# <li>Here’s a thread you can comment on.</li>
# <li>Here’s a conversation you can comment on.</li>
# <li>Here’s a conversation you can participate in.</li>
# <li>Here is a discussion in which you can participate.</li>
# <li>This is an opportunity to engage in the class discussion.</li>
# <li>Here is an opportunity for you to engage in the class discussion.</li>
# <li>Here’s a chance to engage in the discussion.</li>
# <li>Here’s a chance to jump into the discussion.</li>
# <li>Here’s an opportunity to jump in.</li>
# <li>This is a thread you can respond to.</li>
# <li>Here is a discussion you can respond to.</li>
# <li>Here’s a conversation to jump in on.</li>
# <li>Here’s a conversation you can respond to.</li></ul>
# </td><td><ul><li>Participating in class discussions increases exposure to new ideas.</li>
# <li>Our class discussion forums increase exposure to new ideas.</li>
# <li>Exposure to new ideas is one benefit to the class discussion forums.</li>
# <li>Take some time to explore some new ideas in the class forums.</li>
# <li>Writing down your thoughts will help you think through complex ideas.</li>
# <li>Class discussions help you understand concepts, not just memorize them.</li>
# <li>Participating in class discussions will help you understand the concepts, and not just memorize them.</li>
# <li>Contributing to the course discussion forums is a good way to learn new things.</li>
# <li>Participation in course discussions will increase your learning.</li>
# <li>Expanding on others’ ideas is a great way to learn new things.</li>
# <li>Participating in the class discussions online will help you learn the concepts better.</li>
# <li>Asynchronous discussions allow for more thought-processing time.</li>
# <li>Class discussion forums get you to think through your thoughts.</li>
# <li>Participation from all students is key to everyone’s learning.</li>
# <li>Expanding on others’ ideas is a great way to check your own understanding.</li>
# <li>Writing things down is a great way to check your own understanding.</li></ul>
# </td></tr></table>
# 
# <table><tr><td><b>Context - Restate (customizable)</b></td><td><b>Context - Question (customizable)</b></td></tr>
# <li><tr><td><ul><li>Try to describe what was said on this slide in your own words.</li>
# <li>Summarize the main point of this slide.</li>
# <li>Summarize the topic of [Insert concept related to slide here].</li>
# <li>Summarize, in your own words, what @NAME was trying to say here.</li>
# <li>Restate, in your own words, what I was trying to explain on this slide.</li>
# <li>Restate, in your own words, the idea of [Insert concept related to slide here].</li>
# <li>Restate, in your own words, what @NAME was trying to say here.</li>
# <li>Try restating, in your own words, what @NAME was saying here.</li>
# <li>Try and describe the main idea in this slide.</li>
# <li>Can you summarize the discussion happening here?</li></ul>
# </td><td><ul><li>Can you answer the question in this discussion?</li>
# <li>Try to answer the question in this discussion.</li>
# <li>What’s your answer to this question?</li>
# <li>How would you answer this question?</li></ul>
# </td></tr></table>
# 
# A course instructor might decide to prompt some students to answer another students' question, as this is a valuable learning activity. When a course instructor decides to send an email prompt to a student, the system would randomly select two students. The instructor would select and customize the instructional prompt and upon submitting, the system would send an email to two students, with an appropriate version of a randomly selected welcome prompt. An example email prompt is shown below:
# ![SPOC Email Prompt Example](http://www.irishowley.com/website/img/proj_spoc_prompting.jpg)
# 
# 
# ## Processing Logfiles
# Processing the logfiles mostly involves: (1) ensuring that only consenting students are included in the dataset, (2) that each comment is assigned an LDA topic, (3) that student names are removed from the dataset, (4) removing course instructors from the dataset, (5) removing spaces from column headers, and (6) removing approximately 241 comments from the logfiles that were posted to a lecture slide more than 2-3 weeks in the past (i.e., attempted cramming for an increased participation grade).
# 
# ## Statistical Analysis
# I used the pandas and statsmodels libraries to run descriptive statistics, generate plots to better understand the data, and answer our research questions (from above). The main effects of interest were the categorical condition variables, `voting` (updownvote, upvote, novote) & `prompts` (positive, neutral) and a variety of scalar dependent variables (`number of comments`, `comment quality`, `help seeking`, and `learning`).

# In[6]:


# necessary libraries/setup
get_ipython().run_line_magic('matplotlib', 'inline')
import utilsSPOC as utils  # separate file that contains all the constants we need
import pandas as pd
import matplotlib.pyplot as plt
# need a few more libraries for ANOVA analysis
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# importing seaborns for its factorplot
import seaborn as sns  
sns.set_style("darkgrid")
sns.set_context("notebook")

data_comments = pd.io.parsers.read_csv("150512_spoc_comments_lda.csv", encoding="utf8")
data_prompts = pd.io.parsers.read_csv("150512_spoc_prompts_mod.csv", encoding="utf8")
data = pd.io.parsers.read_csv("150512_spoc_full_data_mod.csv", encoding="utf-8-sig")
conditions = [utils.COL_VOTING, utils.COL_PROMPTS]  # all our categorical IVs of interest
covariate = utils.COL_NUM_PROMPTS  # number of prompts received needs to be controlled for
outcome = utils.COL_NUM_LEGIT_COMMENTS


# ### Descriptive Statistics
# Descriptive statistics showed that the mean number of comments students posted was 19 and the median was 17. Students also saw a mean and median number of '1' email prompts.

# In[2]:


df = data[[covariate, outcome]]
df.describe()
# note: the log-processing removes the Prompting condition of any student who did not see a prompt


# In[18]:


# histogram of num comments
fig = plt.figure()
ax1 = fig.add_subplot(121)
comments_hist = data[outcome]
ax1 = comments_hist.plot(kind='hist', title="Histogram "+outcome, by=outcome)
ax1.locator_params(axis='x')
ax1.set_xlabel(outcome)
ax1.set_ylabel("Count")

ax2 = fig.add_subplot(122)
prompts_hist = data[covariate]
ax2 = prompts_hist.plot(kind='hist', title="Histogram "+covariate, by=covariate)
ax2.locator_params(axis='x')
ax2.set_xlabel(covariate)
ax2.set_ylabel("Count")

fig.tight_layout()


# In[36]:


# plotting num prompts per day
sns.set_context("poster")
plot_prompts = data_prompts[[utils.COL_ENCOURAGEMENT_TYPE, utils.COL_TSTAMP]]
plot_prompts[utils.COL_TSTAMP] = pd.to_datetime(plot_prompts[utils.COL_TSTAMP])
plot_prompts.set_index(utils.COL_TSTAMP)
date_counts = plot_prompts[utils.COL_TSTAMP].value_counts()
date_plot = date_counts.resample('d',how=sum).plot(title="Total Prompts By Date",legend=None, kind='bar')
date_plot.set(ylim=(0, 25))
# TODO: need to find a way to drop time part of timestamp...


# When looking at the independent variables, we see a even random assignment to `Condition`. Initially, the assignment to `EncouragementType` wasn't perfectly even (41 vs. 27), but when we removed the prompting condition assignment of any student who did not receive any prompts, the distribution becomes more even.

# In[4]:


df = data[conditions+[covariate, outcome]]
for cond in conditions:
    print(pd.concat([df.groupby(cond)[cond].count(), df.groupby(cond)[covariate].mean(), df.groupby(cond)[outcome].mean()], axis=1))


# ### Answering Our Research Questions
# The research questions require a bit of statistics to answer. In the case of a single factor with two levels we use a t-test to determine if the independent variable in question has a significant effect on the outcome variable. In the case of a single factor with more than two levels, we use a one-way Analysis of Variance (ANOVA). With more than one factor we use a two-way ANOVA. These are all essentially similar linear models (ordinary least squares), with slightly different equations or statistics for determining significance. 
# 
# #### What is the effect of no/up/downvoting on {learning, posting comments, posting quality comments, help seeking}
# *(VotingCondition --> numComments)*
# 
# To answer this question, we run a one-way ANOVA (since there are more than 2 levels to this one factor). Since the p-value is above 0.05 (i.e., 0.72) we must accept the null hypothesis that there is no difference between voting conditions on number of posted comments.

# In[5]:


cond = utils.COL_VOTING
df = data[[cond, outcome]].dropna()

cond_lm = ols(outcome + " ~ C(" + cond + ")", data=df).fit()
anova_table = anova_lm(cond_lm)
print(anova_table)
print(cond_lm.summary())

# boxplot 
fig = plt.figure()
ax = fig.add_subplot(111)
ax = df.boxplot(outcome, cond, ax=plt.gca())
ax.set_xlabel(cond)
ax.set_ylabel(outcome)
fig.tight_layout()


# #### Does showing expertise information about potential helpers increase the number of helpers the student invites to her question thread? 
# *(EncouragementType + numPrompts --> numComments)*
# 
# We run an ANCOVA (since we need to control for the number of prompts each student saw) and see that the p-value is once again not significant (i.e., 0.23).

# In[6]:


cond = utils.COL_PROMPTS
df = data[[cond, covariate, outcome]].dropna()

cond_lm = ols(outcome + " ~ C(" + cond + ") + " + covariate, data=df).fit()
anova_table = anova_lm(cond_lm)
print(anova_table)
print(cond_lm.summary())

# boxplot 
fig = plt.figure()
ax = fig.add_subplot(111)
ax = df.boxplot(outcome, cond, ax=plt.gca())
ax.set_xlabel(cond)
ax.set_ylabel(outcome)
fig.tight_layout()


# #### Is there an interaction between voting X prompting on {learning, posting comments, posting quality comments, help seeking}
# *Condition X EncouragementType + numPrompts--> numComments*
# 
# The OLS output shows that neither the additive model nor the interactive/multiplicative model are significant (p = 0.53 vs. p = 0.47). The AIC scores are quite similar so we'd select the one with the lower AIC score as being a better fit to the data.

# In[7]:


col_names = [utils.COL_VOTING, utils.COL_PROMPTS, covariate, outcome]
factor_groups = data[col_names].dropna()

formula = outcome + " ~ C(" + col_names[0] + ") + C(" + col_names[1] + ") + " + covariate
formula_interaction = outcome + " ~ C(" + col_names[0] + ") * C(" + col_names[1] + ") + " + covariate

print("= = = = = = = = " + formula + " = = = = = = = =")
lm = ols(formula, data=factor_groups).fit()  # linear model: AIC 418
print(lm.summary())

print("\n= = = = = = = = " + formula_interaction + " = = = = = = = =")
lm_interaction = ols(formula_interaction, data=factor_groups).fit()  # interaction linear model: AIC 418.1
print(lm_interaction.summary())

# We can test if they're significantly different with an ANOVA (neither is sig. so not necessary)
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
print("= = " + formula + " ANOVA = = ")
print("= = vs. " + formula_interaction + " = =")
print(anova_lm(lm, lm_interaction))


# In[15]:


# These are ANOVA tests for determining if different models are significantly different from each other.
# Although we know from the previous step that none of these are signficant, so these tests aren't necessary.
# The ANOVA output provides the F-statistics which are necessary for reporting results.
# From: http://statsmodels.sourceforge.net/devel/examples/generated/example_interactions.html

# Tests whether the LM of just Condition is significantly different from the additive LM
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
f_just_first = outcome + " ~ C(" + col_names[0] + ")"
print("= = " + f_just_first + " ANOVA = = ")
print("= = vs. " + formula + " = =")
print(anova_lm(ols(f_just_first, data=factor_groups).fit(), ols(formula, data=factor_groups).fit()))

# Testing whether the LM of just EncouragementType is significantly different from the additive LM
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
f_just_second = outcome + " ~ C(" + col_names[1] + ") + " + covariate
print("= = " + f_just_second + " = = ")
print("= = vs. " + formula + " = =")
print(anova_lm(ols(f_just_second, data=factor_groups).fit(), ols(formula, data=factor_groups).fit()))


# In[8]:


# plotting
from statsmodels.graphics.api import interaction_plot
sns.set_context("notebook")

fig = plt.figure()
ax1 = sns.factorplot(x=col_names[1], y=outcome, data=factor_groups, kind='point', ci=95)
ax1.set(ylim=(0, 45))
ax2 = sns.factorplot(x=col_names[0], y=outcome, data=factor_groups, kind='point', ci=95)
ax2.set(ylim=(0, 45))
ax3 = sns.factorplot(x=col_names[1], hue=col_names[0], y=outcome, data=factor_groups, kind='point', ci=95)
ax3.set(ylim=(0, 45))


# This last interaction plot is interesting, and suggests that we might want to look at the "downvote" condition compared to the "non-downvote" conditions, a split which is captured in our 'COL_NEG_VOTE' column. Neither model appears significant, and an additional test shows neither model performs statistically better than the other (p = 0.1).

# In[ ]:


# sample code for calculating new column reformulation
def get_neg_vote(voting_cond)
        if voting_cond == utils.COND_VOTE_NONE or voting_cond == utils.COND_VOTE_UP:
            return utils.COND_OTHER
        elif voting_cond == utils.COND_VOTE_BOTH:
            return utils.COND_VOTE_BOTH
        else:
            return ""

# if you're missing the log processing, you can get a 'negativeVoting' column with something like this code:
data[utils.COL_NEG_VOTE] = data[utils.COL_VOTING].apply(lambda x: get_neg_vote(x))


# In[9]:


col_names = [utils.COL_NEG_VOTE, utils.COL_PROMPTS, covariate, outcome]
factor_groups = data[col_names].dropna()

formula = outcome + " ~ C(" + col_names[0] + ") + C(" + col_names[1] + ") + " + covariate
formula_interaction = outcome + " ~ C(" + col_names[0] + ") * C(" + col_names[1] + ") + " + covariate

print("= = = = = = = = " + formula + " = = = = = = = =")
lm = ols(formula, data=factor_groups).fit()  # linear model: AIC 416
print(lm.summary())

print("\n= = = = = = = = " + formula_interaction + " = = = = = = = =")
lm_interaction = ols(formula_interaction, data=factor_groups).fit()  # interaction linear model: AIC 414
print(lm_interaction.summary())

# We can test if they're significantly different with an ANOVA (neither is sig. so not necessary)
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
print("= = " + formula + " ANOVA = = ")
print("= = vs. " + formula_interaction + " = =")
print(anova_lm(lm, lm_interaction))

# plotting
sns.set_context("notebook") 
ax3 = sns.factorplot(x=col_names[1], hue=col_names[0], y=outcome, data=factor_groups, kind='point', ci=95)
ax3.set(ylim=(0, 45))
sns.set_context("poster")  # larger plots
ax4 = factor_groups.boxplot(return_type='axes', column=outcome, by=[col_names[0], col_names[1]])


# ### Discussion
# This analysis so far has only looked at `number of comments` as the dependent variable, of which there does not appear to be any significant effect. However, `comment quality`, `help seeking`, and `learning` are all important dependent variables to examine as well. This analysis is still in progress.

# ## Topic Modeling
# I used gensim to automatically apply topics to each forum comment. 

# In[10]:


# basic descriptive statistics
data_topic = data_comments[[utils.COL_LDA]]
data_help = data_comments[[utils.COL_HELP]]

print(data_topic[utils.COL_LDA].describe())

# histogram of LDA topics
sns.set_context("notebook") 
sns.factorplot(utils.COL_LDA, data=data_topic, size=6)


# We can also graph the number of comments over time, and (TODO) graph these comments by topic over time.

# In[11]:


# num comments of each topic type by date
data_topic = data_comments[[utils.COL_LDA, utils.COL_TIMESTAMP]]
data_topic[utils.COL_TIMESTAMP] = pd.to_datetime(data_topic[utils.COL_TIMESTAMP])
data_topic = data_topic.set_index(utils.COL_TIMESTAMP)
data_topic['day'] = data_topic.index.date
counts = data_topic.groupby(['day', utils.COL_LDA]).agg(len)
print(counts.tail())

# plotting
sns.set_context("poster")
data_topic = data_comments[[utils.COL_LDA, utils.COL_TIMESTAMP]]
data_topic[utils.COL_TIMESTAMP] = pd.to_datetime(data_topic[utils.COL_TIMESTAMP])
data_topic.set_index(utils.COL_TIMESTAMP)
date_counts = data_topic[utils.COL_TIMESTAMP].value_counts()
date_plot = date_counts.resample('d',how=sum).plot(title="Total Comments By Date",legend=None)
date_plot.set(ylim=(0, 100))


# ### Research Questions
# Basic descriptive statistics and bar graphs are simple enough to acquire, but what we might want to know is **what topic is the most popular to write about** and **what topic are students requesting the most help on**? To do this, we need to find out how many comments there are on a particular comment, and how many of these comments are help requests.

# In[14]:


def count_instances(df,uid,column): 
    """
    Return dataframe of total counts for each unique value in COLUMN, by UID
    :param df: the dataframe containing the data of interest
    :param uid: the unique index/ID to group the counts by (i.e., author user id)
    :param column: the column containing the things to count (i.e., help requests)
    :return: True if the string contains the letter 'y'
    """
    df['tot_comments'] = 1
    for item in df[column].unique():
        colname = "help_requests" if item == "True" else "other_comments"
        df['num_%s' % colname] = df[column].apply(lambda value: 1 if value == str(item) else 0)
        
    new_data = df.groupby(uid).sum()
    
    """#calculating percents
    cols = [col for col in new_data.columns if 'num' in col]
    for col in cols:
        new_data[col.replace('num','percent')] = new_data[col] / new_data['tot_comments'] * 100
    """    
    return new_data

data_topic = data_comments[[utils.COL_LDA, utils.COL_TIMESTAMP, utils.COL_HELP]]

df_by_topic = count_instances(data_topic, utils.COL_LDA, utils.COL_HELP)  # calculate number help requests per topic

data_help = data_comments[[utils.COL_ID, utils.COL_HELP]]
data_help[utils.COL_HELP] = data_help[utils.COL_HELP].astype('str')
df_help_counts = count_instances(data_help, utils.COL_ID, utils.COL_HELP)  # calculate number help requests per user

# merging our help counts table with original data
data[utils.COL_ID] = data[utils.COL_ID].astype('str')
indexed_data = data.set_index([utils.COL_ID])
combined_df = pd.concat([indexed_data, df_help_counts], axis=1, join_axes=[indexed_data.index])

# Replace NaN in help_requests/other_comments cols with '0'
combined_df["num_help_requests"].fillna(value=0, inplace=True)
combined_df["num_other_comments"].fillna(value=0, inplace=True)


# ## Help Seeking
# I used an extremely naive set of rules to determine if a comment was a help request, such as identifying if a question mark is included in the message (or the word 'question', or 'struggle' or 'stuck', etc) I identified which entries are most likely help requests and which are not. In the future, a coding scheme should be developed to determine what kinds of help are being sought.

# In[35]:


# example code - does not need to execute
def is_help_topic(sentence):
    if "help" in sentence or "question" in sentence or "?" in sentence or "dunno" in sentence or "n't know" in sentence:
        return True
    if "confus" in sentence or "struggl" in sentence or "lost" in sentence or "stuck" in sentence or "know how" in sentence:
        return True
    return False

# if you're missing the log processing, you can get an 'isHelpSeeking' column with something like this code:
data_comments[utils.COL_HELP] = data_comments[utils.COL_COMMENT].apply(lambda x: is_help_topic(x))


# In[15]:


# histogram of help requests
sns.factorplot(utils.COL_HELP, data=data_help, size=6)
data_help.describe()

# plotting
"""
sns.set_context("poster")
data_by_date = data[[utils.COL_HELP, utils.COL_TIMESTAMP]]
data_by_date[utils.COL_TIMESTAMP] = pd.to_datetime(data_by_date[utils.COL_TIMESTAMP])
data_by_date.set_index(utils.COL_TIMESTAMP)
date_counts = data_by_date[utils.COL_TIMESTAMP].value_counts()
date_plot = date_counts.resample('d',how=sum).plot(title="Total Help Requests By Date",legend=None)
date_plot.set(ylim=(0, 100))
"""


# Now, instead of using 'numComments' as our outcome variable, we can use 'numHelpRequests' as the outcome variable. But first we have to calculate that for each user. 

# In[18]:


data_help = data_comments[[utils.COL_AUTHOR, utils.COL_HELP]]
data_help.rename(columns={utils.COL_AUTHOR:utils.COL_ID}, inplace=True)  #rename so we can merge later
data_help[utils.COL_ID] = data_help[utils.COL_ID].astype('str')
data_help[utils.COL_HELP] = data_help[utils.COL_HELP].astype('str')

# merging our help counts table with original data
df_help_counts.drop("tot_comments", axis=1, inplace=True)
data[utils.COL_ID] = data[utils.COL_ID].astype('str')
indexed_data = data.set_index([utils.COL_ID])
combined_df = pd.concat([indexed_data, df_help_counts], axis=1, join_axes=[indexed_data.index])

# Replace NaN in help_requests/other_comments cols with '0'
combined_df["num_help_requests"].fillna(value=0, inplace=True)
combined_df["num_other_comments"].fillna(value=0, inplace=True)


# In[17]:


outcome = "num_help_requests"
col_names = [utils.COL_NEG_VOTE, utils.COL_PROMPTS, covariate, outcome]
factor_groups = combined_df[col_names].dropna()

formula = outcome + " ~ C(" + col_names[0] + ") + C(" + col_names[1] + ") + " + covariate
formula_interaction = outcome + " ~ C(" + col_names[0] + ") * C(" + col_names[1] + ") + " + covariate

print("= = = = = = = = " + formula + " = = = = = = = =")
lm = ols(formula, data=factor_groups).fit()  # linear model: AIC 416
print(lm.summary())

print("\n= = = = = = = = " + formula_interaction + " = = = = = = = =")
lm_interaction = ols(formula_interaction, data=factor_groups).fit()  # interaction linear model: AIC 414
print(lm_interaction.summary())

# We can test if they're significantly different with an ANOVA (neither is sig. so not necessary)
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
print("= = " + formula + " ANOVA = = ")
print("= = vs. " + formula_interaction + " = =")
print(anova_lm(lm, lm_interaction))

# plotting
sns.set_context("notebook") 
ax3 = sns.factorplot(x=col_names[1], hue=col_names[0], y=outcome, data=factor_groups, kind='point', ci=95)
ax3.set(ylim=(0, 45))
sns.set_context("poster")  # larger plots
ax4 = factor_groups.boxplot(return_type='axes', column=outcome, by=[col_names[0], col_names[1]])


# ## Conclusion
# In conclusion, ...TODO: there's still more analysis to do...

# ---
# ## Documentation
# Find this project in [Github: UberHowley/SPOC-File-Processing](https://github.com/UberHowley/spoc-file-processing)
# 
# 
#