Find this project in Github: UberHowley/SPOC-File-Processing
In previous work detailed here I found that up- and down-voting had a significant negative effect on the number of peer helpers MOOC students invited to their help seeking thread. Results from a previous survey experiment also suggested that framing tasks with a learning-oriented (or value-emphasized) instruction decreased evaluation anxiety, which is the hypothesized mechanism through which voting inhibits help seeking. Following on this work, I wished to investigate how voting impacts help seeking in more traditional online course discussion forums and if we can alleviate any potential costs of effective participation while still maintaining the benefits of voting in forums?
So this experiment took place in a Small Private Online Course (SPOC) with a more naturalistic up/downvoting setup in a disucssion forum and used prompts that emphasized the value of participating in forums.
I used Python to clean the logfile data, gensim to assign automated [LDA] topics to each message board post, and pandas to perform statistical analyses in order to answer my research questions.
Our QuickHelper system was designed to advise students on which peers they might want to help them, but also to answer theory-based questions about student motivation and decision-making in the face of common interactional archetypes employed in MOOCs. This yielded the following research questions:
The parallel computing course in which this experiment took place met in-person twice per week for 1:20, but all lecture slides (of which there were 28 sets) were posted online. Each slide had its own discussion forum-like comments beneath it, as shown below. Students enrolled in the course were required to "individually contribute one interesting comment per lecture using the course web site" for a 5% participation grade. The experiment began in the TODO week of the course, and ran for TODO weeks.
Approximately 70 consenting students are included in this 3X2 experiment.
There were three voting conditions in this dimension: no voting, upvoting only, and downvoting, as shown below. Each student was assigned to one of these conditions, and they only ever saw one voting condition. There was the issue of cross-contamination of voting conditions, if one student looked at a peer's screen, but the instructors did not see any physical evidence of this cross contamination until the last week of the course.
Students were also assigned to only one of two prompting conditions: positive (learning-oriented/value-emphasis) or neutral. These prompts were in the form of an email sent through the system, that would have a preset "welcome prompt" followed by a customizable "instructional prompt." The welcome prompt was either value-emphasis or neutral in tone, and the instructional prompt was either a general 'restate' request or a more specific in nature, prompting the student to answer a specific question. The welcome prompts were predefined and not customizable by the sender of the prompts, while the instructional portion was customizable. In order to maintain the utility of the email prompots, instructor-customization had to be supported.
Neutral Welcome Prompt | Positive Welcome Prompt |
|
|
Context - Restate (customizable) | Context - Question (customizable) |
|
|
A course instructor might decide to prompt some students to answer another students' question, as this is a valuable learning activity. When a course instructor decides to send an email prompt to a student, the system would randomly select two students. The instructor would select and customize the instructional prompt and upon submitting, the system would send an email to two students, with an appropriate version of a randomly selected welcome prompt. An example email prompt is shown below:
Processing the logfiles mostly involves: (1) ensuring that only consenting students are included in the dataset, (2) that each comment is assigned an LDA topic, (3) that student names are removed from the dataset, (4) removing course instructors from the dataset, (5) removing spaces from column headers, and (6) removing approximately 241 comments from the logfiles that were posted to a lecture slide more than 2-3 weeks in the past (i.e., attempted cramming for an increased participation grade).
I used the pandas and statsmodels libraries to run descriptive statistics, generate plots to better understand the data, and answer our research questions (from above). The main effects of interest were the categorical condition variables, voting
(updownvote, upvote, novote) & prompts
(positive, neutral) and a variety of scalar dependent variables (number of comments
, comment quality
, help seeking
, and learning
).
# necessary libraries/setup
%matplotlib inline
import utilsSPOC as utils # separate file that contains all the constants we need
import pandas as pd
import matplotlib.pyplot as plt
# need a few more libraries for ANOVA analysis
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
# importing seaborns for its factorplot
import seaborn as sns
sns.set_style("darkgrid")
sns.set_context("notebook")
data_comments = pd.io.parsers.read_csv("150512_spoc_comments_lda.csv", encoding="utf8")
data_prompts = pd.io.parsers.read_csv("150512_spoc_prompts_mod.csv", encoding="utf8")
data = pd.io.parsers.read_csv("150512_spoc_full_data_mod.csv", encoding="utf-8-sig")
conditions = [utils.COL_VOTING, utils.COL_PROMPTS] # all our categorical IVs of interest
covariate = utils.COL_NUM_PROMPTS # number of prompts received needs to be controlled for
outcome = utils.COL_NUM_LEGIT_COMMENTS
Descriptive statistics showed that the mean number of comments students posted was 19 and the median was 17. Students also saw a mean and median number of '1' email prompts.
df = data[[covariate, outcome]]
df.describe()
# note: the log-processing removes the Prompting condition of any student who did not see a prompt
# histogram of num comments
fig = plt.figure()
ax1 = fig.add_subplot(121)
comments_hist = data[outcome]
ax1 = comments_hist.plot(kind='hist', title="Histogram "+outcome, by=outcome)
ax1.locator_params(axis='x')
ax1.set_xlabel(outcome)
ax1.set_ylabel("Count")
ax2 = fig.add_subplot(122)
prompts_hist = data[covariate]
ax2 = prompts_hist.plot(kind='hist', title="Histogram "+covariate, by=covariate)
ax2.locator_params(axis='x')
ax2.set_xlabel(covariate)
ax2.set_ylabel("Count")
fig.tight_layout()
# plotting num prompts per day
sns.set_context("poster")
plot_prompts = data_prompts[[utils.COL_ENCOURAGEMENT_TYPE, utils.COL_TSTAMP]]
plot_prompts[utils.COL_TSTAMP] = pd.to_datetime(plot_prompts[utils.COL_TSTAMP])
plot_prompts.set_index(utils.COL_TSTAMP)
date_counts = plot_prompts[utils.COL_TSTAMP].value_counts()
date_plot = date_counts.resample('d',how=sum).plot(title="Total Prompts By Date",legend=None, kind='bar')
date_plot.set(ylim=(0, 25))
# TODO: need to find a way to drop time part of timestamp...
When looking at the independent variables, we see a even random assignment to Condition
. Initially, the assignment to EncouragementType
wasn't perfectly even (41 vs. 27), but when we removed the prompting condition assignment of any student who did not receive any prompts, the distribution becomes more even.
df = data[conditions+[covariate, outcome]]
for cond in conditions:
print(pd.concat([df.groupby(cond)[cond].count(), df.groupby(cond)[covariate].mean(), df.groupby(cond)[outcome].mean()], axis=1))
The research questions require a bit of statistics to answer. In the case of a single factor with two levels we use a t-test to determine if the independent variable in question has a significant effect on the outcome variable. In the case of a single factor with more than two levels, we use a one-way Analysis of Variance (ANOVA). With more than one factor we use a two-way ANOVA. These are all essentially similar linear models (ordinary least squares), with slightly different equations or statistics for determining significance.
(VotingCondition --> numComments)
To answer this question, we run a one-way ANOVA (since there are more than 2 levels to this one factor). Since the p-value is above 0.05 (i.e., 0.72) we must accept the null hypothesis that there is no difference between voting conditions on number of posted comments.
cond = utils.COL_VOTING
df = data[[cond, outcome]].dropna()
cond_lm = ols(outcome + " ~ C(" + cond + ")", data=df).fit()
anova_table = anova_lm(cond_lm)
print(anova_table)
print(cond_lm.summary())
# boxplot
fig = plt.figure()
ax = fig.add_subplot(111)
ax = df.boxplot(outcome, cond, ax=plt.gca())
ax.set_xlabel(cond)
ax.set_ylabel(outcome)
fig.tight_layout()
(EncouragementType + numPrompts --> numComments)
We run an ANCOVA (since we need to control for the number of prompts each student saw) and see that the p-value is once again not significant (i.e., 0.23).
cond = utils.COL_PROMPTS
df = data[[cond, covariate, outcome]].dropna()
cond_lm = ols(outcome + " ~ C(" + cond + ") + " + covariate, data=df).fit()
anova_table = anova_lm(cond_lm)
print(anova_table)
print(cond_lm.summary())
# boxplot
fig = plt.figure()
ax = fig.add_subplot(111)
ax = df.boxplot(outcome, cond, ax=plt.gca())
ax.set_xlabel(cond)
ax.set_ylabel(outcome)
fig.tight_layout()
Condition X EncouragementType + numPrompts--> numComments
The OLS output shows that neither the additive model nor the interactive/multiplicative model are significant (p = 0.53 vs. p = 0.47). The AIC scores are quite similar so we'd select the one with the lower AIC score as being a better fit to the data.
col_names = [utils.COL_VOTING, utils.COL_PROMPTS, covariate, outcome]
factor_groups = data[col_names].dropna()
formula = outcome + " ~ C(" + col_names[0] + ") + C(" + col_names[1] + ") + " + covariate
formula_interaction = outcome + " ~ C(" + col_names[0] + ") * C(" + col_names[1] + ") + " + covariate
print("= = = = = = = = " + formula + " = = = = = = = =")
lm = ols(formula, data=factor_groups).fit() # linear model: AIC 418
print(lm.summary())
print("\n= = = = = = = = " + formula_interaction + " = = = = = = = =")
lm_interaction = ols(formula_interaction, data=factor_groups).fit() # interaction linear model: AIC 418.1
print(lm_interaction.summary())
# We can test if they're significantly different with an ANOVA (neither is sig. so not necessary)
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
print("= = " + formula + " ANOVA = = ")
print("= = vs. " + formula_interaction + " = =")
print(anova_lm(lm, lm_interaction))
# These are ANOVA tests for determining if different models are significantly different from each other.
# Although we know from the previous step that none of these are signficant, so these tests aren't necessary.
# The ANOVA output provides the F-statistics which are necessary for reporting results.
# From: http://statsmodels.sourceforge.net/devel/examples/generated/example_interactions.html
# Tests whether the LM of just Condition is significantly different from the additive LM
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
f_just_first = outcome + " ~ C(" + col_names[0] + ")"
print("= = " + f_just_first + " ANOVA = = ")
print("= = vs. " + formula + " = =")
print(anova_lm(ols(f_just_first, data=factor_groups).fit(), ols(formula, data=factor_groups).fit()))
# Testing whether the LM of just EncouragementType is significantly different from the additive LM
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
f_just_second = outcome + " ~ C(" + col_names[1] + ") + " + covariate
print("= = " + f_just_second + " = = ")
print("= = vs. " + formula + " = =")
print(anova_lm(ols(f_just_second, data=factor_groups).fit(), ols(formula, data=factor_groups).fit()))
# plotting
from statsmodels.graphics.api import interaction_plot
sns.set_context("notebook")
fig = plt.figure()
ax1 = sns.factorplot(x=col_names[1], y=outcome, data=factor_groups, kind='point', ci=95)
ax1.set(ylim=(0, 45))
ax2 = sns.factorplot(x=col_names[0], y=outcome, data=factor_groups, kind='point', ci=95)
ax2.set(ylim=(0, 45))
ax3 = sns.factorplot(x=col_names[1], hue=col_names[0], y=outcome, data=factor_groups, kind='point', ci=95)
ax3.set(ylim=(0, 45))
This last interaction plot is interesting, and suggests that we might want to look at the "downvote" condition compared to the "non-downvote" conditions, a split which is captured in our 'COL_NEG_VOTE' column. Neither model appears significant, and an additional test shows neither model performs statistically better than the other (p = 0.1).
# sample code for calculating new column reformulation
def get_neg_vote(voting_cond)
if voting_cond == utils.COND_VOTE_NONE or voting_cond == utils.COND_VOTE_UP:
return utils.COND_OTHER
elif voting_cond == utils.COND_VOTE_BOTH:
return utils.COND_VOTE_BOTH
else:
return ""
# if you're missing the log processing, you can get a 'negativeVoting' column with something like this code:
data[utils.COL_NEG_VOTE] = data[utils.COL_VOTING].apply(lambda x: get_neg_vote(x))
col_names = [utils.COL_NEG_VOTE, utils.COL_PROMPTS, covariate, outcome]
factor_groups = data[col_names].dropna()
formula = outcome + " ~ C(" + col_names[0] + ") + C(" + col_names[1] + ") + " + covariate
formula_interaction = outcome + " ~ C(" + col_names[0] + ") * C(" + col_names[1] + ") + " + covariate
print("= = = = = = = = " + formula + " = = = = = = = =")
lm = ols(formula, data=factor_groups).fit() # linear model: AIC 416
print(lm.summary())
print("\n= = = = = = = = " + formula_interaction + " = = = = = = = =")
lm_interaction = ols(formula_interaction, data=factor_groups).fit() # interaction linear model: AIC 414
print(lm_interaction.summary())
# We can test if they're significantly different with an ANOVA (neither is sig. so not necessary)
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
print("= = " + formula + " ANOVA = = ")
print("= = vs. " + formula_interaction + " = =")
print(anova_lm(lm, lm_interaction))
# plotting
sns.set_context("notebook")
ax3 = sns.factorplot(x=col_names[1], hue=col_names[0], y=outcome, data=factor_groups, kind='point', ci=95)
ax3.set(ylim=(0, 45))
sns.set_context("poster") # larger plots
ax4 = factor_groups.boxplot(return_type='axes', column=outcome, by=[col_names[0], col_names[1]])
This analysis so far has only looked at number of comments
as the dependent variable, of which there does not appear to be any significant effect. However, comment quality
, help seeking
, and learning
are all important dependent variables to examine as well. This analysis is still in progress.
I used gensim to automatically apply topics to each forum comment.
# basic descriptive statistics
data_topic = data_comments[[utils.COL_LDA]]
data_help = data_comments[[utils.COL_HELP]]
print(data_topic[utils.COL_LDA].describe())
# histogram of LDA topics
sns.set_context("notebook")
sns.factorplot(utils.COL_LDA, data=data_topic, size=6)
We can also graph the number of comments over time, and (TODO) graph these comments by topic over time.
# num comments of each topic type by date
data_topic = data_comments[[utils.COL_LDA, utils.COL_TIMESTAMP]]
data_topic[utils.COL_TIMESTAMP] = pd.to_datetime(data_topic[utils.COL_TIMESTAMP])
data_topic = data_topic.set_index(utils.COL_TIMESTAMP)
data_topic['day'] = data_topic.index.date
counts = data_topic.groupby(['day', utils.COL_LDA]).agg(len)
print(counts.tail())
# plotting
sns.set_context("poster")
data_topic = data_comments[[utils.COL_LDA, utils.COL_TIMESTAMP]]
data_topic[utils.COL_TIMESTAMP] = pd.to_datetime(data_topic[utils.COL_TIMESTAMP])
data_topic.set_index(utils.COL_TIMESTAMP)
date_counts = data_topic[utils.COL_TIMESTAMP].value_counts()
date_plot = date_counts.resample('d',how=sum).plot(title="Total Comments By Date",legend=None)
date_plot.set(ylim=(0, 100))
Basic descriptive statistics and bar graphs are simple enough to acquire, but what we might want to know is what topic is the most popular to write about and what topic are students requesting the most help on? To do this, we need to find out how many comments there are on a particular comment, and how many of these comments are help requests.
def count_instances(df,uid,column):
"""
Return dataframe of total counts for each unique value in COLUMN, by UID
:param df: the dataframe containing the data of interest
:param uid: the unique index/ID to group the counts by (i.e., author user id)
:param column: the column containing the things to count (i.e., help requests)
:return: True if the string contains the letter 'y'
"""
df['tot_comments'] = 1
for item in df[column].unique():
colname = "help_requests" if item == "True" else "other_comments"
df['num_%s' % colname] = df[column].apply(lambda value: 1 if value == str(item) else 0)
new_data = df.groupby(uid).sum()
"""#calculating percents
cols = [col for col in new_data.columns if 'num' in col]
for col in cols:
new_data[col.replace('num','percent')] = new_data[col] / new_data['tot_comments'] * 100
"""
return new_data
data_topic = data_comments[[utils.COL_LDA, utils.COL_TIMESTAMP, utils.COL_HELP]]
df_by_topic = count_instances(data_topic, utils.COL_LDA, utils.COL_HELP) # calculate number help requests per topic
data_help = data_comments[[utils.COL_ID, utils.COL_HELP]]
data_help[utils.COL_HELP] = data_help[utils.COL_HELP].astype('str')
df_help_counts = count_instances(data_help, utils.COL_ID, utils.COL_HELP) # calculate number help requests per user
# merging our help counts table with original data
data[utils.COL_ID] = data[utils.COL_ID].astype('str')
indexed_data = data.set_index([utils.COL_ID])
combined_df = pd.concat([indexed_data, df_help_counts], axis=1, join_axes=[indexed_data.index])
# Replace NaN in help_requests/other_comments cols with '0'
combined_df["num_help_requests"].fillna(value=0, inplace=True)
combined_df["num_other_comments"].fillna(value=0, inplace=True)
I used an extremely naive set of rules to determine if a comment was a help request, such as identifying if a question mark is included in the message (or the word 'question', or 'struggle' or 'stuck', etc) I identified which entries are most likely help requests and which are not. In the future, a coding scheme should be developed to determine what kinds of help are being sought.
# example code - does not need to execute
def is_help_topic(sentence):
if "help" in sentence or "question" in sentence or "?" in sentence or "dunno" in sentence or "n't know" in sentence:
return True
if "confus" in sentence or "struggl" in sentence or "lost" in sentence or "stuck" in sentence or "know how" in sentence:
return True
return False
# if you're missing the log processing, you can get an 'isHelpSeeking' column with something like this code:
data_comments[utils.COL_HELP] = data_comments[utils.COL_COMMENT].apply(lambda x: is_help_topic(x))
# histogram of help requests
sns.factorplot(utils.COL_HELP, data=data_help, size=6)
data_help.describe()
# plotting
"""
sns.set_context("poster")
data_by_date = data[[utils.COL_HELP, utils.COL_TIMESTAMP]]
data_by_date[utils.COL_TIMESTAMP] = pd.to_datetime(data_by_date[utils.COL_TIMESTAMP])
data_by_date.set_index(utils.COL_TIMESTAMP)
date_counts = data_by_date[utils.COL_TIMESTAMP].value_counts()
date_plot = date_counts.resample('d',how=sum).plot(title="Total Help Requests By Date",legend=None)
date_plot.set(ylim=(0, 100))
"""
Now, instead of using 'numComments' as our outcome variable, we can use 'numHelpRequests' as the outcome variable. But first we have to calculate that for each user.
data_help = data_comments[[utils.COL_AUTHOR, utils.COL_HELP]]
data_help.rename(columns={utils.COL_AUTHOR:utils.COL_ID}, inplace=True) #rename so we can merge later
data_help[utils.COL_ID] = data_help[utils.COL_ID].astype('str')
data_help[utils.COL_HELP] = data_help[utils.COL_HELP].astype('str')
# merging our help counts table with original data
df_help_counts.drop("tot_comments", axis=1, inplace=True)
data[utils.COL_ID] = data[utils.COL_ID].astype('str')
indexed_data = data.set_index([utils.COL_ID])
combined_df = pd.concat([indexed_data, df_help_counts], axis=1, join_axes=[indexed_data.index])
# Replace NaN in help_requests/other_comments cols with '0'
combined_df["num_help_requests"].fillna(value=0, inplace=True)
combined_df["num_other_comments"].fillna(value=0, inplace=True)
outcome = "num_help_requests"
col_names = [utils.COL_NEG_VOTE, utils.COL_PROMPTS, covariate, outcome]
factor_groups = combined_df[col_names].dropna()
formula = outcome + " ~ C(" + col_names[0] + ") + C(" + col_names[1] + ") + " + covariate
formula_interaction = outcome + " ~ C(" + col_names[0] + ") * C(" + col_names[1] + ") + " + covariate
print("= = = = = = = = " + formula + " = = = = = = = =")
lm = ols(formula, data=factor_groups).fit() # linear model: AIC 416
print(lm.summary())
print("\n= = = = = = = = " + formula_interaction + " = = = = = = = =")
lm_interaction = ols(formula_interaction, data=factor_groups).fit() # interaction linear model: AIC 414
print(lm_interaction.summary())
# We can test if they're significantly different with an ANOVA (neither is sig. so not necessary)
print("= = = = = = = = = = = = = = = = = = = = = = = = = = = = = =")
print("= = " + formula + " ANOVA = = ")
print("= = vs. " + formula_interaction + " = =")
print(anova_lm(lm, lm_interaction))
# plotting
sns.set_context("notebook")
ax3 = sns.factorplot(x=col_names[1], hue=col_names[0], y=outcome, data=factor_groups, kind='point', ci=95)
ax3.set(ylim=(0, 45))
sns.set_context("poster") # larger plots
ax4 = factor_groups.boxplot(return_type='axes', column=outcome, by=[col_names[0], col_names[1]])
In conclusion, ...TODO: there's still more analysis to do...