In T244872 and T245794 the reply tool was released as an opt-in beta feature for the following partner wikis: Arabic, Dutch, French and Hungarian. Version 1.0 of the replying tool was deployed on to the partner wikis on March 31, 2020 and Version 2.0 was deployed on June 17, 2020.
We would like to know how the tool is being used and adopted by the partner wikis as a Beta Feature prior to deploying the tool to all volunteers as an opt-out preference on the four partner wikis T249394.
Data for this analysis comes from the PrepUpdate table, the user properties table, and mediawiki_history table.
For metrics that were calculated using mediawiki_history, we reviewed data from the release of Version 1.0 on March 31 through the end of June 2020 (the most recent data available at the time of this analysis).
library(IRdisplay)
display_html(
'<script>
code_show=true;
function code_toggle() {
if (code_show){
$(\'div.input\').hide();
} else {
$(\'div.input\').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Click here to toggle on/off the raw code.">
</form>'
)
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
library(tidyverse); library(glue); library(lubridate); library(scales)
})
Notes:
# Collect users max reply edits over time period and remove bots
query <- "
SELECT
wiki,
reply_user,
max(reply_edits) as reply_edit_count
FROM (
SELECT
wiki_db AS wiki,
event_user_id as reply_user,
max(size(event_user_is_bot_by) > 0 or size(event_user_is_bot_by_historical) > 0) as bot_by_group,
Count(*) as reply_edits
FROM wmf.mediawiki_history
WHERE
ARRAY_CONTAINS(revision_tags, 'discussiontools') AND
snapshot = '2020-06' AND
event_timestamp >= '2020-03-31' AND
event_timestamp <= '2020-04-30' AND
wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
event_entity = 'revision' AND
event_type = 'create'
GROUP BY
wiki_db,
event_user_id
) edits
WHERE not bot_by_group
GROUP BY reply_user, wiki"
test <- wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit
load("Data/reply_edit_users.RData")
reply_edit_users <- results
Warning message in readChar(con, 5L, useBytes = TRUE): “cannot open compressed file 'Data/reply_edit_users.RData', probable reason 'No such file or directory'”
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection Traceback: 1. load("Data/reply_edit_users.RData") 2. readChar(con, 5L, useBytes = TRUE)
num_users <- test %>%
summarise(n_users = n())
num_users
n_users |
---|
<int> |
161 |
# Find overall number of users that made only 1 edit
reply_edits_overall_1edit <- reply_edit_users %>%
filter(reply_edit_count == 1) %>%
summarise(n_users = n(),
percent_users = n_users/328 *100) #divide by total number of reply tool users overall
reply_edits_overall_1edit
n_users | percent_users |
---|---|
<int> | <dbl> |
100 | 30.4878 |
A total of 328 users have successfully made at least one edit using the reply tool from 31 March 2020 through the end of June.
100 of those users (30.49%) have made only 1 edit. The majority (69.5%) have made at least 2 edits using the tool.
#Divide reply edits int0 edit count groups with bin width set to 5 edits
b <- c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, Inf)
names <- c('1-5 edits', '6-10 edits', '11-15 edits', '16-20 edits', '21-25 edits', '26-30 edits', '31-35 edits',
'36-40 edits', '41-45 edits', '46-50 edits', 'over 50 edits')
reply_edit_overall_bygroup <- reply_edit_users %>%
mutate(edit_count_group = cut(reply_edit_count, breaks = b, labels = names)) %>%
group_by(edit_count_group) %>%
summarise(n_users = n()) %>%
mutate(percent_reply_users = n_users/sum(n_users))
reply_edit_overall_bygroup
edit_count_group | n_users | percent_reply_users |
---|---|---|
<fct> | <int> | <dbl> |
1-5 edits | 201 | 0.612804878 |
6-10 edits | 22 | 0.067073171 |
11-15 edits | 18 | 0.054878049 |
16-20 edits | 7 | 0.021341463 |
21-25 edits | 10 | 0.030487805 |
26-30 edits | 8 | 0.024390244 |
31-35 edits | 3 | 0.009146341 |
36-40 edits | 7 | 0.021341463 |
41-45 edits | 3 | 0.009146341 |
46-50 edits | 2 | 0.006097561 |
over 50 edits | 47 | 0.143292683 |
#chart overall users by edit group
options(repr.plot.width = 10, repr.plot.height = 7)
p <- reply_edit_overall_bygroup %>%
ggplot(aes(x=edit_count_group, y = percent_reply_users)) +
geom_bar(stat = 'identity', fill = 'darkblue') +
scale_y_continuous(labels = scales::percent) +
labs (y = "Percent of reply tool ssers",
x = "Number of edits with the reply tool",
title = "Reply tool users overall by edit count group",
subtitle = "31 March through 30 June 2020") +
theme_bw() +
theme(
plot.title = element_text(hjust = 0.5),
text = element_text(size=16),
axis.text.x = element_text(angle=45, hjust=1),
legend.position = "none")
p
ggsave("Figures/reply_edit_users_overall.png", p, width = 16, height = 8, units = "in", dpi = 300)
While the majority of reply tool users make at least 1 edit, not much more make more the 5 edits. Over half (61.28%) of reply tool users made between 1 to 5 edits using the tool. About half of those users are users that made only 1 edit.
32% of reply tool users made more than 10 edits using the tool. 12.2% made between 6 and 15 edits.
About 14.3% of users made over 50 edits during this timeframe. Note: This might consist of some automated traffic not self-identifed as bots.
# Find overall number of users that made only 1 edit
reply_edits_overall_1edit_bywiki <- reply_edit_users %>%
group_by(wiki) %>%
mutate(total_users = n()) %>%
filter(reply_edit_count == 1) %>%
group_by(wiki, total_users) %>%
summarise(one_edit_users = n()) %>%
mutate(percent_users = one_edit_users/total_users*100)
reply_edits_overall_1edit_bywiki
wiki | total_users | one_edit_users | percent_users |
---|---|---|---|
<chr> | <int> | <int> | <dbl> |
arwiki | 83 | 29 | 34.93976 |
frwiki | 153 | 48 | 31.37255 |
huwiki | 43 | 12 | 27.90698 |
nlwiki | 49 | 11 | 22.44898 |
On a per wiki basis, the percent of reply users that made only 1 edit ranged from about 22% to 35%. Arabic Wikipedia had the highest number of one edit users while Dutch Wikipedia had the lowest.
reply_edit_bywiki<- reply_edit_users %>%
mutate(edit_count_group = cut(reply_edit_count, breaks = b, labels = names)) %>%
group_by(wiki, edit_count_group) %>%
summarise(n_users = n()) %>%
mutate(percent_reply_users = n_users/sum(n_users))
reply_edit_bywiki
wiki | edit_count_group | n_users | percent_reply_users |
---|---|---|---|
<chr> | <fct> | <int> | <dbl> |
arwiki | 1-5 edits | 45 | 0.542168675 |
arwiki | 6-10 edits | 8 | 0.096385542 |
arwiki | 11-15 edits | 5 | 0.060240964 |
arwiki | 16-20 edits | 3 | 0.036144578 |
arwiki | 21-25 edits | 2 | 0.024096386 |
arwiki | 26-30 edits | 2 | 0.024096386 |
arwiki | 31-35 edits | 1 | 0.012048193 |
arwiki | over 50 edits | 17 | 0.204819277 |
frwiki | 1-5 edits | 102 | 0.666666667 |
frwiki | 6-10 edits | 11 | 0.071895425 |
frwiki | 11-15 edits | 6 | 0.039215686 |
frwiki | 16-20 edits | 2 | 0.013071895 |
frwiki | 21-25 edits | 4 | 0.026143791 |
frwiki | 26-30 edits | 3 | 0.019607843 |
frwiki | 31-35 edits | 1 | 0.006535948 |
frwiki | 36-40 edits | 5 | 0.032679739 |
frwiki | 41-45 edits | 3 | 0.019607843 |
frwiki | 46-50 edits | 1 | 0.006535948 |
frwiki | over 50 edits | 15 | 0.098039216 |
huwiki | 1-5 edits | 25 | 0.581395349 |
huwiki | 6-10 edits | 2 | 0.046511628 |
huwiki | 11-15 edits | 1 | 0.023255814 |
huwiki | 21-25 edits | 3 | 0.069767442 |
huwiki | 26-30 edits | 2 | 0.046511628 |
huwiki | 31-35 edits | 1 | 0.023255814 |
huwiki | 36-40 edits | 1 | 0.023255814 |
huwiki | over 50 edits | 8 | 0.186046512 |
nlwiki | 1-5 edits | 29 | 0.591836735 |
nlwiki | 6-10 edits | 1 | 0.020408163 |
nlwiki | 11-15 edits | 6 | 0.122448980 |
nlwiki | 16-20 edits | 2 | 0.040816327 |
nlwiki | 21-25 edits | 1 | 0.020408163 |
nlwiki | 26-30 edits | 1 | 0.020408163 |
nlwiki | 36-40 edits | 1 | 0.020408163 |
nlwiki | 46-50 edits | 1 | 0.020408163 |
nlwiki | over 50 edits | 7 | 0.142857143 |
# chart by wiki
p <- reply_edit_bywiki %>%
ggplot(aes(x=edit_count_group, y = percent_reply_users, fill = edit_count_group)) +
geom_bar(stat = 'identity', fill = 'darkblue') +
facet_wrap(~wiki) +
scale_y_continuous(labels = scales::percent) +
labs (y = "Percent of reply tool users",
x = "Number of reply tool edits",
title = "Reply tool users by edit count group and wiki",
subtitle = "31 March through 30 June 2020") +
theme_bw() +
theme(
plot.title = element_text(hjust = 0.5),
text = element_text(size=16),
axis.text.x = element_text(angle=45, hjust=1),
legend.position = "none")
p
ggsave("Figures/reply_edit_users_bywiki.png", p, width = 16, height = 8, units = "in", dpi = 300)
A little over half of all reply users on each partner wiki made between 1-5 edits. After that, there is a general decrase in the numbers of users as the number of edits increase with a couple exceptions.
Dutch Wikipedia has a higher number of users that made 11 to 15 edits compared to 6 to 10 edits.Hungarian Wikipedia had a higher number of users that made between 21 to 25 and 26 to 30 edits compared to 6 to 10 edits.
Notes:
query <- "
-- obtain distinct day of edits by user
SELECT
wiki,
reply_user,
COUNT(DISTINCT(day_of_activity)) AS days_of_activity
FROM (
SELECT
wiki_db AS wiki,
event_user_id as reply_user,
to_date(event_timestamp) as day_of_activity
FROM wmf.mediawiki_history
WHERE
ARRAY_CONTAINS(revision_tags, 'discussiontools') AND
snapshot = '2020-06' AND
event_timestamp >= '2020-03-31' AND
event_timestamp <= '2020-06-30' AND
wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
event_entity = 'revision' AND
event_type = 'create'
) edits
GROUP BY reply_user, wiki
"
results <- collect(sql(query))
save(results, file="Data/reply_days_activity.RData")
load("Data/reply_days_activity.RData")
reply_days_activity <- results
reply_days_activity$days_of_activity <- factor(reply_days_activity$days_of_activity,levels =
c("1-5 days", '6-10 days', '11-15 days',
'16-20 days','21-25 days', '26-30 days',
'31-35 days','36-40 days', '41-45 days','46-50 days', 'over 50 days'))
# Find overall number of users that used the tool on just 1 day
reply_days_overall_1day <- reply_days_activity %>%
filter(days_of_activity == 1) %>%
summarise(one_day_users = n(),
percent_users = one_day_users/328 *100)
reply_days_overall_1day
one_day_users | percent_users |
---|---|
<int> | <dbl> |
127 | 38.71951 |
38.72% of all reply users used the tool on just one day; indicating that the majority of users (61.3%) used the tool on two or more days.
#Divide days of activity into groups with bin width set to 5 days
b <- c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, Inf)
names <- c('1-5 days', '6-10 days', '11-15 days', '16-20 days', '21-25 days', '26-30 days', '31-35 days',
'36-40 days', '41-45 days', '46-50 days', 'over 50 days')
reply_days_overall_bygroup <- reply_days_activity %>%
mutate(days_activity_group = cut(days_of_activity, breaks = b, labels = names)) %>%
group_by(days_activity_group) %>%
summarise(n_users = n()) %>%
mutate(percent_reply_users = n_users/sum(n_users))
reply_days_overall_bygroup
days_activity_group | n_users | percent_reply_users |
---|---|---|
<fct> | <int> | <dbl> |
1-5 days | 224 | 0.68292683 |
6-10 days | 29 | 0.08841463 |
11-15 days | 11 | 0.03353659 |
16-20 days | 12 | 0.03658537 |
21-25 days | 9 | 0.02743902 |
26-30 days | 5 | 0.01524390 |
31-35 days | 11 | 0.03353659 |
36-40 days | 5 | 0.01524390 |
41-45 days | 5 | 0.01524390 |
46-50 days | 6 | 0.01829268 |
over 50 days | 11 | 0.03353659 |
#chart overall users by days of activity group
p <- reply_days_overall_bygroup %>%
ggplot(aes(x=days_activity_group, y = percent_reply_users)) +
geom_bar(stat = 'identity', fill = 'darkblue') +
scale_y_continuous(labels = scales::percent) +
labs (y = "Percent of reply tool users",
x = "Distinct days of activity",
title = "Reply tool users overall by days of activity",
subtitle = "31 March through 30 June 2020") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
text = element_text(size=16),
axis.text.x = element_text(angle=45, hjust=1),
legend.position = "none")
p
ggsave("Figures/reply_days_activity_overall.png", p, width = 16, height = 8, units = "in", dpi = 300)
68.3% of all reply tool users made an edit between 1 to 5 distinct days. Only about 43.3% of those users made an edit on 1 distinct day so we're seeing the majority of users make an edit between 2 to 5 days.
The number of users drops generally decreases as the number of distinct days increase with a couple slight fluctuations.
# Find overall number of users that made only 1 edit
reply_days_1day_bywiki <- reply_days_activity %>%
group_by(wiki) %>%
mutate(total_users = n()) %>%
filter(days_of_activity == 1) %>%
group_by(wiki, total_users) %>%
summarise(one_day_users = n()) %>%
mutate(percent_users = one_day_users/total_users*100)
reply_days_1day_bywiki
wiki | total_users | one_day_users | percent_users |
---|---|---|---|
<chr> | <int> | <int> | <dbl> |
arwiki | 83 | 35 | 42.16867 |
frwiki | 153 | 62 | 40.52288 |
huwiki | 43 | 14 | 32.55814 |
nlwiki | 49 | 16 | 32.65306 |
Across the partner wikis, the percent of one day users ranges from about 32.65% to 42.17%. Similar to the trends seen for the percent of users that made 1 edit, Arabic Wikipedia had the highed percentage of one days users while Dutch Wikipedia had the lowest.
reply_days_bywiki <- reply_days_activity %>%
mutate(days_activity_group = cut(days_of_activity, breaks = b, labels = names)) %>%
group_by(wiki, days_activity_group) %>%
summarise(n_users = n()) %>%
mutate(percent_reply_users = n_users/sum(n_users))
reply_days_bywiki
wiki | days_activity_group | n_users | percent_reply_users |
---|---|---|---|
<chr> | <fct> | <int> | <dbl> |
arwiki | 1-5 days | 55 | 0.66265060 |
arwiki | 6-10 days | 7 | 0.08433735 |
arwiki | 11-15 days | 2 | 0.02409639 |
arwiki | 16-20 days | 2 | 0.02409639 |
arwiki | 21-25 days | 3 | 0.03614458 |
arwiki | 26-30 days | 1 | 0.01204819 |
arwiki | 31-35 days | 4 | 0.04819277 |
arwiki | 36-40 days | 1 | 0.01204819 |
arwiki | 41-45 days | 3 | 0.03614458 |
arwiki | 46-50 days | 2 | 0.02409639 |
arwiki | over 50 days | 3 | 0.03614458 |
frwiki | 1-5 days | 113 | 0.73856209 |
frwiki | 6-10 days | 11 | 0.07189542 |
frwiki | 11-15 days | 6 | 0.03921569 |
frwiki | 16-20 days | 4 | 0.02614379 |
frwiki | 21-25 days | 5 | 0.03267974 |
frwiki | 26-30 days | 2 | 0.01307190 |
frwiki | 31-35 days | 2 | 0.01307190 |
frwiki | 36-40 days | 3 | 0.01960784 |
frwiki | 46-50 days | 3 | 0.01960784 |
frwiki | over 50 days | 4 | 0.02614379 |
huwiki | 1-5 days | 27 | 0.62790698 |
huwiki | 6-10 days | 2 | 0.04651163 |
huwiki | 11-15 days | 1 | 0.02325581 |
huwiki | 16-20 days | 5 | 0.11627907 |
huwiki | 21-25 days | 1 | 0.02325581 |
huwiki | 26-30 days | 1 | 0.02325581 |
huwiki | 31-35 days | 2 | 0.04651163 |
huwiki | 36-40 days | 1 | 0.02325581 |
huwiki | 41-45 days | 1 | 0.02325581 |
huwiki | over 50 days | 2 | 0.04651163 |
nlwiki | 1-5 days | 29 | 0.59183673 |
nlwiki | 6-10 days | 9 | 0.18367347 |
nlwiki | 11-15 days | 2 | 0.04081633 |
nlwiki | 16-20 days | 1 | 0.02040816 |
nlwiki | 26-30 days | 1 | 0.02040816 |
nlwiki | 31-35 days | 3 | 0.06122449 |
nlwiki | 41-45 days | 1 | 0.02040816 |
nlwiki | 46-50 days | 1 | 0.02040816 |
nlwiki | over 50 days | 2 | 0.04081633 |
# numbers of days by wiki
p <- reply_days_bywiki %>%
ggplot(aes(x=days_activity_group, y = percent_reply_users)) +
geom_col(fill= 'darkblue') +
scale_y_continuous(labels = percent) +
facet_wrap(~wiki) +
labs (y = "Percent of reply tool users",
x = "Distinct days of activity",
title = "Reply tool users by days of activity and wiki",
subtitle = "31 March through 30 June 2020") +
theme_bw() +
theme(
plot.title = element_text(hjust = 0.5),
text = element_text(size=16),
axis.text.x = element_text(angle=45, hjust=1),
legend.position = "none")
p
ggsave("Figures/reply_days_activity_bywiki.png", p, width = 16, height = 8, units = "in", dpi = 300)
On each partner wiki, the majority of reply tool users made their edits between 1 to 5 distinct days. The number of users generally decreases as the number of distinct days increase. A few notable exceptions:
Notes:
For the analysis below:
query <- "
SELECT
event.userid as user_id,
wiki,
event.value as beta_selection,
min(CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))) as date,
Count(*) as n_opt
FROM event_sanitized.prefupdate
WHERE
event.property = 'discussiontools-betaenable' AND
wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-30'
GROUP BY
event.userid,
wiki,
event.value
"
beta_selection_rate <- wmf::query_hive(query)
beta_selection_rate$date <- as.Date(beta_selection_rate$date, format = "%Y-%m-%d")
beta_selection_rate$beta_selection <- ifelse(beta_selection_rate$beta_selection == '\"0\"', "turned off", "turned on")
total_distinct_users <- beta_selection_rate %>%
group_by(beta_selection) %>%
summarise(n_users = n())
total_distinct_users
beta_selection | n_users |
---|---|
<chr> | <int> |
turned off | 4525 |
turned on | 2132 |
total_distinct_users_bywiki <- beta_selection_rate %>%
group_by(wiki, beta_selection) %>%
summarise(n_users = n())
total_distinct_users_bywiki
wiki | beta_selection | n_users |
---|---|---|
<chr> | <chr> | <int> |
arwiki | turned off | 1473 |
arwiki | turned on | 581 |
frwiki | turned off | 2298 |
frwiki | turned on | 1192 |
huwiki | turned off | 296 |
huwiki | turned on | 140 |
nlwiki | turned off | 458 |
nlwiki | turned on | 219 |
total_beta_selections <- beta_selection_rate %>%
group_by(beta_selection) %>%
summarise(n_opt = sum(n_opt))
total_beta_selections
beta_selection | n_opt |
---|---|
<chr> | <int> |
turned off | 4651 |
turned on | 2465 |
total_beta_selections_bywiki <- beta_selection_rate %>%
group_by(wiki,beta_selection) %>%
summarise(n_opt = sum(n_opt))
total_beta_selections_bywiki
wiki | beta_selection | n_opt |
---|---|---|
<chr> | <chr> | <int> |
arwiki | turned off | 1542 |
arwiki | turned on | 745 |
frwiki | turned off | 2340 |
frwiki | turned on | 1329 |
huwiki | turned off | 302 |
huwiki | turned on | 152 |
nlwiki | turned off | 467 |
nlwiki | turned on | 239 |
The reply tool feature was turned off by 4,525 distinct users and turned on by 2,132 distinct users. Several of these users turned the feature on and off multiple times. The feature was turned off a total of 4,651 times and turned on a total 2,465 times.
The number of users that turned the feature off is higher than those that turned it on because a large portion of those users were auto enrolled in the feature.
Note we are missing data from 05 May 2020 through 2020 June 05 so this is an underestimate of the total number of users that enabled the feature.
The analysis below excludes any users that turned off or on the beta feature more than once.
# Find all users that turned on the beta feature or turned the feature off only once
# Do not count users again if they re-turned it on or off again.
# In terminal
# spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G
query <- "
-- find number of opt in or opt outs by beta users
SELECT
wiki,
date,
sum(cast(beta_selection = '\"1\"' as int)) as opt_in_users,
sum(cast(beta_selection = '\"0\"' as int)) as opt_out_users
FROM
(
SELECT
event.userid as user_id,
wiki,
event.value as beta_selection,
min(CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))) as date,
Count(*) as n_opt
FROM event_sanitized.prefupdate
WHERE
event.property = 'discussiontools-betaenable' AND
wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-30'
GROUP BY
event.userid,
wiki,
event.value
) as events
WHERE n_opt = 1
GROUP BY
wiki,
date
"
results <- collect(sql(query))
save(results, file="Data/opt_in_once.RData")
load("Data/opt_in_once.RData")
opt_rate_once <- results
opt_rate_once$date <- as.Date(opt_rate_once$date, format = "%Y-%m-%d")
opt_rate_once_clean <- opt_rate_once %>%
gather("beta_selection", "n_users", 3:4)
#Note. We are missing data at end of july so the numbers are likely lower than reported here.
opt_rate_once_overall <- opt_rate_once_clean %>%
group_by(beta_selection) %>%
summarise(n_users = sum(n_users))
opt_rate_overall
beta_selection | n_users |
---|---|
<chr> | <dbl> |
opt_in_users | 1895 |
opt_out_users | 4427 |
We recorded a total of 1,895 users that explicitly turned on the beta feature just once since deployed on March 31 through June 30, 2019. 4,427 users turned the feature off once (This includes both users that had turned it on as well as anyone that had the feature automatically turned on for them).
As mentioned above, we are missing data from 05 May 2020 through 2020 June 05 so this is an underestimate of the total number of users that enabled or disabled the feature.
# Chart opt in and opt out rate over time
# Note: There was a drop in pref-update data starting on 2020-05-11 through 2020-06-05
# https://phabricator.wikimedia.org/T253151
p <- opt_rate_once_clean %>%
group_by(date, beta_selection) %>%
summarise(n_users = sum(n_users)) %>%
ggplot(aes(x= date, y = n_users, color = beta_selection)) +
geom_line(size = 1.2) +
geom_vline(xintercept = as.Date(c('2020-05-11', '2020-06-05')),
linetype = "dashed", color = "black") +
geom_text(aes(x=as.Date('2020-05-11'), y=55, label="Missing PrefUpdate Data (Bug: T253151)"), size=3.7, vjust = -1.2, angle = 90, color = "black") +
geom_text(aes(x=as.Date('2020-06-05'), y=55, label="PrefUpdate Data Events Recorded Again"), size=3.7, vjust = -1.2, angle = 90, color = "black") +
scale_x_date(date_labels = "%d-%b", date_breaks = "1 week", minor_breaks = NULL) +
labs (y = "Number of distinct users",
x = "Date",
title = "Distinct reply tool users that explicitly turned on the feature ",
subtitle = "31 March through 30 June 2020") +
theme_bw() +
theme(legend.position = "bottom",
axis.text.x=element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5))
p
ggsave("Figures/reply_opt_rate_overall.png", p, width = 16, height = 8, units = "in", dpi = 300)
# review opt-ins by wiki
opt_rate_bywiki <- opt_rate_once_clean %>%
group_by(wiki, beta_selection) %>%
summarise(n_users = sum(n_users))
opt_rate_bywiki
# Chart opt in rate by wiki
p <- opt_rate_once_clean %>%
group_by(date, wiki, beta_selection) %>%
summarise(n_users = sum(n_users)) %>%
ggplot(aes(x= date, y = n_users, color = beta_selection)) +
geom_line() +
facet_wrap(~wiki, scales = "free_y") +
scale_x_date(date_labels = "%b", date_breaks = "1 month", minor_breaks = NULL) +
labs (y = "Number of distinct users",
x = "Date",
title = "Distinct reply tool users that explicitly turned on the feature by wiki ",
subtitle = "Note: Data Missing from May 11th through June 5th, 2020") +
theme_bw() +
theme(legend.position = "bottom",
axis.text.x=element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5))
p
ggsave("Figures/reply_opt_rate_bywiki.png", p, width = 16, height = 8, units = "in", dpi = 300)
We reviewed how many reply tool users explicitly turned off the feature after making at least one edit. I excluded users that turned it off and on multiple times from this analysis.
Reply tool users were broken down by those that were auto enrolled (turned on for them by enabling the Automatically enable most beta features) and those that explicitly turned on the feature in Beta Features.
query <- "
--find users that opt out of the reply tool preference
WITH opt_out_users AS (
SELECT
event.userid as opt_out_user,
wiki as opt_out_wiki,
min(event.saveTimestamp) as opt_out_time,
sum(cast(event.value = '\"0\"' as int)) as opt_outs
FROM
event_sanitized.prefupdate
WHERE
event.property = 'discussiontools-betaenable' AND
wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
event.value = '\"0\"' AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-30'
GROUP BY
event.userid,
wiki
),
--find users that opt in to the reply tool preference
opt_in_users AS (
SELECT
event.userid as opt_in_user,
wiki as opt_in_wiki,
min(event.saveTimestamp) as opt_in_time,
sum(cast(event.value = '\"1\"' as int)) as opt_ins
FROM
event_sanitized.prefupdate
WHERE
event.property = 'discussiontools-betaenable' AND
wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
event.value = '\"1\"' AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-30'
GROUP BY
event.userid,
wiki
),
-- find users that made at least one edit with the reply tool
reply_users AS (
SELECT
event_user_id as reply_user,
wiki_db as reply_wiki,
min(mh.event_timestamp) as first_reply_time
FROM wmf.mediawiki_history AS mh
WHERE
ARRAY_CONTAINS(revision_tags, 'discussiontools') AND
snapshot = '2020-06' AND
event_timestamp >= '2020-03-31' AND
event_timestamp <= '2020-06-30' AND
wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND NOT
ARRAY_CONTAINS(event_user_groups_historical, 'bCONCot') AND
event_entity = 'revision' AND
event_type = 'create'
GROUP BY
event_user_id,
wiki_db
)
-- Main Query --
SELECT
opt_out_wiki AS wiki,
SUM(CAST(opt_out_user IS NOT NULL and first_reply_time < opt_out_time and opt_in_user IS NULL AS int)) AS opt_out_users_autoenrolled,
SUM(CAST(opt_out_user IS NOT NULL and first_reply_time < opt_out_time and opt_ins = 1 AS int)) AS opt_out_users_explicitly_enrolled
FROM (
SELECT
reply_users.first_reply_time,
opt_out_users.opt_out_time,
opt_out_users.opt_out_wiki,
opt_in_users.opt_in_user,
opt_in_users.opt_ins,
opt_out_users.opt_out_user
FROM reply_users
LEFT JOIN opt_out_users ON
reply_users.reply_user = opt_out_users.opt_out_user AND
reply_users.reply_wiki = opt_out_users.opt_out_wiki
LEFT JOIN opt_in_users ON
reply_users.reply_user = opt_in_users.opt_in_user AND
reply_users.reply_wiki = opt_in_users.opt_in_wiki
WHERE
opt_out_users.opt_outs = 1
) sessions
GROUP BY
sessions.opt_out_wiki
"
opt_out_after_reply <- wmf::query_hive(query)
## add column of all reply tool users that made at least 1 edit.
opt_out_after_reply$all_reply_users <- c(83,153,43,49)
opt_out_after_reply_overall <- opt_out_after_reply %>%
summarise(percent_opt_out_autoenrolled = sum(opt_out_users_autoenrolled)/sum(all_reply_users) *100,
percent_opt_out_explicitly_enrolled = sum(opt_out_users_explicitly_enrolled/sum(all_reply_users) *100))
opt_out_after_reply_overall
percent_opt_out_autoenrolled | percent_opt_out_explicitly_enrolled |
---|---|
<dbl> | <dbl> |
3.658537 | 13.71951 |
opt_out_after_reply_bywiki <- opt_out_after_reply %>%
mutate(percent_opt_out_autoenrolled = opt_out_users_autoenrolled/all_reply_users * 100,
percent_opt_out_explicitly_enrolled = opt_out_users_explicitly_enrolled/all_reply_users * 100)
head(opt_out_after_reply_bywiki)
wiki | opt_out_users_autoenrolled | opt_out_users_explicitly_enrolled | all_reply_users | percent_opt_out_autoenrolled | percent_opt_out_explicitly_enrolled | |
---|---|---|---|---|---|---|
<chr> | <int> | <int> | <dbl> | <dbl> | <dbl> | |
1 | arwiki | 3 | 15 | 83 | 3.614458 | 18.072289 |
2 | frwiki | 7 | 18 | 153 | 4.575163 | 11.764706 |
3 | huwiki | 1 | 3 | 43 | 2.325581 | 6.976744 |
4 | nlwiki | 1 | 9 | 49 | 2.040816 | 18.367347 |
A total of 57 or 17.38% of all reply tool users explicitly turned off the feature after making at least 1 edit with the reply tool and did not turn it back on again. The majority of these users were users that explicility turned on the feature in beta features.
On a per wiki basis, the highest percent of users that turned the feature off after turning it on was on Arabic Wikipedia (21.69%) and the lowest on Hungarian Wikipedia (9.3%).
It looks like there are several duplicate events in PrefUpdate. I checked the number of duplicates to confirm the impact. There are 100 events that have been copies for discussiontools-betaenable. I updated the open phab task T218835 to document the identified bug.
query <-"SELECT event.property AS property, COUNT(*) AS duplicated_events
FROM (
SELECT event, COUNT(*) AS copies
FROM event.prefupdate
WHERE
event.property = 'discussiontools-betaenable' AND
wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-26'
GROUP BY event
HAVING copies > 1) AS events
GROUP BY event.property
ORDER BY property LIMIT 10000;"
reply_beta_duplicate_events <- wmf::query_hive(query)
head(reply_beta_duplicate_events)
property | duplicated_events | |
---|---|---|
<chr> | <int> | |
1 | discussiontools-betaenable | 100 |
This is a relatively large number of duplicates but it appears that most of the duplicate values but shouldn't signficantly impact the overall trends indicated by the data which indicate that the the opt-in and opt-out rate for the discussion tool has stablized after an initial overall decrease.
Notes:
Number of Auto Enrolled Beta Users | |
---|---|
arwiki | 9927 |
nlwiki | 1274 |
frwiki | 8170 |
huwiki | 558 |
Notes:
## Upper bound: number of people who have made at least 1 edit,
## in any namespace, in the previous 30 day period
query <- "
SELECT
wiki,
count(*) AS n_editors
FROM (
SELECT
event_user_id as user_id,
wiki_db AS wiki,
max(size(event_user_is_bot_by) > 0 or size(event_user_is_bot_by_historical) > 0) as bot_by_group,
COUNT(*) as edits
FROM wmf.mediawiki_history
WHERE
event_timestamp >= '2020-06-01' AND
event_timestamp <= '2020-06-30' AND
event_entity = 'revision' AND
event_type = 'create' AND
wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
snapshot = '2020-06' AND
event_user_id != 0
GROUP BY
event_user_id,
wiki_db
) edits
WHERE
not bot_by_group
GROUP BY
wiki"
upper_bound_editors <- wmf::query_hive(query)
## Lower bound: number of people who have made at least 1 edit
##in a talk namespace in the previous 30 day period
query <- "
SELECT
wiki,
count(*) AS n_editors
FROM (
SELECT
event_user_id as user_id,
wiki_db AS wiki,
max(size(event_user_is_bot_by) > 0 or size(event_user_is_bot_by_historical) > 0) as bot_by_group,
COUNT(*) as edits
FROM wmf.mediawiki_history
WHERE
event_timestamp >= '2020-06-01' AND
event_timestamp <= '2020-06-30' AND
event_entity = 'revision' AND
event_type = 'create' AND
page_namespace_historical % 2 == 1 AND
wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
snapshot = '2020-06' AND
event_user_id != 0
GROUP BY
event_user_id,
wiki_db
) edits
WHERE
not bot_by_group
GROUP BY
wiki"
lower_bound_editors <- wmf::query_hive(query)
upper bound | lower bound | |
---|---|---|
arwiki | 6,455 | 1,685 |
nlwiki | 4,323 | 687 |
frwiki | 20,049 | 3,539 |
huwiki | 2,015 | 404 |
Of the people who have made at least one edit with the Reply tool, how many of these people have made >5%, >10%, >25% and >50% - of their total talk page edits during the identified time period using the tool?
Notes:
# Obtain reply edits and total talk edits
# In terminal
# spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G
query <- "
-- obtain all users who have made at least 1 edit with the reply tool
with reply_users AS(
SELECT
event_user_id AS reply_user,
wiki_db AS reply_wiki
FROM wmf.mediawiki_history
WHERE
ARRAY_CONTAINS(revision_tags, 'discussiontools') AND
snapshot = '2020-06' AND
event_timestamp >= '2020-03-31' AND
event_timestamp <= '2020-06-31' AND
wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND NOT
ARRAY_CONTAINS(event_user_groups_historical, 'bot') AND
event_entity = 'revision' AND
event_type = 'create'
)
-- obtain user talk and reply counts
SELECT
user,
wiki,
SUM(CAST(reply_edit as int)) as reply_edits,
SUM(CAST(talk_edit as int)) as talk_edits
FROM (
SELECT
event_user_id as user,
wiki_db AS wiki,
array_contains(revision_tags, 'discussiontools') as reply_edit,
page_namespace_historical % 2 == 1 as talk_edit
FROM
wmf.mediawiki_history
WHERE
event_timestamp >= '2020-03-31' AND
event_timestamp <= '2020-06-30' AND
event_entity = 'revision' AND
event_type = 'create' AND
wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
snapshot = '2020-06' AND NOT
ARRAY_CONTAINS(event_user_groups_historical, 'bot')
) edits
RIGHT JOIN
reply_users ON edits.user = reply_users.reply_user AND
edits.wiki = reply_users.reply_wiki
GROUP BY user, wiki
"
results <- collect(sql(query))
save(results, file="Data/prop_reply_edits.RData")
load("Data/prop_reply_edits.RData")
prop_reply_edits <- results
prop_reply_edits_clean <- prop_reply_edits %>%
group_by(user, wiki) %>%
mutate(pct_reply = reply_edits/talk_edits * 100)
#Divide reply edits int0 edit count groups with bin width set to 5 edits
b <- c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
names <- c('1-10 percent', '11-20 percent', '21-30 percent', '31-40 percent', '41-50 percent',
'51-60 percent', '61-70 percent', '71-80 perent', '81-90 percent', '90-100 percent')
prop_reply_edits_clean <- prop_reply_edits %>%
group_by(user, wiki) %>%
mutate(pct_reply = reply_edits/talk_edits * 100,
reply_prop_group = cut(pct_reply, breaks = b, labels = names))
# Find overall number of users that made under 5 percent of talk page edits with the reply tool
prop_reply_overall_5percent <- prop_reply_edits_clean %>%
mutate(under_5_percent = ifelse(pct_reply < 5.0, 'True', 'False')) %>%
group_by(under_5_percent) %>%
filter(under_5_percent == 'True') %>%
summarise(n_users = n(),
percent_users = n_users/328 *100)
prop_reply_overall_5percent
under_5_percent | n_users | percent_users |
---|---|---|
<chr> | <int> | <dbl> |
True | 59 | 17.9878 |
Only about 18% of the reply tool users made under 5 percent of their talk page edits during the time period using the reply tool.
# table of totals
prop_reply_overall_bygroup <- prop_reply_edits_clean %>%
filter(!is.na(reply_prop_group)) %>% # a couple cases where there is a reply edit but not talk edit
group_by(reply_prop_group) %>%
summarise(n_users = n()) %>%
mutate(percent_reply_users = n_users/sum(n_users))
prop_reply_overall_bygroup
reply_prop_group | n_users | percent_reply_users |
---|---|---|
<fct> | <int> | <dbl> |
1-10 percent | 98 | 0.31210191 |
11-20 percent | 44 | 0.14012739 |
21-30 percent | 31 | 0.09872611 |
31-40 percent | 37 | 0.11783439 |
41-50 percent | 30 | 0.09554140 |
51-60 percent | 10 | 0.03184713 |
61-70 percent | 10 | 0.03184713 |
71-80 perent | 9 | 0.02866242 |
81-90 percent | 8 | 0.02547771 |
90-100 percent | 37 | 0.11783439 |
#chart overall users by group
p <- prop_reply_overall_bygroup %>%
ggplot(aes(x=reply_prop_group, y = percent_reply_users)) +
geom_bar(stat = 'identity', fill = 'darkblue') +
scale_y_continuous(labels = scales::percent) +
labs (y = "Percent of reply tool users",
x = "Percent of talk edits that were made with reply tool",
title = "Reply Tool Users \n overall by percent of talk page edits made with the reply tool",
subtitle = "31 March through 30 June 2020") +
theme_bw() +
theme(
plot.title = element_text(hjust = 0.5),
text = element_text(size=16),
axis.text.x = element_text(angle=45, hjust=1),
legend.position = "none")
p
ggsave("Figures/prop_reply_edits_overall.png", p, width = 16, height = 8, units = "in", dpi = 300)
The numbers of users is much more evenly distributed acorss the identified bins compared to the number of edits and number of days metrics.
31% of reply tool users made between 1 to 10 percent of their talk page edits using the reply tool. 23.6% of reply tool users made over half of their talk page edits using the reply tool (76.4% made under 50% of their talk page edits using the reply tool)
# Find number of users that made under 5 percent of talk page edits with the reply tool by wiki
prop_reply_bywiki_5percent <- prop_reply_edits_clean %>%
group_by(wiki) %>%
mutate(total_users = n()) %>%
filter(pct_reply < 5.0) %>%
group_by(wiki, total_users) %>%
summarise(under_5percent_users = n()) %>%
mutate(percent_users = under_5percent_users/total_users*100)
prop_reply_bywiki_5percent
wiki | total_users | under_5percent_users | percent_users |
---|---|---|---|
<chr> | <int> | <int> | <dbl> |
arwiki | 83 | 11 | 13.25301 |
frwiki | 155 | 32 | 20.64516 |
huwiki | 44 | 7 | 15.90909 |
nlwiki | 49 | 9 | 18.36735 |
The proportion of reply users that made under 5 percent of their talk page edits with the reply tool ranged from 13.25% on Arabic Wikipedia to 20.65% on French Wikipedia.
# table per wiki
prop_reply_edits_bywiki <- prop_reply_edits_clean %>%
filter(!is.na(reply_prop_group)) %>% # a couple cases where there is a reply edit but not talk edit
group_by(wiki, reply_prop_group) %>%
summarise(n_users = n()) %>%
mutate(percent_reply_users = n_users/sum(n_users))
prop_reply_edits_bywiki
wiki | reply_prop_group | n_users | percent_reply_users |
---|---|---|---|
<chr> | <fct> | <int> | <dbl> |
arwiki | 1-10 percent | 21 | 0.26250000 |
arwiki | 11-20 percent | 15 | 0.18750000 |
arwiki | 21-30 percent | 8 | 0.10000000 |
arwiki | 31-40 percent | 11 | 0.13750000 |
arwiki | 41-50 percent | 8 | 0.10000000 |
arwiki | 51-60 percent | 3 | 0.03750000 |
arwiki | 61-70 percent | 2 | 0.02500000 |
arwiki | 71-80 perent | 2 | 0.02500000 |
arwiki | 90-100 percent | 10 | 0.12500000 |
frwiki | 1-10 percent | 50 | 0.33333333 |
frwiki | 11-20 percent | 22 | 0.14666667 |
frwiki | 21-30 percent | 13 | 0.08666667 |
frwiki | 31-40 percent | 20 | 0.13333333 |
frwiki | 41-50 percent | 13 | 0.08666667 |
frwiki | 51-60 percent | 3 | 0.02000000 |
frwiki | 61-70 percent | 6 | 0.04000000 |
frwiki | 71-80 perent | 3 | 0.02000000 |
frwiki | 81-90 percent | 3 | 0.02000000 |
frwiki | 90-100 percent | 17 | 0.11333333 |
huwiki | 1-10 percent | 12 | 0.30769231 |
huwiki | 11-20 percent | 3 | 0.07692308 |
huwiki | 21-30 percent | 5 | 0.12820513 |
huwiki | 31-40 percent | 3 | 0.07692308 |
huwiki | 41-50 percent | 5 | 0.12820513 |
huwiki | 51-60 percent | 2 | 0.05128205 |
huwiki | 71-80 perent | 2 | 0.05128205 |
huwiki | 81-90 percent | 2 | 0.05128205 |
huwiki | 90-100 percent | 5 | 0.12820513 |
nlwiki | 1-10 percent | 15 | 0.33333333 |
nlwiki | 11-20 percent | 4 | 0.08888889 |
nlwiki | 21-30 percent | 5 | 0.11111111 |
nlwiki | 31-40 percent | 3 | 0.06666667 |
nlwiki | 41-50 percent | 4 | 0.08888889 |
nlwiki | 51-60 percent | 2 | 0.04444444 |
nlwiki | 61-70 percent | 2 | 0.04444444 |
nlwiki | 71-80 perent | 2 | 0.04444444 |
nlwiki | 81-90 percent | 3 | 0.06666667 |
nlwiki | 90-100 percent | 5 | 0.11111111 |
# graph of per wiki numbers
p <- prop_reply_edits_bywiki %>%
ggplot(aes(x=reply_prop_group, y = percent_reply_users)) +
geom_col(fill = 'darkblue') +
facet_wrap(~wiki) +
scale_y_continuous(labels = scales::percent) +
labs (y = "Percent of reply tool users",
x = "Percent of talk edits that were made with reply tool",
title = "Reply Tool Users \n by percent of talk page edits made with the reply tool",
subtitle = "31 March through 30 June 2020") +
theme_bw() +
theme(
plot.title = element_text(hjust = 0.5),
text = element_text(size=16),
axis.text.x = element_text(angle=45, hjust=1),
legend.position = "none")
p
ggsave("Figures/prop_reply_edits_bywiki.png", p, width = 16, height = 8, units = "in", dpi = 300)
French and Dutch Wikipedia had the highest percentage of reply users that made between 1 to 10 percent of their edits using the reply tool.
Arabic Wikipedia had the highest percentage of reply users that made between 11 to 20 percent of thier edits using the reply tool.