Adoption of the Replying Tool (Beta Features)

Task

June 2020

Background

In T244872 and T245794 the reply tool was released as an opt-in beta feature for the following partner wikis: Arabic, Dutch, French and Hungarian. Version 1.0 of the replying tool was deployed on to the partner wikis on March 31, 2020 and Version 2.0 was deployed on June 17, 2020.

We would like to know how the tool is being used and adopted by the partner wikis as a Beta Feature prior to deploying the tool to all volunteers as an opt-out preference on the four partner wikis T249394.

Data

Data for this analysis comes from the PrepUpdate table, the user properties table, and mediawiki_history table.

For metrics that were calculated using mediawiki_history, we reviewed data from the release of Version 1.0 on March 31 through the end of June 2020 (the most recent data available at the time of this analysis).

In [20]:
library(IRdisplay)

display_html(
'<script>  
code_show=true; 
function code_toggle() {
  if (code_show){
    $(\'div.input\').hide();
  } else {
    $(\'div.input\').show();
  }
  code_show = !code_show
}  
$( document ).ready(code_toggle);
</script>
  <form action="javascript:code_toggle()">
    <input type="submit" value="Click here to toggle on/off the raw code.">
 </form>'
)
In [7]:
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(tidyverse); library(glue); library(lubridate); library(scales)
})

How many people have used the Reply tool?

Notes:

  • We reviewed the number and percentage of people overall and by that made just 1 edit and that made edits within identified buckets (e.g. " E.g. "1-5 edits," "6-10 edits," etc.
  • Data comes from mediawiki_history using the new change tag implemented in T242184.
  • We reviewed all edits made with the discussiontools tag. These include all users who were able to successfully make an edit using the reply tool. It does not include attempts with the tool that were not saved.
  • Any self-identified bots were filtered out from the data; however, some un-identified bots might still be included.
In [12]:
# Collect users max reply edits over time period and remove bots

query <- "

SELECT
    wiki,
    reply_user,
    max(reply_edits) as reply_edit_count
FROM (
    SELECT
        wiki_db AS wiki,
        event_user_id as reply_user,
        max(size(event_user_is_bot_by) > 0 or size(event_user_is_bot_by_historical) > 0) as bot_by_group,
        Count(*) as reply_edits
FROM wmf.mediawiki_history
WHERE 
    ARRAY_CONTAINS(revision_tags, 'discussiontools') AND
    snapshot = '2020-06' AND
    event_timestamp >= '2020-03-31' AND 
    event_timestamp <= '2020-04-30' AND
    wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
    event_entity = 'revision' AND
    event_type = 'create'
GROUP BY
    wiki_db,
    event_user_id  
) edits
WHERE not bot_by_group
GROUP BY reply_user, wiki"
In [13]:
test <- wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

In [2]:
load("Data/reply_edit_users.RData")
reply_edit_users <- results
Warning message in readChar(con, 5L, useBytes = TRUE):
“cannot open compressed file 'Data/reply_edit_users.RData', probable reason 'No such file or directory'”
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
Traceback:

1. load("Data/reply_edit_users.RData")
2. readChar(con, 5L, useBytes = TRUE)

How many people have made just 1 edit with the Reply Tool?

In [14]:
num_users <- test %>%
    summarise(n_users = n())

num_users
A data.frame: 1 × 1
n_users
<int>
161
In [331]:
# Find overall number of users that made only 1 edit
reply_edits_overall_1edit <- reply_edit_users %>%
    filter(reply_edit_count == 1)  %>%
    summarise(n_users = n(),
             percent_users = n_users/328 *100)  #divide by total number of reply tool users overall

reply_edits_overall_1edit
A data.frame: 1 × 2
n_userspercent_users
<int><dbl>
10030.4878

A total of 328 users have successfully made at least one edit using the reply tool from 31 March 2020 through the end of June.

100 of those users (30.49%) have made only 1 edit. The majority (69.5%) have made at least 2 edits using the tool.

How many people have made edits with the Reply Tool by edit group count?

In [231]:
#Divide reply edits int0 edit count groups with bin width set to 5 edits
b <- c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, Inf)
names <- c('1-5 edits', '6-10 edits', '11-15 edits', '16-20 edits', '21-25 edits', '26-30 edits', '31-35 edits',
         '36-40 edits', '41-45 edits', '46-50 edits', 'over 50 edits')

reply_edit_overall_bygroup <- reply_edit_users %>%
    mutate(edit_count_group = cut(reply_edit_count, breaks = b, labels = names))  %>%
    group_by(edit_count_group) %>%
     summarise(n_users = n()) %>%
    mutate(percent_reply_users = n_users/sum(n_users))

reply_edit_overall_bygroup
A tibble: 11 × 3
edit_count_groupn_userspercent_reply_users
<fct><int><dbl>
1-5 edits 2010.612804878
6-10 edits 220.067073171
11-15 edits 180.054878049
16-20 edits 70.021341463
21-25 edits 100.030487805
26-30 edits 80.024390244
31-35 edits 30.009146341
36-40 edits 70.021341463
41-45 edits 30.009146341
46-50 edits 20.006097561
over 50 edits 470.143292683
In [563]:
#chart overall users by edit group

options(repr.plot.width = 10, repr.plot.height = 7)

p <- reply_edit_overall_bygroup %>%
    ggplot(aes(x=edit_count_group, y = percent_reply_users)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    scale_y_continuous(labels = scales::percent) +
    labs (y = "Percent of reply tool ssers",
          x = "Number of edits with the reply tool",
         title = "Reply tool users overall by edit count group",
         subtitle = "31 March through 30 June 2020")  +
     theme_bw() +
   theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1),
       legend.position = "none") 
 
        
p
In [564]:
ggsave("Figures/reply_edit_users_overall.png", p, width = 16, height = 8, units = "in", dpi = 300)

While the majority of reply tool users make at least 1 edit, not much more make more the 5 edits. Over half (61.28%) of reply tool users made between 1 to 5 edits using the tool. About half of those users are users that made only 1 edit.

32% of reply tool users made more than 10 edits using the tool. 12.2% made between 6 and 15 edits.

About 14.3% of users made over 50 edits during this timeframe. Note: This might consist of some automated traffic not self-identifed as bots.

Number of users that made reply edits by partner wiki

How many people have made just 1 edit with the Reply Tool by wiki?

In [348]:
# Find overall number of users that made only 1 edit
reply_edits_overall_1edit_bywiki <- reply_edit_users %>%
    group_by(wiki)  %>%
    mutate(total_users = n())  %>%
    filter(reply_edit_count == 1)  %>%
    group_by(wiki, total_users)  %>%
    summarise(one_edit_users = n())  %>%
    mutate(percent_users = one_edit_users/total_users*100)
            

reply_edits_overall_1edit_bywiki
A grouped_df: 4 × 4
wikitotal_usersone_edit_userspercent_users
<chr><int><int><dbl>
arwiki 832934.93976
frwiki1534831.37255
huwiki 431227.90698
nlwiki 491122.44898

On a per wiki basis, the percent of reply users that made only 1 edit ranged from about 22% to 35%. Arabic Wikipedia had the highest number of one edit users while Dutch Wikipedia had the lowest.

How many people have made edits with the Reply Tool by edit group count?

In [276]:
reply_edit_bywiki<- reply_edit_users %>%
    mutate(edit_count_group = cut(reply_edit_count, breaks = b, labels = names))  %>%
    group_by(wiki, edit_count_group) %>%
     summarise(n_users = n()) %>%
    mutate(percent_reply_users = n_users/sum(n_users))

reply_edit_bywiki
A grouped_df: 36 × 4
wikiedit_count_groupn_userspercent_reply_users
<chr><fct><int><dbl>
arwiki1-5 edits 450.542168675
arwiki6-10 edits 80.096385542
arwiki11-15 edits 50.060240964
arwiki16-20 edits 30.036144578
arwiki21-25 edits 20.024096386
arwiki26-30 edits 20.024096386
arwiki31-35 edits 10.012048193
arwikiover 50 edits 170.204819277
frwiki1-5 edits 1020.666666667
frwiki6-10 edits 110.071895425
frwiki11-15 edits 60.039215686
frwiki16-20 edits 20.013071895
frwiki21-25 edits 40.026143791
frwiki26-30 edits 30.019607843
frwiki31-35 edits 10.006535948
frwiki36-40 edits 50.032679739
frwiki41-45 edits 30.019607843
frwiki46-50 edits 10.006535948
frwikiover 50 edits 150.098039216
huwiki1-5 edits 250.581395349
huwiki6-10 edits 20.046511628
huwiki11-15 edits 10.023255814
huwiki21-25 edits 30.069767442
huwiki26-30 edits 20.046511628
huwiki31-35 edits 10.023255814
huwiki36-40 edits 10.023255814
huwikiover 50 edits 80.186046512
nlwiki1-5 edits 290.591836735
nlwiki6-10 edits 10.020408163
nlwiki11-15 edits 60.122448980
nlwiki16-20 edits 20.040816327
nlwiki21-25 edits 10.020408163
nlwiki26-30 edits 10.020408163
nlwiki36-40 edits 10.020408163
nlwiki46-50 edits 10.020408163
nlwikiover 50 edits 70.142857143
In [565]:
# chart by wiki

p <- reply_edit_bywiki %>%
    ggplot(aes(x=edit_count_group, y = percent_reply_users, fill = edit_count_group)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    facet_wrap(~wiki) +
    scale_y_continuous(labels = scales::percent) +
    labs (y = "Percent of reply tool users",
          x = "Number of reply tool edits",
         title = "Reply tool users by edit count group and wiki",
         subtitle = "31 March through 30 June 2020")  +
     theme_bw() +
    theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1),
       legend.position = "none") 
 
        
p
In [566]:
ggsave("Figures/reply_edit_users_bywiki.png", p, width = 16, height = 8, units = "in", dpi = 300)

A little over half of all reply users on each partner wiki made between 1-5 edits. After that, there is a general decrase in the numbers of users as the number of edits increase with a couple exceptions.

Dutch Wikipedia has a higher number of users that made 11 to 15 edits compared to 6 to 10 edits.Hungarian Wikipedia had a higher number of users that made between 21 to 25 and 26 to 30 edits compared to 6 to 10 edits.

How often are people using the Reply tool to make talk page edits?

Notes:

  • We reviewed the number and percentage of people overall that used the tool on just 1 day and that made edits within identified buckets of day (e.g. " E.g. "1-5 days," "6-10 day," etc.)
  • Similar to the calculations above, we reviewed data comes from mediawiki_history. We reviewed all edits made with the discussiontools change tag.
  • A distinct day was defined as a distinct calendar day; therefore, some of these edits may have occurred 24 hours apart and some may have occured only a few hours apart depending on what time of day the edit was made; however, it provides a useful estimation of repeat usage of the tool or "stickiness of the tool".
In [ ]:
query <- "
-- obtain distinct day of edits by user
SELECT
    wiki,
    reply_user,
    COUNT(DISTINCT(day_of_activity)) AS days_of_activity
FROM (
    SELECT
        wiki_db AS wiki,
        event_user_id as reply_user,
        to_date(event_timestamp) as day_of_activity
    FROM wmf.mediawiki_history
    WHERE 
        ARRAY_CONTAINS(revision_tags, 'discussiontools') AND
        snapshot = '2020-06' AND
        event_timestamp >= '2020-03-31' AND 
        event_timestamp <= '2020-06-30' AND
        wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND 
        event_entity = 'revision' AND
        event_type = 'create'
) edits
GROUP BY reply_user, wiki
"

results <- collect(sql(query))
save(results, file="Data/reply_days_activity.RData")
In [279]:
load("Data/reply_days_activity.RData")
reply_days_activity <- results
In [159]:
reply_days_activity$days_of_activity <- factor(reply_days_activity$days_of_activity,levels = 
                                            c("1-5 days", '6-10 days', '11-15 days', 
                                              '16-20 days','21-25 days', '26-30 days',
                                             '31-35 days','36-40 days', '41-45 days','46-50 days', 'over 50 days'))

Reply users distinct days of activity overall

How many people edited with the tool on just 1 day?

In [349]:
# Find overall number of users that used the tool on just 1 day
reply_days_overall_1day <- reply_days_activity %>%
    filter(days_of_activity == 1)  %>%
    summarise(one_day_users = n(),
             percent_users = one_day_users/328 *100)

reply_days_overall_1day
A data.frame: 1 × 2
one_day_userspercent_users
<int><dbl>
12738.71951

38.72% of all reply users used the tool on just one day; indicating that the majority of users (61.3%) used the tool on two or more days.

How many people edited with the Reply Tool by days of activity count?

In [283]:
#Divide days of activity into groups with bin width set to 5 days
b <- c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, Inf)
names <- c('1-5 days', '6-10 days', '11-15 days', '16-20 days', '21-25 days', '26-30 days', '31-35 days',
         '36-40 days', '41-45 days', '46-50 days', 'over 50 days')
In [286]:
reply_days_overall_bygroup <- reply_days_activity %>%
    mutate(days_activity_group = cut(days_of_activity, breaks = b, labels = names))  %>%
    group_by(days_activity_group) %>%
    summarise(n_users = n()) %>%
    mutate(percent_reply_users = n_users/sum(n_users))

reply_days_overall_bygroup
A tibble: 11 × 3
days_activity_groupn_userspercent_reply_users
<fct><int><dbl>
1-5 days 2240.68292683
6-10 days 290.08841463
11-15 days 110.03353659
16-20 days 120.03658537
21-25 days 90.02743902
26-30 days 50.01524390
31-35 days 110.03353659
36-40 days 50.01524390
41-45 days 50.01524390
46-50 days 60.01829268
over 50 days 110.03353659
In [567]:
#chart overall users by days of activity group

p <- reply_days_overall_bygroup %>%
    ggplot(aes(x=days_activity_group, y = percent_reply_users)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    scale_y_continuous(labels = scales::percent) +
    labs (y = "Percent of reply tool users",
          x = "Distinct days of activity",
         title = "Reply tool users overall by days of activity",
         subtitle = "31 March through 30 June 2020")  +
     theme_bw() +
   theme(plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1),
       legend.position = "none") 
 
        
p
In [568]:
ggsave("Figures/reply_days_activity_overall.png", p, width = 16, height = 8, units = "in", dpi = 300)

68.3% of all reply tool users made an edit between 1 to 5 distinct days. Only about 43.3% of those users made an edit on 1 distinct day so we're seeing the majority of users make an edit between 2 to 5 days.

The number of users drops generally decreases as the number of distinct days increase with a couple slight fluctuations.

Reply user distinct days of activity by partner wiki

How many people edited with the tool on just 1 day by wiki?

In [351]:
# Find overall number of users that made only 1 edit
reply_days_1day_bywiki <- reply_days_activity %>%
    group_by(wiki)  %>%
    mutate(total_users = n())  %>%
    filter(days_of_activity == 1)  %>%
    group_by(wiki, total_users)  %>%  
    summarise(one_day_users = n())  %>%
    mutate(percent_users = one_day_users/total_users*100)
            

reply_days_1day_bywiki
A grouped_df: 4 × 4
wikitotal_usersone_day_userspercent_users
<chr><int><int><dbl>
arwiki 833542.16867
frwiki1536240.52288
huwiki 431432.55814
nlwiki 491632.65306

Across the partner wikis, the percent of one day users ranges from about 32.65% to 42.17%. Similar to the trends seen for the percent of users that made 1 edit, Arabic Wikipedia had the highed percentage of one days users while Dutch Wikipedia had the lowest.

How many people edited with the Reply Tool by days of activity count and wiki?

In [294]:
reply_days_bywiki <- reply_days_activity %>%
    mutate(days_activity_group = cut(days_of_activity, breaks = b, labels = names))  %>%
    group_by(wiki, days_activity_group) %>%
    summarise(n_users = n()) %>%
    mutate(percent_reply_users = n_users/sum(n_users))

reply_days_bywiki
A grouped_df: 40 × 4
wikidays_activity_groupn_userspercent_reply_users
<chr><fct><int><dbl>
arwiki1-5 days 550.66265060
arwiki6-10 days 70.08433735
arwiki11-15 days 20.02409639
arwiki16-20 days 20.02409639
arwiki21-25 days 30.03614458
arwiki26-30 days 10.01204819
arwiki31-35 days 40.04819277
arwiki36-40 days 10.01204819
arwiki41-45 days 30.03614458
arwiki46-50 days 20.02409639
arwikiover 50 days 30.03614458
frwiki1-5 days 1130.73856209
frwiki6-10 days 110.07189542
frwiki11-15 days 60.03921569
frwiki16-20 days 40.02614379
frwiki21-25 days 50.03267974
frwiki26-30 days 20.01307190
frwiki31-35 days 20.01307190
frwiki36-40 days 30.01960784
frwiki46-50 days 30.01960784
frwikiover 50 days 40.02614379
huwiki1-5 days 270.62790698
huwiki6-10 days 20.04651163
huwiki11-15 days 10.02325581
huwiki16-20 days 50.11627907
huwiki21-25 days 10.02325581
huwiki26-30 days 10.02325581
huwiki31-35 days 20.04651163
huwiki36-40 days 10.02325581
huwiki41-45 days 10.02325581
huwikiover 50 days 20.04651163
nlwiki1-5 days 290.59183673
nlwiki6-10 days 90.18367347
nlwiki11-15 days 20.04081633
nlwiki16-20 days 10.02040816
nlwiki26-30 days 10.02040816
nlwiki31-35 days 30.06122449
nlwiki41-45 days 10.02040816
nlwiki46-50 days 10.02040816
nlwikiover 50 days 20.04081633
In [569]:
# numbers of days by wiki

p <- reply_days_bywiki %>%
    ggplot(aes(x=days_activity_group, y = percent_reply_users)) +
    geom_col(fill= 'darkblue') +
    scale_y_continuous(labels = percent) +
    facet_wrap(~wiki) +
    labs (y = "Percent of reply tool users",
          x = "Distinct days of activity",
         title = "Reply tool users by days of activity and wiki",
         subtitle = "31 March through 30 June 2020")  +
     theme_bw() +
    theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1),
       legend.position = "none")  
 
        
p
In [570]:
ggsave("Figures/reply_days_activity_bywiki.png", p, width = 16, height = 8, units = "in", dpi = 300)

On each partner wiki, the majority of reply tool users made their edits between 1 to 5 distinct days. The number of users generally decreases as the number of distinct days increase. A few notable exceptions:

  • Hungarian Wikipedia has a higher percentage of users that edited between 16-20 distinct days compared to 6-10 or 11-16 days.
  • Dutch Wikipedia has a larger percentage of users that edited between 6-10 days compared to the other partner wikis.

How many people have had access to the Reply tool?

How many distinct people have explicitly turned on or turned off the Beta Feature?

Notes:

  • Data comes from the PrefUpdate table. We reviewed data available at the time of this analysis from 31 March until June 26th.
  • "Explicitly" turned on indicates users did not have the Automatically enable all new beta features preference checked. Note explicilty turned off could include users that were auto enrolled and then turned off the feature.
  • There are several data QA issues with the PrefUpdate that may impact the results.

Total number of distinct users that explicilty turned on or turned off the beta feature

For the analysis below:

  • "Explicitly Turned On": Includes only users that were not auto-enrolled in the beta feature.
  • "Explcitly Turned Off": Includes both users that explicilty turned on the feature and that were auto enrolled.
In [558]:
query <- "
SELECT
    event.userid as user_id,
    wiki,
    event.value as beta_selection,
    min(CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))) as date,
    Count(*) as n_opt
FROM event_sanitized.prefupdate 
WHERE 
    event.property = 'discussiontools-betaenable' AND
    wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-30'
GROUP BY
    event.userid,
    wiki,
    event.value
"
In [559]:
beta_selection_rate <- wmf::query_hive(query)
In [572]:
beta_selection_rate$date <- as.Date(beta_selection_rate$date, format = "%Y-%m-%d")
In [577]:
beta_selection_rate$beta_selection <- ifelse(beta_selection_rate$beta_selection == '\"0\"', "turned off", "turned on")

Number of distinct users overall

In [578]:
total_distinct_users <- beta_selection_rate %>%
    group_by(beta_selection) %>%
    summarise(n_users = n())

total_distinct_users
A tibble: 2 × 2
beta_selectionn_users
<chr><int>
turned off4525
turned on 2132

Number of distinct users by wiki

In [579]:
total_distinct_users_bywiki <- beta_selection_rate %>%
    group_by(wiki, beta_selection) %>%
    summarise(n_users = n())

total_distinct_users_bywiki
A grouped_df: 8 × 3
wikibeta_selectionn_users
<chr><chr><int>
arwikiturned off1473
arwikiturned on 581
frwikiturned off2298
frwikiturned on 1192
huwikiturned off 296
huwikiturned on 140
nlwikiturned off 458
nlwikiturned on 219

Number of times the feature was turned on and off again overall

In [580]:
total_beta_selections <- beta_selection_rate %>%
    group_by(beta_selection) %>%
    summarise(n_opt = sum(n_opt))

total_beta_selections
A tibble: 2 × 2
beta_selectionn_opt
<chr><int>
turned off4651
turned on 2465

Number of times the feature was turned on and off again by wiki

In [582]:
total_beta_selections_bywiki <- beta_selection_rate %>%
    group_by(wiki,beta_selection) %>%
    summarise(n_opt = sum(n_opt))

total_beta_selections_bywiki
A grouped_df: 8 × 3
wikibeta_selectionn_opt
<chr><chr><int>
arwikiturned off1542
arwikiturned on 745
frwikiturned off2340
frwikiturned on 1329
huwikiturned off 302
huwikiturned on 152
nlwikiturned off 467
nlwikiturned on 239

The reply tool feature was turned off by 4,525 distinct users and turned on by 2,132 distinct users. Several of these users turned the feature on and off multiple times. The feature was turned off a total of 4,651 times and turned on a total 2,465 times.

The number of users that turned the feature off is higher than those that turned it on because a large portion of those users were auto enrolled in the feature.

Note we are missing data from 05 May 2020 through 2020 June 05 so this is an underestimate of the total number of users that enabled the feature.

Number distinct users that turned on or turned off the beta feature once

The analysis below excludes any users that turned off or on the beta feature more than once.

In [ ]:
# Find all users that turned on the beta feature or turned the feature off only once 
# Do not count users again if they re-turned it on or off again.

# In terminal
# spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G
query <- "
-- find number of opt in or opt outs by beta users
SELECT
    wiki,
    date,
    sum(cast(beta_selection = '\"1\"' as int)) as opt_in_users,
    sum(cast(beta_selection = '\"0\"' as int)) as opt_out_users
FROM
(
SELECT
    event.userid as user_id,
    wiki,
    event.value as beta_selection,
    min(CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))) as date,
    Count(*) as n_opt
FROM event_sanitized.prefupdate 
WHERE 
    event.property = 'discussiontools-betaenable' AND
    wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-30'
GROUP BY
    event.userid,
    wiki,
    event.value
) as events
WHERE n_opt = 1
GROUP BY 
    wiki,
    date

"
results <- collect(sql(query))
save(results, file="Data/opt_in_once.RData")
In [541]:
load("Data/opt_in_once.RData")
opt_rate_once <- results
In [542]:
opt_rate_once$date <- as.Date(opt_rate_once$date, format = "%Y-%m-%d")
In [ ]:
opt_rate_once_clean <- opt_rate_once %>%
    gather("beta_selection", "n_users", 3:4)
In [548]:
#Note. We are missing data at end of july so the numbers are likely lower than reported here.
opt_rate_once_overall <- opt_rate_once_clean %>%
    group_by(beta_selection) %>%
    summarise(n_users = sum(n_users))

opt_rate_overall
A tibble: 2 × 2
beta_selectionn_users
<chr><dbl>
opt_in_users 1895
opt_out_users4427

We recorded a total of 1,895 users that explicitly turned on the beta feature just once since deployed on March 31 through June 30, 2019. 4,427 users turned the feature off once (This includes both users that had turned it on as well as anyone that had the feature automatically turned on for them).

As mentioned above, we are missing data from 05 May 2020 through 2020 June 05 so this is an underestimate of the total number of users that enabled or disabled the feature.

In [584]:
# Chart opt in and opt out rate over time
# Note: There was a drop in pref-update data starting on 2020-05-11 through 2020-06-05	
# https://phabricator.wikimedia.org/T253151
p <- opt_rate_once_clean %>%
    group_by(date, beta_selection)  %>%
    summarise(n_users = sum(n_users)) %>%
    ggplot(aes(x= date, y = n_users, color = beta_selection)) +
    geom_line(size = 1.2) +
    geom_vline(xintercept = as.Date(c('2020-05-11', '2020-06-05')),
             linetype = "dashed", color = "black") +
  geom_text(aes(x=as.Date('2020-05-11'), y=55, label="Missing PrefUpdate Data (Bug: T253151)"), size=3.7, vjust = -1.2, angle = 90, color = "black") +
  geom_text(aes(x=as.Date('2020-06-05'), y=55, label="PrefUpdate Data Events Recorded Again"), size=3.7, vjust = -1.2, angle = 90, color = "black") +
    scale_x_date(date_labels = "%d-%b", date_breaks = "1 week", minor_breaks = NULL) +
    labs (y = "Number of distinct users",
          x = "Date",
         title = "Distinct reply tool users that explicitly turned on the feature ",
         subtitle = "31 March through 30 June 2020")  +
     theme_bw() +
   theme(legend.position = "bottom",
        axis.text.x=element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) 
 
        
p
In [586]:
ggsave("Figures/reply_opt_rate_overall.png", p, width = 16, height = 8, units = "in", dpi = 300)

Number distinct users that turned on or turned off the beta feature over time by wiki

In [ ]:
# review opt-ins by wiki
opt_rate_bywiki <- opt_rate_once_clean %>%
    group_by(wiki, beta_selection) %>%
    summarise(n_users = sum(n_users))

opt_rate_bywiki
In [587]:
# Chart opt in rate by wiki

p <- opt_rate_once_clean %>%
    group_by(date, wiki, beta_selection)  %>%
    summarise(n_users = sum(n_users)) %>%
    ggplot(aes(x= date, y = n_users, color = beta_selection)) +
    geom_line() +
    facet_wrap(~wiki, scales = "free_y") +
    scale_x_date(date_labels = "%b", date_breaks = "1 month", minor_breaks = NULL) +
    labs (y = "Number of distinct users",
          x = "Date",
         title = "Distinct reply tool users that explicitly turned on the feature by wiki ",
         subtitle = "Note: Data Missing from May 11th through June 5th, 2020")  +
     theme_bw() +
   theme(legend.position = "bottom",
        axis.text.x=element_text(angle = 45, hjust = 1),
        plot.title = element_text(hjust = 0.5)) 
 

           
p
In [588]:
ggsave("Figures/reply_opt_rate_bywiki.png", p, width = 16, height = 8, units = "in", dpi = 300)

How many distinct people turned off the feature after making at least one edit with the Reply tool?

We reviewed how many reply tool users explicitly turned off the feature after making at least one edit. I excluded users that turned it off and on multiple times from this analysis.

Reply tool users were broken down by those that were auto enrolled (turned on for them by enabling the Automatically enable most beta features) and those that explicitly turned on the feature in Beta Features.

In [1]:
query <- "
--find users that opt out of the reply tool preference 
WITH opt_out_users AS (
SELECT
    event.userid as opt_out_user,
    wiki as opt_out_wiki,
    min(event.saveTimestamp) as opt_out_time,
    sum(cast(event.value = '\"0\"' as int)) as opt_outs
FROM 
    event_sanitized.prefupdate
WHERE
    event.property = 'discussiontools-betaenable' AND
    wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
    event.value = '\"0\"' AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-30'
GROUP BY 
    event.userid, 
    wiki
),
--find users that opt in to the reply tool preference 
opt_in_users AS (
SELECT
    event.userid as opt_in_user,
    wiki as opt_in_wiki,
    min(event.saveTimestamp) as opt_in_time,
    sum(cast(event.value = '\"1\"' as int)) as opt_ins
FROM 
    event_sanitized.prefupdate
WHERE
    event.property = 'discussiontools-betaenable' AND
    wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
    event.value = '\"1\"' AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-30'
GROUP BY 
    event.userid, 
    wiki
),

-- find users that made at least one edit with the reply tool
reply_users AS (
SELECT
    event_user_id as reply_user,
    wiki_db as reply_wiki,
    min(mh.event_timestamp) as first_reply_time
FROM wmf.mediawiki_history AS mh
WHERE 
    ARRAY_CONTAINS(revision_tags, 'discussiontools') AND
    snapshot = '2020-06' AND
    event_timestamp >= '2020-03-31' AND 
    event_timestamp <= '2020-06-30' AND
    wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND NOT
    ARRAY_CONTAINS(event_user_groups_historical, 'bCONCot') AND
    event_entity = 'revision' AND 
    event_type = 'create'
GROUP BY
    event_user_id,
    wiki_db
)

-- Main Query --
SELECT
    opt_out_wiki AS wiki,
    SUM(CAST(opt_out_user IS NOT NULL and first_reply_time < opt_out_time and opt_in_user IS NULL AS int)) AS opt_out_users_autoenrolled,
    SUM(CAST(opt_out_user IS NOT NULL and first_reply_time < opt_out_time and opt_ins = 1 AS int)) AS opt_out_users_explicitly_enrolled

FROM (
SELECT
    reply_users.first_reply_time,
    opt_out_users.opt_out_time,
    opt_out_users.opt_out_wiki,
    opt_in_users.opt_in_user,
    opt_in_users.opt_ins,
    opt_out_users.opt_out_user
FROM reply_users
LEFT JOIN opt_out_users ON 
    reply_users.reply_user = opt_out_users.opt_out_user AND
    reply_users.reply_wiki = opt_out_users.opt_out_wiki 
LEFT JOIN opt_in_users ON 
    reply_users.reply_user = opt_in_users.opt_in_user AND
    reply_users.reply_wiki = opt_in_users.opt_in_wiki 
WHERE 
    opt_out_users.opt_outs = 1 
) sessions
GROUP BY
    sessions.opt_out_wiki
"
In [2]:
opt_out_after_reply <- wmf::query_hive(query)
In [3]:
## add column of all reply tool users that made at least 1 edit.

opt_out_after_reply$all_reply_users <- c(83,153,43,49)
In [17]:
opt_out_after_reply_overall <- opt_out_after_reply %>%
 summarise(percent_opt_out_autoenrolled = sum(opt_out_users_autoenrolled)/sum(all_reply_users) *100,
          percent_opt_out_explicitly_enrolled	= sum(opt_out_users_explicitly_enrolled/sum(all_reply_users) *100))
In [18]:
opt_out_after_reply_overall
A data.frame: 1 × 2
percent_opt_out_autoenrolledpercent_opt_out_explicitly_enrolled
<dbl><dbl>
3.65853713.71951
In [19]:
opt_out_after_reply_bywiki <- opt_out_after_reply %>%
    mutate(percent_opt_out_autoenrolled = opt_out_users_autoenrolled/all_reply_users * 100,
          percent_opt_out_explicitly_enrolled = opt_out_users_explicitly_enrolled/all_reply_users * 100)

head(opt_out_after_reply_bywiki)
A data.frame: 4 × 6
wikiopt_out_users_autoenrolledopt_out_users_explicitly_enrolledall_reply_userspercent_opt_out_autoenrolledpercent_opt_out_explicitly_enrolled
<chr><int><int><dbl><dbl><dbl>
1arwiki315 833.61445818.072289
2frwiki7181534.57516311.764706
3huwiki1 3 432.325581 6.976744
4nlwiki1 9 492.04081618.367347

A total of 57 or 17.38% of all reply tool users explicitly turned off the feature after making at least 1 edit with the reply tool and did not turn it back on again. The majority of these users were users that explicility turned on the feature in beta features.

On a per wiki basis, the highest percent of users that turned the feature off after turning it on was on Arabic Wikipedia (21.69%) and the lowest on Hungarian Wikipedia (9.3%).

Duplicate Events in PrefUpdate

It looks like there are several duplicate events in PrefUpdate. I checked the number of duplicates to confirm the impact. There are 100 events that have been copies for discussiontools-betaenable. I updated the open phab task T218835 to document the identified bug.

In [334]:
query <-"SELECT event.property AS property, COUNT(*) AS duplicated_events
FROM (
  SELECT event, COUNT(*) AS copies
  FROM event.prefupdate 
  WHERE 
    event.property = 'discussiontools-betaenable' AND
    wiki IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '2020-03-31' AND
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) <= '2020-06-26'
  GROUP BY event
  HAVING copies > 1) AS events
GROUP BY event.property
ORDER BY property LIMIT 10000;"
In [335]:
reply_beta_duplicate_events <- wmf::query_hive(query)
In [336]:
head(reply_beta_duplicate_events)
A data.frame: 1 × 2
propertyduplicated_events
<chr><int>
1discussiontools-betaenable100

This is a relatively large number of duplicates but it appears that most of the duplicate values but shouldn't signficantly impact the overall trends indicated by the data which indicate that the the opt-in and opt-out rate for the discussion tool has stablized after an initial overall decrease.

How many people had the DiscussionTools Beta Feature turned on for them?

Notes:

  • "Turned on for them": Indicates users had the following preference checked: Automatically enable all new beta features.
  • Data comes from the mediawiki user properties table, where property is equal to 'betafeatures-auto-enroll'.
  • Data represents total numbers of users that have set this preference as of 30 June 2020.
  • Analysis in separate python notebook (auto_beta_users_collect.ipynb)
Number of Auto Enrolled Beta Users
arwiki 9927
nlwiki 1274
frwiki 8170
huwiki 558

How many people should we expect to try the Replying feature when it is turned on as an opt out user preference at our four partner wikis?

Notes:

  • Data from mediawiki_history.
  • We reviewed the latest 30 days available at the time of analysis: 1-31 March 2020.
In [594]:
## Upper bound: number of people who have made at least 1 edit, 
## in any namespace, in the previous 30 day period
query <- "
    SELECT
    wiki,
    count(*) AS n_editors
FROM (
    SELECT
        event_user_id as user_id,
        wiki_db AS wiki,
        max(size(event_user_is_bot_by) > 0 or size(event_user_is_bot_by_historical) > 0) as bot_by_group,
        COUNT(*) as edits
    FROM wmf.mediawiki_history
    WHERE
        event_timestamp >= '2020-06-01' AND 
        event_timestamp <= '2020-06-30' AND
        event_entity = 'revision' AND
        event_type = 'create' AND
        wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
        snapshot = '2020-06' AND
        event_user_id != 0
    GROUP BY 
        event_user_id,
        wiki_db
) edits
WHERE
    not bot_by_group 
GROUP BY 
    wiki"
In [595]:
upper_bound_editors <- wmf::query_hive(query)
In [596]:
## Lower bound: number of people who have made at least 1 edit 
##in a talk namespace in the previous 30 day period
query <- "
    SELECT
    wiki,
    count(*) AS n_editors
FROM (
    SELECT
        event_user_id as user_id,
        wiki_db AS wiki,
        max(size(event_user_is_bot_by) > 0 or size(event_user_is_bot_by_historical) > 0) as bot_by_group,
        COUNT(*) as edits
    FROM wmf.mediawiki_history
    WHERE
        event_timestamp >= '2020-06-01' AND 
        event_timestamp <= '2020-06-30' AND
        event_entity = 'revision' AND
        event_type = 'create' AND
        page_namespace_historical % 2 == 1 AND
        wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
        snapshot = '2020-06' AND
        event_user_id != 0
    GROUP BY 
        event_user_id,
        wiki_db
) edits
WHERE
    not bot_by_group 
GROUP BY 
    wiki"
In [597]:
lower_bound_editors <- wmf::query_hive(query)

Expected Users Per Wiki: Upper and Lower Bounds (Based on numbers from 1-30 June 2020)

upper bound lower bound
arwiki 6,455 1,685
nlwiki 4,323 687
frwiki 20,049 3,539
huwiki 2,015 404

How many talk page edits are being made with the Reply Tool?

Of the people who have made at least one edit with the Reply tool, how many of these people have made >5%, >10%, >25% and >50% - of their total talk page edits during the identified time period using the tool?

Notes:

  • Data comes from mediawiki_history.
  • Reply tool edits: Any edit marked by discussiontool change tag.
  • Talk Page edits: Any edit made in any of the talk pages.
  • This metric has some slight noise as there could be cases where the following people end up looking the same in the data. In this case, both Person A and Person B would end up in the >50% bucket: Person A: made a total two edits to talk pages, one of which was with the Reply tool; Person B: made a total of 150 talk page edits, 75 of which were with the Reply tool.
  • There are a couple cases where there are reply edits were recorded but no talk page edits marked for the user. I removed these for this analysis but further investigation is needed.
In [ ]:
# Obtain reply edits and total talk edits 
# In terminal
# spark2R --master yarn --executor-memory 2G --executor-cores 1 --driver-memory 4G

query <- "
-- obtain all users who have made at least 1 edit with the reply tool

with reply_users AS(
SELECT
    event_user_id AS reply_user,
    wiki_db AS reply_wiki
FROM wmf.mediawiki_history
WHERE 
    ARRAY_CONTAINS(revision_tags, 'discussiontools') AND
    snapshot = '2020-06' AND
    event_timestamp >= '2020-03-31' AND 
    event_timestamp <= '2020-06-31' AND
    wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND NOT
    ARRAY_CONTAINS(event_user_groups_historical, 'bot') AND
    event_entity = 'revision' AND 
    event_type = 'create' 
)
-- obtain user talk and reply counts
SELECT
    user,
    wiki,
    SUM(CAST(reply_edit as int)) as reply_edits,
    SUM(CAST(talk_edit as int)) as talk_edits
FROM (
    SELECT
        event_user_id as user,
        wiki_db AS wiki,
        array_contains(revision_tags, 'discussiontools') as reply_edit,
        page_namespace_historical % 2 == 1 as talk_edit
    FROM
        wmf.mediawiki_history
    WHERE
        event_timestamp >= '2020-03-31' AND 
        event_timestamp <= '2020-06-30' AND
        event_entity = 'revision' AND
        event_type = 'create' AND
        wiki_db IN ('arwiki','nlwiki', 'frwiki', 'huwiki') AND
        snapshot = '2020-06' AND NOT
        ARRAY_CONTAINS(event_user_groups_historical, 'bot')  
) edits
RIGHT JOIN 
    reply_users ON edits.user = reply_users.reply_user AND
    edits.wiki = reply_users.reply_wiki
GROUP BY user, wiki
"


results <- collect(sql(query))
save(results, file="Data/prop_reply_edits.RData")
In [310]:
load("Data/prop_reply_edits.RData")
prop_reply_edits <- results
In [312]:
prop_reply_edits_clean <- prop_reply_edits %>%
    group_by(user, wiki) %>%
    mutate(pct_reply = reply_edits/talk_edits * 100)
In [314]:
#Divide reply edits int0 edit count groups with bin width set to 5 edits
b <- c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
names <- c('1-10 percent', '11-20 percent', '21-30 percent', '31-40 percent', '41-50 percent', 
           '51-60 percent', '61-70 percent', '71-80 perent', '81-90 percent', '90-100 percent')

prop_reply_edits_clean <- prop_reply_edits %>%
    group_by(user, wiki) %>%
    mutate(pct_reply = reply_edits/talk_edits * 100,
           reply_prop_group = cut(pct_reply, breaks = b, labels = names))

Overall proportion of user talk page edits made with reply tool

Proportion of user talk page edits made with reply tool that are under 5 percent

In [354]:
# Find overall number of users that made under 5 percent of talk page edits with the reply tool
prop_reply_overall_5percent <- prop_reply_edits_clean %>%
    mutate(under_5_percent = ifelse(pct_reply < 5.0, 'True', 'False')) %>% 
    group_by(under_5_percent) %>% 
    filter(under_5_percent == 'True') %>% 
    summarise(n_users = n(),
             percent_users = n_users/328 *100)  

prop_reply_overall_5percent
A tibble: 1 × 3
under_5_percentn_userspercent_users
<chr><int><dbl>
True5917.9878

Only about 18% of the reply tool users made under 5 percent of their talk page edits during the time period using the reply tool.

Proportion of user talk page edits made with reply tool by percent group

In [320]:
# table of totals

prop_reply_overall_bygroup <- prop_reply_edits_clean %>%
    filter(!is.na(reply_prop_group)) %>% # a couple cases where there is a reply edit but not talk edit
    group_by(reply_prop_group) %>%
    summarise(n_users = n()) %>%
    mutate(percent_reply_users = n_users/sum(n_users))

prop_reply_overall_bygroup
A tibble: 10 × 3
reply_prop_groupn_userspercent_reply_users
<fct><int><dbl>
1-10 percent 980.31210191
11-20 percent 440.14012739
21-30 percent 310.09872611
31-40 percent 370.11783439
41-50 percent 300.09554140
51-60 percent 100.03184713
61-70 percent 100.03184713
71-80 perent 90.02866242
81-90 percent 80.02547771
90-100 percent370.11783439
In [602]:
#chart overall users by group

p <- prop_reply_overall_bygroup %>%
    ggplot(aes(x=reply_prop_group, y = percent_reply_users)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    scale_y_continuous(labels = scales::percent) +
    labs (y = "Percent of reply tool users",
          x = "Percent of talk edits that were made with reply tool",
         title =  "Reply Tool Users \n overall by percent of talk page edits made with the reply tool",
         subtitle = "31 March through 30 June 2020")  +
     theme_bw() +
  theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1),
       legend.position = "none")
 
        
p
In [603]:
ggsave("Figures/prop_reply_edits_overall.png", p, width = 16, height = 8, units = "in", dpi = 300)

The numbers of users is much more evenly distributed acorss the identified bins compared to the number of edits and number of days metrics.

31% of reply tool users made between 1 to 10 percent of their talk page edits using the reply tool. 23.6% of reply tool users made over half of their talk page edits using the reply tool (76.4% made under 50% of their talk page edits using the reply tool)

Proportion of talk page edits made with reply tool by wiki

Proportion of user talk page edits made with reply tool that are under 5 percent by wiki

In [359]:
# Find number of users that made under 5 percent of talk page edits with the reply tool by wiki
prop_reply_bywiki_5percent <- prop_reply_edits_clean %>%
    group_by(wiki)  %>%
    mutate(total_users = n()) %>%
    filter(pct_reply < 5.0) %>% 
    group_by(wiki, total_users) %>%
    summarise(under_5percent_users = n()) %>%
    mutate(percent_users = under_5percent_users/total_users*100) 
 
prop_reply_bywiki_5percent
A grouped_df: 4 × 4
wikitotal_usersunder_5percent_userspercent_users
<chr><int><int><dbl>
arwiki 831113.25301
frwiki1553220.64516
huwiki 44 715.90909
nlwiki 49 918.36735

The proportion of reply users that made under 5 percent of their talk page edits with the reply tool ranged from 13.25% on Arabic Wikipedia to 20.65% on French Wikipedia.

Proportion of user talk page edits made with reply tool by percent group and wiki

In [328]:
# table per wiki

prop_reply_edits_bywiki <- prop_reply_edits_clean %>%
    filter(!is.na(reply_prop_group)) %>% # a couple cases where there is a reply edit but not talk edit
    group_by(wiki, reply_prop_group) %>%
    summarise(n_users = n()) %>%
    mutate(percent_reply_users = n_users/sum(n_users))
    

prop_reply_edits_bywiki
A grouped_df: 38 × 4
wikireply_prop_groupn_userspercent_reply_users
<chr><fct><int><dbl>
arwiki1-10 percent 210.26250000
arwiki11-20 percent 150.18750000
arwiki21-30 percent 80.10000000
arwiki31-40 percent 110.13750000
arwiki41-50 percent 80.10000000
arwiki51-60 percent 30.03750000
arwiki61-70 percent 20.02500000
arwiki71-80 perent 20.02500000
arwiki90-100 percent100.12500000
frwiki1-10 percent 500.33333333
frwiki11-20 percent 220.14666667
frwiki21-30 percent 130.08666667
frwiki31-40 percent 200.13333333
frwiki41-50 percent 130.08666667
frwiki51-60 percent 30.02000000
frwiki61-70 percent 60.04000000
frwiki71-80 perent 30.02000000
frwiki81-90 percent 30.02000000
frwiki90-100 percent170.11333333
huwiki1-10 percent 120.30769231
huwiki11-20 percent 30.07692308
huwiki21-30 percent 50.12820513
huwiki31-40 percent 30.07692308
huwiki41-50 percent 50.12820513
huwiki51-60 percent 20.05128205
huwiki71-80 perent 20.05128205
huwiki81-90 percent 20.05128205
huwiki90-100 percent 50.12820513
nlwiki1-10 percent 150.33333333
nlwiki11-20 percent 40.08888889
nlwiki21-30 percent 50.11111111
nlwiki31-40 percent 30.06666667
nlwiki41-50 percent 40.08888889
nlwiki51-60 percent 20.04444444
nlwiki61-70 percent 20.04444444
nlwiki71-80 perent 20.04444444
nlwiki81-90 percent 30.06666667
nlwiki90-100 percent 50.11111111
In [604]:
# graph of per wiki numbers

p <- prop_reply_edits_bywiki %>%
    ggplot(aes(x=reply_prop_group, y = percent_reply_users)) +
    geom_col(fill = 'darkblue') +
    facet_wrap(~wiki) +
    scale_y_continuous(labels = scales::percent) +
    labs (y = "Percent of reply tool users",
          x = "Percent of talk edits that were made with reply tool",
         title = "Reply Tool Users \n by percent of talk page edits made with the reply tool",
         subtitle = "31 March through 30 June 2020")  +
     theme_bw() +
   theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1),
       legend.position = "none")
 
        
p