Instrumentation changes were deployed to make it possible to identify any non-discussion tool EditAttemptStep events by users in the AB test. These changes were deployed on 12 February 2020.
The purpose of this post-deployment QA is to confirm that events are logging as expected and needed to run the AB test.
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
library(magrittr); library(zeallot); library(glue); library(tidyverse); library(zoo); library(lubridate)
library(scales)
})
# collect all desktop edit attempts by bucket/test group and event type
query <-
"
SELECT
date_format(dt, 'yyyy-MM-dd') as attempt_dt,
event.bucket AS experiment_group,
wiki As wiki,
event.integration AS event_type,
-- review participating wikis
IF( wiki IN ('frwiki', 'eswiki', 'itwiki', 'jawiki', 'fawiki', 'plwiki', 'hewiki', 'nlwiki',
'hiwiki', 'kowiki', 'viwiki', 'thwiki', 'ptwiki', 'bnwiki', 'arzwiki', 'swwiki', 'zhwiki',
'ukwiki', 'idwiki', 'amwiki', 'omwiki', 'afwiki'), 'TRUE', 'FALSE'
) AS is_AB_test_wiki,
CASE
WHEN min(event.user_editcount) is NULL THEN 'undefined'
WHEN min(event.user_editcount) < 100 THEN 'under 100'
WHEN min(event.user_editcount) >=100 AND min(event.user_editcount) < 500 THEN '100-499'
ELSE '500+'
END AS initial_edit_count,
event.is_oversample AS is_oversample,
event.user_id as user_id,
event.editing_session_id as edit_attempt_id,
event.editor_interface AS editor_interface
FROM event.editattemptstep
WHERE
-- Review data starting a few days prior to the AB test deployment on Feb 11th
year = 2021
AND month = 02
AND day >= 06
-- look at only desktop init events
AND event.platform = 'desktop'
AND event.action = 'init'
-- remove bots
AND useragent.is_bot = false
GROUP BY
date_format(dt, 'yyyy-MM-dd'),
event.bucket,
wiki,
event.integration,
IF( wiki IN ('frwiki', 'eswiki', 'itwiki', 'jawiki', 'fawiki', 'plwiki', 'hewiki', 'nlwiki',
'hiwiki', 'kowiki', 'viwiki', 'thwiki', 'ptwiki', 'bnwiki', 'arzwiki', 'swwiki', 'zhwiki',
'ukwiki', 'idwiki', 'amwiki', 'omwiki', 'afwiki'), 'TRUE', 'FALSE'
),
event.is_oversample,
event.user_id,
event.editing_session_id,
event.editor_interface
"
collect_edit_attempts <- wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit
collect_edit_attempts$attempt_dt <- as.Date(collect_edit_attempts$attempt_dt)
edit_attempts_byexperimentgroup <- collect_edit_attempts %>%
# remove anon users and review data following AB deployment
filter(attempt_dt >= '2021-02-12',
user_id != 0) %>%
group_by (is_ab_test_wiki, experiment_group) %>%
summarise(n_users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))
edit_attempts_byexperimentgroup
`summarise()` regrouping output by 'is_ab_test_wiki' (override with `.groups` argument)
is_ab_test_wiki | experiment_group | n_users | n_attempts |
---|---|---|---|
<lgl> | <chr> | <int> | <int> |
FALSE | NULL | 49398 | 212818 |
TRUE | control | 5751 | 19307 |
TRUE | NULL | 13980 | 64972 |
TRUE | test | 6338 | 21229 |
I confirmed that I can distinguish events in the AB test (indicated by experiment_group = 'control'
or experiment_group = 'test'
) and events not in the AB test (indicated by experiment_group = 'NULL'
).
I can also distinguish between events logged to the Test
versus Control
groups within the AB test with the new instrumentation. We are logging a reasonable amount of attempts in each group among the expected experiment population.
Control
and Test
events are only logged for the wikis included in the AB test as expected.
About 53.6% of all logged-in users that have attempted a desktop edit since the AB test deployment on 12 Feb 2020 on the participating wikis were not included in the AB test. Based on the AB test criteria, this would include users that have used the reply tool before (defined as people whose discussiontools-editmode preference is set). This indicates that the targeting for the A/B test is fairly restrictive. However, even with the restrictive criteria, I believe we should have sufficient data to complete the analysis in 2 to 3 weeks based on the current rate of daily AB events logged.
dt_attempts_byexperimentgroup <- collect_edit_attempts %>%
# review only AB test events
filter(attempt_dt >= '2021-02-12',
is_ab_test_wiki == TRUE,
experiment_group != 'NULL',
user_id != 0
) %>%
group_by (event_type, experiment_group) %>%
summarise(n_users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))
dt_attempts_byexperimentgroup
`summarise()` regrouping output by 'event_type' (override with `.groups` argument)
event_type | experiment_group | n_users | n_attempts |
---|---|---|---|
<chr> | <chr> | <int> | <int> |
discussiontools | control | 19 | 36 |
discussiontools | test | 1342 | 3128 |
page | control | 5744 | 19271 |
page | test | 5689 | 18101 |
I confirmed I am able to distinguish between discussion tool (event.type = 'discussiontools'
) and non-discussion tool events (event.type = 'page'
) for both the Test
and Control
groups.
There are very few discussion tool attempts (36 attempts from 19 users) in the control group. This is expected as these events could only be recorded for users that manually enabled DiscussionTools.
There is a similar number of attempts recorded for main edit attempts (event.integration = page
) in both the test and control groups. I would expect there to be fewer non-discussion tool edit attempts in the test group compared to the control since that group is shown the reply tool as default.
I reviewed edit attempts on talk pages only to see if some of those edits included edits to the article, where the discussion tool is not available.
# review only talk page edit attempts
query <-
"
SELECT
date_format(dt, 'yyyy-MM-dd') as attempt_dt,
event.bucket AS experiment_group,
wiki As wiki,
event.integration AS event_type,
event.init_type AS init_type,
IF( wiki IN ('frwiki', 'eswiki', 'itwiki', 'jawiki', 'fawiki', 'plwiki', 'hewiki', 'nlwiki',
'hiwiki', 'kowiki', 'viwiki', 'thwiki', 'ptwiki', 'bnwiki', 'arzwiki', 'swwiki', 'zhwiki',
'ukwiki', 'idwiki', 'amwiki', 'omwiki', 'afwiki'), 'TRUE', 'FALSE'
) AS is_AB_test_wiki,
CASE
WHEN min(event.user_editcount) is NULL THEN 'undefined'
WHEN min(event.user_editcount) < 100 THEN 'under 100'
WHEN min(event.user_editcount) >=100 AND min(event.user_editcount) < 500 THEN '100-499'
ELSE '500+'
END AS initial_edit_count,
event.is_oversample AS is_oversample,
event.user_id as user_id,
event.editing_session_id as edit_attempt_id,
event.editor_interface AS editor_interface
FROM event.editattemptstep
WHERE
-- AB test deployed on Feb 11th
year = 2021
AND ((month = 02 AND day >= 12) OR (month > 02))
-- look at only desktop events
AND event.platform = 'desktop'
-- review all talk namespaces
AND event.page_ns % 2 = 1
AND event.action = 'init'
GROUP BY
date_format(dt, 'yyyy-MM-dd'),
event.bucket,
wiki,
event.integration,
event.init_type,
IF( wiki IN ('frwiki', 'eswiki', 'itwiki', 'jawiki', 'fawiki', 'plwiki', 'hewiki', 'nlwiki',
'hiwiki', 'kowiki', 'viwiki', 'thwiki', 'ptwiki', 'bnwiki', 'arzwiki', 'swwiki', 'zhwiki',
'ukwiki', 'idwiki', 'amwiki', 'omwiki', 'afwiki'), 'TRUE', 'FALSE'
),
event.is_oversample,
event.user_id,
event.editing_session_id,
event.editor_interface
"
collect_edit_attempts_talkonly <- wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit
dt_attempts_bytestgroup_talkpage_only <- collect_edit_attempts_talkonly %>%
# review only AB test events
filter(attempt_dt >= '2021-02-12',
is_ab_test_wiki == TRUE,
experiment_group != 'NULL'
) %>%
group_by (event_type, init_type, experiment_group) %>%
summarise(n_users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))
dt_attempts_bytestgroup_talkpage_only
`summarise()` regrouping output by 'event_type', 'init_type' (override with `.groups` argument)
event_type | init_type | experiment_group | n_users | n_attempts |
---|---|---|---|---|
<chr> | <chr> | <chr> | <int> | <int> |
discussiontools | page | control | 34 | 83 |
discussiontools | page | test | 1864 | 4339 |
page | page | control | 844 | 1312 |
page | page | test | 929 | 1755 |
page | section | control | 718 | 1179 |
page | section | test | 626 | 1010 |
ISSUE: There appear to be close to the same number of non-discussion tool edit attempts (event.integration = 'page'
) to talk pages in both the control and test groups. I would expect there to be far fewer number of non-dt edit attempts in the test group compared to the control group as they are shown the reply tool as default.
This could be reflective of actual user behavior. From David Lynch: "Someone who's used to replying to comments via the source editor might be sticking with what they know, rather than poking at new things. In that case, the fairly even split between control and test would be consistent with a user-type who's just not paying attention to the new functionality."
I looked at the edit count bucket splits to confirm if we see more page events in the test group for more experienced users to fit the hypothesis above:
dt_attempts_bytestgroup_talkpage_only_editbucket <- collect_edit_attempts_talkonly %>%
# review only AB test events
filter(attempt_dt >= '2021-02-12',
is_ab_test_wiki == TRUE,
experiment_group != 'NULL'
) %>%
group_by (event_type, experiment_group, initial_edit_count) %>%
summarise(n_users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))
dt_attempts_bytestgroup_talkpage_only_editbucket
`summarise()` regrouping output by 'event_type', 'experiment_group' (override with `.groups` argument)
event_type | experiment_group | initial_edit_count | n_users | n_attempts |
---|---|---|---|---|
<chr> | <chr> | <chr> | <int> | <int> |
discussiontools | control | 100-499 | 3 | 10 |
discussiontools | control | 500+ | 5 | 20 |
discussiontools | control | under 100 | 26 | 53 |
discussiontools | test | 100-499 | 186 | 477 |
discussiontools | test | 500+ | 505 | 1229 |
discussiontools | test | under 100 | 1187 | 2633 |
page | control | 100-499 | 117 | 165 |
page | control | 500+ | 555 | 1418 |
page | control | under 100 | 744 | 908 |
page | test | 100-499 | 108 | 175 |
page | test | 500+ | 554 | 1697 |
page | test | under 100 | 754 | 893 |
ab_attempts_num_wiki <- collect_edit_attempts_talkonly %>%
# review only AB edit attempts
filter(attempt_dt >= '2021-02-12',
is_ab_test_wiki == TRUE,
experiment_group != 'NULL'
) %>%
summarise(num_wikis = n_distinct(wiki))
ab_attempts_num_wiki
num_wikis |
---|
<int> |
21 |
Edit attempts by users in the AB test were recorded on 21 of the 22 participating wikis. A review of attempts by each participating wikis indicates that no edit attempts by a user included in the AB test have yet been recorded on Amharic Wikipedia. As a smaller wiki, this seems possible if no user that met the AB test criteria has made an edit attempt.
ab_attempts_bywiki <- collect_edit_attempts_talkonly %>%
# review only AB edit attempts
filter(attempt_dt >= '2021-02-12',
experiment_group != 'NULL',
is_ab_test_wiki == TRUE) %>%
group_by(wiki, experiment_group) %>%
summarise(n_users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))%>%
arrange(wiki)
ab_attempts_bywiki
`summarise()` regrouping output by 'wiki' (override with `.groups` argument)
wiki | experiment_group | n_users | n_attempts |
---|---|---|---|
<chr> | <chr> | <int> | <int> |
afwiki | control | 3 | 10 |
afwiki | test | 4 | 13 |
amwiki | control | 1 | 3 |
amwiki | test | 1 | 1 |
arzwiki | control | 4 | 5 |
arzwiki | test | 6 | 7 |
bnwiki | control | 20 | 29 |
bnwiki | test | 25 | 52 |
eswiki | control | 205 | 414 |
eswiki | test | 471 | 1131 |
fawiki | control | 89 | 134 |
fawiki | test | 158 | 312 |
frwiki | control | 205 | 352 |
frwiki | test | 528 | 1347 |
hewiki | control | 103 | 265 |
hewiki | test | 149 | 459 |
hiwiki | control | 13 | 14 |
hiwiki | test | 24 | 41 |
idwiki | control | 33 | 47 |
idwiki | test | 54 | 97 |
itwiki | control | 179 | 347 |
itwiki | test | 419 | 1196 |
jawiki | control | 148 | 199 |
jawiki | test | 234 | 544 |
kowiki | control | 26 | 46 |
kowiki | test | 36 | 76 |
nlwiki | control | 56 | 147 |
nlwiki | test | 125 | 354 |
plwiki | control | 81 | 131 |
plwiki | test | 166 | 354 |
ptwiki | control | 101 | 154 |
ptwiki | test | 230 | 487 |
swwiki | control | 3 | 5 |
swwiki | test | 4 | 7 |
thwiki | control | 12 | 14 |
thwiki | test | 19 | 42 |
ukwiki | control | 60 | 118 |
ukwiki | test | 98 | 233 |
viwiki | control | 30 | 47 |
viwiki | test | 39 | 77 |
zhwiki | control | 62 | 93 |
zhwiki | test | 134 | 274 |
There are no large discrepancies between the number of unique users that made an edit attempt in the control and test groups on each wiki as expected if 50% of users were randomly assigned to each bucket.
Note: I reviewed the mediawiki user preferences table to confirm that users in each wiki were bucketed appropriately based on the dicussiontools-abtest
preference and value. See further details in T268193#6781374.
ab_attempts_byeditcount <- collect_edit_attempts %>%
# review only AB edit attempts
filter(attempt_dt >= '2021-02-12',
experiment_group != 'NULL',
is_ab_test_wiki == TRUE) %>%
group_by(experiment_group, initial_edit_count) %>%
summarise(n_users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))
ab_attempts_byeditcount
`summarise()` regrouping output by 'experiment_group' (override with `.groups` argument)
experiment_group | initial_edit_count | n_users | n_attempts |
---|---|---|---|
<chr> | <chr> | <int> | <int> |
control | 100-499 | 316 | 838 |
control | 500+ | 1340 | 4601 |
control | under 100 | 1825 | 3104 |
test | 100-499 | 364 | 999 |
test | 500+ | 1355 | 5000 |
test | under 100 | 2132 | 4243 |
We are logging a reasonable amount of attempts in each edit count group for each test group across all participating wikis.
Currently, we have edit attempts from the most unique users in the under 100 edits edit bucket for both AB experiment groups. However, we are currently logging the most attempts from senior contributors with over 500 edits.
reply_ab_events_byloggedin <- collect_edit_attempts %>%
filter(attempt_dt >= '2021-02-12',
is_ab_test_wiki == TRUE,
test_group != 'NULL') %>%
mutate(logged_in_status = ifelse(user_id == 0, 'logged-out', 'logged-in')) %>%
filter(logged_in_status == 'logged-out') %>%
group_by (wiki,test_group, initial_edit_count) %>%
summarise(users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))
reply_ab_events_byloggedin
I confirmed that there are no edit attempts from logged out users recorded as part of the AB test.
reply_ab_events_byinterface <- collect_edit_attempts %>%
filter(attempt_dt >= '2021-02-12',
is_ab_test_wiki == TRUE,
test_group != 'NULL') %>%
group_by (test_group, editor_interface) %>%
summarise(n_users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))
reply_ab_events_byinterface
`summarise()` regrouping output by 'test_group' (override with `.groups` argument)
test_group | editor_interface | n_users | n_attempts |
---|---|---|---|
<chr> | <chr> | <int> | <int> |
control | visualeditor | 1478 | 2649 |
control | wikitext | 1348 | 3083 |
control | wikitext-2017 | 49 | 125 |
test | visualeditor | 1754 | 3781 |
test | wikitext | 1577 | 3678 |
test | wikitext-2017 | 44 | 110 |
Events from all three editor interface types (visualeditor, wikitext, and wikitext-2017) were recorded for each test group type as expected.
All of the data reported above in this report included the oversampling of discussion tool related events, which was deployed along with the other instrumentation changes on 12 February 2020.
I reviewed edit attempts by oversampling status to confirm if this change was deployed correctly to all dt-related events.
# review edit attempts by oversample status
edit_attempts_by_oversample <- collect_edit_attempts %>%
group_by (experiment_group, is_oversample) %>%
summarise(n_users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))
editattempts_by_oversample
`summarise()` regrouping output by 'experiment_group' (override with `.groups` argument)
test_group | is_oversample | n_users | n_attempts |
---|---|---|---|
<chr> | <chr> | <int> | <int> |
control | false | 2641 | 5707 |
control | true | 98 | 149 |
NULL | false | 73685 | 329915 |
NULL | true | 1348 | 5530 |
test | false | 2586 | 5726 |
test | true | 851 | 1844 |
Oversampled events are recorded for each experiment group type. Now let's review discussion tool events by oversample status.
dt_attempts_by_oversample <- collect_edit_attempts %>%
filter(user_id != 0) %>%
group_by (event_type, is_oversample) %>%
summarise(n_users = n_distinct(user_id),
n_attempts = n_distinct(edit_attempt_id))
editattempts_bydtevents_by_oversample
`summarise()` regrouping output by 'event_type' (override with `.groups` argument)
event_type | is_oversample | n_users | n_attempts |
---|---|---|---|
<chr> | <chr> | <int> | <int> |
discussiontools | false | 217 | 387 |
discussiontools | true | 1510 | 5940 |
page | false | 46210 | 164384 |
page | true | 404 | 824 |
I confirmed that we are oversampling discussiontool events. There are also some non discussion tool events (event.type = page
) we are still oversampling.
# daily discussion tool edits attempts by oversample
dt_events_daily <- collect_edit_attempts %>%
filter(event_type == 'discussiontools') %>%
group_by(attempt_dt, is_oversample) %>%
summarize(n_attempts = n_distinct(edit_attempt_id))
`summarise()` regrouping output by 'attempt_dt' (override with `.groups` argument)
# plot daily dt events
options(repr.plot.width = 14, repr.plot.height = 8)
p <- dt_events_daily %>%
ggplot(aes(x= attempt_dt, y= n_attempts, color = is_oversample)) +
geom_line(size = 1.5) +
geom_vline(xintercept = as.numeric(as.Date(c('2021-02-12', linetype = "dashed", color = "black")))) +
geom_text(aes(x=as.Date('2021-02-12'), y=600, label="Discussion Tool Sampling Rate Change"), size=3.7, vjust = -1.2, angle = 90, color = "black") +
scale_x_date(date_labels = "%b-%d", date_breaks = "1 day", minor_breaks = NULL) +
scale_y_continuous(labels = polloi::compress) +
labs(
x = 'Date', y = "edit attempts per day",
title = "Daily Discussion Tool Edit Attempt By Oversample"
) +
theme(legend.position = "bottom",
text = element_text(size=20)) +
theme_bw()
p
Warning message: “Removed 2 rows containing missing values (geom_vline).”
The plot above show that discussion tool events are oversampled starting on 12 Feb 2021 following the deployment of the config change. The increase in dt events fits with a sampling rate change of 100%.