Post Deployment Data QA Desktop Search Instrumentation

Tasks:

  • A/B Test Setup for Search Move Changes T259250
  • SearchSatisfaction Instrumentation Changes T262300
  • AB Test Setup for new search widget T261647

QA Doc

Overview

In T256100, we added a skin version field and new values for the inputLocation and extraParams fields to record data as part of the following two AB tests on deployed search changes: 1) The first on the new location of the search widget. 2) The second on the new widget / experience we're currently building in Vue.js.

These two A/B tests together support four possible configurations:

skinVersion inputLocation extraParams Description
legacy "header-navigation" Vector Legacy skin with Legacy search (current master)
latest "header-navigation" Vector Latest skin with Legacy search (current master)
latest "header-moved" Vector Latest skin with Legacy search and latest location
latest "header-moved" "WVUI" Vector Latest skin with Latest search and latest location

New Search Location QA

The new location of the search functionarily was deployed to all projects on 28 September 2020. The new location is available by default for anonymous users on our early adopter wikis, and by preference for all other users.

We are performing an A/B test of the new location with logged-in users on our early adopter wikis. 50% of logged-in users are seeing the new experience, while the other 50% are seeing the old experience. The test only applies to users on modern vector.

The SearchSatisfaction Schema will be used to track events from these changes.

Note: Bucketing is done on a search session basis.

I checked the following scenarios:

  • Check events and distinct sessions per header location (PASSED)
  • Check date when events started coming in and the number of events appear as expected (PASSED)
  • Correct inputLocations associated with each skin version (header-moved should not be recorded with legacy) (PASSED: BUG FIX DEPLOYED ON OCTOBER 13)
  • Check that skinVersion is only recorded when skin is vector (PASSED)
  • Check associated actions and sources (PASSED)
  • Check events appear for wikis as expected (PASSED)
  • Check trend of logged in and logged out sessions and events. (PASSED)
  • Check that AB test is balanced (PASSED)
In [1]:
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(tidyverse); library(glue); library(lubridate); library(scales)
})

Collect Data

In [101]:
# collect data from SearchSatisfaction for the relevant fields.
query <- 
"SELECT
    event.inputLocation AS search_location,
    event.searchSessionId AS search_session,
    event.skinVersion AS vector_version,
    event.action AS action_type,
    event.source AS source_type,
    event.skin AS skin_type,
    wiki AS wiki,
    Count(*) AS events
FROM event.searchSatisfaction
-- review a few days prior to check when data started to come in
    WHERE year = 2020 and ((month = 09 and day >= 28) OR month >= 10) 
    -- remove bots
    AND useragent.is_bot = false 
GROUP BY 
    event.inputLocation,
    event.searchSessionId,
    event.skinVersion,
    event.action,
    event.source,
    event.skin,
    wiki"
In [102]:
search_sessions <-  wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

Total number of events and sessions per header location across all Wikis

In [103]:
# overall
search_sessions_bylocation <-search_sessions %>%
# limit to search events completed in two new locations
    filter(search_location %in% c('header-moved', 'header-navigation'),
          filter) %>%
    group_by(search_location) %>%
    summarise(unique_sessions = n_distinct(search_session),
             total_events = sum(events)) 

search_sessions_bylocation
`summarise()` ungrouping output (override with `.groups` argument)

A tibble: 2 × 3
search_locationunique_sessionstotal_events
<chr><int><int>
header-moved 35570589256259728
header-navigation38855815276620657

Looking across all wikis, there should be a larger number of header-navigation search sessions and events as the new header location was only available by default for anonymous users on early adopter wikis and by preference for all ohter users.

While the data show a slightly larger number of header-navigation events, I would expect this difference to be much higher.

Update: Further investigation showed this was due to a bug that incorrectly logged header-navigation events as header-moved events on legacy skin. See further breakdowns below.

Search Location by Vector Version and Skin Type on All Wikis

In [104]:
search_location_check <- search_sessions %>%
    filter(search_location %in% c('header-navigation', 'header-moved')) %>%
    group_by(search_location, vector_version, skin_type) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session))
search_location_check
`summarise()` regrouping output by 'search_location', 'vector_version' (override with `.groups` argument)

A grouped_df: 6 × 5
search_locationvector_versionskin_typenum_eventsnum_sessions
<chr><chr><chr><int><int>
header-moved latestvector 30677585 3823712
header-moved legacyvector22557940731746694
header-moved NULL NULL 2736 492
header-navigationlatestvector 3992199 576515
header-navigationlegacyvector27262586238278963
header-navigationNULL NULL 2596 499

FAILED:

ISSUE: Confirmed that the two header location types are only being recorded for the vector skin type; however, there are a large number header-moved events being recorded with legacy, which is not expected. There are also some header location events being recorded without an associated vector_version or skin.

I will further review by wiki and day below to confirm when these events started occuring.

Further Investigate Legacy Events

In [105]:
## collect all search events with new header recorded on legacy by date.

query <- 
"SELECT
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) as date,
    event.inputLocation AS search_location,
    event.skin as skin_type,
    event.skinVersion AS vector_version,
    event.searchSessionId AS search_session,
    wiki,
    Count(*) as events
FROM event.searchSatisfaction 
-- review a few days prior to check when data started to come in
    WHERE year = 2020 AND ((month = 09 and day >= 20) OR month >= 10) 
-- review autocomplete searches on legacy
    AND event.action = 'searchResultPage'
    AND event.source = 'autocomplete'
    AND event.skinVersion = 'legacy'
    AND event.skin = 'vector'
    AND useragent.is_bot = false 
GROUP BY 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
    event.inputLocation,
    event.skin,
    event.skinVersion,
    event.searchSessionId,
    wiki"
In [106]:
search_sessions_legacy <-  wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

In [107]:
search_sessions_legacy$date <- as.Date(search_sessions_legacy$date, format = "%Y-%m-%d")

Legacy Events with New Header Location By Date

In [108]:
legacy_sessions_withnewsearchloc <- search_sessions_legacy %>%
    filter(search_location == 'header-moved') %>%
    group_by(date, vector_version) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) %>%
    arrange(date)

legacy_sessions_withnewsearchloc
`summarise()` regrouping output by 'date' (override with `.groups` argument)

A grouped_df: 35 × 4
datevector_versionnum_eventsnum_sessions
<date><chr><int><int>
2020-09-22legacy 6 3
2020-09-23legacy 39 8
2020-09-24legacy 124 8
2020-09-25legacy 10 3
2020-09-26legacy 38 6
2020-09-27legacy 59 6
2020-09-28legacy 1077 246
2020-09-29legacy 4001 843
2020-09-30legacy 1820084 264887
2020-10-01legacy124340021831724
2020-10-02legacy141688562014952
2020-10-03legacy130819271800468
2020-10-04legacy146502401994717
2020-10-05legacy176621732457038
2020-10-06legacy181077652496282
2020-10-07legacy181738822508217
2020-10-08legacy179389192467371
2020-10-09legacy168079232322149
2020-10-10legacy147418381988172
2020-10-11legacy161629322164940
2020-10-12legacy191267142612897
2020-10-13legacy161170122281531
2020-10-14legacy 5395902 901123
2020-10-15legacy 3424183 596037
2020-10-16legacy 1799897 327429
2020-10-17legacy 984685 182818
2020-10-18legacy 811087 154778
2020-10-19legacy 829004 161731
2020-10-20legacy 643590 126207
2020-10-21legacy 418973 82357
2020-10-22legacy 186056 39050
2020-10-23legacy 35077 7978
2020-10-24legacy 23123 5196
2020-10-25legacy 19617 4450
2020-10-26legacy 9896 2267
In [109]:
p <- search_sessions_legacy %>%
# compare header-moved to header-navigation dates
    filter(search_location %in% c('header-moved', 'header-navigation')) %>%
    group_by(date, search_location, vector_version) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) %>%
    ggplot(aes(x=date, y= num_sessions, color = search_location)) +
    geom_line(size = 1.5) +
    scale_y_continuous() +
    labs (y = "Number of autocomplete search sessions per day",
          x = "Date",
         title = "Daily search sessions by search bar location on legacy vector skin")  +
     theme_bw() +
   theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1)) 
 
p
`summarise()` regrouping output by 'date', 'search_location' (override with `.groups` argument)

In [118]:
ggsave("Figures/daily_legacy_search_events.png", p, width = 16, height = 8, units = "in", dpi = 300)

Header-moved events on legacy start recording around September 28th, the date of deployment. Unlike the latest skin, they appear to drop on Oct 14th and have continued to decrease since then.

Upon further investigation, it was determined that bug leading to new header location events being recorded on legacy was due to a bug in the intrusmentation that was fixed on October 13th. Since the fix, these events are still declining.

Further Investigate NULL events

In [111]:
## collect all search header events that do not have an associated vector skin
# Check number of events for each to see if roughly even or number of sessions?
query <- 
"SELECT
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) as date,
    event.inputLocation AS search_location,
    event.skinVersion as vector_version,
    event.searchSessionId AS search_session,
    wiki,
    Count(*) as events
FROM event.searchSatisfaction 
-- review a few days prior to check when data started to come in
    WHERE year = 2020 AND ((month = 09 and day >= 20) OR month >= 10) 
-- further investigate instance where no skin was recorded
    AND event.skinVersion is NULL
    AND event.skin IS NULL
    AND event.action = 'searchResultPage'
    AND useragent.is_bot = false 
    AND event.source = 'autocomplete' 
GROUP BY 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
    event.inputLocation,
    event.skinVersion,
    event.searchSessionId,
    wiki"
In [112]:
search_sessions_null <-  wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

In [113]:
search_sessions_null$date <- as.Date(search_sessions_null$date, format = "%Y-%m-%d")
In [114]:
daily_search_sessions_null_events <-search_sessions_null %>%
    filter(search_location %in% c('header-navigation', 'header-moved')) %>%
    group_by(search_location, vector_version, date) %>%
    summarise(num_sessions = n_distinct(search_session),
             num_events = sum(events))  %>%
    arrange(date)

daily_search_sessions_null_events
`summarise()` regrouping output by 'search_location', 'vector_version' (override with `.groups` argument)

A grouped_df: 55 × 5
search_locationvector_versiondatenum_sessionsnum_events
<chr><chr><date><int><int>
header-navigationNULL2020-09-23 1 1
header-navigationNULL2020-09-24 23107
header-navigationNULL2020-09-25119630
header-navigationNULL2020-09-26123704
header-navigationNULL2020-09-27104574
header-navigationNULL2020-09-28120624
header-navigationNULL2020-09-29 90453
header-moved NULL2020-09-30 5 30
header-navigationNULL2020-09-30 70352
header-moved NULL2020-10-01 39220
header-navigationNULL2020-10-01 40182
header-moved NULL2020-10-02 36169
header-navigationNULL2020-10-02 12 59
header-moved NULL2020-10-03 43268
header-navigationNULL2020-10-03 8 36
header-moved NULL2020-10-04 45289
header-navigationNULL2020-10-04 10 62
header-moved NULL2020-10-05 40190
header-navigationNULL2020-10-05 9 59
header-moved NULL2020-10-06 51344
header-navigationNULL2020-10-06 6 33
header-moved NULL2020-10-07 34195
header-navigationNULL2020-10-07 3 13
header-moved NULL2020-10-08 33169
header-navigationNULL2020-10-08 5 26
header-moved NULL2020-10-09 27159
header-moved NULL2020-10-10 30135
header-moved NULL2020-10-11 30166
header-moved NULL2020-10-12 23134
header-navigationNULL2020-10-12 2 4
header-moved NULL2020-10-13 21100
header-navigationNULL2020-10-13 2 16
header-moved NULL2020-10-14 13 58
header-navigationNULL2020-10-14 6 33
header-moved NULL2020-10-15 5 16
header-navigationNULL2020-10-15 11 48
header-moved NULL2020-10-16 2 12
header-navigationNULL2020-10-16 9 76
header-moved NULL2020-10-17 2 32
header-navigationNULL2020-10-17 12 59
header-moved NULL2020-10-18 3 10
header-navigationNULL2020-10-18 17 87
header-moved NULL2020-10-19 4 18
header-navigationNULL2020-10-19 11 50
header-moved NULL2020-10-20 3 10
header-navigationNULL2020-10-20 5 16
header-navigationNULL2020-10-21 13 93
header-navigationNULL2020-10-22 10 69
header-navigationNULL2020-10-23 5 16
header-moved NULL2020-10-24 1 1
header-navigationNULL2020-10-24 9 45
header-moved NULL2020-10-25 1 10
header-navigationNULL2020-10-25 12 61
header-moved NULL2020-10-26 1 1
header-navigationNULL2020-10-26 5 24
In [115]:
# plot null sessions
options(repr.plot.width = 10, repr.plot.height = 7)
p <- search_sessions_null %>%
# remove content - advanced search page searches
    filter(search_location %in% c('header-navigation', 'header-moved'))  %>%
    #filter(search_location != 'content') %>%
    group_by(date, search_location) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) %>%
    ggplot(aes(x=date, y= num_sessions, color = search_location)) +
    geom_line(size = 1.5) +
    scale_y_continuous() +
    labs (y = "Number of autocomplete search sessions per day",
          x = "Date",
         title = "Daily search sessions by search bar location without associated vector skin")  +
     theme_bw() +
   theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1)) 
 
p
`summarise()` regrouping output by 'date' (override with `.groups` argument)

In [145]:
ggsave("Figures/daily_null_search_events.png", p, width = 16, height = 8, units = "in", dpi = 300)

Search events and sessions recorded as having a NULL vector skin version have been decreasing since deployment. Based on the trend lines show in chart below it looks like these NULL values are related to caching issues as the new instrumentation was deployed.

We're currently only recording less than 12 NULL sessions per day on each wiki and they seem to be decreasing still so I don't think this is an issue.

Search Location Events on Latest Skin by Date

In [116]:
## collect all latest search header events by location and date
query <- 
"SELECT
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) as date,
    event.inputLocation AS search_location,
    event.searchSessionId AS search_session,
    wiki,
    Count(*) as events
FROM event.searchSatisfaction 
-- review a few days prior to check when data started to come in
    WHERE year = 2020 AND ((month = 09 and day >= 20) OR month >= 10) 
-- only deployed on modern skin vector
    AND event.skinVersion = 'latest'
    AND event.skin = 'vector'
    AND useragent.is_bot = false 
    AND event.action = 'searchResultPage'
    AND event.source = 'autocomplete' 
GROUP BY 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
    event.inputLocation,
    event.searchSessionId,
    wiki"
In [117]:
search_sessions_daily <-  wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

In [118]:
search_sessions_daily$date <- as.Date(search_sessions_daily$date, format = "%Y-%m-%d")
In [126]:
daily_search_sessions <-search_sessions_daily %>%
#remove content- advanced page events
    filter(search_location != 'content'
# review events after bug fix to address header-moved events on legacy
          )  %>%
    group_by(search_location, date) %>%
    summarise(unique_sessions = n_distinct(search_session),
             total_events = sum(events))  %>%
    arrange(date)
daily_search_sessions
`summarise()` regrouping output by 'search_location' (override with `.groups` argument)

A grouped_df: 97 × 4
search_locationdateunique_sessionstotal_events
<chr><date><int><int>
header 2020-09-20118534 922751
header 2020-09-211433621092926
header 2020-09-221408991054336
header 2020-09-231417951078461
header-navigation2020-09-23 392 2904
header 2020-09-24120545 926857
header-navigation2020-09-24 22423 144811
header 2020-09-25 43904 278737
header-navigation2020-09-25 96679 695054
header 2020-09-26 21589 129302
header-navigation2020-09-26 95391 717155
header 2020-09-27 9429 55990
header-navigation2020-09-27119616 930422
header 2020-09-28 3504 21154
header-moved 2020-09-28 504 2953
header-navigation2020-09-281405561062970
header 2020-09-29 1986 11406
header-moved 2020-09-29 1900 9595
header-navigation2020-09-291375541031666
header 2020-09-30 210 1119
header-moved 2020-09-30 20099 141768
header-navigation2020-09-30124920 922669
header 2020-10-01 61 268
header-moved 2020-10-01118960 917933
header-navigation2020-10-01 50194 309986
header 2020-10-02 30 139
header-moved 2020-10-02122555 952519
header-navigation2020-10-02 32252 186346
header 2020-10-03 11 54
header-moved 2020-10-03114168 928295
header 2020-10-16 6 57
header-moved 2020-10-161460001189450
header-navigation2020-10-16 1836 9199
header 2020-10-17 3 12
header-moved 2020-10-17117510 931602
header-navigation2020-10-17 1933 9416
header 2020-10-18 2 10
header-moved 2020-10-181394351125455
header-navigation2020-10-18 2028 11159
header 2020-10-19 5 25
header-moved 2020-10-191669651300230
header-navigation2020-10-19 2078 9639
header 2020-10-20 4 12
header-moved 2020-10-201672271312794
header-navigation2020-10-20 2154 11209
header 2020-10-21 4 22
header-moved 2020-10-211640991286373
header-navigation2020-10-21 2199 11460
header-moved 2020-10-221600201243858
header-navigation2020-10-22 2142 10851
header-moved 2020-10-231474141169522
header-navigation2020-10-23 2037 11021
header 2020-10-24 1 8
header-moved 2020-10-241271521032914
header-navigation2020-10-24 2063 10339
header-moved 2020-10-251456631166658
header-navigation2020-10-25 2206 11965
header 2020-10-26 3 7
header-moved 2020-10-26 80299 615795
header-navigation2020-10-26 1044 5784
In [127]:
p <-daily_search_sessions %>%
    ggplot(aes(x=date, y= unique_sessions, color = search_location)) +
    geom_line(size = 1.5) +
    scale_y_continuous() +
    labs (y = "Number of autocomplete search sessions per day",
          x = "Date",
         title = "Daily search sessions by search bar location on latest vector skin")  +
     theme_bw() +
   theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1)) 
 
p
In [ ]:
ggsave("Figures/daily_latest_search_events.png", p, width = 16, height = 8, units = "in", dpi = 300)

The "header" location events are how the original search bar events were recorded prior to the instrumentation - these are events are dropping off as expected since it was replced by "header-navigation" events.

Header-moved(new search location) events start recording on the deployment date 28 September. Following deployment, there is a signficant drop-off of header-navigation events on the latest vector skin. The decrease seems signficant but a decrease was expected since the new header was deployed as opt-out for all logged out users on early adopter wikis and 50% of logged in users. An additional check of logged in vesus logged out users will help clarify.

Review Search Location Events on All Skin Types By Action and Source Type

In [121]:
action_type_check <- search_sessions %>%
    filter(search_location %in% c('header-navigation', 'header-moved')) %>%
    group_by(search_location, action_type, source_type) %>%
    summarise(events = sum(events))

action_type_check
`summarise()` regrouping output by 'search_location', 'action_type' (override with `.groups` argument)

A grouped_df: 2 × 4
search_locationaction_typesource_typeevents
<chr><chr><chr><int>
header-moved searchResultPageautocomplete256259728
header-navigationsearchResultPageautocomplete276620657

PASSED:

The header-moved and header-navigation events are only recorded with action=SearchResultPage and source='autocomplete' events as expected.

Review Search Location Events by Wiki

In [129]:
test_wiki_counts <- search_sessions %>%
    filter(search_location %in% c('header-navigation', 'header-moved'),
    ## review test wikis where deployed
     wiki %in% c('euwiki', 'frwiki', 'hewiki', 'ptwikiversity', 'frwiktionary', 'fawiki')) %>%
    group_by(search_location, wiki) %>%
    summarise(events = sum(events),
             sessions = n_distinct(search_session))  %>%
    arrange(wiki)

test_wiki_counts
`summarise()` regrouping output by 'search_location' (override with `.groups` argument)

A grouped_df: 12 × 4
search_locationwikieventssessions
<chr><chr><int><int>
header-moved euwiki 325172 34312
header-navigationeuwiki 42852 5415
header-moved fawiki 1199768 158652
header-navigationfawiki 158559 24731
header-moved frwiki 274724803317233
header-navigationfrwiki 3713332 526448
header-moved frwiktionary 1480194 212714
header-navigationfrwiktionary 209442 37002
header-moved hewiki 156401 92817
header-navigationhewiki 27082 15833
header-moved ptwikiversity 2672 420
header-navigationptwikiversity 378 85
In [130]:
## Review non test wikis to see what the difference is

nontest_wiki_counts <- search_sessions %>%
    filter(search_location %in% c('header-navigation', 'header-moved'),
    ## review some non-test wikis where deployed
     wiki %in% c('enwiki', 'eswiki', 'ruwiki', 'zhwiki', 'dewiki')) %>%
    group_by(search_location, wiki) %>%
    summarise(events = sum(events),
             sessions = n_distinct(search_session))  %>%
    arrange(wiki)

nontest_wiki_counts
`summarise()` regrouping output by 'search_location' (override with `.groups` argument)

A grouped_df: 10 × 4
search_locationwikieventssessions
<chr><chr><int><int>
header-moved dewiki 28332468 3958386
header-navigationdewiki 31971809 4523118
header-moved enwiki13225193818201935
header-navigationenwiki16320240722462494
header-moved eswiki 12229252 1613443
header-navigationeswiki 13987626 1815518
header-moved ruwiki 11748048 1789305
header-navigationruwiki 13711217 2070640
header-moved zhwiki 693042 301380
header-navigationzhwiki 856770 370056

PASSED

On test wikis, there are 4 to 5% more header-moved events than header-navigation events across both skin types. This difference is expected as on the test wikis the new header was also show as default 50% of all logged-in users and by default to all anonymous users.

On a sample of non-partner wikis, there are more header-navigation events and sessions compared to header-moved events. This is also expected as the new location is available as preference on these non-partner wikis. Note: This difference should be even higher but there was an issue in pre 13 October 2020 data of incorrectly logging header-moved events on legacy skin.

Search Sessions by Logged in and Logged out Users

The isAnon field was added on 20 October 2020 to distinguish logged in and logged out users. I'll review events since deployment to confirm they are being recorded as expected.

In [131]:
## collect all search header events by location
# Check number of events for each to see if roughly even or number of sessions?
query <- 
"SELECT
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) as date,
    event.skinVersion AS vector_version,
    event.skin AS skin_type,
    event.inputLocation AS search_location,
    event.isAnon AS is_anonymous,
    event.searchSessionId AS search_session,
    wiki,
    Count(*) as events
FROM event.searchSatisfaction 
-- review a few days prior to check when data started to come in
    WHERE year = 2020 AND ((month = 10 and day >= 18)) 
-- review autocomlete actions
    AND event.action = 'searchResultPage'
    AND event.source = 'autocomplete'
-- remove bots
    -- remove bots
    AND useragent.is_bot = false 
GROUP BY 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
    event.skinVersion,
    event.skin,
    event.inputLocation,
    event.isAnon,
    event.searchSessionId,
    wiki"
In [132]:
search_sessions_byanon <-  wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

In [134]:
search_sessions_byanon$date <- as.Date(search_sessions_byanon$date, format = "%Y-%m-%d")
In [135]:
# rename is anon column to clarify logged in status

search_sessions_byanon_clean <- search_sessions_byanon %>%
 mutate(logged_in_status = case_when(
        is_anonymous == 'NULL' ~ "NULL",
        is_anonymous == 'false'~ "logged-in",
        is_anonymous == 'true' ~ "logged-out"))

Confirm when logged in events start coming in

In [136]:
search_sessions_daily_byanon <- search_sessions_byanon_clean %>%
    filter(is_anonymous != 'NULL',
          skin_type == 'vector',
          vector_version == 'latest',
          search_location %in% c('header-moved', 'header-navigation'))  %>%
    group_by(date, search_location, logged_in_status) %>%
    summarise(num_events = sum(events),
              num_sessions = n_distinct(search_session)) %>%
    arrange(date)

head(search_sessions_daily_byanon, 10)
`summarise()` regrouping output by 'date', 'search_location' (override with `.groups` argument)

A grouped_df: 10 × 5
datesearch_locationlogged_in_statusnum_eventsnum_sessions
<date><chr><chr><int><int>
2020-10-20header-moved logged-in 10488 1819
2020-10-20header-moved logged-out 842856104794
2020-10-20header-navigationlogged-in 7561 1383
2020-10-20header-navigationlogged-out 6 2
2020-10-21header-moved logged-in 13966 2790
2020-10-21header-moved logged-out1268413160707
2020-10-21header-navigationlogged-in 11430 2191
2020-10-21header-navigationlogged-out 3 1
2020-10-22header-moved logged-in 13864 2693
2020-10-22header-moved logged-out1228378157102
In [137]:
# Plot logged in and logged out events by search location
options(repr.plot.width = 10, repr.plot.height = 7)
p <-search_sessions_daily_byanon %>%
    ggplot(aes(x=date, y= num_sessions, color = search_location)) +
    geom_line(size = 1.5) +
    scale_y_continuous() +
    facet_wrap(~logged_in_status, scales = "free_y") +
    labs (y = "Number of autocomplete search sessions per day",
          x = "Date",
         title = "Daily search sessions by search bar location on latest vector skin")  +
     theme_bw() +
   theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        axis.text.x = element_text(angle=45, hjust=1)) 
 
p

PASSED: Logged in events start appearing on 20 October, the date of deployment.

The differences between search location sessions for logged out vs logged in users appears as expected. Most logged-out sessions on the latest vector skin are header-moved events, as the new location is available by default for anonymous users on our early adopter wikis.

The new search location was deployed to 50% of logged-in users on early adopter wikis and only preference by other, which is why there is a more event split of sessions for logged-in users.

Review total number of Search Location Events by Logged in Status

In [138]:
### Confirm number of events and sessions for two new search locations

search_sessions_bysearchlocation_byanon <- search_sessions_byanon_clean %>%
# Remove null events and only review vector skin events
    filter(is_anonymous != 'NULL',
              skin_type == 'vector',
          search_location %in% c('header-moved', 'header-navigation'))  %>%
    group_by(vector_version, search_location, logged_in_status ) %>%
    summarise(num_events = sum(events),
              num_sessions = n_distinct(search_session))


search_sessions_bysearchlocation_byanon
`summarise()` regrouping output by 'vector_version', 'search_location' (override with `.groups` argument)

A grouped_df: 8 × 5
vector_versionsearch_locationlogged_in_statusnum_eventsnum_sessions
<chr><chr><chr><int><int>
latestheader-moved logged-in 90586 16836
latestheader-moved logged-out 7376053 925453
latestheader-navigationlogged-in 69639 13154
latestheader-navigationlogged-out 136 22
legacyheader-moved logged-in 33 2
legacyheader-moved logged-out 1048688 208045
legacyheader-navigationlogged-in 2697490 505192
legacyheader-navigationlogged-out10498573214152130

PASSED: Logged in status recorded for both vector versions (latest and legacy) and both search locations.

The numbers appears as expected.

  • There are very few header-navigation events events for logged-out users on latest vector, which makes sense as the new search location was deployed as opt out to these users.
  • There are a few header-moved events on legacy recorded for both logged in and logged out users. These are likely due to caching issues/ old, long-running sessions. There are very few of these for logged in users compared to logged out.
  • There are a 13% more search events on the legacy skin compared to the latest skin as the latest skin is available as a user preference on all wikis except for partner wikis.

Review total number of search events by skin type and logged in status

In [139]:
search_sessions_byskintype_byanon <- search_sessions_byanon %>%
    filter(is_anonymous != 'NULL')  %>%
    group_by(skin_type, vector_version, is_anonymous ) %>%
    summarise(num_events = sum(events),
              num_sessions = n_distinct(search_session))


search_sessions_byskintype_byanon
`summarise()` regrouping output by 'skin_type', 'vector_version' (override with `.groups` argument)

A grouped_df: 11 × 5
skin_typevector_versionis_anonymousnum_eventsnum_sessions
<chr><chr><chr><int><int>
cologneblueNULL false 1048 130
modern NULL false 27117 4949
modern NULL true 35 4
monobook NULL false 230259 42060
monobook NULL true 262 44
timeless NULL false 28029 5462
timeless NULL true 91 19
vector latestfalse 181311 30707
vector latesttrue 8293233 942578
vector legacyfalse 3081016 517489
vector legacytrue 11789890114550938

PASSED: Confirmed that the isAnon field is being logged for multiple skin types. The number of events and sessions seem as expected.

Review total number of search events by wiki and logged in status

By Test Wiki

In [140]:
search_sessions_byskintype_byanon <- search_sessions_byanon_clean %>%
    filter(is_anonymous != 'NULL',
    wiki %in% c('euwiki', 'frwiki', 'hewiki', 'ptwikiversity', 'frwiktionary', 'fawiki')) %>%
    group_by(wiki, logged_in_status ) %>%
    summarise(num_events = sum(events),
              num_sessions = n_distinct(search_session))


search_sessions_byskintype_byanon
`summarise()` regrouping output by 'wiki' (override with `.groups` argument)

A grouped_df: 12 × 4
wikilogged_in_statusnum_eventsnum_sessions
<chr><chr><int><int>
euwiki logged-in 3538 674
euwiki logged-out 102687 8607
fawiki logged-in 19891 3195
fawiki logged-out 349904 38379
frwiki logged-in 180096 31216
frwiki logged-out7320373809328
frwiktionary logged-in 14569 2229
frwiktionary logged-out 396035 53823
hewiki logged-in 8499 2339
hewiki logged-out 123024 32310
ptwikiversitylogged-in 16 7
ptwikiversitylogged-out 1082 113

PASSED

Sessions by logged in status appear as expected for all the wikis. There are 11 to 24% more logged out sessions compared to logged-in sessions on each partner wiki.

Search Location Events for AB Test Users on Test Wikis

We ran a check to confirm that it was deployed to 50% of users (using the modern vector skin) on each wiki as part of the AB test. Bucketing is done on a search session basis so we can review the number of distinct sessions.

During QA, we realized that there was no way to distinguish users in the AB test. An isAnon field was added to the schema on 20 October 2020.

In [ ]:
# Review AB events to confirm the buckets are balanced

query <- 
"
SELECT
    COUNT (DISTINCT(event.searchSessionId) AS search_session,
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0') as date,
    event.inputLocation AS search_location,
    event.isAnon AS is_anonymous,
    wiki
FROM event.searchSatisfaction 
-- ab test restarted on Oct 20th when isAnon field was added
    WHERE year = 2020 AND ((month = 10 and day >= 20)) 
-- review autocomplete actions
    AND event.action = 'searchResultPage'
    AND event.source = 'autocomplete'
    AND event.inputLocation IN ('header-moved', 'header-navigation')
-- review test wikis
    AND wiki IN ('euwiki', 'frwiki', 'hewiki', 'ptwikiversity', 'frwiktionary', 'fawiki')
-- deployed on on the new vector skin
    AND event.skinVersion = 'latest'
    AND event.skin = 'vector'
-- remove bots
    AND useragent.is_bot = false 
GROUP BY 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
    event.inputLocation,
    event.isAnon,
    event.searchSessionId,
    wiki"
In [ ]:
search_sessions_ab <-  wmfdata::query_hive(query)
In [3]:
head(search_sessions_ab)
A data.frame: 6 × 6
datesearch_locationis_anonymoussearch_sessionwikievents
<chr><chr><chr><chr><chr><int>
12020-10-20header-movedNULL0095a915b3a1d213c379kghu3f7ffrwiki 5
22020-10-20header-movedNULL01fb82cbe1d905573c0fkgh7luarfrwiktionary4
32020-10-20header-movedNULL04a0f9c79659f303475ckghlgphdfrwiki 3
42020-10-20header-movedNULL0927dca0e97d2ce502bakghreznsfrwiki 6
52020-10-20header-movedNULL09516839b31bce7bc634kghxdbuafrwiki 2
62020-10-20header-movedNULL0b0c97100c563049ccb0kgh8tgh2frwiki 8
In [4]:
search_sessions_ab$date <- as.Date(search_sessions_ab$date, format = "%Y-%m-%d")
In [7]:
# rename is anon column to clarify logged in status

search_sessions_ab_clean <- search_sessions_ab %>%
 mutate(logged_in_status = case_when(
        is_anonymous == 'NULL' ~ "NULL",
        is_anonymous == 'false'~ "logged-in",
        is_anonymous == 'true' ~ "logged-out"))
In [8]:
search_sessions_ab_check <- search_sessions_ab_clean %>%
# review only logged in events on wikis
    filter(logged_in_status == 'logged-in') %>%
    group_by(search_location) %>%
    summarise(num_events = sum(events),
              num_sessions = n_distinct(search_session))
`summarise()` ungrouping output (override with `.groups` argument)

Distinct Sessions for all users in AB test across wikis

In [9]:
search_sessions_ab_check
A tibble: 2 × 3
search_locationnum_eventsnum_sessions
<chr><int><int>
header-moved 12909822783
header-navigation13305825205
In [ ]:
### Distinct Sessions for all users in AB test Per Wiki
In [12]:
search_sessions_ab_check_bywiki <- search_sessions_ab_clean %>%
# review only logged in events on wikis
    filter(logged_in_status == 'logged-in') %>%
    group_by(search_location, wiki) %>%
    summarise(num_events = sum(events),
              num_sessions = n_distinct(search_session)) %>%
    arrange(wiki)

search_sessions_ab_check_bywiki
`summarise()` regrouping output by 'search_location' (override with `.groups` argument)

A grouped_df: 12 × 4
search_locationwikinum_eventsnum_sessions
<chr><chr><int><int>
header-moved euwiki 3422 694
header-navigationeuwiki 2502 503
header-moved fawiki 12314 2031
header-navigationfawiki 11434 1906
header-moved frwiki 10123717690
header-navigationfrwiki 10906120390
header-moved frwiktionary 10429 1498
header-navigationfrwiktionary 7974 1330
header-moved hewiki 1659 859
header-navigationhewiki 2055 1069
header-moved ptwikiversity 37 11
header-navigationptwikiversity 32 7

The current difference is within the probable range of a random 50/50 split and the buckets look balanced. Note: There are very few header-moved and header-navigation events recorded for pwikiversity. As a result, it may be difficult to determine any impact of the header move on search sessions initiated during the AB test analysis.

New Search Widget QA

The new search widget was deployed on the test wiki on 17 September 2020 T259798.

I first reviewed events logged on the test wiki to ensure they were recorded as expected prior to the search widget being deployed to the pilot wikis.

Scenarios Tested:

  • Check new search widget events by search location (PASSED)
  • Check date when events started coming in and the number of events appear as expected (PASSED)
  • Check that new search widget events by vector version (PASSED)
  • Check associated actions and sources (PASSED)
  • Check events appear for wikis as expected (PASSED)
  • Check trend of logged in and logged out sessions and events. (PASSED)
  • Check that new edit bucket field was added appopriately (PASSED)
  • Check that AB test is balanced (PASSED)

Test Wiki Post Deployment

In [2]:
# collect data from SearchSatisfaction for the relevant fields.
query <- 
"SELECT
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) as search_dt,
    event.inputLocation AS search_location,
    event.searchSessionId AS search_session,
    event.skinVersion AS vector_version,
    event.action AS action_type,
    event.source AS source_type,
    event.skin AS skin_type,
    event.isAnon AS is_anonymous,
    event.extraParams AS search_type,
    wiki AS wiki,
    Count(*) AS events
FROM event.searchSatisfaction
    WHERE year = 2021
    AND month = 02
    AND day >= 17
    -- remove bots
    AND useragent.is_bot = false 
    AND wiki IN ('testwiki', 'test2wiki')
GROUP BY 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
    event.inputLocation,
    event.searchSessionId,
    event.skinVersion,
    event.action,
    event.source,
    event.skin,
    event.isAnon,
    event.extraParams,
    wiki"
In [4]:
search_widget_sessions <-  wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

In [3]:
search_widget_sessions$search_dt <- as.Date(search_widget_sessions$search_dt)

New Search Widget Events and Sessions By Date

In [27]:
search_sessions_bydate <- search_widget_sessions %>%
#find new search widget events
    filter(action_type == 'searchResultPage',
            source_type == 'autocomplete',
        search_type =="WVUI") %>%
    group_by(search_dt) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) %>%
    arrange(search_dt)

search_sessions_bydate
`summarise()` ungrouping output (override with `.groups` argument)

A tibble: 3 × 3
search_dtnum_eventsnum_sessions
<date><int><int>
2021-02-17194
2021-02-18 11
2021-02-22 11

A total of 6 new search widge search sessions initiated (as indicated by event.extraParams = 'WVUI') have been recorded to date on the testwiki as of 22 February 2020.

New Search Widget Events By Search Type and Search Location

In [19]:
search_sessions_bysearchlocation <- search_widget_sessions %>%
# review only autcomplete sessions
    filter(action_type == 'searchResultPage',
            source_type == 'autocomplete',
          skin_type == 'vector') %>%
    group_by(search_type, vector_version, search_location) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) 

search_sessions_bysearchlocation
`summarise()` regrouping output by 'search_type', 'vector_version' (override with `.groups` argument)

A grouped_df: 5 × 5
search_typevector_versionsearch_locationnum_eventsnum_sessions
<chr><chr><chr><int><int>
NULLlatestcontent 16 3
NULLlatestheader-moved 3013
NULLlegacycontent 8 3
NULLlegacyheader-navigation8524
WVUIlatestheader-moved 21 6

The new search widget events are only logged for the new search location (event.inputLocation = 'header-moved) as expected. We are also logging old search widget events (as indicated by event.extraParams IS NULL) for the new search location) as expected.

The old search location only has events recorded for the legacy vector version and does not include any new search widget events.

NOTE:

  • search_location = content events are searches done one the advanced search page and not appplicable to the new search widget which is displayed on article pages.
  • There have been 21 new search widget search initiated events as shown in the table above but 23 total new search widget events including click events.

New Search Widget Events By Skin Version

In [8]:
search_sessions_byskintype <- search_widget_sessions %>%
# review only autcomplete sessions.
    filter(action_type == 'searchResultPage',
            source_type == 'autocomplete') %>%
    group_by(search_type, skin_type, vector_version) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) 

search_sessions_byskintype
`summarise()` regrouping output by 'search_type', 'skin_type' (override with `.groups` argument)

A grouped_df: 5 × 5
search_typeskin_typevector_versionnum_eventsnum_sessions
<chr><chr><chr><int><int>
NULLmonobookNULL 34 8
NULLtimelessNULL 10 3
NULLvector latest4615
NULLvector legacy9324
WVUIvector latest21 6

Confirmed that new search widget events are only logged for the latest vector skin.

New Search Widget Events By Action Type

In [9]:
search_sessions_byactiontype <- search_widget_sessions %>%
# review only autcomplete sessions.
    filter(search_type == 'WVUI') %>%
    group_by(search_type, source_type, action_type) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) 

search_sessions_byactiontype
`summarise()` regrouping output by 'search_type', 'source_type' (override with `.groups` argument)

A grouped_df: 2 × 5
search_typesource_typeaction_typenum_eventsnum_sessions
<chr><chr><chr><int><int>
WVUIautocompleteclick 22
WVUIautocompletesearchResultPage216

The new search widget events are recorded for autocomplete search sessions intiated as expected. There were also 2 click events that were logged as coming from the new search widget.

Note: Click events include the event.extraParams field to indicate if the new search widget was used but do not include event.inputLocation field to indicate the search location; however, we can use the sessionid to determine which search location sessions included a click to a result.

New Search Widget Events By Logged In Status

Logged in vs Logged out Users that see new header search location

In [25]:
search_sessions_byanon_newloc <- search_widget_sessions %>%
## review search actions in new search location
    filter(action_type == 'searchResultPage',
           search_location == 'header-moved') %>%
        #search_type == 'WVUI') %>%
    group_by(search_type, is_anonymous) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) 

search_sessions_byanon_newloc
`summarise()` regrouping output by 'search_type' (override with `.groups` argument)

A grouped_df: 3 × 4
search_typeis_anonymousnum_eventsnum_sessions
<chr><chr><int><int>
NULLfalse3013
WVUIfalse14 1
WVUItrue 7 5

Logged in vs Logged out Users by Search Location

In [34]:
search_sessions_byanon_newsearch <- search_widget_sessions %>%
# ## review search actions using new search widget
    filter( action_type == 'searchResultPage',
          search_location %in% c('header-moved', 'header-navigation')) %>%
    group_by(search_location, is_anonymous, vector_version) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) 

search_sessions_byanon_newsearch
`summarise()` regrouping output by 'search_location', 'is_anonymous' (override with `.groups` argument)

A grouped_df: 4 × 5
search_locationis_anonymousvector_versionnum_eventsnum_sessions
<chr><chr><chr><int><int>
header-moved falselatest4414
header-moved true latest 7 5
header-navigationfalselegacy8023
header-navigationtrue legacy 5 1

On the latest vector dekstop version, search sessions for logged-out are only recorded for the new search location and the new search widget as expected. Logged-in users see either the new search location and new search widget or the new search location and the old search widget depending on which AB treatment they receive.

New search widget events have been recorded for both logged in (1 session) and logged out (5 sessions) users.

New Search Widget AB Test Sessions Initiated

In [30]:
search_sessions_ab_test <- search_widget_sessions %>%
# review search sessions initiated that meet AB bucketing criteria
    filter(
         search_location %in% c('header-moved', 'header-navigation'),
        is_anonymous == 'false',
          source_type == 'autocomplete',
          action_type == 'searchResultPage',
          vector_version == 'latest',
          skin_type == 'vector') %>%
    group_by(search_type, search_location) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session)) 

search_sessions_ab_test
`summarise()` regrouping output by 'search_type' (override with `.groups` argument)

A grouped_df: 2 × 4
search_typesearch_locationnum_eventsnum_sessions
<chr><chr><int><int>
NULLheader-moved3013
WVUIheader-moved14 1

There have not been sufficent number of events to confirm AB Bucketing but I've confirmed that the AB instrumentation is in place as needed.

Search Widget Session By Edit Count

An edit count bucket was added to the SearchSatisfaction schema in T272991. The change was deployed on 23 February 2021.

In [27]:
# collect data from SearchSatisfaction for the relevant fields.
query <- 
"SELECT
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) as search_dt,
    event.inputLocation AS search_location,
    event.searchSessionId AS search_session,
    event.skinVersion AS vector_version,
    event.action AS action_type,
    event.source AS source_type,
    event.isAnon AS is_anonymous,
    event.extraParams AS search_type,
    event.usereditbucket as edit_bucket,
    wiki AS wiki,
    Count(*) AS events
FROM event.searchSatisfaction
    WHERE year = 2021
    AND month = 02
-- change deployed on 24 Feb 2021 
    AND day >= 24
    -- remove bots
    AND useragent.is_bot = false 
    AND event.skin = 'vector'
GROUP BY 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
    event.inputLocation,
    event.searchSessionId,
    event.skinVersion,
    event.action,
    event.source,
    event.isAnon,
    event.extraParams,
    event.usereditbucket,
    wiki"
In [28]:
search_sessions_weditbucket <-  wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

Edit bucket count events by wiki

In [35]:
editbucket_events_bywiki <- search_sessions_weditbucket %>%
    filter(edit_bucket != 'NULL') %>%
    group_by(wiki, edit_bucket) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session))


editbucket_events_bywiki
`summarise()` regrouping output by 'wiki' (override with `.groups` argument)

A grouped_df: 6 × 4
wikiedit_bucketnum_eventsnum_sessions
<chr><chr><int><int>
mediawikiwiki 0 edits 1396111
mediawikiwiki 1-4 edits 6 1
mediawikiwiki 100-999 edits 50 2
mediawikiwiki 1000+ edits 5 1
mediawikiwiki 5-99 edits 58 6
testwikidatawiki0 edits 2 1

Search session with the edit bucket field data have been recorded on mediawiki (121 search sessions) and testwikidata (1 session) so far. We have events logged for all 5 edit buckts types and the number of events per edit bucket seem reasonable.

Edit bucket count events by logged in status

In [37]:
editbucket_events_byloggedin <- search_sessions_weditbucket %>%
    filter(edit_bucket != 'NULL') %>%
    group_by(wiki, is_anonymous, edit_bucket) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session))


editbucket_events_byloggedin
`summarise()` regrouping output by 'wiki', 'is_anonymous' (override with `.groups` argument)

A grouped_df: 7 × 5
wikiis_anonymousedit_bucketnum_eventsnum_sessions
<chr><chr><chr><int><int>
mediawikiwiki false0 edits 7514
mediawikiwiki false1-4 edits 6 1
mediawikiwiki false100-999 edits 50 2
mediawikiwiki false1000+ edits 5 1
mediawikiwiki false5-99 edits 58 6
mediawikiwiki true 0 edits 132199
testwikidatawikitrue 0 edits 2 1

All logged out users are recorded as having 0 edits as expected.

Edit bucket count events by search location and type

In [36]:
editbucket_events_bysearchtype <- search_sessions_weditbucket %>%
# review search sessions initated with edit bucket recorded
    filter(edit_bucket != 'NULL',
            source_type == 'autocomplete',
          action_type == 'searchResultPage') %>%
    group_by(edit_bucket, search_location, vector_version, search_type) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session))


editbucket_events_bysearchtype
`summarise()` regrouping output by 'edit_bucket', 'search_location', 'vector_version' (override with `.groups` argument)

A grouped_df: 7 × 6
edit_bucketsearch_locationvector_versionsearch_typenum_eventsnum_sessions
<chr><chr><chr><chr><int><int>
0 edits content legacyNULL26622
0 edits header-navigationlegacyNULL43276
1-4 edits header-navigationlegacyNULL 5 1
100-999 editsheader-navigationlegacyNULL 18 2
1000+ edits header-navigationlegacyNULL 2 1
5-99 edits header-moved latestNULL 1 1
5-99 edits header-navigationlegacyNULL 30 5

Edit bucket events are recorded on both vector versions and both search location types. No new search widget events have been recorded since this feature has only been deployed on testwiki and we do not yet have any events there.

Post-Deployment QA to Pilot Wikis

In [216]:
# collect data from SearchSatisfaction for the relevant fields.
query <- 

"WITH search_sessions AS (
-- find search session start date
SELECT
    min(CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0'))) as search_start,
    event.searchSessionId AS search_session,
    wiki AS wiki
FROM event.SearchSatisfaction
WHERE
Year = 2021
AND useragent.is_bot = false 
AND event.skin = 'vector'
AND wiki IN ('frwiktionary', 'hewiki', 'ptwikiversity', 'frwiki', 
    'euwiki', 'fawiki', 'ptwiki', 'kowiki', 'trwiki', 'srwiki', 'bnwiki', 'dewikivoyage', 'vecwiki' )
GROUP BY
event.searchSessionId,
    wiki)

SELECT
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) as event_dt,
    search_sessions.search_start,
    event.inputLocation AS search_location,
    event.searchSessionId AS search_session,
    event.action AS action_type,
    event.source AS source_type,
    event.isAnon AS is_anonymous,
    event.extraParams AS search_type,
    event.usereditbucket as edit_bucket,
    ss.wiki AS wiki,
    Count(*) AS events
FROM event.searchSatisfaction ss
INNER JOIN search_sessions ON
 event.searchSessionId = search_sessions.search_session AND
 ss.wiki = search_sessions.wiki
    WHERE year = 2021
-- Review following bug fix on March 10.
    AND ((Month = 03 AND Day = 10 AND HOUR >= 19) OR (Month = 03 AND Day >=11))
--deployed at 7:00 UTC
    AND HOUR >= 19
-- change deployed on 24 Feb 2021 
    -- remove bots
    AND useragent.is_bot = false 
    AND event.skin = 'vector'
    AND event.skinVersion = 'latest'
    AND ss.wiki IN ('frwiktionary', 'hewiki', 'ptwikiversity', 'frwiki', 
    'euwiki', 'fawiki', 'ptwiki', 'kowiki', 'trwiki', 'srwiki', 'bnwiki', 'dewikivoyage', 'vecwiki' )
GROUP BY 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
    event.inputLocation,
    event.searchSessionId,
    event.action,
    event.source,
    event.isAnon,
    event.extraParams,
    event.usereditbucket,
    ss.wiki,
    search_sessions.search_start"
In [217]:
search_widget_sessions_pilotwikis <-  wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

In [218]:
search_widget_sessions_pilotwikis$event_dt <- as.Date(search_widget_sessions_pilotwikis$event_dt)
search_widget_sessions_pilotwikis$search_start <- as.Date(search_widget_sessions_pilotwikis$search_start)
In [219]:
# add column to clarfiy stage deployments

search_widget_sessions_pilotwikis<- search_widget_sessions_pilotwikis %>%
    mutate(deployment_stage = case_when(
  wiki %in% c('frwiktionary', 'hewiki', 'ptwikiversity') ~ "stage 2",
  wiki %in% c('frwiki', 'euwiki', 'fawiki') ~ "stage 3",
  wiki %in% c('ptwiki', 'kowiki', 'trwiki', 'srwiki', 'bnwiki', 'dewikivoyage', 'vecwiki') ~ "stage 4",
 TRUE ~ "other"
))

Search Sessions Initiated by Pilot Wiki

I reviewed search sessions initiated by pilot wiki to confirm that the split between each group is close to what would be expected with a 50/50 split.

Note: There was an initial issue with the instrumentation that resulted in a number of the new search widget sessions to be not be logged. This was fixed on March 10th. See https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/670459/

In [220]:
search_sessions_ab_test_pilot <- search_widget_sessions_pilotwikis %>%
# review search sessions initiated that meet AB bucketing criteria
    filter(
         search_location %in% c('header-moved', 'header-navigation'),
        # exclude long running search sessionst that started prior to deployment
         search_start >= '2021-03-10',
        is_anonymous == 'false',
         source_type == 'autocomplete',
         action_type == 'searchResultPage') %>%
    group_by(wiki,search_location, search_type) %>%
    summarise(
             num_sessions = n_distinct(search_session)) 

search_sessions_ab_test_pilot
`summarise()` regrouping output by 'wiki', 'search_location' (override with `.groups` argument)

A grouped_df: 26 × 4
wikisearch_locationsearch_typenum_sessions
<chr><chr><chr><int>
bnwiki header-movedNULL 11
bnwiki header-movedWVUI 5
dewikivoyage header-movedNULL 3
dewikivoyage header-movedWVUI 3
euwiki header-movedNULL 59
euwiki header-movedWVUI 73
fawiki header-movedNULL 163
fawiki header-movedWVUI 235
frwiki header-movedNULL2262
frwiki header-movedWVUI2142
frwiktionary header-movedNULL 198
frwiktionary header-movedWVUI 183
hewiki header-movedNULL 100
hewiki header-movedWVUI 367
kowiki header-movedNULL 55
kowiki header-movedWVUI 44
ptwiki header-movedNULL 453
ptwiki header-movedWVUI 500
ptwikiversityheader-movedNULL 5
ptwikiversityheader-movedWVUI 1
srwiki header-movedNULL 39
srwiki header-movedWVUI 116
trwiki header-movedNULL 183
trwiki header-movedWVUI 145
vecwiki header-movedNULL 1
vecwiki header-movedWVUI 3

Daily Search Sessions By Search Type

In [221]:
# plot search location move trends

p <- search_widget_sessions_pilotwikis %>%
    mutate(search_type = ifelse(search_type == 'NULL', "old search", "new search")) %>%
# review search sessions initiated that meet AB bucketing criteria
    filter(
         search_location %in% c('header-moved', 'header-navigation'),
         is_anonymous == 'false',
          source_type == 'autocomplete',
          action_type == 'searchResultPage',
         search_start >= '2021-03-10') %>%
    group_by(event_dt, search_type) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session))  %>%
  ggplot(aes(x=event_dt, y = num_sessions, color = search_type)) +
    geom_line(size = 1.5) +
    labs(y = "number of unique search sessions per day",
          x = "Date",
         title = "Daily AB Test Search Sessions Initiated by Search Type")  +
     theme_bw() +
    theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        legend.position="bottom",
        axis.text.x = element_text(angle=45, hjust=1)) 

p

ggsave("Figures/daily_search_sessions_by_type.png", p, width = 16, height = 8, units = "in", dpi = 300)
`summarise()` regrouping output by 'event_dt' (override with `.groups` argument)

Check non-pilot wikis events to confirm the data differs as expected

In [222]:
# collect non-pilot search session data
query <- 
"
SELECT
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) as event_dt,
    event.inputLocation AS search_location,
    event.searchSessionId AS search_session,
    event.action AS action_type,
    event.source AS source_type,
    event.isAnon AS is_anonymous,
    event.extraParams AS search_type,
    event.usereditbucket as edit_bucket,
    wiki AS wiki,
    Count(*) AS events
FROM event.searchSatisfaction 
    WHERE year = 2021
-- Review following bug fix on March 10.
    AND Month >= 02
--deployed at 7:00 UTC
-- change deployed on 24 Feb 2021 
    -- remove bots
    AND useragent.is_bot = false 
    AND event.skin = 'vector'
    AND event.skinVersion = 'latest'
    AND wiki IN ('slwiki', 'bewiki', 'dewiki', 'ruwiktionary', 'eswiktionary', 'cawiki',
            'da.wikipedia', 'rowiki', 'arwiki', 'idwiki', 'jawikiversity', 'jawiki', 'eswiki' )
GROUP BY 
    CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')),
    event.inputLocation,
    event.searchSessionId,
    event.action,
    event.source,
    event.isAnon,
    event.extraParams,
    event.usereditbucket,
    wiki"
In [223]:
search_widget_sessions_non_pilotwikis <-  wmfdata::query_hive(query)
Don't forget to authenticate with Kerberos using kinit

In [224]:
search_widget_sessions_non_pilotwikis $event_dt <- as.Date(search_widget_sessions_non_pilotwikis$event_dt)
In [225]:
search_sessions_non_pilot <- search_widget_sessions_non_pilotwikis %>%
# review search sessions initiated that meet AB bucketing criteria
    filter(
         search_location %in% c('header-moved', 'header-navigation'),
        is_anonymous == 'false',
         source_type == 'autocomplete',
         action_type == 'searchResultPage') %>%
    group_by(wiki,search_location, search_type) %>%
    summarise(
             num_sessions = n_distinct(search_session)) 

search_sessions_non_pilot
`summarise()` regrouping output by 'wiki', 'search_location' (override with `.groups` argument)

A grouped_df: 17 × 4
wikisearch_locationsearch_typenum_sessions
<chr><chr><chr><int>
arwiki header-moved NULL 524
arwiki header-moved WVUI 11
bewiki header-moved NULL 31
cawiki header-moved NULL 354
cawiki header-moved WVUI 2
dewiki header-moved NULL2522
dewiki header-moved WVUI 10
eswiki header-moved NULL 350
eswiki header-moved WVUI 3
eswiki header-navigationNULL 24
eswiktionaryheader-moved NULL 2
idwiki header-moved NULL 679
idwiki header-moved WVUI 10
jawiki header-moved NULL 852
jawiki header-moved WVUI 33
rowiki header-moved NULL 420
ruwiktionaryheader-moved NULL 826

New search widget sessions (indicated by search_type = "WVUI") were recorded on some of these non-test wikis but not as many as would be expected if available to all logged-in users that are opt'in to the new skin.

In [231]:
p <- search_widget_sessions_non_pilotwikis %>%
    mutate(search_type = ifelse(search_type == 'NULL', "old search", "new search")) %>%
# review search sessions initiated that meet AB bucketing criteria
    filter(
         event_dt < '2021-03-15' , #remove date with incomplete data
        search_location %in% c('header-moved', 'header-navigation'),
         is_anonymous == 'false',
          source_type == 'autocomplete',
          action_type == 'searchResultPage') %>%
    group_by(event_dt, search_type) %>%
    summarise(num_events = sum(events),
             num_sessions = n_distinct(search_session))  %>%
  ggplot(aes(x=event_dt, y = num_sessions, color = search_type)) +
    geom_line(size = 1.5) +
    labs(y = "number of unique search sessions per day",
          x = "Date",
         title = "Daily Non-AB Test Search Sessions Initiated by Search Type")  +
     theme_bw() +
    theme(
        plot.title = element_text(hjust = 0.5),
        text = element_text(size=16),
        legend.position="bottom",
        axis.text.x = element_text(angle=45, hjust=1)) 

p

ggsave("Figures/daily_search_sessions_by_type_nonAB.png", p, width = 16, height = 8, units = "in", dpi = 300)
`summarise()` regrouping output by 'event_dt' (override with `.groups` argument)

We're not seeing any new widget sessions for logged in users opt'd in on non pilot wikis after March 2nd. We record events from Feb 25th through March 2nd only.

Update: Per discussions with the Web team, it was clarified that the new search widget was turned off to all non-pilot wikis following start of the AB test to not interfere with the AB test results.

In [ ]: