In this analysis, we use Adam's topic dataset of articles with "best" topic prediction for pages accessed in September 2019. (see example of first 10K non-randomized rows for an HTML view).
The outcome topics are from the "predicted" field, which is the post-enrichment best guess for the articels.
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this notebook is by default hidden for easier reading.
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code"></form>
''')
import requests
import pandas as pd
import json
import matplotlib.pyplot as plt
import gzip
from wmfdata import hive
import numpy as np
You can find the source for `wmfdata` at https://github.com/neilpquinn/wmfdata
##read topic prediction file
topic = pd.read_csv('topic_prediction.tsv.gz', sep='\t',compression='gzip', header=0)
pageview_query = '''
SELECT
CONCAT(year,"-",month,"-01") AS date,
page_id,
SUM(view_count) AS pageviews
FROM
wmf.pageview_hourly
WHERE year = "{year}"
AND month = "{month}"
AND project = "{wiki}"
AND namespace_id = 0
AND agent_type = "user"
AND NOT (
country_code IN ("PK", "IR", "AF")
AND user_agent_map["browser_family"] = "IE" AND user_agent_map["browser_major"] = 7
)
GROUP BY CONCAT(year,"-",month,"-01"), page_id
'''
enwiki_pv_sept_all = hive.run([
"SET mapreduce.map.memory.mb=4096",
pageview_query.format(
year = 2019,
month = 9,
wiki = "en.wikipedia")
])
enwiki_pv_sept_all['proportion']= enwiki_pv_sept_all['pageviews']/enwiki_pv_sept_all['pageviews'].sum()
enwiki_pv_sep_all = enwiki_pv_sept_all.sort_values(by='pageviews', ascending=False)
enwiki_pv_sept_all[enwiki_pv_sept.page_id.isnull()]
date | page_id | pageviews | proportion | |
---|---|---|---|---|
4156974 | 2019-9-01 | NaN | 48039 | 0.000007 |
print('Total page views in September: ' + str(enwiki_pv_sept_all.pageviews.sum()))
Total page views: 7198970305
print('Number of unqiue pages in September: ' + str(enwiki_pv_sept_all.shape[0]))
Number of unqiue pages: 8043636
print('Top 1M pages account for ' + str(round(enwiki_pv_sept_all.proportion[:1000000].sum() * 100,2)) + '% of total page views in September.')
Top 1M pages account for 91.8% of total page views.
pageview_title_query = '''
WITH v AS (
SELECT page_id, SUM(view_count) AS pageviews
FROM wmf.pageview_hourly
WHERE year = "{year}"
AND month = "{month}"
AND project = "{wiki}"
AND namespace_id = 0
AND agent_type = "user"
AND NOT (
country_code IN ("PK", "IR", "AF") AND user_agent_map["browser_family"] = "IE" AND user_agent_map["browser_major"] = 7
)
GROUP BY page_id
LIMIT 10000000
), p AS (
SELECT page_id, page_title, page_latest
FROM wmf_raw.mediawiki_page
WHERE wiki_db = "enwiki"
AND snapshot = "{snapshot}"
AND page_id IS NOT NULL
AND page_namespace = 0
AND NOT page_is_redirect
)
SELECT v.page_id, p.page_title, v.pageviews
FROM v LEFT JOIN p ON v.page_id=p.page_id
'''
enwiki_pv_sept = hive.run([
"SET mapreduce.map.memory.mb=4096",
pageview_title_query.format(
year = 2019,
month = 9,
wiki = "en.wikipedia",
snapshot = "2019-09")
])
enwiki_pv_sept['proportion']= enwiki_pv_sept['pageviews']/enwiki_pv_sept['pageviews'].sum()
enwiki_pv_sept = enwiki_pv_sept.sort_values(by='pageviews', ascending=False)
enwiki_pv_topic_sept = enwiki_pv_sept.merge(topic, how = 'left', on = 'page_id')
enwiki_pv_topic_sept['predicted'] = enwiki_pv_topic_sept['predicted'].fillna(value='Unknown')
enwiki_pv_topic_sept['proportion']= enwiki_pv_topic_sept['pageviews']/enwiki_pv_topic_sept['pageviews'].sum()
The table below shows the top 50 articles viewed in English Wikipedia in September 2019, with the corresponding propotions among the total pageviews and the best predicted topic.
enwiki_page_sept_summary = enwiki_pv_topic_sept[['page_title','pageviews','proportion','predicted']].sort_values(by='pageviews', ascending=False).reset_index(drop=True).head(50)
print('Top 50 articles account for ' + str(round(enwiki_page_sept_summary.proportion.sum() * 100,2))+ '% of total page views in September.')
Top 50 articles account for 7.84% of total page views in September.
enwiki_page_sept_summary
page_title | pageviews | proportion | predicted | |
---|---|---|---|---|
0 | Main_Page | 473316359 | 0.065746 | Internet culture |
1 | Wikipedia | 7376833 | 0.001025 | Language and literature |
2 | List_of_Queen_of_the_South_episodes | 6426243 | 0.000893 | Broadcasting |
3 | It_Chapter_Two | 3777663 | 0.000525 | Entertainment |
4 | Deaths_in_2019 | 3225335 | 0.000448 | Time |
5 | Greta_Thunberg | 3146905 | 0.000437 | History and society |
6 | Saaho | 2927278 | 0.000407 | Entertainment |
7 | Joker_(2019_film) | 2864258 | 0.000398 | Visual arts |
8 | September_11_attacks | 2435866 | 0.000338 | Politics and government |
9 | Antonio_Brown | 2359216 | 0.000328 | Sports |
10 | Algorithms_for_calculating_variance | 2222385 | 0.000309 | Mathematics |
11 | Chandrayaan-2 | 2164700 | 0.000301 | Space |
12 | Hustlers_(2019_film) | 1909821 | 0.000265 | Entertainment |
13 | The_Bahamas | 1724124 | 0.000239 | The_Bahamas |
14 | Eli_Cohen | 1718906 | 0.000239 | Military and warfare |
15 | Storm_Area_51,_They_Can't_Stop_All_of_Us | 1716245 | 0.000238 | Internet culture |
16 | Ad_Astra_(film) | 1713237 | 0.000238 | Entertainment |
17 | 6ix9ine | 1599551 | 0.000222 | Internet culture |
18 | Bianca_Andreescu | 1594039 | 0.000221 | Sports |
19 | Ric_Ocasek | 1538688 | 0.000214 | Performing arts |
20 | 2019_FIBA_Basketball_World_Cup | 1536804 | 0.000213 | Sports |
21 | Unbelievable_(miniseries) | 1530294 | 0.000213 | Broadcasting |
22 | Line_shaft | 1526395 | 0.000212 | Technology |
23 | Billie_Eilish | 1496020 | 0.000208 | Performing arts |
24 | Mindhunter_(TV_series) | 1479612 | 0.000206 | Broadcasting |
25 | Solar_System | 1455847 | 0.000202 | Space |
26 | Freddie_Mercury | 1449146 | 0.000201 | Performing arts |
27 | Rafael_Nadal | 1426186 | 0.000198 | Sports |
28 | It_(2017_film) | 1382100 | 0.000192 | Entertainment |
29 | United_States | 1370054 | 0.000190 | United_States |
30 | List_of_Bollywood_films_of_2019 | 1359921 | 0.000189 | Entertainment |
31 | John_Bercow | 1342478 | 0.000186 | Politics and government |
32 | Hurricane_Dorian | 1302209 | 0.000181 | Politics and government |
33 | Peaky_Blinders_(TV_series) | 1286720 | 0.000179 | Broadcasting |
34 | Elton_John | 1253440 | 0.000174 | History and society |
35 | Wayne_Williams | 1238887 | 0.000172 | History and society |
36 | Once_Upon_a_Time_in_Hollywood | 1229706 | 0.000171 | Entertainment |
37 | Donald_Trump | 1217332 | 0.000169 | Politics and government |
38 | Joaquin_Phoenix | 1206288 | 0.000168 | Entertainment |
39 | Moving_average | 1186636 | 0.000165 | Mathematics |
40 | Apple_Network_Server | 1157855 | 0.000161 | Technology |
41 | Eddie_Money | 1151955 | 0.000160 | Performing arts |
42 | Sylvester_Stallone | 1141540 | 0.000159 | Entertainment |
43 | Judy_Garland | 1128921 | 0.000157 | History and society |
44 | Elizabeth_II | 1117237 | 0.000155 | Language and literature |
45 | Atlanta_murders_of_1979–1981 | 1116224 | 0.000155 | History and society |
46 | YouTube | 1114222 | 0.000155 | Entertainment |
47 | Charles_Manson | 1099406 | 0.000153 | History and society |
48 | Clash_of_Champions_(2019) | 1097502 | 0.000152 | Entertainment |
49 | Clint_Eastwood | 1074402 | 0.000149 | Entertainment |
The table below shows the page views by top 50 topics in September 2019 on English Wikipedia. Main page is excluded in this table.
enwiki_topic_sept_summary = (enwiki_pv_topic_sept[enwiki_pv_topic_sept.page_title != 'Main_Page']
.groupby('predicted', as_index = False)['pageviews', 'proportion']
.sum()
.sort_values(by='pageviews', ascending=False))
print('Top 10 topics account for ' + str(round(enwiki_topic_sept_summary.proportion[:10].sum() * 100,2))+ '% of total page views in September.')
print('Top 50 topics account for ' + str(round(enwiki_topic_sept_summary.proportion[:50].sum() * 100,2))+ '% of total page views in September.')
Top 10 topics account for 62.49% of total page views in September. Top 50 topics account for 92.03% of total page views in September.
enwiki_topic_sept_summary.head(50)
predicted | pageviews | proportion | |
---|---|---|---|
148 | Entertainment | 951015501 | 0.132101 |
424 | Sports | 651186372 | 0.090453 |
339 | Performing arts | 582187240 | 0.080869 |
75 | Broadcasting | 508963501 | 0.070698 |
202 | History and society | 490649682 | 0.068154 |
347 | Politics and government | 390908702 | 0.054299 |
245 | Language and literature | 233216220 | 0.032395 |
446 | Technology | 233007905 | 0.032366 |
83 | Business and economics | 228801818 | 0.031782 |
342 | Philosophy and religion | 228716606 | 0.031770 |
56 | Biology | 220055114 | 0.030567 |
278 | Medicine | 197890857 | 0.027488 |
459 | Transportation | 189356967 | 0.026303 |
282 | Military and warfare | 187945486 | 0.026107 |
503 | Visual arts | 161411619 | 0.022421 |
355 | Regional geography | 115483591 | 0.016041 |
357 | Regional society | 110243831 | 0.015313 |
167 | Food and drink | 99778456 | 0.013860 |
215 | Internet culture | 99516902 | 0.013823 |
429 | Structures of note | 80883092 | 0.011235 |
142 | Education | 73448968 | 0.010202 |
478 | United States | 59122893 | 0.008212 |
130 | Disambiguation | 58367391 | 0.008108 |
343 | Physics | 56172269 | 0.007803 |
98 | Chemistry | 54165745 | 0.007524 |
422 | Space | 43268636 | 0.006010 |
272 | Mathematics | 41116650 | 0.005711 |
182 | Geosciences | 27172192 | 0.003774 |
279 | Meteorology | 22631546 | 0.003144 |
393 | Science | 22147222 | 0.003076 |
211 | India | 20943288 | 0.002909 |
58 | Bodies of water | 18631407 | 0.002588 |
455 | Time | 18066422 | 0.002510 |
146 | Engineering | 16869893 | 0.002343 |
294 | Music | 15193024 | 0.002110 |
244 | Landforms | 14819421 | 0.002058 |
277 | Media | 13913078 | 0.001933 |
115 | Crafts and hobbies | 13450923 | 0.001868 |
492 | Unknown | 12500400 | 0.001736 |
267 | Maps | 9156431 | 0.001272 |
90 | Canada | 7718540 | 0.001072 |
36 | Australia | 7413482 | 0.001030 |
33 | Arts | 6264661 | 0.000870 |
168 | France | 5750101 | 0.000799 |
214 | Information science | 5459799 | 0.000758 |
184 | Germany | 5000979 | 0.000695 |
223 | Italy | 4353786 | 0.000605 |
178 | Games and toys | 4268883 | 0.000593 |
371 | Russia | 3669942 | 0.000510 |
100 | China | 3425639 | 0.000476 |
enwiki_pv_aug_all = hive.run([
"SET mapreduce.map.memory.mb=4096",
pageview_query.format(
year = 2019,
month = 8,
wiki = "en.wikipedia")
])
enwiki_pv_aug_all['proportion']= enwiki_pv_aug_all['pageviews']/enwiki_pv_aug_all['pageviews'].sum()
enwiki_pv_aug_all = enwiki_pv_aug_all.sort_values(by='pageviews', ascending=False)
print('Total page views in August: ' + str(enwiki_pv_aug_all.pageviews.sum()))
Total page views in August: 7212202447
print('Number of unqiue pages in August: ' + str(enwiki_pv_aug_all.shape[0]))
Number of unqiue pages in August: 8813929
enwiki_pv_aug = hive.run([
"SET mapreduce.map.memory.mb=4096",
pageview_title_query.format(
year = 2019,
month = 8,
wiki = "en.wikipedia",
snapshot = "2019-09")
])
enwiki_pv_topic_aug = enwiki_pv_aug.merge(topic, how = 'left', on = 'page_id')
enwiki_pv_topic_aug['predicted'] = enwiki_pv_topic_aug['predicted'].fillna(value='Unknown')
enwiki_pv_topic_aug['proportion']= enwiki_pv_topic_aug['pageviews']/enwiki_pv_topic_aug['pageviews'].sum()
enwiki_topic_aug_summary = (enwiki_pv_topic_aug[enwiki_pv_topic_aug.page_title != 'Main_Page']
.groupby('predicted', as_index = False)['pageviews', 'proportion']
.sum()
.sort_values(by='pageviews', ascending=False))
print('Top 10 topics account for ' + str(round(enwiki_topic_aug_summary.proportion[:10].sum() * 100,2))+ '% of total page views in August')
print('Top 50 topics account for ' + str(round(enwiki_topic_aug_summary.proportion[:50].sum() * 100,2))+ '% of total page views in August')
Top 10 topics account for 62.48% of total page views in August Top 50 topics account for 91.99% of total page views in August
enwiki_topic_sept_summary.head(50)
predicted | pageviews | proportion | |
---|---|---|---|
148 | Entertainment | 951015501 | 0.132101 |
424 | Sports | 651186372 | 0.090453 |
339 | Performing arts | 582187240 | 0.080869 |
75 | Broadcasting | 508963501 | 0.070698 |
202 | History and society | 490649682 | 0.068154 |
347 | Politics and government | 390908702 | 0.054299 |
245 | Language and literature | 233216220 | 0.032395 |
446 | Technology | 233007905 | 0.032366 |
83 | Business and economics | 228801818 | 0.031782 |
342 | Philosophy and religion | 228716606 | 0.031770 |
56 | Biology | 220055114 | 0.030567 |
278 | Medicine | 197890857 | 0.027488 |
459 | Transportation | 189356967 | 0.026303 |
282 | Military and warfare | 187945486 | 0.026107 |
503 | Visual arts | 161411619 | 0.022421 |
355 | Regional geography | 115483591 | 0.016041 |
357 | Regional society | 110243831 | 0.015313 |
167 | Food and drink | 99778456 | 0.013860 |
215 | Internet culture | 99516902 | 0.013823 |
429 | Structures of note | 80883092 | 0.011235 |
142 | Education | 73448968 | 0.010202 |
478 | United States | 59122893 | 0.008212 |
130 | Disambiguation | 58367391 | 0.008108 |
343 | Physics | 56172269 | 0.007803 |
98 | Chemistry | 54165745 | 0.007524 |
422 | Space | 43268636 | 0.006010 |
272 | Mathematics | 41116650 | 0.005711 |
182 | Geosciences | 27172192 | 0.003774 |
279 | Meteorology | 22631546 | 0.003144 |
393 | Science | 22147222 | 0.003076 |
211 | India | 20943288 | 0.002909 |
58 | Bodies of water | 18631407 | 0.002588 |
455 | Time | 18066422 | 0.002510 |
146 | Engineering | 16869893 | 0.002343 |
294 | Music | 15193024 | 0.002110 |
244 | Landforms | 14819421 | 0.002058 |
277 | Media | 13913078 | 0.001933 |
115 | Crafts and hobbies | 13450923 | 0.001868 |
492 | Unknown | 12500400 | 0.001736 |
267 | Maps | 9156431 | 0.001272 |
90 | Canada | 7718540 | 0.001072 |
36 | Australia | 7413482 | 0.001030 |
33 | Arts | 6264661 | 0.000870 |
168 | France | 5750101 | 0.000799 |
214 | Information science | 5459799 | 0.000758 |
184 | Germany | 5000979 | 0.000695 |
223 | Italy | 4353786 | 0.000605 |
178 | Games and toys | 4268883 | 0.000593 |
371 | Russia | 3669942 | 0.000510 |
100 | China | 3425639 | 0.000476 |
enwiki_topic_sept_summary["sept_rank"] = enwiki_topic_sept_summary["proportion"].rank(ascending=0)
enwiki_topic_aug_summary["aug_rank"] = enwiki_topic_aug_summary["proportion"].rank(ascending=0)
topic_rank = enwiki_topic_sept_summary.merge(enwiki_topic_aug_summary, how = 'left', on = 'predicted')
topic_rank = topic_rank.rename(columns={'predicted': 'topic', 'proportion_x': 'proportion_sept','proportion_y': 'proportion_aug','pageviews_x':'pageviews_sept','pageviews_y':'pageviews_aug'})
topic_rank[['topic','proportion_sept','proportion_aug','sept_rank','aug_rank']].head(50)
topic | proportion_sept | proportion_aug | sept_rank | aug_rank | |
---|---|---|---|---|---|
0 | Entertainment | 0.132101 | 0.138937 | 1.0 | 1.0 |
1 | Sports | 0.090453 | 0.087186 | 2.0 | 2.0 |
2 | Performing arts | 0.080869 | 0.081190 | 3.0 | 3.0 |
3 | Broadcasting | 0.070698 | 0.071853 | 4.0 | 4.0 |
4 | History and society | 0.068154 | 0.069659 | 5.0 | 5.0 |
5 | Politics and government | 0.054299 | 0.052139 | 6.0 | 6.0 |
6 | Language and literature | 0.032395 | 0.032178 | 7.0 | 7.0 |
7 | Technology | 0.032366 | 0.030678 | 8.0 | 9.0 |
8 | Business and economics | 0.031782 | 0.030793 | 9.0 | 8.0 |
9 | Philosophy and religion | 0.031770 | 0.030154 | 10.0 | 11.0 |
10 | Biology | 0.030567 | 0.030221 | 11.0 | 10.0 |
11 | Medicine | 0.027488 | 0.025845 | 12.0 | 14.0 |
12 | Transportation | 0.026303 | 0.027177 | 13.0 | 12.0 |
13 | Military and warfare | 0.026107 | 0.026282 | 14.0 | 13.0 |
14 | Visual arts | 0.022421 | 0.023707 | 15.0 | 15.0 |
15 | Regional geography | 0.016041 | 0.017046 | 16.0 | 16.0 |
16 | Regional society | 0.015313 | 0.016365 | 17.0 | 17.0 |
17 | Food and drink | 0.013860 | 0.014437 | 18.0 | 19.0 |
18 | Internet culture | 0.013823 | 0.014654 | 19.0 | 18.0 |
19 | Structures of note | 0.011235 | 0.011200 | 20.0 | 20.0 |
20 | Education | 0.010202 | 0.009655 | 21.0 | 21.0 |
21 | United States | 0.008212 | 0.008730 | 22.0 | 22.0 |
22 | Disambiguation | 0.008108 | 0.008134 | 23.0 | 23.0 |
23 | Physics | 0.007803 | 0.006779 | 24.0 | 24.0 |
24 | Chemistry | 0.007524 | 0.006209 | 25.0 | 25.0 |
25 | Space | 0.006010 | 0.005227 | 26.0 | 26.0 |
26 | Mathematics | 0.005711 | 0.004364 | 27.0 | 27.0 |
27 | Geosciences | 0.003774 | 0.003477 | 28.0 | 28.0 |
28 | Meteorology | 0.003144 | 0.002263 | 29.0 | 35.0 |
29 | Science | 0.003076 | 0.002795 | 30.0 | 32.0 |
30 | India | 0.002909 | 0.003165 | 31.0 | 29.0 |
31 | Bodies of water | 0.002588 | 0.002971 | 32.0 | 30.0 |
32 | Time | 0.002510 | 0.002395 | 33.0 | 33.0 |
33 | Engineering | 0.002343 | 0.002810 | 34.0 | 31.0 |
34 | Music | 0.002110 | 0.002175 | 35.0 | 37.0 |
35 | Landforms | 0.002058 | 0.002204 | 36.0 | 36.0 |
36 | Media | 0.001933 | 0.001866 | 37.0 | 39.0 |
37 | Crafts and hobbies | 0.001868 | 0.001894 | 38.0 | 38.0 |
38 | Unknown | 0.001736 | 0.002364 | 39.0 | 34.0 |
39 | Maps | 0.001272 | 0.001173 | 40.0 | 40.0 |
40 | Canada | 0.001072 | 0.001172 | 41.0 | 41.0 |
41 | Australia | 0.001030 | 0.001022 | 42.0 | 42.0 |
42 | Arts | 0.000870 | 0.000856 | 43.0 | 44.0 |
43 | France | 0.000799 | 0.000862 | 44.0 | 43.0 |
44 | Information science | 0.000758 | 0.000655 | 45.0 | 46.0 |
45 | Germany | 0.000695 | 0.000740 | 46.0 | 45.0 |
46 | Italy | 0.000605 | 0.000627 | 47.0 | 47.0 |
47 | Games and toys | 0.000593 | 0.000594 | 48.0 | 48.0 |
48 | Russia | 0.000510 | 0.000516 | 49.0 | 49.0 |
49 | China | 0.000476 | 0.000515 | 50.0 | 50.0 |
The changes in proportion and rank between September and August for top 50 topics are not very noticeable.
topic_rank['rank_diff_abs'] = abs(topic_rank['sept_rank'] - topic_rank['aug_rank'])
topic_rank[['topic','proportion_sept','proportion_aug','sept_rank','aug_rank','rank_diff_abs']].head(100).sort_values(by='rank_diff_abs', ascending=False).head(10)
topic | proportion_sept | proportion_aug | sept_rank | aug_rank | rank_diff_abs | |
---|---|---|---|---|---|---|
59 | The_Bahamas | 0.000239 | 0.000022 | 60.0 | 196.0 | 136.0 |
99 | Hong_Kong | 0.000074 | 0.000156 | 100.0 | 73.0 | 27.0 |
78 | predicted | 0.000133 | 0.000104 | 79.0 | 89.0 | 10.0 |
73 | Netherlands | 0.000147 | 0.000127 | 74.0 | 82.0 | 8.0 |
76 | Syria | 0.000139 | 0.000120 | 77.0 | 85.0 | 8.0 |
28 | Meteorology | 0.003144 | 0.002263 | 29.0 | 35.0 | 6.0 |
69 | Israel | 0.000164 | 0.000146 | 70.0 | 76.0 | 6.0 |
84 | Denmark | 0.000105 | 0.000138 | 85.0 | 79.0 | 6.0 |
94 | Saudi Arabia | 0.000088 | 0.000075 | 95.0 | 101.0 | 6.0 |
86 | Argentina | 0.000104 | 0.000100 | 87.0 | 92.0 | 5.0 |
By looking at top 10 topics change in rank from Auguat to September, the topics related to "Country/Region" changes the most between two month.
Compare changes in pageviews for top 10 topics between Spetember and August 2019.
## Load the RPython library so we can use R for graphs
%load_ext rpy2.ipython
%%R
library(ggplot2)
library (tidyverse)
library(data.table)
%%R -i topic_rank
data.table(topic_rank)[1:10] %>%
melt(id.vars = c("topic"), measure.vars = c("pageviews_sept", "pageviews_aug"),variable.name = "month", value.name = 'count') %>%
ggplot(aes(fill=month, y=count, x=reorder(topic,count))) +
geom_bar(position="dodge", stat="identity",width = 0.6) + coord_flip() +
scale_y_continuous("Pageviews per Topic",
labels = polloi::compress) +
theme(axis.title.y=element_blank(),
axis.text=element_text(size=11),
legend.position = c(0.8, 0.15), legend.title = element_blank(),legend.text =element_text( hjust = 0,size = 10))+
labs(color = "type",
title = "Top 10 Viewed Topics Pageviews (Aug vs. Sept)")
Compare topics with top 10 changes in pageview percentage between Spetember and August 2019.
topic_rank['pv_diff_pct'] = abs(topic_rank['pageviews_aug'] / topic_rank['pageviews_sept']-1)
pv_diff = topic_rank[['topic','pageviews_sept','pageviews_aug','pv_diff_pct']].head(50).sort_values(by='pv_diff_pct', ascending=False).head(10)
%%R -i pv_diff
data.table(pv_diff) %>%
filter(topic != 'Unknown') %>%
melt(id.vars = c("topic"), measure.vars = c("pageviews_sept", "pageviews_aug"),variable.name = "month", value.name = 'count') %>%
ggplot(aes(fill=month, y=count, x=reorder(topic,count))) +
geom_bar(position="dodge", stat="identity",width = 0.6) + coord_flip() +
scale_y_continuous("Pageviews per Topic",
labels = polloi::compress) +
theme(axis.title.y=element_blank(),
axis.text=element_text(size=11),
legend.position = c(0.8, 0.15), legend.title = element_blank(),legend.text =element_text( hjust = 0,size = 10))+
labs(color = "type",
title = "Top 10 Pageviews %Diff Topics (Aug vs. Sept)")