from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Motivation: Along with the explosion of popularity of mobile devices, mobile games also become more and more popular nowdays. From retro games such as Plants vs. Zombies and Flappy Bird to more recent multiplayer games such as Clash of Clans and Pokemon Go, there are a variety of them which all made big success but with very different strategies. Investigating how popular they are and what make them popular or not would bring instructional information which is invaluable to determine a better short/long term plan of their product and service.
Text mining via social medial such as Twitter comes in handy in this kind of task. As Twitter data constitutes a rich source of information about any topic imaginable, they can be used to find trends related to a specific keyword, measuring brand sentiment, and gathering feedback about new products and services.
In this proposal, I will demonstrate the possibility to use Twitter to (1) make very efficient and productive survey in the mobile game business, and (2) perform detailed analysis on given product (here use Pokemon Go) to find out important information such as hot terms/topic, product usage patterns, and the customer sentiment of the product.
Methods/Tools: I use Twitter Streaming API together with Python libraries such as Tweepy and twitterscraper to collect user information and relevant tweets (in the past 1-2 weeks) about certain set of popular mobile games. Use Natural Language Processing libraries such as NLTK and TextBlob to find keywords which characterize the game, and use matplotlib and Vincent to render plots.
For the popularity survey, I first use Twitter Streaming API together with Python libraries 𝚃𝚠𝚎𝚎𝚙𝚢 to extract user information, such as number of followers, created time, and number of likes, from the official twitter account of a list of 14 most popular mobile games, based on internet reviews in 2015 and 2016.
I then collect all tweets that are related to each game in the past week (Oct-23 to Oct-29, 2016) using web scraping library twitterscraper. This results in about 100K tweets in total.
In order to perform more detailed study of a given mobile game, I pick the most popular game Pokemon Go as an example, and expand the data set to include all tweets within the past two weeks. That gives a total tweets count about 120K for Pokemon Go alone.
One way we can get an easy measurement of if a game is popular or not is to pull out some information associated with its official twitter account, such as the number of followers and number of likes. Normally, a more popular game would be garnished with more followers and more likes. However, cautions are needed as not every twitter user who plays a particular game would follow its official account on twitter.
The following plot shows total number of followers to-date of a list of 14 popular mobile games (list based on internet reviews in 2015 and 2016):
import pylab
pylab.rcParams['figure.figsize'] = [10,5]
df_gl = pd.read_csv('data/pop1.csv',sep='\t',index_col=0)
current_time = '2016-10-29 00:00:00'
df_gl['days'] = (pd.datetime.now().date() - pd.to_datetime(df_gl['created'])).apply(lambda x: x / np.timedelta64(1,'D'))
df_gl.plot.bar(x=df_gl.index,y='followers',rot=80)
<matplotlib.axes._subplots.AxesSubplot at 0x2ad13ef4cdd0>
In order to show the bias, or extract a unbiased analysis, we here plot number of followers (left) and number of days of the game's lifetime (right) together. Note Pokemon Go only released this June, but the official account was registered in 2014 which is when that game is conceived.
import pylab
pylab.rcParams['figure.figsize'] = [10,5]
dftmp=df_gl.sort_values('followers')
ygl = np.arange(df_gl['likes'].size)
fig, axes = plt.subplots(ncols=2, sharey=True)
axes[0].barh(ygl, dftmp['followers']*0.001, align='center', color='blue', zorder=10)
axes[0].set(title='Number of followers('+r'$\times 1000$)')
axes[1].barh(ygl, dftmp['days'], align='center', color='red', zorder=10)
axes[1].set(title='Number of days')
axes[0].invert_xaxis()
axes[0].set(yticks=ygl, yticklabels=dftmp.index)
axes[0].yaxis.tick_right()
for ax in axes.flat:
ax.margins(0.01)
ax.grid(True)
fig.tight_layout()
fig.subplots_adjust(wspace=0.4)
plt.show()
On the other hand, number of likes might not be a good indicator of a game's recent popularity. Some games gained more likes due to their longevity, such as angry birds and Plants vs Zombies. Some popular games such as Pokemon Go gains less likes partly because of its very young age. One exception is Ingress, a location based, multiplayer online game, a relative of Pokemon Go. It attracks the third highest number of likes within a time span less than the averaged lifetime of the games in the list.
dftmp=df_gl.sort_values('days')
ygl = np.arange(df_gl['likes'].size)
fig, axes = plt.subplots(ncols=2, sharey=True)
axes[0].barh(ygl, dftmp['likes'], align='center', color='blue', zorder=10)
axes[0].set(title='Number of likes')
axes[1].barh(ygl, dftmp['days'], align='center', color='red', zorder=10)
axes[1].set(title='Number of days')
axes[0].invert_xaxis()
axes[0].set(yticks=ygl, yticklabels=dftmp.index)
axes[0].yaxis.tick_right()
for ax in axes.flat:
ax.margins(0.01)
ax.grid(True)
fig.tight_layout()
fig.subplots_adjust(wspace=0.4)
plt.show()
Another way to measure the popularity of a game is by counting how many times twitter users mentioned the game. I therefore scrapped the tweets on the web within the past week for the following six games: Candy Crush, Clash of Clans, Flappy Bird, Plants vs Zombies, Subway Surfers, Pokemon Go. The reason I choose those games is partly because my kids and wife like to play them, and so do I.
import glob
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import random
import os
def merge_files(filenames,outfilename):
first_time = True
for file in filenames:
if os.stat(file).st_size == 0:
continue
if first_time:
df = pd.read_csv(file,index_col=None,header=None,sep='\t')
first_time=False
else:
print file
tmp=pd.read_csv(file,index_col=None, header=None,sep='\t')
df=df.append(tmp,ignore_index=True)
df.to_csv(outfilename,sep='\t')
print 'there are total tweets = ', df[0].count()
games = ['ClashofClans','FlappyBird','PvZ2','SubwaySurfer','CandyCrush','PokemonGo']
for name in games:
filenames = glob.glob("data/"+name+"/*.csv")
output = 'data/'+name+'.csv'
print 'in ',output,':'
merge_files(filenames,output)
The following plot shows the number of tweets that mentioned individual game in the past week.
It indicates that,
although the second might not be so likely. Further study with data over longer time span, and/or get user information might be very useful to tell. In any rate, Pokemon Go does a great job to get its players excited indicated by the huge amount of tweets over a relatively short period of time.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
nbin = 2
pokemon = [72486,0]
flappy =[0,2524]
pvz2=[0,6810]
subway=[0,1675]
candy=[0,1164]
clash=[0,6640]
import pylab
pylab.rcParams['figure.figsize'] = [4,6]
ind = np.arange(nbin) # the x locations for the groups
width = 0.15 # the width of the bars: can also be len(x) sequence
p1 = plt.bar(ind, pokemon, width, color='r',align='center')
p2 = plt.bar(ind, flappy, width, color='y',
bottom=pokemon)
p3 = plt.bar(ind, pvz2, width, color='b',
bottom=flappy)
p4 = plt.bar(ind, subway, width, color='g',
bottom=flappy)
p5 = plt.bar(ind, candy, width, color='brown',
bottom=pvz2)
p6 = plt.bar(ind, clash, width, color='cyan',
bottom=candy)
plt.ylabel('Total # of tweets (in one week)',fontsize=15)
plt.xticks(ind + width/2., ('PokemonGo', 'Others'))
#plt.yticks(np.arange(0, 81, 10))
plt.legend((p1[0], p2[0],p3[0],p4[0],p5[0],p6[0]), ['PokemonGo','FlappyBird','PvZ2','SubwaySurfer','CandyCrush','ClashofClans'])
plt.show()
Before we move on to the text content, lets do a histogram of total count grouped by days of week. For this and following study, I have doubled the data size which now spans over the past two weeks (120k tweets in total). Based on the following plot,
Whether it is an indicator of people spend more time on Pokemon Go from Monday to Thursday, or people tend to tweet more during Mon to Thursday is unclear at this moment.
import glob
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
import random
import os
df = pd.read_csv('data/PokemonGo_week.csv',sep='\t',index_col=0)
df.columns =['userid','tweetid','time','username','text']
timestamp=[]
falselist=[]
count=0
for stamp in df['time']:
if type(stamp) is str:
line=stamp.split(' ')
hr,min = line[0].split(':')
if(line[1]=='PM' and int(hr)<12):
hr=str(int(hr)+12)
timestamp.append('2016-10-'+line[3]+' '+hr+':'+min+':00')
else:
falselist.append(count)
count+=1
df1=df.drop(df.index[falselist])
#print df1['time'].size
print df1['userid'].size
df1['timestamp'] = pd.Series(timestamp, index=df1.index)
df1['dayofweek'] = pd.to_datetime(df1['timestamp']).apply(lambda x: x.weekday())
#df1['dayofweek']
x=np.arange(7)
ax=plt.figure()
#df1.hist(column='dayofweek',alpha=0.5,grid=True,bins=7,xlabel)
df1.plot.hist(y='dayofweek',alpha=0.5,grid=True,bins=7,align='mid',range=[-0.4,6.4],label='Days of week')
ax.set_xlabel=['M','T','W','Th','F','S','Su']
<matplotlib.figure.Figure at 0x2ae143b08890>
#df1['hour']
hourlist = [ int(item.split(' ')[1].split(':')[0]) for item in df1['timestamp']]
df1['hour'] = pd.Series(hourlist, index=df1.index)
x=np.arange(24)
ax=plt.figure()
#df1.hist(column='dayofweek',alpha=0.5,grid=True,bins=7,xlabel)
df1.plot.hist(y='hour',alpha=0.5,grid=True,bins=24,align='mid',range=[0,24.5],label='Hour')
<matplotlib.axes._subplots.AxesSubplot at 0x2b8062839650>
<matplotlib.figure.Figure at 0x2b8062912150>
By searching and sorting all meaningful single words in the text of tweets within the last two weeks, here are the first 10 most frequent words: *('twitter', 38406), ('pic', 23536), ('halloween', 10795), ('video', 9545), ('new', 8537), ('update', 7604), ('liked', 7068), ('event', 6850), ('play', 5300), ('get', 4227)
from nltk.corpus import stopwords
import string
import operator
import json
from collections import Counter
import re
emoticons_str = r"""
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]\(\]/\\OpP] # Mouth
)"""
regex_str = [
emoticons_str,
r'<[^>]+>', # HTML tags
r'(?:@[\w_]+)', # @-mentions
r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
r'(?:[\w_]+)', # other words
r'(?:\S)' # anything else
]
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
def tokenize(s):
return tokens_re.findall(s)
def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens
wordlist=[]
for item in df1['text']:
if(type(item)==str):
wordlist.append(preprocess(item))
else:
continue
wordlist = [item for sublist in wordlist for item in sublist]
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via','com','r']
wordlist = [item.lower() for item in wordlist if item not in stop]
wordlist_nounicode =[re.sub(r'[^\x00-\x7F]+','', item) for item in wordlist]
# Count hashtags only
terms_hash = [term for term in wordlist_nounicode if term.startswith('#')]
# Count mention only
terms_ment = [term for term in wordlist_nounicode if term.startswith('@')]
# Count terms only (no hashtags, no mentions)
terms_only = [term for term in wordlist_nounicode if term not in stop and not term.startswith(('#', '@'))]
count1 = Counter()
# Update the counter
count1.update(terms_hash)
print "the first 40 most frequent hashtag:"
print(count1.most_common(40))
print
print
count2 = Counter()
# Update the counter
count2.update(terms_ment)
print "the first 10 most frequent mention:"
print(count2.most_common(40))
print
print
count3 = Counter()
# Update the counter
count3.update(terms_only)
print "the first 40 most frequent word:"
print(count3.most_common(41)[1:])
import vincent
vincent.core.initialize_notebook()
from vincent import AxisProperties, PropertySet, ValueRef
word_freq = [('twitter', 38406), ('pic', 23536), ('halloween', 10795), ('video', 9545), ('new', 8537),
('update', 7604), ('liked', 7068), ('event', 6850),
('play', 5300), ('get', 4227), ('like', 4135),
('game', 4055), ('still', 3708),
('candy', 3117), ("i'm", 3077), ('first', 2790),
('catch', 2757), ('plus', 2705), ('hack', 2638), ('niantic', 2587),
('gym', 2378)]
labels, freq = zip(*word_freq)
data = {'data': freq, 'x': labels}
bar = vincent.Bar(data, iter_idx='x')
#rotate x axis labels
ax = AxisProperties(
labels = PropertySet(angle=ValueRef(value=-30)))
bar.axes[0].properties = ax
bar.display()
from nltk import bigrams
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via','com','r']
wordlist_bigram=[]
for item in df1['text']:
if(type(item)==str):
tmplist = [term for term in preprocess(item,lowercase=True) if term not in stop]
#tmplist = [re.sub(r'[^\x00-\x7F]+','', term) for term in item]
for term in bigrams(tmplist):
wordlist_bigram.append(' '.join(term))
else:
continue
count = Counter()
# Update the counter
count.update(wordlist_bigram)
print "the first 40 most frequent word:"
print(count.most_common(1000))
import cPickle as pickle
#pickle.dump(wordlist_nounicode, open( "word_onegram.p", "wb" ) )
#pickle.dump(terms_only, open( "terms_only.p", "wb" ) )
#pickle.dump( wordlist_bigram, open( "word_bigram.p", "wb" ) )
terms_only = pickle.load( open( "terms_only.p", "rb" ) )
#wordlist_nounicode = pickle.load( open( "word_onegram.p", "rb" ) )
#wordlist_bigram = pickle.load( open( "word_bigram.p", "rb" ) )
Other rankings as follows:
the first 20 most frequent hashtag
*('#pokemongo', 45988), ('#pokemon', 6280), ('#funny', 3757), ('#minecraft', 3723), ('#agario', 3703), ('#amazing', 3695), ('#game', 3501), ('#new', 3342), ('#europe', 3319), ('#turkey', 3319), ('#trolling', 3318), ('#love', 2584), ('#pok', 2055), ('#gaming', 1163), ('#pokeballs', 1053), ('#pokemongocoinspic', 1015), ('#teamvalor', 779), ('#pokecoins', 605), ('#tech', 601), ('#halloween', 555)
the first 10 most frequent mention
*('@youtube', 9318), ('@leafyishere', 1304), ('@omgitsalia', 1239), ('@pokemongoapp', 1210), ('@trnrtips', 1029), ('@nianticlabs', 907), ('@fsu_atl', 522), ('@lachlanyt', 301), ('@suknives', 286), ('@witelightinghwd', 222)
A search based on bi-gram gives the first 20 most frequent two adjacent words:
*('pic twitter', 23446), ('@youtube video', 7179), ('#pokemongo pic', 4509), ('halloween event', 3673), ('#game #trolling', 3318), ('play pokemon', 2750), ('go update', 2228), ('#love #amazing', 2199), ('new pokemon', 1497), ('halloween update', 1182), ('celebrates halloween', 968), ('need pokecoins', 911), ('rare pokemon', 814), ('apple pen', 768), ('pine apple', 764), ('ppap pine', 764), ('still play', 747), ('still playing', 744), ('candy count', 695), ('go cheats', 685)
which indicates that
# save to json file and then host the plot with webserver; abandoned
import vincent
from vincent import AxisProperties, PropertySet, ValueRef
word_freq = [('twitter', 38406), ('pic', 23536), ('halloween', 10795), ('video', 9545), ('new', 8537),
('update', 7604), ('liked', 7068), ('event', 6850),
('play', 5300), ('get', 4227), ('like', 4135),
('game', 4055), ('still', 3708),
('candy', 3117), ("i'm", 3077), ('first', 2790),
('catch', 2757), ('plus', 2705), ('hack', 2638), ('niantic', 2587),
('gym', 2378)]
labels, freq = zip(*word_freq)
data = {'data': freq, 'x': labels}
bar = vincent.Bar(data, iter_idx='x')
#rotate x axis labels
ax = AxisProperties(
labels = PropertySet(angle=ValueRef(value=-30)))
bar.axes[0].properties = ax
bar.to_json('term_freq.json')
We can try to measure the sentiment scores of the tweets about Pokemon Go, which provides a way to tell game players' opinion about the game. We use the same data sample as above, but only use the most recent 4000 tweets as a test and run those tweets through a sentiment classifier built base on Word Sense Disambiguation using wordnet and word occurance statistics from nltk.
import cPickle as pickle
terms_only = pickle.load( open( "terms_only.p", "rb" ) )
from senti_classifier import senti_classifier
pos_score, neg_score = senti_classifier.polarity_scores(terms_only[-4000:])
print pos_score, neg_score
85.625 33.931
In this proposal, I have demonstrated that we could make use of the data from Twitter to gain insight in the mobile game industry.
Further improvement: