Notebook

In [16]:

from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')

Out[16]:

The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

1. Introductio¶

Motivation: Along with the explosion of popularity of mobile devices, mobile games also become more and more popular nowdays. From retro games such as Plants vs. Zombies and Flappy Bird to more recent multiplayer games such as Clash of Clans and Pokemon Go, there are a variety of them which all made big success but with very different strategies. Investigating how popular they are and what make them popular or not would bring instructional information which is invaluable to determine a better short/long term plan of their product and service.

Text mining via social medial such as Twitter comes in handy in this kind of task. As Twitter data constitutes a rich source of information about any topic imaginable, they can be used to find trends related to a specific keyword, measuring brand sentiment, and gathering feedback about new products and services.

In this proposal, I will demonstrate the possibility to use Twitter to (1) make very efficient and productive survey in the mobile game business, and (2) perform detailed analysis on given product (here use Pokemon Go) to find out important information such as hot terms/topic, product usage patterns, and the customer sentiment of the product.

Methods/Tools: I use Twitter Streaming API together with Python libraries such as $\texttt{Tweepy}$ and $\texttt{twitterscraper}$ to collect user information and relevant tweets (in the past 1-2 weeks) about certain set of popular mobile games. Use Natural Language Processing libraries such as $\texttt{NLTK}$ and $\texttt{TextBlob}$ to find keywords which characterize the game, and use $\texttt{matplotlib}$ and $\texttt{Vincent}$ to render plots.

2. Data collection¶

For the popularity survey, I first use Twitter Streaming API together with Python libraries 𝚃𝚠𝚎𝚎𝚙𝚢 to extract user information, such as number of followers, created time, and number of likes, from the official twitter account of a list of 14 most popular mobile games, based on internet reviews in 2015 and 2016.
I then collect all tweets that are related to each game in the past week (Oct-23 to Oct-29, 2016) using web scraping library twitterscraper. This results in about 100K tweets in total.
In order to perform more detailed study of a given mobile game, I pick the most popular game Pokemon Go as an example, and expand the data set to include all tweets within the past two weeks. That gives a total tweets count about 120K for Pokemon Go alone.

3. Popularity Survey¶

3-1. user information from official account¶

One way we can get an easy measurement of if a game is popular or not is to pull out some information associated with its official twitter account, such as the number of followers and number of likes. Normally, a more popular game would be garnished with more followers and more likes. However, cautions are needed as not every twitter user who plays a particular game would follow its official account on twitter.

The following plot shows total number of followers to-date of a list of 14 popular mobile games (list based on internet reviews in 2015 and 2016):

There are two winners: Clash of Clans and Pokemon Go, both have followers more than 2 millions, which clearly shows both are or were very popular games (follower number only goes up with time, so there is a bias toward games with longer time spans.)

In [204]:

import pylab
pylab.rcParams['figure.figsize'] = [10,5]
df_gl = pd.read_csv('data/pop1.csv',sep='\t',index_col=0)
current_time = '2016-10-29 00:00:00'
df_gl['days'] = (pd.datetime.now().date() - pd.to_datetime(df_gl['created'])).apply(lambda x: x / np.timedelta64(1,'D'))
df_gl.plot.bar(x=df_gl.index,y='followers',rot=80)

Out[204]:

<matplotlib.axes._subplots.AxesSubplot at 0x2ad13ef4cdd0>

In order to show the bias, or extract a unbiased analysis, we here plot number of followers (left) and number of days of the game's lifetime (right) together. Note Pokemon Go only released this June, but the official account was registered in 2014 which is when that game is conceived.

Both Clash of Clans and Pokemon Go have a lifetime close or below the average ages of the list of games we considered. However their follower counts are exceptional, clearly indicate their popularity in recent years. Especially for Pokemon Go, it shows a strong momentum to keep its popularity.

In [163]:

import pylab
pylab.rcParams['figure.figsize'] = [10,5]

dftmp=df_gl.sort_values('followers')
ygl = np.arange(df_gl['likes'].size)

fig, axes = plt.subplots(ncols=2, sharey=True)
axes[0].barh(ygl, dftmp['followers']*0.001, align='center', color='blue', zorder=10)
axes[0].set(title='Number of followers('+r'$\times 1000$)')
axes[1].barh(ygl, dftmp['days'], align='center', color='red', zorder=10)
axes[1].set(title='Number of days')

axes[0].invert_xaxis()
axes[0].set(yticks=ygl, yticklabels=dftmp.index)
axes[0].yaxis.tick_right()

for ax in axes.flat:
    ax.margins(0.01)
    ax.grid(True)

fig.tight_layout()
fig.subplots_adjust(wspace=0.4)
plt.show()

On the other hand, number of likes might not be a good indicator of a game's recent popularity. Some games gained more likes due to their longevity, such as angry birds and Plants vs Zombies. Some popular games such as Pokemon Go gains less likes partly because of its very young age. One exception is Ingress, a location based, multiplayer online game, a relative of Pokemon Go. It attracks the third highest number of likes within a time span less than the averaged lifetime of the games in the list.

In [161]:

dftmp=df_gl.sort_values('days')
ygl = np.arange(df_gl['likes'].size)

fig, axes = plt.subplots(ncols=2, sharey=True)
axes[0].barh(ygl, dftmp['likes'], align='center', color='blue', zorder=10)
axes[0].set(title='Number of likes')
axes[1].barh(ygl, dftmp['days'], align='center', color='red', zorder=10)
axes[1].set(title='Number of days')

axes[0].invert_xaxis()
axes[0].set(yticks=ygl, yticklabels=dftmp.index)
axes[0].yaxis.tick_right()

for ax in axes.flat:
    ax.margins(0.01)
    ax.grid(True)

fig.tight_layout()
fig.subplots_adjust(wspace=0.4)
plt.show()

2.2 tweets count¶

Another way to measure the popularity of a game is by counting how many times twitter users mentioned the game. I therefore scrapped the tweets on the web within the past week for the following six games: Candy Crush, Clash of Clans, Flappy Bird, Plants vs Zombies, Subway Surfers, Pokemon Go. The reason I choose those games is partly because my kids and wife like to play them, and so do I.

In [ ]:

import glob
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import random
import os

def merge_files(filenames,outfilename):
  first_time = True
  for file in filenames:
    if os.stat(file).st_size == 0:
      continue
    if first_time: 
      df = pd.read_csv(file,index_col=None,header=None,sep='\t')
      first_time=False
    else:
      print file
      tmp=pd.read_csv(file,index_col=None, header=None,sep='\t')
      df=df.append(tmp,ignore_index=True)
        
  df.to_csv(outfilename,sep='\t')
  print 'there are total tweets = ', df[0].count()


games = ['ClashofClans','FlappyBird','PvZ2','SubwaySurfer','CandyCrush','PokemonGo']

for name in games:
  filenames = glob.glob("data/"+name+"/*.csv")
  output = 'data/'+name+'.csv'
  print 'in ',output,':'
  merge_files(filenames,output)

The following plot shows the number of tweets that mentioned individual game in the past week.

It is not surprising to see Pokemon Go alone hits 70K tweets,while the rest of five combined only gives 10K.

It indicates that,

1. overall Pokemon Go is the most dominant mobile game based on social media , or
1. Pokemon Go players tweat much more often (almost 100 times) than other game players do

although the second might not be so likely. Further study with data over longer time span, and/or get user information might be very useful to tell. In any rate, Pokemon Go does a great job to get its players excited indicated by the huge amount of tweets over a relatively short period of time.

In [12]:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

nbin = 2
pokemon = [72486,0]
flappy =[0,2524]
pvz2=[0,6810]
subway=[0,1675]
candy=[0,1164]
clash=[0,6640]

import pylab
pylab.rcParams['figure.figsize'] = [4,6]

ind = np.arange(nbin)    # the x locations for the groups
width = 0.15       # the width of the bars: can also be len(x) sequence

p1 = plt.bar(ind, pokemon, width, color='r',align='center')
p2 = plt.bar(ind, flappy, width, color='y',
             bottom=pokemon)
p3 = plt.bar(ind, pvz2, width, color='b',
             bottom=flappy)
p4 = plt.bar(ind, subway, width, color='g',
             bottom=flappy)
p5 = plt.bar(ind, candy, width, color='brown',
             bottom=pvz2)
p6 = plt.bar(ind, clash, width, color='cyan',
             bottom=candy)

plt.ylabel('Total # of tweets (in one week)',fontsize=15)
plt.xticks(ind + width/2., ('PokemonGo', 'Others'))
#plt.yticks(np.arange(0, 81, 10))
plt.legend((p1[0], p2[0],p3[0],p4[0],p5[0],p6[0]), ['PokemonGo','FlappyBird','PvZ2','SubwaySurfer','CandyCrush','ClashofClans'])

plt.show()

3. Pattern of usage¶

3-1. weekly pattern¶

Before we move on to the text content, lets do a histogram of total count grouped by days of week. For this and following study, I have doubled the data size which now spans over the past two weeks (120k tweets in total). Based on the following plot,

there is a clear upprising trend from Monday to Thursday, and a sudden drop on Friday and then gradually climb up back during the weekend.

Whether it is an indicator of people spend more time on Pokemon Go from Monday to Thursday, or people tend to tweet more during Mon to Thursday is unclear at this moment.

In [37]:

import glob
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
import random
import os
df = pd.read_csv('data/PokemonGo_week.csv',sep='\t',index_col=0)
df.columns =['userid','tweetid','time','username','text']

In [38]:

timestamp=[]
falselist=[]
count=0
for stamp in df['time']:
       if type(stamp) is str:
          line=stamp.split(' ')
          hr,min = line[0].split(':')
          if(line[1]=='PM' and int(hr)<12):
            hr=str(int(hr)+12)
          timestamp.append('2016-10-'+line[3]+' '+hr+':'+min+':00')
       else:
          falselist.append(count)
       count+=1
    
df1=df.drop(df.index[falselist])
#print df1['time'].size

In [ ]:

print df1['userid'].size
df1['timestamp'] = pd.Series(timestamp, index=df1.index)
df1['dayofweek'] = pd.to_datetime(df1['timestamp']).apply(lambda x: x.weekday())
#df1['dayofweek']

In [35]:

x=np.arange(7)
ax=plt.figure()
#df1.hist(column='dayofweek',alpha=0.5,grid=True,bins=7,xlabel)
df1.plot.hist(y='dayofweek',alpha=0.5,grid=True,bins=7,align='mid',range=[-0.4,6.4],label='Days of week')
ax.set_xlabel=['M','T','W','Th','F','S','Su']

<matplotlib.figure.Figure at 0x2ae143b08890>

3-2. daily pattern¶

Most of the tweets occur between the day time from 6am to 4pm which is reasonable
It peaks at the noon time, which is perhaps because people have more free time during lunch

In [63]:

#df1['hour'] 
hourlist = [ int(item.split(' ')[1].split(':')[0]) for item in df1['timestamp']]
df1['hour'] = pd.Series(hourlist, index=df1.index)

In [65]:

x=np.arange(24)
ax=plt.figure()
#df1.hist(column='dayofweek',alpha=0.5,grid=True,bins=7,xlabel)
df1.plot.hist(y='hour',alpha=0.5,grid=True,bins=24,align='mid',range=[0,24.5],label='Hour')

Out[65]:

<matplotlib.axes._subplots.AxesSubplot at 0x2b8062839650>

<matplotlib.figure.Figure at 0x2b8062912150>

4. Text analysis¶

4.1 Terms occurence¶

By searching and sorting all meaningful single words in the text of tweets within the last two weeks, here are the first 10 most frequent words: *('twitter', 38406), ('pic', 23536), ('halloween', 10795), ('video', 9545), ('new', 8537), ('update', 7604), ('liked', 7068), ('event', 6850), ('play', 5300), ('get', 4227)

The words 'pic' and 'video' clearly show the popular media content of the tweets;
'halloween', 'new', and 'update' are related to recent update of the App;
'halloween' and 'event' refer to the newly announced halloween event which aims to stimulate the players to play even in this cool weather.

In [ ]:

from nltk.corpus import stopwords
import string
import operator 
import json
from collections import Counter
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

In [ ]:

wordlist=[]
for item in df1['text']:
    if(type(item)==str):
      wordlist.append(preprocess(item))
    else:
      continue

wordlist = [item for sublist in wordlist for item in sublist]
punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via','com','r']

wordlist = [item.lower() for item in wordlist if item not in stop]
wordlist_nounicode =[re.sub(r'[^\x00-\x7F]+','', item) for item in wordlist]

# Count hashtags only
terms_hash = [term for term in wordlist_nounicode if term.startswith('#')]
# Count mention only
terms_ment = [term for term in wordlist_nounicode if term.startswith('@')]
# Count terms only (no hashtags, no mentions)
terms_only = [term for term in wordlist_nounicode if term not in stop and not term.startswith(('#', '@'))] 

count1 = Counter()
# Update the counter
count1.update(terms_hash)

print "the first 40 most frequent hashtag:"
print(count1.most_common(40))
print
print

count2 = Counter()
# Update the counter
count2.update(terms_ment)

print "the first 10 most frequent mention:"
print(count2.most_common(40))
print
print

count3 = Counter()
# Update the counter
count3.update(terms_only)
print "the first 40 most frequent word:"
print(count3.most_common(41)[1:])

In [21]:

import vincent 
vincent.core.initialize_notebook()

from vincent import AxisProperties, PropertySet, ValueRef
word_freq = [('twitter', 38406), ('pic', 23536), ('halloween', 10795), ('video', 9545), ('new', 8537), 
             ('update', 7604), ('liked', 7068), ('event', 6850), 
             ('play', 5300),  ('get', 4227), ('like', 4135), 
             ('game', 4055), ('still', 3708), 
             ('candy', 3117), ("i'm", 3077), ('first', 2790), 
             ('catch', 2757), ('plus', 2705), ('hack', 2638),  ('niantic', 2587), 
             ('gym', 2378)]

labels, freq = zip(*word_freq)
data = {'data': freq, 'x': labels}
bar = vincent.Bar(data, iter_idx='x')
#rotate x axis labels
ax = AxisProperties(
         labels = PropertySet(angle=ValueRef(value=-30)))
bar.axes[0].properties = ax
bar.display()

In [ ]:

from nltk import bigrams 

punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via','com','r']

wordlist_bigram=[]
for item in df1['text']:
    if(type(item)==str):
      tmplist = [term for term in preprocess(item,lowercase=True) if term not in stop]
      #tmplist = [re.sub(r'[^\x00-\x7F]+','', term) for term in item]        
      for term in bigrams(tmplist):
         wordlist_bigram.append(' '.join(term))
    else:
      continue

In [ ]:

count = Counter()
# Update the counter
count.update(wordlist_bigram)
print "the first 40 most frequent word:"
print(count.most_common(1000))

In [ ]:

import cPickle as pickle
#pickle.dump(wordlist_nounicode, open( "word_onegram.p", "wb" ) )
#pickle.dump(terms_only, open( "terms_only.p", "wb" ) )
#pickle.dump( wordlist_bigram, open( "word_bigram.p", "wb" ) )

terms_only = pickle.load( open( "terms_only.p", "rb" ) )
#wordlist_nounicode = pickle.load( open( "word_onegram.p", "rb" ) )
#wordlist_bigram = pickle.load( open( "word_bigram.p", "rb" ) )

Other rankings as follows:

the first 20 most frequent hashtag

*('#pokemongo', 45988), ('#pokemon', 6280), ('#funny', 3757), ('#minecraft', 3723), ('#agario', 3703), ('#amazing', 3695), ('#game', 3501), ('#new', 3342), ('#europe', 3319), ('#turkey', 3319), ('#trolling', 3318), ('#love', 2584), ('#pok', 2055), ('#gaming', 1163), ('#pokeballs', 1053), ('#pokemongocoinspic', 1015), ('#teamvalor', 779), ('#pokecoins', 605), ('#tech', 601), ('#halloween', 555)

the first 10 most frequent mention

*('@youtube', 9318), ('@leafyishere', 1304), ('@omgitsalia', 1239), ('@pokemongoapp', 1210), ('@trnrtips', 1029), ('@nianticlabs', 907), ('@fsu_atl', 522), ('@lachlanyt', 301), ('@suknives', 286), ('@witelightinghwd', 222)

A search based on bi-gram gives the first 20 most frequent two adjacent words:

*('pic twitter', 23446), ('@youtube video', 7179), ('#pokemongo pic', 4509), ('halloween event', 3673), ('#game #trolling', 3318), ('play pokemon', 2750), ('go update', 2228), ('#love #amazing', 2199), ('new pokemon', 1497), ('halloween update', 1182), ('celebrates halloween', 968), ('need pokecoins', 911), ('rare pokemon', 814), ('apple pen', 768), ('pine apple', 764), ('ppap pine', 764), ('still play', 747), ('still playing', 744), ('candy count', 695), ('go cheats', 685)

which indicates that

Pokemon Go players and sellers use twitter and youtube frequently;
recent update and event also show up timely in tweats;
tweets show great need of pokecoins, candy and rare species etc.
Pokemon Go game is very addictive as people like to use phrases such as 'still play/playing'.
Pokemon Go's influence even reaches popular songs (e.g., Pen Pineapple Apple Pen).

In [15]:

# save to json file and then host the plot with webserver; abandoned

import vincent 
from vincent import AxisProperties, PropertySet, ValueRef
word_freq = [('twitter', 38406), ('pic', 23536), ('halloween', 10795), ('video', 9545), ('new', 8537), 
             ('update', 7604), ('liked', 7068), ('event', 6850), 
             ('play', 5300),  ('get', 4227), ('like', 4135), 
             ('game', 4055), ('still', 3708), 
             ('candy', 3117), ("i'm", 3077), ('first', 2790), 
             ('catch', 2757), ('plus', 2705), ('hack', 2638),  ('niantic', 2587), 
             ('gym', 2378)]

labels, freq = zip(*word_freq)
data = {'data': freq, 'x': labels}
bar = vincent.Bar(data, iter_idx='x')
#rotate x axis labels
ax = AxisProperties(
         labels = PropertySet(angle=ValueRef(value=-30)))
bar.axes[0].properties = ax

bar.to_json('term_freq.json')

4. Sentiment analysis¶

We can try to measure the sentiment scores of the tweets about Pokemon Go, which provides a way to tell game players' opinion about the game. We use the same data sample as above, but only use the most recent 4000 tweets as a test and run those tweets through a sentiment classifier built base on Word Sense Disambiguation using wordnet and word occurance statistics from nltk.

Based on the 4000 tweets, we find a much greater positive score ( $85.6$ ) than the total negative score ( $33.9$ ), indicates Pokemon Go is overall receiving a positive feedback

In [22]:

import cPickle as pickle

terms_only = pickle.load( open( "terms_only.p", "rb" ) )

In [35]:

from senti_classifier import senti_classifier


pos_score, neg_score = senti_classifier.polarity_scores(terms_only[-4000:])
print pos_score, neg_score

85.625 33.931

5. Summary¶

In this proposal, I have demonstrated that we could make use of the data from Twitter to gain insight in the mobile game industry.

We can perform very efficient and productive survey, which benifits both the game development and marketing
We can also perform detailed analysis on given product (here use Pokemon Go as an example) to find out important information such as hot terms/topic, product usage patterns, and the customer's opinion of the product.

Further improvement:

bigger data set
larger sample of mobile games
quantitative measure of the relationship between the twitter users and the real game players

In [ ]: