Text-Mining the DLD14 Conference

This is the IPython Notebook behind this analysis of Twitter buzz to the DLD Conference 2014.

Reading the data

The Tweets for this analysis have been collected with the TAGS Google Drive script and the search query "#DLD13 OR #DLD". The data ("Archive" tab) have then been exported as CSV file. Because the TAGS document tends to reach the Google Documents size limits quite fast, I have split the data gathering in multiple TAGS files.

The first step of the analysis is to read the CSV files and combine them into a single file. This step also requires to delete duplicate Tweets.

In [2]:
import pandas as pd
# 1st file
data = pd.read_csv('TAGS - DLD14 - Archive.csv', 
                   parse_dates={'Timestamp': ['created_at']})

# 2nd file
data = data.append(pd.read_csv('TAGS DLD14.2 - Archive.csv', 
                    parse_dates={'Timestamp': ['created_at']}))

# 3rd file etc.
data = data.append(pd.read_csv('TAGS DLD14.3 - Archive.csv', 
                    parse_dates={'Timestamp': ['created_at']}))
data = data.append(pd.read_csv('TAGS DLD14.4 - Archive.csv', 
                    parse_dates={'Timestamp': ['created_at']}))
data = data.append(pd.read_csv('TAGS DLD14.5 - Archive.csv', 
                    parse_dates={'Timestamp': ['created_at']}))
In [3]:
data_old = pd.read_csv("dld13.csv", sep=",",
                       parse_dates={'Timestamp': ['created_at']})

The following is an example for duplicate content. We have to clean the data set to get rid of duplicate tweets.

In [4]:
from collections import Counter

c = Counter(data.id_str)
data[data.id_str == c.most_common()[1][0]][:3]
Out[4]:
Timestamp id_str from_user text time geo_coordinates user_lang in_reply_to_user_id_str in_reply_to_screen_name from_user_id_str in_reply_to_status_id_str source profile_image_url user_followers_count user_friends_count user_utc_offset status_url entities_str
6904 2014-01-19 12:01:09 4.248742e+17 ardakutsal RT @atillayurtseven: @webrazzi #DLD14 medya sp... 19/01/2014 12:01:09 NaN en NaN NaN 43854330 NaN <a href="http://twitter.com/download/iphone" r... http://pbs.twimg.com/profile_images/3027796712... 18151 676 7200 http://twitter.com/ardakutsal/statuses/4248741... {"symbols":[],"urls":[],"hashtags":[{"text":"D...
9072 2014-01-19 12:01:09 4.248742e+17 ardakutsal RT @atillayurtseven: @webrazzi #DLD14 medya sp... 19/01/2014 12:01:09 NaN en NaN NaN 43854330 NaN <a href="http://twitter.com/download/iphone" r... http://pbs.twimg.com/profile_images/3027796712... 18149 676 7200 http://twitter.com/ardakutsal/statuses/4248741... {"symbols":[],"urls":[],"hashtags":[{"text":"D...
11010 2014-01-19 12:01:09 4.248742e+17 ardakutsal RT @atillayurtseven: @webrazzi #DLD14 medya sp... 19/01/2014 12:01:09 NaN en NaN NaN 43854330 NaN <a href="http://twitter.com/download/iphone" r... http://pbs.twimg.com/profile_images/3027796712... 18148 676 7200 http://twitter.com/ardakutsal/statuses/4248741... {"symbols":[],"urls":[],"hashtags":[{"text":"D...

Data Cleaning

Remove entries that have a broken date.

In [387]:
data = data[data['Timestamp'] >= '2014-01-10']

To de-duplicate the data, I used a rather unsophisticated hack: I removed all the columns that were responsible for differences in otherwise duplicate content. E.g. people were gaining followers between two sweeps of Twitter searches and their entries no longer were identified duplicate. Alternatively, the de-duplication could be done with the id_str column.

In [388]:
tweets = {}

# Reduce data frame
tweets['2014'] = data[['Timestamp', 'id_str', 'from_user', 'source', 'text']]
tweets['2013'] = data_old[['Timestamp', 'id_str', 'from_user', 'source', 'text']]

for year in ['2013', '2014']:
    # De-dup
    tweets[year] = tweets[year].drop_duplicates()
    # Set Timestamp as index for the DataFrame
    tweets[year] = tweets[year].set_index('Timestamp')

Because the Twitter API and also the TAGS script changed between 2013 and 2014, the collected data must be further cleaned and homogenized in order to allow for comparisons. E.g. To compare the data-sets for 2013 and 2014 I reversed the 2014 entry for "web" as a Twitter client to the old value of 'web'. This makes the extraction of the Twitter clients much easier.

In [389]:
tweets['2014']['source'][tweets['2014']['source'] == 'web'] = '<a href="http://twitter.com">web</a>'
tweets['2013']['source'] = [x.replace('&lt;', '<').replace('&gt;', '>').replace('&quot;', '"') 
                            for x in tweets['2013']['source']]

for year in ['2013', '2014']:
    tweets[year]['source'] = [x.replace('  ', ' ') for x in tweets[year]['source']]

This is one of the most important parts. Now our dataframe gets translated into a real timeline. I'm aggregating all the Tweets by 15 minute intervals. But you can play around with those values to find other solutions.

In [390]:
ticks = {}

for year in ['2013', '2014']:
    tweets[year]['Tweets'] = 1
    ticks[year] = tweets[year].ix[:, ['Tweets']]
    ticks[year] = ticks[year].Tweets.resample('15min', how='count')

Data Analysis

Let's do a first visualization of the buzz for the conference:

In [490]:
%matplotlib inline 
import matplotlib.pyplot as plt

#plt.figure(figsize=(10,6))
plt.title('DLD14 Conference Buzz')
plt.ylabel('Number of Tweets')
ticks["2014"].ix['2014-01-18':'2014-01-22'].plot()
plt.savefig('DLD14_Buzz.png')

Which year had the highest peak? See answer below.

In [392]:
print max(ticks["2013"])
print max(ticks["2014"])
247
346

DLD 2013 had been one year before DLD 2014. To overlay both buzz timelines, these value have to be time-shifted.

In [484]:
ticks["2014o"] = ticks["2014"] * 1.0
ticks["2013o"] = ticks["2013"].tshift(364, freq="d")

Here's a direct comparison of 2013 and 2014.

In [491]:
fig, ax = plt.subplots()
ticks["2013o"].ix['2014-01-18':'2014-01-23'].plot(color="red", label="DLD 2013")
ticks["2014o"].ix['2014-01-18':'2014-01-23'].plot(label="DLD 2014")
legend = ax.legend(loc='upper left', shadow=True)
plt.xlabel('Date')
plt.title('#DLD14 Conference Buzz')
plt.ylabel('Number of Tweets')
#plt.show()
plt.savefig('DLD14_Buzz_Comp_Comparison.png')

Let's take a look at the clients people were using in 2014 and compare them to 2013. Which Twitter devices have gained importance?

In [395]:
from prettytable import PrettyTable
import re
import prettytable
devices = {}

for y in tweets:
    devices[y] = tweets[y].groupby("source", as_index=False)["source", "Tweets"].sum()
  
dev_names = []
dev_count = []
    
for x in devices["2014"]["source"]:
    m = re.match("<.*>(.*)</a>", x)
    try:
        dev_names.append(m.group(1))
        dev_count.extend(devices["2014"]["Tweets"][devices["2014"]["source"] == x])
    except AttributeError:
        pass
        
       
d = pd.DataFrame({"Device": dev_names, "Tweets_2014": dev_count})
d["Rel_2014"] = 100.0 * d["Tweets_2014"] / sum(d["Tweets_2014"])

dev_names = []
dev_count = []

for x in devices["2013"]["source"]:
    m = re.match("<.*>(.*)</a>", x)
    try:
        dev_names.append(m.group(1))
        dev_count.extend(devices["2013"]["Tweets"][devices["2013"]["source"] == x])
    except AttributeError:
        pass

e = pd.DataFrame({"Device": dev_names, "Tweets_2013": dev_count})
e["Rel_2013"] = 100.0 * e["Tweets_2013"] / sum(e["Tweets_2013"])

f = pd.merge(d, e, how="outer", on="Device")
f["Growth"] = f.Rel_2014 / f.Rel_2013

tb = f[(f['Rel_2014'] >= 0.05) & (f['Growth'] >= 0)].sort(columns="Growth", ascending=False)[["Device", "Rel_2013", "Rel_2014", "Growth"]]
tb.columns = ["Device", "% 2013", "% 2014", "Index"]
tb["% 2014"] = tb["% 2014"].round(2)
tb["% 2013"] = tb["% 2013"].round(2)
tb["Index"] = (100*tb["Index"]).round(1)

pt = PrettyTable(field_names=["Twitter Clients #DLD14", "Percent 2013", "Percent 2014", "Index"]) 
[pt.add_row(a[1]) for a in tb.iterrows()]
pt.align[pt.align['Twitter Clients #DLD14'], pt.align['Percent 2013'], pt.align['Percent 2014'], pt.align['Index']] = 'l', 'r', 'r', 'r'
print pt
+---------------------------+--------------+--------------+-------+
|   Twitter Clients #DLD14  | Percent 2013 | Percent 2014 | Index |
+---------------------------+--------------+--------------+-------+
|  TweetCaster for Android  |     0.12     |     0.5      | 426.2 |
|           IFTTT           |     0.16     |     0.7      | 424.8 |
|          Janetter         |     0.02     |     0.09     | 398.1 |
|          dlvr.it          |     0.74     |     2.49     | 337.5 |
|            iOS            |     0.21     |     0.68     | 322.6 |
|          Storify          |     0.04     |     0.08     | 218.5 |
|          Facebook         |     0.16     |     0.35     | 210.7 |
|         TweetList!        |     0.75     |     1.3      | 173.4 |
|     Plume for Android     |     0.11     |     0.18     | 171.7 |
|    Twitter for Android    |     4.67     |     7.95     | 170.1 |
|         Flipboard         |     0.19     |     0.32     | 169.8 |
|       Sprout Social       |     0.07     |     0.12     | 163.9 |
|        Twitterrific       |     0.61     |     0.99     | 162.1 |
|        twitterfeed        |     2.96     |     4.61     | 155.5 |
|      Twitter for iPad     |     3.76     |     5.0      | 132.9 |
|      Mobile Web (M5)      |     0.56     |     0.74     | 130.7 |
|           Buffer          |     0.35     |     0.45     | 128.0 |
|     Twitter for iPhone    |    17.48     |    22.21     | 127.1 |
|    TweetCaster for iOS    |     0.06     |     0.07     | 121.8 |
|      Tweetbot for iOS     |     2.4      |     2.9      | 120.6 |
|         TweetDeck         |     6.16     |     6.94     | 112.6 |
|        Tweet Button       |     2.59     |     2.9      | 112.1 |
| Twitter for Windows Phone |     0.26     |     0.29     | 110.7 |
|      Tweetbot for Mac     |     0.46     |     0.5      | 109.3 |
|      Twitter for Mac      |     1.57     |     1.67     | 106.6 |
|          Echofon          |     0.87     |     0.86     |  99.4 |
|         HootSuite         |     4.29     |     4.15     |  96.9 |
|      Mobile Web (M2)      |     0.28     |     0.25     |  87.8 |
|         Instagram         |     1.3      |     1.08     |  82.7 |
|         MetroTwit         |     0.08     |     0.07     |  80.3 |
|           Botize          |     0.13     |     0.09     |  72.4 |
|          SharedBy         |     0.19     |     0.13     |  67.3 |
|            OS X           |     0.08     |     0.05     |  66.9 |
|         foursquare        |     0.61     |     0.4      |  65.7 |
|          LinkedIn         |     0.13     |     0.08     |  63.9 |
|            web            |    39.28     |    22.37     |  57.0 |
|  Twitter for BlackBerry®  |     2.1      |     0.61     |  29.3 |
|        Twittelator        |     0.35     |     0.07     |  20.3 |
|    Twitter for Android    |     4.67     |     0.24     |  5.2  |
+---------------------------+--------------+--------------+-------+

Now to the content. Here's a few lines that do some data munging with the Tweet texts.

In [396]:
import nltk
bigram_measures = nltk.collocations.BigramAssocMeasures()
stop = nltk.corpus.stopwords.words('english')
stop = stop + nltk.corpus.stopwords.words('german')

text = {}
words = {}

for year in tweets:
    raw = " ".join(tweets[year]["text"])
    tokens = nltk.WordPunctTokenizer().tokenize(raw)
    text[year] = nltk.Text(tokens)
    words[year] = [w.lower() for w in text[year]]
    words[year] = [w for w in words[year] if len(w) > 2]
    words[year] = [w for w in words[year] if w not in stop]
    words[year] = filter(lambda word: word not in '"\'!%,-:()$\/;?.’–“”#@&', words[year])
    words[year] = [w for w in words[year] if w not in ["://", "http", "co", "rt", "va", "l", "se", "...", ".\"", 
                                                       "amp", "us", "en", "el", "y", "de", "que", "via", "12", 
                                                       "000", "hoy", "por", "les", "per", "la", "los", "5", "1", 
                                                       "[email protected]", "con"]]
    words[year] = [w.replace("\xe2", "") for w in words[year]]
    words[year] = [w.replace("\xc3", "") for w in words[year]]
    words[year] = [w.replace("\xb3", "") for w in words[year]]

How many words have been used in 2013 and 2014? Let's compare the lexical diversity.

In [488]:
numwords = {}
uniwords = {}
lexi = {}

for year in text:
    numwords[year] = len(text[year])
    uniwords[year] = len(set(text[year]))
    lexi[year] = 1.0*numwords[year]/uniwords[year]

print numwords
print uniwords
print lexi
{'2014': 487592, '2013': 195371}
{'2014': 24124, '2013': 15607}
{'2014': 20.21190515669043, '2013': 12.518164925994746}

Printing the Top 25 trending (= growing in frequency) words in 2014

In [399]:
from prettytable import PrettyTable
import codecs

for year in numwords:
    freq_table["Perc_" + year] = 100.0 * freq_table["Freq_" + year] / numwords[year]

for year in ["2014"]:
    freq_table["Growth_" + year] = 100.0 * freq_table["Perc_" + year] / freq_table["Perc_" + str(int(year)-1)]

    tb = freq_table[freq_table['Perc_' + str(year)] >= 0.09].sort(columns="Growth_" + str(year), 
                                                                  ascending=False)[["Word", "Freq_" + str(year), 
                                                                                    "Perc_" + str(year), 
                                                                                    "Growth_" + str(year)]]
    tb.columns = ["Word", "Freq", "Percent", "Index"]
    tb.Index = tb['Index'].round(1)
    tb.Percent = tb['Percent'].round(4)
    
    pt = PrettyTable(field_names=[str(year), 'Frequency', 'Percent', "Index"]) 
    [pt.add_row(a[1]) for a in tb[:25].iterrows()]
    pt.align[str(year)], pt.align['Frequency'], pt.align['Percent'], pt.align['Index'] = 'l', 'r', 'r', 'r'

    print pt
+---------------+-----------+---------+----------+
| 2014          | Frequency | Percent |    Index |
+---------------+-----------+---------+----------+
| dld14         |     16075 |  3.2968 | 322050.9 |
| billion       |       646 |  0.1325 |   2157.0 |
| experience    |       492 |  0.1009 |   1159.6 |
| founder       |       512 |   0.105 |    820.6 |
| anked         |       439 |    0.09 |    586.3 |
| user          |       450 |  0.0923 |    487.3 |
| bill_gross    |      1919 |  0.3936 |    427.2 |
| today         |       644 |  0.1321 |    248.1 |
| todo          |       541 |   0.111 |    240.9 |
| marketing     |       485 |  0.0995 |    234.1 |
| ceo           |       777 |  0.1594 |    225.6 |
| google        |       703 |  0.1442 |    199.8 |
| mobile        |       748 |  0.1534 |    168.4 |
| internet      |       594 |  0.1218 |    165.3 |
| day           |       644 |  0.1321 |    160.3 |
| great         |       810 |  0.1661 |    144.9 |
| digital       |       877 |  0.1799 |    129.7 |
| jeffjarvis    |       470 |  0.0964 |    121.5 |
| people        |       612 |  0.1255 |    116.8 |
| live          |       545 |  0.1118 |     97.1 |
| dldconference |      1134 |  0.2326 |     72.9 |
| munich        |       472 |  0.0968 |     66.1 |
| new           |       498 |  0.1021 |     56.4 |
| dld           |      1399 |  0.2869 |     48.0 |
| data          |       623 |  0.1278 |     44.4 |
+---------------+-----------+---------+----------+

Top words in 2014

In [400]:
tb = freq_table.sort(columns="Perc_2014", ascending=False)[["Word", "Freq_" + str(year), "Perc_" + str(year), 
                                                            "Growth_" + str(year)]]
tb.columns = ["Word", "Freq", "Percent", "Index"]
tb.Index = tb['Index'].round(1)
tb.Percent = tb['Percent'].round(4)

pt = PrettyTable(field_names=["Top 2014", 'Frequency', 'Percent', "Index"]) 
[pt.add_row(a[1]) for a in tb[:25].iterrows()]
pt.align["Top 2014"], pt.align['Frequency'], pt.align['Percent'], pt.align['Index'] = 'l', 'r', 'r', 'r'

print pt
+---------------+-----------+---------+----------+
| Top 2014      | Frequency | Percent |    Index |
+---------------+-----------+---------+----------+
| dld14         |     16075 |  3.2968 | 322050.9 |
| bill_gross    |      1919 |  0.3936 |    427.2 |
| dld           |      1399 |  0.2869 |     48.0 |
| dldconference |      1134 |  0.2326 |     72.9 |
| digital       |       877 |  0.1799 |    129.7 |
| great         |       810 |  0.1661 |    144.9 |
| ceo           |       777 |  0.1594 |    225.6 |
| mobile        |       748 |  0.1534 |    168.4 |
| google        |       703 |  0.1442 |    199.8 |
| billion       |       646 |  0.1325 |   2157.0 |
| day           |       644 |  0.1321 |    160.3 |
| today         |       644 |  0.1321 |    248.1 |
| data          |       623 |  0.1278 |     44.4 |
| people        |       612 |  0.1255 |    116.8 |
| internet      |       594 |  0.1218 |    165.3 |
| live          |       545 |  0.1118 |     97.1 |
| todo          |       541 |   0.111 |    240.9 |
| founder       |       512 |   0.105 |    820.6 |
| new           |       498 |  0.1021 |     56.4 |
| experience    |       492 |  0.1009 |   1159.6 |
| marketing     |       485 |  0.0995 |    234.1 |
| munich        |       472 |  0.0968 |     66.1 |
| jeffjarvis    |       470 |  0.0964 |    121.5 |
| user          |       450 |  0.0923 |    487.3 |
| anked         |       439 |    0.09 |    586.3 |
+---------------+-----------+---------+----------+

Top Words in 2013

In [401]:
tb = freq_table.sort(columns="Perc_2013", ascending=False)[["Word", "Freq_2013", "Perc_2013"]]
tb.columns = ["Word", "Freq", "Percent"]
tb.Percent = tb['Percent'].round(4)

pt = PrettyTable(field_names=["Top 2013", 'Frequency', 'Percent']) 
[pt.add_row(a[1]) for a in tb[:25].iterrows()]
pt.align["Top 2013"], pt.align['Frequency'], pt.align['Percent'] = 'l', 'r', 'r'

print pt
+---------------+-----------+---------+
| Top 2013      | Frequency | Percent |
+---------------+-----------+---------+
| dld13         |      7808 |  3.9965 |
| dld           |      1167 |  0.5973 |
| dldconference |       623 |  0.3189 |
| data          |       562 |  0.2877 |
| big           |       481 |  0.2462 |
| new           |       354 |  0.1812 |
| love          |       351 |  0.1797 |
| like          |       343 |  0.1756 |
| munich        |       286 |  0.1464 |
| digital       |       271 |  0.1387 |
| live          |       225 |  0.1152 |
| great         |       224 |  0.1147 |
| see           |       219 |  0.1121 |
| sex           |       215 |    0.11 |
| media         |       214 |  0.1095 |
| future        |       211 |   0.108 |
| people        |       210 |  0.1075 |
| really        |       195 |  0.0998 |
| times         |       193 |  0.0988 |
| conference    |       184 |  0.0942 |
| york          |       182 |  0.0932 |
| bill_gross    |       180 |  0.0921 |
| mobile        |       178 |  0.0911 |
| social        |       176 |  0.0901 |
| day           |       161 |  0.0824 |
+---------------+-----------+---------+

Top Bigrams in 2014 and 2013

In [402]:
from prettytable import PrettyTable
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

for year in ["2013", "2014"]:
    words[year] = [w.replace("\x80", "") for w in words[year]]
    words[year] = [w.replace("\x99", "") for w in words[year]]
    words[year] = [w for w in words[year] if w not in ["como", "oro", "las", "nadie", "cmo", "todos", "hablan", "una", "hacerlo", 
                                                       "sabe", ")", "todo", "decidir", "slo", "adida"]]
    print "Top Bigrams " + str(year)
    finder = BigramCollocationFinder.from_words(words[year])
    scored = finder.score_ngrams(bigram_measures.raw_freq)

    pt = PrettyTable(field_names=['Bigram', 'Frequency']) 
    [ pt.add_row([" ".join(kv[0]), round(kv[1], 4)]) for kv in scored[:35] ]
    pt.align['Bigram'], pt.align['Frequency'] = 'l', 'r' # Set column alignment
    
    print pt
Top Bigrams 2013
+---------------------+-----------+
| Bigram              | Frequency |
+---------------------+-----------+
| big data            |     0.005 |
| new york            |    0.0024 |
| dld13 dldconference |    0.0022 |
| york times          |     0.002 |
| times dld13         |    0.0019 |
| dld13 tenemos       |    0.0018 |
| tenemos abrirnos    |    0.0018 |
| sulzberger new      |    0.0018 |
| abrirnos redes      |    0.0018 |
| redes sociales      |    0.0018 |
| adolescentes dld13  |    0.0017 |
| comparado sexo      |    0.0017 |
| entre adolescentes  |    0.0017 |
| sexo entre          |    0.0017 |
| data comparado      |    0.0017 |
| dean ornish         |    0.0016 |
| convergencia zorra  |    0.0015 |
| dld13 convergencia  |    0.0015 |
| partners dld13      |    0.0015 |
| kawaja luma         |    0.0015 |
| luma partners       |    0.0015 |
| dld13 dld           |    0.0014 |
| dld13 ltwnds        |    0.0012 |
| teenage sex         |    0.0012 |
| ben horowitz        |    0.0012 |
| jeffjarvis dld13    |    0.0012 |
| future authority    |    0.0012 |
| like teenage        |    0.0012 |
| data like           |    0.0011 |
| ozlem_denizmen dld  |    0.0011 |
| dld13 ibcmunich     |    0.0011 |
| ornish dld13        |     0.001 |
| sex everybody       |     0.001 |
| dldconference dld13 |     0.001 |
| social media        |     0.001 |
+---------------------+-----------+
Top Bigrams 2014
+-----------------------+-----------+
| Bigram                | Frequency |
+-----------------------+-----------+
| dld14 bill_gross      |    0.0029 |
| rovio dld14           |    0.0025 |
| user experience       |    0.0021 |
| vesterbacka rovio     |     0.002 |
| dld14 user            |     0.002 |
| bill_gross whatsapp   |     0.002 |
| today dld14           |    0.0017 |
| jimmy wales           |    0.0016 |
| active users          |    0.0016 |
| million active        |    0.0015 |
| billion smartphones   |    0.0014 |
| earlier today         |    0.0014 |
| users engineers       |    0.0014 |
| 435 million           |    0.0014 |
| engineers 10m         |    0.0014 |
| whatsapp 435          |    0.0014 |
| 10m actives           |    0.0014 |
| actives engineer      |    0.0013 |
| dld14 bitcoin         |    0.0013 |
| people check          |    0.0012 |
| engineer dld14        |    0.0011 |
| 150x day              |    0.0011 |
| check 150x            |    0.0011 |
| smartphones worldwide |    0.0011 |
| hubert burda          |    0.0011 |
| moore law             |     0.001 |
| bitcoin ventaja       |     0.001 |
| dldconference dld14   |     0.001 |
| ventaja puedes        |     0.001 |
| puedes verdad         |     0.001 |
| verdad pagar          |     0.001 |
| joe schoendorf        |    0.0009 |
| cheaper build         |    0.0009 |
| day every             |    0.0009 |
| dld14 qu              |    0.0009 |
+-----------------------+-----------+

Find the context for a word.

In [130]:
text["2014"].concordance("xenon")
Displaying 25 of 166 matches:
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S
12 , 000 transistors . Intel ' s new Xenon has 5 BILLION and Moore ' s Law is S

Visualizing different words over time

In [482]:
query = ["xenon", "wales", "data", "rovio"]
col = ["red", "blue", "green", "black"]
data = tweets["2014"]
data['text'] = data['text'].apply(str.lower)
results = {}

fig, ax = plt.subplots()

for q in range(len(query)):
    results[q] = data[data["text"].str.contains(query[q])].ix[:, ['Tweets']]
    results[q] = results[q].Tweets.resample('30min', how='count')
    results[q].ix['2014-01-19':'2014-01-22'].plot(color=col[q], label=query[q])

legend = ax.legend(loc='upper right', shadow=True)
plt.xlabel('Date')
plt.title('#DLD14 Conference Buzz for ' + ", ".join(query))
plt.ylabel('Number of Tweets')
Out[482]:
<matplotlib.text.Text at 0x2066c4e0>