Let's load one day's worth of tweets from India. These were captured via the Twitter API. The file is at http://files.gramener.com/data/tweets.20130919.json.gz. It's just under 7MB.
First, let's download the file.
import os
import urllib
tweetfile = 'tweets.json.gz'
if not os.path.exists(tweetfile):
url = 'http://files.gramener.com/data/tweets.20130919.json.gz'
urllib.urlretrieve(url, tweetfile)
This file is not quite a gzipped JSON file, despite the file name. Each row is a JSON string. Some lines might be blank -- especially alternate lines.
import gzip
for line in gzip.open(tweetfile).readlines()[:8]:
if line.strip():
print line[:80]
{"created_at":"Wed Sep 18 03:39:02 +0000 2013","id":380174094936702976,"id_str": {"created_at":"Wed Sep 18 03:39:02 +0000 2013","id":380174096635416577,"id_str": {"created_at":"Wed Sep 18 03:39:06 +0000 2013","id":380174111076405248,"id_str": {"created_at":"Wed Sep 18 03:39:16 +0000 2013","id":380174154751696896,"id_str":
Let's load this into a Pandas data structure. After some experimentation, I find that this is a reasonably fast way of loading it.
import pandas as pd
import json
series = pd.Series([
line for line in gzip.open(tweetfile) if line.strip()
]).apply(json.loads)
data = pd.DataFrame({
'id' : series.apply(lambda t: t['id_str']),
'name': series.apply(lambda t: t['user']['screen_name']),
'text': series.apply(lambda t: t['text']),
}).set_index('id')
We've extracted just a few things from the tweets -- such as the ID (which we set as the index), the person who tweeted it, the text of the tweet.
data.head()
name | text | |
---|---|---|
id | ||
380174094936702976 | rgokul | பின்னாடி பாத்தா பர்ஸ்னாலிட்டி, முன்னாடி பாத்தா... |
380174096635416577 | fknadaf | @rehu123 \nHi..re h r u..???? |
380174111076405248 | neetakolhatkar | @sohamsabnis mhanunach jau dya..tyat phile jod... |
380174154751696896 | pinashah1 | @Miragpur7 jok of tha day |
380174182803202050 | MeghaLvsShaleen | @ilovearrt @shweet_tasu @akanksha_pooh31 @Miss... |
words = pd.Series(' '.join(data['text']).split(' '))
words.value_counts().head()
to 3256 the 3235 2441 in 2275 a 2193 dtype: int64
There are lots of errors in the assumption that words are split by a single space. That ignores punctuation, multiple spaces, hyphenation, and a lot of other things. But it's not a bad starting point and you can start making reasonable inferences as a first approximation.
.word_tokenize()
¶The process of converting a sentence into words is called tokenization. NLTK offers an nltk.word_tokenize()
function for this. Let's try it out:
import nltk
for i in range(2, 6):
print data['text'][i]
print nltk.word_tokenize(data['text'][i])
print ''
@sohamsabnis mhanunach jau dya..tyat phile jodi ne ahe...mhanje imagination la break nahi [u'@', u'sohamsabnis', u'mhanunach', u'jau', u'dya..tyat', u'phile', u'jodi', u'ne', u'ahe', u'...', u'mhanje', u'imagination', u'la', u'break', u'nahi'] @Miragpur7 jok of tha day [u'@', u'Miragpur7', u'jok', u'of', u'tha', u'day'] @ilovearrt @shweet_tasu @akanksha_pooh31 @MissHal96 @Mishtithakur @SalgaonkarPriya @Shaleen_Ki_Pari Super cute :p [u'@', u'ilovearrt', u'@', u'shweet_tasu', u'@', u'akanksha_pooh31', u'@', u'MissHal96', u'@', u'Mishtithakur', u'@', u'SalgaonkarPriya', u'@', u'Shaleen_Ki_Pari', u'Super', u'cute', u':', u'p'] Looking forward to interacting with the dynamic students, faculty & team of @SriSriU. Its fast becoming a global centre of excellence ! [u'Looking', u'forward', u'to', u'interacting', u'with', u'the', u'dynamic', u'students', u',', u'faculty', u'&', u'amp', u';', u'team', u'of', u'@', u'SriSriU', u'.', u'Its', u'fast', u'becoming', u'a', u'global', u'centre', u'of', u'excellence', u'!']
There are a few problems with this. User names like @ilovearrt
are split into @
and iloverrrt
. Similarly, &
is split. And so on.
NLTK offers other tokenizers, including the ability to custom-write your own. But for now, we'll just go with our simple list of space-separated words.
NOTE: Tokenization is usually specific to a given dataset.
nltk.corpus.stopwords
and .drop()
¶The bigger problem is that the most common words are also the most often used -- to, the, in, a, etc. These are called stopwords. We need a way of finding and removing them.
NLTK offers a standard list of stopwords. This is what we get if we remove those.
from nltk.corpus import stopwords
ignore = set(stopwords.words('english')) & set(words.unique())
words.value_counts().drop(ignore)
2441 I 1817 I'm 970 u 695 - 604 @ 507 The 503 :) 493 & 467 like 390 hai 364 (@ 363 good 333 one 285 get 285 ! 281 time 280 love 269 n 266 r 263 day 245 RT 244 people 242 :D 240 7 240 #ForSale 227 #Flat 226 don't 223 iOS 222 ur 222 ... gained 1 grips 1 agreed.. 1 election.#congreefights 1 din.... 1 http://t.co/RXM8hgBBoS 1 langsunglah 1 policies. 1 thriller 1 dummy 1 Amen! 1 थीं 1 meetings 1 #Mathura 1 @AmypichardAmy 1 http://t.co/necdoOUAHU 1 Ravjiani, 1 #TheAsianAge 1 coffee.!! 1 race's 1 http://t.co/aH4i8A0Nz1" 1 real… 1 http://t.co/GNzghJBYX1 1 update?? 1 #lazy 1 107,#Gurgaon, 1 annaru 1 snooping 1 @BangaloreAshram 1 जैन। 1 dtype: int64
Still, it's not really clear what the words are. We need to go further.
relevant_words = words.str.lower()
relevant_words = relevant_words[~relevant_words.str.contains(r'[^a-z]')]
relevant_words = relevant_words[relevant_words.str.len() > 1]
ignore = set(stopwords.words('english')) & set(relevant_words.unique())
relevant_words.value_counts().drop(ignore)
good 543 like 418 hai 386 one 365 love 351 time 321 get 300 new 298 people 297 see 273 day 271 ios 255 rt 247 ki 242 ur 242 know 228 go 221 life 219 best 214 se 205 back 201 morning 200 make 192 never 192 hi 192 follow 188 still 188 want 185 india 180 way 178 ... chattarpur 1 pleaseeeeee 1 bhujiya 1 chuploo 1 enuff 1 roost 1 cantt 1 parsvnath 1 expired 1 beam 1 beshak 1 cld 1 pace 1 mushtaq 1 howdy 1 ghalib 1 leya 1 pudhcha 1 pilgrim 1 soiled 1 lool 1 krissh 1 imo 1 muaaaaah 1 pranam 1 bevkoof 1 destroyed 1 quater 1 vasundhara 1 validity 1 dtype: int64
This list is a lot more meaningful.
But before we go ahead, let's take a quick look at the words we've ignored to see if we should've taken something from there.
words.drop(relevant_words.index).str.lower().value_counts().head(30)
2441 a 2377 i 2161 i'm 980 u 778 - 604 @ 507 :) 493 & 467 (@ 363 don't 292 n 291 :p 287 r 285 ! 281 it's 243 :d 241 7 240 #ios7 232 #forsale 227 #flat 226 2 217 . 216 ? 215 !! 204 #residential 204 .. 196 , 191 #bappamorya 189 :-) 173 dtype: int64
... Ah! We're missing all the smileys (which may be OK) and the hashtags (which could be useful). Should we just pull in the hashtags alone? Let's do that. We'll allow #
as an exception. We'll also ignore @
which usually indicates reply to a person.
relevant_words = words.str.lower()
relevant_words = relevant_words[~relevant_words.str.contains(r'[^#@a-z]')]
relevant_words = relevant_words[relevant_words.str.len() > 1]
ignore = set(stopwords.words('english')) & set(relevant_words.unique())
relevant_words.value_counts().drop(ignore)
good 543 like 418 hai 386 one 365 love 351 time 321 get 300 new 298 people 297 see 273 day 271 ios 255 rt 247 ki 242 ur 242 know 228 #forsale 227 #flat 226 go 221 life 219 best 214 se 205 #residential 204 back 201 morning 200 never 192 make 192 hi 192 #bappamorya 189 follow 188 ... circumstances 1 vaat 1 parag 1 recreate 1 #pounding 1 #nestle 1 meuble 1 #thingsthatmakemehappy 1 primarily 1 kanipinchadu 1 #kathmandont 1 ruhu 1 kashif 1 tidak 1 bl 1 dekhaunchu 1 jokingly 1 inclination 1 bd 1 bf 1 sants 1 @itweetfacts 1 dictated 1 bk 1 #instaholic 1 jaaoege 1 mahmood 1 br 1 #justbeingme 1 betch 1 dtype: int64
We haven't added anything to the list of top words, but further down, it may be useful.
nltk.PorterStemmer()
¶Let's look at all the words that start with time
, like timing
, timer
, etc.
relevant_words[relevant_words.str.startswith('tim')].value_counts()
time 321 times 51 timeline 6 timings 3 timeless 3 timesnow 2 timetable 1 tim 1 timely 1 timezone 1 timing 1 timed 1 timline 1 timesheet 1 timepass 1 timro 1 dtype: int64
At the very least, we want time
and times
to mean the same word. These are word stems. Here's one way of doing this in NLTK.
porter = nltk.PorterStemmer()
stemmed_words = relevant_words.apply(porter.stem)
stemmed_words[stemmed_words.str.startswith('tim')].value_counts()
time 378 timelin 6 timeless 3 timesnow 2 timlin 1 timezon 1 tim 1 timet 1 timesheet 1 timepass 1 timro 1 dtype: int64
Notice that this introduces words like timelin
instead of timeline
. These can be avoided through the use of a process called lemmatization
(see nltk.WordNetLemmatizer()
). However, this is relatively slower.
For now, we'll just stick to the original words.
nltk.collocations
¶What if we want to find phrases? If we're looking for 2-word combinations (bigrams), we can use the nltk.collocations.BigramCollocationFinder
. These are the top 30 word pairs.
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
bcf = BigramCollocationFinder.from_words(relevant_words)
for pair in bcf.nbest(BigramAssocMeasures.likelihood_ratio, 30):
print ' '.join(pair)
#bappamorya #bappamorya good morning #flat #forsale #jacksonville #jobs will be #residentialplot #land to be agle baras international airport #land #forsale posted photo baras tu tu jaldi #smwmumbai #mumbaiisamazing now trending happy birthday just posted in the waiting for trending topic jaldi aa gracious acts #apartment #flat @smwmumbai #smwmumbai follow back the best railway station cycling km god bless shows up
Let's get the data into a DataFrame
top_words = relevant_words.value_counts().drop(ignore).reset_index()
top_words.columns = ['word', 'count']
top_words.head()
word | count | |
---|---|---|
0 | good | 543 |
1 | like | 418 |
2 | hai | 386 |
3 | one | 365 |
4 | love | 351 |
(Work in progress...)
import re
re_separator = re.compile(r'[\s"#\.\?,;\(\)!/]+')
re_url = re.compile(r'http.*?($|\s)')
def tokenize(sentence):
sentence = re_url.sub('', sentence)
words = re_separator.split(sentence)
return [word for word in words if
len(word) > 1]
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
# analyser='word', # Separate using punctuations
# analyzer=re_separator.split, # Separate using spaces
# analyzer=re_separator.split, # Separate using custom separator
analyzer=tokenize, # Separate using custom separator
min_df=10, # Ignore words that occur less than 10 times in the corpus
)
# Note: for these 18,000 documents, sklearn takes about ~0.5 seconds on my system
X = vectorizer.fit_transform(data['text'])
# Here are some of the terms that have special characters
print '# terms: %d' % len(vectorizer.vocabulary_)
for key in vectorizer.vocabulary_.keys():
if re.search('\W', key) and not re.search(r'[@#\']', key) and re.search('\w', key):
print key, vectorizer.vocabulary_[key]
# terms: 2482 ^_^ 869 :-D 51 I’m 482 don’t 1203 & 0 [pic] 867 :-P 52 -www 7 [pic]: 868 > 1 alert: 908 here: 1437 100% 11 Job: 490 IN: 454 Café 292 :3 53 :o 57 :p 58 :O 55 :D 54 :P 56 Méridien 592 -_- 6 < 2 10:30 12
# Apply TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
# Let's see the unusual terms
import numpy as np
terms = np.array(vectorizer.get_feature_names())
for index in range(100):
t = terms[(tfidf[index] >= 0.99).toarray()[0]]
if len(t):
print index, t, data['text'][index]
6 [u'sorry'] @b50 oops...sorry typo. 'Than' 7 [u'place'] 9h09 place au someil maintenant 24 [u'GM'] @satish_bsk GM 25 [u'org'] @mrlumpyU_U xbek menindas org yg xblik msia. Ngagaha 26 [u'Hey'] Hey evrybuddy http://t.co/vH89PFhyYg 35 [u'ha'] @gauthamvarma04 ha ha 85 [u'ma'] @bindeshpandya gm$... Jay ma bharat..vande mataram.. Namo namah...@BJYM @BJP_Gujarat
# Segment by those with above median followers
followers_count = series.map(lambda v: v['user']['followers_count'])
segment = followers_count.values > followers_count.median()
count1 = X[segment].sum(axis=0)
count2 = X[~segment].sum(axis=0)
# Count of term in each segment
df = pd.DataFrame(np.concatenate([count1, count2]).T).astype(float)
df.columns = ['a', 'b']
df['term'] = terms
df.head()
a | b | term | |
---|---|---|---|
0 | 261 | 242 | & |
1 | 74 | 53 | > |
2 | 48 | 44 | < |
3 | 4 | 7 | 's |
4 | 4 | 16 | -- |
total = df['a'] + df['b']
contrast = df['a'] / total - 0.5
freq = total.rank() / len(df)
df['significance'] = freq / 2 + contrast.abs()
df.sort_values('significance', ascending=False).head()
a | b | term | significance | |
---|---|---|---|---|
664 | 290 | 1 | Property | 0.985282 |
370 | 239 | 0 | ForSale | 0.985093 |
365 | 232 | 0 | Flat | 0.983783 |
252 | 0 | 189 | BappaMorya | 0.980459 |
688 | 222 | 2 | Residential | 0.973747 |
def termdiff(terms, counts, segment):
df = pd.DataFrame(np.concatenate([
counts[segment].sum(axis=0),
counts[~segment].sum(axis=0)
]).T).astype(float)
df.columns = ['a', 'b']
df['term'] = terms
total = df['a'] + df['b']
df['contrast'] = 2 * (df['a'] / total - 0.5)
df['freq'] = total.rank() / len(df)
df['significance'] = (df['freq'] + df['contrast'].abs()) / 2
return df.sort_values('significance', ascending=False)
termdiff(terms, X, segment).head()
a | b | term | contrast | freq | significance | |
---|---|---|---|---|---|---|
664 | 290 | 1 | Property | 0.993127 | 0.977438 | 0.985282 |
370 | 239 | 0 | ForSale | 1.000000 | 0.970185 | 0.985093 |
365 | 232 | 0 | Flat | 1.000000 | 0.967566 | 0.983783 |
252 | 0 | 189 | BappaMorya | -1.000000 | 0.960919 | 0.980459 |
688 | 222 | 2 | Residential | 0.982143 | 0.965351 | 0.973747 |
There seem to be several influential people on Twitter tweetings about properties for sale. Non-influential people are tweeting about BappaMorya.
with_hashtags = series.apply(lambda v: len(v['entities']['hashtags']) > 0).values
tdiff = termdiff(terms, X, with_hashtags)
tdiff[tdiff['b'] > tdiff['a']].head(10)
a | b | term | contrast | freq | significance | |
---|---|---|---|---|---|---|
2419 | 0 | 135 | टन | -1.000000 | 0.943191 | 0.971595 |
451 | 35 | 946 | I'm | -0.928644 | 0.995568 | 0.962106 |
551 | 2 | 134 | Maharashtra | -0.970588 | 0.943795 | 0.957192 |
2399 | 3 | 98 | और | -0.940594 | 0.922643 | 0.931619 |
1163 | 2 | 76 | dear | -0.948718 | 0.891620 | 0.920169 |
2409 | 10 | 156 | के | -0.879518 | 0.956285 | 0.917902 |
1604 | 0 | 56 | lessons | -1.000000 | 0.835012 | 0.917506 |
2465 | 19 | 246 | है | -0.856604 | 0.974416 | 0.915510 |
697 | 0 | 54 | Rumi | -1.000000 | 0.829371 | 0.914686 |
2468 | 1 | 63 | है। | -0.968750 | 0.860596 | 0.914673 |
Tweets without hashtags tend to be Hindi tweets.
The word "I'm" often is used without hashtags. (These are typically tweets that say "I'm at".)
data.ix[X.T[451].toarray()[0] > 0]['text'].values[:5]
array([u"I'm at LINK (Mumbai, Maharashtra) http://t.co/ComXHpCbua", u"I'm getting fragrance of a dish being cooked in pure ghee... seems yum", u"I'm at Le Meridien - @spg (Bangalore, Karnataka) http://t.co/GhzDYpdTRu", u"I'm at Lajpat Nagar Metro Station (New Delhi, new delhi) http://t.co/MNEHQ9Qesg", u"I'm at Godrej Memorial Hospital (Mumbai, Maharashtra) http://t.co/8lieJFiZH5"], dtype=object)
The word "dear" is often used in hashtags. These are typically replies.
data.ix[X.T[1163].toarray()[0] > 0]['text'].values[:5]
array([u"@Ghislainemonie hi dear what's new for dinner today. I can't decide..", u'@bbhhaappyy So nice of you dear! Trust me and have a great day ahead! Stay blessed and keep connected. Rabh rakha hai!', u'@MJCfan keep smiling dear..have a great day ahead..', u'@sonamakapoor @PerniaQureshi. Hi good morning dear', u'@2ps664 @skelkar07 @keerti07 @TahminaJaved @sheetal3176 @Jyoramesh10 hi dear how r u'], dtype=object)
tdiff = termdiff(terms, X, series.map(lambda v: v['user']['location'].lower().startswith('bangalore')))
tdiff.head()
a | b | term | contrast | freq | significance | |
---|---|---|---|---|---|---|
435 | 842 | 18155 | Hi | -0.911354 | 0.999799 | 0.955576 |
1921 | 842 | 18155 | re | -0.911354 | 0.999799 | 0.955576 |
0 | 0 | 0 | & | NaN | 0.499799 | NaN |
1 | 0 | 0 | > | NaN | 0.499799 | NaN |
2 | 0 | 0 | < | NaN | 0.499799 | NaN |