Twitter text analysis¶

Let's load one day's worth of tweets from India. These were captured via the Twitter API. The file is at http://files.gramener.com/data/tweets.20130919.json.gz. It's just under 7MB.

First, let's download the file.

In [1]:

import os
import urllib

tweetfile = 'tweets.json.gz'
if not os.path.exists(tweetfile):
    url = 'http://files.gramener.com/data/tweets.20130919.json.gz'
    urllib.urlretrieve(url, tweetfile)

This file is not quite a gzipped JSON file, despite the file name. Each row is a JSON string. Some lines might be blank -- especially alternate lines.

In [3]:

import gzip
for line in gzip.open(tweetfile).readlines()[:8]:
    if line.strip():
        print line[:80]

{"created_at":"Wed Sep 18 03:39:02 +0000 2013","id":380174094936702976,"id_str":
{"created_at":"Wed Sep 18 03:39:02 +0000 2013","id":380174096635416577,"id_str":
{"created_at":"Wed Sep 18 03:39:06 +0000 2013","id":380174111076405248,"id_str":
{"created_at":"Wed Sep 18 03:39:16 +0000 2013","id":380174154751696896,"id_str":

Let's load this into a Pandas data structure. After some experimentation, I find that this is a reasonably fast way of loading it.

In [7]:

import pandas as pd
import json

series = pd.Series([
    line for line in gzip.open(tweetfile) if line.strip()
]).apply(json.loads)

data = pd.DataFrame({
  'id'  : series.apply(lambda t: t['id_str']),
  'name': series.apply(lambda t: t['user']['screen_name']),
  'text': series.apply(lambda t: t['text']),
}).set_index('id')

We've extracted just a few things from the tweets -- such as the ID (which we set as the index), the person who tweeted it, the text of the tweet.

In [8]:

data.head()

Out[8]:

	name	text
id
380174094936702976	rgokul	பின்னாடி பாத்தா பர்ஸ்னாலிட்டி, முன்னாடி பாத்தா...
380174096635416577	fknadaf	@rehu123 \nHi..re h r u..????
380174111076405248	neetakolhatkar	@sohamsabnis mhanunach jau dya..tyat phile jod...
380174154751696896	pinashah1	@Miragpur7 jok of tha day
380174182803202050	MeghaLvsShaleen	@ilovearrt @shweet_tasu @akanksha_pooh31 @Miss...

Pure Python¶

Now let's do some basic text analysis on this.

Most frequent words: `.split(' ')` and `.value_counts()`¶

Let's get the full text as a string and count the words. Let's assume that words are split by a single space.

In [10]:

words = pd.Series(' '.join(data['text']).split(' '))
words.value_counts().head()

Out[10]:

to     3256
the    3235
       2441
in     2275
a      2193
dtype: int64

There are lots of errors in the assumption that words are split by a single space. That ignores punctuation, multiple spaces, hyphenation, and a lot of other things. But it's not a bad starting point and you can start making reasonable inferences as a first approximation.

NLTK: `.word_tokenize()`¶

The process of converting a sentence into words is called tokenization. NLTK offers an nltk.word_tokenize() function for this. Let's try it out:

In [11]:

import nltk
for i in range(2, 6):
    print data['text'][i]
    print nltk.word_tokenize(data['text'][i])
    print ''

@sohamsabnis mhanunach jau dya..tyat phile jodi ne ahe...mhanje imagination la break nahi
[u'@', u'sohamsabnis', u'mhanunach', u'jau', u'dya..tyat', u'phile', u'jodi', u'ne', u'ahe', u'...', u'mhanje', u'imagination', u'la', u'break', u'nahi']

@Miragpur7 jok of tha day
[u'@', u'Miragpur7', u'jok', u'of', u'tha', u'day']

@ilovearrt @shweet_tasu @akanksha_pooh31 @MissHal96 @Mishtithakur @SalgaonkarPriya @Shaleen_Ki_Pari Super cute :p
[u'@', u'ilovearrt', u'@', u'shweet_tasu', u'@', u'akanksha_pooh31', u'@', u'MissHal96', u'@', u'Mishtithakur', u'@', u'SalgaonkarPriya', u'@', u'Shaleen_Ki_Pari', u'Super', u'cute', u':', u'p']

Looking forward to interacting with the dynamic students, faculty &amp; team of @SriSriU. Its fast becoming a global centre of excellence !
[u'Looking', u'forward', u'to', u'interacting', u'with', u'the', u'dynamic', u'students', u',', u'faculty', u'&', u'amp', u';', u'team', u'of', u'@', u'SriSriU', u'.', u'Its', u'fast', u'becoming', u'a', u'global', u'centre', u'of', u'excellence', u'!']

There are a few problems with this. User names like @ilovearrt are split into @ and iloverrrt. Similarly, & is split. And so on.

NLTK offers other tokenizers, including the ability to custom-write your own. But for now, we'll just go with our simple list of space-separated words.

NOTE: Tokenization is usually specific to a given dataset.

NLTK¶

Remove stopwords: `nltk.corpus.stopwords` and `.drop()`¶

The bigger problem is that the most common words are also the most often used -- to, the, in, a, etc. These are called stopwords. We need a way of finding and removing them.

NLTK offers a standard list of stopwords. This is what we get if we remove those.

In [12]:

from nltk.corpus import stopwords
ignore = set(stopwords.words('english')) & set(words.unique())
words.value_counts().drop(ignore)

Out[12]:

                           2441
I                          1817
I'm                         970
u                           695
-                           604
@                           507
The                         503
:)                          493
&amp;                       467
like                        390
hai                         364
(@                          363
good                        333
one                         285
get                         285
!                           281
time                        280
love                        269
n                           266
r                           263
day                         245
RT                          244
people                      242
:D                          240
7                           240
#ForSale                    227
#Flat                       226
don't                       223
iOS                         222
ur                          222
                           ... 
gained                        1
grips                         1
agreed..                      1
election.#congreefights       1
din....                       1
http://t.co/RXM8hgBBoS        1
langsunglah                   1
policies.                     1
thriller                      1
dummy                         1
Amen!                         1
थीं                           1
meetings                      1
#Mathura                      1
@AmypichardAmy                1
http://t.co/necdoOUAHU        1
Ravjiani,                     1
#TheAsianAge                  1
coffee.!!                     1
race's                        1
http://t.co/aH4i8A0Nz1"       1
real…                         1
http://t.co/GNzghJBYX1        1
update??                      1
#lazy                         1
107,#Gurgaon,                 1
annaru                        1
snooping                      1
@BangaloreAshram              1
जैन।                          1
dtype: int64

Still, it's not really clear what the words are. We need to go further.

Let's use lowecase for standardisation.
Let's remove punctuations. Maybe any word that even contains punctuation, like "I'm" or "&"
All single-letter words are a good idea to drop off too, like "u".

In [13]:

relevant_words = words.str.lower()
relevant_words = relevant_words[~relevant_words.str.contains(r'[^a-z]')]
relevant_words = relevant_words[relevant_words.str.len() > 1]

In [14]:

ignore = set(stopwords.words('english')) & set(relevant_words.unique())
relevant_words.value_counts().drop(ignore)

Out[14]:

good           543
like           418
hai            386
one            365
love           351
time           321
get            300
new            298
people         297
see            273
day            271
ios            255
rt             247
ki             242
ur             242
know           228
go             221
life           219
best           214
se             205
back           201
morning        200
make           192
never          192
hi             192
follow         188
still          188
want           185
india          180
way            178
              ... 
chattarpur       1
pleaseeeeee      1
bhujiya          1
chuploo          1
enuff            1
roost            1
cantt            1
parsvnath        1
expired          1
beam             1
beshak           1
cld              1
pace             1
mushtaq          1
howdy            1
ghalib           1
leya             1
pudhcha          1
pilgrim          1
soiled           1
lool             1
krissh           1
imo              1
muaaaaah         1
pranam           1
bevkoof          1
destroyed        1
quater           1
vasundhara       1
validity         1
dtype: int64

This list is a lot more meaningful.

But before we go ahead, let's take a quick look at the words we've ignored to see if we should've taken something from there.

In [15]:

words.drop(relevant_words.index).str.lower().value_counts().head(30)

Out[15]:

                2441
a               2377
i               2161
i'm              980
u                778
-                604
@                507
:)               493
&amp;            467
(@               363
don't            292
n                291
:p               287
r                285
!                281
it's             243
:d               241
7                240
#ios7            232
#forsale         227
#flat            226
2                217
.                216
?                215
!!               204
#residential     204
..               196
,                191
#bappamorya      189
:-)              173
dtype: int64

... Ah! We're missing all the smileys (which may be OK) and the hashtags (which could be useful). Should we just pull in the hashtags alone? Let's do that. We'll allow # as an exception. We'll also ignore @ which usually indicates reply to a person.

In [16]:

relevant_words = words.str.lower()
relevant_words = relevant_words[~relevant_words.str.contains(r'[^#@a-z]')]
relevant_words = relevant_words[relevant_words.str.len() > 1]
ignore = set(stopwords.words('english')) & set(relevant_words.unique())
relevant_words.value_counts().drop(ignore)

Out[16]:

good                      543
like                      418
hai                       386
one                       365
love                      351
time                      321
get                       300
new                       298
people                    297
see                       273
day                       271
ios                       255
rt                        247
ki                        242
ur                        242
know                      228
#forsale                  227
#flat                     226
go                        221
life                      219
best                      214
se                        205
#residential              204
back                      201
morning                   200
never                     192
make                      192
hi                        192
#bappamorya               189
follow                    188
                         ... 
circumstances               1
vaat                        1
parag                       1
recreate                    1
#pounding                   1
#nestle                     1
meuble                      1
#thingsthatmakemehappy      1
primarily                   1
kanipinchadu                1
#kathmandont                1
ruhu                        1
kashif                      1
tidak                       1
bl                          1
dekhaunchu                  1
jokingly                    1
inclination                 1
bd                          1
bf                          1
sants                       1
@itweetfacts                1
dictated                    1
bk                          1
#instaholic                 1
jaaoege                     1
mahmood                     1
br                          1
#justbeingme                1
betch                       1
dtype: int64

We haven't added anything to the list of top words, but further down, it may be useful.

Word stems: `nltk.PorterStemmer()`¶

Let's look at all the words that start with time, like timing, timer, etc.

In [17]:

relevant_words[relevant_words.str.startswith('tim')].value_counts()

Out[17]:

time         321
times         51
timeline       6
timings        3
timeless       3
timesnow       2
timetable      1
tim            1
timely         1
timezone       1
timing         1
timed          1
timline        1
timesheet      1
timepass       1
timro          1
dtype: int64

At the very least, we want time and times to mean the same word. These are word stems. Here's one way of doing this in NLTK.

In [18]:

porter = nltk.PorterStemmer()
stemmed_words = relevant_words.apply(porter.stem)
stemmed_words[stemmed_words.str.startswith('tim')].value_counts()

Out[18]:

time         378
timelin        6
timeless       3
timesnow       2
timlin         1
timezon        1
tim            1
timet          1
timesheet      1
timepass       1
timro          1
dtype: int64

Notice that this introduces words like timelin instead of timeline. These can be avoided through the use of a process called lemmatization (see nltk.WordNetLemmatizer()). However, this is relatively slower.

For now, we'll just stick to the original words.

Bigrams: `nltk.collocations`¶

What if we want to find phrases? If we're looking for 2-word combinations (bigrams), we can use the nltk.collocations.BigramCollocationFinder. These are the top 30 word pairs.

In [19]:

from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

bcf = BigramCollocationFinder.from_words(relevant_words)
for pair in bcf.nbest(BigramAssocMeasures.likelihood_ratio, 30):
    print ' '.join(pair)

#bappamorya #bappamorya
good morning
#flat #forsale
#jacksonville #jobs
will be
#residentialplot #land
to be
agle baras
international airport
#land #forsale
posted photo
baras tu
tu jaldi
#smwmumbai #mumbaiisamazing
now trending
happy birthday
just posted
in the
waiting for
trending topic
jaldi aa
gracious acts
#apartment #flat
@smwmumbai #smwmumbai
follow back
the best
railway station
cycling km
god bless
shows up

See this as a word cloud¶

Let's get the data into a DataFrame

In [20]:

top_words = relevant_words.value_counts().drop(ignore).reset_index()
top_words.columns = ['word', 'count']
top_words.head()

Out[20]:

	word	count
0	good	543
1	like	418
2	hai	386
3	one	365
4	love	351

(Work in progress...)

sklearn¶

In [21]:

import re

re_separator = re.compile(r'[\s"#\.\?,;\(\)!/]+')
re_url = re.compile(r'http.*?($|\s)')
def tokenize(sentence):
    sentence = re_url.sub('', sentence)
    words = re_separator.split(sentence)
    return [word for word in words if
            len(word) > 1]

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(
    # analyser='word',             # Separate using punctuations
    # analyzer=re_separator.split, # Separate using spaces
    # analyzer=re_separator.split, # Separate using custom separator
    analyzer=tokenize,             # Separate using custom separator
    min_df=10,                   # Ignore words that occur less than 10 times in the corpus
)

In [22]:

# Note: for these 18,000 documents, sklearn takes about ~0.5 seconds on my system
X = vectorizer.fit_transform(data['text'])

In [24]:

# Here are some of the terms that have special characters 
print '# terms: %d' % len(vectorizer.vocabulary_)
for key in vectorizer.vocabulary_.keys():
    if re.search('\W', key) and not re.search(r'[@#\']', key) and re.search('\w', key):
        print key, vectorizer.vocabulary_[key]

# terms: 2482
^_^ 869
:-D 51
I’m 482
don’t 1203
&amp 0
[pic] 867
:-P 52
-www 7
[pic]: 868
&gt 1
alert: 908
here: 1437
100% 11
Job: 490
IN: 454
Café 292
:3 53
:o 57
:p 58
:O 55
:D 54
:P 56
Méridien 592
-_- 6
&lt 2
10:30 12

In [25]:

# Apply TF-IDF
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)

In [26]:

# Let's see the unusual terms
import numpy as np
terms = np.array(vectorizer.get_feature_names())
for index in range(100):
    t = terms[(tfidf[index] >= 0.99).toarray()[0]]
    if len(t):
        print index, t, data['text'][index]

6 [u'sorry'] @b50 oops...sorry typo. 'Than'
7 [u'place'] 9h09 place au someil maintenant
24 [u'GM'] @satish_bsk GM
25 [u'org'] @mrlumpyU_U xbek menindas org yg xblik msia. Ngagaha
26 [u'Hey'] Hey evrybuddy http://t.co/vH89PFhyYg
35 [u'ha'] @gauthamvarma04 ha ha
85 [u'ma'] @bindeshpandya gm$... Jay ma bharat..vande mataram.. Namo namah...@BJYM @BJP_Gujarat

In [28]:

# Segment by those with above median followers
followers_count = series.map(lambda v: v['user']['followers_count'])
segment = followers_count.values > followers_count.median()
count1 = X[segment].sum(axis=0)
count2 = X[~segment].sum(axis=0)

In [29]:

# Count of term in each segment
df = pd.DataFrame(np.concatenate([count1, count2]).T).astype(float)
df.columns = ['a', 'b']
df['term'] = terms
df.head()

Out[29]:

	a	b	term
0	261	242	&amp
1	74	53	&gt
2	48	44	&lt
3	4	7	's
4	4	16	--

In [30]:

total = df['a'] + df['b']
contrast = df['a'] / total - 0.5
freq = total.rank() / len(df)
df['significance'] = freq / 2 + contrast.abs()

In [31]:

df.sort_values('significance', ascending=False).head()

Out[31]:

	a	b	term	significance
664	290	1	Property	0.985282
370	239	0	ForSale	0.985093
365	232	0	Flat	0.983783
252	0	189	BappaMorya	0.980459
688	222	2	Residential	0.973747

In [52]:

def termdiff(terms, counts, segment):
    df = pd.DataFrame(np.concatenate([
                counts[segment].sum(axis=0),
                counts[~segment].sum(axis=0)
            ]).T).astype(float)
    df.columns = ['a', 'b']
    df['term'] = terms
    total = df['a'] + df['b']
    df['contrast'] = 2 * (df['a'] / total - 0.5)
    df['freq'] = total.rank() / len(df)
    df['significance'] = (df['freq'] + df['contrast'].abs()) / 2
    return df.sort_values('significance', ascending=False)

In [53]:

termdiff(terms, X, segment).head()

Out[53]:

	a	b	term	contrast	freq	significance
664	290	1	Property	0.993127	0.977438	0.985282
370	239	0	ForSale	1.000000	0.970185	0.985093
365	232	0	Flat	1.000000	0.967566	0.983783
252	0	189	BappaMorya	-1.000000	0.960919	0.980459
688	222	2	Residential	0.982143	0.965351	0.973747

There seem to be several influential people on Twitter tweetings about properties for sale. Non-influential people are tweeting about BappaMorya.

In [54]:

with_hashtags = series.apply(lambda v: len(v['entities']['hashtags']) > 0).values

In [59]:

tdiff = termdiff(terms, X, with_hashtags)
tdiff[tdiff['b'] > tdiff['a']].head(10)

Out[59]:

	a	b	term	contrast	freq	significance
2419	0	135	टन	-1.000000	0.943191	0.971595
451	35	946	I'm	-0.928644	0.995568	0.962106
551	2	134	Maharashtra	-0.970588	0.943795	0.957192
2399	3	98	और	-0.940594	0.922643	0.931619
1163	2	76	dear	-0.948718	0.891620	0.920169
2409	10	156	के	-0.879518	0.956285	0.917902
1604	0	56	lessons	-1.000000	0.835012	0.917506
2465	19	246	है	-0.856604	0.974416	0.915510
697	0	54	Rumi	-1.000000	0.829371	0.914686
2468	1	63	है।	-0.968750	0.860596	0.914673

Tweets without hashtags tend to be Hindi tweets.

The word "I'm" often is used without hashtags. (These are typically tweets that say "I'm at".)

In [84]:

data.ix[X.T[451].toarray()[0] > 0]['text'].values[:5]

Out[84]:

array([u"I'm at LINK (Mumbai, Maharashtra) http://t.co/ComXHpCbua",
       u"I'm getting fragrance of a dish being cooked in pure ghee... seems yum",
       u"I'm at Le Meridien - @spg (Bangalore, Karnataka) http://t.co/GhzDYpdTRu",
       u"I'm at Lajpat Nagar Metro Station (New Delhi, new delhi) http://t.co/MNEHQ9Qesg",
       u"I'm at Godrej Memorial Hospital (Mumbai, Maharashtra) http://t.co/8lieJFiZH5"], dtype=object)

The word "dear" is often used in hashtags. These are typically replies.

In [85]:

data.ix[X.T[1163].toarray()[0] > 0]['text'].values[:5]

Out[85]:

array([u"@Ghislainemonie hi dear what's new for dinner today. I can't decide..",
       u'@bbhhaappyy So nice of you dear! Trust me and have a great day ahead! Stay blessed and keep connected. Rabh rakha hai!',
       u'@MJCfan keep smiling dear..have a great day ahead..',
       u'@sonamakapoor @PerniaQureshi.       Hi good morning dear',
       u'@2ps664 @skelkar07 @keerti07 @TahminaJaved @sheetal3176 @Jyoramesh10 hi dear how r u'], dtype=object)

In [92]:

tdiff = termdiff(terms, X, series.map(lambda v: v['user']['location'].lower().startswith('bangalore')))

In [96]:

tdiff.head()

Out[96]:

	a	b	term	contrast	freq	significance
435	842	18155	Hi	-0.911354	0.999799	0.955576
1921	842	18155	re	-0.911354	0.999799	0.955576
0	0	0	&amp	NaN	0.499799	NaN
1	0	0	&gt	NaN	0.499799	NaN
2	0	0	&lt	NaN	0.499799	NaN

Lessons learnt¶

Tokenisation and filtering of words always have a manual element -- so make that easy.
- But are there some robust English tokenisation patterns?
Have a single function that tells me what token is unusual about a group
For each token, show the concordance for context

spaCy¶

Install spaCy

conda config --add channels spacy
conda install spacy
python -m spacy.en.download all

If you get an SSL error, run:

conda config --set ssl_verify False

and re-run the above commands. This adds an ssl_verify: False statement to ~/.condarc add a line:

In [ ]: