# 数据清洗之推特数据¶

[email protected]

## 数据清洗（data cleaning）¶

• 清洗错误行
• 正确分列
• 提取所要分析的内容
• 介绍通过按行、chunk的方式对大规模数据进行预处理

# 1. 抽取tweets样本做实验¶

In [23]:
bigfile = open('/Users/chengjun/百度云同步盘/Writing/OWS/ows-raw.txt', 'rb')
chunkSize = 1000000
print len(chunk)
with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample.txt", 'w') as f:
for i in chunk:
f.write(i)
2752

# 2. 清洗错行的情况¶

In [13]:
with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample.txt", 'rb') as f:
In [14]:
# 总行数
len(lines)
Out[14]:
2752
In [15]:
# 查看第一行
lines[0]
Out[15]:
'"Twitter ID",Text,"Profile Image URL",Day,Hour,Minute,"Created At",Geo,"From User","From User ID",Language,"To User","To User ID",Source\n'

# 问题： 第一行是变量名¶

## 2. 如何获取每一个变量名？¶

In [16]:
varNames = lines[0].replace('\n', '').split(',')
varNames
Out[16]:
'Text',
'"Profile Image URL"',
'Day',
'Hour',
'Minute',
'"Created At"',
'Geo',
'"From User"',
'"From User ID"',
'Language',
'"To User"',
'"To User ID"',
'Source']
In [17]:
len(varNames)
Out[17]:
14
In [26]:
lines[1344]
Out[26]:
'121818600490283009,"RT @chachiTHEgr8: RT @TheNewDeal: First they ignore you, then they laugh at you, then they fight you, then you win. - Gandhi #OccupyWallStreet #OWS #p2",http://a0.twimg.com/profile_images/326662126/Photo_233_normal.jpg,2011-10-06,5,26,"2011-10-06 05:26:32",N;,k_l_h_j,382233343,en,,0,"&lt;a href=&quot;http://twitter.com/#!/download/iphone&quot; rel=&quot;nofollow&quot;&gt;Twitter for iPhone&lt;/a&gt;"\n'

# 如何来处理错误换行情况？¶

In [30]:
with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample_clean.txt", 'w') as f:
right_line = '' # 正确的行，它是一个空字符串
blocks = [] # 确认为正确的行会被添加到blocks里面
for line in lines:
right_line += line.replace('\n', ' ')
line_length = len(right_line.split(','))
if line_length >= 14:
blocks.append(right_line)
right_line = ''
for i in blocks:
f.write(i + '\n')
In [20]:
len(blocks)
Out[20]:
2627
In [27]:
blocks[1344]
Out[27]:
'121818879105310720,"RT @Min_Reyes: RT @The99Percenters: New video to go viral. From We Are Change\rhttp://t.co/6Ff718jk Listen to the guy begging... #ows #cdnpoli",http://a3.twimg.com/sticky/default_profile_images/default_profile_0_normal.png,2011-10-06,5,27,"2011-10-06 05:27:38",N;,MiyazakiMegu,260948518,en,,0,"&lt;a href=&quot;http://www.tweetdeck.com&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;"'
In [32]:
with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample_clean4.txt", 'w') as f:
right_line = '' # 正确的行，它是一个空字符串
blocks = [] # 确认为正确的行会被添加到blocks里面
for line in lines:
right_line += line.replace('\n', ' ').replace('\r', ' ')
line_length = len(right_line.split(','))
if line_length >= 14:
blocks.append(right_line)
right_line = ''
for i in blocks:
f.write(i + '\n')
In [33]:
blocks[1344]
Out[33]:
'121818879105310720,"RT @Min_Reyes: RT @The99Percenters: New video to go viral. From We Are Change http://t.co/6Ff718jk Listen to the guy begging... #ows #cdnpoli",http://a3.twimg.com/sticky/default_profile_images/default_profile_0_normal.png,2011-10-06,5,27,"2011-10-06 05:27:38",N;,MiyazakiMegu,260948518,en,,0,"&lt;a href=&quot;http://www.tweetdeck.com&quot; rel=&quot;nofollow&quot;&gt;TweetDeck&lt;/a&gt;" '

# 3. 读取数据、正确分列¶

In [2]:
# 提示：你可能需要修改以下路径名
with open("/Users/chengjun/GitHub/cjc2016/data/ows_tweets_sample_clean.txt", 'rb') as f:
In [3]:
len(chunk)
Out[3]:
2627
In [4]:
chunk[:3]
Out[4]:
['"Twitter ID",Text,"Profile Image URL",Day,Hour,Minute,"Created At",Geo,"From User","From User ID",Language,"To User","To User ID",Source\n',
'121813146137657344,"@jamiekilstein @allisonkilkenny Interesting interview (never aired, wonder why??) by Fox with #ows protester http://t.co/Fte55Kh7",http://a2.twimg.com/profile_images/1574715503/Kate6_normal.jpg,2011-10-06,5,4,"2011-10-06 05:04:51",N;,KittyHybrid,34532053,en,jamiekilstein,2149053,"&lt;a href=&quot;http://twitter.com/&quot;&gt;web&lt;/a&gt;"\n']
In [5]:
import csv
clean_lines = (line.replace('\x00','') for line in chunk[1:])

# 4. 统计数量¶

### 统计发帖数量所对应的人数的分布¶

In [39]:
from collections import defaultdict
data_dict = defaultdict(int)
line_num = 0
lines = csv.reader((line.replace('\x00','') for line in chunk[1:]), delimiter=',', quotechar='"')
for i in lines:
line_num +=1
data_dict[i[8]] +=1 # i[8] 是user
In [41]:
data_dict.items()[:5]
Out[41]:
[('Mike_EH_52', 1),
('neeirish', 1),
('heroofcamelot', 1),
('agentdarkapple', 1),
('KliphScurlock', 1)]
In [72]:
print line_num
2626
In [48]:
%matplotlib inline
from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import matplotlib
#matplotlib.rcParams['font.sans-serif'] = ['Microsoft YaHei'] #指定默认字体
matplotlib.rc("savefig", dpi=100)
font = FontProperties(fname=r'/Users/chengjun/github/cjc2016/data/msyh.ttf', size=14) # 注意：修改这里的路径名
In [49]:
plt.hist(data_dict.values())
#plt.yscale('log')
#plt.xscale('log')
plt.xlabel(u'发帖数', fontproperties=font)
plt.ylabel(u'人数', fontproperties=font)
plt.show()
In [50]:
tweet_dict = defaultdict(int)
for i in data_dict.values():
tweet_dict[i] += 1

plt.loglog(tweet_dict.keys(), tweet_dict.values(), 'ro',linewidth=2)
plt.xlabel(u'推特数', fontproperties=font)
plt.ylabel(u'人数', fontproperties=font)
plt.show()
In [51]:
import numpy as np
import statsmodels.api as sm

def powerPlot(d_value, d_freq, color, marker):
d_freq = [i + 1 for i in d_freq]
d_prob = [float(i)/sum(d_freq) for i in d_freq]
#d_rank = ss.rankdata(d_value).astype(int)
x = np.log(d_value)
y = np.log(d_prob)
res = sm.OLS(y,xx).fit()
constant,beta = res.params
r2 = res.rsquared
plt.plot(d_value, d_prob, linestyle = '',  color = color, marker = marker)
plt.plot(d_value, np.exp(constant+x*beta),"red")
plt.xscale('log'); plt.yscale('log')
plt.text(max(d_value)/2,max(d_prob)/10,
r'$\beta$ = ' + str(round(beta,2)) +'\n' + r'$R^2$ = ' + str(round(r2, 2)))
In [52]:
histo, bin_edges = np.histogram(data_dict.values(), 15)
bin_center = 0.5*(bin_edges[1:] + bin_edges[:-1])
powerPlot(bin_center,histo, 'r', 'o')
#lg=plt.legend(labels = [u'Tweets', u'Fit'], loc=3, fontsize=20)
plt.ylabel(u'概率', fontproperties=font)
plt.xlabel(u'推特数', fontproperties=font)
plt.show()
In [53]:
import statsmodels.api as sm
from collections import defaultdict
import numpy as np

def powerPlot(data):
d = sorted(data, reverse = True )
d_table = defaultdict(int)
for k in d:
d_table[k] += 1
d_value = sorted(d_table)
d_value = [i+1 for i in d_value]
d_freq = [d_table[i]+1 for i in d_value]
d_prob = [float(i)/sum(d_freq) for i in d_freq]
#d_rank = ss.rankdata(d_value).astype(int)
x = np.log(d_value)
y = np.log(d_prob)
res = sm.OLS(y,xx).fit()
constant,beta = res.params
r2 = res.rsquared
plt.plot(d_value, d_prob, 'ro')
plt.plot(d_value, np.exp(constant+x*beta),"red")
plt.xscale('log'); plt.yscale('log')
plt.text(max(d_value)/2,max(d_prob)/5,
'Beta = ' + str(round(beta,2)) +'\n' + 'R squared = ' + str(round(r2, 2)))
plt.title('Distribution')
plt.ylabel('P(K)')
plt.xlabel('K')
plt.show()

In [54]:
powerPlot(data_dict.values())

# 5. 清洗tweets文本¶

In [60]:
tweet = '''RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!!
#OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com
http://ccc.nju.edu.cn RT !!HELP!!!!'''
In [61]:
import re

# 无法正常安装的同学¶

## 可以在spyder中打开terminal安装¶

In [33]:
import re

tweet = '''RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!!
#OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com
http://ccc.nju.edu.cn RT !!HELP!!!!'''

rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)
rt_user_name = rt_patterns.findall(tweet)[0][1].strip(' @')
rt_user_name
Out[33]:
'AnonKitsu'
In [35]:
import re

tweet = '''@AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!!
#OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com
http://ccc.nju.edu.cn RT !!HELP!!!!'''

rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)
rt_user_name = rt_patterns.findall(tweet)
print rt_user_name

if rt_user_name:
print 'it exits.'
else:
print 'None'
[]
None
In [39]:
import re

def extract_rt_user(tweet):
rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)
rt_user_name = rt_patterns.findall(tweet)
if rt_user_name:
rt_user_name = rt_user_name[0][1].strip(' @')
else:
rt_user_name = None
return rt_user_name
In [38]:
tweet = '''@AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!!
#OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com
http://ccc.nju.edu.cn RT !!HELP!!!!'''

print extract_rt_user(tweet)
None

# 获得清洗过的推特文本¶

In [10]:
def extract_tweet_text(tweet, at_names, urls):
for i in at_names:
tweet = tweet.replace(i, '')
for j in urls:
tweet = tweet.replace(j, '')
marks = ['RT @', '@', '&quot;', '#', '\n', '\t', '  ']
for k in marks:
tweet = tweet.replace(k, '')
return tweet
In [31]:

tweet = '''RT @AnonKitsu: ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!!
#OCCUPYWALLSTREET #OWS #OCCUPYNY PLEASE @chengjun @mili http://computational-communication.com
http://ccc.nju.edu.cn RT !!HELP!!!!'''

at_names = ex.extract_mentioned_screen_names()
urls = ex.extract_urls()
hashtags = ex.extract_hashtags()
rt_user = extract_rt_user(tweet)
tweet_text = extract_tweet_text(tweet, at_names, urls)

print at_names, urls, hashtags, rt_user,'-------->', tweet_text
[u'AnonKitsu', u'chengjun', u'mili'] [u'http://computational-communication.com', u'http://ccc.nju.edu.cn'] [u'OCCUPYWALLSTREET', u'OWS', u'OCCUPYNY'] AnonKitsu --------> : ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! OCCUPYWALLSTREET OWS OCCUPYNY PLEASE RT !!HELP!!!!
In [65]:
import csv

lines = csv.reader((line.replace('\x00','') for line in chunk[1:]), delimiter=',', quotechar='"')
tweets = [i[1] for i in lines]
In [66]:
for tweet in tweets[:5]:
at_names = ex.extract_mentioned_screen_names()
urls = ex.extract_urls()
hashtags = ex.extract_hashtags()
rt_user = extract_rt_user(tweet)
tweet_text = extract_tweet_text(tweet, at_names, urls)

print at_names, urls, hashtags, rt_user,
print tweet_text
[u'AnonKitsu'] [] [u'OCCUPYWALLSTREET', u'OWS', u'OCCUPYNY'] AnonKitsu : ALERT!!!!!!!!!!COPS ARE KETTLING PROTESTERS IN PARK W HELICOPTERS AND PADDYWAGONS!!!! OCCUPYWALLSTREET OWS OCCUPYNY PLEASE RT !!HELP!!!!
[u'jamiekilstein', u'allisonkilkenny'] [u'http://t.co/Fte55Kh7'] [u'ows'] None Interesting interview (never aired, wonder why??) by Fox with ows protester
[u'Seductivpancake'] [] [u'ows'] None  Right! Those guys have a victory condition: regime change. ows doesn't seem to have a goal I can figure out.
[u'bembel'] [u'http://j.mp/rhHavq'] [u'OccupyWallStreet', u'OWS'] bembel  Occupy Wall Street als linke Antwort auf die Tea Party?OccupyWallStreet OWS
[] [] [u'ows'] None ows White shirt= Brown shirt.

rt_user1, user1

rt_user2, user3

rt_user2, user4

...

In [ ]: