# 数据科学的编程工具：大数据¶

[email protected]

# everyone thinks everyone else is doing it, so everyone claims they are doing it.¶

--Dan Ariely of Duke University

# 云计算¶

2006: AWS EC2 (cloud-based computing clusters)

# Map/Reduce¶

Google article on MapReduce by Dean and Ghemawat, 2004

• word count
• network?

# An alternative to Hadoop, Spark with Python¶

• 加州大学伯克利分校的Spark
• 谷歌的TensorFlow
• 华盛顿大学的Dato GraphLab
• 卡内基梅陇大学的Petuum
• 微软的DMTK系统

# Giant Data Sets Are Around¶

In [7]:
from IPython.display import display_html, HTML
HTML('<iframe src=http://ccc.nju.edu.cn/newsmap/ width=1000 height=500></iframe>')
# the webpage we would like to crawl

Out[7]:

### Big Data and whole data are not the same. Without taking into account the sample of a data set, the size of the data set is meaningless. For example, a researcher may seek to understand the topical frequency of tweets, yet if Twitter removes all tweets that contain problematic words or content – such as references to pornography or spam – from the stream, the topical frequency would be inaccurate. Regardless of the number of tweets, it is not a representative sample as the data is skewed from the beginning.¶

d. boyd and K. Crawford, "Critical Questions for Big Data"

Information, Communication & Society Volume 15, Issue 5, 2012 http://www.tandfonline.com/doi/abs/10.1080/1369118X.2012.678878

Google Flu Trends: The Limits of Big Data (NYT)

Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. 2014. “The Parable of Google Flu: Traps in Big Data Analysis.” Science 343 (14 March): 1203-1205.

### The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results-- or from the accumulated evidence of Web-based text patterns and formatted tables-- in both cases without needing any manually annotated data.¶

Halevy, Norvig, Pereira

# Type A: Analysis¶

• making sense of data
• very similar to a statistician

# Type B: Builders¶

• mainly interested in using data in production.
• strong coders and may be trained software engineers.

# 参考文献¶

In [ ]: