Some words about term project

  1. Data collection

    • [corpus] crawling web (evaluative) data
    • [lexicon] compiling (emotional) lexical data from various sources (dictionary, thesaurus, wordnet, etc)
  2. Data pre-processing

    • [corpus] cleaning, tokenization/segmentation, pos tagging
    • [lexicon] mungling
  3. Data annotation and pattern extraction

    • [corpus] ngram, collocates, lexical bundles
    • [lexicon] integraing patterns and scoring
  4. Data analysis

    • [corpus and lexicon] exploratory data analysis: data statistics, plotting and hypothesis/algorithm
  5. Prediction/classification model

    • [corpus and lexicon] model training, testing and tuning
  6. Implementation and Report

    • [app] implementation, publishing and user feedback.

Processing corpus resources

  • We have introduced the nltk ways to work with corpora.
  • A case study of CHILDES and TCCM corpus.
  • Move on to the WaC(web as corpus) for our term project.

  • Review of handout on week_4.2
  • A case study on Taiwan Corpus of Child Mandarin TCCM

Web as Corpus (WaC)

  • Using Web as a large linguistic data repository.
  • Senario and essential knowledge for WaC.
In [1]:
# Basic Senario
img = plt.imread('pipeline.png')
imshow(img)
Out[1]:
<matplotlib.image.AxesImage at 0x1115b1490>

Review of collecting textual data

Raw texts

  • raw texts from the local file(s)
  • raw texts from the web (known (pre-defined) url(s))

Web scraping and access various Web Services via APIs

  • html texts from the web (known (pre-defined) url(s))
  • URL's from any website (using BeautifulSoup)
  • Search Engine Results (google APIs)
  • RSS Feeds
  • Meta-data and posts from social networks APIs (Facebook, Twitter,Weibo, Plurk)

Web crawling

  • Crawling volumous data for (linguistic) purpose (using lingSpider)

  • raw texts from the local file(s)
In [3]:
import nltk, pprint
from __future__ import division
In [4]:
# your turn
  • Accessing Blogs via RSS feeds

    • RSS stands for Rich Site Summary and uses standard web feed formats to publish frequently updated information: blog entries, news headlines, audio, video.

    • An RSS document (called "feed") includes full or summarized text, and metadata, like publishing date and author's name.

    • The most commonly used elements in RSS feeds are "title", "link", "description", "publication date", and "entry ID". The less commonnly used elements are "image", "categories", "enclosures" and "cloud".

With the help of a library called Feedparser, you can access the content of a blog. Feedparser parses feeds in all known formats, including Atom, RSS, and RDF. Use pip (A tool for installing and managing Python packages) to install:

(sudo) pip install feedparser

In [6]:
import feedparser
In [14]:
# Let's try the RSS feed of "languagelog"(http://languagelog.ldc.upenn.edu/nll/)
d = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
In [15]:
# The feed elements are available in d.feed (Remember the "RSS Elements" above); and the 
# items are available in d.entries, which is a list. You access items in the list in the 
# same order in which they appear in the original feed, so the first item is available 
# in d.entries[0].
d.entries[0]

# num. of entries
# print len(d['entries'])
Out[15]:
{'author': u'Mark Liberman',
 'author_detail': {'href': u'http://ling.upenn.edu/~myl',
  'name': u'Mark Liberman'},
 'authors': [{'href': u'http://ling.upenn.edu/~myl',
   'name': u'Mark Liberman'}],
 'content': [{'base': u'http://languagelog.ldc.upenn.edu/nll/?p=7997',
   'language': u'en',
   'type': u'text/html',
   'value': u'<p>In Meg Wilson\'s post on marmoset vs. human conversational turn-taking, \xa0I learned about Tanya Stievers et al., "<a href="http://www.pnas.org/content/106/26/10587.full" target="_blank">Universals and cultural variation in turn-taking in conversation</a>", PNAS 2009, which compared response offsets to polar ("yes-no") questions in 10 languages. Here\'s their plot of the data for English:</p>\n<p><img alt="" src="http://languagelog.ldc.upenn.edu/myl/EnglishGaps.jpg" /></p>\n<p>Based on examination of a Dutch corpus, they argue that "the use of question\u2013answer sequences is a reasonable proxy for turn-taking more generally"; and in their cross-language data, they found that "the response timings for each language, although slightly skewed to the right, have a unimodal distribution with a mode offset for each language between 0 and +200 ms, and an overall mode of 0 ms. The medians are also quite uniform, ranging from 0 ms (English, Japanese, Tzeltal, and Y\xe9l\xee-Dnye) to +300 ms (Danish, \u2021\u0100khoe Hai\u2016om, Lao) (overall cross-linguistic median +100 ms)."</p>\n<p><span id="more-7997"></span></p>\n<p>So for today\'s Breakfast Experiment\u2122, \xa0I decided to take a look at similar measurements from one of the standard speech-technology datasets, namely the\xa0<a href="http://www.isip.piconepress.com/projects/switchboard/" target="_blank">1/29/2003 release of the Mississippi State alignments of the Switchboard corpus</a>. For details on the corpus itself, see J.J. Godfrey et al., "<a href="http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=225858" target="_blank">SWITCHBOARD: telephone speech corpus for research and development</a>", IEEE ICASSP 1992). Here\'s a random selection from one of the conversations:</p>\n<p>The (hand-checked) alignments indicate the start and end of all words, noises, and silences, for each speaker in each conversation. I counted all cases in which a speaker starts talking after the other speaker has been talking, either starting after the other speaker has stopped (yielding a positive offset equal to the silent gap), or before the other speaker has stopped (yielding a negative offset equal to the amount of overlap).</p>\n<p>The result is a distribution in general agreement with Stievers et al. (although I\'m looking at all speaker changes, not just answers to polar questions):</p>\n<p>&nbsp;</p>\n<p><a href="http://languagelog.ldc.upenn.edu/myl/SWBpauses3.png"><img alt="" src="http://languagelog.ldc.upenn.edu/myl/SWBpauses3.png" title="Click to embiggen" width="490" /></a></p>\n<p>But the much larger dataset brings out some perhaps-interesting additional structure, especially an apparent increase in counts around -100 msec, 0 msec, and 100 msec (indicated by red vertical lines in the plot). This might be connected to the "periodic structure" postulated in <a href="http://languagelog.ldc.upenn.edu/myl/WilsonZimmerman1986.pdf" target="_blank">Wilson &amp; Zimmerman 1986</a> and <a href="https://docs.google.com/viewer?url=http%3A%2F%2Flanguagelog.ldc.upenn.edu%2Fmyl%2FWilsonWilson2005.pdf" target="_blank">Wilson &amp; Wilson 2005</a>, though they found conversation-specific differences in the time-structure suggesting that such effects should be washed out in a collective histogram of this sort.</p>\n<p>Since there\'s some demographic data available for the speakers in the SWB corpus, we can look at possible differences according to sex, age, years of education, geographical region, and so on. For this morning, I\'ll just take a lot at sex, and in particular whether there\'s any difference in speaker-change offsets between \xa0female/female and male/male conversations:</p>\n<p><a href="http://languagelog.ldc.upenn.edu/myl/SWB_sex_switches.png"><img alt="" src="http://languagelog.ldc.upenn.edu/myl/SWB_sex_switches.png" title="Click to embiggen" width="490" /></a></p>\n<p>It\'s clear from the plot that overall, interactions in the FF conversations have shorter offsets than in the MM conversations. (FWIW, the median is 130 msec for the males vs. 30 msec for the females.) \xa0As usual, this raises more questions: Is this a difference across all types of interaction? Or are things different for "back-channel" responses vs. question-answer pairs vs. substantive comments? And what happens in mixed-sex conversations?</p>\n<p>It might also be interesting to look at speaker age effects, regional effects, and so on. I\'ve run out of time this morning &#8212; but isn\'t it fun to be able to do an interesting empirical investigation in an hour or so? And isn\'t it too bad that there\'s not more communication between the disciplines centered on conversational analysis and the disciplines centered on speech technology?</p>\n<p>&nbsp;</p>\n<p>&nbsp;</p>'}],
 'guidislink': False,
 'href': u'http://ling.upenn.edu/~myl',
 'id': u'http://languagelog.ldc.upenn.edu/nll/?p=7997',
 'link': u'http://languagelog.ldc.upenn.edu/nll/?p=7997',
 'links': [{'href': u'http://languagelog.ldc.upenn.edu/nll/?p=7997',
   'rel': u'alternate',
   'type': u'text/html'},
  {'href': u'http://languagelog.ldc.upenn.edu/myl/sw02317.mp3',
   'length': u'630281',
   'rel': u'enclosure',
   'type': u'audio/mpeg'},
  {'count': u'0',
   'href': u'http://languagelog.ldc.upenn.edu/nll/?p=7997#comments',
   'rel': u'replies',
   'thr:count': u'0',
   'type': u'text/html'},
  {'count': u'0',
   'href': u'http://languagelog.ldc.upenn.edu/nll/?feed=atom&p=7997',
   'rel': u'replies',
   'thr:count': u'0',
   'type': u'application/atom+xml'}],
 'published': u'2013-10-22T13:33:11Z',
 'published_parsed': time.struct_time(tm_year=2013, tm_mon=10, tm_mday=22, tm_hour=13, tm_min=33, tm_sec=11, tm_wday=1, tm_yday=295, tm_isdst=0),
 'summary': u'In Meg Wilson\'s post on marmoset vs. human conversational turn-taking, \xa0I learned about Tanya Stievers et al., "Universals and cultural variation in turn-taking in conversation", PNAS 2009, which compared response offsets to polar ("yes-no") questions in 10 languages. Here\'s their plot of the data for English: Based on examination of a Dutch corpus, they argue [...]',
 'summary_detail': {'base': u'http://languagelog.ldc.upenn.edu/nll/wp-atom.php',
  'language': u'en',
  'type': u'text/html',
  'value': u'In Meg Wilson\'s post on marmoset vs. human conversational turn-taking, \xa0I learned about Tanya Stievers et al., "Universals and cultural variation in turn-taking in conversation", PNAS 2009, which compared response offsets to polar ("yes-no") questions in 10 languages. Here\'s their plot of the data for English: Based on examination of a Dutch corpus, they argue [...]'},
 'tags': [{'label': None,
   'scheme': u'http://languagelog.ldc.upenn.edu/nll',
   'term': u'Computational linguistics'}],
 u'thr_total': u'0',
 'title': u'Speaker-change offsets',
 'title_detail': {'base': u'http://languagelog.ldc.upenn.edu/nll/wp-atom.php',
  'language': u'en',
  'type': u'text/html',
  'value': u'Speaker-change offsets'},
 'updated': u'2013-10-22T14:07:49Z',
 'updated_parsed': time.struct_time(tm_year=2013, tm_mon=10, tm_mday=22, tm_hour=14, tm_min=7, tm_sec=49, tm_wday=1, tm_yday=295, tm_isdst=0)}
In [16]:
# Print the title of the feed
d['feed']['title']
Out[16]:
u'Language Log'
In [17]:
print "Fetched %s entries from '%s'" %(len(d.entries), d.feed.title)
Fetched 15 entries from 'Language Log'
In [18]:
post = d.entries[4]
post.title
Out[18]:
u'Prepositional identity'
In [19]:
content = post.content[0].value
print content
<p>From Tim Leonard:</p>
<p style="padding-left: 30px;"><span style="color: #000080;">I read </span><span style="text-decoration: underline;"><a href="http://2001italia.blogspot.ca/2013/10/2001-aliens-that-almost-were.html" target="_blank"><span style="color: #000080;">here</span></a></span><span style="color: #000080;"> that Arthur C. Clarke wrote in his diary, "… are virtually identical with us." I was surprised that he would use "identical with" rather than "identical to," since I find it ungrammatical. So I checked Google Ngram Viewer, and was delighted to discover that the preposition that goes with "identical" appears to be a previously fixed choice that's in the process of changing:</span></p>
<p><a href="http://languagelog.ldc.upenn.edu/myl/identicalto_with.png"><img alt="" src="http://languagelog.ldc.upenn.edu/myl/identicalto_with.png" title="Click to embiggen" width="490" /></a><br />
<span id="more-7973"></span></p>
<p>In contrast, <em>similar</em> has always selected for <em>to</em>:</p>
<p><a href="http://languagelog.ldc.upenn.edu/myl/similarto_with.png"><img alt="" src="http://languagelog.ldc.upenn.edu/myl/similarto_with.png" title="Click to embiggen" width="490" /></a></p>
<p>And likewise <em>equivalent</em>:</p>
<p><a href="http://languagelog.ldc.upenn.edu/myl/equivalentto_with.png"><img alt="" src="http://languagelog.ldc.upenn.edu/myl/equivalentto_with.png" title="Click to embiggen" width="490" /></a></p>
<p>As Geoff Pullum recent wrote ("<a href="http://languagelog.ldc.upenn.edu/nll/?p=7476" target="_blank">At Cologne</a>", 10/2/2013):</p>
<p style="padding-left: 30px;"><span style="color: #800000;">The problem is that each specific verb will have certain idiosyncratic demands regarding the particular prepositions it will accept as the head of its preposition-phrase complement. <em>Arrive</em> allows<em> at</em> or <em>in</em> (among others), but not (for example) <em>to</em> or <em>into</em>. And <em>Welcome</em> allows <em>to</em>, but not <em>at </em>or <em>in</em>.</span></p>
<p style="padding-left: 30px;"><span style="color: #800000;">You arrive <strong>at</strong> or <strong>in</strong> a place, not <strong>to</strong> a place, but you welcome someone <strong>to</strong> a place. That's just the way it is. Nobody promised you a rose garden: nobody guaranteed that languages would be easy or fair or logical or commonsensical. They are simply as they are. Deal with it.</span></p>
<p>And adjectives are no easier or fairer or more logical.</p>
<p>Also, the rules can change, sometimes quickly.</p>
<p>&nbsp;</p>
In [20]:
nltk.word_tokenize(nltk.clean_html(content))
Out[20]:
[u'From',
 u'Tim',
 u'Leonard',
 u':',
 u'I',
 u'read',
 u'here',
 u'that',
 u'Arthur',
 u'C.',
 u'Clarke',
 u'wrote',
 u'in',
 u'his',
 u'diary',
 u',',
 u'``',
 u'\u2026',
 u'are',
 u'virtually',
 u'identical',
 u'with',
 u'us.',
 u"''",
 u'I',
 u'was',
 u'surprised',
 u'that',
 u'he',
 u'would',
 u'use',
 u'``',
 u'identical',
 u'with',
 u"''",
 u'rather',
 u'than',
 u'``',
 u'identical',
 u'to',
 u',',
 u"''",
 u'since',
 u'I',
 u'find',
 u'it',
 u'ungrammatical.',
 u'So',
 u'I',
 u'checked',
 u'Google',
 u'Ngram',
 u'Viewer',
 u',',
 u'and',
 u'was',
 u'delighted',
 u'to',
 u'discover',
 u'that',
 u'the',
 u'preposition',
 u'that',
 u'goes',
 u'with',
 u'``',
 u'identical',
 u"''",
 u'appears',
 u'to',
 u'be',
 u'a',
 u'previously',
 u'fixed',
 u'choice',
 u'that',
 u"'s",
 u'in',
 u'the',
 u'process',
 u'of',
 u'changing',
 u':',
 u'In',
 u'contrast',
 u',',
 u'similar',
 u'has',
 u'always',
 u'selected',
 u'for',
 u'to',
 u':',
 u'And',
 u'likewise',
 u'equivalent',
 u':',
 u'As',
 u'Geoff',
 u'Pullum',
 u'recent',
 u'wrote',
 u'(',
 u'``',
 u'At',
 u'Cologne',
 u'``',
 u',',
 u'10/2/2013',
 u')',
 u':',
 u'The',
 u'problem',
 u'is',
 u'that',
 u'each',
 u'specific',
 u'verb',
 u'will',
 u'have',
 u'certain',
 u'idiosyncratic',
 u'demands',
 u'regarding',
 u'the',
 u'particular',
 u'prepositions',
 u'it',
 u'will',
 u'accept',
 u'as',
 u'the',
 u'head',
 u'of',
 u'its',
 u'preposition-phrase',
 u'complement.',
 u'Arrive',
 u'allows',
 u'at',
 u'or',
 u'in',
 u'(',
 u'among',
 u'others',
 u')',
 u',',
 u'but',
 u'not',
 u'(',
 u'for',
 u'example',
 u')',
 u'to',
 u'or',
 u'into',
 u'.',
 u'And',
 u'Welcome',
 u'allows',
 u'to',
 u',',
 u'but',
 u'not',
 u'at',
 u'or',
 u'in',
 u'.',
 u'You',
 u'arrive',
 u'at',
 u'or',
 u'in',
 u'a',
 u'place',
 u',',
 u'not',
 u'to',
 u'a',
 u'place',
 u',',
 u'but',
 u'you',
 u'welcome',
 u'someone',
 u'to',
 u'a',
 u'place.',
 u'That',
 u"'s",
 u'just',
 u'the',
 u'way',
 u'it',
 u'is.',
 u'Nobody',
 u'promised',
 u'you',
 u'a',
 u'rose',
 u'garden',
 u':',
 u'nobody',
 u'guaranteed',
 u'that',
 u'languages',
 u'would',
 u'be',
 u'easy',
 u'or',
 u'fair',
 u'or',
 u'logical',
 u'or',
 u'commonsensical.',
 u'They',
 u'are',
 u'simply',
 u'as',
 u'they',
 u'are.',
 u'Deal',
 u'with',
 u'it.',
 u'And',
 u'adjectives',
 u'are',
 u'no',
 u'easier',
 u'or',
 u'fairer',
 u'or',
 u'more',
 u'logical.',
 u'Also',
 u',',
 u'the',
 u'rules',
 u'can',
 u'change',
 u',',
 u'sometimes',
 u'quickly',
 u'.']
In [21]:
# Use a for loop to print all posts and their links.
for post in d.entries:
    print post.title + ": " + post.link + "\n"
Speaker-change offsets: http://languagelog.ldc.upenn.edu/nll/?p=7997

The Slants v. the USPTO: http://languagelog.ldc.upenn.edu/nll/?p=8001

Marmoset conversation: http://languagelog.ldc.upenn.edu/nll/?p=7989

Linguistic change on a short time scale: http://languagelog.ldc.upenn.edu/nll/?p=7980

Prepositional identity: http://languagelog.ldc.upenn.edu/nll/?p=7973

Bad Science: http://languagelog.ldc.upenn.edu/nll/?p=7952

An endless flawing stream of translation: http://languagelog.ldc.upenn.edu/nll/?p=7883

&quot;If someone has no intelligence&quot;: http://languagelog.ldc.upenn.edu/nll/?p=7904

Something in the water?: http://languagelog.ldc.upenn.edu/nll/?p=7888

Strategery: http://languagelog.ldc.upenn.edu/nll/?p=7893

Magic grinds the wound, bringing invalidity: http://languagelog.ldc.upenn.edu/nll/?p=7834

Indo-Egyptian mystery: http://languagelog.ldc.upenn.edu/nll/?p=7875

Stupid police investigation of racist language: http://languagelog.ldc.upenn.edu/nll/?p=7868

&quot;I been laying in this bed all night long&quot;: http://languagelog.ldc.upenn.edu/nll/?p=7861

Non-projective flavor: http://languagelog.ldc.upenn.edu/nll/?p=7851

  • Building a blog corpus: More advanced processing using NLTK
In [23]:
import os
import sys
import json
from BeautifulSoup import BeautifulStoneSoup
from nltk import clean_html
In [24]:
def cleanHtml(html):
    return BeautifulStoneSoup(clean_html(html),
                convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]
In [25]:
blog_posts = []
for e in d.entries:
    blog_posts.append({'title': e.title, 'content': cleanHtml(e.content[0].value), 'link': e.links[0].href})

out_file = os.path.join('feed.json')
f = open(out_file, 'w')
f.write(json.dumps(blog_posts, indent=1))
f.close()

print 'Wrote output file to %s' % (f.name, )
Wrote output file to feed.json
In [26]:
nltk.download('stopwords')

BLOG_DATA = "feed.json"

blog_data = json.loads(open(BLOG_DATA).read())

# Customize your list of stopwords as needed. Here, we add common
# punctuation and contraction artifacts.

stop_words = nltk.corpus.stopwords.words('english') + [
    '.',
    ',',
    '--',
    '\'s',
    '?',
    ')',
    '(',
    ':',
    '\'',
    '\'re',
    '"',
    '-',
    '}',
    '{',
    u'—',
    ]
[nltk_data] Downloading package 'stopwords' to /Users/shukai
[nltk_data]     1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [27]:
for post in blog_data:
    sentences = nltk.tokenize.sent_tokenize(post['content'])
    words = [w.lower() for sentence in sentences for w in nltk.tokenize.word_tokenize(sentence)]
    fdist = nltk.FreqDist(words)
In [28]:
# Basic stats

num_words = sum([i[1] for i in fdist.items()])
num_unique_words = len(fdist.keys())

# Hapaxes are words that appear only once

num_hapaxes = len(fdist.hapaxes())

top_10_words_sans_stop_words = [w for w in fdist.items() if w[0] not in stop_words][:10]
In [29]:
print post['title']
print '\tNum Sentences:'.ljust(25), len(sentences)
print '\tNum Words:'.ljust(25), num_words
print '\tNum Unique Words:'.ljust(25), num_unique_words
print '\tNum Hapaxes:'.ljust(25), num_hapaxes
print '\tTop 10 Most Frequent Words (sans stop words):\n\t\t', \
        '\n\t\t'.join(['%s (%s)'
        % (w[0], w[1]) for w in top_10_words_sans_stop_words])
Non-projective flavor
	Num Sentences:           19
	Num Words:               541
	Num Unique Words:        236
	Num Hapaxes:             154
	Top 10 Most Frequent Words (sans stop words):
		`` (14)
		dependency (12)
		'' (10)
		graph (9)
		non-projective (9)
		example (6)
		sentence (6)
		parsing (4)
		word (4)
		edges (3)

[Exercise] Chinese Blog

In [33]:
FEED_URL = 'http://monofika.blogspot.com/feeds/posts/default' # change 'monofika'
In [34]:
k = feedparser.parse(FEED_URL)
In [35]:
print "Fetched %s entries from '%s'" %(len(k.entries), k.feed.title)
Fetched 25 entries from '還有一句話'
In [39]:
post = k.entries[4]
post.title
Out[39]:
u'20131017'
In [40]:
content = post.content[0].value
print content
FI:抱歉我沒有煩惱的事可以分享!<br /><br />KA:樓上需要煩惱的話可以無息貸給你。
In [41]:
print cleanHtml(content[:70])
FI:抱歉我沒有煩惱的事可以分享! KA:樓上需要煩惱的話可以無息貸給你。
In [42]:
# pip install jieba
import jieba
In [49]:
content_cleaned = cleanHtml(content[:70])
seg_list =jieba.cut(content_cleaned,cut_all=False)
In [50]:
for w in seg_list:
    print w
FI
:
抱歉
我
沒
有
煩惱
的
事
可以
分享
!
 
KA
:
樓上
需要
煩惱
的
話
可以
無息
貸給
你
。

[Exercise]

Use feedparser library and NLTK to create a small corpus of blogs with certain topics.


Processing lexical resources

  • We have learned the nltk ways to work with lexical resources with rich (hierarchical) information such as wordnet, verbnet.
  • We have also introduced how to process simple word list.

Review

  • More on word list and spelling checker.

Ref

Word Frequency

  • Google book ngram
  • SUBTLEX-NL: a database of Dutch word frequencies based on 44 million words from film and television subtitles.

The Lexicon Project

  • lexical decision data.
    • The British Lexicon project contains lexical decision data for over 28,000 monosyllabic and disyllabic English words which you can find here
    • The French Lexicon Project contains lexical decision times for over 38,000 French words which you can find here.
    • The Dutch Lexicon project contains lexical decision data for more than 14,000 Dutch words, can be downloaded here. Online viewer allows you to get reaction times and accuracy scores for your stimuli or generate lists of stimuli on the basis of filters.

Homework

Read one of the papers from BL,FL and DL projects. Give your comments on how these resources can be used in emotion classification task.