News Headline Analysis

In this project we're analyzing news headlines written by two journalists – a finance reporter from the Business Insider, and a celebrity reporter from the Huffington post – to find similarities and differences between the ways that these authors write headlines for their news articles and blog posts. Our selected reporters are:

  • Akin Oyedele the Business Insider who covers market updates; and
  • Carly Ledbetter from the Huffington Post who mainly writes about celebrities.

Approach

We're initially going to collect and parse news headline from each of the authors, to obtain a parse tree, and then we're going to extract certain information from these parse trees that are indicative of the overall structure of the headline.

Next, we will define a simple sequence similarity metric to compare any pair of headlines quantitatively, and we will apply the same method to all of the headlines we've gathered for each author, to find out how similar each pair of headlines is.

Finally, we're going to use K-Means and tSNE to produce a visual map of all the headlines, where we can see the similarities and the differences between the two authors more clearly.

Data

For this project we've gathered 700 headlines for each author using the AYLIEN News API which we're going to analyze using Python. You can obtain the Pickled data files directly from the GitHub repository, or by using the data collection notebook that we've prepared for this project.

A primer on parse trees

In linguistics, a parse tree is a rooted tree that represents the syntactic structure of a sentence, according to some pre-defined grammar.

For a simple sentence like "The cat sat on the mat", a parse tree might look like this:

The cat sat on the mat

We're going to use the Pattern library for Python to parse the headlines and create parse trees for them:

In [36]:
from pattern.en import parsetree

Let's see an example:

In [37]:
s = parsetree('The cat sat on the mat.')
for sentence in s:
    for chunk in sentence.chunks:
        print chunk.type, [(w.string, w.type) for w in chunk.words]
NP [(u'The', u'DT'), (u'cat', u'NN')]
VP [(u'sat', u'VBD')]
PP [(u'on', u'IN')]
NP [(u'the', u'DT'), (u'mat', u'NN')]

Loading the data

Let's load the Pickled data file for the first author (Akin Oyedele) which contains 700 headlines, and let's see an example of what a headline might look like:

In [66]:
import cPickle as pickle
author1 = pickle.load( open( "author1.p", "rb" ) )
author1[0]
Out[66]:
{u'title': u"One corner of the real-estate market might've peaked"}

Parsing the data

Now that we have all the headlines for the first author loaded, we're going to analyze them, and create parse trees for each headline, and store them together with some basic information about the headline in the same object:

In [67]:
for story in author1:
    story["title_length"] = len(story["title"])
    story["title_chunks"] = [chunk.type for chunk in parsetree(story["title"])[0].chunks]
    story["title_chunks_length"] = len(story["title_chunks"])
In [40]:
author1[0]
Out[40]:
{u'title': u"One corner of the real-estate market might've peaked",
 'title_chunks': [u'NP', u'PP', u'NP', u'VP'],
 'title_chunks_length': 4,
 'title_length': 52}

Let's see what the numeric attributes for headlines written by this author look like. We're going to use Pandas for this.

In [41]:
import pandas as pd

df1 = pd.DataFrame.from_dict(author1)
In [42]:
df1.describe()
Out[42]:
title_chunks_length title_length
count 700.000000 700.000000
mean 5.691429 57.730000
std 3.762884 28.035283
min 1.000000 9.000000
25% 2.000000 35.000000
50% 5.000000 53.000000
75% 7.000000 77.000000
max 30.000000 188.000000

From this information, we're going to extract the chunk type sequence of each headline (i.e. the first level of the parse tree) and use it as an indicator of the overall structure of the headline. So in the above example, we would extract and use the following sequence of chunk types in our analysis:

['NP', 'PP', 'NP', 'VP']

Similarity

We have loaded all the headlines written by the first author, and created and stored their parse trees. Next, we need to find a similarity metric that given two chunk type sequences, tells us how similar these two headlines are, from a structural perspective.

For that we're going to use the SequenceMatcher class of difflib, which produces a similarity score between 0 and 1 for any two sequences (Python lists):

In [65]:
import difflib
print "Similarity scores for...\n"
print "Two identical sequences: ", difflib.SequenceMatcher(None,["A","B","C"],["A","B","C"]).ratio()
print "Two similar sequences: ", difflib.SequenceMatcher(None,["A","B","C"],["A","B","D"]).ratio()
print "Two completely different sequences: ", difflib.SequenceMatcher(None,["A","B","C"],["X","Y","Z"]).ratio()
Similarity scores for...

Two identical sequences:  1.0
Two similar sequences:  0.666666666667
Two completely different sequences:  0.0

Now let's see how that works with our chunk type sequences, for two randomly selected headlines from the first author:

In [68]:
v1 = author1[3]["title_chunks"]
v2 = author1[1]["title_chunks"]

print v1, v2, difflib.SequenceMatcher(None,v1,v2).ratio()
[u'NP', u'NP', u'VP', u'NP', u'NP', u'VP', u'PP'] [u'NP', u'VP', u'NP', u'PP', u'NP', u'NP'] 0.615384615385

Pair-wise similarity matrix for the headlines

We're now going to apply the same sequence similarity metric to all of our headlines, and create a 700x700 matrix of pairwise similarity scores between the headlines:

In [44]:
import numpy as np
chunks = [author["title_chunks"] for author in author1]
m = np.zeros((700,700))
for i, chunkx in enumerate(chunks):
    for j, chunky in enumerate(chunks):
        m[i][j] = difflib.SequenceMatcher(None,chunkx,chunky).ratio()

Visualization

To make things clearer and more understandable, let's try and put all the headlines written by the first author on a 2d scatter plot, where similarly structured headlines are close together.

For that we're going to first use tSNE to reduce the dimensionality of our similarity matrix from 700 down to 2:

In [45]:
from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, verbose=1, random_state=0)
In [46]:
tsne = tsne_model.fit_transform(m)
[t-SNE] Computing pairwise distances...
[t-SNE] Computed conditional probabilities for sample 700 / 700
[t-SNE] Mean sigma: 0.000000
[t-SNE] Error after 83 iterations with early exaggeration: 13.379313
[t-SNE] Error after 144 iterations: 0.633875

And to a bit of color to our visualization, let's K-Means to identify 5 clusters of similar headlines, which we will use in our visualization:

In [47]:
from sklearn.cluster import MiniBatchKMeans

kmeans_model = MiniBatchKMeans(n_clusters=5, init='k-means++', n_init=1, 
                         init_size=1000, batch_size=1000, verbose=False, max_iter=1000)
kmeans = kmeans_model.fit(m)
kmeans_clusters = kmeans.predict(m)
kmeans_distances = kmeans.transform(m)

Finally let's plot the actual chart using Bokeh:

In [48]:
import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.plotting import figure, show, output_notebook

colormap = np.array([
    "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c", 
    "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5", 
    "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f", 
    "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5"
])

output_notebook()
plot_author1 = bp.figure(plot_width=900, plot_height=700, title="Author1",
    tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
    x_axis_type=None, y_axis_type=None, min_border=1)

plot_author1.scatter(x=tsne[:,0], y=tsne[:,1],
                    color=colormap[kmeans_clusters],
                    source=bp.ColumnDataSource({
                        "chunks": [x["title_chunks"] for x in author1], 
                        "title": [x["title"] for x in author1],
                        "cluster": kmeans_clusters
                    }))

hover = plot_author1.select(dict(type=HoverTool))
hover.tooltips={"chunks": "@chunks (title: \"@title\")", "cluster": "@cluster"}
show(plot_author1)
Loading BokehJS ...