#!/usr/bin/env python # coding: utf-8 # # Analyze articles on Hacker News using NLP! # ## 1. Introduction # This notebook demonstrates the usage of the [`news-analyze`](https://github.com/jayantj/news-analyze) library, which makes use of topic modeling and clustering for extracting topics and themes out of a corpus of news articles. The key features are - # # 1. Extracting high quality, human-interpretable topics from a collection of articles # 2. Visualizations of trends in topics over time # 3. Automatically ranking topics by "interesting-ness" # 4. Clustering topics into groups of related topics # 5. Auto-tagging new unseen articles with topics # # The goal of the library is to provide a way to qualitatively explore topics and trends in a news corpus to gain insight into it. # # The notebook presents the usage of these features using a model trained on an year's worth of Hacker News data, which is present in the repo and directly usable. The library doesn't yet provide a documented API to be able to train new models on your own data. This is a work in progress. # # This library was one of the things I worked on while I was part of the [Recurse Center](https://www.recurse.com/), a programmer's retreat for people from a variety of backgrounds and experience levels looking to get better at programming. You should check them out! # # A significant motivation behind this initial alpha release and demo is to get feedback about the following - # # 1. Specific application and areas where this could be useful # 2. Other datasets on which the library could be used # 3. New features that could be helpful # 4. Problems with existing features # 5. Improvements to the API and usage docs # # ## 2. Data and preprocessing # # The data used for training the model is a collection of posts on [Hacker News](http://news.ycombinator.com/), available [here](https://www.kaggle.com/hacker-news/hacker-news-posts/data). The raw data contains 293119 posts from September 2015 to September 2016. A post here refers to an article that was posted to Hacker News, not the comments. The article text is not included, only the url, along with some metadata (time of post, number of points and comments received). # # Firstly, any articles that received under 50 points were filtered out, in order to focus on links that received a fair amount of attention on HN, which results in 20148 posts. Next, to extract the full text of these articles, the content from the urls was scraped and parsed using [newspaper](https://github.com/codelucas/newspaper), a Python library which allows extracting of full text of news articles from html. Content from some urls could not be extracted correctly in this process (mostly 404s), resulting in 15016 parsed articles. # # Topic models were trained on these using [Gensim](http://github.com/RaRe-Technologies/gensim), a Python library that has both native implementations of various topic modeling algorithms as well as wrappers to external topic modeling frameworks. The final model in the repository was trained using a wrapper to [Mallet](https://github.com/mimno/Mallet). [Spacy](https://github.com/explosion/spaCy) was used for tokenization and lemmatization. Tokens that were extremely frequent or extremely rare were filtered out. For more specific details, please have a look at [this file](https://github.com/jayantj/news-analyze/blob/master/models/utils.py). # ## 3. Demonstration # The insights and use-cases presented in this section are on the dataset described above. I don't yet know how well these techniques can generalize to new datasets, and your mileage may vary. Also, the repository does not contain the original text scraped from the HN posts as these are from a variety of websites, some of which might have terms and conditions that do not permit their data to be publicly released. As a result, the notebook might not be runnable on your local machine. I'm currently looking into how to work around this issue. # ### Import required packages # In[1]: get_ipython().run_line_magic('cd', '..') # In[2]: get_ipython().run_line_magic('load_ext', 'autoreload') get_ipython().run_line_magic('autoreload', '2') # In[3]: import os import pickle # In[4]: from plotly.offline import init_notebook_mode, iplot init_notebook_mode(connected=True) # In[5]: get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib matplotlib.rcParams['figure.figsize'] = [12, 8] # ### Load trained model # In[15]: model = pickle.load(open('data/models/hn_ldam_mallet_100t_5a', 'rb')) # ### 3.1 Print all topics, ordered by "interesting-ness" scores # This is the list of all topics that were extracted from the corpus, printed in human-readable form. Note that in the underlying model, each topic is a vector of scores over all words in the corpus. Here, only the top 10 words for each topic are displayed, for ease of reading and in order to get a sense of what each topic is about. # # The topics are ordered in decreasing order of "interesting-ness", which is described in a [later section](#topic-interestingness) in the notebook. # In[53]: model.print_topics_table() # A topic here is NOT exactly the same as the commonly used interpretation of the word `topic`, it is simply a list of "related words". It is intended to represent a broad theme of interest, and doesn't carry a specific label attached to it. # ### 3.2 Find articles for a specific topic # This prints all the articles (along with a snippet of their content) that contained a specific topic, ordered in decreasing order of the topic score for the article, which is a measure of how central the topic was to the article. The top 5 articles are shown here for ease of reading. # In[56]: model.show_topic_articles(99, top_n=5) # In[58]: model.show_topic_articles(44, top_n=5) # ### 3.3 Find topics for a given article # ### 3.3.1 Article from the corpus # This displays the topics that were extracted from a specific article in the corpus. # In[59]: model.show_article_topics(10577102) # The last topic looks strange here - as it turns out, it is an unintended artifact of the data collection process. The `newspaper` library used to extract text from articles extracts text from some of the advertisements and subscribe buttons for NYTimes articles too. As a result, this set of words co-occurs with each other extremely frequently and co-occurs with other words much less frequently, and hence forms a very natural topic for topic modeling algorithms. # ### 3.3.2 Finding topics for a new, unseen article # In[1]: url = "https://www.ligo.caltech.edu/news/ligo20170927" # In[61]: model.show_article_topics_from_url(url) # ### 3.4 Plotting topic trends # The popularity of topics can be plotted over time. Some cherrypicking for interesting results - # In[62]: iplot(model.topic_trend_plot(11)) # The topic contains the words `flight, fly, air, space, aircraft, launch` and sees a huge surge in popularity around March - May 2016. This was the time when SpaceX successfully launched and landed its satellites at sea. And of course, things related to Elon Musk have a tendency to be wildly popular on Hacker News :) # # A quick look at the articles for this topic agrees with this hypothesis - # In[65]: model.show_topic_articles(11, top_n=5) # In[66]: iplot(model.topic_trend_plot(35)) # This topic looks a little more strange. The words `uk london british people` seem fairly coherent, but the presence of words like `image copyright caption` is rather strange. It turns out to be another artifact of the data collection process - a number of the articles with the words `uk London british people` are from the BBC, and the text parser from the article picks up image captions from the BBC site which contain the words `image caption copyright` very frequently. # # As for the popularity trend for the topic, the topic seems fairly dormant most of the time, seeing a massive spike in around June 2016. No prizes for guessing what this is due to - # In[73]: model.show_topic_articles(35, top_n=5) # In[69]: iplot(model.topic_trend_plot(44)) # This topic has a more interesting trend. Privacy and government surveillance has long been a popular topic on Hacker News, and this is clear from the relatively high popularity values in comparison to the other topics plotted so far. As for the significant increase in popularity around February 2016, this corresponds to the San Bernardino event, when there was a large amount of debate on privacy and surveillance, centered around whether Apple, under pressure by the FBI, should or should not unlock an iPhone used by one of the shooters. # # There are also numerous other spikes in this graph, and it'd be interesting to look at them in more detail to see if they can be traced to specific events. # ### 3.5 Topic Intersection # Topics can be combined to find articles that are relevant to both topics. Here, we see combining two separate topics consisting of the words `game player play move win` and `google computer technology machine human` give us articles related to AlphaGo's success against the human Go champion, Lee Sedol. # In[26]: model.show_topic_articles([65, 66], top_n=5) # ### 3.6 Similar topics # Topics that are similar to a specific topic can be found using - # In[27]: model.show_similar_topics(44, top_n=5) # # ### 3.7 Topic Interesting-ness # There are certain topics which occur more frequently in articles than others, but with lower scores. The hypothesis is that these topics are more common and generic, whereas interesting topics would occur less frequently in articles, but higher scores. Common and generic topics would have low scores frequently, indicating they are rarely the main focus of an article, whereas the opposite is true for interesting topics. # # Plotting the distribution of scores over all articles for two topics - # In[76]: topics_of_interest = [43, 95] # In[75]: model.print_topics_table(topics_of_interest) # In[83]: iplot(model.plot_topic_article_distribution(topics_of_interest)) # As expected, the histogram for topic #95 (`test, code, error, bug, problem`), a rather generic topic, at least for Hacker News content, is quite skewed to the left, indicating it occurs with low scores very frequently in articles, and almost never with a high score. The histogram for topic #43 (`quantum, theory, physics, particle, universe`) is much flatter, indicating it is the main theme of an article much more often. # Computing the median of scores across all articles seems like a decent mathematical way of capturing this intuition of "interesting-ness". Sorting topics by the computed median scores in decreasing order, we get - # In[84]: model.print_topics_table() # This seems to give reasonably good results. Specific, focused topics are at the top, whereas common generic topics are at the bottom. It is possible that this metric of interesting-ness could be flawed for certain kinds of data, where either the notion of interesting-ness is different in the first place (as it is a subjective notion), or where the topic-article distribution is significantly different. # ### 3.8 Topic Clusters # The model has a notion of similarity between topics based on a few metrics. The two basic ideas are - # 1. Two topics are similar if they have similar words # 2. Two topics are similar if they co-occur frequently in articles # # The first captures the notion of lexical similarity, whereas the second captures the notion of relatedness. # # Plotting the topic similarity matrix for the `word_doc_sim` metric which combines both (1) and (2) - # In[86]: model.plot_topic_similarities(metric='word_doc_sim') # Looking at this matrix, it is possible to discern a couple of patterns - # 1. Certain topics are similar to many of the other topics. These stand out as distinctly dark rows/columns in the above matrix. # 2. Certain topics are similar to almost none of the other topics - stand-alone topics. These stand out as almost completely white rows/columns in the matrix above. # Both these kind of topics are not well-suited for clustering. Stand-alone topics should be typically be clusters of their own, and it is difficult to assign a single cluster to common topics, as they similar or related to many of the other topics. From a graph-theoretic point of view, these topics would be hub nodes - connected to many of the other topics, and not part of any single graph partition. # # So, the method to cluster topics provides options to exclude such topics from the clustering process. # In addition, the model includes an internal method to determine cluster quality based on the [silhouette scores](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html) of its constituent nodes. The printed clusters are sorted in decreasing order of cluster quality. # In[34]: model.cluster_topics(metric='word_doc_sim', exclude_common=True, exclude_standalone=True) # In[35]: model.print_topic_clusters() # Plotting the similarity matrix for the clustered topics - # In[107]: model.plot_clustered_topic_similarities(metric='word_doc_sim', threshold_percentile=85) # The common as well as standalone topics excluded from the clustering are at the end. Also note that the diagonal values (self-similarity) have been zeroed in the matrix above to allow for easier visualization. # # A number of the clusters look fairly reasonable, grouping together related topics. However, there is one large cluster at the end which contains a large number of disparate, wide-ranging topics - from the matrix, it is also evident that these topics are not very similar to most of the topics part of clustering. # # This suggests that excluding certain topics can introduce certain disadvantages. # # There are also some other tricky aspects to clustering topics - # 1. Choosing an appropriate number of clusters # 2. Whether clustering topics is even suitable for the kind of data and topics you have # 3. How to handle common as well as stand-alone topics while clustering # ## 4. Future Steps # # 1. A clearly defined, well-documented API to allow extracting topics from a user-supplied dataset # 2. Interactive visualization that allows users to browse articles and topics in a single graph # 3. Extracting topics hierarchically - being able to extract sub-topics from the articles associated with a particular topic in order to focus on a more specific theme of interest # 4. Tagging events from news articles to spikes in topic popularity, in order to understand why the interest in a certain topic varied as it did # # In conclusion, I'd love to get more feedback about whether and how this could be useful. Please do get in touch at if you have any ideas. Feel free to do so if you wish to talk about NLP generally either! #