Analyze articles on Hacker News using NLP!

1. Introduction

This notebook demonstrates the usage of the news-analyze library, which makes use of topic modeling and clustering for extracting topics and themes out of a corpus of news articles. The key features are -

  1. Extracting high quality, human-interpretable topics from a collection of articles
  2. Visualizations of trends in topics over time
  3. Automatically ranking topics by "interesting-ness"
  4. Clustering topics into groups of related topics
  5. Auto-tagging new unseen articles with topics

The goal of the library is to provide a way to qualitatively explore topics and trends in a news corpus to gain insight into it.

The notebook presents the usage of these features using a model trained on an year's worth of Hacker News data, which is present in the repo and directly usable. The library doesn't yet provide a documented API to be able to train new models on your own data. This is a work in progress.

This library was one of the things I worked on while I was part of the Recurse Center, a programmer's retreat for people from a variety of backgrounds and experience levels looking to get better at programming. You should check them out!

A significant motivation behind this initial alpha release and demo is to get feedback about the following -

  1. Specific application and areas where this could be useful
  2. Other datasets on which the library could be used
  3. New features that could be helpful
  4. Problems with existing features
  5. Improvements to the API and usage docs

2. Data and preprocessing

The data used for training the model is a collection of posts on Hacker News, available here. The raw data contains 293119 posts from September 2015 to September 2016. A post here refers to an article that was posted to Hacker News, not the comments. The article text is not included, only the url, along with some metadata (time of post, number of points and comments received).

Firstly, any articles that received under 50 points were filtered out, in order to focus on links that received a fair amount of attention on HN, which results in 20148 posts. Next, to extract the full text of these articles, the content from the urls was scraped and parsed using newspaper, a Python library which allows extracting of full text of news articles from html. Content from some urls could not be extracted correctly in this process (mostly 404s), resulting in 15016 parsed articles.

Topic models were trained on these using Gensim, a Python library that has both native implementations of various topic modeling algorithms as well as wrappers to external topic modeling frameworks. The final model in the repository was trained using a wrapper to Mallet. Spacy was used for tokenization and lemmatization. Tokens that were extremely frequent or extremely rare were filtered out. For more specific details, please have a look at this file.

3. Demonstration

The insights and use-cases presented in this section are on the dataset described above. I don't yet know how well these techniques can generalize to new datasets, and your mileage may vary. Also, the repository does not contain the original text scraped from the HN posts as these are from a variety of websites, some of which might have terms and conditions that do not permit their data to be publicly released. As a result, the notebook might not be runnable on your local machine. I'm currently looking into how to work around this issue.

Import required packages

In [1]:
%cd ..
/home/jayant/Projects/recurse/hn_analyze
In [2]:
%load_ext autoreload
%autoreload 2
In [3]:
import os
import pickle
In [4]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
In [5]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = [12, 8]

Load trained model

In [15]:
model = pickle.load(open('data/models/hn_ldam_mallet_100t_5a', 'rb'))

3.1 Print all topics, ordered by "interesting-ness" scores

This is the list of all topics that were extracted from the corpus, printed in human-readable form. Note that in the underlying model, each topic is a vector of scores over all words in the corpus. Here, only the top 10 words for each topic are displayed, for ease of reading and in order to get a sense of what each topic is about.

The topics are ordered in decreasing order of "interesting-ness", which is described in a later section in the notebook.

In [53]:
model.print_topics_table()
   Topic #99      Topic #29      Topic #38      Topic #43      Topic #56      Topic #70   
  ----------     ----------     ----------     ----------     ----------     ----------   
    network         earth        container       quantum        bitcoin          car      
     model          space         docker         theory       transaction      vehicle    
   learning         star            run          physics      blockchain        drive     
    neural         planet         service       particle        network         tesla     
     learn          orbit          image        universe        wright          road      
    machine         moon        application     physicist        block          bike      
     deep           year          deploy          wave         ethereum        driver     
   training         mars          cluster         field          trust          model     
     layer         galaxy         machine         hole         currency       electric    
     image        telescope        host           state        exchange         wheel     


   Topic #71      Topic #11      Topic #94      Topic #2       Topic #12      Topic #13   
  ----------     ----------     ----------     ----------     ----------     ----------   
    animal         flight          cell            key           food           stack     
     human           fly           gene        certificate        eat        instruction  
    specie           air            dna         security          fat         register    
      dog           space          human       encryption        sugar         address    
     bird         aircraft        genome         encrypt         diet           code      
      cat          launch         genetic       password         meat           call      
     year           plane         protein        secure          fruit         memory     
     tree           drone          mouse         secret           egg           byte      
     live           pilot         cancer         public         farmer         program    
     find          rocket        bacteria          tls           grow         function    


   Topic #75      Topic #67      Topic #30      Topic #31      Topic #92      Topic #62   
  ----------     ----------     ----------     ----------     ----------     ----------   
   component         al            stock           git          police          drug      
     react          state           tax          github          crime         patient    
   function        attack         market       repository       officer        health     
      var           group         company        commit          drug          medical    
    element      government        fund          branch         prison         disease    
    return         country         share         change        criminal        doctor     
     state         islamic       investor         code          arrest         cancer     
    render        terrorist      financial        merge          year         treatment   
      dom           saudi          bank          request         call           death     
    import          iran           price         project          law           year      


   Topic #44      Topic #52      Topic #97      Topic #77      Topic #16      Topic #39   
  ----------     ----------     ----------     ----------     ----------     ----------   
  government         pi            phone         memory          node          company    
    agency          board         network          cpu          system         startup    
   security          usb         internet         core           read          founder    
      nsa           chip           radio          intel          state        investor    
 surveillance       power         signal       performance       write          tech      
      fbi         hardware        mobile          cache         cluster        valley     
    snowden         card          device        processor     distribute        start     
 intelligence        km           channel         chip          latency        silicon    
   document        device         service          op           message       business    
  information    controller         fi             gpu         operation        money     


   Topic #93      Topic #86      Topic #27      Topic #50      Topic #91      Topic #10   
  ----------     ----------     ----------     ----------     ----------     ----------   
    energy         service         file         security         code           light     
     power          datum         command        attack        compiler         laser     
     solar           aws            run       vulnerability      rust         electron    
     cost           cloud         install        exploit        compile         field     
    battery       instance        script         hacker        function        energy     
     year          server          build        password     optimization       high      
      gas        application     directory      attacker        library        fusion     
     plant           run          default         hack           call          charge     
     fuel          storage        package         find          memory         produce    
      oil          system           set          system       performance        ray      


   Topic #48      Topic #36      Topic #21      Topic #84      Topic #87      Topic #0    
  ----------     ----------     ----------     ----------     ----------     ----------   
   database         ship           sleep         device          city          upgrade    
     query           sea            day           phone           san            fix      
     datum          water          hour          camera          area           close     
     table          year         exercise        battery        street           add      
     index          ocean         mental         laptop         housing          al       
      row           find          people           vr            york            doc      
    column         island         health         screen        francisco       rebuild    
      sql           river       depression     smartphone        home           david     
     data           land           feel           home         building        update     
    select          site          stress        hardware        people         michael    


   Topic #20      Topic #22      Topic #25      Topic #80      Topic #85      Topic #33   
  ----------     ----------     ----------     ----------     ----------     ----------   
      int          number          uber           brain        facebook         music     
    return          point         amazon          study           ad            video     
   function        matrix         driver        cognitive       twitter         sound     
     const        algorithm       service        memory          user           audio     
     void         function         trip          neuron          post           play      
    struct         vector         airbnb       participant       site           song      
     null           prime          ride          effect         people         stream     
     type           graph          lyft            al            news          record     
   template         line           taxi          ability        content         note      
     char           curve          city       intelligence      medium         listen     


   Topic #58      Topic #34      Topic #63      Topic #73      Topic #28      Topic #57   
  ----------     ----------     ----------     ----------     ----------     ----------   
    thread           src           image           law            war          student    
    process         llvm           color          court        military        school     
     lock           tool           pixel          case          weapon         college    
     call            gnu            map           legal         soviet          learn     
     event          clang          frame          rule          nuclear      university   
     queue         module          draw           state         russian         teach     
     task          include          red          lawyer          force        education   
      run           patch          light       government       missile         class     
     wait         solution        render          judge          bomb           high      
   function        problem         blue           order        american        teacher    


   Topic #74      Topic #1       Topic #41      Topic #65      Topic #14      Topic #96   
  ----------     ----------     ----------     ----------     ----------     ----------   
     type           team          people          game          windows        license    
   function        people        economic        player          linux        software    
    haskell          job           money          play          system        copyright   
   language        company          dao           move          kernel         patent     
     monad        interview      contract          win         microsoft        free      
    return          hire          social          world           os           oracle     
    define        engineer        income          chess           run          include    
     list         employee         rich         computer         boot          source     
    lambda         manager        wealth          level          user           term      
    promise          day        inequality        sport         driver          copy      


   Topic #46      Topic #40      Topic #26      Topic #7       Topic #49      Topic #82   
  ----------     ----------     ----------     ----------     ----------     ----------   
    percent       function        server          water           app            web      
     year          string         network          air          android         page      
      job          return       connection     temperature      google         browser    
    worker        variable        packet          flow           user           site      
     rate           code            ip           surface         apps          content    
    income          match         client          heat           swift         website    
     high        expression         tcp          bridge         mobile          user      
      low           list          address       material          ios        javascript   
   increase          def         protocol        oxygen           add           html      
    growth          call           send         chemical       developer       chrome     


   Topic #23      Topic #79      Topic #15      Topic #8       Topic #19      Topic #24   
  ----------     ----------     ----------     ----------     ----------     ----------   
    support         bank          request        python          film           book      
    release         money         server         library         show           write     
    version         card          client          code            art          century    
    feature        account         http         language         movie          world     
    change         credit        response         java          artist         history    
      add            pay        application       ruby          netflix         great     
      fix          payment          url        javascript        star            man      
    update          cash            api         framework        world         modern     
    include      transaction      service         read            le            year      
     issue         number         header          write         disney          life      


   Topic #64      Topic #6       Topic #60      Topic #5       Topic #32      Topic #45   
  ----------     ----------     ----------     ----------     ----------     ----------   
     type          problem         email          word          project         datum     
    object         machine        message         book           open          memory     
     class         theory          send         language        source          byte      
    method         number           tor           text         developer        file      
    string      mathematical      address         read           build           bit      
   function       computer        account        english       community        size      
     code        mathematic        mail         document         tool           hash      
    public          proof         domain        character     development        key      
     call       mathematician     contact        letter          team            set      
    return        question         user           write        software        buffer     


   Topic #42      Topic #54      Topic #81      Topic #61      Topic #66      Topic #76   
  ----------     ----------     ----------     ----------     ----------     ----------   
    company        design        language         text          google          china     
     year           build         program         mode         computer        country    
   employee         wall           code          window       technology       chinese    
    million       building      programming      screen         machine         world     
   business         part           write          line           human         united     
   executive        small       programmer       editor         system          india     
    billion         room         software        button          world         states     
    accord          shape         system          click           ai         government   
     firm         material       computer        display         year           north     
      ceo          create         design           key           robot        american    


   Topic #9       Topic #17      Topic #90      Topic #3       Topic #53      Topic #47   
  ----------     ----------     ----------     ----------     ----------     ----------   
     apple          datum           api           group          child         product    
     font          result           bot          public          woman        customer    
    iphone         number         direct        political         age         business    
      mac          average         slack          state           man          service    
    design          model          total         member          study         revenue    
     phone        analysis          sun          policy          group          share     
     size          sample           avg         campaign        parent         growth     
    device          show            sat         president        male         platform    
   software     distribution        sms           party          adult         result     
      ios          measure       anonymous     government         sex         software    


   Topic #37      Topic #69      Topic #35      Topic #98      Topic #83      Topic #78   
  ----------     ----------     ----------     ----------     ----------     ----------   
     life           story          image          datum          yahoo         package    
    family        continue          uk            user        restaurant        full      
     year           read          london       information      coffee         debian     
    friend      advertisement     caption         data            bar           text      
      day           main            mr           privacy         house         subject    
    people          times        copyright       access          food           link      
     live        newsletter       british        service         drink          send      
     home           sign           japan        internet         mayer          mbox      
      man           york           year          company         chef          mozilla    
     house        subscribe       people         provide         club           date      


   Topic #4       Topic #68      Topic #18      Topic #89      Topic #51      Topic #95   
  ----------     ----------     ----------     ----------     ----------     ----------   
    people           day           thing       university        price          test      
     thing          drive         people        research         sell           code      
     feel            ms             lot          science        company         error     
     fact            bob           start          paper         market           bug      
     human          year           year        researcher         buy          problem    
     point         august           big           study        business          fix      
     world          store         problem       scientist       product         check     
   question         july           back          publish          pay           fail      
    person          hour          happen         journal         cost           issue     
      bad           april          talk        scientific        sale            run      


   Topic #55      Topic #88      Topic #72      Topic #59   
  ----------     ----------     ----------     ----------   
      day          country         thing         system     
     back         european         find          problem    
     hand          europe          post          change     
      run            de            give          require    
     head          french          write         design     
      sit          france          start        approach    
     begin         germany         read           large     
     walk          german          point          level     
     hour           world         article        process    
      man           paris           ne           provide    


A topic here is NOT exactly the same as the commonly used interpretation of the word topic, it is simply a list of "related words". It is intended to represent a broad theme of interest, and doesn't carry a specific label attached to it.

3.2 Find articles for a specific topic

This prints all the articles (along with a snippet of their content) that contained a specific topic, ordered in decreasing order of the topic score for the article, which is a measure of how central the topic was to the article. The top 5 articles are shown here for ease of reading.

In [56]:
model.show_topic_articles(99, top_n=5)
   Topic #99   
  ----------   
    network    
     model     
   learning    
    neural     
     learn     
    machine    
     deep      
   training    
     layer     
     image     


---------------------------------------------------------------------
Article #11052034 - http://www.wildml.com/deep-learning-glossary/
Deep Learning Glossary
Topic score: 0.83

Article text:
 This glossary is work in progress and I am planning to continuously update it. If you find a mistake or think an important term is missing, please let me know in the comments or via email.

Deep Learning terminology can be quite overwhelming to newcomers. This glossary tries to define commonly used terms and link to original references and additional resources to help readers dive deeper into a specific topic.

The boundary between what is Deep Learning vs. “general” Machine Learning termino (...)(trimmed)


---------------------------------------------------------------------
Article #10384279 - http://blog.christianperone.com/2015/08/convolutional-neural-networks-and-feature-extraction-with-python/
Convolutional neural networks and feature extraction with Python
Topic score: 0.80

Article text:
 Convolutional neural networks (or ConvNets) are biologically-inspired variants of MLPs, they have different kinds of layers and each different layer works different than the usual MLP layers. If you are interested in learning more about ConvNets, a good course is the CS231n – Convolutional Neural Newtorks for Visual Recognition. The architecture of the CNNs are shown in the images below:

As you can see, the ConvNets works with 3D volumes and transformations of these 3D volumes. I won’t repe (...)(trimmed)


---------------------------------------------------------------------
Article #11840175 - https://github.com/rasbt/python-machine-learning-book/blob/master/faq/difference-deep-and-normal-learning.md
What is the difference between deep learning and usual machine learning?
Topic score: 0.74

Article text:
 What is the difference between deep learning and usual machine learning?

That's an interesting question, and I try to answer this in a very general way.

In essence, deep learning offers a set of techniques and algorithms that help us to parameterize deep neural network structures -- artificial neural networks with many hidden layers and parameters. One of the key ideas behind deep learning is to extract high level features from the given dataset. Thereby, deep learning aims to overcome the cha (...)(trimmed)


---------------------------------------------------------------------
Article #12196388 - https://github.com/karandesai-96/digit-classifier
MNIST Handwritten Digit Classifier  beginner neural network project
Topic score: 0.73

Article text:
 MNIST Handwritten Digit Classifier

An implementation of multilayer neural network using numpy library. The implementation is a modified version of Michael Nielsen's implementation in Neural Networks and Deep Learning book.

Brief Background:

If you are familiar with basics of Neural Networks, feel free to skip this section. For total beginners who landed up here before reading anything about Neural Networks:

Neural networks are made up of building blocks known as Sigmoid Neurons . These are n (...)(trimmed)


---------------------------------------------------------------------
Article #11701665 - http://blog.keras.io/building-autoencoders-in-keras.html
Building autoencoders in Keras
Topic score: 0.72

Article text:
 Sat 14 May 2016 In Tutorials.

In this tutorial, we will answer some common questions about autoencoders, and we will cover code examples of the following models:

a simple autoencoder based on a fully-connected layer

a sparse autoencoder

a deep fully-connected autoencoder

a deep convolutional autoencoder

an image denoising model

a sequence-to-sequence autoencoder

a variational autoencoder

Note: all code examples have been updated to the Keras 2.0 API on March 14, 2017. You will need Kera (...)(trimmed)


In [58]:
model.show_topic_articles(44, top_n=5)
   Topic #44   
  ----------   
  government   
    agency     
   security    
      nsa      
 surveillance  
      fbi      
    snowden    
 intelligence  
   document    
  information  


---------------------------------------------------------------------
Article #10304864 - https://edwardsnowden.com/
Edwardsnowden.com
Topic score: 0.78

Article text:
 Who Is Edward Snowden?

Edward Snowden is a 31 year old US citizen, former Intelligence Community officer and whistleblower. The documents he revealed provided a vital public window into the NSA and its international intelligence partners’ secret mass surveillance programs and capabilities. These revelations generated unprecedented attention around the world on privacy intrusions and digital security, leading to a global debate on the issue.

Snowden worked in various roles within the US Intel (...)(trimmed)


---------------------------------------------------------------------
Article #11748746 - http://www.theguardian.com/us-news/2016/may/22/snowden-whistleblower-protections-john-crane
Snowden calls for whistleblower shield after claims by new Pentagon source
Topic score: 0.69

Article text:
 Accusations that Pentagon retaliated against a whistleblower undermine argument that there were options for Snowden other than leaking to the media

Edward Snowden has called for a complete overhaul of US whistleblower protections after a new source from deep inside the Pentagon came forward with a startling account of how the system became a “trap” for those seeking to expose wrongdoing.



The account of John Crane, a former senior Pentagon investigator, appears to undermine Barack Obama,  (...)(trimmed)


---------------------------------------------------------------------
Article #10615250 - https://www.washingtonpost.com/news/the-switch/wp/2015/11/20/why-its-so-hard-to-keep-up-with-how-the-u-s-government-is-spying-on-its-own-people/
Why its so hard to keep up with how the U.S. gov't is spying on its own people
Topic score: 0.68

Article text:
 Since 2013, Americans have gained immense insight about how the government conducts digital spying programs, largely thanks to the revelations made by former security contractor Edward Snowden. But a new report shows it's really hard to keep track of all the ways the United States is snooping on its own people.

After Snowden revealed the National Security Agency was collecting data en masse about American e-mails, the government said it had ended that particular program in 2011.

But it turns o (...)(trimmed)


---------------------------------------------------------------------
Article #11837578 - https://news.vice.com/article/edward-snowden-leaks-tried-to-tell-nsa-about-surveillance-concerns-exclusive
Snowden Tried to Tell NSA About Surveillance Concerns, Documents Reveal
Topic score: 0.68

Article text:
 On the morning of May 29, 2014, an overcast Thursday in Washington, D.C., the general counsel of the Office of the Director of National Intelligence, Robert Litt, wrote an email to high-level officials at the National Security Agency and the White House.

The topic: what to do about Edward Snowden.

Snowden’s leaks had first come to light the previous June, when the Guardian’s Glenn Greenwald and the Washington Post’s Barton Gellman published stories based on highly classified documents pr (...)(trimmed)


---------------------------------------------------------------------
Article #11400686 - http://thehill.com/policy/national-security/274840-report-clinton-could-be-interviewed-by-fbi-within-days
Report: FBI moves to interview Clinton over emails
Topic score: 0.67

Article text:
 Hillary Clinton Hillary Rodham ClintonAssange meets U.S. congressman, vows to prove Russia did not leak him documents High-ranking FBI official leaves Russia probe OPINION | Steve Bannon is Trump's indispensable man — don't sacrifice him to the critics MORE and her top aides might be questioned by FBI officials about her private email server within the next few days, according to a new report from Al Jazeera America.

The news outlet reported that the FBI has concluded its examination of Clint (...)(trimmed)


3.3 Find topics for a given article

3.3.1 Article from the corpus

This displays the topics that were extracted from a specific article in the corpus.

In [59]:
model.show_article_topics(10577102)
---------------------------------------------------------------------
Article #10577102 - http://www.nytimes.com/2015/11/17/us/after-paris-attacks-cia-director-rekindles-debate-over-surveillance.html
After Paris Attacks, C.I.A. Director Rekindles Debate Over Surveillance
Article text:
 “As far as I know, there’s no evidence the French lacked some kind of surveillance authority that would have made a difference,” said Jameel Jaffer, deputy legal director of the American Civil Liberties Union. “When we’ve invested new powers in the government in response to events like the Paris attacks, they have often been abused.”

The debate over the proper limits on government dates to the origins of the United States, with periodic overreaching in the name of security being cur (...)(trimmed)


   Topic #44      Topic #67      Topic #69   
 Score (0.32)   Score (0.19)   Score (0.15)  
  ----------     ----------     ----------   
  government         al            story     
    agency          state        continue    
   security        attack          read      
      nsa           group      advertisement 
 surveillance    government        main      
      fbi          country         times     
    snowden        islamic      newsletter   
 intelligence     terrorist        sign      
   document         saudi          york      
  information       iran         subscribe   


The last topic looks strange here - as it turns out, it is an unintended artifact of the data collection process. The newspaper library used to extract text from articles extracts text from some of the advertisements and subscribe buttons for NYTimes articles too. As a result, this set of words co-occurs with each other extremely frequently and co-occurs with other words much less frequently, and hence forms a very natural topic for topic modeling algorithms.

3.3.2 Finding topics for a new, unseen article

In [1]:
url = "https://www.ligo.caltech.edu/news/ligo20170927"
In [61]:
model.show_article_topics_from_url(url)
Article: https://www.ligo.caltech.edu/news/ligo20170927
Article text:
 News Release • September 27, 2017

The LIGO Scientific Collaboration and the Virgo collaboration report the first joint detection of gravitational waves with both the LIGO and Virgo detectors. This is the fourth announced detection of a binary black hole system and the first significant gravitational-wave signal recorded by the Virgo detector, and highlights the scientific potential of a three-detector network of gravitational-wave detectors.

The three-detector observation was made on August 14 (...)(trimmed)

Most relevant topics:

   Topic #29      Topic #10      Topic #89   
 Score (0.31)   Score (0.15)   Score (0.13)  
  ----------     ----------     ----------   
     earth          light       university   
     space          laser        research    
     star         electron        science    
    planet          field          paper     
     orbit         energy       researcher   
     moon           high           study     
     year          fusion        scientist   
     mars          charge         publish    
    galaxy         produce        journal    
   telescope         ray        scientific   


The popularity of topics can be plotted over time. Some cherrypicking for interesting results -

In [62]:
iplot(model.topic_trend_plot(11))
   Topic #11   
  ----------   
    flight     
      fly      
      air      
     space     
   aircraft    
    launch     
     plane     
     drone     
     pilot     
    rocket     


The topic contains the words flight, fly, air, space, aircraft, launch and sees a huge surge in popularity around March - May 2016. This was the time when SpaceX successfully launched and landed its satellites at sea. And of course, things related to Elon Musk have a tendency to be wildly popular on Hacker News :)

A quick look at the articles for this topic agrees with this hypothesis -

In [65]:
model.show_topic_articles(11, top_n=5)
   Topic #11   
  ----------   
    flight     
      fly      
      air      
     space     
   aircraft    
    launch     
     plane     
     drone     
     pilot     
    rocket     


---------------------------------------------------------------------
Article #11460935 - http://techcrunch.com/2016/04/08/spacex-just-landed-a-rocket-on-a-drone-ship-for-the-first-time/
SpaceX just landed a rocket on a drone ship for the first time
Topic score: 0.82

Article text:
 At 4:43 pm EST, SpaceX successfully launched their next resupply mission to the International Space Station (ISS). In addition to a seamless launch, SpaceX landed the first stage of their Falcon 9 rocket on an autonomous drone ship for the very first time.

Landing from the chase plane pic.twitter.com/2Q5qCaPq9P — SpaceX (@SpaceX) April 8, 2016

This was SpaceX’s fifth landing attempt on a drone ship — all previous attempts ended in explosions. Although in December of last year, Elon Musk (...)(trimmed)


---------------------------------------------------------------------
Article #11459183 - http://mobile.reuters.com/article/idUSKCN0X5228
SpaceX makes breakthrough by landing rocket at sea
Topic score: 0.79

Article text:
 CAPE CANAVERAL, Fla. (Reuters) - A SpaceX Falcon 9 rocket blasted off from Florida on a NASA cargo run to the International Space Station on Friday, and its reusable main-stage booster landed on an ocean platform minutes later in a dramatic spaceflight first.

The successful autonomous touchdown of the booster at sea marked another milestone for billionaire entrepreneur Elon Musk and his privately owned Space Exploration Technologies in the quest to develop a cheap, reusable rocket, expanding hi (...)(trimmed)


---------------------------------------------------------------------
Article #11642855 - http://phys.org/news/2016-05-spacex-successfully-rockets-stage-space.html
SpaceX lands rocket at sea second time after satellite launch
Topic score: 0.76

Article text:
 This photo provided by SpaceX shows the first stage of the company's Falcon rocket after it landed on a platform in the Atlantic Ocean just off the Florida coast on Friday, May 6, 2016, after launching a Japanese communications satellite. (SpaceX via AP) For the second month in a row, the aerospace upstart SpaceX landed a rocket on an ocean platform early Friday, this time following the successful launch of a Japanese communications satellite.

A live webcast showed the first-stage booster touch (...)(trimmed)


---------------------------------------------------------------------
Article #11791272 - http://www.theverge.com/2016/5/27/11787532/spacex-falcon-9-rocket-landing-success-sea-drone-ship
SpaceX successfully lands a Falcon 9 rocket at sea for the third time
Topic score: 0.74

Article text:
 SpaceX just successfully landed the first stage of its Falcon 9 rocket on a drone ship in the Atlantic Ocean. It was the third time in a row the company has landed a rocket booster at sea, and the fourth time overall.

The landing occurred a few minutes before the second stage of the Falcon 9 delivered the THAICOM-8 satellite to space, where it will make its way to geostationary transfer orbit (GTO). GTO is a high-elliptical orbit that is popular for satellites, sitting more than 20,000 miles ab (...)(trimmed)


---------------------------------------------------------------------
Article #11817878 - https://www.washingtonpost.com/graphics/business/rockets/
The New Space Race
Topic score: 0.71

Article text:
 Launch configurations

Launch abort system jettisons the crew to safety in the event of a launchpad failure.

Launch abort system

Orion crew vehicle

Cargo fairing

Exploration upper stage

The core stage of the rocket is orange because that is the natural color of the insulation that will cover it.

Core stage

Solid rocket boosters

Advanced boosters

RS-25 engines

A

B

C

D

A. An initial mission will take an unmanned crew vehicle around the moon and back to demonstrate the capabilities of (...)(trimmed)


In [66]:
iplot(model.topic_trend_plot(35))
   Topic #35   
  ----------   
     image     
      uk       
    london     
    caption    
      mr       
   copyright   
    british    
     japan     
     year      
    people     


This topic looks a little more strange. The words uk london british people seem fairly coherent, but the presence of words like image copyright caption is rather strange. It turns out to be another artifact of the data collection process - a number of the articles with the words uk London british people are from the BBC, and the text parser from the article picks up image captions from the BBC site which contain the words image caption copyright very frequently.

As for the popularity trend for the topic, the topic seems fairly dormant most of the time, seeing a massive spike in around June 2016. No prizes for guessing what this is due to -

In [73]:
model.show_topic_articles(35, top_n=5)
   Topic #35   
  ----------   
     image     
      uk       
    london     
    caption    
      mr       
   copyright   
    british    
     japan     
     year      
    people     


---------------------------------------------------------------------
Article #11970960 - http://www.bbc.com/news/uk-politics-eu-referendum-36620401
Petition for London independence signed by thousands after Brexit vote
Topic score: 0.67

Article text:
 Image copyright Reuters Image caption The overwhelming majority of Londoners voted to remain in the EU

A petition calling for Sadiq Khan to declare London an independent state after the UK voted to quit the EU has been signed by thousands of people.

The petition's organiser James O'Malley, said the capital was "a world city" which should "remain at the heart of Europe".

Nearly 60% of people in the capital backed the Remain campaign, in stark contrast to most of the country.

The LSE's directo (...)(trimmed)


---------------------------------------------------------------------
Article #11966167 - http://www.bbc.co.uk/news/uk-politics-36615028
UK votes to leave EU
Topic score: 0.66

Article text:
 Media playback is unsupported on your device Media caption EU vote: David Cameron says the UK "needs fresh leadership"

Prime Minister David Cameron is to step down by October after the UK voted to leave the European Union.

Speaking outside 10 Downing Street, he said "fresh leadership" was needed.

The PM had urged the country to vote Remain but was defeated by 52% to 48% despite London, Scotland and Northern Ireland backing staying in.

UKIP leader Nigel Farage hailed it as the UK's "independe (...)(trimmed)


---------------------------------------------------------------------
Article #11967959 - http://www.mirror.co.uk/news/uk-news/young-voters-wanted-brexit-least-8271517
Young voters wanted Brexit the least  and will have to live with it the longest
Topic score: 0.58

Article text:
 Get politics updates directly to your inbox + Subscribe Thank you for subscribing! Could not subscribe, try again later Invalid Email

Younger voters will be the losers from today's historic vote to leave the EU after polls repeatedly showed they back Remain.

Brexiters were led to victory in the referendum overnight by triumphing in Tory shires and Old Labour heartlands in Wales and the north of England.

But the Kingdom is no longer United after London, Scotland and Northern Ireland all backed (...)(trimmed)


---------------------------------------------------------------------
Article #11975945 - https://www.theguardian.com/uk-news/2016/jun/25/sturgeon-seeks-urgent-brussels-talks-to-protect-scotlands-eu-membership
Sturgeon seeks Brussels talks to protect Scotland's EU membership
Topic score: 0.52

Article text:
 First minister to set up panel to advise her on Scotland’s relationship with EU, as Labour considers endorsing independence

Nicola Sturgeon is to lobby EU member states directly for support in ensuring that Scotland can remain part of the bloc, after Scots voted emphatically against Brexit on Thursday.



The first minister has disclosed that she is to invite all EU diplomats based in Scotland to a summit at her official residence in Edinburgh within the next two weeks in a bid to sidestep th (...)(trimmed)


---------------------------------------------------------------------
Article #11967478 - http://www.theguardian.com/politics/2016/jun/24/david-cameron-resigns-after-uk-votes-to-leave-european-union
David Cameron announces resignation
Topic score: 0.51

Article text:
 David Cameron has resigned, bringing an abrupt end to his six-year premiership, after the British public took the momentous decision to reject his entreaties and turn their back on the European Union.

Just a year after he clinched a surprise majority in the general election, a visibly emotional Cameron, standing outside Number 10 on Friday morning alongside his wife, Samantha, said: “The will of the British people is an instruction that must be delivered.”

The prime minister campaigned har (...)(trimmed)


In [69]:
iplot(model.topic_trend_plot(44))
   Topic #44   
  ----------   
  government   
    agency     
   security    
      nsa      
 surveillance  
      fbi      
    snowden    
 intelligence  
   document    
  information  


This topic has a more interesting trend. Privacy and government surveillance has long been a popular topic on Hacker News, and this is clear from the relatively high popularity values in comparison to the other topics plotted so far. As for the significant increase in popularity around February 2016, this corresponds to the San Bernardino event, when there was a large amount of debate on privacy and surveillance, centered around whether Apple, under pressure by the FBI, should or should not unlock an iPhone used by one of the shooters.

There are also numerous other spikes in this graph, and it'd be interesting to look at them in more detail to see if they can be traced to specific events.

3.5 Topic Intersection

Topics can be combined to find articles that are relevant to both topics. Here, we see combining two separate topics consisting of the words game player play move win and google computer technology machine human give us articles related to AlphaGo's success against the human Go champion, Lee Sedol.

In [26]:
model.show_topic_articles([65, 66], top_n=5)
   Topic #65      Topic #66   
  ----------     ----------   
     game          google     
    player        computer    
     play        technology   
     move          machine    
      win           human     
     world         system     
     chess          world     
   computer          ai       
     level          year      
     sport          robot     


---------------------------------------------------------------------
Article #11250871 - http://googleasiapacific.blogspot.com/2016/03/alphagos-ultimate-challenge.html
AlphaGos ultimate challenge: a five-game match against Lee Sedol
Topic score: 0.35

Article text:
 Game 3 - March 12, 2016

“It’s arguable that in the first two games Lee Sedol was playing differently than his true style, trying to find a weakness in the computer. Today Lee was definitely playing his own game, from his strong opening to the complicated moves in the final kō. AlphaGo was ready for everything, including the kō fights, and was able to take the win. I’d like to congratulate the people who actually made this accomplishment possible, because it’s a work of art.”

“Lee (...)(trimmed)


---------------------------------------------------------------------
Article #11258168 - http://www.shanghaidaily.com/national/AlphaGo-cant-beat-me-says-Chinese-Go-grandmaster-Ke-Jie/shdaily.shtml
AlphaGo Can't Beat Me, Says Chinese Go Grandmaster Ke Jie
Topic score: 0.33

Article text:
 Home » Nation

ALPHAGO, the computer created by DeepMind, the Artificial Intelligence (AI) arm of Google, defeated world champion Lee Sedol of South Korea Wednesday in Game One of human vs. machine Go-chess showdown. The result is out of the expectations of many, including China's Go grandmaster Ke Jie, but Ke put it clear "AlphaGo is not in my match now".

Ke admitted Thursday he had underestimated AlphaGo's capability before the opening match, but he still believes he will be the winner shoul (...)(trimmed)


---------------------------------------------------------------------
Article #11300892 - https://googleblog.blogspot.com/2016/03/what-we-learned-in-seoul-with-alphago.html
What we learned in Seoul with AlphaGo
Topic score: 0.31

Article text:
 Go may be one of the oldest games in existence, but the attention to our five-game tournament exceeded even our wildest imaginations. Searches for Go rules and Go boards spiked in the U.S. In China, tens of millions watched live streams of the matches, and the “Man vs. Machine Go Showdown” hashtag saw 200 million pageviews on Sina Weibo. Sales of Go boards even surged in Korea.



Our public test of AlphaGo, however, was about more than winning at Go. We founded DeepMind in 2010 to create ge (...)(trimmed)


---------------------------------------------------------------------
Article #10981682 - https://googleblog.blogspot.com/2016/01/alphago-machine-learning-game-go.html
Google AI beats a pro at the game of Go
Topic score: 0.31

Article text:
 The game of Go originated in China more than 2,500 years ago. Confucius wrote about the game, and it is considered one of the four essential arts required of any true Chinese scholar. Played by more than 40 million people worldwide, the rules of the game are simple: Players take turns to place black or white stones on a board, trying to capture the opponent's stones or surround empty space to make points of territory. The game is played primarily through intuition and feel, and because of its be (...)(trimmed)


---------------------------------------------------------------------
Article #11129076 - http://venturebeat.com/2016/02/18/civilization-25-years-66-versions-33m-copies-sold-1-billion-hours-played/
Civilization: 25 years, 33M copies sold, 1B hours played, and 66 versions
Topic score: 0.30

Article text:
 LAS VEGAS — Civilization is one of the gods of strategy games, where you oversee the creation of a whole society in competition with other civilizations. It debuted in 1991, and now at 25, it has become one of the cultural touchstones of the game industry, something that everyone recognizes or has played in the past.

Image Credit: MicroProse

Few game franchises live to see a 25th anniversary, but Civ, as most gamers and industry folk call it, is thriving. It has 33 million copies in sales to (...)(trimmed)


3.6 Similar topics

Topics that are similar to a specific topic can be found using -

In [27]:
model.show_similar_topics(44, top_n=5)
   Topic #44   
  ----------   
  government   
    agency     
   security    
      nsa      
 surveillance  
      fbi      
    snowden    
 intelligence  
   document    
  information  


Topics similar to topic #44
---------------------------

   Topic #73      Topic #3       Topic #98      Topic #50      Topic #67   
 Score (0.23)   Score (0.18)   Score (0.17)   Score (0.16)   Score (0.15)  
  ----------     ----------     ----------     ----------     ----------   
      law           group          datum        security          al       
     court         public          user          attack          state     
     case         political     information   vulnerability     attack     
     legal          state          data          exploit         group     
     rule          member         privacy        hacker       government   
     state         policy         access        password        country    
    lawyer        campaign        service       attacker        islamic    
  government      president      internet         hack         terrorist   
     judge          party         company         find           saudi     
     order       government       provide        system          iran      


3.7 Topic Interesting-ness

There are certain topics which occur more frequently in articles than others, but with lower scores. The hypothesis is that these topics are more common and generic, whereas interesting topics would occur less frequently in articles, but higher scores. Common and generic topics would have low scores frequently, indicating they are rarely the main focus of an article, whereas the opposite is true for interesting topics.

Plotting the distribution of scores over all articles for two topics -

In [76]:
topics_of_interest = [43, 95]
In [75]:
model.print_topics_table(topics_of_interest)
   Topic #43      Topic #95   
  ----------     ----------   
    quantum         test      
    theory          code      
    physics         error     
   particle          bug      
   universe        problem    
   physicist         fix      
     wave           check     
     field          fail      
     hole           issue     
     state           run      


In [83]:
iplot(model.plot_topic_article_distribution(topics_of_interest))

As expected, the histogram for topic #95 (test, code, error, bug, problem), a rather generic topic, at least for Hacker News content, is quite skewed to the left, indicating it occurs with low scores very frequently in articles, and almost never with a high score. The histogram for topic #43 (quantum, theory, physics, particle, universe) is much flatter, indicating it is the main theme of an article much more often.

Computing the median of scores across all articles seems like a decent mathematical way of capturing this intuition of "interesting-ness". Sorting topics by the computed median scores in decreasing order, we get -

In [84]:
model.print_topics_table()
   Topic #99      Topic #29      Topic #38      Topic #43      Topic #56      Topic #70   
  ----------     ----------     ----------     ----------     ----------     ----------   
    network         earth        container       quantum        bitcoin          car      
     model          space         docker         theory       transaction      vehicle    
   learning         star            run          physics      blockchain        drive     
    neural         planet         service       particle        network         tesla     
     learn          orbit          image        universe        wright          road      
    machine         moon        application     physicist        block          bike      
     deep           year          deploy          wave         ethereum        driver     
   training         mars          cluster         field          trust          model     
     layer         galaxy         machine         hole         currency       electric    
     image        telescope        host           state        exchange         wheel     


   Topic #71      Topic #11      Topic #94      Topic #2       Topic #12      Topic #13   
  ----------     ----------     ----------     ----------     ----------     ----------   
    animal         flight          cell            key           food           stack     
     human           fly           gene        certificate        eat        instruction  
    specie           air            dna         security          fat         register    
      dog           space          human       encryption        sugar         address    
     bird         aircraft        genome         encrypt         diet           code      
      cat          launch         genetic       password         meat           call      
     year           plane         protein        secure          fruit         memory     
     tree           drone          mouse         secret           egg           byte      
     live           pilot         cancer         public         farmer         program    
     find          rocket        bacteria          tls           grow         function    


   Topic #75      Topic #67      Topic #30      Topic #31      Topic #92      Topic #62   
  ----------     ----------     ----------     ----------     ----------     ----------   
   component         al            stock           git          police          drug      
     react          state           tax          github          crime         patient    
   function        attack         market       repository       officer        health     
      var           group         company        commit          drug          medical    
    element      government        fund          branch         prison         disease    
    return         country         share         change        criminal        doctor     
     state         islamic       investor         code          arrest         cancer     
    render        terrorist      financial        merge          year         treatment   
      dom           saudi          bank          request         call           death     
    import          iran           price         project          law           year      


   Topic #44      Topic #52      Topic #97      Topic #77      Topic #16      Topic #39   
  ----------     ----------     ----------     ----------     ----------     ----------   
  government         pi            phone         memory          node          company    
    agency          board         network          cpu          system         startup    
   security          usb         internet         core           read          founder    
      nsa           chip           radio          intel          state        investor    
 surveillance       power         signal       performance       write          tech      
      fbi         hardware        mobile          cache         cluster        valley     
    snowden         card          device        processor     distribute        start     
 intelligence        km           channel         chip          latency        silicon    
   document        device         service          op           message       business    
  information    controller         fi             gpu         operation        money     


   Topic #93      Topic #86      Topic #27      Topic #50      Topic #91      Topic #10   
  ----------     ----------     ----------     ----------     ----------     ----------   
    energy         service         file         security         code           light     
     power          datum         command        attack        compiler         laser     
     solar           aws            run       vulnerability      rust         electron    
     cost           cloud         install        exploit        compile         field     
    battery       instance        script         hacker        function        energy     
     year          server          build        password     optimization       high      
      gas        application     directory      attacker        library        fusion     
     plant           run          default         hack           call          charge     
     fuel          storage        package         find          memory         produce    
      oil          system           set          system       performance        ray      


   Topic #48      Topic #36      Topic #21      Topic #84      Topic #87      Topic #0    
  ----------     ----------     ----------     ----------     ----------     ----------   
   database         ship           sleep         device          city          upgrade    
     query           sea            day           phone           san            fix      
     datum          water          hour          camera          area           close     
     table          year         exercise        battery        street           add      
     index          ocean         mental         laptop         housing          al       
      row           find          people           vr            york            doc      
    column         island         health         screen        francisco       rebuild    
      sql           river       depression     smartphone        home           david     
     data           land           feel           home         building        update     
    select          site          stress        hardware        people         michael    


   Topic #20      Topic #22      Topic #25      Topic #80      Topic #85      Topic #33   
  ----------     ----------     ----------     ----------     ----------     ----------   
      int          number          uber           brain        facebook         music     
    return          point         amazon          study           ad            video     
   function        matrix         driver        cognitive       twitter         sound     
     const        algorithm       service        memory          user           audio     
     void         function         trip          neuron          post           play      
    struct         vector         airbnb       participant       site           song      
     null           prime          ride          effect         people         stream     
     type           graph          lyft            al            news          record     
   template         line           taxi          ability        content         note      
     char           curve          city       intelligence      medium         listen     


   Topic #58      Topic #34      Topic #63      Topic #73      Topic #28      Topic #57   
  ----------     ----------     ----------     ----------     ----------     ----------   
    thread           src           image           law            war          student    
    process         llvm           color          court        military        school     
     lock           tool           pixel          case          weapon         college    
     call            gnu            map           legal         soviet          learn     
     event          clang          frame          rule          nuclear      university   
     queue         module          draw           state         russian         teach     
     task          include          red          lawyer          force        education   
      run           patch          light       government       missile         class     
     wait         solution        render          judge          bomb           high      
   function        problem         blue           order        american        teacher    


   Topic #74      Topic #1       Topic #41      Topic #65      Topic #14      Topic #96   
  ----------     ----------     ----------     ----------     ----------     ----------   
     type           team          people          game          windows        license    
   function        people        economic        player          linux        software    
    haskell          job           money          play          system        copyright   
   language        company          dao           move          kernel         patent     
     monad        interview      contract          win         microsoft        free      
    return          hire          social          world           os           oracle     
    define        engineer        income          chess           run          include    
     list         employee         rich         computer         boot          source     
    lambda         manager        wealth          level          user           term      
    promise          day        inequality        sport         driver          copy      


   Topic #46      Topic #40      Topic #26      Topic #7       Topic #49      Topic #82   
  ----------     ----------     ----------     ----------     ----------     ----------   
    percent       function        server          water           app            web      
     year          string         network          air          android         page      
      job          return       connection     temperature      google         browser    
    worker        variable        packet          flow           user           site      
     rate           code            ip           surface         apps          content    
    income          match         client          heat           swift         website    
     high        expression         tcp          bridge         mobile          user      
      low           list          address       material          ios        javascript   
   increase          def         protocol        oxygen           add           html      
    growth          call           send         chemical       developer       chrome     


   Topic #23      Topic #79      Topic #15      Topic #8       Topic #19      Topic #24   
  ----------     ----------     ----------     ----------     ----------     ----------   
    support         bank          request        python          film           book      
    release         money         server         library         show           write     
    version         card          client          code            art          century    
    feature        account         http         language         movie          world     
    change         credit        response         java          artist         history    
      add            pay        application       ruby          netflix         great     
      fix          payment          url        javascript        star            man      
    update          cash            api         framework        world         modern     
    include      transaction      service         read            le            year      
     issue         number         header          write         disney          life      


   Topic #64      Topic #6       Topic #60      Topic #5       Topic #32      Topic #45   
  ----------     ----------     ----------     ----------     ----------     ----------   
     type          problem         email          word          project         datum     
    object         machine        message         book           open          memory     
     class         theory          send         language        source          byte      
    method         number           tor           text         developer        file      
    string      mathematical      address         read           build           bit      
   function       computer        account        english       community        size      
     code        mathematic        mail         document         tool           hash      
    public          proof         domain        character     development        key      
     call       mathematician     contact        letter          team            set      
    return        question         user           write        software        buffer     


   Topic #42      Topic #54      Topic #81      Topic #61      Topic #66      Topic #76   
  ----------     ----------     ----------     ----------     ----------     ----------   
    company        design        language         text          google          china     
     year           build         program         mode         computer        country    
   employee         wall           code          window       technology       chinese    
    million       building      programming      screen         machine         world     
   business         part           write          line           human         united     
   executive        small       programmer       editor         system          india     
    billion         room         software        button          world         states     
    accord          shape         system          click           ai         government   
     firm         material       computer        display         year           north     
      ceo          create         design           key           robot        american    


   Topic #9       Topic #17      Topic #90      Topic #3       Topic #53      Topic #47   
  ----------     ----------     ----------     ----------     ----------     ----------   
     apple          datum           api           group          child         product    
     font          result           bot          public          woman        customer    
    iphone         number         direct        political         age         business    
      mac          average         slack          state           man          service    
    design          model          total         member          study         revenue    
     phone        analysis          sun          policy          group          share     
     size          sample           avg         campaign        parent         growth     
    device          show            sat         president        male         platform    
   software     distribution        sms           party          adult         result     
      ios          measure       anonymous     government         sex         software    


   Topic #37      Topic #69      Topic #35      Topic #98      Topic #83      Topic #78   
  ----------     ----------     ----------     ----------     ----------     ----------   
     life           story          image          datum          yahoo         package    
    family        continue          uk            user        restaurant        full      
     year           read          london       information      coffee         debian     
    friend      advertisement     caption         data            bar           text      
      day           main            mr           privacy         house         subject    
    people          times        copyright       access          food           link      
     live        newsletter       british        service         drink          send      
     home           sign           japan        internet         mayer          mbox      
      man           york           year          company         chef          mozilla    
     house        subscribe       people         provide         club           date      


   Topic #4       Topic #68      Topic #18      Topic #89      Topic #51      Topic #95   
  ----------     ----------     ----------     ----------     ----------     ----------   
    people           day           thing       university        price          test      
     thing          drive         people        research         sell           code      
     feel            ms             lot          science        company         error     
     fact            bob           start          paper         market           bug      
     human          year           year        researcher         buy          problem    
     point         august           big           study        business          fix      
     world          store         problem       scientist       product         check     
   question         july           back          publish          pay           fail      
    person          hour          happen         journal         cost           issue     
      bad           april          talk        scientific        sale            run      


   Topic #55      Topic #88      Topic #72      Topic #59   
  ----------     ----------     ----------     ----------   
      day          country         thing         system     
     back         european         find          problem    
     hand          europe          post          change     
      run            de            give          require    
     head          french          write         design     
      sit          france          start        approach    
     begin         germany         read           large     
     walk          german          point          level     
     hour           world         article        process    
      man           paris           ne           provide    


This seems to give reasonably good results. Specific, focused topics are at the top, whereas common generic topics are at the bottom. It is possible that this metric of interesting-ness could be flawed for certain kinds of data, where either the notion of interesting-ness is different in the first place (as it is a subjective notion), or where the topic-article distribution is significantly different.

3.8 Topic Clusters

The model has a notion of similarity between topics based on a few metrics. The two basic ideas are -

  1. Two topics are similar if they have similar words
  2. Two topics are similar if they co-occur frequently in articles

The first captures the notion of lexical similarity, whereas the second captures the notion of relatedness.

Plotting the topic similarity matrix for the word_doc_sim metric which combines both (1) and (2) -

In [86]:
model.plot_topic_similarities(metric='word_doc_sim')
Out[86]:
<matplotlib.image.AxesImage at 0x7f3a8b8d3be0>

Looking at this matrix, it is possible to discern a couple of patterns -

  1. Certain topics are similar to many of the other topics. These stand out as distinctly dark rows/columns in the above matrix.
  2. Certain topics are similar to almost none of the other topics - stand-alone topics. These stand out as almost completely white rows/columns in the matrix above.

Both these kind of topics are not well-suited for clustering. Stand-alone topics should be typically be clusters of their own, and it is difficult to assign a single cluster to common topics, as they similar or related to many of the other topics. From a graph-theoretic point of view, these topics would be hub nodes - connected to many of the other topics, and not part of any single graph partition.

So, the method to cluster topics provides options to exclude such topics from the clustering process.

In addition, the model includes an internal method to determine cluster quality based on the silhouette scores of its constituent nodes. The printed clusters are sorted in decreasing order of cluster quality.

In [34]:
model.cluster_topics(metric='word_doc_sim', exclude_common=True, exclude_standalone=True)
In [35]:
model.print_topic_clusters()
Cluster 1----------------------------------

   Topic #39      Topic #42      Topic #47      Topic #51   
  ----------     ----------     ----------     ----------   
    company        company        product         price     
    startup         year         customer         sell      
    founder       employee       business        company    
   investor        million        service        market     
     tech         business        revenue          buy      
    valley        executive        share        business    
     start         billion        growth         product    
    silicon        accord        platform          pay      
   business         firm          result          cost      
     money           ceo         software         sale      




Cluster 9----------------------------------

   Topic #16      Topic #48      Topic #86   
  ----------     ----------     ----------   
     node         database        service    
    system          query          datum     
     read           datum           aws      
     state          table          cloud     
     write          index        instance    
    cluster          row          server     
  distribute       column       application  
    latency          sql            run      
    message         data          storage    
   operation       select         system     




Cluster 6----------------------------------

   Topic #44      Topic #73      Topic #92      Topic #98   
  ----------     ----------     ----------     ----------   
  government         law          police          datum     
    agency          court          crime          user      
   security         case          officer      information  
      nsa           legal          drug           data      
 surveillance       rule          prison         privacy    
      fbi           state        criminal        access     
    snowden        lawyer         arrest         service    
 intelligence    government        year         internet    
   document         judge          call          company    
  information       order           law          provide    




Cluster 13----------------------------------

   Topic #12      Topic #21      Topic #53      Topic #62      Topic #71      Topic #80   
  ----------     ----------     ----------     ----------     ----------     ----------   
     food           sleep          child          drug          animal          brain     
      eat            day           woman         patient         human          study     
      fat           hour            age          health         specie        cognitive   
     sugar        exercise          man          medical          dog          memory     
     diet          mental          study         disease         bird          neuron     
     meat          people          group         doctor           cat        participant  
     fruit         health         parent         cancer          year          effect     
      egg        depression        male         treatment        tree            al       
    farmer          feel           adult          death          live          ability    
     grow          stress           sex           year           find       intelligence  


   Topic #89      Topic #94   
  ----------     ----------   
  university        cell      
   research         gene      
    science          dna      
     paper          human     
  researcher       genome     
     study         genetic    
   scientist       protein    
    publish         mouse     
    journal        cancer     
  scientific      bacteria    




Cluster 8----------------------------------

   Topic #3       Topic #28      Topic #35      Topic #67      Topic #76      Topic #88   
  ----------     ----------     ----------     ----------     ----------     ----------   
     group           war           image           al            china         country    
    public        military          uk            state         country       european    
   political       weapon         london         attack         chinese        europe     
     state         soviet         caption         group          world           de       
    member         nuclear          mr         government       united         french     
    policy         russian       copyright       country         india         france     
   campaign         force         british        islamic        states         germany    
   president       missile         japan        terrorist     government       german     
     party          bomb           year           saudi          north          world     
  government      american        people          iran         american         paris     




Cluster 0----------------------------------

   Topic #13      Topic #20      Topic #40      Topic #58      Topic #64      Topic #74   
  ----------     ----------     ----------     ----------     ----------     ----------   
     stack           int         function        thread          type           type      
  instruction      return         string         process        object        function    
   register       function        return          lock           class         haskell    
    address         const        variable         call          method        language    
     code           void           code           event         string          monad     
     call          struct          match          queue        function        return     
    memory          null        expression        task           code          define     
     byte           type           list            run          public          list      
    program       template          def           wait           call          lambda     
   function         char           call         function        return         promise    


   Topic #75      Topic #91   
  ----------     ----------   
   component        code      
     react        compiler    
   function         rust      
      var          compile    
    element       function    
    return      optimization  
     state         library    
    render          call      
      dom          memory     
    import       performance  




Cluster 5----------------------------------

   Topic #2       Topic #15      Topic #26      Topic #50      Topic #60   
  ----------     ----------     ----------     ----------     ----------   
      key          request        server        security         email     
  certificate      server         network        attack         message    
   security        client       connection    vulnerability      send      
  encryption        http          packet         exploit          tor      
    encrypt       response          ip           hacker         address    
   password      application      client        password        account    
    secure           url            tcp         attacker         mail      
    secret           api          address         hack          domain     
    public         service       protocol         find          contact    
      tls          header          send          system          user      




Cluster 11----------------------------------

   Topic #7       Topic #10      Topic #11      Topic #29      Topic #36      Topic #43   
  ----------     ----------     ----------     ----------     ----------     ----------   
     water          light         flight          earth          ship          quantum    
      air           laser           fly           space           sea          theory     
  temperature     electron          air           star           water         physics    
     flow           field          space         planet          year         particle    
    surface        energy        aircraft         orbit          ocean        universe    
     heat           high          launch          moon           find         physicist   
    bridge         fusion          plane          year          island          wave      
   material        charge          drone          mars           river          field     
    oxygen         produce         pilot         galaxy          land           hole      
   chemical          ray          rocket        telescope        site           state     


   Topic #54      Topic #93   
  ----------     ----------   
    design         energy     
     build          power     
     wall           solar     
   building         cost      
     part          battery    
     small          year      
     room            gas      
     shape          plant     
   material         fuel      
    create           oil      




Cluster 12----------------------------------

   Topic #6       Topic #17      Topic #22      Topic #45      Topic #61      Topic #63   
  ----------     ----------     ----------     ----------     ----------     ----------   
    problem         datum         number          datum          text           image     
    machine        result          point         memory          mode           color     
    theory         number         matrix          byte          window          pixel     
    number         average       algorithm        file          screen           map      
 mathematical       model        function          bit           line           frame     
   computer       analysis        vector          size          editor          draw      
  mathematic       sample          prime          hash          button           red      
     proof          show           graph           key           click          light     
 mathematician  distribution       line            set          display        render     
   question        measure         curve         buffer           key           blue      


   Topic #99   
  ----------   
    network    
     model     
   learning    
    neural     
     learn     
    machine    
     deep      
   training    
     layer     
     image     




Cluster 10----------------------------------

   Topic #30      Topic #41      Topic #79   
  ----------     ----------     ----------   
     stock         people          bank      
      tax         economic         money     
    market          money          card      
    company          dao          account    
     fund         contract        credit     
     share         social           pay      
   investor        income         payment    
   financial        rich           cash      
     bank          wealth       transaction  
     price       inequality       number     




Cluster 7----------------------------------

   Topic #14      Topic #52      Topic #77      Topic #84      Topic #97   
  ----------     ----------     ----------     ----------     ----------   
    windows          pi           memory         device          phone     
     linux          board           cpu           phone         network    
    system           usb           core          camera        internet    
    kernel          chip           intel         battery         radio     
   microsoft        power       performance      laptop         signal     
      os          hardware         cache           vr           mobile     
      run           card         processor       screen         device     
     boot            km            chip        smartphone       channel    
     user          device           op            home          service    
    driver       controller         gpu         hardware          fi       




Cluster 14----------------------------------

   Topic #8       Topic #27      Topic #81      Topic #95   
  ----------     ----------     ----------     ----------   
    python          file         language         test      
    library        command        program         code      
     code            run           code           error     
   language        install      programming        bug      
     java          script          write         problem    
     ruby           build       programmer         fix      
  javascript      directory      software         check     
   framework       default        system          fail      
     read          package       computer         issue     
     write           set          design           run      




Cluster 4----------------------------------

   Topic #23   
  ----------   
    support    
    release    
    version    
    feature    
    change     
      add      
      fix      
    update     
    include    
     issue     




Cluster 2----------------------------------

   Topic #1       Topic #49      Topic #82      Topic #85   
  ----------     ----------     ----------     ----------   
     team            app            web         facebook    
    people         android         page            ad       
      job          google         browser        twitter    
    company         user           site           user      
   interview        apps          content         post      
     hire           swift         website         site      
   engineer        mobile          user          people     
   employee          ios        javascript        news      
    manager          add           html          content    
      day         developer       chrome         medium     




Cluster 3----------------------------------

   Topic #0       Topic #5       Topic #9       Topic #19      Topic #25      Topic #31   
  ----------     ----------     ----------     ----------     ----------     ----------   
    upgrade         word           apple          film           uber            git      
      fix           book           font           show          amazon         github     
     close        language        iphone           art          driver       repository   
      add           text            mac           movie         service        commit     
      al            read          design         artist          trip          branch     
      doc          english         phone         netflix        airbnb         change     
    rebuild       document         size           star           ride           code      
     david        character       device          world          lyft           merge     
    update         letter        software          le            taxi          request    
    michael         write           ios          disney          city          project    


   Topic #33      Topic #38      Topic #56      Topic #57      Topic #65      Topic #68   
  ----------     ----------     ----------     ----------     ----------     ----------   
     music        container       bitcoin        student         game            day      
     video         docker       transaction      school         player          drive     
     sound           run        blockchain       college         play            ms       
     audio         service        network         learn          move            bob      
     play           image         wright       university         win           year      
     song        application       block          teach          world         august     
    stream         deploy        ethereum       education        chess          store     
    record         cluster         trust          class        computer         july      
     note          machine       currency         high           level          hour      
    listen          host         exchange        teacher         sport          april     


   Topic #69      Topic #70      Topic #78      Topic #83      Topic #87      Topic #96   
  ----------     ----------     ----------     ----------     ----------     ----------   
     story           car          package         yahoo          city          license    
   continue        vehicle         full        restaurant         san         software    
     read           drive         debian         coffee          area         copyright   
 advertisement      tesla          text            bar          street         patent     
     main           road          subject         house         housing         free      
     times          bike           link           food           york          oracle     
  newsletter       driver          send           drink        francisco       include    
     sign           model          mbox           mayer          home          source     
     york         electric        mozilla         chef         building         term      
   subscribe        wheel          date           club          people          copy      




Stand-alone topics----------------------------------

   Topic #34      Topic #90   
  ----------     ----------   
      src            api      
     llvm            bot      
     tool          direct     
      gnu           slack     
     clang          total     
    module           sun      
    include          avg      
     patch           sat      
   solution          sms      
    problem       anonymous   




Common topics----------------------------------

   Topic #4       Topic #18      Topic #24      Topic #32      Topic #37      Topic #46   
  ----------     ----------     ----------     ----------     ----------     ----------   
    people          thing          book          project         life          percent    
     thing         people          write          open          family          year      
     feel            lot          century        source          year            job      
     fact           start          world        developer       friend         worker     
     human          year          history         build           day           rate      
     point           big           great        community       people         income     
     world         problem          man           tool           live           high      
   question         back          modern       development       home            low      
    person         happen          year           team            man         increase    
      bad           talk           life         software         house         growth     


   Topic #55      Topic #59      Topic #66      Topic #72   
  ----------     ----------     ----------     ----------   
      day          system         google          thing     
     back          problem       computer         find      
     hand          change       technology        post      
      run          require        machine         give      
     head          design          human          write     
      sit         approach        system          start     
     begin          large          world          read      
     walk           level           ai            point     
     hour          process         year          article    
      man          provide         robot           ne       




Plotting the similarity matrix for the clustered topics -

In [107]:
model.plot_clustered_topic_similarities(metric='word_doc_sim', threshold_percentile=85)
Out[107]:
<matplotlib.image.AxesImage at 0x7f3a8b9c93c8>

The common as well as standalone topics excluded from the clustering are at the end. Also note that the diagonal values (self-similarity) have been zeroed in the matrix above to allow for easier visualization.

A number of the clusters look fairly reasonable, grouping together related topics. However, there is one large cluster at the end which contains a large number of disparate, wide-ranging topics - from the matrix, it is also evident that these topics are not very similar to most of the topics part of clustering.

This suggests that excluding certain topics can introduce certain disadvantages.

There are also some other tricky aspects to clustering topics -

  1. Choosing an appropriate number of clusters
  2. Whether clustering topics is even suitable for the kind of data and topics you have
  3. How to handle common as well as stand-alone topics while clustering

4. Future Steps

  1. A clearly defined, well-documented API to allow extracting topics from a user-supplied dataset
  2. Interactive visualization that allows users to browse articles and topics in a single graph
  3. Extracting topics hierarchically - being able to extract sub-topics from the articles associated with a particular topic in order to focus on a more specific theme of interest
  4. Tagging events from news articles to spikes in topic popularity, in order to understand why the interest in a certain topic varied as it did

In conclusion, I'd love to get more feedback about whether and how this could be useful. Please do get in touch at [email protected] if you have any ideas. Feel free to do so if you wish to talk about NLP generally either!