This notebook demonstrates the usage of the news-analyze
library, which makes use of topic modeling and clustering for extracting topics and themes out of a corpus of news articles. The key features are -
The goal of the library is to provide a way to qualitatively explore topics and trends in a news corpus to gain insight into it.
The notebook presents the usage of these features using a model trained on an year's worth of Hacker News data, which is present in the repo and directly usable. The library doesn't yet provide a documented API to be able to train new models on your own data. This is a work in progress.
This library was one of the things I worked on while I was part of the Recurse Center, a programmer's retreat for people from a variety of backgrounds and experience levels looking to get better at programming. You should check them out!
A significant motivation behind this initial alpha release and demo is to get feedback about the following -
The data used for training the model is a collection of posts on Hacker News, available here. The raw data contains 293119 posts from September 2015 to September 2016. A post here refers to an article that was posted to Hacker News, not the comments. The article text is not included, only the url, along with some metadata (time of post, number of points and comments received).
Firstly, any articles that received under 50 points were filtered out, in order to focus on links that received a fair amount of attention on HN, which results in 20148 posts. Next, to extract the full text of these articles, the content from the urls was scraped and parsed using newspaper, a Python library which allows extracting of full text of news articles from html. Content from some urls could not be extracted correctly in this process (mostly 404s), resulting in 15016 parsed articles.
Topic models were trained on these using Gensim, a Python library that has both native implementations of various topic modeling algorithms as well as wrappers to external topic modeling frameworks. The final model in the repository was trained using a wrapper to Mallet. Spacy was used for tokenization and lemmatization. Tokens that were extremely frequent or extremely rare were filtered out. For more specific details, please have a look at this file.
The insights and use-cases presented in this section are on the dataset described above. I don't yet know how well these techniques can generalize to new datasets, and your mileage may vary. Also, the repository does not contain the original text scraped from the HN posts as these are from a variety of websites, some of which might have terms and conditions that do not permit their data to be publicly released. As a result, the notebook might not be runnable on your local machine. I'm currently looking into how to work around this issue.
%cd ..
/home/jayant/Projects/recurse/hn_analyze
%load_ext autoreload
%autoreload 2
import os
import pickle
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = [12, 8]
model = pickle.load(open('data/models/hn_ldam_mallet_100t_5a', 'rb'))
This is the list of all topics that were extracted from the corpus, printed in human-readable form. Note that in the underlying model, each topic is a vector of scores over all words in the corpus. Here, only the top 10 words for each topic are displayed, for ease of reading and in order to get a sense of what each topic is about.
The topics are ordered in decreasing order of "interesting-ness", which is described in a later section in the notebook.
model.print_topics_table()
Topic #99 Topic #29 Topic #38 Topic #43 Topic #56 Topic #70 ---------- ---------- ---------- ---------- ---------- ---------- network earth container quantum bitcoin car model space docker theory transaction vehicle learning star run physics blockchain drive neural planet service particle network tesla learn orbit image universe wright road machine moon application physicist block bike deep year deploy wave ethereum driver training mars cluster field trust model layer galaxy machine hole currency electric image telescope host state exchange wheel Topic #71 Topic #11 Topic #94 Topic #2 Topic #12 Topic #13 ---------- ---------- ---------- ---------- ---------- ---------- animal flight cell key food stack human fly gene certificate eat instruction specie air dna security fat register dog space human encryption sugar address bird aircraft genome encrypt diet code cat launch genetic password meat call year plane protein secure fruit memory tree drone mouse secret egg byte live pilot cancer public farmer program find rocket bacteria tls grow function Topic #75 Topic #67 Topic #30 Topic #31 Topic #92 Topic #62 ---------- ---------- ---------- ---------- ---------- ---------- component al stock git police drug react state tax github crime patient function attack market repository officer health var group company commit drug medical element government fund branch prison disease return country share change criminal doctor state islamic investor code arrest cancer render terrorist financial merge year treatment dom saudi bank request call death import iran price project law year Topic #44 Topic #52 Topic #97 Topic #77 Topic #16 Topic #39 ---------- ---------- ---------- ---------- ---------- ---------- government pi phone memory node company agency board network cpu system startup security usb internet core read founder nsa chip radio intel state investor surveillance power signal performance write tech fbi hardware mobile cache cluster valley snowden card device processor distribute start intelligence km channel chip latency silicon document device service op message business information controller fi gpu operation money Topic #93 Topic #86 Topic #27 Topic #50 Topic #91 Topic #10 ---------- ---------- ---------- ---------- ---------- ---------- energy service file security code light power datum command attack compiler laser solar aws run vulnerability rust electron cost cloud install exploit compile field battery instance script hacker function energy year server build password optimization high gas application directory attacker library fusion plant run default hack call charge fuel storage package find memory produce oil system set system performance ray Topic #48 Topic #36 Topic #21 Topic #84 Topic #87 Topic #0 ---------- ---------- ---------- ---------- ---------- ---------- database ship sleep device city upgrade query sea day phone san fix datum water hour camera area close table year exercise battery street add index ocean mental laptop housing al row find people vr york doc column island health screen francisco rebuild sql river depression smartphone home david data land feel home building update select site stress hardware people michael Topic #20 Topic #22 Topic #25 Topic #80 Topic #85 Topic #33 ---------- ---------- ---------- ---------- ---------- ---------- int number uber brain facebook music return point amazon study ad video function matrix driver cognitive twitter sound const algorithm service memory user audio void function trip neuron post play struct vector airbnb participant site song null prime ride effect people stream type graph lyft al news record template line taxi ability content note char curve city intelligence medium listen Topic #58 Topic #34 Topic #63 Topic #73 Topic #28 Topic #57 ---------- ---------- ---------- ---------- ---------- ---------- thread src image law war student process llvm color court military school lock tool pixel case weapon college call gnu map legal soviet learn event clang frame rule nuclear university queue module draw state russian teach task include red lawyer force education run patch light government missile class wait solution render judge bomb high function problem blue order american teacher Topic #74 Topic #1 Topic #41 Topic #65 Topic #14 Topic #96 ---------- ---------- ---------- ---------- ---------- ---------- type team people game windows license function people economic player linux software haskell job money play system copyright language company dao move kernel patent monad interview contract win microsoft free return hire social world os oracle define engineer income chess run include list employee rich computer boot source lambda manager wealth level user term promise day inequality sport driver copy Topic #46 Topic #40 Topic #26 Topic #7 Topic #49 Topic #82 ---------- ---------- ---------- ---------- ---------- ---------- percent function server water app web year string network air android page job return connection temperature google browser worker variable packet flow user site rate code ip surface apps content income match client heat swift website high expression tcp bridge mobile user low list address material ios javascript increase def protocol oxygen add html growth call send chemical developer chrome Topic #23 Topic #79 Topic #15 Topic #8 Topic #19 Topic #24 ---------- ---------- ---------- ---------- ---------- ---------- support bank request python film book release money server library show write version card client code art century feature account http language movie world change credit response java artist history add pay application ruby netflix great fix payment url javascript star man update cash api framework world modern include transaction service read le year issue number header write disney life Topic #64 Topic #6 Topic #60 Topic #5 Topic #32 Topic #45 ---------- ---------- ---------- ---------- ---------- ---------- type problem email word project datum object machine message book open memory class theory send language source byte method number tor text developer file string mathematical address read build bit function computer account english community size code mathematic mail document tool hash public proof domain character development key call mathematician contact letter team set return question user write software buffer Topic #42 Topic #54 Topic #81 Topic #61 Topic #66 Topic #76 ---------- ---------- ---------- ---------- ---------- ---------- company design language text google china year build program mode computer country employee wall code window technology chinese million building programming screen machine world business part write line human united executive small programmer editor system india billion room software button world states accord shape system click ai government firm material computer display year north ceo create design key robot american Topic #9 Topic #17 Topic #90 Topic #3 Topic #53 Topic #47 ---------- ---------- ---------- ---------- ---------- ---------- apple datum api group child product font result bot public woman customer iphone number direct political age business mac average slack state man service design model total member study revenue phone analysis sun policy group share size sample avg campaign parent growth device show sat president male platform software distribution sms party adult result ios measure anonymous government sex software Topic #37 Topic #69 Topic #35 Topic #98 Topic #83 Topic #78 ---------- ---------- ---------- ---------- ---------- ---------- life story image datum yahoo package family continue uk user restaurant full year read london information coffee debian friend advertisement caption data bar text day main mr privacy house subject people times copyright access food link live newsletter british service drink send home sign japan internet mayer mbox man york year company chef mozilla house subscribe people provide club date Topic #4 Topic #68 Topic #18 Topic #89 Topic #51 Topic #95 ---------- ---------- ---------- ---------- ---------- ---------- people day thing university price test thing drive people research sell code feel ms lot science company error fact bob start paper market bug human year year researcher buy problem point august big study business fix world store problem scientist product check question july back publish pay fail person hour happen journal cost issue bad april talk scientific sale run Topic #55 Topic #88 Topic #72 Topic #59 ---------- ---------- ---------- ---------- day country thing system back european find problem hand europe post change run de give require head french write design sit france start approach begin germany read large walk german point level hour world article process man paris ne provide
A topic here is NOT exactly the same as the commonly used interpretation of the word topic
, it is simply a list of "related words". It is intended to represent a broad theme of interest, and doesn't carry a specific label attached to it.
This prints all the articles (along with a snippet of their content) that contained a specific topic, ordered in decreasing order of the topic score for the article, which is a measure of how central the topic was to the article. The top 5 articles are shown here for ease of reading.
model.show_topic_articles(99, top_n=5)
Topic #99 ---------- network model learning neural learn machine deep training layer image --------------------------------------------------------------------- Article #11052034 - http://www.wildml.com/deep-learning-glossary/ Deep Learning Glossary Topic score: 0.83 Article text: This glossary is work in progress and I am planning to continuously update it. If you find a mistake or think an important term is missing, please let me know in the comments or via email. Deep Learning terminology can be quite overwhelming to newcomers. This glossary tries to define commonly used terms and link to original references and additional resources to help readers dive deeper into a specific topic. The boundary between what is Deep Learning vs. “general” Machine Learning termino (...)(trimmed) --------------------------------------------------------------------- Article #10384279 - http://blog.christianperone.com/2015/08/convolutional-neural-networks-and-feature-extraction-with-python/ Convolutional neural networks and feature extraction with Python Topic score: 0.80 Article text: Convolutional neural networks (or ConvNets) are biologically-inspired variants of MLPs, they have different kinds of layers and each different layer works different than the usual MLP layers. If you are interested in learning more about ConvNets, a good course is the CS231n – Convolutional Neural Newtorks for Visual Recognition. The architecture of the CNNs are shown in the images below: As you can see, the ConvNets works with 3D volumes and transformations of these 3D volumes. I won’t repe (...)(trimmed) --------------------------------------------------------------------- Article #11840175 - https://github.com/rasbt/python-machine-learning-book/blob/master/faq/difference-deep-and-normal-learning.md What is the difference between deep learning and usual machine learning? Topic score: 0.74 Article text: What is the difference between deep learning and usual machine learning? That's an interesting question, and I try to answer this in a very general way. In essence, deep learning offers a set of techniques and algorithms that help us to parameterize deep neural network structures -- artificial neural networks with many hidden layers and parameters. One of the key ideas behind deep learning is to extract high level features from the given dataset. Thereby, deep learning aims to overcome the cha (...)(trimmed) --------------------------------------------------------------------- Article #12196388 - https://github.com/karandesai-96/digit-classifier MNIST Handwritten Digit Classifier beginner neural network project Topic score: 0.73 Article text: MNIST Handwritten Digit Classifier An implementation of multilayer neural network using numpy library. The implementation is a modified version of Michael Nielsen's implementation in Neural Networks and Deep Learning book. Brief Background: If you are familiar with basics of Neural Networks, feel free to skip this section. For total beginners who landed up here before reading anything about Neural Networks: Neural networks are made up of building blocks known as Sigmoid Neurons . These are n (...)(trimmed) --------------------------------------------------------------------- Article #11701665 - http://blog.keras.io/building-autoencoders-in-keras.html Building autoencoders in Keras Topic score: 0.72 Article text: Sat 14 May 2016 In Tutorials. In this tutorial, we will answer some common questions about autoencoders, and we will cover code examples of the following models: a simple autoencoder based on a fully-connected layer a sparse autoencoder a deep fully-connected autoencoder a deep convolutional autoencoder an image denoising model a sequence-to-sequence autoencoder a variational autoencoder Note: all code examples have been updated to the Keras 2.0 API on March 14, 2017. You will need Kera (...)(trimmed)
model.show_topic_articles(44, top_n=5)
Topic #44 ---------- government agency security nsa surveillance fbi snowden intelligence document information --------------------------------------------------------------------- Article #10304864 - https://edwardsnowden.com/ Edwardsnowden.com Topic score: 0.78 Article text: Who Is Edward Snowden? Edward Snowden is a 31 year old US citizen, former Intelligence Community officer and whistleblower. The documents he revealed provided a vital public window into the NSA and its international intelligence partners’ secret mass surveillance programs and capabilities. These revelations generated unprecedented attention around the world on privacy intrusions and digital security, leading to a global debate on the issue. Snowden worked in various roles within the US Intel (...)(trimmed) --------------------------------------------------------------------- Article #11748746 - http://www.theguardian.com/us-news/2016/may/22/snowden-whistleblower-protections-john-crane Snowden calls for whistleblower shield after claims by new Pentagon source Topic score: 0.69 Article text: Accusations that Pentagon retaliated against a whistleblower undermine argument that there were options for Snowden other than leaking to the media Edward Snowden has called for a complete overhaul of US whistleblower protections after a new source from deep inside the Pentagon came forward with a startling account of how the system became a “trap” for those seeking to expose wrongdoing. The account of John Crane, a former senior Pentagon investigator, appears to undermine Barack Obama, (...)(trimmed) --------------------------------------------------------------------- Article #10615250 - https://www.washingtonpost.com/news/the-switch/wp/2015/11/20/why-its-so-hard-to-keep-up-with-how-the-u-s-government-is-spying-on-its-own-people/ Why its so hard to keep up with how the U.S. gov't is spying on its own people Topic score: 0.68 Article text: Since 2013, Americans have gained immense insight about how the government conducts digital spying programs, largely thanks to the revelations made by former security contractor Edward Snowden. But a new report shows it's really hard to keep track of all the ways the United States is snooping on its own people. After Snowden revealed the National Security Agency was collecting data en masse about American e-mails, the government said it had ended that particular program in 2011. But it turns o (...)(trimmed) --------------------------------------------------------------------- Article #11837578 - https://news.vice.com/article/edward-snowden-leaks-tried-to-tell-nsa-about-surveillance-concerns-exclusive Snowden Tried to Tell NSA About Surveillance Concerns, Documents Reveal Topic score: 0.68 Article text: On the morning of May 29, 2014, an overcast Thursday in Washington, D.C., the general counsel of the Office of the Director of National Intelligence, Robert Litt, wrote an email to high-level officials at the National Security Agency and the White House. The topic: what to do about Edward Snowden. Snowden’s leaks had first come to light the previous June, when the Guardian’s Glenn Greenwald and the Washington Post’s Barton Gellman published stories based on highly classified documents pr (...)(trimmed) --------------------------------------------------------------------- Article #11400686 - http://thehill.com/policy/national-security/274840-report-clinton-could-be-interviewed-by-fbi-within-days Report: FBI moves to interview Clinton over emails Topic score: 0.67 Article text: Hillary Clinton Hillary Rodham ClintonAssange meets U.S. congressman, vows to prove Russia did not leak him documents High-ranking FBI official leaves Russia probe OPINION | Steve Bannon is Trump's indispensable man — don't sacrifice him to the critics MORE and her top aides might be questioned by FBI officials about her private email server within the next few days, according to a new report from Al Jazeera America. The news outlet reported that the FBI has concluded its examination of Clint (...)(trimmed)
This displays the topics that were extracted from a specific article in the corpus.
model.show_article_topics(10577102)
--------------------------------------------------------------------- Article #10577102 - http://www.nytimes.com/2015/11/17/us/after-paris-attacks-cia-director-rekindles-debate-over-surveillance.html After Paris Attacks, C.I.A. Director Rekindles Debate Over Surveillance Article text: “As far as I know, there’s no evidence the French lacked some kind of surveillance authority that would have made a difference,” said Jameel Jaffer, deputy legal director of the American Civil Liberties Union. “When we’ve invested new powers in the government in response to events like the Paris attacks, they have often been abused.” The debate over the proper limits on government dates to the origins of the United States, with periodic overreaching in the name of security being cur (...)(trimmed) Topic #44 Topic #67 Topic #69 Score (0.32) Score (0.19) Score (0.15) ---------- ---------- ---------- government al story agency state continue security attack read nsa group advertisement surveillance government main fbi country times snowden islamic newsletter intelligence terrorist sign document saudi york information iran subscribe
The last topic looks strange here - as it turns out, it is an unintended artifact of the data collection process. The newspaper
library used to extract text from articles extracts text from some of the advertisements and subscribe buttons for NYTimes articles too. As a result, this set of words co-occurs with each other extremely frequently and co-occurs with other words much less frequently, and hence forms a very natural topic for topic modeling algorithms.
url = "https://www.ligo.caltech.edu/news/ligo20170927"
model.show_article_topics_from_url(url)
Article: https://www.ligo.caltech.edu/news/ligo20170927 Article text: News Release • September 27, 2017 The LIGO Scientific Collaboration and the Virgo collaboration report the first joint detection of gravitational waves with both the LIGO and Virgo detectors. This is the fourth announced detection of a binary black hole system and the first significant gravitational-wave signal recorded by the Virgo detector, and highlights the scientific potential of a three-detector network of gravitational-wave detectors. The three-detector observation was made on August 14 (...)(trimmed) Most relevant topics: Topic #29 Topic #10 Topic #89 Score (0.31) Score (0.15) Score (0.13) ---------- ---------- ---------- earth light university space laser research star electron science planet field paper orbit energy researcher moon high study year fusion scientist mars charge publish galaxy produce journal telescope ray scientific
The popularity of topics can be plotted over time. Some cherrypicking for interesting results -
iplot(model.topic_trend_plot(11))
Topic #11 ---------- flight fly air space aircraft launch plane drone pilot rocket
The topic contains the words flight, fly, air, space, aircraft, launch
and sees a huge surge in popularity around March - May 2016. This was the time when SpaceX successfully launched and landed its satellites at sea. And of course, things related to Elon Musk have a tendency to be wildly popular on Hacker News :)
A quick look at the articles for this topic agrees with this hypothesis -
model.show_topic_articles(11, top_n=5)
Topic #11 ---------- flight fly air space aircraft launch plane drone pilot rocket --------------------------------------------------------------------- Article #11460935 - http://techcrunch.com/2016/04/08/spacex-just-landed-a-rocket-on-a-drone-ship-for-the-first-time/ SpaceX just landed a rocket on a drone ship for the first time Topic score: 0.82 Article text: At 4:43 pm EST, SpaceX successfully launched their next resupply mission to the International Space Station (ISS). In addition to a seamless launch, SpaceX landed the first stage of their Falcon 9 rocket on an autonomous drone ship for the very first time. Landing from the chase plane pic.twitter.com/2Q5qCaPq9P — SpaceX (@SpaceX) April 8, 2016 This was SpaceX’s fifth landing attempt on a drone ship — all previous attempts ended in explosions. Although in December of last year, Elon Musk (...)(trimmed) --------------------------------------------------------------------- Article #11459183 - http://mobile.reuters.com/article/idUSKCN0X5228 SpaceX makes breakthrough by landing rocket at sea Topic score: 0.79 Article text: CAPE CANAVERAL, Fla. (Reuters) - A SpaceX Falcon 9 rocket blasted off from Florida on a NASA cargo run to the International Space Station on Friday, and its reusable main-stage booster landed on an ocean platform minutes later in a dramatic spaceflight first. The successful autonomous touchdown of the booster at sea marked another milestone for billionaire entrepreneur Elon Musk and his privately owned Space Exploration Technologies in the quest to develop a cheap, reusable rocket, expanding hi (...)(trimmed) --------------------------------------------------------------------- Article #11642855 - http://phys.org/news/2016-05-spacex-successfully-rockets-stage-space.html SpaceX lands rocket at sea second time after satellite launch Topic score: 0.76 Article text: This photo provided by SpaceX shows the first stage of the company's Falcon rocket after it landed on a platform in the Atlantic Ocean just off the Florida coast on Friday, May 6, 2016, after launching a Japanese communications satellite. (SpaceX via AP) For the second month in a row, the aerospace upstart SpaceX landed a rocket on an ocean platform early Friday, this time following the successful launch of a Japanese communications satellite. A live webcast showed the first-stage booster touch (...)(trimmed) --------------------------------------------------------------------- Article #11791272 - http://www.theverge.com/2016/5/27/11787532/spacex-falcon-9-rocket-landing-success-sea-drone-ship SpaceX successfully lands a Falcon 9 rocket at sea for the third time Topic score: 0.74 Article text: SpaceX just successfully landed the first stage of its Falcon 9 rocket on a drone ship in the Atlantic Ocean. It was the third time in a row the company has landed a rocket booster at sea, and the fourth time overall. The landing occurred a few minutes before the second stage of the Falcon 9 delivered the THAICOM-8 satellite to space, where it will make its way to geostationary transfer orbit (GTO). GTO is a high-elliptical orbit that is popular for satellites, sitting more than 20,000 miles ab (...)(trimmed) --------------------------------------------------------------------- Article #11817878 - https://www.washingtonpost.com/graphics/business/rockets/ The New Space Race Topic score: 0.71 Article text: Launch configurations Launch abort system jettisons the crew to safety in the event of a launchpad failure. Launch abort system Orion crew vehicle Cargo fairing Exploration upper stage The core stage of the rocket is orange because that is the natural color of the insulation that will cover it. Core stage Solid rocket boosters Advanced boosters RS-25 engines A B C D A. An initial mission will take an unmanned crew vehicle around the moon and back to demonstrate the capabilities of (...)(trimmed)
iplot(model.topic_trend_plot(35))
Topic #35 ---------- image uk london caption mr copyright british japan year people
This topic looks a little more strange. The words uk london british people
seem fairly coherent, but the presence of words like image copyright caption
is rather strange. It turns out to be another artifact of the data collection process - a number of the articles with the words uk London british people
are from the BBC, and the text parser from the article picks up image captions from the BBC site which contain the words image caption copyright
very frequently.
As for the popularity trend for the topic, the topic seems fairly dormant most of the time, seeing a massive spike in around June 2016. No prizes for guessing what this is due to -
model.show_topic_articles(35, top_n=5)
Topic #35 ---------- image uk london caption mr copyright british japan year people --------------------------------------------------------------------- Article #11970960 - http://www.bbc.com/news/uk-politics-eu-referendum-36620401 Petition for London independence signed by thousands after Brexit vote Topic score: 0.67 Article text: Image copyright Reuters Image caption The overwhelming majority of Londoners voted to remain in the EU A petition calling for Sadiq Khan to declare London an independent state after the UK voted to quit the EU has been signed by thousands of people. The petition's organiser James O'Malley, said the capital was "a world city" which should "remain at the heart of Europe". Nearly 60% of people in the capital backed the Remain campaign, in stark contrast to most of the country. The LSE's directo (...)(trimmed) --------------------------------------------------------------------- Article #11966167 - http://www.bbc.co.uk/news/uk-politics-36615028 UK votes to leave EU Topic score: 0.66 Article text: Media playback is unsupported on your device Media caption EU vote: David Cameron says the UK "needs fresh leadership" Prime Minister David Cameron is to step down by October after the UK voted to leave the European Union. Speaking outside 10 Downing Street, he said "fresh leadership" was needed. The PM had urged the country to vote Remain but was defeated by 52% to 48% despite London, Scotland and Northern Ireland backing staying in. UKIP leader Nigel Farage hailed it as the UK's "independe (...)(trimmed) --------------------------------------------------------------------- Article #11967959 - http://www.mirror.co.uk/news/uk-news/young-voters-wanted-brexit-least-8271517 Young voters wanted Brexit the least and will have to live with it the longest Topic score: 0.58 Article text: Get politics updates directly to your inbox + Subscribe Thank you for subscribing! Could not subscribe, try again later Invalid Email Younger voters will be the losers from today's historic vote to leave the EU after polls repeatedly showed they back Remain. Brexiters were led to victory in the referendum overnight by triumphing in Tory shires and Old Labour heartlands in Wales and the north of England. But the Kingdom is no longer United after London, Scotland and Northern Ireland all backed (...)(trimmed) --------------------------------------------------------------------- Article #11975945 - https://www.theguardian.com/uk-news/2016/jun/25/sturgeon-seeks-urgent-brussels-talks-to-protect-scotlands-eu-membership Sturgeon seeks Brussels talks to protect Scotland's EU membership Topic score: 0.52 Article text: First minister to set up panel to advise her on Scotland’s relationship with EU, as Labour considers endorsing independence Nicola Sturgeon is to lobby EU member states directly for support in ensuring that Scotland can remain part of the bloc, after Scots voted emphatically against Brexit on Thursday. The first minister has disclosed that she is to invite all EU diplomats based in Scotland to a summit at her official residence in Edinburgh within the next two weeks in a bid to sidestep th (...)(trimmed) --------------------------------------------------------------------- Article #11967478 - http://www.theguardian.com/politics/2016/jun/24/david-cameron-resigns-after-uk-votes-to-leave-european-union David Cameron announces resignation Topic score: 0.51 Article text: David Cameron has resigned, bringing an abrupt end to his six-year premiership, after the British public took the momentous decision to reject his entreaties and turn their back on the European Union. Just a year after he clinched a surprise majority in the general election, a visibly emotional Cameron, standing outside Number 10 on Friday morning alongside his wife, Samantha, said: “The will of the British people is an instruction that must be delivered.” The prime minister campaigned har (...)(trimmed)
iplot(model.topic_trend_plot(44))
Topic #44 ---------- government agency security nsa surveillance fbi snowden intelligence document information
This topic has a more interesting trend. Privacy and government surveillance has long been a popular topic on Hacker News, and this is clear from the relatively high popularity values in comparison to the other topics plotted so far. As for the significant increase in popularity around February 2016, this corresponds to the San Bernardino event, when there was a large amount of debate on privacy and surveillance, centered around whether Apple, under pressure by the FBI, should or should not unlock an iPhone used by one of the shooters.
There are also numerous other spikes in this graph, and it'd be interesting to look at them in more detail to see if they can be traced to specific events.
Topics can be combined to find articles that are relevant to both topics. Here, we see combining two separate topics consisting of the words game player play move win
and google computer technology machine human
give us articles related to AlphaGo's success against the human Go champion, Lee Sedol.
model.show_topic_articles([65, 66], top_n=5)
Topic #65 Topic #66 ---------- ---------- game google player computer play technology move machine win human world system chess world computer ai level year sport robot --------------------------------------------------------------------- Article #11250871 - http://googleasiapacific.blogspot.com/2016/03/alphagos-ultimate-challenge.html AlphaGos ultimate challenge: a five-game match against Lee Sedol Topic score: 0.35 Article text: Game 3 - March 12, 2016 “It’s arguable that in the first two games Lee Sedol was playing differently than his true style, trying to find a weakness in the computer. Today Lee was definitely playing his own game, from his strong opening to the complicated moves in the final kō. AlphaGo was ready for everything, including the kō fights, and was able to take the win. I’d like to congratulate the people who actually made this accomplishment possible, because it’s a work of art.” “Lee (...)(trimmed) --------------------------------------------------------------------- Article #11258168 - http://www.shanghaidaily.com/national/AlphaGo-cant-beat-me-says-Chinese-Go-grandmaster-Ke-Jie/shdaily.shtml AlphaGo Can't Beat Me, Says Chinese Go Grandmaster Ke Jie Topic score: 0.33 Article text: Home » Nation ALPHAGO, the computer created by DeepMind, the Artificial Intelligence (AI) arm of Google, defeated world champion Lee Sedol of South Korea Wednesday in Game One of human vs. machine Go-chess showdown. The result is out of the expectations of many, including China's Go grandmaster Ke Jie, but Ke put it clear "AlphaGo is not in my match now". Ke admitted Thursday he had underestimated AlphaGo's capability before the opening match, but he still believes he will be the winner shoul (...)(trimmed) --------------------------------------------------------------------- Article #11300892 - https://googleblog.blogspot.com/2016/03/what-we-learned-in-seoul-with-alphago.html What we learned in Seoul with AlphaGo Topic score: 0.31 Article text: Go may be one of the oldest games in existence, but the attention to our five-game tournament exceeded even our wildest imaginations. Searches for Go rules and Go boards spiked in the U.S. In China, tens of millions watched live streams of the matches, and the “Man vs. Machine Go Showdown” hashtag saw 200 million pageviews on Sina Weibo. Sales of Go boards even surged in Korea. Our public test of AlphaGo, however, was about more than winning at Go. We founded DeepMind in 2010 to create ge (...)(trimmed) --------------------------------------------------------------------- Article #10981682 - https://googleblog.blogspot.com/2016/01/alphago-machine-learning-game-go.html Google AI beats a pro at the game of Go Topic score: 0.31 Article text: The game of Go originated in China more than 2,500 years ago. Confucius wrote about the game, and it is considered one of the four essential arts required of any true Chinese scholar. Played by more than 40 million people worldwide, the rules of the game are simple: Players take turns to place black or white stones on a board, trying to capture the opponent's stones or surround empty space to make points of territory. The game is played primarily through intuition and feel, and because of its be (...)(trimmed) --------------------------------------------------------------------- Article #11129076 - http://venturebeat.com/2016/02/18/civilization-25-years-66-versions-33m-copies-sold-1-billion-hours-played/ Civilization: 25 years, 33M copies sold, 1B hours played, and 66 versions Topic score: 0.30 Article text: LAS VEGAS — Civilization is one of the gods of strategy games, where you oversee the creation of a whole society in competition with other civilizations. It debuted in 1991, and now at 25, it has become one of the cultural touchstones of the game industry, something that everyone recognizes or has played in the past. Image Credit: MicroProse Few game franchises live to see a 25th anniversary, but Civ, as most gamers and industry folk call it, is thriving. It has 33 million copies in sales to (...)(trimmed)
Topics that are similar to a specific topic can be found using -
model.show_similar_topics(44, top_n=5)
Topic #44 ---------- government agency security nsa surveillance fbi snowden intelligence document information Topics similar to topic #44 --------------------------- Topic #73 Topic #3 Topic #98 Topic #50 Topic #67 Score (0.23) Score (0.18) Score (0.17) Score (0.16) Score (0.15) ---------- ---------- ---------- ---------- ---------- law group datum security al court public user attack state case political information vulnerability attack legal state data exploit group rule member privacy hacker government state policy access password country lawyer campaign service attacker islamic government president internet hack terrorist judge party company find saudi order government provide system iran
There are certain topics which occur more frequently in articles than others, but with lower scores. The hypothesis is that these topics are more common and generic, whereas interesting topics would occur less frequently in articles, but higher scores. Common and generic topics would have low scores frequently, indicating they are rarely the main focus of an article, whereas the opposite is true for interesting topics.
Plotting the distribution of scores over all articles for two topics -
topics_of_interest = [43, 95]
model.print_topics_table(topics_of_interest)
Topic #43 Topic #95 ---------- ---------- quantum test theory code physics error particle bug universe problem physicist fix wave check field fail hole issue state run
iplot(model.plot_topic_article_distribution(topics_of_interest))
As expected, the histogram for topic #95 (test, code, error, bug, problem
), a rather generic topic, at least for Hacker News content, is quite skewed to the left, indicating it occurs with low scores very frequently in articles, and almost never with a high score. The histogram for topic #43 (quantum, theory, physics, particle, universe
) is much flatter, indicating it is the main theme of an article much more often.
Computing the median of scores across all articles seems like a decent mathematical way of capturing this intuition of "interesting-ness". Sorting topics by the computed median scores in decreasing order, we get -
model.print_topics_table()
Topic #99 Topic #29 Topic #38 Topic #43 Topic #56 Topic #70 ---------- ---------- ---------- ---------- ---------- ---------- network earth container quantum bitcoin car model space docker theory transaction vehicle learning star run physics blockchain drive neural planet service particle network tesla learn orbit image universe wright road machine moon application physicist block bike deep year deploy wave ethereum driver training mars cluster field trust model layer galaxy machine hole currency electric image telescope host state exchange wheel Topic #71 Topic #11 Topic #94 Topic #2 Topic #12 Topic #13 ---------- ---------- ---------- ---------- ---------- ---------- animal flight cell key food stack human fly gene certificate eat instruction specie air dna security fat register dog space human encryption sugar address bird aircraft genome encrypt diet code cat launch genetic password meat call year plane protein secure fruit memory tree drone mouse secret egg byte live pilot cancer public farmer program find rocket bacteria tls grow function Topic #75 Topic #67 Topic #30 Topic #31 Topic #92 Topic #62 ---------- ---------- ---------- ---------- ---------- ---------- component al stock git police drug react state tax github crime patient function attack market repository officer health var group company commit drug medical element government fund branch prison disease return country share change criminal doctor state islamic investor code arrest cancer render terrorist financial merge year treatment dom saudi bank request call death import iran price project law year Topic #44 Topic #52 Topic #97 Topic #77 Topic #16 Topic #39 ---------- ---------- ---------- ---------- ---------- ---------- government pi phone memory node company agency board network cpu system startup security usb internet core read founder nsa chip radio intel state investor surveillance power signal performance write tech fbi hardware mobile cache cluster valley snowden card device processor distribute start intelligence km channel chip latency silicon document device service op message business information controller fi gpu operation money Topic #93 Topic #86 Topic #27 Topic #50 Topic #91 Topic #10 ---------- ---------- ---------- ---------- ---------- ---------- energy service file security code light power datum command attack compiler laser solar aws run vulnerability rust electron cost cloud install exploit compile field battery instance script hacker function energy year server build password optimization high gas application directory attacker library fusion plant run default hack call charge fuel storage package find memory produce oil system set system performance ray Topic #48 Topic #36 Topic #21 Topic #84 Topic #87 Topic #0 ---------- ---------- ---------- ---------- ---------- ---------- database ship sleep device city upgrade query sea day phone san fix datum water hour camera area close table year exercise battery street add index ocean mental laptop housing al row find people vr york doc column island health screen francisco rebuild sql river depression smartphone home david data land feel home building update select site stress hardware people michael Topic #20 Topic #22 Topic #25 Topic #80 Topic #85 Topic #33 ---------- ---------- ---------- ---------- ---------- ---------- int number uber brain facebook music return point amazon study ad video function matrix driver cognitive twitter sound const algorithm service memory user audio void function trip neuron post play struct vector airbnb participant site song null prime ride effect people stream type graph lyft al news record template line taxi ability content note char curve city intelligence medium listen Topic #58 Topic #34 Topic #63 Topic #73 Topic #28 Topic #57 ---------- ---------- ---------- ---------- ---------- ---------- thread src image law war student process llvm color court military school lock tool pixel case weapon college call gnu map legal soviet learn event clang frame rule nuclear university queue module draw state russian teach task include red lawyer force education run patch light government missile class wait solution render judge bomb high function problem blue order american teacher Topic #74 Topic #1 Topic #41 Topic #65 Topic #14 Topic #96 ---------- ---------- ---------- ---------- ---------- ---------- type team people game windows license function people economic player linux software haskell job money play system copyright language company dao move kernel patent monad interview contract win microsoft free return hire social world os oracle define engineer income chess run include list employee rich computer boot source lambda manager wealth level user term promise day inequality sport driver copy Topic #46 Topic #40 Topic #26 Topic #7 Topic #49 Topic #82 ---------- ---------- ---------- ---------- ---------- ---------- percent function server water app web year string network air android page job return connection temperature google browser worker variable packet flow user site rate code ip surface apps content income match client heat swift website high expression tcp bridge mobile user low list address material ios javascript increase def protocol oxygen add html growth call send chemical developer chrome Topic #23 Topic #79 Topic #15 Topic #8 Topic #19 Topic #24 ---------- ---------- ---------- ---------- ---------- ---------- support bank request python film book release money server library show write version card client code art century feature account http language movie world change credit response java artist history add pay application ruby netflix great fix payment url javascript star man update cash api framework world modern include transaction service read le year issue number header write disney life Topic #64 Topic #6 Topic #60 Topic #5 Topic #32 Topic #45 ---------- ---------- ---------- ---------- ---------- ---------- type problem email word project datum object machine message book open memory class theory send language source byte method number tor text developer file string mathematical address read build bit function computer account english community size code mathematic mail document tool hash public proof domain character development key call mathematician contact letter team set return question user write software buffer Topic #42 Topic #54 Topic #81 Topic #61 Topic #66 Topic #76 ---------- ---------- ---------- ---------- ---------- ---------- company design language text google china year build program mode computer country employee wall code window technology chinese million building programming screen machine world business part write line human united executive small programmer editor system india billion room software button world states accord shape system click ai government firm material computer display year north ceo create design key robot american Topic #9 Topic #17 Topic #90 Topic #3 Topic #53 Topic #47 ---------- ---------- ---------- ---------- ---------- ---------- apple datum api group child product font result bot public woman customer iphone number direct political age business mac average slack state man service design model total member study revenue phone analysis sun policy group share size sample avg campaign parent growth device show sat president male platform software distribution sms party adult result ios measure anonymous government sex software Topic #37 Topic #69 Topic #35 Topic #98 Topic #83 Topic #78 ---------- ---------- ---------- ---------- ---------- ---------- life story image datum yahoo package family continue uk user restaurant full year read london information coffee debian friend advertisement caption data bar text day main mr privacy house subject people times copyright access food link live newsletter british service drink send home sign japan internet mayer mbox man york year company chef mozilla house subscribe people provide club date Topic #4 Topic #68 Topic #18 Topic #89 Topic #51 Topic #95 ---------- ---------- ---------- ---------- ---------- ---------- people day thing university price test thing drive people research sell code feel ms lot science company error fact bob start paper market bug human year year researcher buy problem point august big study business fix world store problem scientist product check question july back publish pay fail person hour happen journal cost issue bad april talk scientific sale run Topic #55 Topic #88 Topic #72 Topic #59 ---------- ---------- ---------- ---------- day country thing system back european find problem hand europe post change run de give require head french write design sit france start approach begin germany read large walk german point level hour world article process man paris ne provide
This seems to give reasonably good results. Specific, focused topics are at the top, whereas common generic topics are at the bottom. It is possible that this metric of interesting-ness could be flawed for certain kinds of data, where either the notion of interesting-ness is different in the first place (as it is a subjective notion), or where the topic-article distribution is significantly different.
The model has a notion of similarity between topics based on a few metrics. The two basic ideas are -
The first captures the notion of lexical similarity, whereas the second captures the notion of relatedness.
Plotting the topic similarity matrix for the word_doc_sim
metric which combines both (1) and (2) -
model.plot_topic_similarities(metric='word_doc_sim')
<matplotlib.image.AxesImage at 0x7f3a8b8d3be0>
Looking at this matrix, it is possible to discern a couple of patterns -
Both these kind of topics are not well-suited for clustering. Stand-alone topics should be typically be clusters of their own, and it is difficult to assign a single cluster to common topics, as they similar or related to many of the other topics. From a graph-theoretic point of view, these topics would be hub nodes - connected to many of the other topics, and not part of any single graph partition.
So, the method to cluster topics provides options to exclude such topics from the clustering process.
In addition, the model includes an internal method to determine cluster quality based on the silhouette scores of its constituent nodes. The printed clusters are sorted in decreasing order of cluster quality.
model.cluster_topics(metric='word_doc_sim', exclude_common=True, exclude_standalone=True)
model.print_topic_clusters()
Cluster 1---------------------------------- Topic #39 Topic #42 Topic #47 Topic #51 ---------- ---------- ---------- ---------- company company product price startup year customer sell founder employee business company investor million service market tech business revenue buy valley executive share business start billion growth product silicon accord platform pay business firm result cost money ceo software sale Cluster 9---------------------------------- Topic #16 Topic #48 Topic #86 ---------- ---------- ---------- node database service system query datum read datum aws state table cloud write index instance cluster row server distribute column application latency sql run message data storage operation select system Cluster 6---------------------------------- Topic #44 Topic #73 Topic #92 Topic #98 ---------- ---------- ---------- ---------- government law police datum agency court crime user security case officer information nsa legal drug data surveillance rule prison privacy fbi state criminal access snowden lawyer arrest service intelligence government year internet document judge call company information order law provide Cluster 13---------------------------------- Topic #12 Topic #21 Topic #53 Topic #62 Topic #71 Topic #80 ---------- ---------- ---------- ---------- ---------- ---------- food sleep child drug animal brain eat day woman patient human study fat hour age health specie cognitive sugar exercise man medical dog memory diet mental study disease bird neuron meat people group doctor cat participant fruit health parent cancer year effect egg depression male treatment tree al farmer feel adult death live ability grow stress sex year find intelligence Topic #89 Topic #94 ---------- ---------- university cell research gene science dna paper human researcher genome study genetic scientist protein publish mouse journal cancer scientific bacteria Cluster 8---------------------------------- Topic #3 Topic #28 Topic #35 Topic #67 Topic #76 Topic #88 ---------- ---------- ---------- ---------- ---------- ---------- group war image al china country public military uk state country european political weapon london attack chinese europe state soviet caption group world de member nuclear mr government united french policy russian copyright country india france campaign force british islamic states germany president missile japan terrorist government german party bomb year saudi north world government american people iran american paris Cluster 0---------------------------------- Topic #13 Topic #20 Topic #40 Topic #58 Topic #64 Topic #74 ---------- ---------- ---------- ---------- ---------- ---------- stack int function thread type type instruction return string process object function register function return lock class haskell address const variable call method language code void code event string monad call struct match queue function return memory null expression task code define byte type list run public list program template def wait call lambda function char call function return promise Topic #75 Topic #91 ---------- ---------- component code react compiler function rust var compile element function return optimization state library render call dom memory import performance Cluster 5---------------------------------- Topic #2 Topic #15 Topic #26 Topic #50 Topic #60 ---------- ---------- ---------- ---------- ---------- key request server security email certificate server network attack message security client connection vulnerability send encryption http packet exploit tor encrypt response ip hacker address password application client password account secure url tcp attacker mail secret api address hack domain public service protocol find contact tls header send system user Cluster 11---------------------------------- Topic #7 Topic #10 Topic #11 Topic #29 Topic #36 Topic #43 ---------- ---------- ---------- ---------- ---------- ---------- water light flight earth ship quantum air laser fly space sea theory temperature electron air star water physics flow field space planet year particle surface energy aircraft orbit ocean universe heat high launch moon find physicist bridge fusion plane year island wave material charge drone mars river field oxygen produce pilot galaxy land hole chemical ray rocket telescope site state Topic #54 Topic #93 ---------- ---------- design energy build power wall solar building cost part battery small year room gas shape plant material fuel create oil Cluster 12---------------------------------- Topic #6 Topic #17 Topic #22 Topic #45 Topic #61 Topic #63 ---------- ---------- ---------- ---------- ---------- ---------- problem datum number datum text image machine result point memory mode color theory number matrix byte window pixel number average algorithm file screen map mathematical model function bit line frame computer analysis vector size editor draw mathematic sample prime hash button red proof show graph key click light mathematician distribution line set display render question measure curve buffer key blue Topic #99 ---------- network model learning neural learn machine deep training layer image Cluster 10---------------------------------- Topic #30 Topic #41 Topic #79 ---------- ---------- ---------- stock people bank tax economic money market money card company dao account fund contract credit share social pay investor income payment financial rich cash bank wealth transaction price inequality number Cluster 7---------------------------------- Topic #14 Topic #52 Topic #77 Topic #84 Topic #97 ---------- ---------- ---------- ---------- ---------- windows pi memory device phone linux board cpu phone network system usb core camera internet kernel chip intel battery radio microsoft power performance laptop signal os hardware cache vr mobile run card processor screen device boot km chip smartphone channel user device op home service driver controller gpu hardware fi Cluster 14---------------------------------- Topic #8 Topic #27 Topic #81 Topic #95 ---------- ---------- ---------- ---------- python file language test library command program code code run code error language install programming bug java script write problem ruby build programmer fix javascript directory software check framework default system fail read package computer issue write set design run Cluster 4---------------------------------- Topic #23 ---------- support release version feature change add fix update include issue Cluster 2---------------------------------- Topic #1 Topic #49 Topic #82 Topic #85 ---------- ---------- ---------- ---------- team app web facebook people android page ad job google browser twitter company user site user interview apps content post hire swift website site engineer mobile user people employee ios javascript news manager add html content day developer chrome medium Cluster 3---------------------------------- Topic #0 Topic #5 Topic #9 Topic #19 Topic #25 Topic #31 ---------- ---------- ---------- ---------- ---------- ---------- upgrade word apple film uber git fix book font show amazon github close language iphone art driver repository add text mac movie service commit al read design artist trip branch doc english phone netflix airbnb change rebuild document size star ride code david character device world lyft merge update letter software le taxi request michael write ios disney city project Topic #33 Topic #38 Topic #56 Topic #57 Topic #65 Topic #68 ---------- ---------- ---------- ---------- ---------- ---------- music container bitcoin student game day video docker transaction school player drive sound run blockchain college play ms audio service network learn move bob play image wright university win year song application block teach world august stream deploy ethereum education chess store record cluster trust class computer july note machine currency high level hour listen host exchange teacher sport april Topic #69 Topic #70 Topic #78 Topic #83 Topic #87 Topic #96 ---------- ---------- ---------- ---------- ---------- ---------- story car package yahoo city license continue vehicle full restaurant san software read drive debian coffee area copyright advertisement tesla text bar street patent main road subject house housing free times bike link food york oracle newsletter driver send drink francisco include sign model mbox mayer home source york electric mozilla chef building term subscribe wheel date club people copy Stand-alone topics---------------------------------- Topic #34 Topic #90 ---------- ---------- src api llvm bot tool direct gnu slack clang total module sun include avg patch sat solution sms problem anonymous Common topics---------------------------------- Topic #4 Topic #18 Topic #24 Topic #32 Topic #37 Topic #46 ---------- ---------- ---------- ---------- ---------- ---------- people thing book project life percent thing people write open family year feel lot century source year job fact start world developer friend worker human year history build day rate point big great community people income world problem man tool live high question back modern development home low person happen year team man increase bad talk life software house growth Topic #55 Topic #59 Topic #66 Topic #72 ---------- ---------- ---------- ---------- day system google thing back problem computer find hand change technology post run require machine give head design human write sit approach system start begin large world read walk level ai point hour process year article man provide robot ne
Plotting the similarity matrix for the clustered topics -
model.plot_clustered_topic_similarities(metric='word_doc_sim', threshold_percentile=85)
<matplotlib.image.AxesImage at 0x7f3a8b9c93c8>
The common as well as standalone topics excluded from the clustering are at the end. Also note that the diagonal values (self-similarity) have been zeroed in the matrix above to allow for easier visualization.
A number of the clusters look fairly reasonable, grouping together related topics. However, there is one large cluster at the end which contains a large number of disparate, wide-ranging topics - from the matrix, it is also evident that these topics are not very similar to most of the topics part of clustering.
This suggests that excluding certain topics can introduce certain disadvantages.
There are also some other tricky aspects to clustering topics -
In conclusion, I'd love to get more feedback about whether and how this could be useful. Please do get in touch at jayantjain1992@gmail.com if you have any ideas. Feel free to do so if you wish to talk about NLP generally either!