From idea to prototype in AI.
If you've ever been around a startup or in the tech world for any significant amount of time, you've definitely encountered some, if not all of the following phrases: "agile software development", "prototyping", "feedback loop", "rapid iteration", etc.
This Silicon Valley techno-babble can be distilled down to one simple concept, which just so happens to be the mantra of many a successful entrepreneur: test out your idea as quickly as possible, and then make it better over time. Stated more verbosely, before you invest mind and money into creating a cutting-edge solution to a problem, it might benefit you to get a baseline performance for your task using off-the-shelf techniques. Once you establish the efficacy of a low-cost, easy approach, you can then put on your Elon Musk hat and drive towards #innovation and #disruption.
A concrete example might help illustrate this point:
Let's say our goal was to create a natural language system that effectively allowed someone to converse with an academic paper. This task could be step one of many towards the development of an automated scientific discovery tool. Society can thank us later.
But where do we begin? Well, a part of the solution has to deal with knowledge extraction. In order to create a conversational engine that understands scientific papers, we'll first need to develop an entity recognition module, and this, lucky for us, is the topic of our notebook!
"What's an entity?" you ask? Excellent question. Take a look at the following sentence:
Dr. Abraham is the primary author of this paper, and a physician in the specialty of internal medicine.
Now, it should be relatively straighforward for an English-speaking human to pick out the important concepts in this sentence:
[Dr. Abraham] is the [primary author] of this [paper], and a [physician] in the [specialty] of [internal medicine].
These words and/or phrases are categorized as "entities" because they represent salient ideas, nouns, and noun phrases in the real world. A subset of entities can be "named", in that they correspond to specific places, people, organizations, and so on. A named entity is to a regular entity, what "Dr. Abraham" is to a "physician". The good doctor is a real person and an instance of the "physician" class, and is therefore considered "named". Examples of named entities include "Google", "Neil DeGrasse Tyson", and "Tokyo", while regular, garden-variety entities can include the list just mentioned, as well as things like "dog", "newspaper", "task", etc.
Let's see if we can get a computer to run this kind of analysis to pull important concepts from sentences.
For our conversational academic paper program, we won't be satisfied with simply capturing named entities, because we need to understand the relationships between general concepts as well as actual things, places, etc. Unfortunately, while most out-of-the-box text processing libraries have a moderately useful named entity recognizer, they have little to no support for a generalized entity recognizer.
This is because of a subtle, yet important constraint.
Entities, as we've discussed, correspond to a superset of named entities, which should make them easier to extract. Indeed, blindly pulling all entities from a text source is in fact simple, but it's sadly not all that useful. In order to justify this exercise, we'd need to develop an entity extraction approach that is restricted to, or is cognizant of, some particular domain, for example, neuroscience, psychology, computer science, economics, etc. This paradoxical complexity makes it nontrivial to create a generic, but useful, entity recognizer. Hence the lack of support in most open-source libraries that deal with natural language processing.
To largely simplify our task then, we must generate a set of entities from a scientific paper, that is larger than a simple list of named entities, but smaller than the giant list of all entities, restricted to the domain of a particular paper in question.
Yikes. Are you sweating a little? Because I am. Instead of reaching for some Ibuprofen and deep learning pills, let's make a prototype using a little ingenuity, simple open-source code, and a lot of heuristics. Hopefully, through this process, we'll also learn a bit about the text processing pipeline that brings understanding natural language into the realm of the possible.
Enought chit-chat. Let's get to it!
%load_ext autoreload
%autoreload 2
Fun fact: Curious about what 'autoreload' does? Check this out.
import pandas as pd
import spacy
from spacy.displacy.render import EntityRenderer
from IPython.core.display import display, HTML
Let's do some basic housekeeping before we start diving headfirst into entity extraction. We'll need to deal with visualization, load up a language model, and of course, examine/set-up our data source.
Our prototype will lean heavily on a popular natural langauge processing (NLP) library known as spaCy, which also has a wonderful set of classes and methods defined to help visualize parts of the NLP pipeline. Up top, where we've imported modules, you'll have noticed that we're pulling 'EntityRenderer' from spaCy's displacy module, as we'll be repurposing some of this code for our... um... purposes. In general, this is a good exercise if you ever want to get your hands dirty and really learn how certain classes work in your friendly neighborhood open-source projects. Nothing should ever be off-limits or a black box; always dissect and play with your code before you eat it.
Wander on over to spaCy's website, and you'll quickly discover that they've put in some serious thought into making the user interface absolutely gorgeous. (While Matthew undeniably had some input on this, I'm going to make an intelligent assumption that the design ideas are probably Ines' contribution).
<rant> Why spend so much time discussing visualization? Well, one of my biggest pet peeves is this: even if you can create a product, if you don't put in the time to make it look beautiful, or delightful to use, then you don't care about packaging your ideas for export to an audience. And that makes me sad. Once you get something working, make it pretty. </rant>
def custom_render(doc, df, column, options={}, page=False, minify=False, idx=0):
"""Overload the spaCy built-in rendering to allow custom part-of-speech (POS) tags.
Keyword arguments:
doc -- a spaCy nlp doc object
df -- a pandas dataframe object
column -- the name of of a column of interest in the dataframe
options -- various options to feed into the spaCy renderer, including colors
page -- rendering markup as full HTML page (default False)
minify -- for compact HTML (default False)
idx -- index for specific query or doc in dataframe (default 0)
"""
renderer, converter = EntityRenderer, parse_custom_ents
renderer = renderer(options=options)
parsed = [converter(doc, df=df, idx=idx, column=column)]
html = renderer.render(parsed, page=page, minify=minify).strip()
return display(HTML(html))
def parse_custom_ents(doc, df, idx, column):
"""Parse custom entity types that aren't in the original spaCy module.
Keyword arguments:
doc -- a spaCy nlp doc object
df -- a pandas dataframe object
idx -- index for specific query or doc in dataframe
column -- the name of of a column of interest in the dataframe
"""
if column in df.columns:
entities = df[column][idx]
ents = [{'start': ent[1], 'end': ent[2], 'label': ent[3]}
for ent in entities]
else:
ents = [{'start': ent.start_char, 'end': ent.end_char, 'label': ent.label_}
for ent in doc.ents]
return {'text': doc.text, 'ents': ents, 'title': None}
def render_entities(idx, df, options={}, column='named_ents'):
"""A wrapper function to get text from a dataframe and render it visually in jupyter notebooks
Keyword arguments:
idx -- index for specific query or doc in dataframe (default 0)
df -- a pandas dataframe object
options -- various options to feed into the spaCy renderer, including colors
column -- the name of of a column of interest in the dataframe (default 'named_ents')
"""
text = df['text'][idx]
custom_render(nlp(text), df=df, column=column, options=options, idx=idx)
# colors for additional part of speech tags we want to visualize
options = {
'colors': {'COMPOUND': '#FE6BFE', 'PROPN': '#18CFE6', 'NOUN': '#18CFE6', 'NP': '#1EECA6', 'ENTITY': '#FF8800'}
}
pd.set_option('display.max_rows', 10) # edit how jupyter will render our pandas dataframes
pd.options.mode.chained_assignment = None # prevent warning about working on a copy of a dataframe
spaCy's pre-built models are trained on different corpora of text, to capture parts-of-speech, extract named entities, and in general understand how to tokenize words into chunks that have meaning in a given language.
We'll grab the 'en_core_web_lg' model by running the following command in the shell (comment it out once you've run it so you don't keep downloading it every time you go through the notebook).
# !python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')
Fun fact: We can run shell commands in a Jupyter notebook by using the bang operator. This is an example of a magic command, of which we saw an example at the begnning with '%autoreload'.
As our data source, we'll be using papers presented at the Neural Information Processing Systems (NIPS) conference held in a different location around the world each year. NIPS is the premier conference for all things machine learning, and considering our goal with this notebook, is an apropos choice to source our data. We'll pull a conveniently packaged dataset from Kaggle, a data science competition site, and then work with a subset of the papers to keep our prototyping as lean and fast as possible.
Once we've grabbed the files using Kaggle's API, we'll take a look at what we're working with. Let's store everything in a separate 'data' folder to keep our directory clean. I've discarded all extra files and renamed the essential one to 'nips.csv'. You'll see a few other files in there, but ignore them for now.
PATH = './data/'
!ls {PATH}
freq_words.csv nips.csv
Fun fact: You can use python variables in shell commands by nesting them inside curly braces.
file = 'nips.csv'
df = pd.read_csv(f'{PATH}{file}')
mini_df = df[:10]
mini_df.index = pd.RangeIndex(len(mini_df.index))
# comment this out to run on full dataset
df = mini_df
Now that we're all ready to get started, let's come up with a general list of tasks to to guide our approach.
That doesn't look too bad now does it? Let's build ourselves a prototype entity extractor.
display(df)
Id | Title | EventType | PdfName | Abstract | PaperText | |
---|---|---|---|---|---|---|
0 | 5677 | Double or Nothing: Multiplicative Incentive Me... | Poster | 5677-double-or-nothing-multiplicative-incentiv... | Crowdsourcing has gained immense popularity in... | Double or Nothing: Multiplicative\nIncentive M... |
1 | 5941 | Learning with Symmetric Label Noise: The Impor... | Spotlight | 5941-learning-with-symmetric-label-noise-the-i... | Convex potential minimisation is the de facto ... | Learning with Symmetric Label Noise: The\nImpo... |
2 | 6019 | Algorithmic Stability and Uniform Generalization | Poster | 6019-algorithmic-stability-and-uniform-general... | One of the central questions in statistical le... | Algorithmic Stability and Uniform Generalizati... |
3 | 6035 | Adaptive Low-Complexity Sequential Inference f... | Poster | 6035-adaptive-low-complexity-sequential-infere... | We develop a sequential low-complexity inferen... | Adaptive Low-Complexity Sequential Inference f... |
4 | 5978 | Covariance-Controlled Adaptive Langevin Thermo... | Poster | 5978-covariance-controlled-adaptive-langevin-t... | Monte Carlo sampling for Bayesian posterior in... | Covariance-Controlled Adaptive Langevin\nTherm... |
5 | 5714 | Robust Portfolio Optimization | Poster | 5714-robust-portfolio-optimization.pdf | We propose a robust portfolio optimization app... | Robust Portfolio Optimization\n\nFang Han\nDep... |
6 | 5937 | Logarithmic Time Online Multiclass prediction | Spotlight | 5937-logarithmic-time-online-multiclass-predic... | We study the problem of multiclass classificat... | Logarithmic Time Online Multiclass prediction\... |
7 | 5802 | Planar Ultrametrics for Image Segmentation | Poster | 5802-planar-ultrametrics-for-image-segmentatio... | We study the problem of hierarchical clusterin... | Planar Ultrametrics for Image Segmentation\n\n... |
8 | 5776 | Expressing an Image Stream with a Sequence of ... | Poster | 5776-expressing-an-image-stream-with-a-sequenc... | We propose an approach for generating a sequen... | Expressing an Image Stream with a Sequence of\... |
9 | 5814 | Parallel Correlation Clustering on Big Graphs | Poster | 5814-parallel-correlation-clustering-on-big-gr... | Given a similarity graph between items, correl... | Parallel Correlation Clustering on Big Graphs\... |
lower = lambda x: x.lower() # make everything lowercase
df = pd.DataFrame(df['Abstract'].apply(lower))
df.columns = ['text']
display(df)
text | |
---|---|
0 | crowdsourcing has gained immense popularity in... |
1 | convex potential minimisation is the de facto ... |
2 | one of the central questions in statistical le... |
3 | we develop a sequential low-complexity inferen... |
4 | monte carlo sampling for bayesian posterior in... |
5 | we propose a robust portfolio optimization app... |
6 | we study the problem of multiclass classificat... |
7 | we study the problem of hierarchical clusterin... |
8 | we propose an approach for generating a sequen... |
9 | given a similarity graph between items, correl... |
Initially, there was quite a bit of metadata associated with each entry, including a unique identifier, the type of paper presented at the conference, as well as the actual paper text. After pulling out just the abstracts, we've now ended up with with a clean, read-to-go dataframe, and are ready to begin extracting entities.
def extract_named_ents(text):
"""Extract named entities, and beginning, middle and end idx using spaCy's out-of-the-box model.
Keyword arguments:
text -- the actual text source from which to extract entities
"""
return [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in nlp(text).ents]
def add_named_ents(df):
"""Create new column in data frame with named entity tuple extracted.
Keyword arguments:
df -- a dataframe object
"""
df['named_ents'] = df['text'].apply(extract_named_ents)
add_named_ents(df)
display(df)
text | named_ents | |
---|---|---|
0 | crowdsourcing has gained immense popularity in... | [(several hundred, 896, 911, CARDINAL)] |
1 | convex potential minimisation is the de facto ... | [(2008, 109, 113, DATE), (2008, 500, 504, DATE... |
2 | one of the central questions in statistical le... | [(one, 0, 3, CARDINAL)] |
3 | we develop a sequential low-complexity inferen... | [] |
4 | monte carlo sampling for bayesian posterior in... | [] |
5 | we propose a robust portfolio optimization app... | [] |
6 | we study the problem of multiclass classificat... | [] |
7 | we study the problem of hierarchical clusterin... | [] |
8 | we propose an approach for generating a sequen... | [] |
9 | given a similarity graph between items, correl... | [(3-approximation, 257, 272, CARDINAL), (graph... |
column = 'named_ents'
render_entities(9, df, options=options, column=column) # take a look at one of the abstracts
A quick glance at some of the abstracts shows that while we are able to extract numeric entities, not much else comes through. Not great. But then again, this is exactly why simply extracting named entities is not enough. On the plus side, our intuition about built-in models and scientific text was spot on! The spaCy named entity recognizer just wasn't exposed to this category of corpora and was instead trained on blogs, news, and comments. Academic papers don't use the most common English words, so it isn't unreasonable to expect a generally trained model to fail when confronted with text in such a restricted domain.
Look at a few more abstracts by changing the index parameter in our "render_entities" function to convince yourself of the following notion:
We need to widen our search.
def extract_nouns(text):
"""Extract a few types of nouns, and beginning, middle and end idx using spaCy's POS (part of speech) tagger.
Keyword arguments:
text -- the actual text source from which to extract entities
"""
keep_pos = ['PROPN', 'NOUN']
return [(tok.text, tok.idx, tok.idx+len(tok.text), tok.pos_) for tok in nlp(text) if tok.pos_ in keep_pos]
def add_nouns(df):
"""Create new column in data frame with nouns extracted.
Keyword arguments:
df -- a dataframe object
"""
df['nouns'] = df['text'].apply(extract_nouns)
add_nouns(df)
display(df)
text | named_ents | nouns | |
---|---|---|---|
0 | crowdsourcing has gained immense popularity in... | [(several hundred, 896, 911, CARDINAL)] | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... |
1 | convex potential minimisation is the de facto ... | [(2008, 109, 113, DATE), (2008, 500, 504, DATE... | [(minimisation, 17, 29, NOUN), (approach, 46, ... |
2 | one of the central questions in statistical le... | [(one, 0, 3, CARDINAL)] | [(questions, 19, 28, NOUN), (learning, 44, 52,... |
3 | we develop a sequential low-complexity inferen... | [] | [(complexity, 28, 38, NOUN), (inference, 39, 4... |
4 | monte carlo sampling for bayesian posterior in... | [] | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... |
5 | we propose a robust portfolio optimization app... | [] | [(portfolio, 20, 29, NOUN), (optimization, 30,... |
6 | we study the problem of multiclass classificat... | [] | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... |
7 | we study the problem of hierarchical clusterin... | [] | [(problem, 13, 20, NOUN), (clustering, 37, 47,... |
8 | we propose an approach for generating a sequen... | [] | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... |
9 | given a similarity graph between items, correl... | [(3-approximation, 257, 272, CARDINAL), (graph... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... |
column = 'nouns'
render_entities(0, df, options=options, column=column)
This is more colorful. But is it useful? It appears as if we are able to pull out a lot of concepts, but things like "rest", "popularity", and "data", aren't all that interesting (atleast in the first abstract). Our search is too wide at this point.
Good to know. Let's power through for now, and merge our lists of entities.
def extract_named_nouns(row_series):
"""Combine nouns and non-numerical entities.
Keyword arguments:
row_series -- a Pandas Series object
"""
ents = set()
idxs = set()
# remove duplicates and merge two lists together
for noun_tuple in row_series['nouns']:
for named_ents_tuple in row_series['named_ents']:
if noun_tuple[1] == named_ents_tuple[1]:
idxs.add(noun_tuple[1])
ents.add(named_ents_tuple)
if noun_tuple[1] not in idxs:
ents.add(noun_tuple)
return sorted(list(ents), key=lambda x: x[1])
def add_named_nouns(df):
"""Create new column in data frame with nouns and named ents.
Keyword arguments:
df -- a dataframe object
"""
df['named_nouns'] = df.apply(extract_named_nouns, axis=1)
add_named_nouns(df)
display(df)
text | named_ents | nouns | named_nouns | |
---|---|---|---|---|
0 | crowdsourcing has gained immense popularity in... | [(several hundred, 896, 911, CARDINAL)] | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... |
1 | convex potential minimisation is the de facto ... | [(2008, 109, 113, DATE), (2008, 500, 504, DATE... | [(minimisation, 17, 29, NOUN), (approach, 46, ... | [(minimisation, 17, 29, NOUN), (approach, 46, ... |
2 | one of the central questions in statistical le... | [(one, 0, 3, CARDINAL)] | [(questions, 19, 28, NOUN), (learning, 44, 52,... | [(questions, 19, 28, NOUN), (learning, 44, 52,... |
3 | we develop a sequential low-complexity inferen... | [] | [(complexity, 28, 38, NOUN), (inference, 39, 4... | [(complexity, 28, 38, NOUN), (inference, 39, 4... |
4 | monte carlo sampling for bayesian posterior in... | [] | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... |
5 | we propose a robust portfolio optimization app... | [] | [(portfolio, 20, 29, NOUN), (optimization, 30,... | [(portfolio, 20, 29, NOUN), (optimization, 30,... |
6 | we study the problem of multiclass classificat... | [] | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... |
7 | we study the problem of hierarchical clusterin... | [] | [(problem, 13, 20, NOUN), (clustering, 37, 47,... | [(problem, 13, 20, NOUN), (clustering, 37, 47,... |
8 | we propose an approach for generating a sequen... | [] | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... |
9 | given a similarity graph between items, correl... | [(3-approximation, 257, 272, CARDINAL), (graph... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... |
column = 'named_nouns'
render_entities(1, df, options=options, column=column)
In this step, we're just combining the named entities extracted using spaCy's built-in model with nouns identified by the part-of-speech or POS tagger. We're dropping any numeric entities for now because they are harder to deal with and don't really represent new concepts. You'll notice (if you look closely enough), that we are also ignoring any hyphenated entities. In spaCy's tokenizer, it is possible to prevent hyphenated words form being split apart, but we'll reserve this, along with other types of advanced fine-tuning or low-level editing to if and when we move beyond the prototype phase.
So far, in the past few steps, we've deal with one-word entities. However, it's also entirely permissible for combinations of two or more words to represent a single concept. This means that in order for our prototype to successfully capture the most relevant concepts, we'll need to pull n-length phrases from our academic abstracts in addition to single word entities.
Even mild exposure to computer science, or any of the various isoforms of engineering, will have introduced you to the idea of an abstraction, wherein low-level concepts are bundled into higher-order relationships. The noun phrase or chunk is an abstraction which consists of two or more words, and is the by-product of dependency parsing, POS tagging, and tokenization. spaCy's POS tagger is essentially a statistical model which learns to predict the tag (noun, verb, adjective, etc.) for a given word using examples of tagged-sentences.
This supervised machine learning approach relies on tokens generated from splitting text into somewhat atomic units using a rule-based tokenizer (although there are some interesting unsupervised models out there as well). Dependency parsing then uncovers relationships between these tagged tokens, allowing us to finally extract noun chunks or phrases of relevance.
The full pipeline goes something like this:
raw text → tokenization → POS tagging → dependency parsing → noun chunk extraction
Theoretically, one could swap out noun chunk extraction for named entity recognition, but that's the part of the pipeline we are attempting to modify for our own purposes, because we want n-length entities. Barring our custom intrusion, however, this is exactly how spaCy's built-in model works! If you don't believe me (which you shouldn't, since you're a scientist), scroll up to the very top of this notebook to convince yourself.
Neat huh? Need a visualization of tokenization, POS tagging, and dependency parsing to convince you of just how cool this is?
Take a look:
text = "Dr. Abraham is the primary author of this paper, and a physician in the specialty of internal medicine."
spacy.displacy.render(nlp(text), jupyter=True) # generating raw-markup using spacy's built-in renderer
Just gorgeous. Following our pipeline, let's use this dependency tree to tease out the noun phrases in our dummy sentence. We'll have to create a few functions to do the heavy lifting first (we can reuse these guys for our full dataset later), and then use a simple procedure to visualize our example.
def extract_noun_phrases(text):
"""Combine noun phrases.
Keyword arguments:
text -- the actual text source from which to extract entities
"""
return [(chunk.text, chunk.start_char, chunk.end_char, chunk.label_) for chunk in nlp(text).noun_chunks]
def add_noun_phrases(df):
"""Create new column in data frame with noun phrases.
Keyword arguments:
df -- a dataframe object
"""
df['noun_phrases'] = df['text'].apply(extract_noun_phrases)
def visualize_noun_phrases(text):
"""Create a temporary dataframe to extract and visualize noun phrases.
Keyword arguments:
text -- the actual text source from which to extract entities
"""
df = pd.DataFrame([text])
df.columns = ['text']
add_noun_phrases(df)
column = 'noun_phrases'
render_entities(0, dummy_df, options=options, column=column)
visualize_noun_phrases(text)
Compare this to what we'd originally set out to accomplish:
[Dr. Abraham] is the [primary author] of this [paper], and a [physician] in the [specialty] of [internal medicine].
I don't know about you, but everytime I see this work, I'm blown away by both the intricate complexity and beautiful simplicity of this process. Ignoring the prepositions, with one single move, we've done a damn-near perfect job of extracting the main ideas from this sentence. How amazing is that?!
Hats off to spaCy, and the hordes of data scientists, machine learning engineers, and linguists that made this possible.
Now, if we just use this approach and add together the single-word entities we extracted from our academic abstracts earlier, we should be getting close to a pretty awesome set of concepts! Let's capture some noun phrases and see what we get.
add_noun_phrases(df)
display(df)
text | named_ents | nouns | named_nouns | noun_phrases | |
---|---|---|---|---|---|
0 | crowdsourcing has gained immense popularity in... | [(several hundred, 896, 911, CARDINAL)] | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... | [(crowdsourcing, 0, 13, NP), (immense populari... |
1 | convex potential minimisation is the de facto ... | [(2008, 109, 113, DATE), (2008, 500, 504, DATE... | [(minimisation, 17, 29, NOUN), (approach, 46, ... | [(minimisation, 17, 29, NOUN), (approach, 46, ... | [(convex potential minimisation, 0, 29, NP), (... |
2 | one of the central questions in statistical le... | [(one, 0, 3, CARDINAL)] | [(questions, 19, 28, NOUN), (learning, 44, 52,... | [(questions, 19, 28, NOUN), (learning, 44, 52,... | [(the central questions, 7, 28, NP), (statisti... |
3 | we develop a sequential low-complexity inferen... | [] | [(complexity, 28, 38, NOUN), (inference, 39, 4... | [(complexity, 28, 38, NOUN), (inference, 39, 4... | [(we, 0, 2, NP), (a sequential low-complexity ... |
4 | monte carlo sampling for bayesian posterior in... | [] | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... | [(bayesian posterior inference, 25, 53, NP), (... |
5 | we propose a robust portfolio optimization app... | [] | [(portfolio, 20, 29, NOUN), (optimization, 30,... | [(portfolio, 20, 29, NOUN), (optimization, 30,... | [(we, 0, 2, NP), (a robust portfolio optimizat... |
6 | we study the problem of multiclass classificat... | [] | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... | [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu... |
7 | we study the problem of hierarchical clusterin... | [] | [(problem, 13, 20, NOUN), (clustering, 37, 47,... | [(problem, 13, 20, NOUN), (clustering, 37, 47,... | [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi... |
8 | we propose an approach for generating a sequen... | [] | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... | [(we, 0, 2, NP), (an approach, 11, 22, NP), (a... |
9 | given a similarity graph between items, correl... | [(3-approximation, 257, 272, CARDINAL), (graph... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... | [(a similarity graph, 6, 24, NP), (items, 33, ... |
column = 'noun_phrases'
render_entities(0, df, options=options, column=column)
Hmm... should've seen this coming. While we've now done a great job of extracting noun phrases from our abstracts, we're running into the same problem as before. Our funnel is too wide, and we're pulling uninteresting bigrams like "the simplicity", "the rest", and "this mechanism". These chunks are indeed noun phrases, but not domain-specific concepts. Not to mention, we still have to deal with those pesky prepositions (try saying that five times fast).
Let's see if we can narrow our search and just get the most important phrases.
def extract_compounds(text):
"""Extract compound noun phrases with beginning and end idxs.
Keyword arguments:
text -- the actual text source from which to extract entities
"""
comp_idx = 0
compound = []
compound_nps = []
tok_idx = 0
for idx, tok in enumerate(nlp(text)):
if tok.dep_ == 'compound':
# capture hyphenated compounds
children = ''.join([c.text for c in tok.children])
if '-' in children:
compound.append(''.join([children, tok.text]))
else:
compound.append(tok.text)
# remember starting index of first child in compound or word
try:
tok_idx = [c for c in tok.children][0].idx
except IndexError:
if len(compound) == 1:
tok_idx = tok.idx
comp_idx = tok.i
# append the last word in a compound phrase
if tok.i - comp_idx == 1:
compound.append(tok.text)
if len(compound) > 1:
compound = ' '.join(compound)
compound_nps.append((compound, tok_idx, tok_idx+len(compound), 'COMPOUND'))
# reset parameters
tok_idx = 0
compound = []
return compound_nps
def add_compounds(df):
"""Create new column in data frame with compound noun phrases.
Keyword arguments:
df -- a dataframe object
"""
df['compounds'] = df['text'].apply(extract_compounds)
add_compounds(df)
display(df)
text | named_ents | nouns | named_nouns | noun_phrases | compounds | |
---|---|---|---|---|---|---|
0 | crowdsourcing has gained immense popularity in... | [(several hundred, 896, 911, CARDINAL)] | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... | [(crowdsourcing, 0, 13, NP), (immense populari... | [(machine learning applications, 47, 76, COMPO... |
1 | convex potential minimisation is the de facto ... | [(2008, 109, 113, DATE), (2008, 500, 504, DATE... | [(minimisation, 17, 29, NOUN), (approach, 46, ... | [(minimisation, 17, 29, NOUN), (approach, 46, ... | [(convex potential minimisation, 0, 29, NP), (... | [(label noise, 143, 154, COMPOUND), (function ... |
2 | one of the central questions in statistical le... | [(one, 0, 3, CARDINAL)] | [(questions, 19, 28, NOUN), (learning, 44, 52,... | [(questions, 19, 28, NOUN), (learning, 44, 52,... | [(the central questions, 7, 28, NP), (statisti... | [(learning theory, 44, 59, COMPOUND), (inferen... |
3 | we develop a sequential low-complexity inferen... | [] | [(complexity, 28, 38, NOUN), (inference, 39, 4... | [(complexity, 28, 38, NOUN), (inference, 39, 4... | [(we, 0, 2, NP), (a sequential low-complexity ... | [(low-complexity inference procedure, 28, 62, ... |
4 | monte carlo sampling for bayesian posterior in... | [] | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... | [(bayesian posterior inference, 25, 53, NP), (... | [(monte carlo sampling, 0, 20, COMPOUND), (mac... |
5 | we propose a robust portfolio optimization app... | [] | [(portfolio, 20, 29, NOUN), (optimization, 30,... | [(portfolio, 20, 29, NOUN), (optimization, 30,... | [(we, 0, 2, NP), (a robust portfolio optimizat... | [(portfolio optimization approach, 20, 51, COM... |
6 | we study the problem of multiclass classificat... | [] | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... | [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu... | [(test time, 134, 143, COMPOUND), (tree constr... |
7 | we study the problem of hierarchical clusterin... | [] | [(problem, 13, 20, NOUN), (clustering, 37, 47,... | [(problem, 13, 20, NOUN), (clustering, 37, 47,... | [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi... | [(lp relaxation, 182, 195, COMPOUND), (cost pe... |
8 | we propose an approach for generating a sequen... | [] | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... | [(we, 0, 2, NP), (an approach, 11, 22, NP), (a... | [(image stream, 77, 89, COMPOUND), (image stre... |
9 | given a similarity graph between items, correl... | [(3-approximation, 257, 272, CARDINAL), (graph... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... | [(a similarity graph, 6, 24, NP), (items, 33, ... | [(similarity graph, 8, 24, COMPOUND), (correla... |
column = 'compounds'
render_entities(0, df, options=options, column=column)
That's starting to look pretty good! By targetting words in the dependency tree that were tagged as belonging to a compound, we were able to drive the number of noun phrases down rather nicely. Next, we'll add these phrases to the list of entities we extracted from each abstract, to create a set which will include unigrams, bigrams, and more. Oh my!
def extract_comp_nouns(row_series, cols=[]):
"""Combine compound noun phrases and entities.
Keyword arguments:
row_series -- a Pandas Series object
"""
return {noun_tuple[0] for col in cols for noun_tuple in row_series[col]}
def add_comp_nouns(df, cols=[]):
"""Create new column in data frame with merged entities.
Keyword arguments:
df -- a dataframe object
cols -- a list of column names that need to be merged
"""
df['comp_nouns'] = df.apply(extract_comp_nouns, axis=1, cols=cols)
cols = ['nouns', 'compounds']
add_comp_nouns(df, cols=cols)
display(df)
text | named_ents | nouns | named_nouns | noun_phrases | compounds | comp_nouns | clean_ents | |
---|---|---|---|---|---|---|---|---|
0 | crowdsourcing has gained immense popularity in... | [(several hundred, 896, 911, CARDINAL)] | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... | [(crowdsourcing, 0, 13, NP), (immense populari... | [(machine learning applications, 47, 76, COMPO... | {mechanisms, low-quality data, problem, requir... | {mechanisms, low-quality data, problem, worker... |
1 | convex potential minimisation is the de facto ... | [(2008, 109, 113, DATE), (2008, 500, 504, DATE... | [(minimisation, 17, 29, NOUN), (approach, 46, ... | [(minimisation, 17, 29, NOUN), (approach, 46, ... | [(convex potential minimisation, 0, 29, NP), (... | [(label noise, 143, 154, COMPOUND), (function ... | {function class, solution, result, performance... | {function class, solution, result, svm, paper,... |
2 | one of the central questions in statistical le... | [(one, 0, 3, CARDINAL)] | [(questions, 19, 28, NOUN), (learning, 44, 52,... | [(questions, 19, 28, NOUN), (learning, 44, 52,... | [(the central questions, 7, 28, NP), (statisti... | [(learning theory, 44, 59, COMPOUND), (inferen... | {result, conditions, dimensionality reduction ... | {dimensionality reduction methods, conditions,... |
3 | we develop a sequential low-complexity inferen... | [] | [(complexity, 28, 38, NOUN), (inference, 39, 4... | [(complexity, 28, 38, NOUN), (inference, 39, 4... | [(we, 0, 2, NP), (a sequential low-complexity ... | [(low-complexity inference procedure, 28, 62, ... | {large-sample limit, concentration, asymptotic... | {classes, form parametric, number, function, e... |
4 | monte carlo sampling for bayesian posterior in... | [] | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... | [(bayesian posterior inference, 25, 53, NP), (... | [(monte carlo sampling, 0, 20, COMPOUND), (mac... | {machine learning, discrete-time analogues, gr... | {machine learning, discrete-time analogues, gr... |
5 | we propose a robust portfolio optimization app... | [] | [(portfolio, 20, 29, NOUN), (optimization, 30,... | [(portfolio, 20, 29, NOUN), (optimization, 30,... | [(we, 0, 2, NP), (a robust portfolio optimizat... | [(portfolio optimization approach, 20, 51, COM... | {work, optimization, portfolio, dependence, th... | {dependence, theory, events, method, dimension... |
6 | we study the problem of multiclass classificat... | [] | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... | [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu... | [(test time, 134, 143, COMPOUND), (tree constr... | {entropy, problem, classes, conditions, number... | {problem, classes, conditions, number, functio... |
7 | we study the problem of hierarchical clusterin... | [] | [(problem, 13, 20, NOUN), (clustering, 37, 47,... | [(problem, 13, 20, NOUN), (clustering, 37, 47,... | [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi... | [(lp relaxation, 182, 195, COMPOUND), (cost pe... | {problem, image, distances, partitions, cost, ... | {space, matching, terms, algorithm, cost perfe... |
8 | we propose an approach for generating a sequen... | [] | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... | [(we, 0, 2, NP), (an approach, 11, 22, NP), (a... | [(image stream, 77, 89, COMPOUND), (image stre... | {text-image parallel, language descriptions, m... | {text-image parallel, language descriptions, m... |
9 | given a similarity graph between items, correl... | [(3-approximation, 257, 272, CARDINAL), (graph... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... | [(a similarity graph, 6, 24, NP), (items, 33, ... | [(similarity graph, 8, 24, COMPOUND), (correla... | {practice kwikcluster, similarity, ratio, prac... | {practice kwikcluster, ratio, serializability,... |
# take a look at all the nouns again
column = 'named_nouns'
render_entities(0, df, options=options, column=column)
# take a look at all the compound noun phrases again
column = 'compounds'
render_entities(0, df, options=options, column=column)
# take a look at combined entities
df['comp_nouns'][0]
{'amounts', 'applications', 'benefit', 'challenge', 'crowdsourcing', 'data', 'error', 'error rates', 'expenditure', 'experiments', 'form', 'incentive', 'low-quality data', 'lunch', 'machine', 'machine learning applications', 'mechanism', 'mechanisms', 'no', 'no-free-lunch requirement', 'payment', 'payment mechanism', 'popularity', 'problem', 'quality', 'questions', 'rates', 'reduction', 'requirement', 'rest', 'simplicity', 'spammers', 'workers'}
Now that we have all the entities grouped together, we can see how good we are doing. We've successfully captured single-word as well as n-grams, but there appear to be a lot of duplicates. Words that should've been included in a phrase were somehow split apart, most likely as a result of not properly dealing with hyphenation when we first tokenized our abstracts.
Not to worry, this should be relatively easy to take care. We'll also apply a few other heuristics to clean up our list and remove the most common English words to further pare down the list of entities.
def drop_duplicate_np_splits(ents):
"""Drop any entities that are already captured by noun phrases.
Keyword arguments:
ents -- a set of entities
"""
drop_ents = set()
for ent in ents:
if len(ent.split(' ')) > 1:
for e in ent.split(' '):
if e in ents:
drop_ents.add(e)
return ents - drop_ents
def drop_single_char_nps(ents):
"""Within an entity, drop single characters.
Keyword arguments:
ents -- a set of entities
"""
return {' '.join([e for e in ent.split(' ') if not len(e) == 1]) for ent in ents}
def drop_double_char(ents):
"""Drop any entities that are less than three characters.
Keyword arguments:
ents -- a set of entities
"""
drop_ents = {ent for ent in ents if len(ent) < 3}
return ents - drop_ents
def keep_alpha(ents):
"""Keep only entities with alphabetical unicode characters, hyphens, and spaces.
Keyword arguments:
ents -- a set of entities
"""
keep_char = set('-abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ')
drop_ents = {ent for ent in ents if not set(ent).issubset(keep_char)}
return ents - drop_ents
These last four functions will slice and dice the list of entities gathered from each abstract in various ways. In addition to this granular processing, we'll also want to remove words that are frequent in the English language, as a heuristic to naturally drop stop words and uncover the domain of each academic source.
Why is this?
Well, in NLP, as in search engine optimization (SEO), the most common words in a given corpus are known as stop words. These unfortunate candidates are hunted down with extreme prejudice and removed from the population to improve search results, enhance semantic analysis, and in our case, help restrict the domain. This is because removing stop words automatically limits the vocabulary of a corpus to the words that are less frequent and therefore, more likely to exist in that abstract than anywhere else.
You can, of course, argue that the most common words in a scientific paper might in fact be the most important concepts, but stop words are usually overwhelmingingly overrepresented in any corpus. This intuition however, is exactly why we aren't going to simply take the most common words in one specific abstract and remove them. Instead, we'll be targetting the most frequent words based on a large, general domain sample of the English language.
The "freq_words.csv" file you might have noticed earlier in our file path, is actually a list generated from a corpus with 10 billion words gathered by the good people at Word Frequencey Data.
Let's take a look at the list and then remove these words from our set of entities.
!ls {PATH}
freq_words.csv nips.csv
filename = 'freq_words.csv'
freq_words_df = pd.read_csv(f'{PATH}{filename}')
display(freq_words_df)
Rank | Word | Part of speech | Frequency | Dispersion | Unnamed: 5 | Unnamed: 6 | |
---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 1.0 | the | a | 22038615.0 | 0.98 | NaN | NaN |
2 | 2.0 | be | v | 12545825.0 | 0.97 | NaN | NaN |
3 | 3.0 | and | c | 10741073.0 | 0.99 | NaN | NaN |
4 | 4.0 | of | i | 10343885.0 | 0.97 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... |
4996 | 4996.0 | plaintiff | n | 5312.0 | 0.88 | NaN | NaN |
4997 | 4997.0 | kid | v | 5094.0 | 0.92 | NaN | NaN |
4998 | 4998.0 | middle-class | j | 5025.0 | 0.93 | NaN | NaN |
4999 | 4999.0 | apology | n | 4972.0 | 0.94 | NaN | NaN |
5000 | 5000.0 | till | i | 5079.0 | 0.92 | NaN | NaN |
5001 rows × 7 columns
freq_words = freq_words_df['Word'].iloc[1:]
display(freq_words)
1 the 2 be 3 and 4 of 5 a ... 4996 plaintiff 4997 kid 4998 middle-class 4999 apology 5000 till Name: Word, Length: 5000, dtype: object
def remove_freq_words(ents):
"""Drop any entities in the 5000 most common words in the English langauge.
Keyword arguments:
ents -- a set of entities
"""
filename = 'freq_words.csv'
PATH = './data/'
freq_words = pd.read_csv(f'{PATH}{filename}')['Word'].iloc[1:]
for word in freq_words:
try:
ents.remove(word)
except KeyError:
continue # ignore the stop word if it's not in the list of abstract entities
return ents
def add_clean_ents(df, funcs=[]):
"""Create new column in data frame with cleaned entities.
Keyword arguments:
df -- a dataframe object
funcs -- a list of heuristic functions to be applied to entities
"""
col = 'clean_ents'
df[col] = df['comp_nouns']
for f in funcs:
df[col] = df[col].apply(f)
funcs = [drop_duplicate_np_splits, drop_double_char, keep_alpha, drop_single_char_nps, remove_freq_words]
add_clean_ents(df, funcs)
display(df)
text | named_ents | nouns | named_nouns | noun_phrases | compounds | comp_nouns | clean_ents | |
---|---|---|---|---|---|---|---|---|
0 | crowdsourcing has gained immense popularity in... | [(several hundred, 896, 911, CARDINAL)] | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... | [(crowdsourcing, 0, 13, NOUN), (popularity, 33... | [(crowdsourcing, 0, 13, NP), (immense populari... | [(machine learning applications, 47, 76, COMPO... | {mechanisms, low-quality data, problem, requir... | {mechanisms, low-quality data, workers, questi... |
1 | convex potential minimisation is the de facto ... | [(2008, 109, 113, DATE), (2008, 500, 504, DATE... | [(minimisation, 17, 29, NOUN), (approach, 46, ... | [(minimisation, 17, 29, NOUN), (approach, 46, ... | [(convex potential minimisation, 0, 29, NP), (... | [(label noise, 143, 154, COMPOUND), (function ... | {function class, solution, result, performance... | {function class, svm, guessing, learners, conv... |
2 | one of the central questions in statistical le... | [(one, 0, 3, CARDINAL)] | [(questions, 19, 28, NOUN), (learning, 44, 52,... | [(questions, 19, 28, NOUN), (learning, 44, 52,... | [(the central questions, 7, 28, NP), (statisti... | [(learning theory, 44, 59, COMPOUND), (inferen... | {result, conditions, dimensionality reduction ... | {dimensionality reduction methods, conditions,... |
3 | we develop a sequential low-complexity inferen... | [] | [(complexity, 28, 38, NOUN), (inference, 39, 4... | [(complexity, 28, 38, NOUN), (inference, 39, 4... | [(we, 0, 2, NP), (a sequential low-complexity ... | [(low-complexity inference procedure, 28, 62, ... | {large-sample limit, concentration, asymptotic... | {classes, form parametric, methods, dirichlet ... |
4 | monte carlo sampling for bayesian posterior in... | [] | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... | [(carlo, 6, 11, NOUN), (bayesian, 25, 33, NOUN... | [(bayesian posterior inference, 25, 53, NP), (... | [(monte carlo sampling, 0, 20, COMPOUND), (mac... | {machine learning, discrete-time analogues, gr... | {machine learning, discrete-time analogues, gr... |
5 | we propose a robust portfolio optimization app... | [] | [(portfolio, 20, 29, NOUN), (optimization, 30,... | [(portfolio, 20, 29, NOUN), (optimization, 30,... | [(we, 0, 2, NP), (a robust portfolio optimizat... | [(portfolio optimization approach, 20, 51, COM... | {work, optimization, portfolio, dependence, th... | {dependence, events, asset returns, dimensions... |
6 | we study the problem of multiclass classificat... | [] | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... | [(problem, 13, 20, NOUN), (multiclass, 24, 34,... | [(we, 0, 2, NP), (the problem, 9, 20, NP), (mu... | [(test time, 134, 143, COMPOUND), (tree constr... | {entropy, problem, classes, conditions, number... | {classes, conditions, partitions, multiclass, ... |
7 | we study the problem of hierarchical clusterin... | [] | [(problem, 13, 20, NOUN), (clustering, 37, 47,... | [(problem, 13, 20, NOUN), (clustering, 37, 47,... | [(we, 0, 2, NP), (the problem, 9, 20, NP), (hi... | [(lp relaxation, 182, 195, COMPOUND), (cost pe... | {problem, image, distances, partitions, cost, ... | {matching, algorithm, cost perfect, lp relaxat... |
8 | we propose an approach for generating a sequen... | [] | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... | [(approach, 14, 22, NOUN), (sequence, 40, 48, ... | [(we, 0, 2, NP), (an approach, 11, 22, NP), (a... | [(image stream, 77, 89, COMPOUND), (image stre... | {text-image parallel, language descriptions, m... | {text-image parallel, language descriptions, m... |
9 | given a similarity graph between items, correl... | [(3-approximation, 257, 272, CARDINAL), (graph... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... | [(similarity, 8, 18, NOUN), (graph, 19, 24, NO... | [(a similarity graph, 6, 24, NP), (items, 33, ... | [(similarity graph, 8, 24, COMPOUND), (correla... | {practice kwikcluster, similarity, ratio, prac... | {practice kwikcluster, serializability, result... |
def visualize_entities(df, idx=0):
"""Visualize the entities for a given abstract in the dataframe.
Keyword arguments:
df -- a dataframe object
idx -- the index of interest for the dataframe (default 0)
"""
# store entity start and end index for visualization in dummy df
ents = []
abstract = df['text'][idx]
for ent in df['clean_ents'][idx]:
i = abstract.find(ent) # locate the index of the entity in the abstract
ents.append((ent, i, i+len(ent), 'ENTITY'))
ents.sort(key=lambda tup: tup[1])
dummy_df = pd.DataFrame([abstract, ents]).T # transpose dataframe
dummy_df.columns = ['text', 'clean_ents']
column = 'clean_ents'
render_entities(0, dummy_df, options=options, column=column)
visualize_entities(df, 0)
That's a good looking list of concepts wouldn't you say? By removing stop words and fine-tuning our set, we were able to capture only the most important entities in this first abstract! Let's finish up with a quick recapitulation of our approach and some thoughts on what we can do going forward.
Well, at the risk of tooting our own horn, I feel rather confident saying that we've accomplished what we set out to do! We took an abstract from a scientific paper, combined named and regular entities, extracted compound noun phrases, and pared down the final list using heuristics and stop word domain restriction to generate a set of important concepts.
Keep in mind that this exercise wasn't to create the world's best entity extractor. It was to get a fast baseline for what we can do with limited knowledge about the domain, and limited use of deep learning superpowers. We've now ended up with a prototype that shows we can get relatively far using out-of-the-box methods, with minor scripting for customization. And the best part? Our approach didn't require any extensive compute or proprietary software!
Going forward, we'd want to test our approach on larger data sets (perhaps full scientific papers), and create an easy-to-use API for visualization, as well as individual and batch processing of text sources. Improving the actual entity extraction itself might involve a language model trained on academic papers or the addition of other intelligent heuristics. At some point, we'd also want to link each entity to an external database with further information, so that our conversational academic paper program would be able to orient these concepts within a larger knowledge graph.
At the end of all of this, we've built a fast entity extraction prototype that confidently moves us towards creating an engine to communicate with academic papers, which will (hopefully) set the foundation for an automated scientific discovery tool.
Great work!