This notebook demonstrates the interoperable capabilities of the open source library Text Extensions for Pandas. Specifically we use Pandas DataFrames as a bridge between multiple natural language processing libraries. The example that we show here uses the capabilities of IBM's Watson Natural Language Understanding service and SpaCy to perform a number of NLP tasks such as extracting entities, relations, spans and sentiment.
This notebook requires a Python 3.7 or later environment with the following packages:
pip install ibm-watson
command.pip install spacy
command.text_extensions_for_pandas
You can satisfy the dependency on text_extensions_for_pandas
in either of two ways:
pip install text_extensions_for_pandas
before running this notebook. This command adds the library to your Python environment from the latest PyPi release.# Uncomment and run this cell if you are using this notebook in a cloud environment such
# as IBM Watson Studio or Google Colab and you want to install the required packages.
# Note: This will install packages to your environment so only run if you need to install
# these packages.
# Uncomment below cell to install packages
# !pip install ibm_watson spacy text_extensions_for_pandas
# Core Python libraries
import json
import os
import sys
import pandas as pd
from typing import *
# IBM Watson libraries
import ibm_watson
import ibm_watson.natural_language_understanding_v1 as nlu
import ibm_cloud_sdk_core
# SpaCy
import spacy
# And of course we need the text_extensions_for_pandas library itself.
try:
import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
# If we're running from within the project source tree and the parent Python
# environment doesn't have the text_extensions_for_pandas package, use the
# version in the local source tree.
if not os.getcwd().endswith("notebooks"):
raise e
if ".." not in sys.path:
sys.path.insert(0, "..")
import text_extensions_for_pandas as tp
In this section, we will setup various parts of the Watson NLU service and pass our documents through the service to obtain various features and outputs from the service.
This section is divided into subsections to setup, connect and use the service.
In this part of the notebook, we will use the Watson Natural Language Understanding (NLU) service to extract key features from our example document.
You can create an instance of Watson NLU on the IBM Cloud for free by navigating to this page and clicking on the button marked "Get started free". You can also install your own instance of Watson NLU on OpenShift by using IBM Watson Natural Language Understanding for IBM Cloud Pak for Data.
You'll need two pieces of information to access your instance of Watson NLU: An API key and a service URL. If you're using Watson NLU on the IBM Cloud, you can find your API key and service URL in the IBM Cloud web UI. Navigate to the resource list and click on your instance of Natural Language Understanding to open the management UI for your service. Then click on the "Manage" tab to show a page with your API key and service URL.
The cell that follows assumes that you are using the environment variables IBM_API_KEY
and IBM_SERVICE_URL
to store your credentials. If you're running this notebook in Jupyter on your laptop, you can set these environment variables while starting up jupyter notebook
or jupyter lab
. For example:
IBM_API_KEY='<my API key>' \
IBM_SERVICE_URL='<my service URL>' \
jupyter lab
Alternately, you can uncomment the first two lines of code below to set the IBM_API_KEY
and IBM_SERVICE_URL
environment variables directly.
Be careful not to store your API key in any publicly-accessible location!
# If you need to embed your credentials inline, uncomment the following two lines and
# paste your credentials in the indicated locations.
# os.environ["IBM_API_KEY"] = "<API key goes here>"
# os.environ["IBM_SERVICE_URL"] = "<Service URL goes here>"
# Retrieve the API key for your Watson NLU service instance
if "IBM_API_KEY" not in os.environ:
raise ValueError("Expected Watson NLU api key in the environment variable 'IBM_API_KEY'")
api_key = os.environ.get("IBM_API_KEY")
# Retrieve the service URL for your Watson NLU service instance
if "IBM_SERVICE_URL" not in os.environ:
raise ValueError("Expected Watson NLU service URL in the environment variable 'IBM_SERVICE_URL'")
service_url = os.environ.get("IBM_SERVICE_URL")
This notebook uses the IBM Watson Python SDK to perform authentication on the IBM Cloud via the
IAMAuthenticator
class. See the IBM Watson Python SDK documentation for more information.
We start by using the API key and service URL from the previous cell to create an instance of the Python API for Watson NLU.
natural_language_understanding = ibm_watson.NaturalLanguageUnderstandingV1(
version="2019-07-12",
authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key)
)
natural_language_understanding.set_service_url(service_url)
natural_language_understanding
<ibm_watson.natural_language_understanding_v1.NaturalLanguageUnderstandingV1 at 0x7f86b07053d0>
Once you've opened a connection to the Watson NLU service, you can pass documents through
the service by invoking the analyze()
method.
The example document that we use here is an excerpt from the plot summary for Monty Python and the Holy Grail, drawn from the Wikipedia entry for that movie.
Let's preview what the raw text looks like:
from IPython.display import display, HTML
doc_file = "../resources/holy_grail_short.txt"
with open(doc_file, "r") as f:
doc_text = f.read()
display(HTML(f"<b>Document Text:</b><blockquote>{doc_text}</blockquote>"))
In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours. Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.
Watson Natural Language Understanding can perform multiple kinds of analysis on the example document.
We will be looking at the following:
See the Watson NLU documentation for a full description of the types of analysis that NLU can perform.
# Make the request
response = natural_language_understanding.analyze(
text=doc_text,
# TODO: Use this URL once we've pushed the shortened document to Github
#url="https://raw.githubusercontent.com/CODAIT/text-extensions-for-pandas/master/resources/holy_grail_short.txt",
return_analyzed_text=True,
features=nlu.Features(
entities=nlu.EntitiesOptions(sentiment=True),
keywords=nlu.KeywordsOptions(sentiment=True, emotion=True),
relations=nlu.RelationsOptions(),
semantic_roles=nlu.SemanticRolesOptions(),
syntax=nlu.SyntaxOptions(sentences=True,
tokens=nlu.SyntaxOptionsTokens(lemma=True, part_of_speech=True))
)).get_result()
The response from the analyze()
method is a Python dictionary. The dictionary contains an entry
for each pass of analysis requested, plus some additional entries with metadata about the API request
itself. Here's a list of the keys in response
:
response.keys()
dict_keys(['usage', 'syntax', 'semantic_roles', 'relations', 'language', 'keywords', 'entities', 'analyzed_text'])
Text Extensions for Pandas includes a handy function watson_nlu_parse_response()
that turns the output of Watson NLU's analyze()
function into a dictionary of Pandas DataFrames. This makes it much easier to process the output from NLU and perform downstream operations. Let us run the NLU response object through that conversion below.
dfs = tp.io.watson.nlu.parse_response(response)
dfs.keys()
dict_keys(['syntax', 'entities', 'entity_mentions', 'keywords', 'relations', 'semantic_roles'])
The output of each analysis pass that Watson NLU performed is now a DataFrame. Let's look at the outputs of the "relations" pass. Here's the original output as Python objects:
response["relations"]
[{'type': 'partOfMany', 'sentence': "Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.", 'score': 0.610221, 'arguments': [{'text': 'Galahad', 'location': [208, 215], 'entities': [{'type': 'Person', 'text': 'Galahad'}]}, {'text': 'their', 'location': [323, 328], 'entities': [{'type': 'Person', 'text': 'their'}]}]}, {'type': 'partOfMany', 'sentence': "Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.", 'score': 0.710112, 'arguments': [{'text': 'Lancelot', 'location': [266, 274], 'entities': [{'type': 'Person', 'text': 'Lancelot'}]}, {'text': 'their', 'location': [323, 328], 'entities': [{'type': 'Person', 'text': 'their'}]}]}, {'type': 'parentOf', 'sentence': "Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.", 'score': 0.3821, 'arguments': [{'text': 'their', 'location': [323, 328], 'entities': [{'type': 'Person', 'text': 'their'}]}, {'text': 'squires', 'location': [329, 336], 'entities': [{'type': 'Person', 'text': 'squires'}]}]}, {'type': 'residesIn', 'sentence': 'Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place".', 'score': 0.492869, 'arguments': [{'text': 'Arthur', 'location': [362, 368], 'entities': [{'type': 'Person', 'text': 'King Arthur'}]}, {'text': 'Camelot', 'location': [386, 393], 'entities': [{'type': 'GeopoliticalEntity', 'text': 'Camelot'}]}]}, {'type': 'locatedAt', 'sentence': 'Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place".', 'score': 0.339446, 'arguments': [{'text': 'men', 'location': [379, 382], 'entities': [{'type': 'Person', 'text': 'men'}]}, {'text': 'Camelot', 'location': [386, 393], 'entities': [{'type': 'GeopoliticalEntity', 'text': 'Camelot'}]}]}, {'type': 'affectedBy', 'sentence': 'As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.', 'score': 0.604304, 'arguments': [{'text': 'them', 'location': [572, 576], 'entities': [{'type': 'Person', 'text': 'their'}]}, {'text': 'speaks', 'location': [562, 568], 'entities': [{'type': 'EventCommunication', 'text': 'speaks'}]}]}]
And here's the DataFrame version of the same information:
dfs["relations"]
type | sentence_span | score | arguments.0.span | arguments.1.span | arguments.0.entities.type | arguments.1.entities.type | arguments.0.entities.text | arguments.1.entities.text | |
---|---|---|---|---|---|---|---|---|---|
0 | partOfMany | [130, 361): 'Along the way, he recruits Sir Be... | 0.610221 | [208, 215): 'Galahad' | [323, 328): 'their' | Person | Person | Galahad | their |
1 | partOfMany | [130, 361): 'Along the way, he recruits Sir Be... | 0.710112 | [266, 274): 'Lancelot' | [323, 328): 'their' | Person | Person | Lancelot | their |
2 | parentOf | [130, 361): 'Along the way, he recruits Sir Be... | 0.382100 | [323, 328): 'their' | [329, 336): 'squires' | Person | Person | their | squires |
3 | residesIn | [362, 512): 'Arthur leads the men to Camelot, ... | 0.492869 | [362, 368): 'Arthur' | [386, 393): 'Camelot' | Person | GeopoliticalEntity | King Arthur | Camelot |
4 | locatedAt | [362, 512): 'Arthur leads the men to Camelot, ... | 0.339446 | [379, 382): 'men' | [386, 393): 'Camelot' | Person | GeopoliticalEntity | men | Camelot |
5 | affectedBy | [513, 629): 'As they turn away, God (an image ... | 0.604304 | [572, 576): 'them' | [562, 568): 'speaks' | Person | EventCommunication | their | speaks |
As you can see above, it is much more organized and convenient to deal with once we have it as a DataFrame.
Each row in the DataFrame contains information about a single relationship that Watson Natural Language Understanding identified in our input text. As you can see, Watson NLU returns a lot of information about each relationship. For simplicity, let's focus on three columns:
relations = dfs["relations"][["type", "arguments.0.span", "arguments.1.span"]].copy()
relations
type | arguments.0.span | arguments.1.span | |
---|---|---|---|
0 | partOfMany | [208, 215): 'Galahad' | [323, 328): 'their' |
1 | partOfMany | [266, 274): 'Lancelot' | [323, 328): 'their' |
2 | parentOf | [323, 328): 'their' | [329, 336): 'squires' |
3 | residesIn | [362, 368): 'Arthur' | [386, 393): 'Camelot' |
4 | locatedAt | [379, 382): 'men' | [386, 393): 'Camelot' |
5 | affectedBy | [572, 576): 'them' | [562, 568): 'speaks' |
Text Extensions for Pandas uses Pandas extension types to represent spans (regions of a document) and tensors (multi-dimensional arrays). For example, the "arguments.0.span" and "arguments.1.span" columns in the above DataFrame are both stored using the extension type for spans.
Here's the Pandas data type (also known as "dtype") information for the three columns of this DataFrame:
relations.dtypes
type object arguments.0.span SpanDtype arguments.1.span SpanDtype dtype: object
Note how the "arguments.0.span" and "arguments.1.span" columns are of dtype SpanDtype
.
SpanDtype
is a Pandas extension type from the Text Extensions for Pandas library.
The SpanDtype
data type corresponds to two Python classes: Span
for scalar values
and SpanArray
for array values. SpanArray
is
a subclass of the Pandas ExtensionArray
class, which is the base class for custom 1-D array types in Pandas.
You can access the array object behind any Pandas extension type via the pandas.Series.array
property:
print(relations["arguments.0.span"].array)
<SpanArray> [ [208, 215): 'Galahad', [266, 274): 'Lancelot', [323, 328): 'their', [362, 368): 'Arthur', [379, 382): 'men', [572, 576): 'them'] Length: 6, dtype: SpanDtype
Extension types support most the functionality of built-in Pandas types like Int64Dtype
and DatetimeTZDtype
.
For example, SpanDtype
defines the +
(also known as __add__()
) operation
for spans to mean "the shortest span that completely covers both input spans". So we can
"add" the contents of the "arguments.0.span" and "arguments.1.span" columns of our DataFrame
to obtain a span that covers both arguments, plus the text in between them.
The cell below demonstrates a simple +
operation with Spans.
relations["context"] = relations["arguments.0.span"] + relations["arguments.1.span"]
relations
type | arguments.0.span | arguments.1.span | context | |
---|---|---|---|---|
0 | partOfMany | [208, 215): 'Galahad' | [323, 328): 'their' | [208, 328): 'Galahad the Pure, Sir Robin the N... |
1 | partOfMany | [266, 274): 'Lancelot' | [323, 328): 'their' | [266, 328): 'Lancelot, and Sir Not-Appearing-i... |
2 | parentOf | [323, 328): 'their' | [329, 336): 'squires' | [323, 336): 'their squires' |
3 | residesIn | [362, 368): 'Arthur' | [386, 393): 'Camelot' | [362, 393): 'Arthur leads the men to Camelot' |
4 | locatedAt | [379, 382): 'men' | [386, 393): 'Camelot' | [379, 393): 'men to Camelot' |
5 | affectedBy | [572, 576): 'them' | [562, 568): 'speaks' | [562, 576): 'speaks to them' |
Take a look at the last row of the above DataFrame. The span in "arguments.0.span" comes after "arguments.1.span" in the last row, but the "context" column is still correct.
A SpanArray
can also render itself using Jupyter Notebook callbacks. To
see the HTML representation of the SpanArray
, pass the array object
to Jupyter's display()
function; or make that object be the last line of the cell, as in the following example:
relations["context"].array
In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires Set and Robin's troubadours. Arthur leads the men to Camelot , but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.
This makes it very easy to visually inspect relevant portions of a document and to present any findings.
You can also convert an individual element of the array into a Python object of type Span
that
represents that single span as a scalar value:
target_span = relations.iloc[0]["arguments.1.span"]
target_span
[323, 328): 'their'
You can use a Span
object to create a Pandas selection condition.
For example, we can use a selection condition to select the rows from the relations
DataFrame
whose second argument's span matches the span we just stored in the variable
target_span
:
relations[relations["arguments.1.span"] == target_span]
type | arguments.0.span | arguments.1.span | context | |
---|---|---|---|---|
0 | partOfMany | [208, 215): 'Galahad' | [323, 328): 'their' | [208, 328): 'Galahad the Pure, Sir Robin the N... |
1 | partOfMany | [266, 274): 'Lancelot' | [323, 328): 'their' | [266, 328): 'Lancelot, and Sir Not-Appearing-i... |
Pandas extension types also support aggregation. Let's use the sum()
aggregate to find the portion of the document that includes the context for all the relationships in the above DataFrame.
Recall that the Text Extensions for Pandas defines the addition operator for spans as "the shortest span that completely covers both input spans". Similarly, the "sum" of a collection of spans is the shortest span that completely covers all the spans.
max_context_span = relations[relations["arguments.1.span"] == target_span]["context"].sum()
print(f"""
Span: {str(max_context_span)}
Covered text: "{max_context_span.covered_text}"
""")
Span: [208, 328): 'Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and [...]' Covered text: "Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their"
With Text Extensions for Pandas, you can use Pandas DataFrames as a common representation for you NLP application's intermediate data, regardless of which NLP library you used to produce that data.
In the cell that follows, we take the text that we just ran through Watson NLU and feed that text through a
SpaCy langauge model. Then we use the make_tokens_and_features()
function from Text
Extensions for Pandas to convert this output to a Pandas DataFrame of token features.
In order to load the spacy language model, download the spacy model using the following command:
$ python -m spacy download en_core_web_sm
You can also add a line in the below cell to install it inline:
!python -m spacy download en_core_web_sm
doc_text = response["analyzed_text"]
spacy_language_model = spacy.load("en_core_web_sm")
token_features = tp.io.spacy.make_tokens_and_features(doc_text, spacy_language_model)
token_features
id | span | lemma | pos | tag | dep | head | shape | ent_iob | ent_type | is_alpha | is_stop | sentence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | [0, 2): 'In' | in | ADP | IN | prep | 12 | Xx | O | True | True | [0, 129): 'In AD 932, King Arthur and his squi... | |
1 | 1 | [3, 5): 'AD' | ad | NOUN | NN | pobj | 0 | XX | B | DATE | True | False | [0, 129): 'In AD 932, King Arthur and his squi... |
2 | 2 | [6, 9): '932' | 932 | NUM | CD | nummod | 1 | ddd | I | DATE | False | False | [0, 129): 'In AD 932, King Arthur and his squi... |
3 | 3 | [9, 10): ',' | , | PUNCT | , | punct | 12 | , | O | False | False | [0, 129): 'In AD 932, King Arthur and his squi... | |
4 | 4 | [11, 15): 'King' | King | PROPN | NNP | compound | 5 | Xxxx | B | PERSON | True | False | [0, 129): 'In AD 932, King Arthur and his squi... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
142 | 142 | [606, 613): 'finding' | find | VERB | VBG | pcomp | 141 | xxxx | O | True | False | [513, 629): 'As they turn away, God (an image ... | |
143 | 143 | [614, 617): 'the' | the | DET | DT | det | 145 | xxx | B | FAC | True | True | [513, 629): 'As they turn away, God (an image ... |
144 | 144 | [618, 622): 'Holy' | Holy | PROPN | NNP | compound | 145 | Xxxx | I | FAC | True | False | [513, 629): 'As they turn away, God (an image ... |
145 | 145 | [623, 628): 'Grail' | Grail | PROPN | NNP | dobj | 142 | Xxxxx | I | FAC | True | False | [513, 629): 'As they turn away, God (an image ... |
146 | 146 | [628, 629): '.' | . | PUNCT | . | punct | 133 | . | O | False | False | [513, 629): 'As they turn away, God (an image ... |
147 rows × 13 columns
Recall that, in the previous section of this notebook, we defined a variable max_context_span
containing
the region of the text that covers the elements of some relationships that Watson Natural Language Understanding
identified:
max_context_span
[208, 328): 'Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and [...]'
Let's identify all the rows of our SpaCy token features DataFrame that overlap with this span of the document.
The SpanArray
class has a built-in operation overlaps()
for building Pandas selection conditions based on overlapping spans. Here we use overlaps()
to filter the token_features
DataFrame based on the value of max_context_span
:
spacy_context_tokens = token_features[token_features["span"].array.overlaps(max_context_span)]
spacy_context_tokens
id | span | lemma | pos | tag | dep | head | shape | ent_iob | ent_type | is_alpha | is_stop | sentence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
44 | 44 | [208, 215): 'Galahad' | Galahad | PROPN | NNP | appos | 39 | Xxxxx | B | PERSON | True | False | [130, 361): 'Along the way, he recruits Sir Be... |
45 | 45 | [216, 219): 'the' | the | DET | DT | det | 46 | xxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
46 | 46 | [220, 224): 'Pure' | pure | ADJ | JJ | appos | 44 | Xxxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
47 | 47 | [224, 225): ',' | , | PUNCT | , | punct | 39 | , | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
48 | 48 | [226, 229): 'Sir' | Sir | PROPN | NNP | compound | 49 | Xxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
49 | 49 | [230, 235): 'Robin' | Robin | PROPN | NNP | appos | 39 | Xxxxx | B | PERSON | True | False | [130, 361): 'Along the way, he recruits Sir Be... |
50 | 50 | [236, 239): 'the' | the | DET | DT | det | 63 | xxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
51 | 51 | [240, 243): 'Not' | not | PART | RB | neg | 53 | Xxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
52 | 52 | [243, 244): '-' | - | PUNCT | HYPH | punct | 53 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
53 | 53 | [244, 249): 'Quite' | quite | VERB | VB | nmod | 63 | Xxxxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
54 | 54 | [249, 250): '-' | - | PUNCT | HYPH | punct | 53 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
55 | 55 | [250, 252): 'So' | so | SCONJ | IN | advmod | 57 | Xx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
56 | 56 | [252, 253): '-' | - | PUNCT | HYPH | punct | 57 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
57 | 57 | [253, 258): 'Brave' | brave | VERB | VB | pobj | 53 | Xxxxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
58 | 58 | [258, 259): '-' | - | PUNCT | HYPH | punct | 57 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
59 | 59 | [259, 261): 'as' | as | ADP | IN | prep | 57 | xx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
60 | 60 | [261, 262): '-' | - | PUNCT | HYPH | punct | 59 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
61 | 61 | [262, 265): 'Sir' | sir | NOUN | NN | pobj | 59 | Xxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
62 | 62 | [265, 266): '-' | - | PUNCT | HYPH | punct | 63 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
63 | 63 | [266, 274): 'Lancelot' | Lancelot | PROPN | NNP | appos | 49 | Xxxxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
64 | 64 | [274, 275): ',' | , | PUNCT | , | punct | 39 | , | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
65 | 65 | [276, 279): 'and' | and | CCONJ | CC | cc | 39 | xxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
66 | 66 | [280, 283): 'Sir' | Sir | PROPN | NNP | npadvmod | 69 | Xxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
67 | 67 | [284, 287): 'Not' | not | PART | RB | neg | 69 | Xxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
68 | 68 | [287, 288): '-' | - | PUNCT | HYPH | punct | 69 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
69 | 69 | [288, 297): 'Appearing' | appear | VERB | VBG | conj | 39 | Xxxxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
70 | 70 | [297, 298): '-' | - | PUNCT | HYPH | punct | 69 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
71 | 71 | [298, 300): 'in' | in | ADP | IN | prep | 69 | xx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
72 | 72 | [300, 301): '-' | - | PUNCT | HYPH | punct | 71 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
73 | 73 | [301, 305): 'this' | this | PRON | DT | pobj | 71 | xxxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
74 | 74 | [305, 306): '-' | - | PUNCT | HYPH | punct | 75 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
75 | 75 | [306, 310): 'Film' | Film | PROPN | NNP | appos | 69 | Xxxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
76 | 76 | [310, 311): ',' | , | PUNCT | , | punct | 69 | , | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
77 | 77 | [312, 317): 'along' | along | ADP | IN | prep | 69 | xxxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
78 | 78 | [318, 322): 'with' | with | ADP | IN | prep | 77 | xxxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
79 | 79 | [323, 328): 'their' | their | PRON | PRP$ | poss | 80 | xxxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... |
Notice that the "sentence" column of the SpaCy output in the previous cell contains multiple different values, even though all the tokens are actually from the same sentence. SpaCy's language model has incorrectly split this sentence into multiple smaller sentences. We can use pandas.DataFrame.drop_duplicates()
to show exactly which sentence fragments are present in this slice of the SpaCy output:
spacy_context_tokens[["sentence"]].drop_duplicates()
sentence | |
---|---|
44 | [130, 361): 'Along the way, he recruits Sir Be... |
Alternately, we can drill down to the SpanArray
object to show a HTML representation of these sentence fragments in context:
spacy_context_tokens["sentence"].array.unique()
In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours. Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.
The sentence identification component of Watson Natural Language Understanding does a better job than SpaCy on this region of the document.
Earlier in this notebook, we created a Python dictionary dfs
, where each item in the dictionary is a DataFrame. The DataFrame under the key "syntax" holds the output of Watson NLU's syntax analysis, which includes sentence information. Let's extract the section of this Watson NLU output that matches our target span.
watson_syntax = dfs["syntax"]
watson_context_tokens = watson_syntax[watson_syntax["span"].array.overlaps(max_context_span)]
watson_context_tokens
span | part_of_speech | lemma | sentence | |
---|---|---|---|---|
44 | [208, 215): 'Galahad' | PROPN | None | [130, 361): 'Along the way, he recruits Sir Be... |
45 | [216, 219): 'the' | DET | the | [130, 361): 'Along the way, he recruits Sir Be... |
46 | [220, 224): 'Pure' | PROPN | None | [130, 361): 'Along the way, he recruits Sir Be... |
47 | [224, 225): ',' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
48 | [226, 229): 'Sir' | PROPN | Sir | [130, 361): 'Along the way, he recruits Sir Be... |
49 | [230, 235): 'Robin' | PROPN | Robin | [130, 361): 'Along the way, he recruits Sir Be... |
50 | [236, 239): 'the' | DET | the | [130, 361): 'Along the way, he recruits Sir Be... |
51 | [240, 243): 'Not' | PROPN | None | [130, 361): 'Along the way, he recruits Sir Be... |
52 | [243, 244): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
53 | [244, 249): 'Quite' | PROPN | None | [130, 361): 'Along the way, he recruits Sir Be... |
54 | [249, 250): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
55 | [250, 252): 'So' | ADV | so | [130, 361): 'Along the way, he recruits Sir Be... |
56 | [252, 253): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
57 | [253, 258): 'Brave' | PROPN | Brave | [130, 361): 'Along the way, he recruits Sir Be... |
58 | [258, 259): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
59 | [259, 261): 'as' | ADP | as | [130, 361): 'Along the way, he recruits Sir Be... |
60 | [261, 262): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
61 | [262, 265): 'Sir' | PROPN | Sir | [130, 361): 'Along the way, he recruits Sir Be... |
62 | [265, 266): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
63 | [266, 274): 'Lancelot' | PROPN | None | [130, 361): 'Along the way, he recruits Sir Be... |
64 | [274, 275): ',' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
65 | [276, 279): 'and' | CCONJ | and | [130, 361): 'Along the way, he recruits Sir Be... |
66 | [280, 283): 'Sir' | PROPN | Sir | [130, 361): 'Along the way, he recruits Sir Be... |
67 | [284, 287): 'Not' | ADV | not | [130, 361): 'Along the way, he recruits Sir Be... |
68 | [287, 288): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
69 | [288, 297): 'Appearing' | PROPN | None | [130, 361): 'Along the way, he recruits Sir Be... |
70 | [297, 298): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
71 | [298, 300): 'in' | ADP | in | [130, 361): 'Along the way, he recruits Sir Be... |
72 | [300, 301): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
73 | [301, 305): 'this' | PRON | this | [130, 361): 'Along the way, he recruits Sir Be... |
74 | [305, 306): '-' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
75 | [306, 310): 'Film' | PROPN | Film | [130, 361): 'Along the way, he recruits Sir Be... |
76 | [310, 311): ',' | PUNCT | None | [130, 361): 'Along the way, he recruits Sir Be... |
77 | [312, 317): 'along' | ADP | along | [130, 361): 'Along the way, he recruits Sir Be... |
78 | [318, 322): 'with' | ADP | with | [130, 361): 'Along the way, he recruits Sir Be... |
79 | [323, 328): 'their' | PRON | their | [130, 361): 'Along the way, he recruits Sir Be... |
The Watson NLU output correctly maps every token to the same sentence, and the span of the sentence is correct:
watson_context_tokens["sentence"].unique()
In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours. Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.
Let's create a DataFrame of token metadata that combines the higher-quality sentence information from Watson NLU with the token features from SpaCy.
context_tokens = spacy_context_tokens.copy() # Make a copy so we can modify the copy
context_tokens["sentence"] = watson_context_tokens["sentence"].copy()
context_tokens.head(10) # Show first 10 rows
id | span | lemma | pos | tag | dep | head | shape | ent_iob | ent_type | is_alpha | is_stop | sentence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
44 | 44 | [208, 215): 'Galahad' | Galahad | PROPN | NNP | appos | 39 | Xxxxx | B | PERSON | True | False | [130, 361): 'Along the way, he recruits Sir Be... |
45 | 45 | [216, 219): 'the' | the | DET | DT | det | 46 | xxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
46 | 46 | [220, 224): 'Pure' | pure | ADJ | JJ | appos | 44 | Xxxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
47 | 47 | [224, 225): ',' | , | PUNCT | , | punct | 39 | , | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
48 | 48 | [226, 229): 'Sir' | Sir | PROPN | NNP | compound | 49 | Xxx | O | True | False | [130, 361): 'Along the way, he recruits Sir Be... | |
49 | 49 | [230, 235): 'Robin' | Robin | PROPN | NNP | appos | 39 | Xxxxx | B | PERSON | True | False | [130, 361): 'Along the way, he recruits Sir Be... |
50 | 50 | [236, 239): 'the' | the | DET | DT | det | 63 | xxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
51 | 51 | [240, 243): 'Not' | not | PART | RB | neg | 53 | Xxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... | |
52 | 52 | [243, 244): '-' | - | PUNCT | HYPH | punct | 53 | - | O | False | False | [130, 361): 'Along the way, he recruits Sir Be... | |
53 | 53 | [244, 249): 'Quite' | quite | VERB | VB | nmod | 63 | Xxxxx | O | True | True | [130, 361): 'Along the way, he recruits Sir Be... |
The columns "head", "id", and "dep" of the SpaCy features map the tokens to nodes of the sentence's dependency parse. Specifically:
Text Extensions for Pandas includes a function render_parse_tree()
that displays parse trees using
displaCy. Let's use render_parse_tree()
to render the SpaCy parse tree information for the tokens in our DataFrame:
tp.io.spacy.render_parse_tree(spacy_context_tokens)
In this notebook we demonstrated how Text Extensions for Pandas can be used to perform various NLP tasks. We started by loading our document and passing it through Watson NLU service. We extracted various entities and relations. We used Text Extensions for Pandas to manipualte the Span data and visualize some of our findings. Finally we pass this through a language model using SpaCy which gives us more insights such as parts of speech tagging. We then combine all the results to render a parse tree.
This notebook also demonstrates how easy it is to inter-operate with other popular NLP packages such as SpaCy, pandas and IBM Watson NLU.