Integrate_NLP_Libraries.ipynb: Combine the Outputs of Multiple NLP Libraries using Text Extensions for Pandas

Introduction¶

This notebook demonstrates the interoperable capabilities of the open source library Text Extensions for Pandas. Specifically we use Pandas DataFrames as a bridge between multiple natural language processing libraries. The example that we show here uses the capabilities of IBM's Watson Natural Language Understanding service and SpaCy to perform a number of NLP tasks such as extracting entities, relations, spans and sentiment.

Table of Contents¶

Environment Setup
Set up the and use Watson Natural Language Understanding Service
Manipulate Span Data with Text Extensions for Pandas
Extract Additional Features with SpaCy
Combine Outputs from Various Packages

Environment Setup ¶

This notebook requires a Python 3.7 or later environment with the following packages:

The dependencies listed in the "requirements.txt" file for Text Extensions for Pandas
The ibm_watson package, available via PyPi. It can be installed with a simple pip install ibm-watson command.
The spacy package, available via PyPi. It can be installed with a simple pip install spacy command.
text_extensions_for_pandas

You can satisfy the dependency on text_extensions_for_pandas in either of two ways:

Run pip install text_extensions_for_pandas before running this notebook. This command adds the library to your Python environment from the latest PyPi release.
Or optionally, run this notebook out of your local copy of the Text Extensions for Pandas project's source tree. In this case, the notebook will use the version of Text Extensions for Pandas in your local source tree if the package is not installed in your Python environment.

In [1]:

# Uncomment and run this cell if you are using this notebook in a cloud environment such 
# as IBM Watson Studio or Google Colab and you want to install the required packages. 
# Note: This will install packages to your environment so only run if you need to install 
# these packages.

# Uncomment below cell to install packages
# !pip install ibm_watson spacy text_extensions_for_pandas

In [2]:

# Core Python libraries
import json
import os
import sys
import pandas as pd
from typing import *

# IBM Watson libraries
import ibm_watson
import ibm_watson.natural_language_understanding_v1 as nlu
import ibm_cloud_sdk_core

# SpaCy
import spacy

# And of course we need the text_extensions_for_pandas library itself.
try:
    import text_extensions_for_pandas as tp
except ModuleNotFoundError as e:
    # If we're running from within the project source tree and the parent Python
    # environment doesn't have the text_extensions_for_pandas package, use the
    # version in the local source tree.
    if not os.getcwd().endswith("notebooks"):
        raise e
    if ".." not in sys.path:
        sys.path.insert(0, "..")
    import text_extensions_for_pandas as tp

Using the Watson Natural Language Understanding Service ¶

In this section, we will setup various parts of the Watson NLU service and pass our documents through the service to obtain various features and outputs from the service.

This section is divided into subsections to setup, connect and use the service.

Set up the Watson Natural Language Understanding Service¶

In this part of the notebook, we will use the Watson Natural Language Understanding (NLU) service to extract key features from our example document.

You can create an instance of Watson NLU on the IBM Cloud for free by navigating to this page and clicking on the button marked "Get started free". You can also install your own instance of Watson NLU on OpenShift by using IBM Watson Natural Language Understanding for IBM Cloud Pak for Data.

You'll need two pieces of information to access your instance of Watson NLU: An API key and a service URL. If you're using Watson NLU on the IBM Cloud, you can find your API key and service URL in the IBM Cloud web UI. Navigate to the resource list and click on your instance of Natural Language Understanding to open the management UI for your service. Then click on the "Manage" tab to show a page with your API key and service URL.

The cell that follows assumes that you are using the environment variables IBM_API_KEY and IBM_SERVICE_URL to store your credentials. If you're running this notebook in Jupyter on your laptop, you can set these environment variables while starting up jupyter notebook or jupyter lab. For example:

IBM_API_KEY='<my API key>' \
IBM_SERVICE_URL='<my service URL>' \
  jupyter lab

Alternately, you can uncomment the first two lines of code below to set the IBM_API_KEY and IBM_SERVICE_URL environment variables directly. Be careful not to store your API key in any publicly-accessible location!

In [3]:

# If you need to embed your credentials inline, uncomment the following two lines and
# paste your credentials in the indicated locations.
# os.environ["IBM_API_KEY"] = "<API key goes here>"
# os.environ["IBM_SERVICE_URL"] = "<Service URL goes here>"

# Retrieve the API key for your Watson NLU service instance
if "IBM_API_KEY" not in os.environ:
    raise ValueError("Expected Watson NLU api key in the environment variable 'IBM_API_KEY'")
api_key = os.environ.get("IBM_API_KEY")
    
# Retrieve the service URL for your Watson NLU service instance
if "IBM_SERVICE_URL" not in os.environ:
    raise ValueError("Expected Watson NLU service URL in the environment variable 'IBM_SERVICE_URL'")
service_url = os.environ.get("IBM_SERVICE_URL")  

Connect to the Watson Natural Language Understanding Python API¶

This notebook uses the IBM Watson Python SDK to perform authentication on the IBM Cloud via the IAMAuthenticator class. See the IBM Watson Python SDK documentation for more information.

We start by using the API key and service URL from the previous cell to create an instance of the Python API for Watson NLU.

In [4]:

natural_language_understanding = ibm_watson.NaturalLanguageUnderstandingV1(
    version="2019-07-12",
    authenticator=ibm_cloud_sdk_core.authenticators.IAMAuthenticator(api_key)
)
natural_language_understanding.set_service_url(service_url)
natural_language_understanding

Out[4]:

<ibm_watson.natural_language_understanding_v1.NaturalLanguageUnderstandingV1 at 0x7f86b07053d0>

Pass a Document through the Watson NLU Service¶

Once you've opened a connection to the Watson NLU service, you can pass documents through the service by invoking the analyze() method.

The example document that we use here is an excerpt from the plot summary for Monty Python and the Holy Grail, drawn from the Wikipedia entry for that movie.

Let's preview what the raw text looks like:

In [5]:

from IPython.display import display, HTML
doc_file = "../resources/holy_grail_short.txt"
with open(doc_file, "r") as f:
    doc_text = f.read()
    
display(HTML(f"<b>Document Text:</b><blockquote>{doc_text}</blockquote>"))

Document Text:

In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours. Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.

Watson Natural Language Understanding can perform multiple kinds of analysis on the example document.

We will be looking at the following:

entities (with sentiment)
keywords (with sentiment and emotion)
relations
semantic_roles
syntax (with sentences, tokens, and part of speech)

See the Watson NLU documentation for a full description of the types of analysis that NLU can perform.

In [6]:

# Make the request
response = natural_language_understanding.analyze(
    text=doc_text,
    # TODO: Use this URL once we've pushed the shortened document to Github
    #url="https://raw.githubusercontent.com/CODAIT/text-extensions-for-pandas/master/resources/holy_grail_short.txt",
    return_analyzed_text=True,
    features=nlu.Features(
        entities=nlu.EntitiesOptions(sentiment=True),
        keywords=nlu.KeywordsOptions(sentiment=True, emotion=True),
        relations=nlu.RelationsOptions(),
        semantic_roles=nlu.SemanticRolesOptions(),
        syntax=nlu.SyntaxOptions(sentences=True, 
                                 tokens=nlu.SyntaxOptionsTokens(lemma=True, part_of_speech=True))
    )).get_result()

The response from the analyze() method is a Python dictionary. The dictionary contains an entry for each pass of analysis requested, plus some additional entries with metadata about the API request itself. Here's a list of the keys in response:

In [7]:

response.keys()

Out[7]:

dict_keys(['usage', 'syntax', 'semantic_roles', 'relations', 'language', 'keywords', 'entities', 'analyzed_text'])

Text Extensions for Pandas includes a handy function watson_nlu_parse_response() that turns the output of Watson NLU's analyze() function into a dictionary of Pandas DataFrames. This makes it much easier to process the output from NLU and perform downstream operations. Let us run the NLU response object through that conversion below.

In [8]:

dfs = tp.io.watson.nlu.parse_response(response)
dfs.keys()

Out[8]:

dict_keys(['syntax', 'entities', 'entity_mentions', 'keywords', 'relations', 'semantic_roles'])

The output of each analysis pass that Watson NLU performed is now a DataFrame. Let's look at the outputs of the "relations" pass. Here's the original output as Python objects:

In [9]:

response["relations"]

Out[9]:

[{'type': 'partOfMany',
  'sentence': "Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.",
  'score': 0.610221,
  'arguments': [{'text': 'Galahad',
    'location': [208, 215],
    'entities': [{'type': 'Person', 'text': 'Galahad'}]},
   {'text': 'their',
    'location': [323, 328],
    'entities': [{'type': 'Person', 'text': 'their'}]}]},
 {'type': 'partOfMany',
  'sentence': "Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.",
  'score': 0.710112,
  'arguments': [{'text': 'Lancelot',
    'location': [266, 274],
    'entities': [{'type': 'Person', 'text': 'Lancelot'}]},
   {'text': 'their',
    'location': [323, 328],
    'entities': [{'type': 'Person', 'text': 'their'}]}]},
 {'type': 'parentOf',
  'sentence': "Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.",
  'score': 0.3821,
  'arguments': [{'text': 'their',
    'location': [323, 328],
    'entities': [{'type': 'Person', 'text': 'their'}]},
   {'text': 'squires',
    'location': [329, 336],
    'entities': [{'type': 'Person', 'text': 'squires'}]}]},
 {'type': 'residesIn',
  'sentence': 'Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place".',
  'score': 0.492869,
  'arguments': [{'text': 'Arthur',
    'location': [362, 368],
    'entities': [{'type': 'Person', 'text': 'King Arthur'}]},
   {'text': 'Camelot',
    'location': [386, 393],
    'entities': [{'type': 'GeopoliticalEntity', 'text': 'Camelot'}]}]},
 {'type': 'locatedAt',
  'sentence': 'Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place".',
  'score': 0.339446,
  'arguments': [{'text': 'men',
    'location': [379, 382],
    'entities': [{'type': 'Person', 'text': 'men'}]},
   {'text': 'Camelot',
    'location': [386, 393],
    'entities': [{'type': 'GeopoliticalEntity', 'text': 'Camelot'}]}]},
 {'type': 'affectedBy',
  'sentence': 'As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.',
  'score': 0.604304,
  'arguments': [{'text': 'them',
    'location': [572, 576],
    'entities': [{'type': 'Person', 'text': 'their'}]},
   {'text': 'speaks',
    'location': [562, 568],
    'entities': [{'type': 'EventCommunication', 'text': 'speaks'}]}]}]

And here's the DataFrame version of the same information:

In [10]:

dfs["relations"]

Out[10]:

	type	sentence_span	score	arguments.0.span	arguments.1.span	arguments.0.entities.type	arguments.1.entities.type	arguments.0.entities.text	arguments.1.entities.text
0	partOfMany	[130, 361): 'Along the way, he recruits Sir Be...	0.610221	[208, 215): 'Galahad'	[323, 328): 'their'	Person	Person	Galahad	their
1	partOfMany	[130, 361): 'Along the way, he recruits Sir Be...	0.710112	[266, 274): 'Lancelot'	[323, 328): 'their'	Person	Person	Lancelot	their
2	parentOf	[130, 361): 'Along the way, he recruits Sir Be...	0.382100	[323, 328): 'their'	[329, 336): 'squires'	Person	Person	their	squires
3	residesIn	[362, 512): 'Arthur leads the men to Camelot, ...	0.492869	[362, 368): 'Arthur'	[386, 393): 'Camelot'	Person	GeopoliticalEntity	King Arthur	Camelot
4	locatedAt	[362, 512): 'Arthur leads the men to Camelot, ...	0.339446	[379, 382): 'men'	[386, 393): 'Camelot'	Person	GeopoliticalEntity	men	Camelot
5	affectedBy	[513, 629): 'As they turn away, God (an image ...	0.604304	[572, 576): 'them'	[562, 568): 'speaks'	Person	EventCommunication	their	speaks

As you can see above, it is much more organized and convenient to deal with once we have it as a DataFrame.

Each row in the DataFrame contains information about a single relationship that Watson Natural Language Understanding identified in our input text. As you can see, Watson NLU returns a lot of information about each relationship. For simplicity, let's focus on three columns:

"type": The type of relationship between the two entities
"arguments.0.span": Span of characters in the original text where the first entity in the relationship appeared
"argmennts.1.span": Span of the second entity in the relationship

In [11]:

relations = dfs["relations"][["type", "arguments.0.span", "arguments.1.span"]].copy()
relations

Out[11]:

	type	arguments.0.span	arguments.1.span
0	partOfMany	[208, 215): 'Galahad'	[323, 328): 'their'
1	partOfMany	[266, 274): 'Lancelot'	[323, 328): 'their'
2	parentOf	[323, 328): 'their'	[329, 336): 'squires'
3	residesIn	[362, 368): 'Arthur'	[386, 393): 'Camelot'
4	locatedAt	[379, 382): 'men'	[386, 393): 'Camelot'
5	affectedBy	[572, 576): 'them'	[562, 568): 'speaks'

Manipulate Span Data ¶

Text Extensions for Pandas uses Pandas extension types to represent spans (regions of a document) and tensors (multi-dimensional arrays). For example, the "arguments.0.span" and "arguments.1.span" columns in the above DataFrame are both stored using the extension type for spans.

Here's the Pandas data type (also known as "dtype") information for the three columns of this DataFrame:

In [12]:

relations.dtypes

Out[12]:

type                   object
arguments.0.span    SpanDtype
arguments.1.span    SpanDtype
dtype: object

Note how the "arguments.0.span" and "arguments.1.span" columns are of dtype SpanDtype. SpanDtype is a Pandas extension type from the Text Extensions for Pandas library. The SpanDtype data type corresponds to two Python classes: Span for scalar values and SpanArray for array values. SpanArray is a subclass of the Pandas ExtensionArray class, which is the base class for custom 1-D array types in Pandas.

You can access the array object behind any Pandas extension type via the pandas.Series.array property:

In [13]:

print(relations["arguments.0.span"].array)

<SpanArray>
[ [208, 215): 'Galahad', [266, 274): 'Lancelot',    [323, 328): 'their',
   [362, 368): 'Arthur',      [379, 382): 'men',     [572, 576): 'them']
Length: 6, dtype: SpanDtype

Extension types support most the functionality of built-in Pandas types like Int64Dtype and DatetimeTZDtype.

For example, SpanDtype defines the + (also known as __add__()) operation for spans to mean "the shortest span that completely covers both input spans". So we can "add" the contents of the "arguments.0.span" and "arguments.1.span" columns of our DataFrame to obtain a span that covers both arguments, plus the text in between them. The cell below demonstrates a simple + operation with Spans.

In [14]:

relations["context"] = relations["arguments.0.span"] + relations["arguments.1.span"]
relations

Out[14]:

	type	arguments.0.span	arguments.1.span	context
0	partOfMany	[208, 215): 'Galahad'	[323, 328): 'their'	[208, 328): 'Galahad the Pure, Sir Robin the N...
1	partOfMany	[266, 274): 'Lancelot'	[323, 328): 'their'	[266, 328): 'Lancelot, and Sir Not-Appearing-i...
2	parentOf	[323, 328): 'their'	[329, 336): 'squires'	[323, 336): 'their squires'
3	residesIn	[362, 368): 'Arthur'	[386, 393): 'Camelot'	[362, 393): 'Arthur leads the men to Camelot'
4	locatedAt	[379, 382): 'men'	[386, 393): 'Camelot'	[379, 393): 'men to Camelot'
5	affectedBy	[572, 576): 'them'	[562, 568): 'speaks'	[562, 576): 'speaks to them'

Take a look at the last row of the above DataFrame. The span in "arguments.0.span" comes after "arguments.1.span" in the last row, but the "context" column is still correct.

A SpanArray can also render itself using Jupyter Notebook callbacks. To see the HTML representation of the SpanArray, pass the array object to Jupyter's display() function; or make that object be the last line of the cell, as in the following example:

In [15]:

relations["context"].array

Out[15]:

	begin	end	context
0	208	328	Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their
1	266	328	Lancelot, and Sir Not-Appearing-in-this-Film, along with their
2	323	336	their squires
3	362	393	Arthur leads the men to Camelot
4	379	393	men to Camelot
5	562	576	speaks to them

In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires Set and Robin's troubadours. Arthur leads the men to Camelot , but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

This makes it very easy to visually inspect relevant portions of a document and to present any findings.

You can also convert an individual element of the array into a Python object of type Span that represents that single span as a scalar value:

In [16]:

target_span = relations.iloc[0]["arguments.1.span"]
target_span

Out[16]:

[323, 328): 'their'

You can use a Span object to create a Pandas selection condition. For example, we can use a selection condition to select the rows from the relations DataFrame whose second argument's span matches the span we just stored in the variable target_span:

In [17]:

relations[relations["arguments.1.span"] == target_span]

Out[17]:

	type	arguments.0.span	arguments.1.span	context
0	partOfMany	[208, 215): 'Galahad'	[323, 328): 'their'	[208, 328): 'Galahad the Pure, Sir Robin the N...
1	partOfMany	[266, 274): 'Lancelot'	[323, 328): 'their'	[266, 328): 'Lancelot, and Sir Not-Appearing-i...

Pandas extension types also support aggregation. Let's use the sum() aggregate to find the portion of the document that includes the context for all the relationships in the above DataFrame.

Recall that the Text Extensions for Pandas defines the addition operator for spans as "the shortest span that completely covers both input spans". Similarly, the "sum" of a collection of spans is the shortest span that completely covers all the spans.

In [18]:

max_context_span = relations[relations["arguments.1.span"] == target_span]["context"].sum()
print(f"""
Span: {str(max_context_span)}
Covered text: "{max_context_span.covered_text}"
""")

Span: [208, 328): 'Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and [...]'
Covered text: "Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their"

Extract Additional Features with SpaCy ¶

With Text Extensions for Pandas, you can use Pandas DataFrames as a common representation for you NLP application's intermediate data, regardless of which NLP library you used to produce that data.

In the cell that follows, we take the text that we just ran through Watson NLU and feed that text through a SpaCy langauge model. Then we use the make_tokens_and_features() function from Text Extensions for Pandas to convert this output to a Pandas DataFrame of token features.

In order to load the spacy language model, download the spacy model using the following command: $ python -m spacy download en_core_web_sm

You can also add a line in the below cell to install it inline: !python -m spacy download en_core_web_sm

In [19]:

doc_text = response["analyzed_text"]
spacy_language_model = spacy.load("en_core_web_sm")
token_features = tp.io.spacy.make_tokens_and_features(doc_text, spacy_language_model)
token_features

Out[19]:

	id	span	lemma	pos	tag	dep	head	shape	ent_iob	ent_type	is_alpha	is_stop	sentence
0	0	[0, 2): 'In'	in	ADP	IN	prep	12	Xx	O		True	True	[0, 129): 'In AD 932, King Arthur and his squi...
1	1	[3, 5): 'AD'	ad	NOUN	NN	pobj	0	XX	B	DATE	True	False	[0, 129): 'In AD 932, King Arthur and his squi...
2	2	[6, 9): '932'	932	NUM	CD	nummod	1	ddd	I	DATE	False	False	[0, 129): 'In AD 932, King Arthur and his squi...
3	3	[9, 10): ','	,	PUNCT	,	punct	12	,	O		False	False	[0, 129): 'In AD 932, King Arthur and his squi...
4	4	[11, 15): 'King'	King	PROPN	NNP	compound	5	Xxxx	B	PERSON	True	False	[0, 129): 'In AD 932, King Arthur and his squi...
...	...	...	...	...	...	...	...	...	...	...	...	...	...
142	142	[606, 613): 'finding'	find	VERB	VBG	pcomp	141	xxxx	O		True	False	[513, 629): 'As they turn away, God (an image ...
143	143	[614, 617): 'the'	the	DET	DT	det	145	xxx	B	FAC	True	True	[513, 629): 'As they turn away, God (an image ...
144	144	[618, 622): 'Holy'	Holy	PROPN	NNP	compound	145	Xxxx	I	FAC	True	False	[513, 629): 'As they turn away, God (an image ...
145	145	[623, 628): 'Grail'	Grail	PROPN	NNP	dobj	142	Xxxxx	I	FAC	True	False	[513, 629): 'As they turn away, God (an image ...
146	146	[628, 629): '.'	.	PUNCT	.	punct	133	.	O		False	False	[513, 629): 'As they turn away, God (an image ...

147 rows × 13 columns

Recall that, in the previous section of this notebook, we defined a variable max_context_span containing the region of the text that covers the elements of some relationships that Watson Natural Language Understanding identified:

In [20]:

max_context_span

Out[20]:

[208, 328): 'Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and [...]'

Let's identify all the rows of our SpaCy token features DataFrame that overlap with this span of the document.

The SpanArray class has a built-in operation overlaps() for building Pandas selection conditions based on overlapping spans. Here we use overlaps() to filter the token_features DataFrame based on the value of max_context_span:

In [21]:

spacy_context_tokens = token_features[token_features["span"].array.overlaps(max_context_span)]
spacy_context_tokens

Out[21]:

	id	span	lemma	pos	tag	dep	head	shape	ent_iob	ent_type	is_alpha	is_stop	sentence
44	44	[208, 215): 'Galahad'	Galahad	PROPN	NNP	appos	39	Xxxxx	B	PERSON	True	False	[130, 361): 'Along the way, he recruits Sir Be...
45	45	[216, 219): 'the'	the	DET	DT	det	46	xxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
46	46	[220, 224): 'Pure'	pure	ADJ	JJ	appos	44	Xxxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
47	47	[224, 225): ','	,	PUNCT	,	punct	39	,	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
48	48	[226, 229): 'Sir'	Sir	PROPN	NNP	compound	49	Xxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
49	49	[230, 235): 'Robin'	Robin	PROPN	NNP	appos	39	Xxxxx	B	PERSON	True	False	[130, 361): 'Along the way, he recruits Sir Be...
50	50	[236, 239): 'the'	the	DET	DT	det	63	xxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
51	51	[240, 243): 'Not'	not	PART	RB	neg	53	Xxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
52	52	[243, 244): '-'	-	PUNCT	HYPH	punct	53	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
53	53	[244, 249): 'Quite'	quite	VERB	VB	nmod	63	Xxxxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
54	54	[249, 250): '-'	-	PUNCT	HYPH	punct	53	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
55	55	[250, 252): 'So'	so	SCONJ	IN	advmod	57	Xx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
56	56	[252, 253): '-'	-	PUNCT	HYPH	punct	57	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
57	57	[253, 258): 'Brave'	brave	VERB	VB	pobj	53	Xxxxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
58	58	[258, 259): '-'	-	PUNCT	HYPH	punct	57	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
59	59	[259, 261): 'as'	as	ADP	IN	prep	57	xx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
60	60	[261, 262): '-'	-	PUNCT	HYPH	punct	59	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
61	61	[262, 265): 'Sir'	sir	NOUN	NN	pobj	59	Xxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
62	62	[265, 266): '-'	-	PUNCT	HYPH	punct	63	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
63	63	[266, 274): 'Lancelot'	Lancelot	PROPN	NNP	appos	49	Xxxxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
64	64	[274, 275): ','	,	PUNCT	,	punct	39	,	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
65	65	[276, 279): 'and'	and	CCONJ	CC	cc	39	xxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
66	66	[280, 283): 'Sir'	Sir	PROPN	NNP	npadvmod	69	Xxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
67	67	[284, 287): 'Not'	not	PART	RB	neg	69	Xxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
68	68	[287, 288): '-'	-	PUNCT	HYPH	punct	69	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
69	69	[288, 297): 'Appearing'	appear	VERB	VBG	conj	39	Xxxxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
70	70	[297, 298): '-'	-	PUNCT	HYPH	punct	69	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
71	71	[298, 300): 'in'	in	ADP	IN	prep	69	xx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
72	72	[300, 301): '-'	-	PUNCT	HYPH	punct	71	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
73	73	[301, 305): 'this'	this	PRON	DT	pobj	71	xxxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
74	74	[305, 306): '-'	-	PUNCT	HYPH	punct	75	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
75	75	[306, 310): 'Film'	Film	PROPN	NNP	appos	69	Xxxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
76	76	[310, 311): ','	,	PUNCT	,	punct	69	,	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
77	77	[312, 317): 'along'	along	ADP	IN	prep	69	xxxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
78	78	[318, 322): 'with'	with	ADP	IN	prep	77	xxxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
79	79	[323, 328): 'their'	their	PRON	PRP$	poss	80	xxxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...

Combine Outputs of Both Libraries ¶

Notice that the "sentence" column of the SpaCy output in the previous cell contains multiple different values, even though all the tokens are actually from the same sentence. SpaCy's language model has incorrectly split this sentence into multiple smaller sentences. We can use pandas.DataFrame.drop_duplicates() to show exactly which sentence fragments are present in this slice of the SpaCy output:

In [22]:

spacy_context_tokens[["sentence"]].drop_duplicates()

Out[22]:

	sentence
44	[130, 361): 'Along the way, he recruits Sir Be...

Alternately, we can drill down to the SpanArray object to show a HTML representation of these sentence fragments in context:

In [23]:

spacy_context_tokens["sentence"].array.unique()

Out[23]:

	begin	end	begin token	end token	context
0	130	361	27	86	Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.

In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours. Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

The sentence identification component of Watson Natural Language Understanding does a better job than SpaCy on this region of the document.

Earlier in this notebook, we created a Python dictionary dfs, where each item in the dictionary is a DataFrame. The DataFrame under the key "syntax" holds the output of Watson NLU's syntax analysis, which includes sentence information. Let's extract the section of this Watson NLU output that matches our target span.

In [24]:

watson_syntax = dfs["syntax"]
watson_context_tokens = watson_syntax[watson_syntax["span"].array.overlaps(max_context_span)]
watson_context_tokens

Out[24]:

	span	part_of_speech	lemma	sentence
44	[208, 215): 'Galahad'	PROPN	None	[130, 361): 'Along the way, he recruits Sir Be...
45	[216, 219): 'the'	DET	the	[130, 361): 'Along the way, he recruits Sir Be...
46	[220, 224): 'Pure'	PROPN	None	[130, 361): 'Along the way, he recruits Sir Be...
47	[224, 225): ','	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
48	[226, 229): 'Sir'	PROPN	Sir	[130, 361): 'Along the way, he recruits Sir Be...
49	[230, 235): 'Robin'	PROPN	Robin	[130, 361): 'Along the way, he recruits Sir Be...
50	[236, 239): 'the'	DET	the	[130, 361): 'Along the way, he recruits Sir Be...
51	[240, 243): 'Not'	PROPN	None	[130, 361): 'Along the way, he recruits Sir Be...
52	[243, 244): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
53	[244, 249): 'Quite'	PROPN	None	[130, 361): 'Along the way, he recruits Sir Be...
54	[249, 250): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
55	[250, 252): 'So'	ADV	so	[130, 361): 'Along the way, he recruits Sir Be...
56	[252, 253): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
57	[253, 258): 'Brave'	PROPN	Brave	[130, 361): 'Along the way, he recruits Sir Be...
58	[258, 259): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
59	[259, 261): 'as'	ADP	as	[130, 361): 'Along the way, he recruits Sir Be...
60	[261, 262): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
61	[262, 265): 'Sir'	PROPN	Sir	[130, 361): 'Along the way, he recruits Sir Be...
62	[265, 266): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
63	[266, 274): 'Lancelot'	PROPN	None	[130, 361): 'Along the way, he recruits Sir Be...
64	[274, 275): ','	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
65	[276, 279): 'and'	CCONJ	and	[130, 361): 'Along the way, he recruits Sir Be...
66	[280, 283): 'Sir'	PROPN	Sir	[130, 361): 'Along the way, he recruits Sir Be...
67	[284, 287): 'Not'	ADV	not	[130, 361): 'Along the way, he recruits Sir Be...
68	[287, 288): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
69	[288, 297): 'Appearing'	PROPN	None	[130, 361): 'Along the way, he recruits Sir Be...
70	[297, 298): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
71	[298, 300): 'in'	ADP	in	[130, 361): 'Along the way, he recruits Sir Be...
72	[300, 301): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
73	[301, 305): 'this'	PRON	this	[130, 361): 'Along the way, he recruits Sir Be...
74	[305, 306): '-'	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
75	[306, 310): 'Film'	PROPN	Film	[130, 361): 'Along the way, he recruits Sir Be...
76	[310, 311): ','	PUNCT	None	[130, 361): 'Along the way, he recruits Sir Be...
77	[312, 317): 'along'	ADP	along	[130, 361): 'Along the way, he recruits Sir Be...
78	[318, 322): 'with'	ADP	with	[130, 361): 'Along the way, he recruits Sir Be...
79	[323, 328): 'their'	PRON	their	[130, 361): 'Along the way, he recruits Sir Be...

The Watson NLU output correctly maps every token to the same sentence, and the span of the sentence is correct:

In [25]:

watson_context_tokens["sentence"].unique()

Out[25]:

	begin	end	begin token	end token	context
0	130	361	27	86	Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours.

In AD 932, King Arthur and his squire, Patsy, travel throughout Britain searching for men to join the Knights of the Round Table. Along the way, he recruits Sir Bedevere the Wise, Sir Lancelot the Brave, Sir Galahad the Pure, Sir Robin the Not-Quite-So-Brave-as-Sir-Lancelot, and Sir Not-Appearing-in-this-Film, along with their squires and Robin's troubadours. Arthur leads the men to Camelot, but upon further consideration (thanks to a musical number) he decides not to go there because it is "a silly place". As they turn away, God (an image of W. G. Grace) speaks to them and gives Arthur the task of finding the Holy Grail.

Your notebook viewer does not support Javascript execution. The above rendering will not be interactive.

Let's create a DataFrame of token metadata that combines the higher-quality sentence information from Watson NLU with the token features from SpaCy.

In [26]:

context_tokens = spacy_context_tokens.copy()  # Make a copy so we can modify the copy
context_tokens["sentence"] = watson_context_tokens["sentence"].copy()
context_tokens.head(10)  # Show first 10 rows

Out[26]:

	id	span	lemma	pos	tag	dep	head	shape	ent_iob	ent_type	is_alpha	is_stop	sentence
44	44	[208, 215): 'Galahad'	Galahad	PROPN	NNP	appos	39	Xxxxx	B	PERSON	True	False	[130, 361): 'Along the way, he recruits Sir Be...
45	45	[216, 219): 'the'	the	DET	DT	det	46	xxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
46	46	[220, 224): 'Pure'	pure	ADJ	JJ	appos	44	Xxxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
47	47	[224, 225): ','	,	PUNCT	,	punct	39	,	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
48	48	[226, 229): 'Sir'	Sir	PROPN	NNP	compound	49	Xxx	O		True	False	[130, 361): 'Along the way, he recruits Sir Be...
49	49	[230, 235): 'Robin'	Robin	PROPN	NNP	appos	39	Xxxxx	B	PERSON	True	False	[130, 361): 'Along the way, he recruits Sir Be...
50	50	[236, 239): 'the'	the	DET	DT	det	63	xxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
51	51	[240, 243): 'Not'	not	PART	RB	neg	53	Xxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...
52	52	[243, 244): '-'	-	PUNCT	HYPH	punct	53	-	O		False	False	[130, 361): 'Along the way, he recruits Sir Be...
53	53	[244, 249): 'Quite'	quite	VERB	VB	nmod	63	Xxxxx	O		True	True	[130, 361): 'Along the way, he recruits Sir Be...

The columns "head", "id", and "dep" of the SpaCy features map the tokens to nodes of the sentence's dependency parse. Specifically:

the "id" column gives each token an integer ID
the "head" column indicates the ID of the parent, or head, token of each token in the parse tree
the "dep" column indicates the type of relationship between each parent-child pair in the parse tree

Text Extensions for Pandas includes a function render_parse_tree() that displays parse trees using displaCy. Let's use render_parse_tree() to render the SpaCy parse tree information for the tokens in our DataFrame:

In [27]:

tp.io.spacy.render_parse_tree(spacy_context_tokens)

Conclusion¶

In this notebook we demonstrated how Text Extensions for Pandas can be used to perform various NLP tasks. We started by loading our document and passing it through Watson NLU service. We extracted various entities and relations. We used Text Extensions for Pandas to manipualte the Span data and visualize some of our findings. Finally we pass this through a language model using SpaCy which gives us more insights such as parts of speech tagging. We then combine all the results to render a parse tree.

This notebook also demonstrates how easy it is to inter-operate with other popular NLP packages such as SpaCy, pandas and IBM Watson NLU.

In [ ]: