This notebook demonstrates how to visualize term frequency distributions using Rosette API via the /morphology/lemmas
endpoint. You can check out the code on GitHub.
The first section imports the Rosette API Python binding module. We also import some helper methods from visualize.py
and compare_vocabulary.py
. These modules can be used via their commandline drivers as well if preferred (run ./visualize.py -h
and ./compare_vocabulary.py -h
for usage instructions). Finally we also import some helper methods for rendering inline HTML within the notebook.
import os
from visualize import visualize, color_key
from compare_vocabulary import fdist, load_stopwords
from rosette.api import API
from IPython.display import display, HTML
The next step is to initialize Rosette API so that we can make API calls. For this we need a Rosette API key. If you already have a key or you want to sign up for a key, head over to https://developer.rosette.com. After instantiating a rosette.api.API
instance we also set the output
URL parameter to rosette
because we want to get detailed morphology analyses from Rosette's Annotated Data Model (ADM) in order to access the part-of-speech annotations.
from getpass import getpass
api = API(
user_key=(
os.environ.get('ROSETTE_USER_KEY') or # load key from environment variable if possible
getpass(prompt='Enter your Rosette API key: ') # fall back to prompting user for key
),
service_url='https://api.rosette.com/rest/v1/'
)
api.set_url_parameter('output', 'rosette')
The next step is to determine which part-of-speech tags we are interested in comparing. The less interesting tags have been commented out below, but you can experiment with different tags based on your interests.
POS_TAGS = {
'ADJ', #Adjective
'ADP', #Adposition
'ADV', #Adverb
'AUX', #Auxiliary
'CONJ', #Coordinating
#'DET', #Determiner
'INTJ', #Interjection
'NOUN', #Noun
'NUM', #Numeral
#'PART', #Particle
'PRON', #Pronoun
'PROPN', #Proper
#'PUNCT', #Punctuation
'SCONJ', #Subordinating
'SYM', #Symbol
'VERB', #Verb
'X', #Other
}
data
Directories¶The following block identifies the directories data/{carroll,frost,poe,shakespeare,whitman,yeats}
as corpora to analyze. These directories comprise small collections of poems by famous poets. You can add your own corpora to analyze simply by adding directories of plain-text .txt
files to the data
directory and replacing the directory names below. We also pick a value for n
here which determines the cut-off point to limit the frequency distributions to the top-n most frequent terms in the corpus. If you want more results you can increase n
and if you want to simply analyze the entire vocabulary you can set n = None
.
corpora = 'carroll', 'frost', 'poe', 'shakespeare', 'whitman', 'yeats'
n = 100 # visualize frequencies for top n most frequent terms
To help interpret the color-coded part-of-speech tags for each term, a color key is rendered below.
display(HTML(color_key()))
Tag | Name | Color |
---|---|---|
ADJ | Adjective | seagreen |
ADP | Adposition | brown |
ADV | Adverb | limegreen |
AUX | Auxiliary verb | blue |
CONJ | Coordinating conjunction | orangered |
DET | Determiner | silver |
INTJ | Interjection | mocha |
NOUN | Noun | orange |
NUM | Numeral | skyblue |
PART | Particle | magenta |
PRON | Pronoun | red |
PROPN | Proper noun | violet |
PUNCT | Punctuation | teal |
SCONJ | Subordinating conjunction | goldenrod |
SYM | Symbol | olive |
VERB | Verb | purple |
X | Other | black |
The following code loops over each directory and computes a frequency distribution from the terms that occur in the corpus. Note that the lemmas in each frequency distribution are filtered by an English stopword list. Each term is then rendered with a color corresponding to its part-of-speech and its size is relative to its frequency. You can hover your mouse over individual terms to see the numerical frequencies.
stopwords = load_stopwords('stopwords.json')['eng']
for corpus in corpora:
display(HTML(f'<h1>{corpus}</h1>'))
fd = fdist(f'data/{corpus}', api, n, stopwords)
display(HTML(visualize(fd, pos_tags=POS_TAGS)))