Visualize Term Frequency Distributions

This notebook demonstrates how to visualize term frequency distributions using Rosette API via the /morphology/lemmas endpoint. You can check out the code on GitHub.

Setup

The first section imports the Rosette API Python binding module. We also import some helper methods from visualize.py and compare_vocabulary.py. These modules can be used via their commandline drivers as well if preferred (run ./visualize.py -h and ./compare_vocabulary.py -h for usage instructions). Finally we also import some helper methods for rendering inline HTML within the notebook.

In [1]:
import os

from visualize import visualize, color_key
from compare_vocabulary import fdist, load_stopwords
from rosette.api import API
from IPython.display import display, HTML

Instantiating a Rosette API Instance

The next step is to initialize Rosette API so that we can make API calls. For this we need a Rosette API key. If you already have a key or you want to sign up for a key, head over to https://developer.rosette.com. After instantiating a rosette.api.API instance we also set the output URL parameter to rosette because we want to get detailed morphology analyses from Rosette's Annotated Data Model (ADM) in order to access the part-of-speech annotations.

In [2]:
from getpass import getpass

api = API(
    user_key=(
        os.environ.get('ROSETTE_USER_KEY') or # load key from environment variable if possible
        getpass(prompt='Enter your Rosette API key: ') # fall back to prompting user for key
    ),
    service_url='https://api.rosette.com/rest/v1/'
)
api.set_url_parameter('output', 'rosette')

Decide which Part-of-Speech (POS) Tags to Include

The next step is to determine which part-of-speech tags we are interested in comparing. The less interesting tags have been commented out below, but you can experiment with different tags based on your interests.

In [3]:
POS_TAGS = {
    'ADJ',   #Adjective
    'ADP',   #Adposition
    'ADV',   #Adverb
    'AUX',   #Auxiliary
    'CONJ',  #Coordinating
    #'DET',   #Determiner
    'INTJ',  #Interjection
    'NOUN',  #Noun
    'NUM',   #Numeral
    #'PART',  #Particle
    'PRON',  #Pronoun
    'PROPN', #Proper
    #'PUNCT', #Punctuation
    'SCONJ', #Subordinating
    'SYM',   #Symbol
    'VERB',  #Verb
    'X',     #Other
}

Load Corpora from data Directories

The following block identifies the directories data/{carroll,frost,poe,shakespeare,whitman,yeats} as corpora to analyze. These directories comprise small collections of poems by famous poets. You can add your own corpora to analyze simply by adding directories of plain-text .txt files to the data directory and replacing the directory names below. We also pick a value for n here which determines the cut-off point to limit the frequency distributions to the top-n most frequent terms in the corpus. If you want more results you can increase n and if you want to simply analyze the entire vocabulary you can set n = None.

In [4]:
corpora = 'carroll', 'frost', 'poe', 'shakespeare', 'whitman', 'yeats'
n = 100 # visualize frequencies for top n most frequent terms

Display Color Key

To help interpret the color-coded part-of-speech tags for each term, a color key is rendered below.

In [5]:
display(HTML(color_key()))

Color Key

Tag Name Color
ADJ Adjective seagreen
ADP Adposition brown
ADV Adverb limegreen
AUX Auxiliary verb blue
CONJ Coordinating conjunction orangered
DET Determiner silver
INTJ Interjection mocha
NOUN Noun orange
NUM Numeral skyblue
PART Particle magenta
PRON Pronoun red
PROPN Proper noun violet
PUNCT Punctuation teal
SCONJ Subordinating conjunction goldenrod
SYM Symbol olive
VERB Verb purple
X Other black

Visualize the Frequency Distributions

The following code loops over each directory and computes a frequency distribution from the terms that occur in the corpus. Note that the lemmas in each frequency distribution are filtered by an English stopword list. Each term is then rendered with a color corresponding to its part-of-speech and its size is relative to its frequency. You can hover your mouse over individual terms to see the numerical frequencies.

In [6]:
stopwords = load_stopwords('stopwords.json')['eng']
    
for corpus in corpora:
    display(HTML(f'<h1>{corpus}</h1>'))
    fd = fdist(f'data/{corpus}', api, n, stopwords)
    display(HTML(visualize(fd, pos_tags=POS_TAGS)))

carroll

say come Walrus Carpenter give oyster youth one jaw stand eye shine good Oysters four little old Jabberwock bird hand two go head day sea make odd think pleasant know Father William yet Twas brillig slithy tove gyre gimble mimsy borogoves mome raths outgrabe beware son claw take vorpal time rest thought leave sun might night get sand dry see cloud fly walk weep anything clear seven suppose tear o eldest shake young shoe yet thick wait thing cry fat now can turn nothing seem trot none

frost

one go make wood say good tree know see wall think come birch ice will keep take leaf back do not ground stone neighbor across like like away stop though snow dark two bend one day way sun leave cow boy break learn climb fire fill must frozen year give ask mile sleep road diverge travel long far just wear morning black oh tell age something love boulder gap even can find line set fall use stay turn pine apple never get fence head wall top hand seem father well

poe

thou door sea upon soul nevermore chamber Lenore Raven bird love eye nothing heart say still Annabel Lee come angel now within wind far shall lie shadow die name let bust sit tell leave o'er one maiden never feel dream ever word quoth fly Heaven kingdom love Eldorado lie stone tap ah floor seek sorrow sad long speak see mystery day shore song unto light fall prophet take heaven beautiful go night life throne tower sky alas rap Tis grow implore sure hear dream back yore perch

shakespeare

thou shall sonnet eye love can see woe time death love summer shake fair never day art hath eternal Death long man long life sweet thing new dear friend night think war fire find doom behold nothing make like go leave upon see'st black take must well alter ever Love lip cheek far red white wire rose mistress yet XVIII compare lovely temperate rough wind darling bud May lease short date sometimes hot heaven shine often gold complexion dim sometime decline chance nature change course untrimmed fade lose possession owest

whitman

O one sing alone book make love utter love hour joy o day ocean look know think word free one much man will space air lose life repay lover see Louisiana grow without joyous leaf friend leave self person yet worthy say far power form step plea hardly go time besiege place artillery shut door library shelf yet need bring intellect madness deep yield dark mine something must come long die meet separate carry diverse little dear City live pass eye offer live oak stand moss near upon seek

yeats

heart now will go make come upon man know give peace eye seem beauty O can day hear water old like thing still head bring Second say Time burn never dream drop wing blood o fall crowd face grow hand cloud among well great bid year young can not Coming turn lion form breed amid secret Innisfree arise build nine alone shall slow sing full always lake shore deep Maid Quiet wind gather show last record walk street whereon autumn stone since one see suddenly look lover away meet somewhere fight Kiltartan