Word2Vec Analysis on the Gnadenhutten Massacre

Author: Ung, Lik Teng
Class: DH150, Winter 2019
Instructor: Professor Ashley Sanders Garcia

Word2Vec is a popular word embedding, which is able to model words in high-dimensional space beyond frequency count. The advantage of Word2Vec is that it can capture the "contexts" of a word within a specific body of corpus. I trained a Word2Vec model on 9 newspaper articles on the Gnadenhutten Massacre that happened on March 8, 1782. I am interested in how different sides involved in this massacre were being discussed in public discourse. Specifically, I am interested in words that are most associated with the Moravian Indians and the American militia.

Table of Contents

In [35]:
import cython, os #ENSURE cython package is installed on computer/canopy
import string, re, collections
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from string import ascii_letters, digits
from smart_open import smart_open

import gensim
from gensim.models import phrases 
from gensim import corpora, models, similarities #calc all similarities at once, from http://radimrehurek.com/gensim/tut3.html
from gensim.models import Word2Vec, KeyedVectors

from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer

# from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# import plotly.offline as py 
# py.init_notebook_mode(connected=True)
# import plotly.graph_objs as go
# import plotly.tools as tls

import plotly.plotly as py
import plotly.tools as plotly_tools
import plotly.graph_objs as go

plotly_tools.set_credentials_file(username='unglikteng', api_key='ho4TAl3mWMMNK6DnRSCL')

from nltk import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

plt.style.use('ggplot')
%matplotlib inline

1. Documents Import

The directory path and filename are hardcoded here. Import your text documents if you would like to analyze them with Word2Vec.


In [3]:
primaries = []
primaryPath = os.path.join(os.path.realpath(""),"primary")
for root, directories, file in os.walk('primary'):
    for txt in file:
        path = os.path.join(primaryPath, txt)
        file = open(path, "r")
        primaries.append("".join(file.read().splitlines()))
        file.close()
In [4]:
# 9 Newspaper articles
len(primaries)
Out[4]:
9
In [5]:
# Get a sense of how the article look like 
primaries[0]
Out[5]:
'Centenary of GnadenhuttenInformation about the Old Ohio Moravian Settlement and its MassacreSpecial Correspondence of the Cincinnati Gazette Newark, O., April 27, - Gnadenhutten was established by a Christian Indian named Joshua, who brought with him a party of Mohicans, and proceeded to lay out the town on 24th day of September, 1772. It was on the west side of the Tuscarawas river, four miles above Schronbrunn (a Moravian village already established), and was called “Upper Town”. This location, however, was not satisfactory to the Netawatwees, then the reigning chief of the Delaware nation, who caused it to be removed to a point about eight miles below Schonbrunn, on the east side of the river. Here Gnadenhutten (Tents of Grace) was laid out October 9, 1772, by Joshua and his party, who were from the Moravian village of Friedenstadt (City of Peace), located on the Beaver river, in Pennsylvania. This village was subsequently removed and added to the villages of Gnadenhutten and Schonbrunn. Rev. David Zeisberger preached the first sermon in Gnadenhutten October 17, 1772.The British seized the Moravian Gnadenhutten, and with their horses, cattle, etc., drove them prisoners to the “Sandusky plains,” by Captain Matthew Elliott of the British army (a while American renegade, however), who had under his command at the time 300 hostile Indians. They were made captives September 11, 1781, and the party reached Sandusky river on the first day of October following, when they went into camp. The leaders of these Moravians at the time of the removal were Revs. Zeisberger, Senseman, and Jungman, of New Schonbrunn; Revs. Heckwelder and Jung, of Salem, and Rev. William Edwards, of Gnadenhutten. This camp, subsequently known as “Captives’ Town,” was located in the heart of the then hostile Wyandot country, on the Sandusky river, about a mile above the mouth of Broken Sword creek, and ten miles from the present town of Upper Sandusky. Here the captives were allowed to build huts and go into winter quarters. Late in October, 1781, leaders only were ordered to Detroit, there to go before the British commandant, Major DePeyster, to answer to the charge against them of aiding the Americans. They soon proved themselves innocent, and were sent back to “Captives’ Town,” on the Sandusky.'

2. Text Preprocessing


In [6]:
# Define our Text Preprocessor class
## Tokenize -> Remove stopwords -> Stemming 
class Preprocessor:
    def tokenize_word(self, sentence, to_token = None):
        # all lower case 
        lower = sentence.strip().lower()
        
        # remove punctuation
        punctuation_table = str.maketrans(string.punctuation, len(string.punctuation)*' ' )
        noPunc = lower.translate(punctuation_table)
        
        # remove digit
        nodigit = re.sub(r'\d+', '', noPunc)
        nodigit = re.sub(r'\s+', ' ', nodigit).strip()
        if to_token:
            tokenized = word_tokenize(nodigit)
            return tokenized
        return nodigit 
    
    def stem_word(self, tokens):
        stemmer = SnowballStemmer("english")
        stemmed = []
        for token in tokens:
            stemmed.append(stemmer.stem(token))
        return stemmed
    
    def remove_stopwords(self, tokens):
        stopword_list = stopwords.words("english")
        filtered = [w for w in tokens if w not in stopword_list]
        return filtered
    
In [7]:
# Define preprocessor object
preprocessor = Preprocessor()

primariesToken = [preprocessor.stem_word(
                        preprocessor.remove_stopwords(
                            preprocessor.tokenize_word(line, to_token=True))) 
                              for line in primaries]

primariesUnstemmed = [preprocessor.remove_stopwords(
                        preprocessor.tokenize_word(line, to_token=True)) 
                              for line in primaries]
In [8]:
# Build stemming dictionary
# This dictionary will help us trace back to the unstemmed words
stem_dict = {}
for i, row in enumerate(primariesUnstemmed):
    for j, token in enumerate(row):
        stem_dict.update({primariesToken[i][j]:token})
        
In [9]:
# Visualize the corpus - Frequency Analysis

def word_counter(list_of_doc):
    countVec = CountVectorizer()
    df_cv = countVec.fit_transform(list_of_doc)
    word_freq = dict(zip(countVec.get_feature_names(), np.asarray(df_cv.sum(axis=0)).ravel()))
    word_counter = collections.Counter(word_freq)
    word_counter_df = pd.DataFrame(word_counter.most_common(20), columns = ['word','freq'])

    a4_dims = (15, 10)
    fig, ax = plt.subplots(figsize = a4_dims)
    sns.barplot(x="word", y="freq", data=word_counter_df, palette= "PuBuGn_d",ax=ax)
    return word_counter
In [10]:
wc_primaries = word_counter([" ".join(tokens) for tokens in primariesToken])

3. Word2Vec Training


In [11]:
# Text Preprocessing -> Phrase Detection with Gensim -> Word2Vec Training 
## Phrase Detection

bigram_transformer = phrases.Phrases(primariesToken) 
bigram= phrases.Phraser(bigram_transformer)

Word2Vec Hyperparameters:

  • skip-gram method is used instead of CBOW (Continuous Bag of Words) since skip-gram generally performs better on small dataset
  • Dimension of word vectors: 500
  • min_count: since the corpus is pretty small, set min_count to 2 is reasonable
In [12]:
model_primaries = Word2Vec(bigram[primariesToken], workers=4, sg=1,size=500,window=5, min_count = 2, sample=1e-3)

model_primaries.init_sims(replace=True) #Precompute L2-normalized vectors. If replace is set to TRUE, forget the original vectors and only keep the normalized ones. Saves lots of memory, but can't continue to train the model.
model_primaries.save("model_primaries") #save your model for later use! change the name to something to remember the hyperparameters you trained it with
In [13]:
# Load the model
model_p = Word2Vec.load("model_primaries")
In [14]:
# There are 1318 words in the vocabulary
len(model_p.wv.vocab)
Out[14]:
1318
In [15]:
model_p.wv.vocab.keys()
Out[15]:
dict_keys(['old', 'ohio', 'moravian', 'settlement', 'correspond', 'cincinnati', 'newark', 'april', 'gnadenhutten', 'establish', 'christian_indian', 'name', 'joshua', 'brought', 'parti', 'proceed', 'lay', 'town', 'th', 'day', 'septemb', 'west', 'side', 'tuscarawa_river', 'four', 'mile', 'villag', 'alreadi', 'call', '“', 'upper', '”', 'locat', 'howev', 'reign', 'chief', 'delawar', 'nation', 'caus', 'remov', 'point', 'eight', 'schonbrunn', 'east', 'river', 'tent', 'grace', 'laid', 'octob', 'peac', 'pennsylvania', 'subsequ', 'rev', 'david_zeisberg', 'preach', 'first', 'british', 'seiz', 'hors', 'cattl', 'etc', 'drove', 'prison', 'sanduski', 'plain', 'captain', 'matthew', 'elliott', 'armi', 'american', 'command', 'time', 'hostil', 'indian', 'made', 'captiv', 'reach', 'follow', 'went', 'camp', 'leader', 'zeisberg', 'senseman', 'new', 'heckweld', 'jung', 'salem', 'william', 'known', '’', 'heart', 'wyandot', 'countri', 'broken', 'creek', 'ten', 'present', 'allow', 'build', 'hut', 'go', 'winter', 'quarter', 'late', 'order', 'detroit', 'major', 'answer', 'charg', 'aid', 'soon', 'prove', 'innoc', 'sent', 'back', 'fear', 'massacr', 'interest', 'event', 'advoc', 'week', 'two', 'happen', 'tuscarawa', 'counti', 'took', 'visit', 'site', 'ancient', 'trust', 'brief', 'account', 'place', 'terribl', 'may', 'reader', 'conclud', 'make', 'subject', 'communic', 'deserv', 'among', 'respect', 'valuabl', 'church', 'great', 'britain', 'origin', 'brethren', 'law', 'christ', 'know', 'unit', 'one', 'peculiar', 'unusu', 'belief', 'say', 'exhibit', 'submit', 'import', 'concern', 'member', 'lot', 'consist', 'number', 'small', 'cylind', 'inch', 'long', 'half', 'construct', 'end', 'pull', 'apart', 'disclos', 'word', 'yes', 'case', 'alik', 'far', 'appear', 'contain', 'use', 'princip', 'matter', 'instanc', 'young', 'man', 'mind', 'would', 'like', 'certain', 'woman', 'wife', 'minist', 'state', 'take', 'littl', 'put', 'thorough', 'consid', 'provid', 'approv', 'match', 'reason', 'although', 'need', 'marri', 'yet', 'also', 'much', 'decid', 'whether', 'accept', 'missionari', 'field', 'labor', 'character', 'hold', 'exampl', 'other', 'everi', 'bodi', 'christian', 'whatev', 'persuad', 'engag', 'mission', 'care', 'quarrel', 'carri', 'address', 'men', 'given', 'bethlehem', 'alway', 'still', 'center', 'unit_state', 'home', 'offic', 'societi', 'sever', 'year', 'revolutionari', 'war', 'wilder', 'convert', 'tribe', 'met', 'good', 'success', 'larg', 'savag', 'built', 'inhabit', 'three', 'within', 'goshen', 'washington', 'stand', 'beauti', 'situat', 'bank', 'south', 'philadelphia', 'columbus', 'railroad', 'hundr', 'faith', 'neat', 'meet', 'hous', 'pass', 'street', 'struck', 'quiet', 'throughout', 'modern', 'mani', 'gabl', 'ret', 'tradit', 'past', 'peopl', 'rush', 'hurri', 'world', 'around', 'simpl', 'tast', 'deepli', 'religi', 'earth', 'desir', 'daili', 'prayer', 'strife', 'sober', 'wish', 'never', 'learn', 'along', 'life', 'keep', 'way', 'attend', 'servic', 'sunday', 'inform', 'school', 'upon', 'congreg', 'born', 'seventi', 'year_ago', 'kind', 'accompani', 'spot', 'edg', 'eye', 'sacr', 'found', 'modest', 'foot', 'part', 'embrac', 'adjoin', 'graveyard', 'enclos', 'fenc', 'stood', 'occur', 'cruel', 'honor', 'grow', 'forest', 'tree', 'grown', 'sinc', 'seen', 'appl', 'plant', 'garden', 'ground', 'cellar', 'visibl', 'taken', 'char', 'corn', 'wood', 'pick', 'stone', 'piec', 'burn', 'red', 'hard', 'bore', 'mark', 'heat', 'lie', 'heap', 'purpos', 'erect', 'monument', 'last', 'rest', 'perhap', 'incid', 'histori', 'cruelti', 'equal', 'butcheri', 'pale', 'bloodi', 'slaughter', 'murder', 'king', 'excus', 'issu', 'mistak', 'true', 'addit', 'civil', 'nineti', 'act', 'without', 'even', 'grassi', 'could', 'scarc', 'horror', 'listen', 'recit', 'dark', 'deed', 'commit', 'sat', 'said', 'live', 'attach', 'wa', 'troubl', 'whose', 'territori', 'alli', 'look', 'leagu', 'white', 'hand', 'station', 'fort', 'pitt', 'pretend', 'harass', 'fire', 'summer', 'band', 'came', 'threat', 'promis', 'safeti', 'leav', 'crop', 'drag', 'hear', 'governor', 'noth', 'discharg', 'suffer', 'untold', 'privat', 'cold', 'permit', 'return', 'women_children', 'gather', 'advis', 'near', 'starv', 'els', 'arm', 'hunt', 'depred', 'frontier', 'rob', 'famili', 'mingo', 'cloth', 'stolen', 'pursu', 'compani', 'immedi', 'rais', 'colonel', 'williamson', 'set', 'arriv', 'night', 'march', 'next', 'morn', 'discov', 'advanc', 'cross', 'see', 'accost', 'come', 'protect', 'give', 'began', 'differ', 'spirit', 'bound', 'shut', 'nineti_six', 'consult', 'held', 'done', 'soldier', 'form', 'line', 'col_williamson', 'question', 'favor', 'save', 'step', 'forward', 'death', 'eighteen', 'whole', 'resolv', 'blood', 'men_women', 'children', 'meantim', 'suspect', 'dread', 'result', 'prepar', 'fate', 'pray', 'sing', 'hymn', 'aw', 'execut', 'commenc', 'doom', 'victim', 'brain', 'cooper', 'mallet', 'continu', 'cours', 'left', 'work', 'accomplish', 'repeat', 'treacheri', 'tragedi', 'right', 'escap', 'unobserv', 'warn', 'stun', 'blow', 'scalp', 'recov', 'conscious', 'departur', 'tell', 'stori', 'anoth', 'boy', 'hide', 'confin', 'flame', 'third', 'beg', 'conceal', 'horrid', 'enact', 'god', 'disast', 'die', 'kill', 'lightn', 'fiendish', 'torn', 'least', 'us', 'crawford', 'autumn', 'reveng', 'inhuman', 'chin', 'seem', 'silent', 'think', 'join', 'martyr', 'love', 'well', 'bone', 'sad', 'gray', 'hair', 'forc', 'august', 'breez', 'chant', 'depart', 'dreami', 'hill', 'safe', 'breath', 'break', 'sleep', 'shriek', 'aros', 'pain', 'chicago', 'find', 'month', 'scene', 'horribl', 'record', 'detail', 'previous', 'compos', 'english', 'mask', 'friendship', 'endur', 'hardship', 'persecut', 'blame', 'thank', 'prais', 'divis', 'fifti', 'forsaken', 'greater', 'portion', 'fall', 'face', 'actor', 'notori', 'david', 'destroy', 'suppos', 'trade', 'wild', 'actual', 'wrong', 'busi', 'usual', 'captur', 'militari', 'pittsburgh', 'vote', 'eighti', 'twenti', 'determin', 'news', 'almost', 'implicit', 'confid', 'devout', 'serv', 'offer', 'strength', 'encourag', 'infant', 'closer', 'mother', 'breast', 'brave', 'women', 'chosen', 'led', 'enter', 'butcher', 'pretti', 'poor', 'merci', 'dead', 'captor', 'manner', 'shot', 'tomahawk', 'various', 'attempt', 'rise', 'despatch', 'slaughter_hous', 'partial', 'consum', 'remain', 'buri', 'friend', 'person', 'perpetr', 'light', 'evid', 'fail', 'secur', 'kept', 'fortun', 'togeth', 'forti', 'thirti', 'former', 'six', 'acr', 'purchas', 'organ', 'object', 'perpetu', 'memori', 'exact', 'precious', 'nine', 'dollar', 'view', 'solicit', 'public', 'general', 'receiv', 'acknowledg', 'suggest', 'moravian_missionari', 'strong', 'beyond', 'doubt', 'grave', 'marbl', 'slab', 'commemor', 'tribun', 'histor', 'sun', 'testimoni', 'afterward', 'june', 'appropri', 'inscript', 'hope', 'midway', 'increas', 'popul', 'store', 'effort', 'fix', 'capit', 'today', 'celebr', 'white_settler', 'circumst', 'lead', 'render', 'induc', 'covert', 'land', 'clear', 'thrive', 'enjoy', 'comfort', 'outrag', 'treatment', 'endeavor', 'valley', 'refus', 'natur', 'thus', 'avoid', 'border', 'bring', 'start', 'human', 'expedit', 'plunder', 'pillag', 'influenc', 'knew', 'atroc', 'wallac', 'eastern', 'fled', 'toward', 'mrs', 'settler', 'pursuit', 'harbor', 'dress', 'earli', 'col', 'recent', 'barbar', 'finish', 'treat', 'harm', 'sooner', 'surrend', 'accus', 'told', 'propos', 'thirsti', 'share', 'decis', 'therefor', 'ask', 'moment', 'devot', 'tie', 'behind', 'scatter', 'room', 'progress', 'aliv', 'gun', 'knife', 'brutal', 'turn', 'pen', 'regist', 'regret', 'feel', 'fact', 'centenni', 'translat', 'mean', 'seven', 'fell', 'cheer', 'mingl', 'shout', 'signific', 'necessari', 'rare', 'pieti', 'heard', 'heathen', 'sold', 'trader', 'gentl', 'evil', 'effect', 'forcibl', 'becam', 'spread', 'flourish', 'convers', 'conspiraci', 'famous', 'ottawa', 'plot', 'though', 'harsh', 'ignor', 'canaanit', 'exist', '…', 'nativ', 'fit', 'fair', 'high', 'settl', 'fertil', 'soil', 'pleasant', 'prevail', 'pipe', 'coloni', 'broke', 'teacher', 'seat', 'headquart', 'cruelli', 'accord', 'crime', 'move', 'head', 'surround', 'smile', 'final', 'conduct', 'molest', 'complet', 'deceiv', 'prevent', 'manifest', 'council', 'mode', 'excit', 'idea', 'futur', 'campaign', 'bleed', 'written', 'demonstr', 'must', 'morrow', 'affair', 'believ', 'show', 'realiz', 'fortitud', 'larger', 'watch', 'outsid', 'produc', 'babe', 'parent', 'etern', 'quick', 'cut', 'similar', 'dwell', 'stay', 'short', 'frequent', 'retir', 'canada', 'pious', 'sixti', 'reflect', 'imposs', 'except', 'st', 'proper', 'observ', 'approach', 'especi', 'presid', 'arthur', 'hay', 'invit', 'deliv', 'new_fairfield', 'particip', 'wyom', 'probabl', 'struggl', '–', 'gospel', 'mahon', 'lehighton', 'elev', 'posit', 'novemb', 'attack', 'burnt', 'inscrib', 'lord', 'renew', 'mohagan', 'maintain', 'north', 'lehigh', 'possess', 'separ', 'begin', 'road', 'path', 'mountain', 'warrior', 'plantat', 'martin', 'resid', 'succeed', 'polici', 'chang', 'might', 'alon', 'mohegan', 'languag', 'bishop', 'foundat', 'shawne', 'hatchet', 'french', 'resolut', 'chapel', 'weissport', 'cultur', 'defeat', 'open', 'neighbor', 'caution', 'enemi', 'possibl', 'compli', 'nov', 'georg', 'custard', 'expect', 'joseph', 'sturg', 'got', 'door', 'ran', 'partch', 'window', 'child', 'stair', 'best', 'jump', 'hid', 'stump', 'saw', 'abus', 'perish', 'twelv', 'stabl', 'five', 'parich', 'blanket', 'meant', 'abl', 'deliver', 'report', 'delay', 'contrari', 'bed', 'brother', 'notic', 'militia', 'ventur', 'troop', 'stockad', 'properti', 'strategi', 'quit', 'later', 'januari', 'benjamin', 'susanna', 'lost', 'surpris', 'moravian_convert', 'search', 'canadian', 'borderland', 'john', 'p', 'bow', 'journal', 'entri', 'narrow', 'lt', 'muskingum', 'develop', 'ohio_valley', 'close', 'communiti', 'eighteenth', 'centuri', 'aftermath', 'surviv', 'most', 'munse', 'refug', 'rumor', 'offici', 'earlier', 'complic', 'passag', 'decad', 'often', 'resist', 'presenc', 'lake', 'region', 'disappear', 'deal', 'particular', 'symbol', 'rang', 'impact', 'fairfield', 'moraviantown', 'upper_canada', 'northern', 'thame_river', 'stabil', 'boundari', 'problem', 'incurs', 'immigr', 'intern', 'migrat', 'simpli', 'govern', 'resid_fairfield', 'difficult', 'demand', 'relat', 'british_offici', 'inde', 'southern', 'illustr', 'avail', 'option', 'nineteenth', 'articl', 'despit', 'pressur', 'examin', 'specif', 'individu', 'second', 'polit', 'elimin', 'tragic', 'backcountri', 'defin', 'appalachian', 'conflict', 'warfar', 'local', 'neutral', 'neither', 'violenc', 'readili', 'label', 'limit', 'mid', 'zone', 'grew', 'immin', 'conclus', 'agreement', 'affect', 'raid', 'constant', 'includ', 'impend', 'sometim', 'michigan_histor', 'review', 'headman', 'power', 'jaw', 'danger', 'battl', 'risk', 'play', 'spring', 'tri', 'reloc', 'want', 'across', 'sens', 'wit', 'movement', 'potenti', 'extent', 'convinc', 'juli', 'clinton', 'northwest', 'refuge', 'slowli', 'longer', 'thing', 'lennachgo', 'destruct', 'negat', 'ripen', 'ojibw', 'away', 'welcom', 'son', 'agre', 'father', 'instead', 'journey', 'josiah', 'harmar', 'headmen', 'alcohol', 'stop', 'huron', 'better', 'drop', 'shift', 'auglaiz', 'becom', 'milit', 'confederaci', 'gen', 'messag', 'wampum', 'singular', 'repres', 'advic', 'easi', 'declin', 'area', 'spiritu', 'assist', 'support', 'thought', 'remark', 'agent', 'mckee', 'treati', 'sign', 'unfortun', 'volatil', 'jay', 'term', 'paper', 'econom', 'thame', 'cultiv', 'suffici', 'travel', 'knowledg', 'lieuten', 'simco', 'request', 'bushel', 'mere', 'cano', 'vital', 'season', 'fur', 'heighten', 'pari', 'contend', 'perform', 'western', 'fort_malden', 'negoti', 'somewhat', 'uneasi', 'america', 'worri', 'threaten', 'tension', 'capt', 'mississippi', 'compar', 'anxieti', 'surfac', 'niagara', 'nevertheless', 'begun', 'period', 'regul', 'strict', 'rule', 'requir', 'sin', 'social', 'standard', 'reveal', 'dramat', 'initi', 'primari', 'practic', 'deterior', 'group', 'less', 'diari', 'read', 'michael', 'diarist', 'complain', 'difficulti', 'relationship', 'gift', 'expans', 'wrote', 'count', 'enough', 'connect', 'respons', 'chesapeak', 'alleg', 'desert', 'drew', 'elliot', 'prophet', 'indiana', 'tecumseh', 'amherstburg', 'indic', 'revit', 'fellow', 'onim', 'either', 'gave', 'belong', 'stream', 'malden', 'denk', 'deep', 'hospit', 'chao', 'henri', 'schnall', 'food', 'dispers', 'daughter', 'meanwhil', 'cass', 'kinship', 'charl', 'killbuck', 'gelemend', 'perfect', 'jstor', 'www', 'org', 'scienc', 'art', 'heckeweld', 'guidanc', 'secret', 'verifi', 'exhort', 'resign', 'applic', 'liberti', 'acquaint', 'untim', 'impress', 'justic', 'mr_heckeweld', 'charact', 'intercours', 'marietta', 'amidst', 'full', 'publish', 'putnam', 'wabash', 'fever', 'common', 'barg', 'alarm', 'chimney', 'pocket', 'assur', 'releas', 'intellig', 'messeng', 'shebosh', 'wound', 'latter', 'display', 'glorious', 'youth', 'abel'])
In [16]:
# model_p.wv.most_similar(positive = ["white", "american"],
#                         negative = ["british"])

$\overrightarrow{Dimension_g} = \overrightarrow{white} + \overrightarrow{american} - \overrightarrow{british}$

In [17]:
whiteAmerican_british = [('moravian', 0.9998588562011719),
                         ('indian', 0.9998587369918823),
                         ('fairfield', 0.9998579621315002),
                         ('mani', 0.9998571872711182),
                         ('live', 0.9998558163642883),
                         ('murder', 0.9998539686203003),
                         ('day', 0.9998528957366943),
                         ('missionari', 0.9998528957366943),
                         ('massacr', 0.9998518824577332),
                         ('ohio', 0.999851405620575)]
In [19]:
## Visualize words most similar to White American
my_word_list=[]
my_word_vectors=[]
# label=[]

for i in whiteAmerican_british: 
    if my_word_list not in my_word_list:
        my_word_list.append(i[0])
        my_word_vectors.append(model_p.wv[i[0]])
        
tsne_model = TSNE(perplexity=5, n_components=2, init='pca', n_iter=3000, random_state=23) #you may need to tune these, epsecially the perplexity. #Use PCA to reduce dimensionality to 2-D, an "X" and a "Y 
new_values = tsne_model.fit_transform(my_word_vectors)

x = []
y = []
for value in new_values:
    x.append(value[0])
    y.append(value[1])


trace1 = go.Scatter(
    x = x,
    y = y,
    mode = 'markers+text',
    name = "Similar to White American",
    text = [stem_dict[word] if stem_dict.get(word) else word for word in my_word_list],
    textposition='bottom center'
)


data = [trace1]

layout = go.Layout(dict(title = "Most similar Words to White American",
                    yaxis   = dict(title = "Dimension2"),
                    xaxis   = dict(title = "Dimension1"),
                    plot_bgcolor  = "rgb(243,243,243)",
                    paper_bgcolor = "rgb(243,243,243)",
                   )
              )

fig = go.Figure(data=data,layout=layout)
py.iplot(fig)

Some of the significant words that come up from this specific dimension include;

  • Massacre
  • Moravian Indians
  • War
  • Missionary
  • Ohio

These words construct what we know about the Gnadenhutten Massacre - "The Moravian Indians, who were not allies of Britain, were massacred by American militia in Gnadenhutte, Ohio.


4. Word2Vec to Tensor

I am using Google Embedding Projector to visualize my Word2Vec model. Since it is built on tensorflow, we need to convert our Word2Vec output to the tensor format. The function below does just that.

The final visualization of this Word2Vec model can be found here.


In [23]:
def word2vec2tensor(model, tensor_filename):
    outfiletsv = tensor_filename + '_tensor.tsv'
    outfiletsvmeta = tensor_filename + '_metadata.tsv'
                
    with smart_open(outfiletsv, 'wb') as file_vector, smart_open(outfiletsvmeta, 'wb') as file_metadata:
        for word in model.wv.index2word:
            word = stem_dict[word] if stem_dict.get(word) else word
            file_metadata.write(gensim.utils.to_utf8(word) + gensim.utils.to_utf8('\n'))
            vector_row = '\t'.join(str(x) for x in model.wv.__getitem__(word))
            file_vector.write(gensim.utils.to_utf8(vector_row) + gensim.utils.to_utf8('\n'))
In [28]:
# word2vec2tensor(model_p, "model_primaries")