Author: Ung, Lik Teng
Class: DH150, Winter 2019
Instructor: Professor Ashley Sanders Garcia
Word2Vec is a popular word embedding, which is able to model words in high-dimensional space beyond frequency count. The advantage of Word2Vec is that it can capture the "contexts" of a word within a specific body of corpus. I trained a Word2Vec model on 9 newspaper articles on the Gnadenhutten Massacre that happened on March 8, 1782. I am interested in how different sides involved in this massacre were being discussed in public discourse. Specifically, I am interested in words that are most associated with the Moravian Indians and the American militia.
Table of Contents
import cython, os #ENSURE cython package is installed on computer/canopy
import string, re, collections
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from string import ascii_letters, digits
from smart_open import smart_open
import gensim
from gensim.models import phrases
from gensim import corpora, models, similarities #calc all similarities at once, from http://radimrehurek.com/gensim/tut3.html
from gensim.models import Word2Vec, KeyedVectors
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer
# from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# import plotly.offline as py
# py.init_notebook_mode(connected=True)
# import plotly.graph_objs as go
# import plotly.tools as tls
import plotly.plotly as py
import plotly.tools as plotly_tools
import plotly.graph_objs as go
plotly_tools.set_credentials_file(username='unglikteng', api_key='ho4TAl3mWMMNK6DnRSCL')
from nltk import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
plt.style.use('ggplot')
%matplotlib inline
The directory path and filename are hardcoded here. Import your text documents if you would like to analyze them with Word2Vec.
primaries = []
primaryPath = os.path.join(os.path.realpath(""),"primary")
for root, directories, file in os.walk('primary'):
for txt in file:
path = os.path.join(primaryPath, txt)
file = open(path, "r")
primaries.append("".join(file.read().splitlines()))
file.close()
# 9 Newspaper articles
len(primaries)
9
# Get a sense of how the article look like
primaries[0]
'Centenary of GnadenhuttenInformation about the Old Ohio Moravian Settlement and its MassacreSpecial Correspondence of the Cincinnati Gazette Newark, O., April 27, - Gnadenhutten was established by a Christian Indian named Joshua, who brought with him a party of Mohicans, and proceeded to lay out the town on 24th day of September, 1772. It was on the west side of the Tuscarawas river, four miles above Schronbrunn (a Moravian village already established), and was called “Upper Town”. This location, however, was not satisfactory to the Netawatwees, then the reigning chief of the Delaware nation, who caused it to be removed to a point about eight miles below Schonbrunn, on the east side of the river. Here Gnadenhutten (Tents of Grace) was laid out October 9, 1772, by Joshua and his party, who were from the Moravian village of Friedenstadt (City of Peace), located on the Beaver river, in Pennsylvania. This village was subsequently removed and added to the villages of Gnadenhutten and Schonbrunn. Rev. David Zeisberger preached the first sermon in Gnadenhutten October 17, 1772.The British seized the Moravian Gnadenhutten, and with their horses, cattle, etc., drove them prisoners to the “Sandusky plains,” by Captain Matthew Elliott of the British army (a while American renegade, however), who had under his command at the time 300 hostile Indians. They were made captives September 11, 1781, and the party reached Sandusky river on the first day of October following, when they went into camp. The leaders of these Moravians at the time of the removal were Revs. Zeisberger, Senseman, and Jungman, of New Schonbrunn; Revs. Heckwelder and Jung, of Salem, and Rev. William Edwards, of Gnadenhutten. This camp, subsequently known as “Captives’ Town,” was located in the heart of the then hostile Wyandot country, on the Sandusky river, about a mile above the mouth of Broken Sword creek, and ten miles from the present town of Upper Sandusky. Here the captives were allowed to build huts and go into winter quarters. Late in October, 1781, leaders only were ordered to Detroit, there to go before the British commandant, Major DePeyster, to answer to the charge against them of aiding the Americans. They soon proved themselves innocent, and were sent back to “Captives’ Town,” on the Sandusky.'
# Define our Text Preprocessor class
## Tokenize -> Remove stopwords -> Stemming
class Preprocessor:
def tokenize_word(self, sentence, to_token = None):
# all lower case
lower = sentence.strip().lower()
# remove punctuation
punctuation_table = str.maketrans(string.punctuation, len(string.punctuation)*' ' )
noPunc = lower.translate(punctuation_table)
# remove digit
nodigit = re.sub(r'\d+', '', noPunc)
nodigit = re.sub(r'\s+', ' ', nodigit).strip()
if to_token:
tokenized = word_tokenize(nodigit)
return tokenized
return nodigit
def stem_word(self, tokens):
stemmer = SnowballStemmer("english")
stemmed = []
for token in tokens:
stemmed.append(stemmer.stem(token))
return stemmed
def remove_stopwords(self, tokens):
stopword_list = stopwords.words("english")
filtered = [w for w in tokens if w not in stopword_list]
return filtered
# Define preprocessor object
preprocessor = Preprocessor()
primariesToken = [preprocessor.stem_word(
preprocessor.remove_stopwords(
preprocessor.tokenize_word(line, to_token=True)))
for line in primaries]
primariesUnstemmed = [preprocessor.remove_stopwords(
preprocessor.tokenize_word(line, to_token=True))
for line in primaries]
# Build stemming dictionary
# This dictionary will help us trace back to the unstemmed words
stem_dict = {}
for i, row in enumerate(primariesUnstemmed):
for j, token in enumerate(row):
stem_dict.update({primariesToken[i][j]:token})
# Visualize the corpus - Frequency Analysis
def word_counter(list_of_doc):
countVec = CountVectorizer()
df_cv = countVec.fit_transform(list_of_doc)
word_freq = dict(zip(countVec.get_feature_names(), np.asarray(df_cv.sum(axis=0)).ravel()))
word_counter = collections.Counter(word_freq)
word_counter_df = pd.DataFrame(word_counter.most_common(20), columns = ['word','freq'])
a4_dims = (15, 10)
fig, ax = plt.subplots(figsize = a4_dims)
sns.barplot(x="word", y="freq", data=word_counter_df, palette= "PuBuGn_d",ax=ax)
return word_counter
wc_primaries = word_counter([" ".join(tokens) for tokens in primariesToken])
# Text Preprocessing -> Phrase Detection with Gensim -> Word2Vec Training
## Phrase Detection
bigram_transformer = phrases.Phrases(primariesToken)
bigram= phrases.Phraser(bigram_transformer)
Word2Vec Hyperparameters:
model_primaries = Word2Vec(bigram[primariesToken], workers=4, sg=1,size=500,window=5, min_count = 2, sample=1e-3)
model_primaries.init_sims(replace=True) #Precompute L2-normalized vectors. If replace is set to TRUE, forget the original vectors and only keep the normalized ones. Saves lots of memory, but can't continue to train the model.
model_primaries.save("model_primaries") #save your model for later use! change the name to something to remember the hyperparameters you trained it with
# Load the model
model_p = Word2Vec.load("model_primaries")
# There are 1318 words in the vocabulary
len(model_p.wv.vocab)
1318
model_p.wv.vocab.keys()
dict_keys(['old', 'ohio', 'moravian', 'settlement', 'correspond', 'cincinnati', 'newark', 'april', 'gnadenhutten', 'establish', 'christian_indian', 'name', 'joshua', 'brought', 'parti', 'proceed', 'lay', 'town', 'th', 'day', 'septemb', 'west', 'side', 'tuscarawa_river', 'four', 'mile', 'villag', 'alreadi', 'call', '“', 'upper', '”', 'locat', 'howev', 'reign', 'chief', 'delawar', 'nation', 'caus', 'remov', 'point', 'eight', 'schonbrunn', 'east', 'river', 'tent', 'grace', 'laid', 'octob', 'peac', 'pennsylvania', 'subsequ', 'rev', 'david_zeisberg', 'preach', 'first', 'british', 'seiz', 'hors', 'cattl', 'etc', 'drove', 'prison', 'sanduski', 'plain', 'captain', 'matthew', 'elliott', 'armi', 'american', 'command', 'time', 'hostil', 'indian', 'made', 'captiv', 'reach', 'follow', 'went', 'camp', 'leader', 'zeisberg', 'senseman', 'new', 'heckweld', 'jung', 'salem', 'william', 'known', '’', 'heart', 'wyandot', 'countri', 'broken', 'creek', 'ten', 'present', 'allow', 'build', 'hut', 'go', 'winter', 'quarter', 'late', 'order', 'detroit', 'major', 'answer', 'charg', 'aid', 'soon', 'prove', 'innoc', 'sent', 'back', 'fear', 'massacr', 'interest', 'event', 'advoc', 'week', 'two', 'happen', 'tuscarawa', 'counti', 'took', 'visit', 'site', 'ancient', 'trust', 'brief', 'account', 'place', 'terribl', 'may', 'reader', 'conclud', 'make', 'subject', 'communic', 'deserv', 'among', 'respect', 'valuabl', 'church', 'great', 'britain', 'origin', 'brethren', 'law', 'christ', 'know', 'unit', 'one', 'peculiar', 'unusu', 'belief', 'say', 'exhibit', 'submit', 'import', 'concern', 'member', 'lot', 'consist', 'number', 'small', 'cylind', 'inch', 'long', 'half', 'construct', 'end', 'pull', 'apart', 'disclos', 'word', 'yes', 'case', 'alik', 'far', 'appear', 'contain', 'use', 'princip', 'matter', 'instanc', 'young', 'man', 'mind', 'would', 'like', 'certain', 'woman', 'wife', 'minist', 'state', 'take', 'littl', 'put', 'thorough', 'consid', 'provid', 'approv', 'match', 'reason', 'although', 'need', 'marri', 'yet', 'also', 'much', 'decid', 'whether', 'accept', 'missionari', 'field', 'labor', 'character', 'hold', 'exampl', 'other', 'everi', 'bodi', 'christian', 'whatev', 'persuad', 'engag', 'mission', 'care', 'quarrel', 'carri', 'address', 'men', 'given', 'bethlehem', 'alway', 'still', 'center', 'unit_state', 'home', 'offic', 'societi', 'sever', 'year', 'revolutionari', 'war', 'wilder', 'convert', 'tribe', 'met', 'good', 'success', 'larg', 'savag', 'built', 'inhabit', 'three', 'within', 'goshen', 'washington', 'stand', 'beauti', 'situat', 'bank', 'south', 'philadelphia', 'columbus', 'railroad', 'hundr', 'faith', 'neat', 'meet', 'hous', 'pass', 'street', 'struck', 'quiet', 'throughout', 'modern', 'mani', 'gabl', 'ret', 'tradit', 'past', 'peopl', 'rush', 'hurri', 'world', 'around', 'simpl', 'tast', 'deepli', 'religi', 'earth', 'desir', 'daili', 'prayer', 'strife', 'sober', 'wish', 'never', 'learn', 'along', 'life', 'keep', 'way', 'attend', 'servic', 'sunday', 'inform', 'school', 'upon', 'congreg', 'born', 'seventi', 'year_ago', 'kind', 'accompani', 'spot', 'edg', 'eye', 'sacr', 'found', 'modest', 'foot', 'part', 'embrac', 'adjoin', 'graveyard', 'enclos', 'fenc', 'stood', 'occur', 'cruel', 'honor', 'grow', 'forest', 'tree', 'grown', 'sinc', 'seen', 'appl', 'plant', 'garden', 'ground', 'cellar', 'visibl', 'taken', 'char', 'corn', 'wood', 'pick', 'stone', 'piec', 'burn', 'red', 'hard', 'bore', 'mark', 'heat', 'lie', 'heap', 'purpos', 'erect', 'monument', 'last', 'rest', 'perhap', 'incid', 'histori', 'cruelti', 'equal', 'butcheri', 'pale', 'bloodi', 'slaughter', 'murder', 'king', 'excus', 'issu', 'mistak', 'true', 'addit', 'civil', 'nineti', 'act', 'without', 'even', 'grassi', 'could', 'scarc', 'horror', 'listen', 'recit', 'dark', 'deed', 'commit', 'sat', 'said', 'live', 'attach', 'wa', 'troubl', 'whose', 'territori', 'alli', 'look', 'leagu', 'white', 'hand', 'station', 'fort', 'pitt', 'pretend', 'harass', 'fire', 'summer', 'band', 'came', 'threat', 'promis', 'safeti', 'leav', 'crop', 'drag', 'hear', 'governor', 'noth', 'discharg', 'suffer', 'untold', 'privat', 'cold', 'permit', 'return', 'women_children', 'gather', 'advis', 'near', 'starv', 'els', 'arm', 'hunt', 'depred', 'frontier', 'rob', 'famili', 'mingo', 'cloth', 'stolen', 'pursu', 'compani', 'immedi', 'rais', 'colonel', 'williamson', 'set', 'arriv', 'night', 'march', 'next', 'morn', 'discov', 'advanc', 'cross', 'see', 'accost', 'come', 'protect', 'give', 'began', 'differ', 'spirit', 'bound', 'shut', 'nineti_six', 'consult', 'held', 'done', 'soldier', 'form', 'line', 'col_williamson', 'question', 'favor', 'save', 'step', 'forward', 'death', 'eighteen', 'whole', 'resolv', 'blood', 'men_women', 'children', 'meantim', 'suspect', 'dread', 'result', 'prepar', 'fate', 'pray', 'sing', 'hymn', 'aw', 'execut', 'commenc', 'doom', 'victim', 'brain', 'cooper', 'mallet', 'continu', 'cours', 'left', 'work', 'accomplish', 'repeat', 'treacheri', 'tragedi', 'right', 'escap', 'unobserv', 'warn', 'stun', 'blow', 'scalp', 'recov', 'conscious', 'departur', 'tell', 'stori', 'anoth', 'boy', 'hide', 'confin', 'flame', 'third', 'beg', 'conceal', 'horrid', 'enact', 'god', 'disast', 'die', 'kill', 'lightn', 'fiendish', 'torn', 'least', 'us', 'crawford', 'autumn', 'reveng', 'inhuman', 'chin', 'seem', 'silent', 'think', 'join', 'martyr', 'love', 'well', 'bone', 'sad', 'gray', 'hair', 'forc', 'august', 'breez', 'chant', 'depart', 'dreami', 'hill', 'safe', 'breath', 'break', 'sleep', 'shriek', 'aros', 'pain', 'chicago', 'find', 'month', 'scene', 'horribl', 'record', 'detail', 'previous', 'compos', 'english', 'mask', 'friendship', 'endur', 'hardship', 'persecut', 'blame', 'thank', 'prais', 'divis', 'fifti', 'forsaken', 'greater', 'portion', 'fall', 'face', 'actor', 'notori', 'david', 'destroy', 'suppos', 'trade', 'wild', 'actual', 'wrong', 'busi', 'usual', 'captur', 'militari', 'pittsburgh', 'vote', 'eighti', 'twenti', 'determin', 'news', 'almost', 'implicit', 'confid', 'devout', 'serv', 'offer', 'strength', 'encourag', 'infant', 'closer', 'mother', 'breast', 'brave', 'women', 'chosen', 'led', 'enter', 'butcher', 'pretti', 'poor', 'merci', 'dead', 'captor', 'manner', 'shot', 'tomahawk', 'various', 'attempt', 'rise', 'despatch', 'slaughter_hous', 'partial', 'consum', 'remain', 'buri', 'friend', 'person', 'perpetr', 'light', 'evid', 'fail', 'secur', 'kept', 'fortun', 'togeth', 'forti', 'thirti', 'former', 'six', 'acr', 'purchas', 'organ', 'object', 'perpetu', 'memori', 'exact', 'precious', 'nine', 'dollar', 'view', 'solicit', 'public', 'general', 'receiv', 'acknowledg', 'suggest', 'moravian_missionari', 'strong', 'beyond', 'doubt', 'grave', 'marbl', 'slab', 'commemor', 'tribun', 'histor', 'sun', 'testimoni', 'afterward', 'june', 'appropri', 'inscript', 'hope', 'midway', 'increas', 'popul', 'store', 'effort', 'fix', 'capit', 'today', 'celebr', 'white_settler', 'circumst', 'lead', 'render', 'induc', 'covert', 'land', 'clear', 'thrive', 'enjoy', 'comfort', 'outrag', 'treatment', 'endeavor', 'valley', 'refus', 'natur', 'thus', 'avoid', 'border', 'bring', 'start', 'human', 'expedit', 'plunder', 'pillag', 'influenc', 'knew', 'atroc', 'wallac', 'eastern', 'fled', 'toward', 'mrs', 'settler', 'pursuit', 'harbor', 'dress', 'earli', 'col', 'recent', 'barbar', 'finish', 'treat', 'harm', 'sooner', 'surrend', 'accus', 'told', 'propos', 'thirsti', 'share', 'decis', 'therefor', 'ask', 'moment', 'devot', 'tie', 'behind', 'scatter', 'room', 'progress', 'aliv', 'gun', 'knife', 'brutal', 'turn', 'pen', 'regist', 'regret', 'feel', 'fact', 'centenni', 'translat', 'mean', 'seven', 'fell', 'cheer', 'mingl', 'shout', 'signific', 'necessari', 'rare', 'pieti', 'heard', 'heathen', 'sold', 'trader', 'gentl', 'evil', 'effect', 'forcibl', 'becam', 'spread', 'flourish', 'convers', 'conspiraci', 'famous', 'ottawa', 'plot', 'though', 'harsh', 'ignor', 'canaanit', 'exist', '…', 'nativ', 'fit', 'fair', 'high', 'settl', 'fertil', 'soil', 'pleasant', 'prevail', 'pipe', 'coloni', 'broke', 'teacher', 'seat', 'headquart', 'cruelli', 'accord', 'crime', 'move', 'head', 'surround', 'smile', 'final', 'conduct', 'molest', 'complet', 'deceiv', 'prevent', 'manifest', 'council', 'mode', 'excit', 'idea', 'futur', 'campaign', 'bleed', 'written', 'demonstr', 'must', 'morrow', 'affair', 'believ', 'show', 'realiz', 'fortitud', 'larger', 'watch', 'outsid', 'produc', 'babe', 'parent', 'etern', 'quick', 'cut', 'similar', 'dwell', 'stay', 'short', 'frequent', 'retir', 'canada', 'pious', 'sixti', 'reflect', 'imposs', 'except', 'st', 'proper', 'observ', 'approach', 'especi', 'presid', 'arthur', 'hay', 'invit', 'deliv', 'new_fairfield', 'particip', 'wyom', 'probabl', 'struggl', '–', 'gospel', 'mahon', 'lehighton', 'elev', 'posit', 'novemb', 'attack', 'burnt', 'inscrib', 'lord', 'renew', 'mohagan', 'maintain', 'north', 'lehigh', 'possess', 'separ', 'begin', 'road', 'path', 'mountain', 'warrior', 'plantat', 'martin', 'resid', 'succeed', 'polici', 'chang', 'might', 'alon', 'mohegan', 'languag', 'bishop', 'foundat', 'shawne', 'hatchet', 'french', 'resolut', 'chapel', 'weissport', 'cultur', 'defeat', 'open', 'neighbor', 'caution', 'enemi', 'possibl', 'compli', 'nov', 'georg', 'custard', 'expect', 'joseph', 'sturg', 'got', 'door', 'ran', 'partch', 'window', 'child', 'stair', 'best', 'jump', 'hid', 'stump', 'saw', 'abus', 'perish', 'twelv', 'stabl', 'five', 'parich', 'blanket', 'meant', 'abl', 'deliver', 'report', 'delay', 'contrari', 'bed', 'brother', 'notic', 'militia', 'ventur', 'troop', 'stockad', 'properti', 'strategi', 'quit', 'later', 'januari', 'benjamin', 'susanna', 'lost', 'surpris', 'moravian_convert', 'search', 'canadian', 'borderland', 'john', 'p', 'bow', 'journal', 'entri', 'narrow', 'lt', 'muskingum', 'develop', 'ohio_valley', 'close', 'communiti', 'eighteenth', 'centuri', 'aftermath', 'surviv', 'most', 'munse', 'refug', 'rumor', 'offici', 'earlier', 'complic', 'passag', 'decad', 'often', 'resist', 'presenc', 'lake', 'region', 'disappear', 'deal', 'particular', 'symbol', 'rang', 'impact', 'fairfield', 'moraviantown', 'upper_canada', 'northern', 'thame_river', 'stabil', 'boundari', 'problem', 'incurs', 'immigr', 'intern', 'migrat', 'simpli', 'govern', 'resid_fairfield', 'difficult', 'demand', 'relat', 'british_offici', 'inde', 'southern', 'illustr', 'avail', 'option', 'nineteenth', 'articl', 'despit', 'pressur', 'examin', 'specif', 'individu', 'second', 'polit', 'elimin', 'tragic', 'backcountri', 'defin', 'appalachian', 'conflict', 'warfar', 'local', 'neutral', 'neither', 'violenc', 'readili', 'label', 'limit', 'mid', 'zone', 'grew', 'immin', 'conclus', 'agreement', 'affect', 'raid', 'constant', 'includ', 'impend', 'sometim', 'michigan_histor', 'review', 'headman', 'power', 'jaw', 'danger', 'battl', 'risk', 'play', 'spring', 'tri', 'reloc', 'want', 'across', 'sens', 'wit', 'movement', 'potenti', 'extent', 'convinc', 'juli', 'clinton', 'northwest', 'refuge', 'slowli', 'longer', 'thing', 'lennachgo', 'destruct', 'negat', 'ripen', 'ojibw', 'away', 'welcom', 'son', 'agre', 'father', 'instead', 'journey', 'josiah', 'harmar', 'headmen', 'alcohol', 'stop', 'huron', 'better', 'drop', 'shift', 'auglaiz', 'becom', 'milit', 'confederaci', 'gen', 'messag', 'wampum', 'singular', 'repres', 'advic', 'easi', 'declin', 'area', 'spiritu', 'assist', 'support', 'thought', 'remark', 'agent', 'mckee', 'treati', 'sign', 'unfortun', 'volatil', 'jay', 'term', 'paper', 'econom', 'thame', 'cultiv', 'suffici', 'travel', 'knowledg', 'lieuten', 'simco', 'request', 'bushel', 'mere', 'cano', 'vital', 'season', 'fur', 'heighten', 'pari', 'contend', 'perform', 'western', 'fort_malden', 'negoti', 'somewhat', 'uneasi', 'america', 'worri', 'threaten', 'tension', 'capt', 'mississippi', 'compar', 'anxieti', 'surfac', 'niagara', 'nevertheless', 'begun', 'period', 'regul', 'strict', 'rule', 'requir', 'sin', 'social', 'standard', 'reveal', 'dramat', 'initi', 'primari', 'practic', 'deterior', 'group', 'less', 'diari', 'read', 'michael', 'diarist', 'complain', 'difficulti', 'relationship', 'gift', 'expans', 'wrote', 'count', 'enough', 'connect', 'respons', 'chesapeak', 'alleg', 'desert', 'drew', 'elliot', 'prophet', 'indiana', 'tecumseh', 'amherstburg', 'indic', 'revit', 'fellow', 'onim', 'either', 'gave', 'belong', 'stream', 'malden', 'denk', 'deep', 'hospit', 'chao', 'henri', 'schnall', 'food', 'dispers', 'daughter', 'meanwhil', 'cass', 'kinship', 'charl', 'killbuck', 'gelemend', 'perfect', 'jstor', 'www', 'org', 'scienc', 'art', 'heckeweld', 'guidanc', 'secret', 'verifi', 'exhort', 'resign', 'applic', 'liberti', 'acquaint', 'untim', 'impress', 'justic', 'mr_heckeweld', 'charact', 'intercours', 'marietta', 'amidst', 'full', 'publish', 'putnam', 'wabash', 'fever', 'common', 'barg', 'alarm', 'chimney', 'pocket', 'assur', 'releas', 'intellig', 'messeng', 'shebosh', 'wound', 'latter', 'display', 'glorious', 'youth', 'abel'])
# model_p.wv.most_similar(positive = ["white", "american"],
# negative = ["british"])
$\overrightarrow{Dimension_g} = \overrightarrow{white} + \overrightarrow{american} - \overrightarrow{british}$
whiteAmerican_british = [('moravian', 0.9998588562011719),
('indian', 0.9998587369918823),
('fairfield', 0.9998579621315002),
('mani', 0.9998571872711182),
('live', 0.9998558163642883),
('murder', 0.9998539686203003),
('day', 0.9998528957366943),
('missionari', 0.9998528957366943),
('massacr', 0.9998518824577332),
('ohio', 0.999851405620575)]
## Visualize words most similar to White American
my_word_list=[]
my_word_vectors=[]
# label=[]
for i in whiteAmerican_british:
if my_word_list not in my_word_list:
my_word_list.append(i[0])
my_word_vectors.append(model_p.wv[i[0]])
tsne_model = TSNE(perplexity=5, n_components=2, init='pca', n_iter=3000, random_state=23) #you may need to tune these, epsecially the perplexity. #Use PCA to reduce dimensionality to 2-D, an "X" and a "Y
new_values = tsne_model.fit_transform(my_word_vectors)
x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])
trace1 = go.Scatter(
x = x,
y = y,
mode = 'markers+text',
name = "Similar to White American",
text = [stem_dict[word] if stem_dict.get(word) else word for word in my_word_list],
textposition='bottom center'
)
data = [trace1]
layout = go.Layout(dict(title = "Most similar Words to White American",
yaxis = dict(title = "Dimension2"),
xaxis = dict(title = "Dimension1"),
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
)
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
Some of the significant words that come up from this specific dimension include;
These words construct what we know about the Gnadenhutten Massacre - "The Moravian Indians, who were not allies of Britain, were massacred by American militia in Gnadenhutte, Ohio.
I am using Google Embedding Projector to visualize my Word2Vec model. Since it is built on tensorflow, we need to convert our Word2Vec output to the tensor format. The function below does just that.
The final visualization of this Word2Vec model can be found here.
def word2vec2tensor(model, tensor_filename):
outfiletsv = tensor_filename + '_tensor.tsv'
outfiletsvmeta = tensor_filename + '_metadata.tsv'
with smart_open(outfiletsv, 'wb') as file_vector, smart_open(outfiletsvmeta, 'wb') as file_metadata:
for word in model.wv.index2word:
word = stem_dict[word] if stem_dict.get(word) else word
file_metadata.write(gensim.utils.to_utf8(word) + gensim.utils.to_utf8('\n'))
vector_row = '\t'.join(str(x) for x in model.wv.__getitem__(word))
file_vector.write(gensim.utils.to_utf8(vector_row) + gensim.utils.to_utf8('\n'))
# word2vec2tensor(model_p, "model_primaries")