Employers are always striving to motivate, and create a pleasant work environment for their team members, with the goal of increasing productivity level, while maintaining strong employee retention. It's also not a coincidence that each year, Employers are competing to land on the top 100 rankings such as "Canada's Top 100 Employers" and "Great Place To Work."
In order to evaluate the quality of each Employer, we need to analyze Employer Reviews written by both former and current Employees to determine the results. Luckily, "Glassdoor" was created for this reason, which gives an inside scope of each Employer. By understanding the main topics in each Employer Reviews, Employers can then make adjustment to improve their work environment, which ultimately improves Employee productivity/retention.
However, some Employers have hundreds and thousands of reviews, which can take up a lot of time and resource to complete before determining the results.
Business Solutions:
To solve this issue, we will extract the main topics from all Employer Reviews for each Employer, and then determine the overall consensus.
We will perform an unsupervised learning algorithm in Topic Modeling, which uses Latent Dirichlet Allocation (LDA) Model, and LDA Mallet (Machine Learning Language Toolkit) Model.
We will also determine the dominant topic associated to each Employee Reviews, as well as determining the Employee Reviews for each dominant topics for an in-depth analysis.
Benefits:
Robustness:
To ensure the model performs well, we will take the following steps:
Note that the main different between LDA Model vs. LDA Mallet Model is that, LDA Model uses Variational Bayes method, which is faster, but less precise than LDA Mallet Model which uses Gibbs Sampling.
Assumption:
Future:
This model is Part Two of the "Quality Control for Banking using LDA and LDA Mallet," where we're able to showcase information on Employer Reviews with full visualization of the results.
import pandas as pd
csv = ("employee_reviews.csv")
df = pd.read_csv(csv, encoding='latin1') # Solves enocding issue when importing csv
df.head(5)
Unnamed: 0 | company | location | dates | job-title | summary | pros | cons | advice-to-mgmt | overall-ratings | work-balance-stars | culture-values-stars | carrer-opportunities-stars | comp-benefit-stars | senior-mangemnet-stars | helpful-count | link | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | none | Dec 11, 2018 | Current Employee - Anonymous Employee | Best Company to work for | People are smart and friendly | Bureaucracy is slowing things down | none | 5.0 | 4.0 | 5.0 | 5.0 | 4.0 | 5.0 | 0 | https://www.glassdoor.com/Reviews/Google-Revie... | |
1 | 2 | Mountain View, CA | Jun 21, 2013 | Former Employee - Program Manager | Moving at the speed of light, burn out is inev... | 1) Food, food, food. 15+ cafes on main campus ... | 1) Work/life balance. What balance? All those ... | 1) Don't dismiss emotional intelligence and ad... | 4.0 | 2.0 | 3.0 | 3.0 | 5.0 | 3.0 | 2094 | https://www.glassdoor.com/Reviews/Google-Revie... | |
2 | 3 | New York, NY | May 10, 2014 | Current Employee - Software Engineer III | Great balance between big-company security and... | * If you're a software engineer, you're among ... | * It *is* becoming larger, and with it comes g... | Keep the focus on the user. Everything else wi... | 5.0 | 5.0 | 4.0 | 5.0 | 5.0 | 4.0 | 949 | https://www.glassdoor.com/Reviews/Google-Revie... | |
3 | 4 | Mountain View, CA | Feb 8, 2015 | Current Employee - Anonymous Employee | The best place I've worked and also the most d... | You can't find a more well-regarded company th... | I live in SF so the commute can take between 1... | Keep on NOT micromanaging - that is a huge ben... | 5.0 | 2.0 | 5.0 | 5.0 | 4.0 | 5.0 | 498 | https://www.glassdoor.com/Reviews/Google-Revie... | |
4 | 5 | Los Angeles, CA | Jul 19, 2018 | Former Employee - Software Engineer | Unique, one of a kind dream job | Google is a world of its own. At every other c... | If you don't work in MTV (HQ), you will be giv... | Promote managers into management for their man... | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 5.0 | 49 | https://www.glassdoor.com/Reviews/Google-Revie... |
df['company'].unique()
array(['google', 'amazon', 'facebook', 'netflix', 'apple', 'microsoft'], dtype=object)
After importing the data, we see that the "summary" column is where the Employer Reviews are for each Employer. This is the column that we are going to use for extracting topics.
Also, we see that there are 5 different Employers under the "company" column. As a result, we will review only the first company to capture the results of the Employer Reviews.
Note: The same steps that we will use for the first Employer can be replicated for other Employers.
# Filters to Google only
dfg = df[df['company'] == 'google']
# Filters the data to the column needed for topic modeling
dfg = dfg[['summary']]
# Use the first 500 sample size
dfg = dfg.head(500)
Here we have filtered the dataset for the first Employer. Next we filtered the "summary" column for Employer Reviews. Lastly, we reduced the size of the sample to 500 to save computation time and power.
We will use regular expressions to clean out any unfavorable characters in our dataset, and then preview what the data looks like after the cleaning.
data = dfg['summary'].values.tolist() # convert to list
# Use Regex to remove all characters except letters and space
import re
data = [re.sub(r'[^a-zA-Z ]+', '', str(sent)) for sent in data]
# Preview the first list of the cleaned data
from pprint import pprint
pprint(data[:1])
['Best Company to work for']
With our data now cleaned, the next step is to pre-process our data so that it can used as an input for our LDA model.
We will perform the following:
simple_preprocess
simple_preprocess
once againcorpus.stopwords
models.phrases.Phraser
spacy.load(en)
which is Spacy's English dictionary# Implement simple_preprocess for Tokenization and additional cleaning
import gensim
from gensim.utils import simple_preprocess
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
data_words = list(sent_to_words(data))
# Remove stopwords using gensim's simple_preprocess and NLTK's stopwords
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use']) # Add additional stop words
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
data_words_nostops = remove_stopwords(data_words)
# Create and Apply Bigrams and Trigrams
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # Higher threshold fewer phrases
trigram = gensim.models.Phrases(bigram[data_words], threshold=100
bigram_mod = gensim.models.phrases.Phraser(bigram) # Faster way to get a sentence into a trigram/bigram
trigram_mod = gensim.models.phrases.Phraser(trigram)
def make_trigram(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
data_words_trigrams = make_trigram(data_words_nostops)
# Lemmatize the data
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent)) # Adds English dictionary from Spacy
texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
# lemma_ is base form and pos_ is lose part
return texts_out
data_lemmatized = lemmatization(data_words_trigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
# Preview the data
print(data_lemmatized[:1])
[['good', 'company', 'work']]
Here we are able to see texts that are Tokenized, Cleaned (stopwords removed), Lemmatized with applicable bigram and trigrams.
Now that our data have been cleaned and pre-processed, here are the final steps that we need to implement before our data is ready for LDA input:
corpora.Dictionary
.doc2bow
import gensim.corpora as corpora
id2word = corpora.Dictionary(data_lemmatized) # Create dictionary
texts = data_lemmatized # Create corpus
corpus = [id2word.doc2bow(text) for text in texts] # Apply Term Frequency
print(corpus[:1]) # Preview the data
[[(0, 1), (1, 1), (2, 1)]]
We can see that our corpus is a list of every word in an index form followed by count frequency.
id2word[0]
'company'
We can also see the actual word of each index by calling the index from our pre-processed data dictionary.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]
[[('company', 1), ('good', 1), ('work', 1)]]
Lastly, we can see the list of every word in actual word (instead of index form) followed by their count frequency using a simple for
loop.
Now that we have created our dictionary and corpus, we can feed the data into our LDA Model.
Latent (hidden) Dirichlet Allocation is a generative probabilistic model of a documents (composites) made up of words (parts). The model is based on the probability of words when selecting (sampling) topics (category), and the probability of topics when selecting a document.
Essentially, we are extracting topics in documents by looking at the probability of words to determine the topics, and then the probability of topics to determine the documents.
There are two LDA algorithms. The Variational Bayes is used by Gensim's LDA Model, while Gibb's Sampling is used by LDA Mallet Model using Gensim's Wrapper package.
Here is the general overview of Variational Bayes and Gibbs Sampling:
# Build LDA Model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics = 7, random_state = 100,
update_every = 1, chunksize = 100, passes = 10, alpha = 'auto',
per_word_topics=True) # Here we selected 7 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
[(0, '0.102*"google" + 0.071*"great" + 0.036*"sale" + 0.029*"job" + 0.024*"pay" + ' '0.023*"love" + 0.022*"perk" + 0.022*"balance" + 0.020*"internship" + ' '0.018*"business"'), (1, '0.054*"experience" + 0.040*"lead" + 0.023*"grad" + 0.023*"associate" + ' '0.020*"analytical" + 0.019*"designer" + 0.017*"strategist" + 0.011*"brand" ' '+ 0.010*"solution" + 0.010*"interactive"'), (2, '0.172*"manager" + 0.063*"account" + 0.044*"program" + 0.041*"senior" + ' '0.038*"product" + 0.025*"marketing" + 0.024*"culture" + 0.022*"bad" + ' '0.020*"staff" + 0.017*"new"'), (3, '0.233*"engineer" + 0.206*"software" + 0.071*"review" + 0.022*"cloud" + ' '0.016*"engineering" + 0.015*"developer" + 0.014*"pgm" + 0.013*"need" + ' '0.013*"senior" + 0.011*"legal"'), (4, '0.166*"work" + 0.143*"good" + 0.141*"place" + 0.139*"great" + ' '0.112*"company" + 0.013*"career" + 0.012*"awesome" + 0.011*"excellent" + ' '0.008*"start" + 0.008*"stuff"'), (5, '0.070*"intern" + 0.067*"amazing" + 0.055*"analyst" + 0.027*"technical" + ' '0.025*"specialist" + 0.020*"long" + 0.017*"skill" + 0.017*"term" + ' '0.017*"come" + 0.013*"sure"'), (6, '0.123*"great" + 0.056*"people" + 0.044*"benefit" + 0.029*"culture" + ' '0.027*"overall" + 0.023*"director" + 0.023*"meh" + 0.021*"lot" + ' '0.020*"many" + 0.019*"time"')]
After building the LDA Model using Gensim, we display the 7 topics in our document along with the top 10 keywords and their corresponding weights that makes up each topic.
# Compute perplexity
print('Perplexity: ', lda_model.log_perplexity(corpus))
# Compute coherence score
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)
Perplexity: -5.49701415002346 Coherence Score: 0.6258108598839495
In order to determine the accuracy of the topics that we used, we will compute the Perplexity Score and the Coherence Score. The Perplexity score measures how well the LDA Model predicts the sample (the lower the perplexity score, the better the model predicts). The Coherence score measures the quality of the topics that were learned (the higher the coherence score, the higher the quality of the learned topics).
Here we see a Perplexity score of -5.49 (negative due to log space), and Coherence score of 0.62.
Note: We will use the Coherence score moving forward, since we want to optimizing the number of topics in our documents.
import warnings
warnings.filterwarnings("ignore", category=FutureWarning) # Hides all future warnings
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis
We are using pyLDAvis to visualize our topics.
For interpretation of pyLDAvis:
Now that we have completed our Topic Modeling using "Variational Bayes" algorithm from Gensim's LDA, we will now explore Mallet's LDA (which is more accurate but slower) using Gibb's Sampling (Markov Chain Monte Carlos) under Gensim's Wrapper package.
Mallet's LDA Model is more accurate, since it utilizes Gibb's Sampling by sampling one variable at a time conditional upon all other variables.
import os
from gensim.models.wrappers import LdaMallet
os.environ.update({'MALLET_HOME':r'/Users/Mick/Desktop/mallet/'}) # Set environment
mallet_path = '/Users/Mick/Desktop/mallet/bin/mallet' # Update this path
# Build the LDA Mallet Model
ldamallet = LdaMallet(mallet_path,corpus=corpus,num_topics=7,id2word=id2word) # Here we selected 7 topics again
pprint(ldamallet.show_topics(formatted=False))
[(0, [('work', 0.42162162162162165), ('benefit', 0.04864864864864865), ('culture', 0.043243243243243246), ('engineer', 0.03783783783783784), ('lot', 0.02702702702702703), ('environment', 0.02702702702702703), ('long', 0.021621621621621623), ('world', 0.016216216216216217), ('designer', 0.016216216216216217), ('early', 0.010810810810810811)]), (1, [('engineer', 0.2781065088757396), ('software', 0.2781065088757396), ('senior', 0.05917159763313609), ('cloud', 0.029585798816568046), ('nice', 0.023668639053254437), ('lead', 0.023668639053254437), ('staff', 0.01775147928994083), ('partner', 0.01775147928994083), ('developer', 0.01775147928994083), ('big', 0.01775147928994083)]), (2, [('great', 0.33076923076923076), ('sale', 0.06153846153846154), ('program', 0.046153846153846156), ('love', 0.046153846153846156), ('director', 0.038461538461538464), ('good', 0.03076923076923077), ('time', 0.023076923076923078), ('datum', 0.015384615384615385), ('stuff', 0.015384615384615385), ('analytical', 0.015384615384615385)]), (3, [('manager', 0.18787878787878787), ('account', 0.06666666666666667), ('analyst', 0.06666666666666667), ('people', 0.06060606060606061), ('product', 0.048484848484848485), ('perk', 0.04242424242424243), ('balance', 0.03636363636363636), ('project', 0.030303030303030304), ('associate', 0.030303030303030304), ('meh', 0.024242424242424242)]), (4, [('place', 0.24025974025974026), ('google', 0.11688311688311688), ('amazing', 0.09090909090909091), ('review', 0.08441558441558442), ('job', 0.045454545454545456), ('excellent', 0.025974025974025976), ('specialist', 0.01948051948051948), ('rough', 0.012987012987012988), ('bad', 0.012987012987012988), ('depend', 0.006493506493506494)]), (5, [('company', 0.32), ('great', 0.3142857142857143), ('career', 0.022857142857142857), ('pay', 0.022857142857142857), ('grad', 0.017142857142857144), ('dream', 0.017142857142857144), ('perfect', 0.017142857142857144), ('marketing', 0.011428571428571429), ('security', 0.011428571428571429), ('executive', 0.005714285714285714)]), (6, [('good', 0.3611111111111111), ('place', 0.1388888888888889), ('intern', 0.07222222222222222), ('experience', 0.05), ('awesome', 0.027777777777777776), ('technical', 0.016666666666666666), ('tech', 0.016666666666666666), ('ambitious', 0.011111111111111112), ('fix', 0.011111111111111112), ('data', 0.011111111111111112)])]
After building the LDA Mallet Model using Gensim's Wrapper package, here we see our 7 new topics in the document along with the top 10 keywords and their corresponding weights that makes up each topic.
# Compute coherence score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence="c_v")
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)
Coherence Score: 0.7737574215817109
Here we see that the Coherence Score for our LDA Mallet Model is showing 0.77 which is much improved in comparison to the 0.62 Coherence Score from the LDA Model above. Also, given that we are now using a more accurate model from Gibb's Sampling, and combined with the purpose of the Coherence Score was to measure the quality of the topics that were learned, then our next step is to improve the actual Coherence Score, which will ultimately improve the overall quality of the topics learned.
To improve the quality of the topics learned, we need to find the optimal number of topics in our document, and once we find the optimal number of topics in our document, then our Coherence Score will be optimized, since all the topics in the document are extracted accordingly without redundancy.
We will use the following function to run our LDA Mallet Model:
compute_coherence_values
Note: We will trained our model to find topics between the range of 2 to 40 topics with an interval of 6.
# Compute a list of LDA Mallet Models and corresponding Coherence Values
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized,
start=2, limit=40, step=6)
# Visualize the optimal LDA Mallet Model
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel('Num Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()
# Print the coherence scores
for m, cv in zip(x, coherence_values):
print('Num Topics =', m, ' has Coherence Value of', round(cv, 4))
Num Topics = 2 has Coherence Value of 0.7254 Num Topics = 8 has Coherence Value of 0.7591 Num Topics = 14 has Coherence Value of 0.782 Num Topics = 20 has Coherence Value of 0.7849 Num Topics = 26 has Coherence Value of 0.7771 Num Topics = 32 has Coherence Value of 0.7673 Num Topics = 38 has Coherence Value of 0.7494
With our models trained, and the performances visualized, we can see that the optimal number of topics here is 20 topics with a Coherence Score of 0.78 which is slightly higher than our previous results at 0.77. However, we can also see that the model with a coherence score of 0.78 is also the highest scoring model, which implies that there are a total 20 dominant topics in this document.
We will proceed and select our final model using 20 topics.
# Select the model with highest coherence value and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10)) # Set num_words parament to show 10 words per each topic
[(0, '0.282*"company" + 0.231*"people" + 0.077*"place" + 0.051*"grad" + ' '0.026*"challenge" + 0.026*"compensation" + 0.026*"contract" + ' '0.026*"progress" + 0.026*"smart" + 0.026*"intern"'), (1, '0.367*"company" + 0.102*"cloud" + 0.041*"strategist" + 0.041*"environment" ' '+ 0.041*"employee" + 0.020*"pry" + 0.020*"usa" + 0.020*"break" + ' '0.020*"atmosphere" + 0.020*"term"'), (2, '0.164*"review" + 0.164*"amazing" + 0.145*"senior" + 0.109*"job" + ' '0.109*"balance" + 0.055*"great" + 0.036*"big" + 0.018*"compensation" + ' '0.018*"java" + 0.018*"ep"'), (3, '0.333*"good" + 0.083*"review" + 0.083*"pay" + 0.062*"environment" + ' '0.042*"senior" + 0.042*"outstanding" + 0.021*"depend" + 0.021*"point" + ' '0.021*"hype" + 0.021*"worrying"'), (4, '0.314*"great" + 0.137*"program" + 0.078*"perk" + 0.059*"time" + ' '0.059*"staff" + 0.039*"analytical" + 0.020*"technician" + 0.020*"average" + ' '0.020*"coworker" + 0.020*"datum"'), (5, '0.530*"place" + 0.045*"perfect" + 0.045*"technical" + 0.030*"life" + ' '0.015*"overpay" + 0.015*"iii" + 0.015*"learn" + 0.015*"vice" + ' '0.015*"leader" + 0.015*"write"'), (6, '0.431*"company" + 0.308*"good" + 0.077*"product" + 0.015*"team" + ' '0.015*"ad" + 0.015*"depend" + 0.015*"effect" + 0.015*"consultant" + ' '0.015*"starting" + 0.015*"altering"'), (7, '0.473*"great" + 0.145*"benefit" + 0.073*"excellent" + 0.055*"product" + ' '0.055*"lot" + 0.018*"class" + 0.018*"customer" + 0.018*"real" + ' '0.018*"promo" + 0.018*"run"'), (8, '0.579*"good" + 0.053*"analyst" + 0.053*"business" + 0.035*"pgm" + ' '0.035*"year" + 0.018*"educator" + 0.018*"workplace" + 0.018*"accountant" + ' '0.018*"bad" + 0.018*"unclear"'), (9, '0.316*"google" + 0.088*"director" + 0.035*"phenomenal" + 0.035*"early" + ' '0.035*"fun" + 0.018*"fulfillment" + 0.018*"workload" + 0.018*"exact" + ' '0.018*"hire" + 0.018*"eager"'), (10, '0.437*"manager" + 0.155*"account" + 0.085*"project" + 0.042*"tech" + ' '0.042*"awesome" + 0.028*"start" + 0.014*"unethical" + 0.014*"make" + ' '0.014*"googlex" + 0.014*"goal"'), (11, '0.435*"software" + 0.065*"dream" + 0.065*"developer" + 0.043*"data" + ' '0.043*"meh" + 0.043*"rough" + 0.022*"expectation" + 0.022*"unique" + ' '0.022*"large" + 0.022*"senior"'), (12, '0.338*"work" + 0.118*"culture" + 0.074*"associate" + 0.074*"amazing" + ' '0.059*"experience" + 0.015*"realistic" + 0.015*"listen" + 0.015*"glassdoor" ' '+ 0.015*"undeniably" + 0.015*"workplace"'), (13, '0.348*"work" + 0.188*"intern" + 0.116*"analyst" + 0.058*"internship" + ' '0.058*"love" + 0.058*"engineering" + 0.014*"demand" + 0.014*"assignment" + ' '0.014*"learn" + 0.014*"gr"'), (14, '0.585*"work" + 0.057*"ambitious" + 0.038*"wonderful" + 0.038*"awesome" + ' '0.019*"add" + 0.019*"wlb" + 0.019*"surpass" + 0.019*"enjoy" + ' '0.019*"working" + 0.019*"brand"'), (15, '0.609*"engineer" + 0.078*"lead" + 0.062*"experience" + 0.047*"bad" + ' '0.016*"stuff" + 0.016*"teammate" + 0.016*"datacenter" + 0.016*"challang" + ' '0.016*"market" + 0.016*"practicum"'), (16, '0.479*"great" + 0.083*"career" + 0.062*"marketing" + 0.042*"lot" + ' '0.021*"brand" + 0.021*"worker" + 0.021*"employee" + 0.021*"org" + ' '0.021*"job" + 0.021*"meh"'), (17, '0.429*"software" + 0.238*"engineer" + 0.048*"designer" + 0.032*"fix" + ' '0.016*"competitive" + 0.016*"hrbp" + 0.016*"challenge" + 0.016*"boost" + ' '0.016*"reward" + 0.016*"founder"'), (18, '0.441*"great" + 0.118*"sale" + 0.044*"long" + 0.015*"geekland" + ' '0.015*"quality" + 0.015*"production" + 0.015*"shiny" + 0.015*"colleague" + ' '0.015*"stuff" + 0.015*"adsense"'), (19, '0.364*"place" + 0.045*"partner" + 0.045*"specialist" + 0.045*"perk" + ' '0.030*"love" + 0.030*"worklife" + 0.030*"fun" + 0.015*"issue" + ' '0.015*"store" + 0.015*"hardware"')]
By using our Optimal LDA Mallet Model using Gensim's Wrapper package, we displayed the 20 topics in our document along with the top 10 keywords and their corresponding weights that makes up each topic.
# Wordcloud of Top N words in each topic
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors
cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
cloud = WordCloud(stopwords=stop_words,
background_color='white',
width=2500,
height=1800,
max_words=10,
colormap='tab10',
color_func=lambda *args, **kwargs: cols[i],
prefer_horizontal=1.0)
topics = optimal_model.show_topics(formatted=False)
fig, axes = plt.subplots(2, 2, figsize=(10,10), sharex=True, sharey=True)
for i, ax in enumerate(axes.flatten()):
fig.add_subplot(ax)
topic_words = dict(topics[i][1])
cloud.generate_from_frequencies(topic_words, max_font_size=300)
plt.gca().imshow(cloud)
plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
plt.gca().axis('off')
plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()
Here we also visualized the first 4 topics in our document along with the top 10 keywords. Each keyword's corresponding weights are shown by the size of the text.
Based on the visualization, we see the following topics:
Now that our Optimal Model is constructed, we will apply the model and determine the following:
def format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data):
sent_topics_df = pd.DataFrame()
# Get dominant topic in each document
for i, row in enumerate(ldamodel[corpus]):
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0:
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4),
topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords'] # Create dataframe title
# Add original text to the end of the output (recall that texts = data_lemmatized)
contents = pd.Series(texts)
sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Document']
df_dominant_topic.head(10)
Document_No | Dominant_Topic | Topic_Perc_Contrib | Keywords | Document | |
---|---|---|---|---|---|
0 | 0 | 6.0 | 0.0577 | company, good, product, team, ad, depend, effe... | Best Company to work for |
1 | 1 | 4.0 | 0.0798 | great, program, perk, time, staff, analytical,... | Moving at the speed of light burn out is inevi... |
2 | 2 | 10.0 | 0.0731 | manager, account, project, tech, awesome, star... | Great balance between bigcompany security and ... |
3 | 3 | 13.0 | 0.0648 | work, intern, analyst, internship, love, engin... | The best place Ive worked and also the most de... |
4 | 4 | 11.0 | 0.0833 | software, dream, developer, data, meh, rough, ... | Unique one of a kind dream job |
5 | 5 | 13.0 | 0.0723 | work, intern, analyst, internship, love, engin... | NICE working in GOOGLE as an INTERN |
6 | 6 | 15.0 | 0.0630 | engineer, lead, experience, bad, stuff, teamma... | Software engineer |
7 | 7 | 0.0 | 0.0628 | company, people, place, grad, challenge, compe... | great place to work and progress |
8 | 8 | 14.0 | 0.0660 | work, ambitious, wonderful, awesome, add, wlb,... | Google Surpasses Realistic Expectations |
9 | 9 | 15.0 | 0.0588 | engineer, lead, experience, bad, stuff, teamma... | Execellent for engineers |
Here we see a list of the first 10 document with corresponding dominant topics attached.
# Group top 20 documents for the 20 dominant topic
sent_topics_sorteddf_mallet = pd.DataFrame()
sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')
for i, grp in sent_topics_outdf_grpd:
sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet,
grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], axis=0)
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Document"]
sent_topics_sorteddf_mallet
Topic_Num | Topic_Perc_Contrib | Keywords | Document | |
---|---|---|---|---|
0 | 0.0 | 0.0804 | company, people, place, grad, challenge, compe... | Company full of people running around caring o... |
1 | 1.0 | 0.0833 | company, cloud, strategist, environment, emplo... | I broke down crying on the datacenter floor |
2 | 2.0 | 0.0717 | review, amazing, senior, job, balance, great, ... | Amazing place to develop technical skills |
3 | 3.0 | 0.0744 | good, review, pay, environment, senior, outsta... | Good pay and work |
4 | 4.0 | 0.0807 | great, program, perk, time, staff, analytical,... | Average with a hint of arrogance |
5 | 5.0 | 0.0778 | place, perfect, technical, life, overpay, iii,... | Not perfect but still the best place in the wo... |
6 | 6.0 | 0.0702 | company, good, product, team, ad, depend, effe... | Best Company in the world |
7 | 7.0 | 0.0874 | great, benefit, excellent, product, lot, class... | Great benefits but large enough to get lost in |
8 | 8.0 | 0.0713 | good, analyst, business, pgm, year, educator, ... | Good company with good benefits lots of red ta... |
9 | 9.0 | 0.0828 | google, director, phenomenal, early, fun, fulf... | Early Childhood Educator |
10 | 10.0 | 0.0865 | manager, account, project, tech, awesome, star... | Project Manager |
11 | 11.0 | 0.0833 | software, dream, developer, data, meh, rough, ... | Unique one of a kind dream job |
12 | 12.0 | 0.0759 | work, culture, associate, amazing, experience,... | Massage Therapist |
13 | 13.0 | 0.0849 | work, intern, analyst, internship, love, engin... | Software Engineering Intern |
14 | 14.0 | 0.0723 | work, ambitious, wonderful, awesome, add, wlb,... | wonderful place to work |
15 | 15.0 | 0.0702 | engineer, lead, experience, bad, stuff, teamma... | Engineering Practicum Internship |
16 | 16.0 | 0.0751 | great, career, marketing, lot, brand, worker, ... | Google is great recruiting org not so much |
17 | 17.0 | 0.0798 | software, engineer, designer, fix, competitive... | Sr Interactive Designer Sr Solution Consultant |
18 | 18.0 | 0.0844 | great, sale, long, geekland, quality, producti... | Adsense Publisher |
19 | 19.0 | 0.0765 | place, partner, specialist, perk, love, workli... | Love working at Google in Boulder CO |
Here we see a list of most relevant documents for each of the 20 dominant topics.
# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()
topic_contribution = round(topic_counts/topic_counts.sum(), 4)
topic_num_keywords = {'Topic_Num': pd.Series([0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,
11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0])}
topic_num_keywords = pd.DataFrame(topic_num_keywords)
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)
df_dominant_topics.reset_index(drop=True, inplace=True)
df_dominant_topics.columns = ['Dominant Topic', 'Num_Document', 'Perc_Document']
df_dominant_topics
Dominant Topic | Num_Document | Perc_Document | |
---|---|---|---|
0 | 0.0 | 35 | 0.070 |
1 | 1.0 | 32 | 0.064 |
2 | 2.0 | 20 | 0.040 |
3 | 3.0 | 24 | 0.048 |
4 | 4.0 | 27 | 0.054 |
5 | 5.0 | 35 | 0.070 |
6 | 6.0 | 19 | 0.038 |
7 | 7.0 | 17 | 0.034 |
8 | 8.0 | 22 | 0.044 |
9 | 9.0 | 33 | 0.066 |
10 | 10.0 | 43 | 0.086 |
11 | 11.0 | 18 | 0.036 |
12 | 12.0 | 21 | 0.042 |
13 | 13.0 | 26 | 0.052 |
14 | 14.0 | 15 | 0.030 |
15 | 15.0 | 24 | 0.048 |
16 | 16.0 | 17 | 0.034 |
17 | 17.0 | 33 | 0.066 |
18 | 18.0 | 19 | 0.038 |
19 | 19.0 | 20 | 0.040 |
Here we see the number of documents and the percentage of overall documents that contributes to each of the 20 dominant topics.
Based on our modeling above, we were able to use a very accurate model from Gibb's Sampling, and further optimize the model by finding the optimal number of dominant topics without redundancy.
As a result, we are now able to see the 20 dominant topics that were extracted from our dataset. Furthermore, we are also able to see the dominant topic for each of the 500 documents, and determine the most relevant document for each dominant topics.
With the in-depth analysis of each individual topics and documents above, Employers can use this approach to learn the topics from Employer Reviews, and make appropriate adjustments to improve their work environment, which can ultimately improve employee productivity/retention.