1 Business Problem¶

Employers are always striving to motivate, and create a pleasant work environment for their team members, with the goal of increasing productivity level, while maintaining strong employee retention. It's also not a coincidence that each year, Employers are competing to land on the top 100 rankings such as "Canada's Top 100 Employers" and "Great Place To Work."

In order to evaluate the quality of each Employer, we need to analyze Employer Reviews written by both former and current Employees to determine the results. Luckily, "Glassdoor" was created for this reason, which gives an inside scope of each Employer. By understanding the main topics in each Employer Reviews, Employers can then make adjustment to improve their work environment, which ultimately improves Employee productivity/retention.

However, some Employers have hundreds and thousands of reviews, which can take up a lot of time and resource to complete before determining the results.

Business Solutions:

To solve this issue, we will extract the main topics from all Employer Reviews for each Employer, and then determine the overall consensus.

We will perform an unsupervised learning algorithm in Topic Modeling, which uses Latent Dirichlet Allocation (LDA) Model, and LDA Mallet (Machine Learning Language Toolkit) Model.

We will also determine the dominant topic associated to each Employee Reviews, as well as determining the Employee Reviews for each dominant topics for an in-depth analysis.

Benefits:

Efficiently determine the main topics of Employer Reviews
Increase Employee productivity/retention by improving work environments based on topics from Employer Reviews
Conveniently determine the topics of each review
Extract detailed information by determining the most relevant review for each topic

Robustness:

To ensure the model performs well, we will take the following steps:

Run the LDA Model and the LDA Mallet Model to compare the performances of each model
Run the LDA Mallet Model and optimize the number of topics in the Employer Reviews by choosing the optimal model with highest performance

Note that the main different between LDA Model vs. LDA Mallet Model is that, LDA Model uses Variational Bayes method, which is faster, but less precise than LDA Mallet Model which uses Gibbs Sampling.

Assumption:

To save computation power and time, we have taken a sample size of 500 for each Employer, and assuming that this dataset is sufficient to capture the topics in the Employer Reviews
We're also assuming that the results in this model is applicable in the same way, as if the model were applied on an entire population of the Employer Reviews dataset, with the exception of few parameter tweaks

Future:

This model is Part Two of the "Quality Control for Banking using LDA and LDA Mallet," where we're able to showcase information on Employer Reviews with full visualization of the results.

2 Data Overview¶

In [1]:

import pandas as pd
csv = ("employee_reviews.csv")
df = pd.read_csv(csv, encoding='latin1') # Solves enocding issue when importing csv
df.head(5)

Out[1]:

	Unnamed: 0	company	location	dates	job-title	summary	pros	cons	advice-to-mgmt	overall-ratings	work-balance-stars	culture-values-stars	carrer-opportunities-stars	comp-benefit-stars	senior-mangemnet-stars	helpful-count	link
0	1	google	none	Dec 11, 2018	Current Employee - Anonymous Employee	Best Company to work for	People are smart and friendly	Bureaucracy is slowing things down	none	5.0	4.0	5.0	5.0	4.0	5.0	0	https://www.glassdoor.com/Reviews/Google-Revie...
1	2	google	Mountain View, CA	Jun 21, 2013	Former Employee - Program Manager	Moving at the speed of light, burn out is inev...	1) Food, food, food. 15+ cafes on main campus ...	1) Work/life balance. What balance? All those ...	1) Don't dismiss emotional intelligence and ad...	4.0	2.0	3.0	3.0	5.0	3.0	2094	https://www.glassdoor.com/Reviews/Google-Revie...
2	3	google	New York, NY	May 10, 2014	Current Employee - Software Engineer III	Great balance between big-company security and...	* If you're a software engineer, you're among ...	* It is becoming larger, and with it comes g...	Keep the focus on the user. Everything else wi...	5.0	5.0	4.0	5.0	5.0	4.0	949	https://www.glassdoor.com/Reviews/Google-Revie...
3	4	google	Mountain View, CA	Feb 8, 2015	Current Employee - Anonymous Employee	The best place I've worked and also the most d...	You can't find a more well-regarded company th...	I live in SF so the commute can take between 1...	Keep on NOT micromanaging - that is a huge ben...	5.0	2.0	5.0	5.0	4.0	5.0	498	https://www.glassdoor.com/Reviews/Google-Revie...
4	5	google	Los Angeles, CA	Jul 19, 2018	Former Employee - Software Engineer	Unique, one of a kind dream job	Google is a world of its own. At every other c...	If you don't work in MTV (HQ), you will be giv...	Promote managers into management for their man...	5.0	5.0	5.0	5.0	5.0	5.0	49	https://www.glassdoor.com/Reviews/Google-Revie...

In [2]:

df['company'].unique()

Out[2]:

array(['google', 'amazon', 'facebook', 'netflix', 'apple', 'microsoft'],
      dtype=object)

After importing the data, we see that the "summary" column is where the Employer Reviews are for each Employer. This is the column that we are going to use for extracting topics.

Also, we see that there are 5 different Employers under the "company" column. As a result, we will review only the first company to capture the results of the Employer Reviews.

Note: The same steps that we will use for the first Employer can be replicated for other Employers.

3 Topics Analysis for Google¶

In [3]:

# Filters to Google only
dfg = df[df['company'] == 'google']

# Filters the data to the column needed for topic modeling
dfg = dfg[['summary']]

# Use the first 500 sample size
dfg = dfg.head(500)

Here we have filtered the dataset for the first Employer. Next we filtered the "summary" column for Employer Reviews. Lastly, we reduced the size of the sample to 500 to save computation time and power.

4 Data Cleaning¶

We will use regular expressions to clean out any unfavorable characters in our dataset, and then preview what the data looks like after the cleaning.

In [7]:

data = dfg['summary'].values.tolist() # convert to list

# Use Regex to remove all characters except letters and space
import re
data = [re.sub(r'[^a-zA-Z ]+', '', str(sent)) for sent in data] 

# Preview the first list of the cleaned data
from pprint import pprint
pprint(data[:1])

['Best Company to work for']

5 Pre-Processing¶

With our data now cleaned, the next step is to pre-process our data so that it can used as an input for our LDA model.

We will perform the following:

Breakdown each sentences into a list of words through Tokenization by using Gensim's simple_preprocess
Additional cleaning by converting text into lowercase, and removing punctuations by using Gensim's simple_preprocess once again
Remove stopwords (words that carry no meaning such as to, the, etc) by using NLTK's corpus.stopwords
Apply Bigram and Trigram model for words that occurs together (ie. warrant_proceeding, there_isnt_enough) by using Gensim's models.phrases.Phraser
Transform words to their root words (ie. walking to walk, mice to mouse) by Lemmatizing the text using spacy.load(en) which is Spacy's English dictionary

In [11]:

# Implement simple_preprocess for Tokenization and additional cleaning
import gensim
from gensim.utils import simple_preprocess 
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations       
data_words = list(sent_to_words(data))


# Remove stopwords using gensim's simple_preprocess and NLTK's stopwords
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use']) # Add additional stop words
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]   
data_words_nostops = remove_stopwords(data_words)


# Create and Apply Bigrams and Trigrams
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # Higher threshold fewer phrases
trigram = gensim.models.Phrases(bigram[data_words], threshold=100
bigram_mod = gensim.models.phrases.Phraser(bigram)    # Faster way to get a sentence into a trigram/bigram
trigram_mod = gensim.models.phrases.Phraser(trigram)
def make_trigram(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]
data_words_trigrams = make_trigram(data_words_nostops)
                                

# Lemmatize the data
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) # Adds English dictionary from Spacy
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
        # lemma_ is base form and pos_ is lose part
    return texts_out
data_lemmatized = lemmatization(data_words_trigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

                                
# Preview the data     
print(data_lemmatized[:1])

[['good', 'company', 'work']]

Here we are able to see texts that are Tokenized, Cleaned (stopwords removed), Lemmatized with applicable bigram and trigrams.

6 Prepare Dictionary and Corpus¶

Now that our data have been cleaned and pre-processed, here are the final steps that we need to implement before our data is ready for LDA input:

Create a dictionary from our pre-processed data using Gensim's corpora.Dictionary
Create a corpus by applying "term frequency" (word count) to our "pre-processed data dictionary" using Gensim's .doc2bow

In [12]:

import gensim.corpora as corpora
id2word = corpora.Dictionary(data_lemmatized)      # Create dictionary
texts = data_lemmatized                            # Create corpus
corpus = [id2word.doc2bow(text) for text in texts] # Apply Term Frequency
print(corpus[:1])                                  # Preview the data

[[(0, 1), (1, 1), (2, 1)]]

We can see that our corpus is a list of every word in an index form followed by count frequency.

In [13]:

id2word[0]

Out[13]:

'company'

We can also see the actual word of each index by calling the index from our pre-processed data dictionary.

In [14]:

[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

Out[14]:

[[('company', 1), ('good', 1), ('work', 1)]]

Lastly, we can see the list of every word in actual word (instead of index form) followed by their count frequency using a simple for loop.

Now that we have created our dictionary and corpus, we can feed the data into our LDA Model.

7 LDA Model¶

Latent (hidden) Dirichlet Allocation is a generative probabilistic model of a documents (composites) made up of words (parts). The model is based on the probability of words when selecting (sampling) topics (category), and the probability of topics when selecting a document.

Essentially, we are extracting topics in documents by looking at the probability of words to determine the topics, and then the probability of topics to determine the documents.

There are two LDA algorithms. The Variational Bayes is used by Gensim's LDA Model, while Gibb's Sampling is used by LDA Mallet Model using Gensim's Wrapper package.

Here is the general overview of Variational Bayes and Gibbs Sampling:

Variational Bayes
- Sampling the variations between, and within each word (part or variable) to determine which topic it belongs to (but some variations cannot be explained)
- Fast but less accurate
Gibb's Sampling (Markov Chain Monte Carlos)
- Sampling one variable at a time, conditional upon all other variables
- Slow but more accurate

In [15]:

# Build LDA Model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics = 7, random_state = 100,
                                            update_every = 1, chunksize = 100, passes = 10, alpha = 'auto',
                                            per_word_topics=True) # Here we selected 7 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.102*"google" + 0.071*"great" + 0.036*"sale" + 0.029*"job" + 0.024*"pay" + '
  '0.023*"love" + 0.022*"perk" + 0.022*"balance" + 0.020*"internship" + '
  '0.018*"business"'),
 (1,
  '0.054*"experience" + 0.040*"lead" + 0.023*"grad" + 0.023*"associate" + '
  '0.020*"analytical" + 0.019*"designer" + 0.017*"strategist" + 0.011*"brand" '
  '+ 0.010*"solution" + 0.010*"interactive"'),
 (2,
  '0.172*"manager" + 0.063*"account" + 0.044*"program" + 0.041*"senior" + '
  '0.038*"product" + 0.025*"marketing" + 0.024*"culture" + 0.022*"bad" + '
  '0.020*"staff" + 0.017*"new"'),
 (3,
  '0.233*"engineer" + 0.206*"software" + 0.071*"review" + 0.022*"cloud" + '
  '0.016*"engineering" + 0.015*"developer" + 0.014*"pgm" + 0.013*"need" + '
  '0.013*"senior" + 0.011*"legal"'),
 (4,
  '0.166*"work" + 0.143*"good" + 0.141*"place" + 0.139*"great" + '
  '0.112*"company" + 0.013*"career" + 0.012*"awesome" + 0.011*"excellent" + '
  '0.008*"start" + 0.008*"stuff"'),
 (5,
  '0.070*"intern" + 0.067*"amazing" + 0.055*"analyst" + 0.027*"technical" + '
  '0.025*"specialist" + 0.020*"long" + 0.017*"skill" + 0.017*"term" + '
  '0.017*"come" + 0.013*"sure"'),
 (6,
  '0.123*"great" + 0.056*"people" + 0.044*"benefit" + 0.029*"culture" + '
  '0.027*"overall" + 0.023*"director" + 0.023*"meh" + 0.021*"lot" + '
  '0.020*"many" + 0.019*"time"')]

After building the LDA Model using Gensim, we display the 7 topics in our document along with the top 10 keywords and their corresponding weights that makes up each topic.

8 LDA Model Performance¶

In [16]:

# Compute perplexity
print('Perplexity: ', lda_model.log_perplexity(corpus))

# Compute coherence score
from gensim.models import CoherenceModel
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

Perplexity:  -5.49701415002346

Coherence Score:  0.6258108598839495

In order to determine the accuracy of the topics that we used, we will compute the Perplexity Score and the Coherence Score. The Perplexity score measures how well the LDA Model predicts the sample (the lower the perplexity score, the better the model predicts). The Coherence score measures the quality of the topics that were learned (the higher the coherence score, the higher the quality of the learned topics).

Here we see a Perplexity score of -5.49 (negative due to log space), and Coherence score of 0.62.

Note: We will use the Coherence score moving forward, since we want to optimizing the number of topics in our documents.

8.1 Visualize LDA Model¶

In [17]:

import warnings
warnings.filterwarnings("ignore", category=FutureWarning) # Hides all future warnings
import pyLDAvis
import pyLDAvis.gensim 
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

Out[17]:

We are using pyLDAvis to visualize our topics.

For interpretation of pyLDAvis:

Each bubble represents a topic
The larger the bubble, the more prevalent the topic will be
A good topic model has fairly big, non-overlapping bubbles scattered through the chart (instead of being clustered in one quadrant)
Red highlight: Salient keywords that form the topics (most notable keywords)

9 LDA Mallet Model¶

Now that we have completed our Topic Modeling using "Variational Bayes" algorithm from Gensim's LDA, we will now explore Mallet's LDA (which is more accurate but slower) using Gibb's Sampling (Markov Chain Monte Carlos) under Gensim's Wrapper package.

Mallet's LDA Model is more accurate, since it utilizes Gibb's Sampling by sampling one variable at a time conditional upon all other variables.

In [20]:

import os
from gensim.models.wrappers import LdaMallet
os.environ.update({'MALLET_HOME':r'/Users/Mick/Desktop/mallet/'}) # Set environment
mallet_path = '/Users/Mick/Desktop/mallet/bin/mallet'             # Update this path

# Build the LDA Mallet Model
ldamallet = LdaMallet(mallet_path,corpus=corpus,num_topics=7,id2word=id2word) # Here we selected 7 topics again
pprint(ldamallet.show_topics(formatted=False))

[(0,
  [('work', 0.42162162162162165),
   ('benefit', 0.04864864864864865),
   ('culture', 0.043243243243243246),
   ('engineer', 0.03783783783783784),
   ('lot', 0.02702702702702703),
   ('environment', 0.02702702702702703),
   ('long', 0.021621621621621623),
   ('world', 0.016216216216216217),
   ('designer', 0.016216216216216217),
   ('early', 0.010810810810810811)]),
 (1,
  [('engineer', 0.2781065088757396),
   ('software', 0.2781065088757396),
   ('senior', 0.05917159763313609),
   ('cloud', 0.029585798816568046),
   ('nice', 0.023668639053254437),
   ('lead', 0.023668639053254437),
   ('staff', 0.01775147928994083),
   ('partner', 0.01775147928994083),
   ('developer', 0.01775147928994083),
   ('big', 0.01775147928994083)]),
 (2,
  [('great', 0.33076923076923076),
   ('sale', 0.06153846153846154),
   ('program', 0.046153846153846156),
   ('love', 0.046153846153846156),
   ('director', 0.038461538461538464),
   ('good', 0.03076923076923077),
   ('time', 0.023076923076923078),
   ('datum', 0.015384615384615385),
   ('stuff', 0.015384615384615385),
   ('analytical', 0.015384615384615385)]),
 (3,
  [('manager', 0.18787878787878787),
   ('account', 0.06666666666666667),
   ('analyst', 0.06666666666666667),
   ('people', 0.06060606060606061),
   ('product', 0.048484848484848485),
   ('perk', 0.04242424242424243),
   ('balance', 0.03636363636363636),
   ('project', 0.030303030303030304),
   ('associate', 0.030303030303030304),
   ('meh', 0.024242424242424242)]),
 (4,
  [('place', 0.24025974025974026),
   ('google', 0.11688311688311688),
   ('amazing', 0.09090909090909091),
   ('review', 0.08441558441558442),
   ('job', 0.045454545454545456),
   ('excellent', 0.025974025974025976),
   ('specialist', 0.01948051948051948),
   ('rough', 0.012987012987012988),
   ('bad', 0.012987012987012988),
   ('depend', 0.006493506493506494)]),
 (5,
  [('company', 0.32),
   ('great', 0.3142857142857143),
   ('career', 0.022857142857142857),
   ('pay', 0.022857142857142857),
   ('grad', 0.017142857142857144),
   ('dream', 0.017142857142857144),
   ('perfect', 0.017142857142857144),
   ('marketing', 0.011428571428571429),
   ('security', 0.011428571428571429),
   ('executive', 0.005714285714285714)]),
 (6,
  [('good', 0.3611111111111111),
   ('place', 0.1388888888888889),
   ('intern', 0.07222222222222222),
   ('experience', 0.05),
   ('awesome', 0.027777777777777776),
   ('technical', 0.016666666666666666),
   ('tech', 0.016666666666666666),
   ('ambitious', 0.011111111111111112),
   ('fix', 0.011111111111111112),
   ('data', 0.011111111111111112)])]

After building the LDA Mallet Model using Gensim's Wrapper package, here we see our 7 new topics in the document along with the top 10 keywords and their corresponding weights that makes up each topic.

9.1 LDA Mallet Model Performance¶

In [21]:

# Compute coherence score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence="c_v")
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)

Coherence Score:  0.7737574215817109

Here we see that the Coherence Score for our LDA Mallet Model is showing 0.77 which is much improved in comparison to the 0.62 Coherence Score from the LDA Model above. Also, given that we are now using a more accurate model from Gibb's Sampling, and combined with the purpose of the Coherence Score was to measure the quality of the topics that were learned, then our next step is to improve the actual Coherence Score, which will ultimately improve the overall quality of the topics learned.

To improve the quality of the topics learned, we need to find the optimal number of topics in our document, and once we find the optimal number of topics in our document, then our Coherence Score will be optimized, since all the topics in the document are extracted accordingly without redundancy.

10 Finding the Optimal Number of Topics for LDA Mallet Model¶

We will use the following function to run our LDA Mallet Model:

compute_coherence_values

Note: We will trained our model to find topics between the range of 2 to 40 topics with an interval of 6.

In [23]:

# Compute a list of LDA Mallet Models and corresponding Coherence Values
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence()) 
    return model_list, coherence_values
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized,
                                                        start=2, limit=40, step=6)

# Visualize the optimal LDA Mallet Model
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel('Num Topics')
plt.ylabel('Coherence score')
plt.legend(('coherence_values'), loc='best')
plt.show()

In [24]:

# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print('Num Topics =', m, ' has Coherence Value of', round(cv, 4))

Num Topics = 2  has Coherence Value of 0.7254
Num Topics = 8  has Coherence Value of 0.7591
Num Topics = 14  has Coherence Value of 0.782
Num Topics = 20  has Coherence Value of 0.7849
Num Topics = 26  has Coherence Value of 0.7771
Num Topics = 32  has Coherence Value of 0.7673
Num Topics = 38  has Coherence Value of 0.7494

With our models trained, and the performances visualized, we can see that the optimal number of topics here is 20 topics with a Coherence Score of 0.78 which is slightly higher than our previous results at 0.77. However, we can also see that the model with a coherence score of 0.78 is also the highest scoring model, which implies that there are a total 20 dominant topics in this document.

We will proceed and select our final model using 20 topics.

In [25]:

# Select the model with highest coherence value and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10)) # Set num_words parament to show 10 words per each topic

[(0,
  '0.282*"company" + 0.231*"people" + 0.077*"place" + 0.051*"grad" + '
  '0.026*"challenge" + 0.026*"compensation" + 0.026*"contract" + '
  '0.026*"progress" + 0.026*"smart" + 0.026*"intern"'),
 (1,
  '0.367*"company" + 0.102*"cloud" + 0.041*"strategist" + 0.041*"environment" '
  '+ 0.041*"employee" + 0.020*"pry" + 0.020*"usa" + 0.020*"break" + '
  '0.020*"atmosphere" + 0.020*"term"'),
 (2,
  '0.164*"review" + 0.164*"amazing" + 0.145*"senior" + 0.109*"job" + '
  '0.109*"balance" + 0.055*"great" + 0.036*"big" + 0.018*"compensation" + '
  '0.018*"java" + 0.018*"ep"'),
 (3,
  '0.333*"good" + 0.083*"review" + 0.083*"pay" + 0.062*"environment" + '
  '0.042*"senior" + 0.042*"outstanding" + 0.021*"depend" + 0.021*"point" + '
  '0.021*"hype" + 0.021*"worrying"'),
 (4,
  '0.314*"great" + 0.137*"program" + 0.078*"perk" + 0.059*"time" + '
  '0.059*"staff" + 0.039*"analytical" + 0.020*"technician" + 0.020*"average" + '
  '0.020*"coworker" + 0.020*"datum"'),
 (5,
  '0.530*"place" + 0.045*"perfect" + 0.045*"technical" + 0.030*"life" + '
  '0.015*"overpay" + 0.015*"iii" + 0.015*"learn" + 0.015*"vice" + '
  '0.015*"leader" + 0.015*"write"'),
 (6,
  '0.431*"company" + 0.308*"good" + 0.077*"product" + 0.015*"team" + '
  '0.015*"ad" + 0.015*"depend" + 0.015*"effect" + 0.015*"consultant" + '
  '0.015*"starting" + 0.015*"altering"'),
 (7,
  '0.473*"great" + 0.145*"benefit" + 0.073*"excellent" + 0.055*"product" + '
  '0.055*"lot" + 0.018*"class" + 0.018*"customer" + 0.018*"real" + '
  '0.018*"promo" + 0.018*"run"'),
 (8,
  '0.579*"good" + 0.053*"analyst" + 0.053*"business" + 0.035*"pgm" + '
  '0.035*"year" + 0.018*"educator" + 0.018*"workplace" + 0.018*"accountant" + '
  '0.018*"bad" + 0.018*"unclear"'),
 (9,
  '0.316*"google" + 0.088*"director" + 0.035*"phenomenal" + 0.035*"early" + '
  '0.035*"fun" + 0.018*"fulfillment" + 0.018*"workload" + 0.018*"exact" + '
  '0.018*"hire" + 0.018*"eager"'),
 (10,
  '0.437*"manager" + 0.155*"account" + 0.085*"project" + 0.042*"tech" + '
  '0.042*"awesome" + 0.028*"start" + 0.014*"unethical" + 0.014*"make" + '
  '0.014*"googlex" + 0.014*"goal"'),
 (11,
  '0.435*"software" + 0.065*"dream" + 0.065*"developer" + 0.043*"data" + '
  '0.043*"meh" + 0.043*"rough" + 0.022*"expectation" + 0.022*"unique" + '
  '0.022*"large" + 0.022*"senior"'),
 (12,
  '0.338*"work" + 0.118*"culture" + 0.074*"associate" + 0.074*"amazing" + '
  '0.059*"experience" + 0.015*"realistic" + 0.015*"listen" + 0.015*"glassdoor" '
  '+ 0.015*"undeniably" + 0.015*"workplace"'),
 (13,
  '0.348*"work" + 0.188*"intern" + 0.116*"analyst" + 0.058*"internship" + '
  '0.058*"love" + 0.058*"engineering" + 0.014*"demand" + 0.014*"assignment" + '
  '0.014*"learn" + 0.014*"gr"'),
 (14,
  '0.585*"work" + 0.057*"ambitious" + 0.038*"wonderful" + 0.038*"awesome" + '
  '0.019*"add" + 0.019*"wlb" + 0.019*"surpass" + 0.019*"enjoy" + '
  '0.019*"working" + 0.019*"brand"'),
 (15,
  '0.609*"engineer" + 0.078*"lead" + 0.062*"experience" + 0.047*"bad" + '
  '0.016*"stuff" + 0.016*"teammate" + 0.016*"datacenter" + 0.016*"challang" + '
  '0.016*"market" + 0.016*"practicum"'),
 (16,
  '0.479*"great" + 0.083*"career" + 0.062*"marketing" + 0.042*"lot" + '
  '0.021*"brand" + 0.021*"worker" + 0.021*"employee" + 0.021*"org" + '
  '0.021*"job" + 0.021*"meh"'),
 (17,
  '0.429*"software" + 0.238*"engineer" + 0.048*"designer" + 0.032*"fix" + '
  '0.016*"competitive" + 0.016*"hrbp" + 0.016*"challenge" + 0.016*"boost" + '
  '0.016*"reward" + 0.016*"founder"'),
 (18,
  '0.441*"great" + 0.118*"sale" + 0.044*"long" + 0.015*"geekland" + '
  '0.015*"quality" + 0.015*"production" + 0.015*"shiny" + 0.015*"colleague" + '
  '0.015*"stuff" + 0.015*"adsense"'),
 (19,
  '0.364*"place" + 0.045*"partner" + 0.045*"specialist" + 0.045*"perk" + '
  '0.030*"love" + 0.030*"worklife" + 0.030*"fun" + 0.015*"issue" + '
  '0.015*"store" + 0.015*"hardware"')]

By using our Optimal LDA Mallet Model using Gensim's Wrapper package, we displayed the 20 topics in our document along with the top 10 keywords and their corresponding weights that makes up each topic.

10.1 Visual the Optimal LDA Mallet Model¶

In [26]:

# Wordcloud of Top N words in each topic
from matplotlib import pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import matplotlib.colors as mcolors
cols = [color for name, color in mcolors.TABLEAU_COLORS.items()]
cloud = WordCloud(stopwords=stop_words,
                  background_color='white',
                  width=2500,
                  height=1800,
                  max_words=10,
                  colormap='tab10',
                  color_func=lambda *args, **kwargs: cols[i],
                  prefer_horizontal=1.0)
topics = optimal_model.show_topics(formatted=False)
fig, axes = plt.subplots(2, 2, figsize=(10,10), sharex=True, sharey=True)
for i, ax in enumerate(axes.flatten()):
    fig.add_subplot(ax)
    topic_words = dict(topics[i][1])
    cloud.generate_from_frequencies(topic_words, max_font_size=300)
    plt.gca().imshow(cloud)
    plt.gca().set_title('Topic ' + str(i), fontdict=dict(size=16))
    plt.gca().axis('off')
plt.subplots_adjust(wspace=0, hspace=0)
plt.axis('off')
plt.margins(x=0, y=0)
plt.tight_layout()
plt.show()

Here we also visualized the first 4 topics in our document along with the top 10 keywords. Each keyword's corresponding weights are shown by the size of the text.

Based on the visualization, we see the following topics:

Topic 0: Employer Quality
Topic 1: Management Quality
Topic 2: Employer Perception
Topic 3: Employee Happiness

11 Analysis¶

Now that our Optimal Model is constructed, we will apply the model and determine the following:

Determine the dominant topics for each document
Determine the most relevant document for each of the 20 dominant topics
Determine the distribution of documents contributed to each of the 20 dominant topics

11.1 Finding topics for each document¶

In [27]:

def format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data):
    sent_topics_df = pd.DataFrame()
    # Get dominant topic in each document
    for i, row in enumerate(ldamodel[corpus]):                   
        row = sorted(row, key=lambda x: (x[1]), reverse=True)        
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:                                           
                wp = ldamodel.show_topic(topic_num)              
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4),
                                                                  topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords'] # Create dataframe title
    # Add original text to the end of the output (recall that texts = data_lemmatized)
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)    
df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Document']
df_dominant_topic.head(10)

Out[27]:

	Document_No	Dominant_Topic	Topic_Perc_Contrib	Keywords	Document
0	0	6.0	0.0577	company, good, product, team, ad, depend, effe...	Best Company to work for
1	1	4.0	0.0798	great, program, perk, time, staff, analytical,...	Moving at the speed of light burn out is inevi...
2	2	10.0	0.0731	manager, account, project, tech, awesome, star...	Great balance between bigcompany security and ...
3	3	13.0	0.0648	work, intern, analyst, internship, love, engin...	The best place Ive worked and also the most de...
4	4	11.0	0.0833	software, dream, developer, data, meh, rough, ...	Unique one of a kind dream job
5	5	13.0	0.0723	work, intern, analyst, internship, love, engin...	NICE working in GOOGLE as an INTERN
6	6	15.0	0.0630	engineer, lead, experience, bad, stuff, teamma...	Software engineer
7	7	0.0	0.0628	company, people, place, grad, challenge, compe...	great place to work and progress
8	8	14.0	0.0660	work, ambitious, wonderful, awesome, add, wlb,...	Google Surpasses Realistic Expectations
9	9	15.0	0.0588	engineer, lead, experience, bad, stuff, teamma...	Execellent for engineers

Here we see a list of the first 10 document with corresponding dominant topics attached.

11.2 Finding documents for each topic¶

In [28]:

# Group top 20 documents for the 20 dominant topic
sent_topics_sorteddf_mallet = pd.DataFrame()
sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic') 
for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet,
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], axis=0)
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Document"]
sent_topics_sorteddf_mallet 

Out[28]:

	Topic_Num	Topic_Perc_Contrib	Keywords	Document
0	0.0	0.0804	company, people, place, grad, challenge, compe...	Company full of people running around caring o...
1	1.0	0.0833	company, cloud, strategist, environment, emplo...	I broke down crying on the datacenter floor
2	2.0	0.0717	review, amazing, senior, job, balance, great, ...	Amazing place to develop technical skills
3	3.0	0.0744	good, review, pay, environment, senior, outsta...	Good pay and work
4	4.0	0.0807	great, program, perk, time, staff, analytical,...	Average with a hint of arrogance
5	5.0	0.0778	place, perfect, technical, life, overpay, iii,...	Not perfect but still the best place in the wo...
6	6.0	0.0702	company, good, product, team, ad, depend, effe...	Best Company in the world
7	7.0	0.0874	great, benefit, excellent, product, lot, class...	Great benefits but large enough to get lost in
8	8.0	0.0713	good, analyst, business, pgm, year, educator, ...	Good company with good benefits lots of red ta...
9	9.0	0.0828	google, director, phenomenal, early, fun, fulf...	Early Childhood Educator
10	10.0	0.0865	manager, account, project, tech, awesome, star...	Project Manager
11	11.0	0.0833	software, dream, developer, data, meh, rough, ...	Unique one of a kind dream job
12	12.0	0.0759	work, culture, associate, amazing, experience,...	Massage Therapist
13	13.0	0.0849	work, intern, analyst, internship, love, engin...	Software Engineering Intern
14	14.0	0.0723	work, ambitious, wonderful, awesome, add, wlb,...	wonderful place to work
15	15.0	0.0702	engineer, lead, experience, bad, stuff, teamma...	Engineering Practicum Internship
16	16.0	0.0751	great, career, marketing, lot, brand, worker, ...	Google is great recruiting org not so much
17	17.0	0.0798	software, engineer, designer, fix, competitive...	Sr Interactive Designer Sr Solution Consultant
18	18.0	0.0844	great, sale, long, geekland, quality, producti...	Adsense Publisher
19	19.0	0.0765	place, partner, specialist, perk, love, workli...	Love working at Google in Boulder CO

Here we see a list of most relevant documents for each of the 20 dominant topics.

11.3 Document distribution across Topics¶

In [29]:

# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()
topic_contribution = round(topic_counts/topic_counts.sum(), 4)
topic_num_keywords = {'Topic_Num': pd.Series([0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,
                                              11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0])}
topic_num_keywords = pd.DataFrame(topic_num_keywords)
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)
df_dominant_topics.reset_index(drop=True, inplace=True)
df_dominant_topics.columns = ['Dominant Topic', 'Num_Document', 'Perc_Document']
df_dominant_topics

Out[29]:

	Dominant Topic	Num_Document	Perc_Document
0	0.0	35	0.070
1	1.0	32	0.064
2	2.0	20	0.040
3	3.0	24	0.048
4	4.0	27	0.054
5	5.0	35	0.070
6	6.0	19	0.038
7	7.0	17	0.034
8	8.0	22	0.044
9	9.0	33	0.066
10	10.0	43	0.086
11	11.0	18	0.036
12	12.0	21	0.042
13	13.0	26	0.052
14	14.0	15	0.030
15	15.0	24	0.048
16	16.0	17	0.034
17	17.0	33	0.066
18	18.0	19	0.038
19	19.0	20	0.040

Here we see the number of documents and the percentage of overall documents that contributes to each of the 20 dominant topics.

12 Answering the Questions¶

Based on our modeling above, we were able to use a very accurate model from Gibb's Sampling, and further optimize the model by finding the optimal number of dominant topics without redundancy.

As a result, we are now able to see the 20 dominant topics that were extracted from our dataset. Furthermore, we are also able to see the dominant topic for each of the 500 documents, and determine the most relevant document for each dominant topics.

With the in-depth analysis of each individual topics and documents above, Employers can use this approach to learn the topics from Employer Reviews, and make appropriate adjustments to improve their work environment, which can ultimately improve employee productivity/retention.