We approach the problem with standard libraries.
With more than six million reviews to work with, the computation space is massive. We will need a method to work with the data outside of memory space. Dask has seen remarkable development in recent years - we will be trying our hand with this.
Update 9/4/19: Even with Dask and a modern PC, it was not feasible to compute on a train data set covering half of the text. This has been reduced to about one million reviews, or 15 percent of the data set. These computations are still very complex, taking an hour in some situations.
What's in a review?¶
Steps to find review value¶
For this project, we define value as an entry containing any useful votes.¶
The three steps below will be the analyses we take to uncover value in reviews. They increase in complexity.¶
The basic bag of words and TfIdf approach is always a good start.
There are methods to score the readability of a document. We will dig to analyze correlation between usefulness and readability. Furthermore, we will also consider that the number of words in a review could have correlation.
Do word embeddings provide any insight? We will use the gensim library to see if we can find value or trends here.
Do these metrics work in tandem? We will see if we can aggregate them and produce a classification model.
The 'val' and 'noval' prefixes indicate value - usefulness and noval(ue).
import matplotlib.pyplot as plt import seaborn as sns import yelp_tool # All libraries within this module # -----------------------------------------------------------------------------------------------------------# # new_vec = load('../../../_Storage/Data/yelp_dataset/vec.joblib') sns.set_style('darkgrid') laptop = '/media/seapea/Blade HDD/_Storage/Data/yelp_dataset/' Xy_train = yelp_tool.load('Xy_train.joblib') Xy_test = yelp_tool.load('Xy_test.joblib') %matplotlib inline #------------------------------------------------------------------------------------------------------------# print("Done!")
Throughout this project, I have found it very difficult to work with Dask. It is not covered in the course anywhere, despite being a more realistic tool when too much data exists to analyze things in memory. Arithmetic computations and other analyses across the Dask dataframe chunks pose challenges we haven't seen when operating only in memory.
Some basic issues encountered with Dask include:
.locquerying - single query returning many values, i.e.
.locslicing - throws errors, i.e.
.loc- totally non-operational
.ilocslicing - distributed tables don't have an absolute index
spaCypipelines (memory allocation)
Word2Vecconversions (memory allocation)
Because of these limitations, some of the work and discovery efforts behind the scenes might not be displayed in detail. Furthermore, once working through a few basics, we have elected to use a smaller text selection so that we may work within memory. In doing so, we are able to select about one million reviews for test and train each, using about ~15 MB a piece. This is a farcry from the ~5.5 GB original file, but this move is absolutely necessary, as matrix computations have only now become feasible.
Nonetheless, Dask has excellent documentation, so we have worked through the basics and will use it again in the future.
In a previous project, I limited my vocabulary and text subset to one million words as spaCy throws an error when too large a vocabulary exists for use in memory. I have since discovered a lazy computation design for this exists within spaCy called the pipe. This allows spaCy to batch the documents and prevent breaking memory constraints. This has been a helpful discovery.
While sparse matrices are efficient, some computations on them at the size of this corpus will still overload memory. We must keep this in mind with any BoW or TfIdf computations.
When working with some arrays - in one case, the LSA conversions - the kernel catches permanently. I have decided to work with much smaller subsets of the text (as noted above), yet this does not stop the computational overload. Problems like these have caused massive delays in finishing this project. One million records and only 15 is still nothing to take lightly.
More than once, the kernel, the shell, or Jupyter would hang and halt the entire process, clearing variables and stopping my workflow.
Testing the basics first¶
Let's try some basic Dask computations, verifying integrity and observing the distribution of review ratings.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,9)) # Logarithms to transform the shape, adding to because there were some -1 reviews sns.distplot(np.log((y_train + 2).compute()), ax=ax1) sns.boxplot(y_train.compute(), ax=ax2) ax1.set_title
<matplotlib.axes._subplots.AxesSubplot at 0x7f466ba29b38>
Below is another simple task to test the library - we will try the TfidfVectorizer.
# This was a quick look at the entire dataset's text features TV = TfidfVectorizer(max_features=30) matrix_X = TV().fit_transform(Xy_train.text.apply(fix_nl)) print(TV().get_feature_names())
['best', 'came', 'chicken', 'come', 'definitely', 'did', 'didn', 'don', 'food', 'friendly', 'good', 'got', 'great', 'just', 'like', 'little', 'love', 'nice', 'order', 'ordered', 'people', 'place', 'really', 'restaurant', 'service', 'staff', 'time', 'try', 've', 'went']
Our vectorizer didn't overload the system!
We will try to run another basic vectorizer. We do this below with some preprocessing this time - the benefits of lemmatization would increase the diversity of words selected for features. The goal here is to verify the pipeline functionality.
# This cell will test the steps of my approach against the much smaller dataframe df.text = df.text.apply(fix_nl) new_txt = list(nlp.pipe(df.iloc[:5000].text)) cvec = CountVectorizer(stop_words='english', max_features=100) cvec.fit_transform([lemma_sent(txt) for txt in new_txt]) cvec.transform([lemma_sent(txt) for txt in list(nlp.pipe(df.iloc[5000:6000].text))])
<5000x100 sparse matrix of type '<class 'numpy.int64'>' with 52872 stored elements in Compressed Sparse Row format>
While testing the pipeline function above, we are able to verify
transformeven works against new data! We will now attempt the same process against the trained features from the entire 'rev' dataset.
Note: This vectorizer is based on half the text, an approach I later abandoned due to the computation times and memory errors. (Read introduction notes) I did save it and use it, as it is capturing the the best feature representation of a larger training set.¶
myX = new_vec.transform(Xy_train.text.apply(fix_nl)) fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(16,9), sharey=True, sharex=True) myInd1 = Xy_train[Xy_train.useful == True].index myInd0 = Xy_train[Xy_train.useful == False].index ax1.bar(range(300), np.array(myX[myInd1].sum(axis=0)).tolist(), width=1) ax2.bar(range(300), np.array(myX[myInd0].sum(axis=0)).tolist(), width=1) ax1.set_title('Useful reviews counted against 300 features from Bag of Words', size=16) ax2.set_title('Not useful reviews counted against 300 features from Bag of Words', size=16)
Text(0.5, 1.0, 'Not useful reviews counted against 300 features from Bag of Words')
Visually, we are seeing striking similarities between the reviews by useful or not useful. The peaks and valleys, upon first impression, appear to be consistent. We can see that the standout features actually have significantly different values per y outcome. The class imbalance is only about six percent, so this is a welcome difference that might help our models.
2. Readability and review length¶
Another strong contendor¶
Will either correlate with review usefulness?¶
Spacy is an excellent library. This optional pipeline component - Readability - will calculate the readability score. After a little research, I selected the Flesch-Kincaid reading ease value. This seems to have stood the test of time, accepted and modified since its rise to popularity in the 1970s.
# val_doc = nlp.pipe(Xy_train[Xy_train.useful == True].text) # noval_doc = nlp.pipe(Xy_train[Xy_train.useful == False].text) # val_read_len = [(txt._.flesch_kincaid_reading_ease, len(txt)) for txt in val_doc] # noval_read_len = [(txt._.flesch_kincaid_reading_ease, len(txt)) for txt in noval_doc] val_read_len = load('val_read_len.joblib') noval_read_len = load('noval_read_len.joblib') val_read_len = np.array(val_read_len) noval_read_len = np.array(noval_read_len)
A basic scatterplot can give us insight about the feature relationship, if any.
new1 = pd.DataFrame(val_read_len) new1 = True new2 = pd.DataFrame(noval_read_len) new2 = False frame = pd.concat([new1, new2]) other = frame[frame > -200] fig, ax = plt.subplots(figsize=(16,9)) ax.set_title('Readability rating versus review length, on scatterplot', size=16) ax.set_xlabel('Flesch-Kincaid reading ease rating', size=14) ax.set_ylabel('Review length in characters', size=14) sns.scatterplot(x=other, y=other, hue=other, ax=ax, alpha=.01) plt.show()
Based on this scatterplot, we can deduce that a relationship exists. I have cut the extreme outliers for visualization's sake, and we can now see the less useful reviews tend to sink in this plot. Good to know.
Once I was able to learn a little more about Dask, its limitations, and basically remove it from my workflow as described so far, I saw massive increases in computation time and progress (obviously!). I considered increasing the sample sizes of train and test after my successes, but this calculation of readability and the use of an efficient pipeline still took more than an hour. We are sticking with one million samples!
3. Embeddings to create features¶
The most complex task¶
Doc2Vec: How to handle the large corpus¶
After some trial, error, and research, it is my conclusion that the model computation will not differ between one large or several chunks - one million documents is still vast. We would have to train and update the document vector model.
This information comes from a blog written by the library's author - Radim Rehurek - discussing multiprocessing for faster running times. If the content can be processed in parallel, it does not require information from the entire corpus and therefore can be run in pieces.
We will verify if this can be done below. Within gensim, we can access the document and word vectors after training updates to see if the model updates.
vec1, tags1 = tag(ddf).compute() test_arr = vec1.docvecs vec1.docvecs.most_similar(0, topn=5)
[(17033, 0.38854920864105225), (23295, 0.3664354085922241), (12518, 0.3616259694099426), (12915, 0.3600383400917053), (14743, 0.35090503096580505)]
vec2, tags2 = tag(rev).compute() vec1.train(tags2, len(tags1) + len(tags2), epochs=5) test_arr1 = vec1.docvecs vec1.docvecs.most_similar(0, topn=5)
[(17033, 0.38854920864105225), (23295, 0.3664354085922241), (12518, 0.3616259694099426), (12915, 0.3600383400917053), (14743, 0.35090503096580505)]
# Every single value in array is equal!!! sum(test_arr != test_arr1)
It appears the document vectors within the model are not returning different results when compared with the
most_similarmethod. I did see this somewhere but wanted to verify. We can verify the document vectors themselves are not different, as shown when comparing the test arrays. This is further reinforced via Google queries and documentation, showing
Doc2Vecdoes not allow the same training update approach that word vectors allow. We will have to work with the embeddings differently.
We also tried the vanilla LSI approach with weak results - displayed below. We will return to other gensim options as this is not promising.
lsa_df =  for corp, use in zip(mycorp, Xy_train.useful): lsa_df.append([ele for ele in model[corp]]) clf_gbc = yelp_tool.GradientBoostingClassifier() clf_gbc.fit(X=df.drop(columns='useful'), y=df.useful) def tester(n): df = Xy_test.sample(n) mycorp_t = [mydct.doc2bow(yelp_tool.fix_nl(txt).split()) for txt in df.text] lsa_df =  for corp in mycorp_t: lsa_df.append([ele for ele in model[corp]]) try: return clf_gbc.score(X=yelp_tool.pd.DataFrame(lsa_df), y=df.useful) except: return yelp_tool.np.NAN results =  for i in range(500): results.append(tester(50)) yelp_tool.pd.Series(results).describe()
count 292.000000 mean 0.625548 std 0.069801 min 0.400000 25% 0.580000 50% 0.620000 75% 0.660000 max 0.800000 dtype: float64
We are actually able to use a different similarity mathematic behind the scences if we dig deeper. Within the library, I discovered there are faster ways to model the text with Latent Semantic Indexing, then put text into the reduced space and calculate vectors. The heart of this is at the 'Similarity' object.
- Create gensim Dictionary
- Create corpus of tuples
- Create LSI model consisting of each input
- Query the index with tuple doc2bow object and fit it into the vector space
As we have seen in the past, manually graphing LSA into two spaces is not as visually informative. The major reason for this is the restriction on the values the mathematics causes; it does not move and recenter the data - as PCA would. This weakness is demonstrated below; the distribution resembles a cone.
# sklearn tool faster than [fix_nl(txt).split] for txt in rev.text.compute()]) # mydct variable saved for future mydct = Dictionary([new_vec.get_feature_names()]) mycorp = [mydct.doc2bow(fix_nl(txt).split()) for txt in X_train.compute()] mylsi_2 = LsiModel(mycorp, id2word=mydct, num_topics=2) LSA2 = [mylsi_2[bow] for bow in mycorp] x =  y =  hue =  for val, h in zip(LSA2, (y_train > 0).compute()): try: x.append(val) y.append(val) hue.append(h) except: pass fig, ax = plt.subplots(figsize=(16,9)) sns.scatterplot(x=x, y=y, hue=hue, alpha=0.05) ax.set_title('LSA representation of text', size=16)
Text(0.5, 1.0, 'LSA representation of text')
As I have seen in other projects with text, LSI tends to be limited in visual reproduction.
Let's save 20 LSI features (it isn't much, but higher numbers have created computational obstacles), and then pump those into the TSNE object of SKlearn. This is generally a better visualization method when discussing many dimensions. Maybe that will give us insight if we are treading properly.
mydct = load('mydct.joblib') mycorp = [mydct.doc2bow(fix_nl(txt).split()) for txt in X_train.compute()] mylsi_20 = LsiModel(mycorp, id2word=mydct, num_topics=50) LSA20 = [mylsi_20[bow] for bow in mycorp] LSA20_array =  for i, val in enumerate(LSA20): try: LSA20_array.append(np.asarray(LSA20[i])[:,1]) except: pass tsne = TSNE() transformed = tsne.fit_transform(X=np.asarray(LSA20_array)) fig, ax = plt.subplots(figsize=(16,9)) sns.scatterplot(data=transformed, hue=(y_train > 0).compute(), alpha=0.003) ax.set_title('TSNE transformation of train data', size=16)
Text(0.5, 1.0, 'TSNE transformation of train data')