Yelp data challenge - text

Can we predict "useful" review labels?

If companies and Yelp could identify the strongest reviewers, they could guarantee better quality reviews.

The Yelp dataset contains more than 6.5 million reviews. This project aims to clean the text and crunch the numbers to see if useful labels are predictable.

Approach

We approach the problem with standard libraries.

  • numpy
  • pandas
  • scikit-learn
  • spaCy

With more than six million reviews to work with, the computation space is massive. We will need a method to work with the data outside of memory space. Dask has seen remarkable development in recent years - we will be trying our hand with this.

Update 9/4/19: Even with Dask and a modern PC, it was not feasible to compute on a train data set covering half of the text. This has been reduced to about one million reviews, or 15 percent of the data set. These computations are still very complex, taking an hour in some situations.



What's in a review?

Steps to find review value

For this project, we define value as an entry containing any useful votes.

The three steps below will be the analyses we take to uncover value in reviews. They increase in complexity.

1. Consider the count of words

The basic bag of words and TfIdf approach is always a good start.

2. Readability and length of review

There are methods to score the readability of a document. We will dig to analyze correlation between usefulness and readability. Furthermore, we will also consider that the number of words in a review could have correlation.

3. Embeddings

Do word embeddings provide any insight? We will use the gensim library to see if we can find value or trends here.

Combining the metrics

Do these metrics work in tandem? We will see if we can aggregate them and produce a classification model.

Establishing functions for project

For clarity and workflow, let's store the project functions at the top of the project - just like imports. This includes:

  1. Fix natural language - basic text cleaning on delayed object
  2. Lemmatize sentence - return a string of lemmas
  3. Small frame - this project has enormous amounts of text; this provides a smaller dataframe
  4. Variables - load regular variables into workspace
  5. Tagger - preprocesses data into tagged documents for gensim

Note: prefixes and following along

The 'val' and 'noval' prefixes indicate value - usefulness and noval(ue).

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import yelp_tool # All libraries within this module
# -----------------------------------------------------------------------------------------------------------#
# new_vec = load('../../../_Storage/Data/yelp_dataset/vec.joblib')
sns.set_style('darkgrid')
laptop = '/media/seapea/Blade HDD/_Storage/Data/yelp_dataset/'
Xy_train = yelp_tool.load('Xy_train.joblib')
Xy_test = yelp_tool.load('Xy_test.joblib')
%matplotlib inline
#------------------------------------------------------------------------------------------------------------#
print("Done!")
Done!

Dask - distributing workers

Challenges in a distributed work environment

Dask is a library to handle larger datasets by persisting some data on disk and some in memory.

Throughout this project, I have found it very difficult to work with Dask. It is not covered in the course anywhere, despite being a more realistic tool when too much data exists to analyze things in memory. Arithmetic computations and other analyses across the Dask dataframe chunks pose challenges we haven't seen when operating only in memory.

Some basic issues encountered with Dask include:

  • Boolean indexing
  • .loc querying - single query returning many values, i.e. X.loc[0]
  • .loc slicing - throws errors, i.e. X.loc[:50]
  • Series .loc - totally non-operational
  • .iloc slicing - distributed tables don't have an absolute index
  • spaCy pipelines (memory allocation)
  • Word2Vec conversions (memory allocation)
  • Basic computations (very long turn around, typically 20 minutes, up to several hours)
  • Limited train / test split options

Because of these limitations, some of the work and discovery efforts behind the scenes might not be displayed in detail. Furthermore, once working through a few basics, we have elected to use a smaller text selection so that we may work within memory. In doing so, we are able to select about one million reviews for test and train each, using about ~15 MB a piece. This is a farcry from the ~5.5 GB original file, but this move is absolutely necessary, as matrix computations have only now become feasible.

Nonetheless, Dask has excellent documentation, so we have worked through the basics and will use it again in the future.

Other efficiency improvements: memory

Previous attempts at spaCy

In a previous project, I limited my vocabulary and text subset to one million words as spaCy throws an error when too large a vocabulary exists for use in memory. I have since discovered a lazy computation design for this exists within spaCy called the pipe. This allows spaCy to batch the documents and prevent breaking memory constraints. This has been a helpful discovery.

Sparse matrices

While sparse matrices are efficient, some computations on them at the size of this corpus will still overload memory. We must keep this in mind with any BoW or TfIdf computations.

Persistent issues: computation times

Oversized arrays

When working with some arrays - in one case, the LSA conversions - the kernel catches permanently. I have decided to work with much smaller subsets of the text (as noted above), yet this does not stop the computational overload. Problems like these have caused massive delays in finishing this project. One million records and only 15 is still nothing to take lightly.

Jupyter and kernel hang

More than once, the kernel, the shell, or Jupyter would hang and halt the entire process, clearing variables and stopping my workflow.



Testing the basics first

Distribution of review ratings

Let's try some basic Dask computations, verifying integrity and observing the distribution of review ratings.

In [36]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,9))
# Logarithms to transform the shape, adding to because there were some -1 reviews
sns.distplot(np.log((y_train + 2).compute()), ax=ax1)
sns.boxplot(y_train.compute(), ax=ax2)
ax1.set_title
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f466ba29b38>

Approaching parsing of vast text

Below is another simple task to test the library - we will try the TfidfVectorizer.

In [7]:
# This was a quick look at the entire dataset's text features
TV = TfidfVectorizer(max_features=30)
matrix_X = TV().fit_transform(Xy_train.text.apply(fix_nl))
print(TV().get_feature_names())
['best', 'came', 'chicken', 'come', 'definitely', 'did', 'didn', 'don', 'food', 'friendly', 'good', 'got', 'great', 'just', 'like', 'little', 'love', 'nice', 'order', 'ordered', 'people', 'place', 'really', 'restaurant', 'service', 'staff', 'time', 'try', 've', 'went']

Success!

Our vectorizer didn't overload the system!

Crunching spaCy pipeline design

We will try to run another basic vectorizer. We do this below with some preprocessing this time - the benefits of lemmatization would increase the diversity of words selected for features. The goal here is to verify the pipeline functionality.

In [412]:
# This cell will test the steps of my approach against the much smaller dataframe
df.text = df.text.apply(fix_nl)
new_txt = list(nlp.pipe(df.iloc[:5000].text))
cvec = CountVectorizer(stop_words='english', max_features=100)
cvec.fit_transform([lemma_sent(txt) for txt in new_txt])
cvec.transform([lemma_sent(txt) for txt in list(nlp.pipe(df.iloc[5000:6000].text))])
Out[412]:
<5000x100 sparse matrix of type '<class 'numpy.int64'>'
	with 52872 stored elements in Compressed Sparse Row format>


1. Bag of words

The standard starting line

Vectorized text transformation

While testing the pipeline function above, we are able to verify fit_transform functionality, and transform even works against new data! We will now attempt the same process against the trained features from the entire 'rev' dataset.

Note: This vectorizer is based on half the text, an approach I later abandoned due to the computation times and memory errors. (Read introduction notes) I did save it and use it, as it is capturing the the best feature representation of a larger training set.

In [200]:
myX = new_vec.transform(Xy_train.text.apply(fix_nl))

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(16,9), sharey=True, sharex=True)
myInd1 = Xy_train[Xy_train.useful == True].index
myInd0 = Xy_train[Xy_train.useful == False].index

ax1.bar(range(300), np.array(myX[myInd1].sum(axis=0)).tolist()[0], width=1)
ax2.bar(range(300), np.array(myX[myInd0].sum(axis=0)).tolist()[0], width=1)
ax1.set_title('Useful reviews counted against 300 features from Bag of Words', size=16)
ax2.set_title('Not useful reviews counted against 300 features from Bag of Words', size=16)
Out[200]:
Text(0.5, 1.0, 'Not useful reviews counted against 300 features from Bag of Words')

Visual differences

Visually, we are seeing striking similarities between the reviews by useful or not useful. The peaks and valleys, upon first impression, appear to be consistent. We can see that the standout features actually have significantly different values per y outcome. The class imbalance is only about six percent, so this is a welcome difference that might help our models.



2. Readability and review length

Another strong contendor

Will either correlate with review usefulness?

Spacy is an excellent library. This optional pipeline component - Readability - will calculate the readability score. After a little research, I selected the Flesch-Kincaid reading ease value. This seems to have stood the test of time, accepted and modified since its rise to popularity in the 1970s.

In [201]:
# val_doc = nlp.pipe(Xy_train[Xy_train.useful == True].text)
# noval_doc = nlp.pipe(Xy_train[Xy_train.useful == False].text)
# val_read_len = [(txt._.flesch_kincaid_reading_ease, len(txt)) for txt in val_doc]
# noval_read_len = [(txt._.flesch_kincaid_reading_ease, len(txt)) for txt in noval_doc]

val_read_len = load('val_read_len.joblib')
noval_read_len = load('noval_read_len.joblib')

val_read_len = np.array(val_read_len)
noval_read_len = np.array(noval_read_len)

Visualizing the two features

A basic scatterplot can give us insight about the feature relationship, if any.

In [202]:
new1 = pd.DataFrame(val_read_len)
new1[2] = True
new2 = pd.DataFrame(noval_read_len)
new2[2] = False
frame = pd.concat([new1, new2])
other = frame[frame[0] > -200]
fig, ax = plt.subplots(figsize=(16,9))
ax.set_title('Readability rating versus review length, on scatterplot', size=16)
ax.set_xlabel('Flesch-Kincaid reading ease rating', size=14)
ax.set_ylabel('Review length in characters', size=14)
sns.scatterplot(x=other[0], y=other[1], hue=other[2], ax=ax, alpha=.01)
plt.show()

A trend is apparent

Based on this scatterplot, we can deduce that a relationship exists. I have cut the extreme outliers for visualization's sake, and we can now see the less useful reviews tend to sink in this plot. Good to know.

Note: Gladly, we haven't increased our corpus

Once I was able to learn a little more about Dask, its limitations, and basically remove it from my workflow as described so far, I saw massive increases in computation time and progress (obviously!). I considered increasing the sample sizes of train and test after my successes, but this calculation of readability and the use of an efficient pipeline still took more than an hour. We are sticking with one million samples!



3. Embeddings to create features

The most complex task

Doc2Vec: How to handle the large corpus

After some trial, error, and research, it is my conclusion that the model computation will not differ between one large or several chunks - one million documents is still vast. We would have to train and update the document vector model.

This information comes from a blog written by the library's author - Radim Rehurek - discussing multiprocessing for faster running times. If the content can be processed in parallel, it does not require information from the entire corpus and therefore can be run in pieces.

Verifying training updates

We will verify if this can be done below. Within gensim, we can access the document and word vectors after training updates to see if the model updates.

In [27]:
vec1, tags1 = tag(ddf).compute()
test_arr = vec1.docvecs[0]
vec1.docvecs.most_similar(0, topn=5)
Out[27]:
[(17033, 0.38854920864105225),
 (23295, 0.3664354085922241),
 (12518, 0.3616259694099426),
 (12915, 0.3600383400917053),
 (14743, 0.35090503096580505)]
In [28]:
vec2, tags2 = tag(rev).compute()
vec1.train(tags2, len(tags1) + len(tags2), epochs=5)
test_arr1 = vec1.docvecs[0]
vec1.docvecs.most_similar(0, topn=5)
Out[28]:
[(17033, 0.38854920864105225),
 (23295, 0.3664354085922241),
 (12518, 0.3616259694099426),
 (12915, 0.3600383400917053),
 (14743, 0.35090503096580505)]
In [31]:
# Every single value in array is equal!!!
sum(test_arr != test_arr1)
Out[31]:
0

Basic retraining failed

It appears the document vectors within the model are not returning different results when compared with the most_similar method. I did see this somewhere but wanted to verify. We can verify the document vectors themselves are not different, as shown when comparing the test arrays. This is further reinforced via Google queries and documentation, showing Doc2Vec does not allow the same training update approach that word vectors allow. We will have to work with the embeddings differently.

gensim LsiModel

We also tried the vanilla LSI approach with weak results - displayed below. We will return to other gensim options as this is not promising.

In [4]:
lsa_df = []
for corp, use in zip(mycorp, Xy_train.useful):
    lsa_df.append([ele[1] for ele in model[corp]])

clf_gbc = yelp_tool.GradientBoostingClassifier()
clf_gbc.fit(X=df.drop(columns='useful'), y=df.useful)

def tester(n):
    df = Xy_test.sample(n)
    mycorp_t = [mydct.doc2bow(yelp_tool.fix_nl(txt).split()) for txt in df.text]

    lsa_df = []
    for corp in mycorp_t:
        lsa_df.append([ele[1] for ele in model[corp]])
        
    try:    
        return clf_gbc.score(X=yelp_tool.pd.DataFrame(lsa_df), y=df.useful)
    except:
        return yelp_tool.np.NAN

results = []
for i in range(500):
    results.append(tester(50))

yelp_tool.pd.Series(results).describe()
Out[4]:
count    292.000000
mean       0.625548
std        0.069801
min        0.400000
25%        0.580000
50%        0.620000
75%        0.660000
max        0.800000
dtype: float64

Similarity in gensim

We are actually able to use a different similarity mathematic behind the scences if we dig deeper. Within the library, I discovered there are faster ways to model the text with Latent Semantic Indexing, then put text into the reduced space and calculate vectors. The heart of this is at the 'Similarity' object.

To use it properly, we follow these steps:

  1. Create gensim Dictionary
  2. Create corpus of tuples
  3. Create LSI model consisting of each input
  4. Query the index with tuple doc2bow object and fit it into the vector space

Manually graphing

As we have seen in the past, manually graphing LSA into two spaces is not as visually informative. The major reason for this is the restriction on the values the mathematics causes; it does not move and recenter the data - as PCA would. This weakness is demonstrated below; the distribution resembles a cone.

In [35]:
# sklearn tool faster than [fix_nl(txt).split] for txt in rev.text.compute()])
# mydct variable saved for future
mydct = Dictionary([new_vec.get_feature_names()])
mycorp = [mydct.doc2bow(fix_nl(txt).split()) for txt in X_train.compute()]
mylsi_2 = LsiModel(mycorp, id2word=mydct, num_topics=2)
LSA2 = [mylsi_2[bow] for bow in mycorp]

x = []
y = []
hue = []

for val, h in zip(LSA2, (y_train > 0).compute()):
    try:
        x.append(val[0][1])
        y.append(val[1][1])
        hue.append(h)
    except:
        pass

fig, ax = plt.subplots(figsize=(16,9))
sns.scatterplot(x=x, y=y, hue=hue, alpha=0.05)
ax.set_title('LSA representation of text', size=16)
Out[35]:
Text(0.5, 1.0, 'LSA representation of text')

LSI Cone

As I have seen in other projects with text, LSI tends to be limited in visual reproduction.

20 LSI features

Let's save 20 LSI features (it isn't much, but higher numbers have created computational obstacles), and then pump those into the TSNE object of SKlearn. This is generally a better visualization method when discussing many dimensions. Maybe that will give us insight if we are treading properly.

In [47]:
mydct = load('mydct.joblib')
mycorp = [mydct.doc2bow(fix_nl(txt).split()) for txt in X_train.compute()]
mylsi_20 = LsiModel(mycorp, id2word=mydct, num_topics=50)
LSA20 = [mylsi_20[bow] for bow in mycorp]

LSA20_array = []

for i, val in enumerate(LSA20):
    try:
        LSA20_array.append(np.asarray(LSA20[i])[:,1])
    except:
        pass
    
tsne = TSNE()
transformed = tsne.fit_transform(X=np.asarray(LSA20_array))

fig, ax = plt.subplots(figsize=(16,9))
sns.scatterplot(data=transformed, hue=(y_train > 0).compute(), alpha=0.003)
ax.set_title('TSNE transformation of train data', size=16)
Out[47]:
Text(0.5, 1.0, 'TSNE transformation of train data')