mlcourse.ai – Open Machine Learning Course

Individual Project

Predicting Wine Expert Rating

Author: Maxim Klyuchnikov


In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.offline as py
import warnings
import pycountry
from statsmodels.graphics.gofplots import qqplot
from wordcloud import WordCloud, STOPWORDS

warnings.filterwarnings('ignore')

import re
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from scipy.sparse import csr_matrix, hstack
from yellowbrick.model_selection import ValidationCurve, LearningCurve

py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

RANDOM_SEED=17

Project Description

Dataset

The data is taken from the Kaggle dataset https://www.kaggle.com/zynicide/wine-reviews/home, which in turn was scraped by the dataset's author from https://www.winemag.com/
There are lot of reviews from differents experts for the wines from the whole world. Also, some wine-specific information is also provided as a part of the dataset.
Dataset consists of the following fields (per info from https://github.com/zackthoutt/wine-deep-learning):

Features

  • Points: the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)
  • Title: the title of the wine review, which often contains the vintage if you're interested in extracting that feature
  • Variety: the type of grapes used to make the wine (ie Pinot Noir)
  • Description: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
  • Country: the country that the wine is from
  • Province: the province or state that the wine is from
  • Region 1: the wine growing area in a province or state (ie Napa)
  • Region 2: sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank
  • Winery: the winery that made the wine
  • Designation: the vineyard within the winery where the grapes that made the wine are from
  • Price: the cost for a bottle of the wine, in US$
  • Taster Name: name of the person who tasted and reviewed the wine
  • Taster Twitter Handle: Twitter handle for the person who tasted ane reviewed the wine

Target

We have wine rating (Points) as a target. Reviewers from the original site provide rating for the wines varying from 80 to 100, here is the details of different ranges:

Range Mark Description
98–100 Classic The pinnacle of quality
94–97 Superb A great achievement
90–93 Excellent Highly recommended
87–89 Very Often good value; well recommended
83–86 Good Suitable for everyday consumption; often good value
80–82 Acceptable Can be employed in casual, less-critical circumstances

Our goal and possible applications

Originally, dataset author collected the data to create a predictive model to identify wines through blind tasting like a master sommelier would. Here we will try to solve simpler, yet useful in real life, task: predict the wine rating based on the wine features and words used in its review. This can have the following practical applications:

Understanding the unrated wine quality

Unlike other beverages, wines comes in overwhelming variety: it's about 10k grapes exists (and their number is growing), they can be blended in different proportions, the grape collection year and growing conditions comes into play, the wine may be seasoned for different amount of time in different types of barrels, etc, etc.

So review of the specific wine or lists like "top 10 wines of the season" doesn't make any sense - if you go to 2 different local stores there is a good chance you won't find the same wine in both of them. Finding the specific wine may require journey to another city or even country :) In such conditions it's worth to have a model which may predict the wine quality without having an exact rating given by the expert, but based on the wine features which you can get from the bottle.

Blind testing the expert predictions

While this is an area of purely personal taste, professionals always try to become free from the biases and provide objective observations. Blind testing may allow to find the biases of the specific reviewer.
Actually, the model could be used for cross-validation of the expert ratings :)

Data Analysis and Cleaning

Let's download the data from Kaggle, extract them into data folder and check the main properties of the resulting DataFrame:

In [2]:
df = pd.read_csv('data/winemag-data-130k-v2.csv', index_col=0)
df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 129971 entries, 0 to 129970
Data columns (total 13 columns):
country                  129908 non-null object
description              129971 non-null object
designation              92506 non-null object
points                   129971 non-null int64
price                    120975 non-null float64
province                 129908 non-null object
region_1                 108724 non-null object
region_2                 50511 non-null object
taster_name              103727 non-null object
taster_twitter_handle    98758 non-null object
title                    129971 non-null object
variety                  129970 non-null object
winery                   129971 non-null object
dtypes: float64(1), int64(1), object(11)
memory usage: 128.9 MB

As we can see, there are many null values in the data, we need to deal with them later.

In [3]:
df.head()
Out[3]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

Let's check the data for possible categorical features:

In [4]:
df.nunique()
Out[4]:
country                      43
description              119955
designation               37979
points                       21
price                       390
province                    425
region_1                   1229
region_2                     17
taster_name                  19
taster_twitter_handle        15
title                    118840
variety                     707
winery                    16757
dtype: int64

Looks like the following features can be represented as categorical:

  • designation
  • province
  • region_1
  • region_2
  • taster_name
  • taster_twitter_handle
  • variety
  • winery

Let's explore the data now to get acquainted to the dataset more closely:

Country

In [5]:
plt.figure(figsize=(13, 10))
ax = sns.countplot(y=df.country, order=df.country.value_counts().index, palette='tab10')
for p, label in zip(ax.patches, df.country.value_counts()):
    ax.annotate("{0:,d}".format(label), (p.get_width() + 50, p.get_y() + 0.7))
ax.set_title('Number of wine reviews per country', fontsize=18);

We see that we have a lot of the reviews for the wines from US, which can be explained by the fact that the reviewers are mostly located in the US.
Also it should be noted that we have countries with less number of reviews which may cause problems.

Let's see how the countries are distributed on the map, along with the number if review in them.
For the Choropleth to display the coloring in a more understantable way, let's log1p-transform the number of reviews per country:

In [6]:
countries = df.groupby('country').size().reset_index()
countries.columns = ['name', 'size']
countries.name = countries.name.replace({ # making the country names compatible with pycountry
    'England': 'United Kingdom',
    'Czech Republic': 'United Kingdom',
    'Macedonia': 'Macedonia, Republic of',
    'Moldova': 'Moldova, Republic of',
    'US': 'United States'
})

data = pd.DataFrame(index=countries.index)
data['name'] = countries.name
data['size'] = countries['size']
data['code'] = countries.apply(lambda x: pycountry.countries.get(name=x['name']), axis=1)
data['code'] = data.code.apply(lambda x: x.alpha_3 if x else None)
data = data.dropna()

choropleth_data = [dict(
    type='choropleth',
    locations=data['code'],
    z=np.log1p(data['size']),
    #showscale=False,
    text=data['name'],
    marker=dict(
        line=dict(
            color='rgb(180,180,180)',
            width=0.5
        )),
)]

layout = dict(
    title='Number of wine reviews per country, log-transformed',
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection=dict(
            type='natural earth'
        )
    ))

fig = dict(data=choropleth_data, layout=layout)
py.iplot(fig, validate=False)
In [7]:
top_rated_countries = df[['country', 'points']].groupby('country').mean().reset_index().sort_values('points', ascending=False).country[:10]

data = df[df.country.isin(top_rated_countries)]

plt.figure(figsize=(15, 7))
ax = sns.violinplot(x='country', y='points', data=data, order=top_rated_countries, palette='tab10')
ax.set_title('Top 10 countries with highest average rating', fontsize=18);

Here we can see that the some of the countries with low number of reviews has pretty high average rating.
Probably, it's because wines with the highest potential rating are the first to be reviewed by the experts.
The dependency between the Country and Points is clear.

Cleaning and transforming

Countries with less number of reviews does not have too much predictive power and introduce unnecessary noise, so let's replace them with the name 'Other' instead:

In [8]:
vc = df.country.value_counts()
df['trans_country'] = df.country.replace(vc[vc < 100].index, 'Other')
In [9]:
top_rated_countries = df[['trans_country', 'points']].groupby('trans_country').mean().reset_index().sort_values('points', ascending=False).trans_country[:10]

top_rated_countries_data = df[df.trans_country.isin(top_rated_countries)]

plt.figure(figsize=(15, 7))
ax = sns.violinplot(x='trans_country', y='points', data=top_rated_countries_data, order=top_rated_countries, palette='tab10')
ax.set_title('Top 10 countries with highest average rating', fontsize=18);

Now see better distribution of the rating among countries in the top 10 list.

Province, Region 1 and Region 2

These features are actually parts of the wine location hierarchy, so they better be joined into one field with the Country. Let's take a look at them:

In [10]:
df[['trans_country', 'province', 'region_1', 'region_2']].head()
Out[10]:
trans_country province region_1 region_2
0 Italy Sicily & Sardinia Etna NaN
1 Portugal Douro NaN NaN
2 US Oregon Willamette Valley Willamette Valley
3 US Michigan Lake Michigan Shore NaN
4 US Oregon Willamette Valley Willamette Valley
In [11]:
df[['trans_country', 'province', 'region_1', 'region_2']].nunique()
Out[11]:
trans_country      20
province          425
region_1         1229
region_2           17
dtype: int64
In [12]:
print('Countries with Region 2:', df[~df.region_2.isna()].trans_country.unique())
Countries with Region 2: ['US']

Looks like Region 2 is a US-specific feature, but it won't hurt if we include it as well, so we get better categorization for US wines.

In [13]:
df['location'] = df.apply(lambda x: ' / '.join([y for y in [str(x['trans_country']), str(x['province']), str(x['region_1']), str(x['region_2'])] if y != 'nan']), axis=1)
df.location.head()
Out[13]:
0                     Italy / Sicily & Sardinia / Etna
1                                     Portugal / Douro
2    US / Oregon / Willamette Valley / Willamette V...
3                  US / Michigan / Lake Michigan Shore
4    US / Oregon / Willamette Valley / Willamette V...
Name: location, dtype: object

Now let's try to see if there is a dependency between the Points and Location:

In [14]:
df_top_locations = df[df.location.isin(df.location.value_counts().index[:10])]

plt.figure(figsize=(12, 10))
ax = sns.violinplot(y='location', x='points', data=df_top_locations, palette='tab10');
ax.set_title('Wine rating distribution over top 10 locations with highest average rating', fontsize=18);

Cleaning and transforming

Let's see if we can get something from the title:

In [15]:
df[['region_1', 'title']].head()
Out[15]:
region_1 title
0 Etna Nicosia 2013 Vulkà Bianco (Etna)
1 NaN Quinta dos Avidagos 2011 Avidagos Red (Douro)
2 Willamette Valley Rainstorm 2013 Pinot Gris (Willamette Valley)
3 Lake Michigan Shore St. Julian 2013 Reserve Late Harvest Riesling ...
4 Willamette Valley Sweet Cheeks 2012 Vintner's Reserve Wild Child...

As we can see, some regions are repeated in title and even if region is NaN, it is possible to fill it with the value from the title, so let's do it:

In [16]:
def extract_region_1(row):
    if row.region_1 == 'nan':
        return row.region_1
    if not row.title.endswith(')'):
        return None
    return row.title[row.title.rindex('(')+1:-1]

df.region_1 = df.apply(extract_region_1, axis=1)
In [17]:
df[['region_1', 'title']].head()
Out[17]:
region_1 title
0 Etna Nicosia 2013 Vulkà Bianco (Etna)
1 Douro Quinta dos Avidagos 2011 Avidagos Red (Douro)
2 Willamette Valley Rainstorm 2013 Pinot Gris (Willamette Valley)
3 Lake Michigan Shore St. Julian 2013 Reserve Late Harvest Riesling ...
4 Willamette Valley Sweet Cheeks 2012 Vintner's Reserve Wild Child...

Great, now let's recreate the Location:

In [18]:
df['location'] = df.apply(lambda x: ' / '.join([y for y in [str(x['trans_country']), str(x['province']), str(x['region_1']), str(x['region_2'])] if y != 'nan']), axis=1)
df.location.head()
Out[18]:
0                     Italy / Sicily & Sardinia / Etna
1                             Portugal / Douro / Douro
2    US / Oregon / Willamette Valley / Willamette V...
3                  US / Michigan / Lake Michigan Shore
4    US / Oregon / Willamette Valley / Willamette V...
Name: location, dtype: object

Now let's replace the locations with lower amount of reviews with the name 'Other'

In [19]:
vc = df.location.value_counts()
df.location = df.location.replace(vc[vc < 2].index, 'Other')

Price

Price is is given in the US$, let's see how it's distributed:

In [20]:
plt.figure(figsize=(15, 5))
data = df[~df.price.isna()]
plt.scatter(range(data.shape[0]), np.sort(data.price.values)[::-1])
plt.title("Distribution of wine prices", fontsize=18)
plt.ylabel('Price');

Wow, there are wines with more than $3000 price. That's not a usual weekend wine :)

As we see, the price distribution is very skewed, let's try to log-transform it:

In [21]:
plt.figure(figsize=(15, 3))
series_price = df[~df.price.isna()].price.apply(np.log1p)
ax = sns.distplot(series_price);
ax.set_title("Distribution of wine prices", fontsize=18)
ax.set_ylabel('Price (log1p)')
ax.set_xlabel('');

Still, it's not normal:

In [22]:
print('Shapiro-Wilk test:', stats.shapiro(series_price))
print('Kolmogorov-Smirnov test:', stats.kstest(series_price, cdf='norm'))
Shapiro-Wilk test: (0.9742796421051025, 0.0)
Kolmogorov-Smirnov test: KstestResult(statistic=0.9809554303012358, pvalue=0.0)

But not very skewed anymore:

In [23]:
print('Skeweness:', series_price.skew())
print('Kurtosis:', series_price.kurt())
Skeweness: 0.6570262197115684
Kurtosis: 0.8646195548501945

Now let's see a connection between the Price (not log-transformed) and Points:

In [24]:
plt.figure(figsize=(15, 5))
ax = sns.regplot(x='points', y='price', data=df, fit_reg=False, x_jitter=True)
ax.set_title('Correlation between the wine price and points given', fontsize=18);

And now let's see which countries has the most expensive wines (per average):

In [25]:
plt.figure(figsize=(13, 7))
data = df[['country', 'price']].groupby('country').mean().reset_index().sort_values('price', ascending=False)
ax = sns.barplot(y='country', x='price', data=data, palette='tab10')

for p, label in zip(ax.patches, data.price):
    if np.isnan(label):
        continue
    ax.annotate('{0:.2f}'.format(label), (p.get_width() + 0.2, p.get_y() + 0.5))

ax.set_title('Top countries with the most expensive average wine prices');

Insterestingly, we see, for example, Germany, Hungary and France in leaders here, which are also in leaders for average wine rating above.

Let's take the countries with the top rated wines and see the prices distribution in them:

In [26]:
plt.figure(figsize=(15, 5))
sns.violinplot(x='country', y='price', data=top_rated_countries_data, order=top_rated_countries, palette='tab10');

Weel, not good, the Price need to be transformed.

Cleaning and transforming

In [27]:
df['trans_price'] = df.price.apply(np.log1p)
In [28]:
top_rated_countries = df[['trans_country', 'points']].groupby('trans_country').mean().reset_index().sort_values('points', ascending=False).trans_country[:10]

top_rated_countries_data = df[df.trans_country.isin(top_rated_countries)]

plt.figure(figsize=(15, 5))
sns.violinplot(x='trans_country', y='trans_price', data=top_rated_countries_data, order=top_rated_countries, palette='tab10');
In [29]:
plt.figure(figsize=(15, 5))
ax = sns.regplot(x='points', y='trans_price', data=df, fit_reg=False, x_jitter=True)
ax.set_title('Correlation between the wine price (log) and points given', fontsize=18);

Variety

Let's see the top 10 varietes with their wine counts:

In [30]:
df_top_varieties = df[df.variety.isin(df.variety.value_counts().index[:10])]
plt.figure(figsize=(13, 5))
ax = sns.countplot(y=df_top_varieties.variety, order=df_top_varieties.variety.value_counts().index, palette='tab10')
for p, label in zip(ax.patches, df_top_varieties.variety.value_counts()):
    ax.annotate("{0:,d}".format(label), (p.get_width() + 50, p.get_y() + 0.5))
ax.set_title('Number of wines per variety', fontsize=18);

Now let's see the dependency between the Variety and Points:

In [31]:
plt.figure(figsize=(14, 10))
ax = sns.violinplot(y='variety', x='points', data=df_top_varieties, palette='tab10', order=df_top_varieties.variety.value_counts().index)
ax.set_title('Wine rating distribution over top 10 varietes by wine count', fontsize=18);

As we see, somevarietes get higher points than the other and points distribution is also may vary.

Variety has the same problem as other categorical features: there are some varietes where almost no samples, but they affect the points heavily:

In [32]:
top_rated_varietes = df[['variety', 'points']].groupby('variety').mean().reset_index().sort_values('points', ascending=False).variety[:10]

top_rated_varietes_data = df[df.variety.isin(top_rated_varietes)]

plt.figure(figsize=(15, 5))
ax = sns.violinplot(x='variety', y='points', data=top_rated_varietes_data, order=top_rated_varietes, palette='tab10');
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

Cleaning and transforming

In [33]:
vc = df.variety.value_counts()
df['trans_variety'] = df.variety.replace(vc[vc < 2].index, 'Other')
In [34]:
top_rated_varietes = df[['trans_variety', 'points']].groupby('trans_variety').mean().reset_index().sort_values('points', ascending=False).trans_variety[:10]

top_rated_varietes_data = df[df.trans_variety.isin(top_rated_varietes)]

plt.figure(figsize=(15, 5))
ax = sns.violinplot(x='trans_variety', y='points', data=top_rated_varietes_data, order=top_rated_varietes, palette='tab10');
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

Title

In [35]:
df.title.head(10)
Out[35]:
0                    Nicosia 2013 Vulkà Bianco  (Etna)
1        Quinta dos Avidagos 2011 Avidagos Red (Douro)
2        Rainstorm 2013 Pinot Gris (Willamette Valley)
3    St. Julian 2013 Reserve Late Harvest Riesling ...
4    Sweet Cheeks 2012 Vintner's Reserve Wild Child...
5    Tandem 2011 Ars In Vitro Tempranillo-Merlot (N...
6     Terre di Giurfo 2013 Belsito Frappato (Vittoria)
7                Trimbach 2012 Gewurztraminer (Alsace)
8    Heinz Eifel 2013 Shine Gewürztraminer (Rheinhe...
9    Jean-Baptiste Adam 2012 Les Natures Pinot Gris...
Name: title, dtype: object

The title itself seems to be not containing valuable information except that we already used it for filling the nulls in Region 1 and we can extract a Year (vintage) from it, we will do it later.

Description

That a typical textual varible, which we can try to analyze with word clouds.

Let's see what experts tell about wines that has low rating:

In [36]:
stopwords = set(STOPWORDS)
stopwords.update(['wine', 'a', 'about', 'above', 'across', 'after', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'among', 'an', 'and', 'another', 'any', 'anybody', 'anyone', 'anything', 'anywhere', 'are', 'area', 'areas', 'around', 'as', 'ask', 'asked', 'asking', 'asks', 'at', 'away', 'b', 'back', 'backed', 'backing', 'backs', 'be', 'became', 'because', 'become', 'becomes', 'been', 'before', 'began', 'behind', 'being', 'beings', 'best', 'better', 'between', 'big', 'both', 'but', 'by', 'c', 'came', 'can', 'cannot', 'case', 'cases', 'certain', 'certainly', 'clear', 'clearly', 'come', 'could', 'd', 'did', 'differ', 'different', 'differently', 'do', 'does', 'done', 'down', 'down', 'downed', 'downing', 'downs', 'during', 'e', 'each', 'early', 'either', 'end', 'ended', 'ending', 'ends', 'enough', 'even', 'evenly', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'f', 'face', 'faces', 'fact', 'facts', 'far', 'felt', 'few', 'find', 'finds', 'first', 'for', 'four', 'from', 'full', 'fully', 'further', 'furthered', 'furthering', 'furthers', 'g', 'gave', 'general', 'generally', 'get', 'gets', 'give', 'given', 'gives', 'go', 'going', 'good', 'goods', 'got', 'great', 'greater', 'greatest', 'group', 'grouped', 'grouping', 'groups', 'h', 'had', 'has', 'have', 'having', 'he', 'her', 'here', 'herself', 'high', 'high', 'high', 'higher', 'highest', 'him', 'himself', 'his', 'how', 'however', 'i', 'if', 'important', 'in', 'interest', 'interested', 'interesting', 'interests', 'into', 'is', 'it', 'its', 'itself', 'j', 'just', 'k', 'keep', 'keeps', 'kind', 'knew', 'know', 'known', 'knows', 'l', 'large', 'largely', 'last', 'later', 'latest', 'least', 'less', 'let', 'lets', 'like', 'likely', 'long', 'longer', 'longest', 'm', 'made', 'make', 'making', 'man', 'many', 'may', 'me', 'member', 'members', 'men', 'might', 'more', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', 'my', 'myself', 'n', 'necessary', 'need', 'needed', 'needing', 'needs', 'never', 'new', 'new', 'newer', 'newest', 'next', 'no', 'nobody', 'non', 'noone', 'not', 'nothing', 'now', 'nowhere', 'number', 'numbers', 'o', 'of', 'off', 'often', 'old', 'older', 'oldest', 'on', 'once', 'one', 'only', 'open', 'opened', 'opening', 'opens', 'or', 'order', 'ordered', 'ordering', 'orders', 'other', 'others', 'our', 'out', 'over', 'p', 'part', 'parted', 'parting', 'parts', 'per', 'perhaps', 'place', 'places', 'point', 'pointed', 'pointing', 'points', 'possible', 'present', 'presented', 'presenting', 'presents', 'problem', 'problems', 'put', 'puts', 'q', 'quite', 'r', 'rather', 'really', 'right', 'right', 'room', 'rooms', 's', 'said', 'same', 'saw', 'say', 'says', 'second', 'seconds', 'see', 'seem', 'seemed', 'seeming', 'seems', 'sees', 'several', 'shall', 'she', 'should', 'show', 'showed', 'showing', 'shows', 'side', 'sides', 'since', 'small', 'smaller', 'smallest', 'so', 'some', 'somebody', 'someone', 'something', 'somewhere', 'state', 'states', 'still', 'still', 'such', 'sure', 't', 'take', 'taken', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'therefore', 'these', 'they', 'thing', 'things', 'think', 'thinks', 'this', 'those', 'though', 'thought', 'thoughts', 'three', 'through', 'thus', 'to', 'today', 'together', 'too', 'took', 'toward', 'turn', 'turned', 'turning', 'turns', 'two', 'u', 'under', 'until', 'up', 'upon', 'us', 'use', 'used', 'uses', 'v', 'very', 'w', 'want', 'wanted', 'wanting', 'wants', 'was', 'way', 'ways', 'we', 'well', 'wells', 'went', 'were', 'what', 'when', 'where', 'whether', 'which', 'while', 'who', 'whole', 'whose', 'why', 'will', 'with', 'within', 'without', 'work', 'worked', 'working', 'works', 'would', 'x', 'y', 'year', 'years', 'yet', 'you', 'young', 'younger', 'youngest', 'your', 'yours', 'z'])

wordcloud = WordCloud(background_color='white', stopwords=stopwords,
    max_words=500, max_font_size=200, width=2000, height=800,
    random_state=RANDOM_SEED).generate(' '.join(df[df.points < 83].description.str.lower()))

plt.figure(figsize=(15, 7))
plt.imshow(wordcloud)
plt.title("Low Rated Wines Description Word Cloud", fontsize=20)
plt.axis('off');