Process the Wine Review file from Kaggle

  • Do some very high level exploration of the data
  • Filter data down to only include wines from Australia
  • Other minor changes to support plotting

This notebook is a companion to the post at Practical Business Python

In [1]:
import pandas as pd
from bokeh.palettes import viridis
In [2]:
# Download and unpack the file from
df = pd.read_csv('winemag-data-130k-v2.csv', index_col=0)
In [3]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
In [4]:
# Let's see the top 15 Countries[:15]
US              54504
France          22093
Italy           19540
Spain            6645
Portugal         5691
Chile            4472
Argentina        3800
Austria          3345
Australia        2329
Germany          2165
New Zealand      1419
South Africa     1401
Israel            505
Greece            466
Canada            257
Name: country, dtype: int64
In [5]:
# Parse out just the Australian wines
df_Aussie = df[ == "Australia"].copy()
In [6]:
# Let's see what the price and point distribution looks like
points price
count 2329.000000 2294.000000
mean 88.580507 35.437663
std 2.989900 49.049458
min 80.000000 5.000000
25% 87.000000 15.000000
50% 89.000000 21.000000
75% 91.000000 38.000000
max 100.000000 850.000000
In [7]:
# Let's see where null values might be
country                  False
description              False
designation               True
points                   False
price                     True
province                 False
region_1                  True
region_2                  True
taster_name               True
taster_twitter_handle     True
title                    False
variety                  False
winery                   False
dtype: bool

For this analysis, price and points are going to be most important. Let's see how many null prices there are

In [8]:
In [9]:
# We can drop these or fill with the average. For the sake of this analysis, I'll fill with the mean
df_Aussie['price'].fillna((df_Aussie['price'].mean()), inplace=True)
In [10]:
# Double check
In [11]:
# There are a couple of titles that are duplicated and it causes minor problems when plotting. 
df_Aussie.drop_duplicates(subset=['title'], inplace=True)
In [12]:
# One way we will filter the data is by province. Let's see how many there are
South Australia      1268
Victoria              293
Western Australia     274
Australia Other       230
New South Wales        82
Tasmania               36
Name: province, dtype: int64
In [13]:
# It will be nice to color code the plot by the different varieties
Shiraz                       583
Chardonnay                   378
Cabernet Sauvignon           255
Riesling                     142
Pinot Noir                   119
Grenache                      72
Red Blend                     58
Sauvignon Blanc               50
Shiraz-Viognier               38
Viognier                      32
Sparkling Blend               31
Shiraz-Cabernet Sauvignon     29
Bordeaux-style Red Blend      28
Rosé                          25
Pinot Grigio                  25
Name: variety, dtype: int64
In [14]:
# Use the viridis pallette and create a column that contains the desired color
varieties = list(df_Aussie.variety.unique())
colors = viridis(len(varieties))
color_map = dict(zip(varieties, colors))
df_Aussie["variety_color"] = df_Aussie["variety"].map(color_map)
In [15]:
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery variety_color
77 Australia This medium-bodied Chardonnay features aromas ... Made With Organic Grapes 86 18.0 South Australia South Australia NaN Joe Czerwinski @JoeCz Yalumba 2016 Made With Organic Grapes Chardonn... Chardonnay Yalumba #440154
83 Australia Pale copper in hue, this wine exudes passion f... Jester Sangiovese 86 20.0 South Australia McLaren Vale NaN Joe Czerwinski @JoeCz Mitolo 2016 Jester Sangiovese Rosé (McLaren Vale) Rosé Mitolo #450558
123 Australia The blend is roughly two-thirds Shiraz and one... Parson's Flat 92 40.0 South Australia Padthaway NaN Joe Czerwinski @JoeCz Henry's Drive Vignerons 2006 Parson's Flat Shi... Shiraz-Cabernet Sauvignon Henry's Drive Vignerons #460B5E
191 Australia From the little-known region of Padthaway, thi... The Trial of John Montford 87 30.0 South Australia Padthaway NaN Joe Czerwinski @JoeCz Henry's Drive Vignerons 2006 The Trial of John... Cabernet Sauvignon Henry's Drive Vignerons #471163
232 Australia Lifted cedar and pine notes interspersed with ... Red Belly Black 85 12.0 South Australia South Australia NaN NaN NaN Angove's 2006 Red Belly Black Shiraz (South Au... Shiraz Angove's #471669
In [16]:
# Save the filtered and processed file to be used in our bokeh plot
In [ ]: