How To Choose The Perfect Beer?¶

This is a perfect competition for persons who have a beginner’s level understanding of various concepts of machine learning and data science, and are looking to polish their understanding and check how they stand against a larger community.

Data scientists take their beer very seriously. Recommendations from friends? No thank you. Websites? Too many pop-ups. Ads? Yeah, as if. They can trust only true solid numbers. Here’s a fun fact: Last year, Indians drank a total of 4.7 million litres of beer and the number is expected to go up to 6.5 billion litres by 2022.

Newer brands are also entering the market — take Delhi-based startup Bira 91, for example. With the $50 million they have raised, Bira plans to flood India with more beer and fill in that gap between traditional inexpensive brands and expensive ones.

So how will data scientists choose their beer? Will they look at the combination of barley, water, hops and yeast arrived upon? There are many things to consider — a series of complex biochemical reactions need to take place to make the perfect beer.

That’s why, here at MachineHack, we have entrusted this very important job to the most trustworthy people in the world (especially when it comes to beer) to you, the data scientists.

Data:¶

The train and test data will consist of various features that describe a beer. In many beer cellars, important factors such as temperature and humidity are maintained by a climate control system. Hence features like Cellar Temperature and Serving Temperature become really important. This is an actual data set that is curated over months of primary and secondary research by our team. Each row contains fixed size object of features. There are nine features and each feature can be accessed by its name.

Features¶

ABV – Alcohol By Volume
Brewing Company
Food Pairing – Perfect food to have with this beer
Glassware Used – Perfect glassware to use to enjoy this beer
Beer Name – Name of the beer
Ratings
Score (Predict) – Overall score of the beer
Style Name – Style in which the beer is prepared
Cellar Temperature
Serving Temperature

Problem Statement¶

With the given nine features (categorical and continuous) build a model to predict the score of the beer.

Evaluation¶

Goal: It is your job to predict the score for each beer. For each beer in the test set, you must predict the score variable.

Metric: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the predicted value and observed score values. The final score calculation is done in the following way:

X = Sigmoid of RMSE, which squashes the RMSE between the range of 0 and 1
Score = 1 – X, Hence, lesser the RMSE better your score is.

Submission File Format: Please do not change the format of the test file while submissions. Just fill up the price columns without touching any other data on the file.

In [1]:

import pandas as pd                 # pandas is a dataframe library
import matplotlib.pyplot as plt     # matplotlib.pyplot plots data
import numpy as np                  # numpy provides N-dim object support

# do ploting inline instead of in a separate window
%matplotlib inline

/home/nbuser/anaconda3_420/lib/python3.5/site-packages/matplotlib/font_manager.py:281: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  'Matplotlib is building the font cache using fc-list. '

In [2]:

df = pd.read_csv("data/Beer Train Data Set.csv")  # load Pima data. Adjust path as necessary

df.shape

Out[2]:

(185643, 10)

In [3]:

df.dtypes

Out[3]:

ABV                    float64
Brewing Company          int64
Food Paring             object
Glassware Used          object
Beer Name                int64
Ratings                 object
Style Name              object
Cellar Temperature      object
Serving Temperature     object
Score                  float64
dtype: object

In [4]:

df.head(5)

Out[4]:

	ABV	Brewing Company	Food Paring	Glassware Used	Beer Name	Ratings	Style Name	Cellar Temperature	Serving Temperature	Score
0	6.5	8929	(Curried,Thai)Cheese(pepperyMontereyPepperJack...	PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...	15121	22	AmericanIPA	40-45	45-50	3.28
1	5.5	13187	(PanAsian)Cheese(earthyCamembert,Fontina,nutty...	PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...	59817	1	AmericanPaleAle(APA)	35-40	40-45	3.52
2	8.1	6834	Meat(Pork,Poultry)	PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...	32669	3	IrishRedAle	35-40	40-45	4.01
3	NaN	11688	(Indian,LatinAmerican,PanAsian)General(Aperitif)	PintGlass(orBecker,Nonic,Tumbler),PilsenerGlas...	130798	0	AmericanMaltLiquor	35-40	35-40	0.00
4	6.0	10417	Meat(Poultry,Fish,Shellfish)	PilsenerGlass(orPokal)	124087	1	EuroPaleLager	35-40	40-45	2.73

In [5]:

df.tail(5)

Out[5]:

	ABV	Brewing Company	Food Paring	Glassware Used	Beer Name	Ratings	Style Name	Cellar Temperature	Serving Temperature	Score
185638	4.5	9105	(Dessert,Aperitif,Digestive)	PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...	141522	0	HerbedSpicedBeer	NaN	45-50	0.00
185639	4.5	3348	(Barbecue,Italian)Cheese(earthyCamembert,Fonti...	PilsenerGlass(orPokal)	85557	1	AmericanPaleLager	35-40	40-45	4.19
185640	NaN	8216	Cheese(earthyCamembert,Fontina,nuttyAsiago,Col...	PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...	105072	1	EnglishBrownAle	40-45	45-50	3.11
185641	6.2	1755	(Curried,Thai)Cheese(pepperyMontereyPepperJack...	PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...	70788	2	AmericanIPA	40-45	45-50	3.40
185642	6.4	4341	(Curried,Thai)Cheese(pepperyMontereyPepperJack...	PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel...	149979	1	AmericanIPA	40-45	45-50	4.31

In [6]:

df.isnull().values.any()

Out[6]:

True

In [7]:

df.describe()

Out[7]:

	ABV	Brewing Company	Beer Name	Score
count	170513.000000	185643.000000	185643.000000	185643.000000
mean	6.354961	7008.757659	83738.220111	3.198432
std	1.907205	3914.168053	48520.065146	1.358862
min	0.010000	0.000000	0.000000	0.000000
25%	5.000000	3825.000000	41232.500000	3.270000
50%	6.000000	7111.000000	83335.000000	3.710000
75%	7.200000	10402.000000	125148.500000	3.970000
max	80.000000	13541.000000	168534.000000	5.000000

In [25]:

df['Ratings'] = df['Ratings'].astype('category')
df['Ratings'].head(5)

Out[25]:

0    22
1     1
2     3
3     0
4     1
Name: Ratings, dtype: category
Categories (1824, object): [0, 1, 1,000, 1,001, ..., 995, 996, 997, 999]

In [9]:

df['Cellar Temperature'] = df['Cellar Temperature'].astype('category')
df['Cellar Temperature'].head(5)

Out[9]:

0    40-45
1    35-40
2    35-40
3    35-40
4    35-40
Name: Cellar Temperature, dtype: category
Categories (3, object): [35-40, 40-45, 45-50]

In [10]:

df['Serving Temperature'] = df['Serving Temperature'].astype('category')
df['Serving Temperature'].head(5)

Out[10]:

0    45-50
1    40-45
2    40-45
3    35-40
4    40-45
Name: Serving Temperature, dtype: category
Categories (4, object): [35-40, 40-45, 45-50, 50-55]

In [17]:

print("# rows in dataframe : {0}".format(len(df)))
print("# rows missing ABV : {0}".format(len(df.loc[np.isnan(df['ABV'])])))
print("# rows missing Brewing Company : {0}".format(len(df.loc[np.isnan(df['Brewing Company'])])))
print("# rows missing Food Paring : {0}".format(len(df.loc[df['Food Paring']==''])))
print("# rows missing Glassware Used : {0}".format(len(df.loc[df['Glassware Used']==''])))
print("# rows missing Beer Name : {0}".format(len(df.loc[np.isnan(df['Beer Name'])])))
print("# rows missing Ratings : {0}".format(len(df.loc[df['Ratings']==''])))
print("# rows missing Style Name : {0}".format(len(df.loc[df['Style Name']== ''])))
print("# rows missing Cellar Temperature : {0}".format(len(df.loc[df['Cellar Temperature'].isnull().values == True])))
print("# rows missing Serving Temperature : {0}".format(len(df.loc[df['Serving Temperature'].isnull().values == True])))

# rows in dataframe : 185643
# rows missing ABV : 15130
# rows missing Brewing Company : 0
# rows missing Food Paring : 0
# rows missing Glassware Used : 0
# rows missing Beer Name : 0
# rows missing Ratings : 0
# rows missing Style Name : 0
# rows missing Cellar Temperature : 6781
# rows missing Serving Temperature : 193

In [18]:

df.describe()

Out[18]:

	ABV	Brewing Company	Beer Name	Score
count	170513.000000	185643.000000	185643.000000	185643.000000
mean	6.354961	7008.757659	83738.220111	3.198432
std	1.907205	3914.168053	48520.065146	1.358862
min	0.010000	0.000000	0.000000	0.000000
25%	5.000000	3825.000000	41232.500000	3.270000
50%	6.000000	7111.000000	83335.000000	3.710000
75%	7.200000	10402.000000	125148.500000	3.970000
max	80.000000	13541.000000	168534.000000	5.000000

In [ ]:

def plot_corr(df, size=10):
    """
    Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot

    Displays:
        matrix of correlation between columns.  Blue-cyan-yellow-red-darkred => less to more correlated
                                                0 ------------------>  1
                                                Expect a darkred line running from top left to bottom right
    """

    corr = df.corr()    # data frame correlation function
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)   # color code the rectangles by correlation value
    plt.xticks(range(len(corr.columns)), corr.columns)  # draw x tick marks
    plt.yticks(range(len(corr.columns)), corr.columns)  # draw y tick marks

In [ ]:

plot_corr(df)

In [ ]:

df.corr()

In [ ]:

from sklearn.cross_validation import train_test_split

featured_col_names = ['ABV', 'Brewing Company','Beer Name','Ratings', 'Style Name' ,'Cellar Temperature', 'Serving Temperature', '']
predicted_class_name = ['diabetes']

x = df[featured_col_names].values               # predictor feature columns (8 X m)
y = df[predicted_class_name].values             # predicted class (1 = true, 0 = false) column (1 X m)

split_test_size = 0.30

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = split_test_size, random_state = 42)
        # test_size = 0.30 is 30%, 42 -> sets the seed for generating random numbers for the iterations in traing process

In [ ]:

print("{0:0.2f}% in Training set".format((len(x_train)/len(df.index))*100))
print("{0:0.2f}% in Test set".format((len(x_test)/len(df.index))*100))

In [ ]: