This is a perfect competition for persons who have a beginner’s level understanding of various concepts of machine learning and data science, and are looking to polish their understanding and check how they stand against a larger community.
Data scientists take their beer very seriously. Recommendations from friends? No thank you. Websites? Too many pop-ups. Ads? Yeah, as if. They can trust only true solid numbers. Here’s a fun fact: Last year, Indians drank a total of 4.7 million litres of beer and the number is expected to go up to 6.5 billion litres by 2022.
Newer brands are also entering the market — take Delhi-based startup Bira 91, for example. With the $50 million they have raised, Bira plans to flood India with more beer and fill in that gap between traditional inexpensive brands and expensive ones.
So how will data scientists choose their beer? Will they look at the combination of barley, water, hops and yeast arrived upon? There are many things to consider — a series of complex biochemical reactions need to take place to make the perfect beer.
That’s why, here at MachineHack, we have entrusted this very important job to the most trustworthy people in the world (especially when it comes to beer) to you, the data scientists.
The train and test data will consist of various features that describe a beer. In many beer cellars, important factors such as temperature and humidity are maintained by a climate control system. Hence features like Cellar Temperature and Serving Temperature become really important. This is an actual data set that is curated over months of primary and secondary research by our team. Each row contains fixed size object of features. There are nine features and each feature can be accessed by its name.
With the given nine features (categorical and continuous) build a model to predict the score of the beer.
Goal: It is your job to predict the score for each beer. For each beer in the test set, you must predict the score variable.
Metric: Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the predicted value and observed score values. The final score calculation is done in the following way:
Submission File Format: Please do not change the format of the test file while submissions. Just fill up the price columns without touching any other data on the file.
import pandas as pd # pandas is a dataframe library
import matplotlib.pyplot as plt # matplotlib.pyplot plots data
import numpy as np # numpy provides N-dim object support
# do ploting inline instead of in a separate window
%matplotlib inline
/home/nbuser/anaconda3_420/lib/python3.5/site-packages/matplotlib/font_manager.py:281: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment. 'Matplotlib is building the font cache using fc-list. '
df = pd.read_csv("data/Beer Train Data Set.csv") # load Pima data. Adjust path as necessary
df.shape
(185643, 10)
df.dtypes
ABV float64 Brewing Company int64 Food Paring object Glassware Used object Beer Name int64 Ratings object Style Name object Cellar Temperature object Serving Temperature object Score float64 dtype: object
df.head(5)
ABV | Brewing Company | Food Paring | Glassware Used | Beer Name | Ratings | Style Name | Cellar Temperature | Serving Temperature | Score | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 6.5 | 8929 | (Curried,Thai)Cheese(pepperyMontereyPepperJack... | PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel... | 15121 | 22 | AmericanIPA | 40-45 | 45-50 | 3.28 |
1 | 5.5 | 13187 | (PanAsian)Cheese(earthyCamembert,Fontina,nutty... | PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel... | 59817 | 1 | AmericanPaleAle(APA) | 35-40 | 40-45 | 3.52 |
2 | 8.1 | 6834 | Meat(Pork,Poultry) | PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel... | 32669 | 3 | IrishRedAle | 35-40 | 40-45 | 4.01 |
3 | NaN | 11688 | (Indian,LatinAmerican,PanAsian)General(Aperitif) | PintGlass(orBecker,Nonic,Tumbler),PilsenerGlas... | 130798 | 0 | AmericanMaltLiquor | 35-40 | 35-40 | 0.00 |
4 | 6.0 | 10417 | Meat(Poultry,Fish,Shellfish) | PilsenerGlass(orPokal) | 124087 | 1 | EuroPaleLager | 35-40 | 40-45 | 2.73 |
df.tail(5)
ABV | Brewing Company | Food Paring | Glassware Used | Beer Name | Ratings | Style Name | Cellar Temperature | Serving Temperature | Score | |
---|---|---|---|---|---|---|---|---|---|---|
185638 | 4.5 | 9105 | (Dessert,Aperitif,Digestive) | PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel... | 141522 | 0 | HerbedSpicedBeer | NaN | 45-50 | 0.00 |
185639 | 4.5 | 3348 | (Barbecue,Italian)Cheese(earthyCamembert,Fonti... | PilsenerGlass(orPokal) | 85557 | 1 | AmericanPaleLager | 35-40 | 40-45 | 4.19 |
185640 | NaN | 8216 | Cheese(earthyCamembert,Fontina,nuttyAsiago,Col... | PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel... | 105072 | 1 | EnglishBrownAle | 40-45 | 45-50 | 3.11 |
185641 | 6.2 | 1755 | (Curried,Thai)Cheese(pepperyMontereyPepperJack... | PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel... | 70788 | 2 | AmericanIPA | 40-45 | 45-50 | 3.40 |
185642 | 6.4 | 4341 | (Curried,Thai)Cheese(pepperyMontereyPepperJack... | PintGlass(orBecker,Nonic,Tumbler),Mug(orSeidel... | 149979 | 1 | AmericanIPA | 40-45 | 45-50 | 4.31 |
df.isnull().values.any()
True
df.describe()
ABV | Brewing Company | Beer Name | Score | |
---|---|---|---|---|
count | 170513.000000 | 185643.000000 | 185643.000000 | 185643.000000 |
mean | 6.354961 | 7008.757659 | 83738.220111 | 3.198432 |
std | 1.907205 | 3914.168053 | 48520.065146 | 1.358862 |
min | 0.010000 | 0.000000 | 0.000000 | 0.000000 |
25% | 5.000000 | 3825.000000 | 41232.500000 | 3.270000 |
50% | 6.000000 | 7111.000000 | 83335.000000 | 3.710000 |
75% | 7.200000 | 10402.000000 | 125148.500000 | 3.970000 |
max | 80.000000 | 13541.000000 | 168534.000000 | 5.000000 |
df['Ratings'] = df['Ratings'].astype('category')
df['Ratings'].head(5)
0 22 1 1 2 3 3 0 4 1 Name: Ratings, dtype: category Categories (1824, object): [0, 1, 1,000, 1,001, ..., 995, 996, 997, 999]
df['Cellar Temperature'] = df['Cellar Temperature'].astype('category')
df['Cellar Temperature'].head(5)
0 40-45 1 35-40 2 35-40 3 35-40 4 35-40 Name: Cellar Temperature, dtype: category Categories (3, object): [35-40, 40-45, 45-50]
df['Serving Temperature'] = df['Serving Temperature'].astype('category')
df['Serving Temperature'].head(5)
0 45-50 1 40-45 2 40-45 3 35-40 4 40-45 Name: Serving Temperature, dtype: category Categories (4, object): [35-40, 40-45, 45-50, 50-55]
print("# rows in dataframe : {0}".format(len(df)))
print("# rows missing ABV : {0}".format(len(df.loc[np.isnan(df['ABV'])])))
print("# rows missing Brewing Company : {0}".format(len(df.loc[np.isnan(df['Brewing Company'])])))
print("# rows missing Food Paring : {0}".format(len(df.loc[df['Food Paring']==''])))
print("# rows missing Glassware Used : {0}".format(len(df.loc[df['Glassware Used']==''])))
print("# rows missing Beer Name : {0}".format(len(df.loc[np.isnan(df['Beer Name'])])))
print("# rows missing Ratings : {0}".format(len(df.loc[df['Ratings']==''])))
print("# rows missing Style Name : {0}".format(len(df.loc[df['Style Name']== ''])))
print("# rows missing Cellar Temperature : {0}".format(len(df.loc[df['Cellar Temperature'].isnull().values == True])))
print("# rows missing Serving Temperature : {0}".format(len(df.loc[df['Serving Temperature'].isnull().values == True])))
# rows in dataframe : 185643 # rows missing ABV : 15130 # rows missing Brewing Company : 0 # rows missing Food Paring : 0 # rows missing Glassware Used : 0 # rows missing Beer Name : 0 # rows missing Ratings : 0 # rows missing Style Name : 0 # rows missing Cellar Temperature : 6781 # rows missing Serving Temperature : 193
df.describe()
ABV | Brewing Company | Beer Name | Score | |
---|---|---|---|---|
count | 170513.000000 | 185643.000000 | 185643.000000 | 185643.000000 |
mean | 6.354961 | 7008.757659 | 83738.220111 | 3.198432 |
std | 1.907205 | 3914.168053 | 48520.065146 | 1.358862 |
min | 0.010000 | 0.000000 | 0.000000 | 0.000000 |
25% | 5.000000 | 3825.000000 | 41232.500000 | 3.270000 |
50% | 6.000000 | 7111.000000 | 83335.000000 | 3.710000 |
75% | 7.200000 | 10402.000000 | 125148.500000 | 3.970000 |
max | 80.000000 | 13541.000000 | 168534.000000 | 5.000000 |
def plot_corr(df, size=10):
"""
Function plots a graphical correlation matrix for each pair of columns in the dataframe.
Input:
df: pandas DataFrame
size: vertical and horizontal size of the plot
Displays:
matrix of correlation between columns. Blue-cyan-yellow-red-darkred => less to more correlated
0 ------------------> 1
Expect a darkred line running from top left to bottom right
"""
corr = df.corr() # data frame correlation function
fig, ax = plt.subplots(figsize=(size, size))
ax.matshow(corr) # color code the rectangles by correlation value
plt.xticks(range(len(corr.columns)), corr.columns) # draw x tick marks
plt.yticks(range(len(corr.columns)), corr.columns) # draw y tick marks
plot_corr(df)
df.corr()
from sklearn.cross_validation import train_test_split
featured_col_names = ['ABV', 'Brewing Company','Beer Name','Ratings', 'Style Name' ,'Cellar Temperature', 'Serving Temperature', '']
predicted_class_name = ['diabetes']
x = df[featured_col_names].values # predictor feature columns (8 X m)
y = df[predicted_class_name].values # predicted class (1 = true, 0 = false) column (1 X m)
split_test_size = 0.30
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = split_test_size, random_state = 42)
# test_size = 0.30 is 30%, 42 -> sets the seed for generating random numbers for the iterations in traing process
print("{0:0.2f}% in Training set".format((len(x_train)/len(df.index))*100))
print("{0:0.2f}% in Test set".format((len(x_test)/len(df.index))*100))