The goal of this project is to create a model that estimates the subjective rating of a wine as graded by experts, using as inputs (features) the physicochemical properties of the wine.
There are two datasets, one for white wine (4,898 wines) and one for red (1,599 wines). They are all variants of the Portuguese "Vinho Verde" wine.
This dataset could be used as a classification or a regression problem. I will be doing regression. Since wine properties are very different for white and red wines, I will be doing a separate analysis for each type of wine. I will be only doing the white wine dataset in this notebook. There is another notebook for the red wine.
We found that it is possible to achieve reasonably accurate predictions of the rating by using a Random Forest approach. Using this data based approach can help wine producers supplement their traditional tasting methods.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis.
"Modeling wine preferences by data mining from physicochemical properties."
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
@Elsevier: http://dx.doi.org/10.1016/j.dss.2009.05.016
Pre-press (pdf): http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
bib: http://www3.dsi.uminho.pt/pcortez/dss09.bib
# import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('darkgrid')
import pickle
# import pipeline and preprocessing libraries
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# import model selection and metrics libraries
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error
# import the different models we are going to use
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
# read the datasets
df = pd.read_csv('winequality-white.csv', sep=';')
Let's look at the first five lines of our data:
display(df.head())
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
And the last 5 as well:
display(df.tail())
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4893 | 6.2 | 0.21 | 0.29 | 1.6 | 0.039 | 24.0 | 92.0 | 0.99114 | 3.27 | 0.50 | 11.2 | 6 |
4894 | 6.6 | 0.32 | 0.36 | 8.0 | 0.047 | 57.0 | 168.0 | 0.99490 | 3.15 | 0.46 | 9.6 | 5 |
4895 | 6.5 | 0.24 | 0.19 | 1.2 | 0.041 | 30.0 | 111.0 | 0.99254 | 2.99 | 0.46 | 9.4 | 6 |
4896 | 5.5 | 0.29 | 0.30 | 1.1 | 0.022 | 20.0 | 110.0 | 0.98869 | 3.34 | 0.38 | 12.8 | 7 |
4897 | 6.0 | 0.21 | 0.38 | 0.8 | 0.020 | 22.0 | 98.0 | 0.98941 | 3.26 | 0.32 | 11.8 | 6 |
We can also verify that the dataset has 1,599 wines.
print(df.shape)
(4898, 12)
Let's now take a look at the data type of the variables:
print(df.dtypes)
fixed acidity float64 volatile acidity float64 citric acid float64 residual sugar float64 chlorides float64 free sulfur dioxide float64 total sulfur dioxide float64 density float64 pH float64 sulphates float64 alcohol float64 quality int64 dtype: object
We can see the basic properties of the different features, they all seem to make sense. Since the features are have different ranges of values we will need to do some standardization before using the machine learning models.
display(df.describe())
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 | 4898.000000 |
mean | 6.854788 | 0.278241 | 0.334192 | 6.391415 | 0.045772 | 35.308085 | 138.360657 | 0.994027 | 3.188267 | 0.489847 | 10.514267 | 5.877909 |
std | 0.843868 | 0.100795 | 0.121020 | 5.072058 | 0.021848 | 17.007137 | 42.498065 | 0.002991 | 0.151001 | 0.114126 | 1.230621 | 0.885639 |
min | 3.800000 | 0.080000 | 0.000000 | 0.600000 | 0.009000 | 2.000000 | 9.000000 | 0.987110 | 2.720000 | 0.220000 | 8.000000 | 3.000000 |
25% | 6.300000 | 0.210000 | 0.270000 | 1.700000 | 0.036000 | 23.000000 | 108.000000 | 0.991723 | 3.090000 | 0.410000 | 9.500000 | 5.000000 |
50% | 6.800000 | 0.260000 | 0.320000 | 5.200000 | 0.043000 | 34.000000 | 134.000000 | 0.993740 | 3.180000 | 0.470000 | 10.400000 | 6.000000 |
75% | 7.300000 | 0.320000 | 0.390000 | 9.900000 | 0.050000 | 46.000000 | 167.000000 | 0.996100 | 3.280000 | 0.550000 | 11.400000 | 6.000000 |
max | 14.200000 | 1.100000 | 1.660000 | 65.800000 | 0.346000 | 289.000000 | 440.000000 | 1.038980 | 3.820000 | 1.080000 | 14.200000 | 9.000000 |
It is good to see there are no missing values in our data:
print(df.isnull().sum())
fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64
Let's take a look at the distribution of the features
df.hist(figsize=(14,14), xrot=-45, column=sorted(df.columns))
plt.show()
We can visualize the distribution of our target variable vs the other features to look for patterns that could help us in our analysis.
The violin plots can also help us look for outliers in the data.
columns = list(df.columns)
columns.remove('quality')
for col in columns:
sns.violinplot(x="quality", y=col, data=df)
plt.show()
We can also take a look at the correlations between the variables with the help of a heatmap.
# calculate correlations
correlations = df.corr()
# create a mask to see only the bottom triangle
mask = np.zeros_like(correlations)
mask[np.triu_indices_from(mask)] = 1
# crate the heatmap
sns.set_style('white')
plt.figure(figsize=(9,8))
sns.heatmap(correlations * 100,
cmap='RdBu_r',
annot=True,
fmt='.0f',
mask=mask,
cbar=False)
plt.show()
The highest correlation is between density and residual sugar. The second highest correlation is between total sulfur dioxide and free sulfur dioxide which makes sense.
Quality, our target variable, has its highest correlation with alcohol_and _density but neither is that high.
With this correlation information, we can perform some additional visual exploration.
First, let's explore our target variable with the two features that it has the highest correlation: alcohol and density:
sns.violinplot(x="quality", y="alcohol", data=df)
plt.show()
sns.violinplot(x="quality", y="density", data=df)
plt.show()
Let's visualize 3 variable pairs with high correlation:
sns.scatterplot(x="density", y="residual sugar", data=df)
plt.show()
sns.scatterplot(x="density", y="alcohol", data=df)
plt.show()
sns.scatterplot(x="total sulfur dioxide", y="free sulfur dioxide", data=df)
plt.show()
Look for duplicated rows
duplicated_rows = df[df.duplicated()]
display(duplicated_rows)
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.900000 | 6 |
5 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.99510 | 3.26 | 0.44 | 10.100000 | 6 |
7 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.00100 | 3.00 | 0.45 | 8.800000 | 6 |
8 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.99400 | 3.30 | 0.49 | 9.500000 | 6 |
20 | 6.2 | 0.66 | 0.48 | 1.2 | 0.029 | 29.0 | 75.0 | 0.98920 | 3.33 | 0.39 | 12.800000 | 8 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4828 | 6.4 | 0.23 | 0.35 | 10.3 | 0.042 | 54.0 | 140.0 | 0.99670 | 3.23 | 0.47 | 9.200000 | 5 |
4850 | 7.0 | 0.36 | 0.35 | 2.5 | 0.048 | 67.0 | 161.0 | 0.99146 | 3.05 | 0.56 | 11.100000 | 6 |
4851 | 6.4 | 0.33 | 0.44 | 8.9 | 0.055 | 52.0 | 164.0 | 0.99488 | 3.10 | 0.48 | 9.600000 | 5 |
4856 | 7.1 | 0.23 | 0.39 | 13.7 | 0.058 | 26.0 | 172.0 | 0.99755 | 2.90 | 0.46 | 9.000000 | 6 |
4880 | 6.6 | 0.34 | 0.40 | 8.1 | 0.046 | 68.0 | 170.0 | 0.99494 | 3.15 | 0.50 | 9.533333 | 6 |
937 rows × 12 columns
This seems to be that several wine tasters rated the same wine similarly. We will keep the duplicates for now.
From our visual exploration, there does not seem to be other structural errors, outliers or missing data that needs to be fixed for now.
Without further domain knowledge, it is hard to choose interaction features or indicator features that could help us engineer new features.
We will leave the data as is for now.
Let's create our analytical base table (ABT):
df.to_csv('analytical_base_table.csv', index=None)
We can treat this problem as a classification problem with 7 classes (all wines have a score of either 3, 4, 5, 6, 7, 8, or 9).
We could also see it as a regression problem since the model can predict a continuous value and it will correctly represent the relative quality of the wine. That is, if our model predicts a wine to have quality 3.1 and the true value was 3, we can see it as good prediction.
I believe it is a better approach to model the problem as a regression problem.
We will use the following algorithms and see which one performs best:
I chose the first three because the regularization they provide normally improves the linear regression, especially in a case like this where there linear regression has many variables to fit.
Random Forest and Gradient Boosting Regressors are normally good choices in regression problems.
For Lasso and Ridge we will tune the alpha parameter. For ElasticNet we will focus on the alpha and the l1 ratio.
For the Random Forest Regressor we will focus on the number of trees and the number of features to use per tree.
For the Gradient Boosting Regressor we will tune the number of estimators, the learning rate and the maximum depth.
Now that we have selected our algorithms, we can start to train them on the white wine data. The first step is to separate the features from our target variable:
# load ABT
df = pd.read_csv('analytical_base_table.csv')
# Create X and y tables
y = df.quality
X = df.drop('quality', axis=1)
We can now split our data into train and test data. We will keep 20% for testing and train and select the model on the train split. We also set random_state to 775 for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=775)
From our exploration, we saw that we have different ranges of values in our data. Our models will be better if the data has similar scales, so we will add a scaling step to our pipeline.
With this dictionary, we can keep our pipelines for each algorithm in the same place. We will set random_state to 111 for repoducibility.
pipelines = {
'lasso' : make_pipeline(StandardScaler(), Lasso(random_state=111)),
'ridge' : make_pipeline(StandardScaler(), Ridge(random_state=111)),
'enet' : make_pipeline(StandardScaler(), ElasticNet(random_state=111)),
'rf' : make_pipeline(StandardScaler(), RandomForestRegressor(random_state=111)),
'gb' : make_pipeline(StandardScaler(), GradientBoostingRegressor(random_state=111)),
}
As a sanity check, let's confirm that the pipelines were created:
for pipeline in pipelines:
print(pipeline, type(pipelines[pipeline]))
lasso <class 'sklearn.pipeline.Pipeline'> ridge <class 'sklearn.pipeline.Pipeline'> enet <class 'sklearn.pipeline.Pipeline'> rf <class 'sklearn.pipeline.Pipeline'> gb <class 'sklearn.pipeline.Pipeline'>
The next step will be to train our algorithms. We will create a dictionary with the hyperparameters to tune for each algorithm. As an intermediate step, I like to take a look at the tunable parameters for each algorithm so that we can:
for pipeline in pipelines:
display(pipelines[pipeline].get_params())
{'memory': None, 'steps': [('standardscaler', StandardScaler()), ('lasso', Lasso(random_state=111))], 'verbose': False, 'standardscaler': StandardScaler(), 'lasso': Lasso(random_state=111), 'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'lasso__alpha': 1.0, 'lasso__copy_X': True, 'lasso__fit_intercept': True, 'lasso__max_iter': 1000, 'lasso__normalize': False, 'lasso__positive': False, 'lasso__precompute': False, 'lasso__random_state': 111, 'lasso__selection': 'cyclic', 'lasso__tol': 0.0001, 'lasso__warm_start': False}
{'memory': None, 'steps': [('standardscaler', StandardScaler()), ('ridge', Ridge(random_state=111))], 'verbose': False, 'standardscaler': StandardScaler(), 'ridge': Ridge(random_state=111), 'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'ridge__alpha': 1.0, 'ridge__copy_X': True, 'ridge__fit_intercept': True, 'ridge__max_iter': None, 'ridge__normalize': False, 'ridge__random_state': 111, 'ridge__solver': 'auto', 'ridge__tol': 0.001}
{'memory': None, 'steps': [('standardscaler', StandardScaler()), ('elasticnet', ElasticNet(random_state=111))], 'verbose': False, 'standardscaler': StandardScaler(), 'elasticnet': ElasticNet(random_state=111), 'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'elasticnet__alpha': 1.0, 'elasticnet__copy_X': True, 'elasticnet__fit_intercept': True, 'elasticnet__l1_ratio': 0.5, 'elasticnet__max_iter': 1000, 'elasticnet__normalize': False, 'elasticnet__positive': False, 'elasticnet__precompute': False, 'elasticnet__random_state': 111, 'elasticnet__selection': 'cyclic', 'elasticnet__tol': 0.0001, 'elasticnet__warm_start': False}
{'memory': None, 'steps': [('standardscaler', StandardScaler()), ('randomforestregressor', RandomForestRegressor(random_state=111))], 'verbose': False, 'standardscaler': StandardScaler(), 'randomforestregressor': RandomForestRegressor(random_state=111), 'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'randomforestregressor__bootstrap': True, 'randomforestregressor__ccp_alpha': 0.0, 'randomforestregressor__criterion': 'mse', 'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 'auto', 'randomforestregressor__max_leaf_nodes': None, 'randomforestregressor__max_samples': None, 'randomforestregressor__min_impurity_decrease': 0.0, 'randomforestregressor__min_impurity_split': None, 'randomforestregressor__min_samples_leaf': 1, 'randomforestregressor__min_samples_split': 2, 'randomforestregressor__min_weight_fraction_leaf': 0.0, 'randomforestregressor__n_estimators': 100, 'randomforestregressor__n_jobs': None, 'randomforestregressor__oob_score': False, 'randomforestregressor__random_state': 111, 'randomforestregressor__verbose': 0, 'randomforestregressor__warm_start': False}
{'memory': None, 'steps': [('standardscaler', StandardScaler()), ('gradientboostingregressor', GradientBoostingRegressor(random_state=111))], 'verbose': False, 'standardscaler': StandardScaler(), 'gradientboostingregressor': GradientBoostingRegressor(random_state=111), 'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'gradientboostingregressor__alpha': 0.9, 'gradientboostingregressor__ccp_alpha': 0.0, 'gradientboostingregressor__criterion': 'friedman_mse', 'gradientboostingregressor__init': None, 'gradientboostingregressor__learning_rate': 0.1, 'gradientboostingregressor__loss': 'ls', 'gradientboostingregressor__max_depth': 3, 'gradientboostingregressor__max_features': None, 'gradientboostingregressor__max_leaf_nodes': None, 'gradientboostingregressor__min_impurity_decrease': 0.0, 'gradientboostingregressor__min_impurity_split': None, 'gradientboostingregressor__min_samples_leaf': 1, 'gradientboostingregressor__min_samples_split': 2, 'gradientboostingregressor__min_weight_fraction_leaf': 0.0, 'gradientboostingregressor__n_estimators': 100, 'gradientboostingregressor__n_iter_no_change': None, 'gradientboostingregressor__random_state': 111, 'gradientboostingregressor__subsample': 1.0, 'gradientboostingregressor__tol': 0.0001, 'gradientboostingregressor__validation_fraction': 0.1, 'gradientboostingregressor__verbose': 0, 'gradientboostingregressor__warm_start': False}
After checking, we can now create the hyperparameter grid. First we create a dictionary for each algorithm with the hyperparameters we want to tune and the values to use.
lasso_hyperparameters = {
'lasso__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]
}
ridge_hyperparameters = {
'ridge__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10]
}
enet_hyperparameters = {
'elasticnet__alpha' : [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10],
'elasticnet__l1_ratio' : [0.1, 0.3, 0.5, 0.7, 0.9]
}
rf_hyperparameters = {
'randomforestregressor__n_estimators' : [100, 200, 300, 400, 500],
'randomforestregressor__max_features' : ['auto', 'sqrt', 0.33]
}
gb_hyperparameters = {
'gradientboostingregressor__n_estimators' : [100, 200],
'gradientboostingregressor__learning_rate' : [0.05, 0.1, 0.2],
'gradientboostingregressor__max_depth' : [1, 3, 5]
}
And to keep everything in one place, we create a dictionary with all the hyperparameter dictionaries:
hyperparameters = {
'lasso' : lasso_hyperparameters,
'ridge' : ridge_hyperparameters,
'enet' : enet_hyperparameters,
'rf' : rf_hyperparameters,
'gb' : gb_hyperparameters
}
With this setup, we can now loop over our dictionary and train each of our algorithms and tune the hyperparameters all in one step. We will use 10 fold cross-validation for the hyperparameter tuning.
fitted_models = {}
for name, pipeline in pipelines.items():
model = GridSearchCV(pipeline, hyperparameters[name], cv=10)
model.fit(X_train, y_train)
fitted_models[name] = model
print(name, 'has been fitted')
lasso has been fitted ridge has been fitted enet has been fitted rf has been fitted gb has been fitted
It seems that Random Forest gets the best results.
for name, model in fitted_models.items():
print(name, model.best_score_)
lasso 0.27855099133259004 ridge 0.2785310732323244 enet 0.27963894991160776 rf 0.5417770593072861 gb 0.48340324064416196
Let's calculate the R squared error and the Mean Squared error to select the best algorithm:
for name, model in fitted_models.items():
pred = model.predict(X_test)
print(name)
print('-'*8)
print(f'R^2: {r2_score(y_test, pred)}')
print(f'MSE: {mean_squared_error(y_test, pred)}')
print()
lasso -------- R^2: 0.24763868296405078 MSE: 0.5974895730410107 ridge -------- R^2: 0.2491886891065973 MSE: 0.596258631886341 enet -------- R^2: 0.24863075786174593 MSE: 0.5967017143438307 rf -------- R^2: 0.5137982992091297 MSE: 0.3861182653061224 gb -------- R^2: 0.4522951893692222 MSE: 0.4349611098368771
Random Forest gives us the best R squared and the lowest MSE.
We can visualize predicted versus actual values using our trained model.
rf_pred = fitted_models['rf'].predict(X_test)
plt.scatter(rf_pred, y_test)
plt.xlabel('predicted')
plt.ylabel('actual')
plt.show()
We can also get a look at the parameters that worked best for this algorithm:
fitted_models['rf'].best_estimator_
Pipeline(steps=[('standardscaler', StandardScaler()), ('randomforestregressor', RandomForestRegressor(max_features='sqrt', n_estimators=500, random_state=111))])
We can now create a pickle file of our best model
with open('final_model_white.pkl', 'wb') as f:
pickle.dump(fitted_models['rf'].best_estimator_, f)
Random forest beat out boosted trees and regularized regression approaches.
A mean squared error lower that 0.39 is enough to help winemakers to estimate ratings and supplement their regular testing protocols.
This model could be further improved in several ways: