Read more here.
This datasets is related to red variants of the Portuguese "Vinho Verde" wine. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
The datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).
This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)
Original dataset was obtained from https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547-553, 2009.
Most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
Found in small quantities, citric acid can add 'freshness' and flavor to wines.
The amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.
The amount of salt in the wine.
free sulfur dioxide
the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
total sulfur dioxide
Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
The density of water is close to that of water depending on the percent alcohol and sugar content.
describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant.
The percent alcohol content of the wine.
output variable (based on sensory data, score between 0 and 10).
from sklearn.cross_validation import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error from sklearn.ensemble import RandomForestRegressor import seaborn as sns import warnings warnings.filterwarnings('ignore') import numpy as np import pandas as pd import statsmodels.stats.api as sm %pylab inline
Populating the interactive namespace from numpy and matplotlib
wine = pd.read_csv('winequality-red.csv', sep=',', header=0) wine.head()
|fixed acidity||volatile acidity||citric acid||residual sugar||chlorides||free sulfur dioxide||total sulfur dioxide||density||pH||sulphates||alcohol||quality|
Do some plotting to know how the data columns are distributed in the dataset.
fig = plt.figure(figsize = (10,6)) sns.barplot(x = 'quality', y = 'fixed acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x233536c9978>
Here we see that fixed acidity does not give any specification to classify the quality.
fig = plt.figure(figsize = (10,6)) sns.barplot(x = 'quality', y = 'volatile acidity', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x233553f49e8>
Here we see that its quite a downing trend in the volatile acidity as we go higher the quality.
fig = plt.figure(figsize = (10,6)) sns.barplot(x = 'quality', y = 'citric acid', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x233536d9c18>
Composition of citric acid go higher as we go higher in the quality of the wine.
fig = plt.figure(figsize = (10,6)) sns.barplot(x = 'quality', y = 'residual sugar', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x233532cc978>
We can see, that hasn't got trend on sugar and quality.
fig = plt.figure(figsize = (10,6)) sns.barplot(x = 'quality', y = 'chlorides', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x23353a45080>
Composition of chloride also go down as we go higher in the quality of the wine
fig = plt.figure(figsize = (10,6)) sns.barplot(x = 'quality', y = 'free sulfur dioxide', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x23353a3dcc0>
We can see, that hasn't got trend on free sulfur dioxide and quality.
fig = plt.figure(figsize = (10,6)) sns.barplot(x = 'quality', y = 'total sulfur dioxide', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x23354e3a668>
fig = plt.figure(figsize = (10,6)) sns.barplot(x = 'quality', y = 'sulphates', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x233539c35c0>
Sulphates level goes higher with the quality of wine
fig = plt.figure(figsize = (10,6)) sns.barplot(x = 'quality', y = 'alcohol', data = wine)
<matplotlib.axes._subplots.AxesSubplot at 0x23355010b00>
Alcohol level also goes higher as te quality of wine increases
Here is the distribution of expert assessments of wines in the sample:
plt.figure(figsize(8,6)) stat = wine.groupby('quality')['quality'].agg(lambda x : float(len(x)) / wine.shape) stat.plot(kind='bar', fontsize=14, width=0.9, color="red") plt.xticks(rotation=0) plt.ylabel('Proportion', fontsize=14) plt.xlabel('Quality', fontsize=14)
We can see, that data is normally distributed.
Let's see heatmap correlation between features.
<matplotlib.axes._subplots.AxesSubplot at 0x23354ca5d68>
Interesting, we can now see that acidity and pH level of wine are in inverse proportion.
This result illustrates a fact from a chemisty: as we know, pH = -lg[H+], whehe [H+] in fact represents acidity. So, bigger acidity - lower pH level.
plt.figure(figsize=(10,8)) plt.scatter(wine['fixed acidity'],wine['pH']) plt.xlabel('Acidity').set_size(20) plt.ylabel('pH').set_size(20)
However, there is no direct connection between total acidity and pH (it is possible to find wines with a high pH for wine and high acidity).In wine tasting, the term “acidity” refers to the fresh, tart and sour attributes of the wine which are evaluated in relation to how well the acidity balances out the sweetness and bitter components of the wine such as tannins.
Acidity in wine is important.
As much as modern health has demonized acidic foods, acidity is an essential trait in wine that’s necessary for quality. Great wines are in balance with their 4 fundamental traits (Acidity, tannin, alcohol and sweetness) and as wines age, the acidity acts as a buffer to preserve the wine longer. For example, Sauternes, a wine with both high acidity and sweetness, is known to age several decades.
Summarise previous results, we can see that:
We can see that the classes are normally balanced. We making a classification task, and we can see, that we got some trends by features, and it's mean, that they can have weights. And it's offers to use MSE — measure of the quality of an estimator — it is always non-negative, and values closer to zero are better.
For our task we will consider the following models:
Check, how much NaN values in dataset:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): fixed acidity 1599 non-null float64 volatile acidity 1599 non-null float64 citric acid 1599 non-null float64 residual sugar 1599 non-null float64 chlorides 1599 non-null float64 free sulfur dioxide 1599 non-null float64 total sulfur dioxide 1599 non-null float64 density 1599 non-null float64 pH 1599 non-null float64 sulphates 1599 non-null float64 alcohol 1599 non-null float64 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB
We can see, that we don't have NaN values (its very good). So, most of preprocessing is over.
# Start Modeling # Split Training and Testing data (within training data file only) # Hold out method validation from sklearn.model_selection import train_test_split X_train = wine.iloc[:,1:len(wine.columns)-2] # this represents the input Features Y_train = wine.loc[:,'quality'] # Scaling features (only feature NOT observation) from sklearn.preprocessing import MinMaxScaler, StandardScaler # Scaling - brings to range (0 to 1) ScaleFn = MinMaxScaler() X_Scale = ScaleFn.fit_transform(X_train) # Standardise - brings to Zero mean, Unit variance ScaleFn = StandardScaler() X_Strd = ScaleFn.fit_transform(X_train)
from sklearn.decomposition import PCA pca = PCA() x_pca = pca.fit_transform(X_Strd)
Plot the graph to find the principal components
plt.figure(figsize=(10,10)) plt.plot(np.cumsum(pca.explained_variance_ratio_), 'ro-') plt.grid()
AS per the graph, we can see that 7 principal components attribute for 90% of variation in the data.
We shall pick the first 7 components for our prediction.
pca_new = PCA(n_components=8) x_new = pca_new.fit_transform(X_Strd) print(x_new)
[[-1.85550029 -0.11566443 -1.09357046 ... -0.75643601 -0.54724227 -0.12543802] [-0.80070544 1.49109015 -1.17367872 ... -0.24160267 1.08018456 -0.54859163] [-0.84153176 0.44660569 -1.03827391 ... -0.2839886 0.65509512 -0.22355232] ... [-0.75716274 0.69393496 0.80821779 ... -0.22768437 0.05590351 -0.79624827] [-1.63220619 1.13054855 0.63442137 ... -0.43474156 -0.40891784 -0.41109861] [ 0.4677948 -0.11499324 1.1798357 ... 0.6522725 -0.23266225 0.34922552]]
test_size = .30 seedNo = 11 X_train,X_test,Y_train,Y_test = train_test_split(x_new,Y_train,test_size = test_size, random_state = seedNo) print("train X", X_train.shape) print("train Y", Y_train.shape) print("test X", X_test.shape) print("test Y", Y_test.shape) # Choose algorithm and train from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from xgboost import XGBClassifier from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score mymodel =  mymodel.append(('LogReg', LogisticRegression())) mymodel.append(('KNN', KNeighborsClassifier())) mymodel.append(('DeciTree', DecisionTreeClassifier())) mymodel.append(('RandForest', RandomForestClassifier())) mymodel.append(('SVM', SVC())) mymodel.append(('XGBoost', XGBClassifier())) All_model_result =  All_model_name =  for algoname, algorithm in mymodel: kfoldFn = KFold(n_splits = 11, random_state = seedNo) Eval_result = cross_val_score(algorithm, X_train, Y_train, cv = kfoldFn, scoring = 'accuracy') All_model_result.append(Eval_result) All_model_name.append(algoname) print("Modelname and Model accuracy:", algoname, 100*Eval_result.mean(),"%")
train X (1119, 8) train Y (1119,) test X (480, 8) test Y (480,) Modelname and Model accuracy: LogReg 54.50927445685744 % Modelname and Model accuracy: KNN 55.67939146855863 % Modelname and Model accuracy: DeciTree 57.461040221669215 % Modelname and Model accuracy: RandForest 61.93325214874428 % Modelname and Model accuracy: SVM 59.42623674132119 % Modelname and Model accuracy: XGBoost 60.771959548896056 %
Build rf model:
rf = RandomForestRegressor(n_estimators=100, min_samples_leaf=3)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=3, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
Качество выросло ещё сильнее, хотя модель и переобучилась:
A true evaluation of wines and their predictions of a random forest:
plt.figure(figsize(16,7)) plt.subplot(121) pyplot.scatter(Y_train, rf.predict(X_train), color="red", alpha=0.1) pyplot.xlim(2.5,9.5) pyplot.ylim(2.5,9.5) plot(range(11), color='black') grid() pyplot.title('Train set', fontsize=20) pyplot.xlabel('Quality', fontsize=14) pyplot.ylabel('Estimated quality', fontsize=14) plt.subplot(122) pyplot.scatter(Y_test, rf.predict(X_test), color="red", alpha=0.1) pyplot.xlim(2.5,9.5) pyplot.ylim(2.5,9.5) plot(range(11), color='black') grid() pyplot.title('Test set', fontsize=20) pyplot.xlabel('Quality', fontsize=14) pyplot.ylabel('Estimated quality', fontsize=14)
The coefficient of determination for the random forest:
In conclusion, I want to say, that we can:
Thank you for listening!