Notebook

Scaling data with `scikit-learn`¶

Many machine learning techniques require standardized data. In this notebook, we discuss typical standardization schemes offered by scikit-learn's preprocessing module: http://scikit-learn.org/stable/modules/preprocessing.html

In [1]:

import io
import pandas
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler

Let us assume we have the following data read using pandas.read_csv function:

In [2]:

file_content = io.StringIO("""age;weight;height
23;70;180
22;65;160
31;80;190
26;80;175
22;65;170
""")

df = pandas.read_csv(file_content, sep=";")
print(df)

   age  weight  height
0   23      70     180
1   22      65     160
2   31      80     190
3   26      80     175
4   22      65     170

All three variables have rather different means and variance, which can be problematic for some machine learning tools:

In [3]:

print("Means")
print(df.mean(axis=0))
print("\nStandard deviations")
print(df.std(axis=0))

Means
age        24.8
weight     72.0
height    175.0
dtype: float64

Standard deviations
age        3.834058
weight     7.582875
height    11.180340
dtype: float64

In this case, scikit-learn offers scaler object that rescale data on a per-variable basis. In this tutorial, we will present the following scalers:

StandardScaler rescales the data to have zero mean and unit variance (take a look at the doc if you want to do either unit variance normalization only or zero mean normalization only);
MinMaxScaler rescales the data to lie in the [0,1] interval (take a look at the doc if you want to change the interval boundaries);
MaxAbsScaler rescales the data so that it lies in the [-1,1] interval.

In [4]:

scaler = StandardScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

print("Scaled data")
print(df_scaled)
print("\nMeans")
print(df_scaled.mean(axis=0))
print("\nStandard deviations")
print(df_scaled.std(axis=0))

Scaled data
[[-0.52489066 -0.29488391  0.5       ]
 [-0.81649658 -1.03209369 -1.5       ]
 [ 1.80795671  1.17953565  1.5       ]
 [ 0.34992711  1.17953565  0.        ]
 [-0.81649658 -1.03209369 -0.5       ]]

Means
[ -1.99840144e-16   0.00000000e+00   0.00000000e+00]

Standard deviations
[ 1.  1.  1.]

You can notice that pandas dataframes are turned into numpy arrays after scaling (scikit-learn works with numpy arrays).

Once the scaler has been fitted to the data, the transform methods turns unscaled data to its scaled equivalent, while inverse_transform transforms scaled data back to its unscaled representation:

In [5]:

print(scaler.inverse_transform(df_scaled))

[[  23.   70.  180.]
 [  22.   65.  160.]
 [  31.   80.  190.]
 [  26.   80.  175.]
 [  22.   65.  170.]]

Other scalers can be used in a similar manner:

In [6]:

scaler = MinMaxScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

print("Scaled data")
print(df_scaled)
print("\nMinimum values")
print(df_scaled.min(axis=0))
print("\nMaximum values")
print(df_scaled.max(axis=0))
print("\nInverse transforms")
print(scaler.inverse_transform(df_scaled))

Scaled data
[[ 0.11111111  0.33333333  0.66666667]
 [ 0.          0.          0.        ]
 [ 1.          1.          1.        ]
 [ 0.44444444  1.          0.5       ]
 [ 0.          0.          0.33333333]]

Minimum values
[ 0.  0.  0.]

Maximum values
[ 1.  1.  1.]

Inverse transforms
[[  23.   70.  180.]
 [  22.   65.  160.]
 [  31.   80.  190.]
 [  26.   80.  175.]
 [  22.   65.  170.]]

In [7]:

scaler = MaxAbsScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

print("Scaled data")
print(df_scaled)
print("\nMinimum values")
print(df_scaled.min(axis=0))
print("\nMaximum values")
print(df_scaled.max(axis=0))
print("\nInverse transforms")
print(scaler.inverse_transform(df_scaled))

Scaled data
[[ 0.74193548  0.875       0.94736842]
 [ 0.70967742  0.8125      0.84210526]
 [ 1.          1.          1.        ]
 [ 0.83870968  1.          0.92105263]
 [ 0.70967742  0.8125      0.89473684]]

Minimum values
[ 0.70967742  0.8125      0.84210526]

Maximum values
[ 1.  1.  1.]

Inverse transforms
[[  23.   70.  180.]
 [  22.   65.  160.]
 [  31.   80.  190.]
 [  26.   80.  175.]
 [  22.   65.  170.]]

Scaling data with scikit-learn¶

Scaling data with `scikit-learn`¶