scikit-learn
¶Many machine learning techniques require standardized data. In this notebook, we discuss typical standardization schemes offered by scikit-learn
's preprocessing
module: http://scikit-learn.org/stable/modules/preprocessing.html
import io
import pandas
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
Let us assume we have the following data read using pandas.read_csv
function:
file_content = io.StringIO("""age;weight;height
23;70;180
22;65;160
31;80;190
26;80;175
22;65;170
""")
df = pandas.read_csv(file_content, sep=";")
print(df)
age weight height 0 23 70 180 1 22 65 160 2 31 80 190 3 26 80 175 4 22 65 170
All three variables have rather different means and variance, which can be problematic for some machine learning tools:
print("Means")
print(df.mean(axis=0))
print("\nStandard deviations")
print(df.std(axis=0))
Means age 24.8 weight 72.0 height 175.0 dtype: float64 Standard deviations age 3.834058 weight 7.582875 height 11.180340 dtype: float64
In this case, scikit-learn
offers scaler object that rescale data on a per-variable basis. In this tutorial, we will present the following scalers:
StandardScaler
rescales the data to have zero mean and unit variance (take a look at the doc if you want to do either unit variance normalization only or zero mean normalization only);MinMaxScaler
rescales the data to lie in the [0,1] interval (take a look at the doc if you want to change the interval boundaries);MaxAbsScaler
rescales the data so that it lies in the [-1,1] interval.scaler = StandardScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)
print("Scaled data")
print(df_scaled)
print("\nMeans")
print(df_scaled.mean(axis=0))
print("\nStandard deviations")
print(df_scaled.std(axis=0))
Scaled data [[-0.52489066 -0.29488391 0.5 ] [-0.81649658 -1.03209369 -1.5 ] [ 1.80795671 1.17953565 1.5 ] [ 0.34992711 1.17953565 0. ] [-0.81649658 -1.03209369 -0.5 ]] Means [ -1.99840144e-16 0.00000000e+00 0.00000000e+00] Standard deviations [ 1. 1. 1.]
You can notice that pandas
dataframes are turned into numpy
arrays after scaling (scikit-learn
works with numpy
arrays).
Once the scaler has been fitted to the data, the transform
methods turns unscaled data to its scaled equivalent, while inverse_transform
transforms scaled data back to its unscaled representation:
print(scaler.inverse_transform(df_scaled))
[[ 23. 70. 180.] [ 22. 65. 160.] [ 31. 80. 190.] [ 26. 80. 175.] [ 22. 65. 170.]]
Other scalers can be used in a similar manner:
scaler = MinMaxScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)
print("Scaled data")
print(df_scaled)
print("\nMinimum values")
print(df_scaled.min(axis=0))
print("\nMaximum values")
print(df_scaled.max(axis=0))
print("\nInverse transforms")
print(scaler.inverse_transform(df_scaled))
Scaled data [[ 0.11111111 0.33333333 0.66666667] [ 0. 0. 0. ] [ 1. 1. 1. ] [ 0.44444444 1. 0.5 ] [ 0. 0. 0.33333333]] Minimum values [ 0. 0. 0.] Maximum values [ 1. 1. 1.] Inverse transforms [[ 23. 70. 180.] [ 22. 65. 160.] [ 31. 80. 190.] [ 26. 80. 175.] [ 22. 65. 170.]]
scaler = MaxAbsScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)
print("Scaled data")
print(df_scaled)
print("\nMinimum values")
print(df_scaled.min(axis=0))
print("\nMaximum values")
print(df_scaled.max(axis=0))
print("\nInverse transforms")
print(scaler.inverse_transform(df_scaled))
Scaled data [[ 0.74193548 0.875 0.94736842] [ 0.70967742 0.8125 0.84210526] [ 1. 1. 1. ] [ 0.83870968 1. 0.92105263] [ 0.70967742 0.8125 0.89473684]] Minimum values [ 0.70967742 0.8125 0.84210526] Maximum values [ 1. 1. 1.] Inverse transforms [[ 23. 70. 180.] [ 22. 65. 160.] [ 31. 80. 190.] [ 26. 80. 175.] [ 22. 65. 170.]]