#!/usr/bin/env python # coding: utf-8 # # Scaling data with `scikit-learn` # # Many machine learning techniques require standardized data. In this notebook, we discuss typical standardization schemes offered by `scikit-learn`'s `preprocessing` module: # In[1]: import io import pandas from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler # Let us assume we have the following data read using `pandas.read_csv` function: # In[2]: file_content = io.StringIO("""age;weight;height 23;70;180 22;65;160 31;80;190 26;80;175 22;65;170 """) df = pandas.read_csv(file_content, sep=";") print(df) # All three variables have rather different means and variance, which can be problematic for some machine learning tools: # In[3]: print("Means") print(df.mean(axis=0)) print("\nStandard deviations") print(df.std(axis=0)) # In this case, `scikit-learn` offers scaler object that rescale data on a per-variable basis. In this tutorial, we will present the following scalers: # * [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) rescales the data to have zero mean and unit variance (take a look at the doc if you want to do either unit variance normalization only or zero mean normalization only); # * [`MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) rescales the data to lie in the [0,1] interval (take a look at the doc if you want to change the interval boundaries); # * [`MaxAbsScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) rescales the data so that it lies in the [-1,1] interval. # In[4]: scaler = StandardScaler() scaler.fit(df) df_scaled = scaler.transform(df) print("Scaled data") print(df_scaled) print("\nMeans") print(df_scaled.mean(axis=0)) print("\nStandard deviations") print(df_scaled.std(axis=0)) # You can notice that `pandas` dataframes are turned into `numpy` arrays after scaling (`scikit-learn` works with `numpy` arrays). # # Once the scaler has been fitted to the data, the `transform` methods turns unscaled data to its scaled equivalent, while `inverse_transform` transforms scaled data back to its unscaled representation: # In[5]: print(scaler.inverse_transform(df_scaled)) # Other scalers can be used in a similar manner: # In[6]: scaler = MinMaxScaler() scaler.fit(df) df_scaled = scaler.transform(df) print("Scaled data") print(df_scaled) print("\nMinimum values") print(df_scaled.min(axis=0)) print("\nMaximum values") print(df_scaled.max(axis=0)) print("\nInverse transforms") print(scaler.inverse_transform(df_scaled)) # In[7]: scaler = MaxAbsScaler() scaler.fit(df) df_scaled = scaler.transform(df) print("Scaled data") print(df_scaled) print("\nMinimum values") print(df_scaled.min(axis=0)) print("\nMaximum values") print(df_scaled.max(axis=0)) print("\nInverse transforms") print(scaler.inverse_transform(df_scaled))