#!/usr/bin/env python
# coding: utf-8

# # Scaling data with `scikit-learn`
# 
# Many machine learning techniques require standardized data. In this notebook, we discuss typical standardization schemes offered by `scikit-learn`'s `preprocessing` module: <http://scikit-learn.org/stable/modules/preprocessing.html>

# In[1]:


import io
import pandas
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler


# Let us assume we have the following data read using `pandas.read_csv` function:

# In[2]:


file_content = io.StringIO("""age;weight;height
23;70;180
22;65;160
31;80;190
26;80;175
22;65;170
""")

df = pandas.read_csv(file_content, sep=";")
print(df)


# All three variables have rather different means and variance, which can be problematic for some machine learning tools:

# In[3]:


print("Means")
print(df.mean(axis=0))
print("\nStandard deviations")
print(df.std(axis=0))


# In this case, `scikit-learn` offers scaler object that rescale data on a per-variable basis. In this tutorial, we will present the following scalers:
# * [`StandardScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) rescales the data to have zero mean and unit variance (take a look at the doc if you want to do either unit variance normalization only or zero mean normalization only);
# * [`MinMaxScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) rescales the data to lie in the [0,1] interval (take a look at the doc if you want to change the interval boundaries);
# * [`MaxAbsScaler`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) rescales the data so that it lies in the [-1,1] interval.

# In[4]:


scaler = StandardScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

print("Scaled data")
print(df_scaled)
print("\nMeans")
print(df_scaled.mean(axis=0))
print("\nStandard deviations")
print(df_scaled.std(axis=0))


# You can notice that `pandas` dataframes are turned into `numpy` arrays after scaling (`scikit-learn` works with `numpy` arrays).
# 
# Once the scaler has been fitted to the data, the `transform` methods turns unscaled data to its scaled equivalent, while `inverse_transform` transforms scaled data back to its unscaled representation:

# In[5]:


print(scaler.inverse_transform(df_scaled))


# Other scalers can be used in a similar manner:

# In[6]:


scaler = MinMaxScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

print("Scaled data")
print(df_scaled)
print("\nMinimum values")
print(df_scaled.min(axis=0))
print("\nMaximum values")
print(df_scaled.max(axis=0))
print("\nInverse transforms")
print(scaler.inverse_transform(df_scaled))


# In[7]:


scaler = MaxAbsScaler()
scaler.fit(df)
df_scaled = scaler.transform(df)

print("Scaled data")
print(df_scaled)
print("\nMinimum values")
print(df_scaled.min(axis=0))
print("\nMaximum values")
print(df_scaled.max(axis=0))
print("\nInverse transforms")
print(scaler.inverse_transform(df_scaled))