Notebook

Feature Scaling¶

Because in Machine Learning models, features are mapped into n-dimensional space. So let's say there are two variables (x, y) which will be mapped in 2D co-ordinate system. If one variable, say y, is very huge and other, x, is very small, then the euclidean distance will be dominated by the bigger one and smaller one will be ignored. In this case we are losing valuable information, hence feature scaling is used to solve this problem.

Additional reasons for transformation:

To more closely approximate a theoretical distribution that has nice statistical properties.
To spread out data more evenly.
To make data distribution more symmetric
to make relationships between variables more linear.
TO make data more constant in variance (homoscedasticity).

There are 3 most used ways to scale features.¶

Min Max Scaling:

Will scale the input to have minimum of 0 and maximum of 1. That is, it scales the data in the range of [0, 1] This is useful when the parameters have to be on same positive scale. But in this case, the outliers are lost. $$X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$$

Standardization:

Will scale the input to have mean of 0 and variance of 1. $$X_{stand} = \frac{X - \mu}{\sigma}$$

Normalizing:

Will scale the input to make the norm of 1. For instance, for 3D data the 3 independent variables will lie on a unit Sphere.

Log Transformation:

Taking the log of data after any of above transformation.

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

For most applications, Standardization is recommended. Min Max Scaling is recommended for Neural Networks. Normalizing is recommended when Clustering eg. KMeans.

In [17]:

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler

df = pd.read_csv('Data.csv').dropna()
print(df)
X = df[["Age", "Salary"]].values.astype(np.float64)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes

In [21]:

standard_scaler = StandardScaler()
normalizer = Normalizer()
min_max_scaler = MinMaxScaler()

print("Standardization")
print(standard_scaler.fit_transform(X))

print("Normalizing")
print(normalizer.fit_transform(X))

print("MinMax Scaling")
print(min_max_scaler.fit_transform(X))

Standardization
[[ 0.69985807  0.58989097]
 [-1.51364653 -1.50749915]
 [-1.12302807 -0.98315162]
 [-0.08137885 -0.37141284]
 [-0.47199731 -0.6335866 ]
 [ 1.22068269  1.20162976]
 [ 1.48109499  1.55119478]
 [-0.211585    0.1529347 ]]
Normalizing
[[  6.11110997e-04   9.99999813e-01]
 [  5.62499911e-04   9.99999842e-01]
 [  5.55555470e-04   9.99999846e-01]
 [  6.22950699e-04   9.99999806e-01]
 [  6.03448166e-04   9.99999818e-01]
 [  6.07594825e-04   9.99999815e-01]
 [  6.02409529e-04   9.99999819e-01]
 [  5.52238722e-04   9.99999848e-01]]
MinMax Scaling
[[ 0.73913043  0.68571429]
 [ 0.          0.        ]
 [ 0.13043478  0.17142857]
 [ 0.47826087  0.37142857]
 [ 0.34782609  0.28571429]
 [ 0.91304348  0.88571429]
 [ 1.          1.        ]
 [ 0.43478261  0.54285714]]

In [ ]: