1 数据的标准化¶

1.1 scale：sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)¶

with_mean : boolean, True by default, If True, center the data before scaling. 即使得对应axis上的均值为0
with_std : boolean, True by default，If True, scale the data to unit variance. 即使得对应axis上的方差为1

In [1]:

import numpy as np
from sklearn.preprocessing import * 

rg = np.random.RandomState(2017)
X_train = rg.uniform(0, 5, (4,3))
X_scaled = scale(X_train)
print('Mean: {}, \nStd: {}'.format(X_scaled.mean(axis=0, dtype=np.int), X_scaled.std(axis=0)))

Mean: [0 0 0], 
Std: [1. 1. 1.]

scale的参数axis=0，表示对每列进行标准化，即每列减去此列均值再除以其方差¶

In [2]:

def f(array):
    result = (array - np.mean(array)) / np.std(array, ddof=0)  # ddof默认为0
    return result

scale_result = np.apply_along_axis(f, axis=0, arr=X_train)
assert np.allclose(X_scaled, scale_result)

1.2 StandardScaler：sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)¶

可通过fit方法获取某特征的均值与方差，再运用transform方法标准化其他特征

优点：¶

1)提升模型的收敛速度

2)使得各指标值都处于同一个量纲上，提升模型的精度

In [3]:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris 

dataset = load_iris()
np.random.seed(2017)
iris = pd.DataFrame(dataset.data, columns=dataset.feature_names).sample(5)
iris

Out[3]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
143	6.8	3.2	5.9	2.3
115	6.4	3.2	5.3	2.3
102	7.1	3.0	5.9	2.1
51	6.4	3.2	4.5	1.5
76	6.8	2.8	4.8	1.4

对某数据框直接调用fit_transform时，等价于单独对每列分别进行scale操作¶

In [4]:

scaler = StandardScaler()
iris_scaled = scaler.fit_transform(iris)  
iris_scaled

Out[4]:

array([[ 0.372678  ,  0.75      ,  1.0932857 ,  0.96958969],
       [-1.11803399,  0.75      ,  0.03526728,  0.96958969],
       [ 1.49071198, -0.5       ,  1.0932857 ,  0.45927933],
       [-1.11803399,  0.75      , -1.37542395, -1.07165176],
       [ 0.372678  , -1.75      , -0.84641474, -1.32680694]])

In [5]:

iris_scaled.mean(axis=0, dtype=np.int), iris_scaled.std(axis=0)

Out[5]:

(array([0, 0, 0, 0]), array([1., 1., 1., 1.]))

In [6]:

# 等价于对每列单独调用scale
np.allclose(scaler.fit_transform(iris), scale(iris))

Out[6]:

True

1.2 数据的归一化：将数据映射到指定的范围，用于去除不同维度数据的量纲以及量纲单位¶

sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)¶

转换过程¶

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std * (max - min) + min

In [7]:

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
scaler.fit(data)

Out[7]:

MinMaxScaler(copy=True, feature_range=(0, 1))

In [8]:

scaler.transform(data)

Out[8]:

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

In [9]:

# 或者直接调用fit_transform 
scaler.fit_transform(data)

Out[9]:

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

1.3 MaxAbsScaler¶

In [10]:

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

max_abs_scaler = MaxAbsScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs           

Out[10]:

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

In [11]:

max_abs_scaler.scale_    

Out[11]:

array([2., 1., 2.])

In [12]:

X_test = np.array([[ -3., -1.,  4.]])
X_test_maxabs = max_abs_scaler.transform(X_test)
X_test_maxabs                 

Out[12]:

array([[-1.5, -1. ,  2. ]])

1.4 RobustScaler¶

转化过程:(x-median) / IQR, IQR等于75分位点减去25分位点处的值¶

In [13]:

np.random.seed(2018)
X_train = np.random.randn(4,3)

max_abs_scaler = RobustScaler()
X_train_maxabs = max_abs_scaler.fit_transform(X_train)
X_train_maxabs               

Out[13]:

array([[-0.12669367,  0.61018033,  1.22235048],
       [-2.02310438, -0.03205827,  0.34615926],
       [ 0.12669367, -3.19747007, -0.70069397],
       [ 1.2167336 ,  0.03205827, -0.34615926]])

In [14]:

# 求各列的中位数
max_abs_scaler.center_ 

Out[14]:

array([-0.20977884,  0.50624895,  0.34544916])

In [15]:

# 求各列IQR值
max_abs_scaler.scale_

Out[15]:

array([0.52874591, 0.12390117, 1.47498622])

In [16]:

# 验证max_abs_scaler.scale_返回的是否为IQR值
IQR = np.percentile(X_train, 75, axis=0) - np.percentile(X_train, 25, axis=0)
np.allclose(max_abs_scaler.scale_ ,IQR)

Out[16]:

True

2 正则化¶

2.1 L1正则化:每行各元素除以每行的L1范数¶

In [17]:

x = [[1,-1,2],[2, 0,0],[0, 1, -1]]
df = pd.DataFrame(x, columns=list('ABC'))

x_norm1 = normalize(x, norm='l1')
df_norm1 = pd.DataFrame(x_norm1)
print('L1正则化:')
df_norm1

L1正则化:

Out[17]:

	0	1	2
0	0.25	-0.25	0.5
1	1.00	0.00	0.0
2	0.00	0.50	-0.5

In [18]:

df_norm1 = df.copy()
for idx in df.index:
    l1_row = sum(abs(df.iloc[idx]))
    df_norm1.iloc[idx] = df.iloc[idx] / l1_row
    
df_norm1  

Out[18]:

	A	B	C
0	0.25	-0.25	0.5
1	1.00	0.00	0.0
2	0.00	0.50	-0.5

2.2 L2正则化：每行各元素除以每行的L2范数¶

In [19]:

x = [[1,-1,2],[2, 0,0],[0, 1, -1]]
df = pd.DataFrame(x, columns=list('ABC'))
df

Out[19]:

	A	B	C
0	1	-1	2
1	2	0	0
2	0	1	-1

In [20]:

x_norm2 = normalize(x, norm='l2')
df_norm2 = pd.DataFrame(x_norm2)
df_norm2

Out[20]:

	0	1	2
0	0.408248	-0.408248	0.816497
1	1.000000	0.000000	0.000000
2	0.000000	0.707107	-0.707107

In [21]:

df_norm2 = df.copy()
for idx in df.index:
    l2_row = np.sqrt(sum(np.square(df.iloc[idx])))
    df_norm2.iloc[idx] = df.iloc[idx] / l2_row

df_norm2     

Out[21]:

	A	B	C
0	0.408248	-0.408248	0.816497
1	1.000000	0.000000	0.000000
2	0.000000	0.707107	-0.707107

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
143	6.8	3.2	5.9	2.3
115	6.4	3.2	5.3	2.3
102	7.1	3.0	5.9	2.1
51	6.4	3.2	4.5	1.5
76	6.8	2.8	4.8	1.4

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
143	6.8	3.2	5.9	2.3
115	6.4	3.2	5.3	2.3
102	7.1	3.0	5.9	2.1
51	6.4	3.2	4.5	1.5
76	6.8	2.8	4.8	1.4

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
143	6.8	3.2	5.9	2.3
115	6.4	3.2	5.3	2.3
102	7.1	3.0	5.9	2.1
51	6.4	3.2	4.5	1.5
76	6.8	2.8	4.8	1.4