In [1]:
%matplotlib inline
from preamble import *

4. Representing Data and Engineering Features

  • 특성(Feature)의 종류
    • Continuous Feature (연속형 특성)
      • 정량적
        • 픽셀 밝기
        • 붓꽃 측정값
    • Categorical Feature (범주형 특성) or Discrete Feature (이산형 특성)
      • 정성적
        • 제품의 브랜드
        • 색상
        • 판매분류(책, 옷, 하드웨어)
  • Feature Engineering (특성 공학)
    • 특정 어플리케이션에 가장 적합한 데이터의 표현을 찾는 것
    • 일반적으로 데이터가 어떤 형태로 구성되어 있는가보다 데이터를 어떻게 표현하는가가 머신러닝 모델의 성능에 더 많은 영향을 줌
    • 일반적으로 올바른 데이터 표현은 지도학습 모델에서 적절한 매개변수를 선택하는 것보다 성능에 더 많은 영향을 줌

4.1 Categorical Variables

  • Adult Data Set
    • https://archive.ics.uci.edu/ml/datasets/adult
    • 1994년 인구 조사 데이터베이스에서 추출한 미국 성인의 소득 데이터셋
    • 특성(Feature)
      • 연속형 특성
        • 근로자 나이(age)
        • 주당 근로시간(hours-per-week)
      • 범주형 특성
        • 고용형태(workclass)
          • 자영업(self-emp-not-inc)
          • 사업체 근로자(private)
          • 공공 근로자(state-gov)
        • 교육수준(education)
          • 학사(Bachelors)
          • 석사(Masters)
          • ...
        • 성별(gender)
        • 직업(occupation)
  • 풀려는 문제
    • 분류 문제: 어떤 근로자의 수입이 50,000달러를 초과하는지 그 이하인지를 예측
    • 타깃 특성: income
      • <=50K
      • >50K
  • 사용하려는 머신러닝 모델
    • 로지스틱 회귀
    • 문제
      • 범주형 특성값을 로지스틱 회귀식에 곧바로 넣을 수가 없음

4.1.1 One-Hot-Encoding (Dummy variables)

  • One-out-of-N encoding
  • Dummy variable (가변수)
  • pandas를 이용하여 데이터를 로드하고 범주형 변수를 원-핫 인코딩으로 변경
In [5]:
import os
# The file has no headers naming the columns, so we pass header=None
# and provide the column names explicitly in "names"
adult_path = os.path.join(mglearn.datasets.DATA_PATH, "adult.data")

data = pd.read_csv(
    adult_path, 
    header=None, 
    index_col=False,
    names=['age', 'workclass', 'fnlwgt', 'education',  'education-num',
           'marital-status', 'occupation', 'relationship', 'race', 'gender',
           'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
           'income']
)

# For illustration purposes, we only select some of the columns:
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']]
# IPython.display allows nice output formatting within the Jupyter notebook

display(data.head())

print(data.size)
age workclass education gender hours-per-week occupation income
0 39 State-gov Bachelors Male 40 Adm-clerical <=50K
1 50 Self-emp-not-inc Bachelors Male 13 Exec-managerial <=50K
2 38 Private HS-grad Male 40 Handlers-cleaners <=50K
3 53 Private 11th Male 40 Handlers-cleaners <=50K
4 28 Private Bachelors Female 40 Prof-specialty <=50K
227927
Checking string-encoded categorical data
  • Colume의 내용 확인
    • value_counts()
      • 각 유일한 값들이 몇 번 출현하는지 출력
In [6]:
print(data.gender.value_counts())
 Male      21790
 Female    10771
Name: gender, dtype: int64
In [10]:
print(data.workclass.value_counts())
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64
In [7]:
print("Original features:", list(data.columns))
print("length of features:", len(data.columns))
Original features: ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']
length of features: 7
  • pandas.get_dummies(data)
    • 주어진 data에서 범주형(문자열)으로 분류되는 열을 자동으로 수치형으로 변환
In [8]:
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))
print("length of dummy features:", len(data_dummies.columns))
Features after get_dummies:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupation_ Other-service', 'occupation_ Priv-house-serv', 'occupation_ Prof-specialty', 'occupation_ Protective-serv', 'occupation_ Sales', 'occupation_ Tech-support', 'occupation_ Transport-moving', 'income_ <=50K', 'income_ >50K']
length of dummy features: 46
In [9]:
display(data_dummies.head(n=10))
age hours-per-week workclass_ ? workclass_ Federal-gov ... occupation_ Tech-support occupation_ Transport-moving income_ <=50K income_ >50K
0 39 40 0 0 ... 0 0 1 0
1 50 13 0 0 ... 0 0 1 0
2 38 40 0 0 ... 0 0 1 0
3 53 40 0 0 ... 0 0 1 0
4 28 40 0 0 ... 0 0 1 0
5 37 40 0 0 ... 0 0 1 0
6 49 16 0 0 ... 0 0 1 0
7 52 45 0 0 ... 0 0 0 1
8 31 50 0 0 ... 0 0 0 1
9 42 40 0 0 ... 0 0 0 1

10 rows × 46 columns

In [12]:
one_hot_encoded = data_dummies.loc[:, 'workclass_ ?':'workclass_ Without-pay']
display(one_hot_encoded.head(n=10))
workclass_ ? workclass_ Federal-gov workclass_ Local-gov workclass_ Never-worked ... workclass_ Self-emp-inc workclass_ Self-emp-not-inc workclass_ State-gov workclass_ Without-pay
0 0 0 0 0 ... 0 0 1 0
1 0 0 0 0 ... 0 1 0 0
2 0 0 0 0 ... 0 0 0 0
3 0 0 0 0 ... 0 0 0 0
4 0 0 0 0 ... 0 0 0 0
5 0 0 0 0 ... 0 0 0 0
6 0 0 0 0 ... 0 0 0 0
7 0 0 0 0 ... 0 1 0 0
8 0 0 0 0 ... 0 0 0 0
9 0 0 0 0 ... 0 0 0 0

10 rows × 9 columns

In [14]:
print(type(one_hot_encoded.values))
print(one_hot_encoded.values[:10])
<class 'numpy.ndarray'>
[[0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 1 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 1 0 0]
 [0 0 0 0 1 0 0 0 0]
 [0 0 0 0 1 0 0 0 0]]
  • 훈련 데이터에서 타깃 속성 제외하기
In [15]:
# Get only the columns containing features
# that is all columns from 'age' to 'occupation_ Transport-moving'
# This range contains all the features but not the target

features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']

# extract NumPy arrays
X = features.values
y = data_dummies['income_ >50K'].values
print("X.shape: {}  y.shape: {}".format(X.shape, y.shape))
X.shape: (32561, 44)  y.shape: (32561,)
  • 로지스틱 회귀 모델 적용 및 평가
In [25]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))
Test score: 0.81

4.1.2 Numbers can encode categoricals

In [26]:
# create a DataFrame with an integer feature and a categorical string feature
demo_df = pd.DataFrame(
    {'Integer Feature': [0, 1, 2, 1],
     'Categorical Feature': ['socks', 'fox', 'socks', 'box']}
)

display(demo_df)
Categorical Feature Integer Feature
0 socks 0
1 fox 1
2 socks 2
3 box 1
In [27]:
display(pd.get_dummies(demo_df))
Integer Feature Categorical Feature_box Categorical Feature_fox Categorical Feature_socks
0 0 0 0 1
1 1 0 1 0
2 2 0 0 1
3 1 1 0 0
  • astype으로 숫자형 특성을 문자열 특성으로 변환하고 get_dummies() 함수 적용
  • 원본 데이터 내에 숫자와 문자열 특성이 모두 원핫인코딩된 결과
In [28]:
demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str)
display(pd.get_dummies(demo_df, columns=['Integer Feature', 'Categorical Feature']))
Integer Feature_0 Integer Feature_1 Integer Feature_2 Categorical Feature_box Categorical Feature_fox Categorical Feature_socks
0 1 0 0 0 0 1
1 0 1 0 0 1 0
2 0 0 1 0 0 1
3 0 1 0 1 0 0

4.2 Binning, Discretization, Linear Models and Trees

  • Binning(구간 분할) or Discretization(이산화)
In [32]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

X, y = mglearn.datasets.make_wave(n_samples=100)

print("X.shape: {}".format(X.shape))
print("y.shape: {}".format(y.shape))
print()
for i in range(10):
    print(X[i], y[i])
X.shape: (100, 1)
y.shape: (100,)

[-0.753] -0.3979485798878842
[2.704] 0.7105775485755936
[1.392] 0.41392866721449156
[0.592] -0.3483837936512941
[-2.064] -1.6020040642044855
[-2.064] -1.3135709853245343
[-2.651] -0.12426799844607195
[2.197] 1.1366058452312982
[0.607] 0.22684365004805757
[1.248] -0.10700112891754687
In [33]:
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
print(line.shape)

reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)
plt.plot(line, reg.predict(line), label="decision tree")

reg = LinearRegression().fit(X, y)
plt.plot(line, reg.predict(line), label="linear regression")

plt.plot(X[:, 0], y, 'o', c='k')

plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")
(1000, 1)
Out[33]:
<matplotlib.legend.Legend at 0x10d12bb00>
  • 구간 분할 (또는 이산화)
    • 연속형 데이터에 대하여 강력한 선형 모델을 만드는 방법 중 하나
In [35]:
bins = np.linspace(-3, 3, 11)
print("bins: {}".format(bins))
bins: [-3.  -2.4 -1.8 -1.2 -0.6  0.   0.6  1.2  1.8  2.4  3. ]
  • 첫번째 구간: 1 - [-3. -2.4)
  • 두번째 구간: 2 - [-2.4 -1.8)
  • ...
  • 열번째 구간: 10 - [2.4 3.)
  • np.digitize(X, bins)
    • X에 있는 데이터를 bins 기준으로 어느 구간에 속하는지를 기록
    • 즉, 연속형 데이터를 이산형 데이터로 변환
In [36]:
which_bin = np.digitize(X, bins=bins)
print("\nData points:\n", X[:5])
print("\nBin membership for data points:\n", which_bin[:5])
Data points:
 [[-0.753]
 [ 2.704]
 [ 1.392]
 [ 0.592]
 [-2.064]]

Bin membership for data points:
 [[ 4]
 [10]
 [ 8]
 [ 6]
 [ 2]]
  • OneHotEncoder
    • 이산형 데이터를 One Hot 벡터로 변환
In [43]:
from sklearn.preprocessing import OneHotEncoder

# transform using the OneHotEncoder
encoder = OneHotEncoder(sparse=False)
encoder.fit(which_bin)
X_binned = encoder.transform(which_bin)

print(X_binned[:5])
[[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]
In [40]:
print("X.shape: {}".format(X.shape))
print("X_binned.shape: {}".format(X_binned.shape))
X.shape: (100, 1)
X_binned.shape: (100, 10)
  • One Hot 벡터로 인코딩된 데이터로 결정트리모델과 선형회귀모델을 새로 구성
In [47]:
encoder = OneHotEncoder(sparse=False)
which_bin = np.digitize(X, bins=bins)
encoder.fit(which_bin)
X_binned = encoder.transform(which_bin)

line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
print("line.shape:", line.shape)
line_binned = encoder.transform(np.digitize(line, bins=bins))
print("line_binned.shape:", line_binned.shape)

reg = LinearRegression().fit(X_binned, y)
plt.plot(line, reg.predict(line_binned), label='linear regression binned')

reg = DecisionTreeRegressor(min_samples_split=3).fit(X_binned, y)
plt.plot(line, reg.predict(line_binned), label='decision tree binned')

plt.plot(X[:, 0], y, 'o', c='k')
plt.vlines(bins, -3, 3, linewidth=1, alpha=.2)
plt.legend(loc="best")
plt.ylabel("Regression output")
plt.xlabel("Input feature")
line.shape: (1000, 1)
line_binned.shape: (1000, 10)
Out[47]:
Text(0.5,0,'Input feature')
  • 선형회귀 모델의 결과와 결정트리 모델 결과가 완전히 일치함
  • 각 구간안에서는 특성의 값이 상수이므로, 어떤 모델이든 그 구간의 포인트에 대해서는 같은 값을 예측함
  • 데이터의 구간화 (binned or digitized) 변환, 즉 One Hot 벡터로 변환 이후 동일한 모델에서 학습하였을 때의 결과
    • Linear Regression: 모델이 더 유연해짐 --> 큰 이득
    • DecisionTreeRegressor: 모델이 덜 유연해짐
  • 즉, 선형 모델을 사용해야 할 때 데이터 구간화는 모델 성능을 높혀줌

4.3 Interactions and Polynomials

  • 원본 데이터의 특성을 풍부하게 나타내는 방법
In [48]:
X_combined = np.hstack([X, X_binned])
print(X_combined.shape)
print(X_combined[:5])
(100, 11)
[[-0.753  0.     0.     0.     1.     0.     0.     0.     0.     0.
   0.   ]
 [ 2.704  0.     0.     0.     0.     0.     0.     0.     0.     0.
   1.   ]
 [ 1.392  0.     0.     0.     0.     0.     0.     0.     1.     0.
   0.   ]
 [ 0.592  0.     0.     0.     0.     0.     1.     0.     0.     0.
   0.   ]
 [-2.064  0.     1.     0.     0.     0.     0.     0.     0.     0.
   0.   ]]
In [49]:
reg = LinearRegression().fit(X_combined, y)

line_combined = np.hstack([line, line_binned])
plt.plot(line, reg.predict(line_combined), label='linear regression combined')

plt.vlines(bins, -3, 3, linewidth=1, alpha=.2)

plt.legend(loc="best")
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.plot(X[:, 0], y, 'o', c='k')
Out[49]:
[<matplotlib.lines.Line2D at 0x10e92efd0>]
  • 위 그림: 각 구간에 대해 절편과 기울기를 학습하게 됨
    • 기울기가 각 구간별로 동일 --> 각 구간별로 다른 기울기를 가지도록 하는 것이 좋음.
In [50]:
X_product = np.hstack([X_binned, X * X_binned])
print(X_product.shape)
print(X_product[:5])
(100, 20)
[[ 0.     0.     0.     1.     0.     0.     0.     0.     0.     0.
  -0.    -0.    -0.    -0.753 -0.    -0.    -0.    -0.    -0.    -0.   ]
 [ 0.     0.     0.     0.     0.     0.     0.     0.     0.     1.
   0.     0.     0.     0.     0.     0.     0.     0.     0.     2.704]
 [ 0.     0.     0.     0.     0.     0.     0.     1.     0.     0.
   0.     0.     0.     0.     0.     0.     0.     1.392  0.     0.   ]
 [ 0.     0.     0.     0.     0.     1.     0.     0.     0.     0.
   0.     0.     0.     0.     0.     0.592  0.     0.     0.     0.   ]
 [ 0.     1.     0.     0.     0.     0.     0.     0.     0.     0.
  -0.    -2.064 -0.    -0.    -0.    -0.    -0.    -0.    -0.    -0.   ]]
In [51]:
reg = LinearRegression().fit(X_product, y)

line_product = np.hstack([line_binned, line * line_binned])
plt.plot(line, reg.predict(line_product), label='linear regression product')

for bin in bins:
    plt.plot([bin, bin], [-3, 3], ':', c='k', linewidth=1)

plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")
Out[51]:
<matplotlib.legend.Legend at 0x10e7db7f0>
In [57]:
from sklearn.preprocessing import PolynomialFeatures

# include polynomials up to x ** 10:
# the default "include_bias=True" adds a feature that's constantly 1
poly = PolynomialFeatures(degree=10, include_bias=False)
poly.fit(X)
X_poly = poly.transform(X)
In [58]:
print("X.shape: {}".format(X.shape))
print("X_poly.shape: {}".format(X_poly.shape))
X.shape: (100, 1)
X_poly.shape: (100, 10)
In [59]:
print("Entries of X:\n{}".format(X[:5]))
print("Entries of X_poly:\n{}".format(X_poly[:5]))
Entries of X:
[[-0.753]
 [ 2.704]
 [ 1.392]
 [ 0.592]
 [-2.064]]
Entries of X_poly:
[[   -0.753     0.567    -0.427     0.321    -0.242     0.182    -0.137
      0.103    -0.078     0.058]
 [    2.704     7.313    19.777    53.482   144.632   391.125  1057.714
   2860.36   7735.232 20918.278]
 [    1.392     1.938     2.697     3.754     5.226     7.274    10.125
     14.094    19.618    27.307]
 [    0.592     0.35      0.207     0.123     0.073     0.043     0.025
      0.015     0.009     0.005]
 [   -2.064     4.26     -8.791    18.144   -37.448    77.289  -159.516
    329.222  -679.478  1402.367]]
In [60]:
print("Polynomial feature names:\n{}".format(poly.get_feature_names()))
Polynomial feature names:
['x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7', 'x0^8', 'x0^9', 'x0^10']
  • PolynomialFeatures을 사용하여 다항 회귀(Polynomial Regression)모델 생성
In [61]:
reg = LinearRegression().fit(X_poly, y)

line_poly = poly.transform(line)
plt.plot(line, reg.predict(line_poly), label='polynomial linear regression')
plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")
Out[61]:
<matplotlib.legend.Legend at 0x10ee045f8>

-SVR(gamma='auto')

In [62]:
from sklearn.svm import SVR

for gamma in [1, 10]:
    svr = SVR(gamma=gamma).fit(X, y)
    plt.plot(line, svr.predict(line), label='SVR gamma={}'.format(gamma))

plt.plot(X[:, 0], y, 'o', c='k')
plt.ylabel("Regression output")
plt.xlabel("Input feature")
plt.legend(loc="best")
Out[62]:
<matplotlib.legend.Legend at 0x10d5364e0>
  • 보스턴 주택 가격 데이터셋
In [63]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)

# rescale data
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In [64]:
poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly = poly.transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_poly.shape: {}".format(X_train_poly.shape))
X_train.shape: (379, 13)
X_train_poly.shape: (379, 105)
  • 원본 데이터의 특성이 13개인데 반해, 다항식 특성을 적용하여 105개의 특성으로 변환됨
    • 1 + 13 + H(13, 2) = 1 + 13 + C(14, 2) = 1 + 13 + 91 = 105
In [65]:
print("Polynomial feature names:\n{}".format(poly.get_feature_names()))
Polynomial feature names:
['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12', 'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12', 'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12', 'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12', 'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12', 'x9^2', 'x9 x10', 'x9 x11', 'x9 x12', 'x10^2', 'x10 x11', 'x10 x12', 'x11^2', 'x11 x12', 'x12^2']
  • PolynomialFeatures에 의한 데이터 Augmentation은 선형 모델에 더 적합
In [66]:
from sklearn.linear_model import Ridge

ridge = Ridge().fit(X_train_scaled, y_train)
print("Score without interactions: {:.3f}".format(ridge.score(X_test_scaled, y_test)))

ridge = Ridge().fit(X_train_poly, y_train)
print("Score with interactions: {:.3f}".format(ridge.score(X_test_poly, y_test)))
Score without interactions: 0.621
Score with interactions: 0.753
  • RandomForestRegressor은 데이터 Augmentation에 의하여 성능향상 기대하기 어려움
In [67]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100).fit(X_train_scaled, y_train)
print("Score without interactions: {:.3f}".format(rf.score(X_test_scaled, y_test)))

rf = RandomForestRegressor(n_estimators=100).fit(X_train_poly, y_train)
print("Score with interactions: {:.3f}".format(rf.score(X_test_poly, y_test)))
Score without interactions: 0.819
Score with interactions: 0.769

4.4 Univariate Non-linear transformations

  • log, exp, sin 함수를 사용한 특성 변환
    • 선형 모델 또는 신경망의 성능을 올리는 데 활용
    • 데이터셋에 주기적인 패턴이 있을 때 적합
    • 정수 카운트 데이터에 적합
      • 예. "사용자가 얼마나 자주 로그인하는가?" 같은 특성
      • 카운트에는 음수가 없으며, 임의의 통계 패턴을 따르는 경우가 많음
      • 일반적으로 정규 분포를 따를 때 모델 성능이 좋음
  • 인위적으로 카운트 데이터 만들기
    • 1) 표준 정규 분포를 따르는 랜덤 데이터 만들기
    • 2) 해당 랜덤 데이터 * 10 을 평균으로 지니는 Poisson 분포를 따르는 데이터 만들기
In [68]:
rnd = np.random.RandomState(0)
X_org = rnd.normal(size=(1000, 3))
w = rnd.normal(size=3)
print("X_org.shape:", X_org.shape)
print("w.shape:", w.shape)
print()

X = rnd.poisson(lam = 10 * np.exp(X_org))
y = np.dot(X_org, w)
print("X.shape:", X.shape)
print("y.shape:", y.shape)
print()

print("X_org[:10, 0]:", X_org[:10, 0])
print("X[:10, 0]", X[:10, 0])
print("y[:10]", y[:10])
X_org.shape: (1000, 3)
w.shape: (3,)

X.shape: (1000, 3)
y.shape: (1000,)

X_org[:10, 0]: [ 1.764  2.241  0.95   0.411  0.761  0.334  0.313  0.654  2.27  -0.187]
X[:10, 0] [ 56  81  25  20  27  18  12  21 109   7]
y[:10] [2.926 4.744 1.439 0.57  1.231 1.405 0.305 1.618 2.784 0.405]
  • bincount로 각 구간별 데이터 개수 세기
In [69]:
print("Number of feature appearances:\n{}".format(np.bincount(X[:, 0])))
Number of feature appearances:
[28 38 68 48 61 59 45 56 37 40 35 34 36 26 23 26 27 21 23 23 18 21 10  9
 17  9  7 14 12  7  3  8  4  5  5  3  4  2  4  1  1  3  2  5  3  8  2  5
  2  1  2  3  3  2  2  3  3  0  1  2  1  0  0  3  1  0  0  0  1  3  0  1
  0  2  0  1  1  0  0  0  0  1  0  0  2  2  0  1  1  0  0  0  0  1  1  0
  0  0  0  0  0  0  1  0  0  0  0  0  1  1  0  0  1  0  0  0  0  0  0  0
  1  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1]
In [70]:
bins = np.bincount(X[:, 0])
plt.bar(range(len(bins)), bins, color='grey')
plt.ylabel("Number of appearances")
plt.xlabel("Value")
Out[70]:
Text(0.5,0,'Value')
  • 위와 같은 데이터에 선형회귀 모델인 Ridge 적용하기
    • 성능이 좋지 못함.
In [71]:
from sklearn.linear_model import Ridge
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
score = Ridge().fit(X_train, y_train).score(X_test, y_test)
print("Test score: {:.3f}".format(score))
Test score: 0.622
  • 원본 데이터 X를 log(X+1) 식을 사용하여 변환하기
In [72]:
X_train_log = np.log(X_train + 1)
X_test_log = np.log(X_test + 1)
  • 데이터가 정규분포에 가깝게 변환됨
In [73]:
plt.hist(X_train_log[:, 0], bins=25, color='gray')
plt.ylabel("Number of appearances")
plt.xlabel("Value")
plt.show()