Метрические методы¶

Шестаков А.В. Майнор по анализу данных 31/05/2016

Метрические методы классификации и регрессии - одни из самых простых моделей. Они основаны на гипотезе о компактности (непрерывности) - близким объектам соответствуют близкие ответы.

Дело остаётся за малым:

Определить, что же такое близкий объект
Определить, сколько ближайших соседей учитывать при прогнозировании
Определить как их учитывать?

Какие недостатки метода kNN вы помните из лекции?

Интуиция¶

Классификация¶

Поиграем с kNN на игружечном датасете

In [2]:

from sklearn.datasets import make_moons
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')

%matplotlib inline

/home/shestakoff/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [3]:

X, y = make_moons(noise=0.3, random_state=123)
plt.scatter(X[:,0], X[:,1], c=y)

Out[3]:

<matplotlib.collections.PathCollection at 0x7f112b846c50>

In [12]:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split

knn = KNeighborsClassifier(n_neighbors=10, weights='distance', metric='euclidean')
knn.fit(X, y)

Out[12]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='distance')

In [13]:

x_range = np.linspace(X.min(), X.max(), 100)
xx1, xx2 = np.meshgrid(x_range, x_range)

Y = knn.predict_proba(np.c_[xx1.ravel(), xx2.ravel()])[:,1]
Y = Y.reshape(xx1.shape)

plt.contourf(xx1, xx2, Y, alpha=0.3)
plt.scatter(X[:,0], X[:,1],c=y)

Out[13]:

<matplotlib.collections.PathCollection at 0x7f1129b16e10>

Регрессия¶

Загрузите простой датасет с измерениями носов(?) кенгуру. Reference: Australian Journal of Zoology, Vol. 28, p607-613

In [14]:

data = np.loadtxt('kengo.csv', skiprows=1, delimiter=',')
X = data[:,0].reshape(-1,1)
y = data[:,1]

In [15]:

plt.scatter(X, y)

Out[15]:

<matplotlib.collections.PathCollection at 0x7f1129aa3f10>

In [16]:

from sklearn.neighbors import KNeighborsRegressor

In [46]:

knn = KNeighborsRegressor(n_neighbors=5, weights='uniform', metric='euclidean')
knn.fit(X, y)

Out[46]:

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='euclidean',
          metric_params=None, n_jobs=1, n_neighbors=5, p=2,
          weights='uniform')

In [47]:

x_range = np.linspace(X.min(), X.max(), 100).reshape(-1,1)

y_hat = knn.predict(x_range)

plt.scatter(X, y)
plt.plot(x_range, y_hat, 'r')

Out[47]:

[<matplotlib.lines.Line2D at 0x7f11291c5450>]

Задание 1¶

Регрессия¶

Обучите метод ближайшего соседа на данных о стоимости апартаментов в Бостоне.
С помощью кросс-валидации определите оптимальное количество ближайших соседей и функцию расчета весов ближайших соседей

In [53]:

from sklearn.datasets import load_boston
data = load_boston()

In [54]:

print data['DESCR']

Boston House Prices dataset

Notes
------
Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive

:Median Value (attribute 14) is usually the target

:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

**References**

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

In [56]:

X = data.data
y = data.target

In [57]:

from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler

In [61]:

from sklearn.metrics import mean_absolute_error

In [68]:

cv = KFold(X.shape[0], n_folds=3)

scores = []
for k in xrange(1,20):
    cv_score = []
    for train_idx, test_idx in cv:
        # Нормализация
        scaller = StandardScaler()

        X_train, y_train = X[train_idx], y[train_idx]

        X_train = scaller.fit_transform(X_train)

        # Обучили knn
        knn = KNeighborsRegressor(n_neighbors=k, weights='distance')
        knn.fit(X_train, y_train)

        # Предсказываем
        X_test, y_test = X[test_idx], y[test_idx]
        X_test = scaller.transform(X_test)
        y_hat = knn.predict(X_test)

        cv_score.append(mean_absolute_error(y_test, y_hat))
    scores.append(np.mean(cv_score))

In [69]:

plt.plot(np.arange(1,20), scores)

Out[69]:

[<matplotlib.lines.Line2D at 0x7f1128e2aa90>]

Классификация¶

Загрузите датасет с новостными текстами. Выберите 2 категории, разбейте документы на слова (n-gramm'ы), "обучите" метод ближайшего соседа для задачи категоризации текстов по их содержанию.

Используйте косинусную мету близости

In [70]:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [71]:

news_docs = fetch_20newsgroups(subset='all', 
                               categories=['alt.atheism', 'comp.graphics'])

In [75]:

X = news_docs.data
y = news_docs.target

In [76]:

from sklearn.metrics import roc_auc_score

In [90]:

cv = KFold(len(X), n_folds=3)

scores = []
vect = CountVectorizer(stop_words='english')
X_bow = vect.fit_transform(X)        
for k in xrange(1,40):
    cv_score = []
    for train_idx, test_idx in cv:
        # Векторизация уже
        X_train, y_train = X_bow[train_idx], y[train_idx]

        # Обучили knn
        knn = KNeighborsClassifier(n_neighbors=k, weights='uniform', algorithm='brute',
                                   metric='cosine')
        knn.fit(X_train, y_train)

        # Предсказываем
        X_test, y_test = X_bow[test_idx], y[test_idx]
        y_hat = knn.predict_proba(X_test)

        cv_score.append(roc_aplt.plot(np.arange(1,40), scores)uc_score(y_test, y_hat[:,1]))
    scores.append(np.mean(cv_score))

In [92]:

plt.plot(np.arange(1,40), scores)

Out[92]:

[<matplotlib.lines.Line2D at 0x7f11289b7590>]