The usual scenario for learning tasks such as those presented in this book include a list of instances (represented as feature/value pairs) and a special feature (the target class) that we want to predict for future instances based on the values of the remaining features. However, the source data does not usually come in this format. We have to extract what we think are potentially useful features and convert them to our learning format. This process is called feature extraction or feature engineering, and it is an often underestimated but very important and time-consuming phase in most real- world machine learning tasks.
Start by importing numpy, scikit-learn, pandas, and pyplot, the Python libraries we will be using in this chapter. Show the versions we will be using (in case you have problems running the notebooks).
%pylab inline
import IPython
import sklearn as sk
import numpy as np
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
print 'IPython version:', IPython.__version__
print 'numpy version:', np.__version__
print 'scikit-learn version:', sk.__version__
print 'matplotlib version:', matplotlib.__version__
print 'pandas version:', pd.__version__
Populating the interactive namespace from numpy and matplotlib IPython version: 2.1.0 numpy version: 1.8.2 scikit-learn version: 0.15.1 matplotlib version: 1.3.1 pandas version: 0.14.1
The Python package pandas (http://pandas.pydata.org/), for example, provides data structures and tools for data analysis. It aims to provide similar features to those of R, the popular language and environment for statistical computing. We will use pandas to import the Titanic data we presented in Chapter 2, Supervised Learning, and convert them to the scikit-learn format.
titanic = pd.read_csv('data/titanic.csv')
print titanic
row.names pclass survived \ 0 1 1st 1 1 2 1st 0 2 3 1st 0 3 4 1st 0 4 5 1st 1 5 6 1st 1 6 7 1st 1 7 8 1st 0 8 9 1st 1 9 10 1st 0 10 11 1st 0 11 12 1st 1 12 13 1st 1 13 14 1st 1 14 15 1st 0 15 16 1st 1 16 17 1st 0 17 18 1st 0 18 19 1st 1 19 20 1st 1 20 21 1st 1 21 22 1st 0 22 23 1st 1 23 24 1st 1 24 25 1st 1 25 26 1st 0 26 27 1st 1 27 28 1st 1 28 29 1st 1 29 30 1st 0 ... ... ... ... 1283 1284 3rd 0 1284 1285 3rd 0 1285 1286 3rd 0 1286 1287 3rd 0 1287 1288 3rd 0 1288 1289 3rd 0 1289 1290 3rd 1 1290 1291 3rd 0 1291 1292 3rd 0 1292 1293 3rd 0 1293 1294 3rd 1 1294 1295 3rd 0 1295 1296 3rd 0 1296 1297 3rd 0 1297 1298 3rd 0 1298 1299 3rd 0 1299 1300 3rd 0 1300 1301 3rd 0 1301 1302 3rd 0 1302 1303 3rd 1 1303 1304 3rd 0 1304 1305 3rd 1 1305 1306 3rd 0 1306 1307 3rd 0 1307 1308 3rd 0 1308 1309 3rd 0 1309 1310 3rd 0 1310 1311 3rd 0 1311 1312 3rd 0 1312 1313 3rd 0 name age embarked \ 0 Allen, Miss Elisabeth Walton 29.0000 Southampton 1 Allison, Miss Helen Loraine 2.0000 Southampton 2 Allison, Mr Hudson Joshua Creighton 30.0000 Southampton 3 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 25.0000 Southampton 4 Allison, Master Hudson Trevor 0.9167 Southampton 5 Anderson, Mr Harry 47.0000 Southampton 6 Andrews, Miss Kornelia Theodosia 63.0000 Southampton 7 Andrews, Mr Thomas, jr 39.0000 Southampton 8 Appleton, Mrs Edward Dale (Charlotte Lamson) 58.0000 Southampton 9 Artagaveytia, Mr Ramon 71.0000 Cherbourg 10 Astor, Colonel John Jacob 47.0000 Cherbourg 11 Astor, Mrs John Jacob (Madeleine Talmadge Force) 19.0000 Cherbourg 12 Aubert, Mrs Leontine Pauline NaN Cherbourg 13 Barkworth, Mr Algernon H. NaN Southampton 14 Baumann, Mr John D. NaN Southampton 15 Baxter, Mrs James (Helene DeLaudeniere Chaput) 50.0000 Cherbourg 16 Baxter, Mr Quigg Edmond 24.0000 Cherbourg 17 Beattie, Mr Thomson 36.0000 Cherbourg 18 Beckwith, Mr Richard Leonard 37.0000 Southampton 19 Beckwith, Mrs Richard Leonard (Sallie Monypeny) 47.0000 Southampton 20 Behr, Mr Karl Howell 26.0000 Cherbourg 21 Birnbaum, Mr Jakob 25.0000 Cherbourg 22 Bishop, Mr Dickinson H. 25.0000 Cherbourg 23 Bishop, Mrs Dickinson H. (Helen Walton) 19.0000 Cherbourg 24 Bjornstrm-Steffansson, Mr Mauritz Hakan 28.0000 Southampton 25 Blackwell, Mr Stephen Weart 45.0000 Southampton 26 Blank, Mr Henry 39.0000 Cherbourg 27 Bonnell, Miss Caroline 30.0000 Southampton 28 Bonnell, Miss Elizabeth 58.0000 Southampton 29 Borebank, Mr John James NaN Southampton ... ... ... ... 1283 Vestrom, Miss Hulda Amanda Adolfina NaN NaN 1284 Vonk, Mr Jenko NaN NaN 1285 Ware, Mr Frederick NaN NaN 1286 Warren, Mr Charles William NaN NaN 1287 Wazli, Mr Yousif NaN NaN 1288 Webber, Mr James NaN NaN 1289 Wennerstrom, Mr August Edvard NaN NaN 1290 Wenzel, Mr Linhart NaN NaN 1291 Widegren, Mr Charles Peter NaN NaN 1292 Wiklund, Mr Jacob Alfred NaN NaN 1293 Wilkes, Mrs Ellen NaN NaN 1294 Willer, Mr Aaron NaN NaN 1295 Willey, Mr Edward NaN NaN 1296 Williams, Mr Howard Hugh NaN NaN 1297 Williams, Mr Leslie NaN NaN 1298 Windelov, Mr Einar NaN NaN 1299 Wirz, Mr Albert NaN NaN 1300 Wiseman, Mr Phillippe NaN NaN 1301 Wittevrongel, Mr Camiel NaN NaN 1302 Yalsevac, Mr Ivan NaN NaN 1303 Yasbeck, Mr Antoni NaN NaN 1304 Yasbeck, Mrs Antoni NaN NaN 1305 Youssef, Mr Gerios NaN NaN 1306 Zabour, Miss Hileni NaN NaN 1307 Zabour, Miss Tamini NaN NaN 1308 Zakarian, Mr Artun NaN NaN 1309 Zakarian, Mr Maprieder NaN NaN 1310 Zenn, Mr Philip NaN NaN 1311 Zievens, Rene NaN NaN 1312 Zimmerman, Leo NaN NaN home.dest room ticket boat \ 0 St Louis, MO B-5 24160 L221 2 1 Montreal, PQ / Chesterville, ON C26 NaN NaN 2 Montreal, PQ / Chesterville, ON C26 NaN (135) 3 Montreal, PQ / Chesterville, ON C26 NaN NaN 4 Montreal, PQ / Chesterville, ON C22 NaN 11 5 New York, NY E-12 NaN 3 6 Hudson, NY D-7 13502 L77 10 7 Belfast, NI A-36 NaN NaN 8 Bayside, Queens, NY C-101 NaN 2 9 Montevideo, Uruguay NaN NaN (22) 10 New York, NY NaN 17754 L224 10s 6d (124) 11 New York, NY NaN 17754 L224 10s 6d 4 12 Paris, France B-35 17477 L69 6s 9 13 Hessle, Yorks A-23 NaN B 14 New York, NY NaN NaN NaN 15 Montreal, PQ B-58/60 NaN 6 16 Montreal, PQ B-58/60 NaN NaN 17 Winnipeg, MN C-6 NaN NaN 18 New York, NY D-35 NaN 5 19 New York, NY D-35 NaN 5 20 New York, NY C-148 NaN 5 21 San Francisco, CA NaN NaN (148) 22 Dowagiac, MI B-49 NaN 7 23 Dowagiac, MI B-49 NaN 7 24 Stockholm, Sweden / Washington, DC NaN D 25 Trenton, NJ NaN NaN (241) 26 Glen Ridge, NJ A-31 NaN 7 27 Youngstown, OH C-7 NaN 8 28 Birkdale, England Cleveland, Ohio C-103 NaN 8 29 London / Winnipeg, MB D-21/2 NaN NaN ... ... ... ... ... 1283 NaN NaN NaN NaN 1284 NaN NaN NaN NaN 1285 NaN NaN NaN NaN 1286 NaN NaN NaN NaN 1287 NaN NaN NaN NaN 1288 NaN NaN NaN NaN 1289 NaN NaN NaN NaN 1290 NaN NaN NaN NaN 1291 NaN NaN NaN NaN 1292 NaN NaN NaN NaN 1293 NaN NaN NaN NaN 1294 NaN NaN NaN NaN 1295 NaN NaN NaN NaN 1296 NaN NaN NaN NaN 1297 NaN NaN NaN NaN 1298 NaN NaN NaN NaN 1299 NaN NaN NaN NaN 1300 NaN NaN NaN NaN 1301 NaN NaN NaN NaN 1302 NaN NaN NaN NaN 1303 NaN NaN NaN NaN 1304 NaN NaN NaN NaN 1305 NaN NaN NaN NaN 1306 NaN NaN NaN NaN 1307 NaN NaN NaN NaN 1308 NaN NaN NaN NaN 1309 NaN NaN NaN NaN 1310 NaN NaN NaN NaN 1311 NaN NaN NaN NaN 1312 NaN NaN NaN NaN sex 0 female 1 female 2 male 3 female 4 male 5 male 6 female 7 male 8 female 9 male 10 male 11 female 12 female 13 male 14 male 15 female 16 male 17 male 18 male 19 female 20 male 21 male 22 male 23 female 24 male 25 male 26 male 27 female 28 female 29 male ... ... 1283 female 1284 male 1285 male 1286 male 1287 male 1288 male 1289 male 1290 male 1291 male 1292 male 1293 female 1294 male 1295 male 1296 male 1297 male 1298 male 1299 male 1300 male 1301 male 1302 male 1303 male 1304 female 1305 male 1306 female 1307 female 1308 male 1309 male 1310 male 1311 female 1312 male [1313 rows x 11 columns]
You can see that each csv column has a corresponding feature into the DataFrame, and that the feature type is induced from the available data. We can inspect some features to see what they look like.
print titanic.head()[['pclass', 'survived', 'age', 'embarked', 'boat', 'sex']]
pclass survived age embarked boat sex 0 1st 1 29.0000 Southampton 2 female 1 1st 0 2.0000 Southampton NaN female 2 1st 0 30.0000 Southampton (135) male 3 1st 0 25.0000 Southampton NaN female 4 1st 1 0.9167 Southampton 11 male
titanic.describe()
row.names | survived | age | |
---|---|---|---|
count | 1313.000000 | 1313.000000 | 633.000000 |
mean | 657.000000 | 0.341965 | 31.194181 |
std | 379.174762 | 0.474549 | 14.747525 |
min | 1.000000 | 0.000000 | 0.166700 |
25% | 329.000000 | 0.000000 | 21.000000 |
50% | 657.000000 | 0.000000 | 30.000000 |
75% | 985.000000 | 1.000000 | 41.000000 |
max | 1313.000000 | 1.000000 | 71.000000 |
he main difficulty we have now is that scikit-learn methods expect real numbers as feature values. In Chapter 2, Supervised Learning, we used the LabelEncoder and OneHotEncoder preprocessing methods to manually convert certain categorical features into 1-of-K values (generating a new feature for each possible value; valued 1 if the original feature had the corresponding value and 0 otherwise). This time, we will use a similar scikit-learn method, DictVectorizer, which automatically builds these features from the different original feature values. Moreover, we will program a method to encode a set of columns in a unique step.
from sklearn import feature_extraction
def one_hot_dataframe(data, cols, replace=False):
""" Takes a dataframe and a list of columns that need to be encoded.
Returns a 3-tuple comprising the data, the vectorized data,
and the fitted vectorizor.
Modified from https://gist.github.com/kljensen/5452382
"""
vec = feature_extraction.DictVectorizer()
mkdict = lambda row: dict((col, row[col]) for col in cols)
#print 'Construyo vecData...'
#print data[cols]
#print cols
# Create a dictionary for each row
#print data[cols].apply(mkdict, axis=1).data
#[0]['pclass']
#vecData = pd.DataFrame(vec.fit_transform(data[cols].apply(mkdict, axis=1)).toarray())
vecData = pd.DataFrame(vec.fit_transform(data[cols].to_dict(outtype='records')).toarray())
vecData.columns = vec.get_feature_names()
vecData.index = data.index
if replace is True:
data = data.drop(cols, axis=1)
data = data.join(vecData)
return (data, vecData)
titanic, titanic_n= one_hot_dataframe(titanic, ['pclass', 'embarked', 'sex'], replace=True)
titanic.describe()
row.names | survived | age | embarked | embarked=Cherbourg | embarked=Queenstown | embarked=Southampton | pclass=1st | pclass=2nd | pclass=3rd | sex=female | sex=male | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1313.000000 | 1313.000000 | 633.000000 | 821 | 1313.000000 | 1313.000000 | 1313.000000 | 1313.000000 | 1313.000000 | 1313.000000 | 1313.000000 | 1313.000000 |
mean | 657.000000 | 0.341965 | 31.194181 | 0 | 0.154608 | 0.034273 | 0.436405 | 0.245240 | 0.213252 | 0.541508 | 0.352628 | 0.647372 |
std | 379.174762 | 0.474549 | 14.747525 | 0 | 0.361668 | 0.181998 | 0.496128 | 0.430393 | 0.409760 | 0.498464 | 0.477970 | 0.477970 |
min | 1.000000 | 0.000000 | 0.166700 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 329.000000 | 0.000000 | 21.000000 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 657.000000 | 0.000000 | 30.000000 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 |
75% | 985.000000 | 1.000000 | 41.000000 | 0 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
max | 1313.000000 | 1.000000 | 71.000000 | 0 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
What does the 'embarked' feature has?
print titanic_n.head(5)
print titanic_n[titanic_n['embarked'] != 0].head()
embarked embarked=Cherbourg embarked=Queenstown embarked=Southampton \ 0 0 0 0 1 1 0 0 0 1 2 0 0 0 1 3 0 0 0 1 4 0 0 0 1 pclass=1st pclass=2nd pclass=3rd sex=female sex=male 0 1 0 0 1 0 1 1 0 0 1 0 2 1 0 0 0 1 3 1 0 0 1 0 4 1 0 0 0 1 embarked embarked=Cherbourg embarked=Queenstown embarked=Southampton \ 62 NaN 0 0 0 165 NaN 0 0 0 195 NaN 0 0 0 196 NaN 0 0 0 229 NaN 0 0 0 pclass=1st pclass=2nd pclass=3rd sex=female sex=male 62 1 0 0 0 1 165 1 0 0 0 1 195 1 0 0 0 1 196 1 0 0 0 1 229 1 0 0 0 1
Convert the remaining categorical features...
print titanic.head()
titanic, titanic_n = one_hot_dataframe(titanic, ['home.dest', 'room', 'ticket', 'boat'], replace=True)
row.names survived name \ 0 1 1 Allen, Miss Elisabeth Walton 1 2 0 Allison, Miss Helen Loraine 2 3 0 Allison, Mr Hudson Joshua Creighton 3 4 0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 4 5 1 Allison, Master Hudson Trevor age home.dest room ticket boat embarked \ 0 29.0000 St Louis, MO B-5 24160 L221 2 0 1 2.0000 Montreal, PQ / Chesterville, ON C26 NaN NaN 0 2 30.0000 Montreal, PQ / Chesterville, ON C26 NaN (135) 0 3 25.0000 Montreal, PQ / Chesterville, ON C26 NaN NaN 0 4 0.9167 Montreal, PQ / Chesterville, ON C22 NaN 11 0 embarked=Cherbourg embarked=Queenstown embarked=Southampton pclass=1st \ 0 0 0 1 1 1 0 0 1 1 2 0 0 1 1 3 0 0 1 1 4 0 0 1 1 pclass=2nd pclass=3rd sex=female sex=male 0 0 0 1 0 1 0 0 1 0 2 0 0 0 1 3 0 0 1 0 4 0 0 0 1
We also have to deal with missing values, since DecisionTreeClassifier we plan to use does not admit them on input. Pandas allow us to replace them with a fixed value using the fillna method. We will use the mean age for the age feature, and 0 for the remaining missing attributes. Adjust N/A ages with the mean age
print titanic['age'].describe()
mean = titanic['age'].mean()
titanic['age'].fillna(mean, inplace=True)
print titanic['age'].describe()
count 633.000000 mean 31.194181 std 14.747525 min 0.166700 25% 21.000000 50% 30.000000 75% 41.000000 max 71.000000 dtype: float64 count 1313.000000 mean 31.194181 std 10.235540 min 0.166700 25% 30.000000 50% 31.194181 75% 31.194181 max 71.000000 dtype: float64
Complete n/a with zeros
titanic.fillna(0, inplace=True)
print titanic
row.names survived name \ 0 1 1 Allen, Miss Elisabeth Walton 1 2 0 Allison, Miss Helen Loraine 2 3 0 Allison, Mr Hudson Joshua Creighton 3 4 0 Allison, Mrs Hudson J.C. (Bessie Waldo Daniels) 4 5 1 Allison, Master Hudson Trevor 5 6 1 Anderson, Mr Harry 6 7 1 Andrews, Miss Kornelia Theodosia 7 8 0 Andrews, Mr Thomas, jr 8 9 1 Appleton, Mrs Edward Dale (Charlotte Lamson) 9 10 0 Artagaveytia, Mr Ramon 10 11 0 Astor, Colonel John Jacob 11 12 1 Astor, Mrs John Jacob (Madeleine Talmadge Force) 12 13 1 Aubert, Mrs Leontine Pauline 13 14 1 Barkworth, Mr Algernon H. 14 15 0 Baumann, Mr John D. 15 16 1 Baxter, Mrs James (Helene DeLaudeniere Chaput) 16 17 0 Baxter, Mr Quigg Edmond 17 18 0 Beattie, Mr Thomson 18 19 1 Beckwith, Mr Richard Leonard 19 20 1 Beckwith, Mrs Richard Leonard (Sallie Monypeny) 20 21 1 Behr, Mr Karl Howell 21 22 0 Birnbaum, Mr Jakob 22 23 1 Bishop, Mr Dickinson H. 23 24 1 Bishop, Mrs Dickinson H. (Helen Walton) 24 25 1 Bjornstrm-Steffansson, Mr Mauritz Hakan 25 26 0 Blackwell, Mr Stephen Weart 26 27 1 Blank, Mr Henry 27 28 1 Bonnell, Miss Caroline 28 29 1 Bonnell, Miss Elizabeth 29 30 0 Borebank, Mr John James ... ... ... ... 1283 1284 0 Vestrom, Miss Hulda Amanda Adolfina 1284 1285 0 Vonk, Mr Jenko 1285 1286 0 Ware, Mr Frederick 1286 1287 0 Warren, Mr Charles William 1287 1288 0 Wazli, Mr Yousif 1288 1289 0 Webber, Mr James 1289 1290 1 Wennerstrom, Mr August Edvard 1290 1291 0 Wenzel, Mr Linhart 1291 1292 0 Widegren, Mr Charles Peter 1292 1293 0 Wiklund, Mr Jacob Alfred 1293 1294 1 Wilkes, Mrs Ellen 1294 1295 0 Willer, Mr Aaron 1295 1296 0 Willey, Mr Edward 1296 1297 0 Williams, Mr Howard Hugh 1297 1298 0 Williams, Mr Leslie 1298 1299 0 Windelov, Mr Einar 1299 1300 0 Wirz, Mr Albert 1300 1301 0 Wiseman, Mr Phillippe 1301 1302 0 Wittevrongel, Mr Camiel 1302 1303 1 Yalsevac, Mr Ivan 1303 1304 0 Yasbeck, Mr Antoni 1304 1305 1 Yasbeck, Mrs Antoni 1305 1306 0 Youssef, Mr Gerios 1306 1307 0 Zabour, Miss Hileni 1307 1308 0 Zabour, Miss Tamini 1308 1309 0 Zakarian, Mr Artun 1309 1310 0 Zakarian, Mr Maprieder 1310 1311 0 Zenn, Mr Philip 1311 1312 0 Zievens, Rene 1312 1313 0 Zimmerman, Leo age embarked embarked=Cherbourg embarked=Queenstown \ 0 29.000000 0 0 0 1 2.000000 0 0 0 2 30.000000 0 0 0 3 25.000000 0 0 0 4 0.916700 0 0 0 5 47.000000 0 0 0 6 63.000000 0 0 0 7 39.000000 0 0 0 8 58.000000 0 0 0 9 71.000000 0 1 0 10 47.000000 0 1 0 11 19.000000 0 1 0 12 31.194181 0 1 0 13 31.194181 0 0 0 14 31.194181 0 0 0 15 50.000000 0 1 0 16 24.000000 0 1 0 17 36.000000 0 1 0 18 37.000000 0 0 0 19 47.000000 0 0 0 20 26.000000 0 1 0 21 25.000000 0 1 0 22 25.000000 0 1 0 23 19.000000 0 1 0 24 28.000000 0 0 0 25 45.000000 0 0 0 26 39.000000 0 1 0 27 30.000000 0 0 0 28 58.000000 0 0 0 29 31.194181 0 0 0 ... ... ... ... ... 1283 31.194181 0 0 0 1284 31.194181 0 0 0 1285 31.194181 0 0 0 1286 31.194181 0 0 0 1287 31.194181 0 0 0 1288 31.194181 0 0 0 1289 31.194181 0 0 0 1290 31.194181 0 0 0 1291 31.194181 0 0 0 1292 31.194181 0 0 0 1293 31.194181 0 0 0 1294 31.194181 0 0 0 1295 31.194181 0 0 0 1296 31.194181 0 0 0 1297 31.194181 0 0 0 1298 31.194181 0 0 0 1299 31.194181 0 0 0 1300 31.194181 0 0 0 1301 31.194181 0 0 0 1302 31.194181 0 0 0 1303 31.194181 0 0 0 1304 31.194181 0 0 0 1305 31.194181 0 0 0 1306 31.194181 0 0 0 1307 31.194181 0 0 0 1308 31.194181 0 0 0 1309 31.194181 0 0 0 1310 31.194181 0 0 0 1311 31.194181 0 0 0 1312 31.194181 0 0 0 embarked=Southampton pclass=1st pclass=2nd ... \ 0 1 1 0 ... 1 1 1 0 ... 2 1 1 0 ... 3 1 1 0 ... 4 1 1 0 ... 5 1 1 0 ... 6 1 1 0 ... 7 1 1 0 ... 8 1 1 0 ... 9 0 1 0 ... 10 0 1 0 ... 11 0 1 0 ... 12 0 1 0 ... 13 1 1 0 ... 14 1 1 0 ... 15 0 1 0 ... 16 0 1 0 ... 17 0 1 0 ... 18 1 1 0 ... 19 1 1 0 ... 20 0 1 0 ... 21 0 1 0 ... 22 0 1 0 ... 23 0 1 0 ... 24 1 1 0 ... 25 1 1 0 ... 26 0 1 0 ... 27 1 1 0 ... 28 1 1 0 ... 29 1 1 0 ... ... ... ... ... ... 1283 0 0 0 ... 1284 0 0 0 ... 1285 0 0 0 ... 1286 0 0 0 ... 1287 0 0 0 ... 1288 0 0 0 ... 1289 0 0 0 ... 1290 0 0 0 ... 1291 0 0 0 ... 1292 0 0 0 ... 1293 0 0 0 ... 1294 0 0 0 ... 1295 0 0 0 ... 1296 0 0 0 ... 1297 0 0 0 ... 1298 0 0 0 ... 1299 0 0 0 ... 1300 0 0 0 ... 1301 0 0 0 ... 1302 0 0 0 ... 1303 0 0 0 ... 1304 0 0 0 ... 1305 0 0 0 ... 1306 0 0 0 ... 1307 0 0 0 ... 1308 0 0 0 ... 1309 0 0 0 ... 1310 0 0 0 ... 1311 0 0 0 ... 1312 0 0 0 ... ticket=248744 L13 ticket=248749 L13 ticket=250647 ticket=27849 \ 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0 10 0 0 0 0 11 0 0 0 0 12 0 0 0 0 13 0 0 0 0 14 0 0 0 0 15 0 0 0 0 16 0 0 0 0 17 0 0 0 0 18 0 0 0 0 19 0 0 0 0 20 0 0 0 0 21 0 0 0 0 22 0 0 0 0 23 0 0 0 0 24 0 0 0 0 25 0 0 0 0 26 0 0 0 0 27 0 0 0 0 28 0 0 0 0 29 0 0 0 0 ... ... ... ... ... 1283 0 0 0 0 1284 0 0 0 0 1285 0 0 0 0 1286 0 0 0 0 1287 0 0 0 0 1288 0 0 0 0 1289 0 0 0 0 1290 0 0 0 0 1291 0 0 0 0 1292 0 0 0 0 1293 0 0 0 0 1294 0 0 0 0 1295 0 0 0 0 1296 0 0 0 0 1297 0 0 0 0 1298 0 0 0 0 1299 0 0 0 0 1300 0 0 0 0 1301 0 0 0 0 1302 0 0 0 0 1303 0 0 0 0 1304 0 0 0 0 1305 0 0 0 0 1306 0 0 0 0 1307 0 0 0 0 1308 0 0 0 0 1309 0 0 0 0 1310 0 0 0 0 1311 0 0 0 0 1312 0 0 0 0 ticket=28220 L32 10s ticket=34218 L10 10s ticket=36973 L83 9s 6d \ 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 10 0 0 0 11 0 0 0 12 0 0 0 13 0 0 0 14 0 0 0 15 0 0 0 16 0 0 0 17 0 0 0 18 0 0 0 19 0 0 0 20 0 0 0 21 0 0 0 22 0 0 0 23 0 0 0 24 0 0 0 25 0 0 0 26 0 0 0 27 0 0 0 28 0 0 0 29 0 0 0 ... ... ... ... 1283 0 0 0 1284 0 0 0 1285 0 0 0 1286 0 0 0 1287 0 0 0 1288 0 0 0 1289 0 0 0 1290 0 0 0 1291 0 0 0 1292 0 0 0 1293 0 0 0 1294 0 0 0 1295 0 0 0 1296 0 0 0 1297 0 0 0 1298 0 0 0 1299 0 0 0 1300 0 0 0 1301 0 0 0 1302 0 0 0 1303 0 0 0 1304 0 0 0 1305 0 0 0 1306 0 0 0 1307 0 0 0 1308 0 0 0 1309 0 0 0 1310 0 0 0 1311 0 0 0 1312 0 0 0 ticket=392091 ticket=7076 ticket=L15 1s 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0 8 0 0 0 9 0 0 0 10 0 0 0 11 0 0 0 12 0 0 0 13 0 0 0 14 0 0 0 15 0 0 0 16 0 0 0 17 0 0 0 18 0 0 0 19 0 0 0 20 0 0 0 21 0 0 0 22 0 0 0 23 0 0 0 24 0 0 0 25 0 0 0 26 0 0 0 27 0 0 0 28 0 0 0 29 0 0 0 ... ... ... ... 1283 0 0 0 1284 0 0 0 1285 0 0 0 1286 0 0 0 1287 0 0 0 1288 0 0 0 1289 0 0 0 1290 0 0 0 1291 0 0 0 1292 0 0 0 1293 0 0 0 1294 0 0 0 1295 0 0 0 1296 0 0 0 1297 0 0 0 1298 0 0 0 1299 0 0 0 1300 0 0 0 1301 0 0 0 1302 0 0 0 1303 0 0 0 1304 0 0 0 1305 0 0 0 1306 0 0 0 1307 0 0 0 1308 0 0 0 1309 0 0 0 1310 0 0 0 1311 0 0 0 1312 0 0 0 [1313 rows x 581 columns]
Build the training and testing dataset
from sklearn.cross_validation import train_test_split
titanic_target = titanic['survived']
titanic_data = titanic.drop(['name', 'row.names', 'survived'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(titanic_data, titanic_target, test_size=0.25, random_state=33)
Let's see how a decision tree works with the current feature set.
from sklearn import tree
dt = tree.DecisionTreeClassifier(criterion='entropy')
dt = dt.fit(X_train, y_train)
import pydot, StringIO
dot_data = StringIO.StringIO()
tree.export_graphviz(dt, out_file=dot_data, feature_names=titanic_data.columns)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_png('titanic.png')
from IPython.core.display import Image
Image(filename='titanic.png')
from sklearn import metrics
def measure_performance(X, y, clf, show_accuracy=True, show_classification_report=True, show_confussion_matrix=True):
y_pred = clf.predict(X)
if show_accuracy:
print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y, y_pred)),"\n"
if show_classification_report:
print "Classification report"
print metrics.classification_report(y, y_pred),"\n"
if show_confussion_matrix:
print "Confussion matrix"
print metrics.confusion_matrix(y, y_pred),"\n"
from sklearn import metrics
measure_performance(X_test, y_test, dt, show_confussion_matrix=False, show_classification_report=False)
Accuracy:0.839
Working with a smaller feature set may lead to better results. So we want to find some way to algorithmically find the best features. This task is called feature selection and is a crucial step when we aim to get decent results with machine learning algorithms. If we have poor features, our algorithm will return poor results no matter how sophisticated our machine learning algorithm is. Select only the 20% most important features, using a chi2 test
from sklearn import feature_selection
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=20)
X_train_fs = fs.fit_transform(X_train, y_train)
print titanic_data.columns[fs.get_support()]
print fs.scores_[2]
print titanic_data.columns[2]
Index([u'age', u'embarked=Cherbourg', u'embarked=Southampton', u'pclass=1st', u'pclass=2nd', u'pclass=3rd', u'sex=female', u'sex=male', u'boat=1', u'boat=10', u'boat=11', u'boat=12', u'boat=13', u'boat=14', u'boat=14/12', u'boat=14/D', u'boat=15', u'boat=16', u'boat=2', u'boat=3', u'boat=4', u'boat=5', u'boat=5/7', u'boat=6', u'boat=7', u'boat=8', u'boat=9', u'boat=A', u'boat=B', u'boat=C', u'boat=D', u'home.dest=Aberdeen / Portland, OR', u'home.dest=Albany, NY', u'home.dest=Australia Fingal, ND', u'home.dest=Austria-Hungary / Germantown, Philadelphia, PA', u'home.dest=Ballydehob, Co Cork, Ireland New York, NY', u'home.dest=Bangkok, Thailand / Roseville, IL', u'home.dest=Barcelona, Spain / Havana, Cuba', u'home.dest=Bayside, Queens, NY', u'home.dest=Belgium Montreal, PQ', u'home.dest=Belmont, MA', u'home.dest=Berne, Switzerland / Central City, IA', u'home.dest=Birkdale, England Cleveland, Ohio', u'home.dest=Bournmouth, England', u'home.dest=Bristol, Avon / Jacksonville, FL', u'home.dest=Brooklyn, NY', u'home.dest=Bryn Mawr, PA', u'home.dest=Calgary, AB', u'home.dest=Chelsea, London', u'home.dest=Chicago, IL', u'home.dest=Co Athlone, Ireland New York, NY', u'home.dest=Co Clare, Ireland Washington, DC', u'home.dest=Co Longford, Ireland New York, NY', u'home.dest=Cooperstown, NY', u'home.dest=Cornwall / Hancock, MI', u'home.dest=Deer Lodge, MT', u'home.dest=Denver, CO', u'home.dest=Detroit, MI', u'home.dest=Dowagiac, MI', u'home.dest=Duluth, MN', u'home.dest=England / Bennington, VT', u'home.dest=England Albion, NY', u'home.dest=England Brooklyn, NY', u'home.dest=England Oglesby, IL', u'home.dest=Finland / Minneapolis, MN', u'home.dest=Finland / Washington, DC', u'home.dest=Folkstone, Kent / New York, NY', u'home.dest=Green Bay, WI', u'home.dest=Greenwich, CT', u'home.dest=Guntur, India / Benton Harbour, MI', u'home.dest=Halifax, NS', u'home.dest=Harrisburg, PA', u'home.dest=Harrow, England', u'home.dest=Haverford, PA', u'home.dest=Haverford, PA / Cooperstown, NY', u'home.dest=Hessle, Yorks', u'home.dest=Hudson, NY', u'home.dest=India / Rapid City, SD', u'home.dest=Indianapolis, IN', u'home.dest=Italy Philadelphia, PA', u'home.dest=Lima, Peru', u'home.dest=Liverpool', u'home.dest=Liverpool, England Bedford, OH', u'home.dest=London Vancouver, BC', u'home.dest=London / East Orange, NJ', u'home.dest=London / Paris', u'home.dest=London, England Norfolk, VA', u'home.dest=Mt Airy, Philadelphia, PA', u'home.dest=New York, NY', u'home.dest=New York, NY / Ithaca, NY', u'home.dest=Norwich / New York, NY', u'home.dest=Paris / Haiti', u'home.dest=Paris, France', u'home.dest=Plymouth, Devon / Detroit, MI', u'home.dest=Rotherfield, Sussex, England Essex Co, MA', u'home.dest=Spain / Havana, Cuba', u'home.dest=St Louis, MO', u'home.dest=Sweden Winnipeg, MN', u'home.dest=Syria New York, NY', u'home.dest=Tuxedo Park, NY', ...], dtype='object') 41.2650346212 embarked=Cherbourg
Evaluate performance with the new feature set
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confussion_matrix=False, show_classification_report=False)
Accuracy:0.848
Find the best percentil using cross-validation on the training set
from sklearn import cross_validation
percentiles = range(1, 100, 5)
results = []
for i in range(1, 100, 5):
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=i)
X_train_fs = fs.fit_transform(X_train, y_train)
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
#print i,scores.mean()
results = np.append(results, scores.mean())
optimal_percentil = np.where(results == results.max())[0]
print "Optimal number of features:{0}".format(percentiles[optimal_percentil]), "\n"
# Plot number of features VS. cross-validation scores
import pylab as pl
pl.figure()
pl.xlabel("Number of features selected")
pl.ylabel("Cross validation accuracy)")
pl.plot(percentiles,results)
print "Mean scores:",results
Optimal number of features:6 Mean scores: [ 0.83332303 0.87804576 0.87195424 0.86994434 0.87399505 0.86891363 0.86992373 0.86991342 0.87195424 0.86991342 0.87194393 0.87398475 0.86991342 0.87093383 0.86992373 0.86074005 0.86583179 0.86790353 0.86891363 0.8648423 ]
-c:13: DeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
Evaluate our best number of features on the test set
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=percentiles[optimal_percentil])
X_train_fs = fs.fit_transform(X_train, y_train)
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confussion_matrix=False, show_classification_report=False)
Accuracy:0.860
-c:1: DeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
print dt.get_params()
{'splitter': 'best', 'min_density': None, 'compute_importances': None, 'max_leaf_nodes': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'random_state': None, 'criterion': 'entropy', 'max_features': None, 'max_depth': None}
Compute the best criterion, using the held out set (see next notebook on Model Selection)
dt = tree.DecisionTreeClassifier(criterion='entropy')
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
print "Entropy criterion accuracy on cv: {0:.3f}".format(scores.mean())
dt = tree.DecisionTreeClassifier(criterion='gini')
scores = cross_validation.cross_val_score(dt, X_train_fs, y_train, cv=5)
print "Gini criterion accuracy on cv: {0:.3f}".format(scores.mean())
Entropy criterion accuracy on cv: 0.879 Gini criterion accuracy on cv: 0.880
dt.fit(X_train_fs, y_train)
X_test_fs = fs.transform(X_test)
measure_performance(X_test_fs, y_test, dt, show_confussion_matrix=False, show_classification_report=False)
Accuracy:0.863