Exploratory Analyses of Titanic data

Load packages and data

In [1]:
using Gadfly
using DataFrames
df=readtable("train.csv")
Out[1]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1103Braund, Mr. Owen Harrismale22.010A/5 211717.25NAS
2211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C
3313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.925NAS
4411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1C123S
5503Allen, Mr. William Henrymale35.0003734508.05NAS
6603Moran, Mr. JamesmaleNA003308778.4583NAQ
7701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
8803Palsson, Master. Gosta Leonardmale2.03134990921.075NAS
9913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NAS
101012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NAC
111113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7G6S
121211Bonnell, Miss. Elizabethfemale58.00011378326.55C103S
131303Saundercock, Mr. William Henrymale20.000A/5. 21518.05NAS
141403Andersson, Mr. Anders Johanmale39.01534708231.275NAS
151503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NAS
161612Hewlett, Mrs. (Mary D Kingcome) female55.00024870616.0NAS
171703Rice, Master. Eugenemale2.04138265229.125NAQ
181812Williams, Mr. Charles EugenemaleNA0024437313.0NAS
191903Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)female31.01034576318.0NAS
202013Masselmani, Mrs. FatimafemaleNA0026497.225NAC
212102Fynney, Mr. Joseph Jmale35.00023986526.0NAS
222212Beesley, Mr. Lawrencemale34.00024869813.0D56S
232313McGowan, Miss. Anna "Annie"female15.0003309238.0292NAQ
242411Sloper, Mr. William Thompsonmale28.00011378835.5A6S
252503Palsson, Miss. Torborg Danirafemale8.03134990921.075NAS
262613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)female38.01534707731.3875NAS
272703Emir, Mr. Farred ChehabmaleNA0026317.225NAC
282801Fortune, Mr. Charles Alexandermale19.03219950263.0C23 C25 C27S
292913O'Dwyer, Miss. Ellen "Nellie"femaleNA003309597.8792NAQ
303003Todoroff, Mr. LaliomaleNA003492167.8958NAS

List the summary of the data

In [2]:
describe(df)
PassengerId
Min      1.0
1st Qu.  223.5
Median   446.0
Mean     446.0
3rd Qu.  668.5
Max      891.0
NAs      0
NA%      0.0%

Survived
Min      0.0
1st Qu.  0.0
Median   0.0
Mean     0.3838383838383838
3rd Qu.  1.0
Max      1.0
NAs      0
NA%      0.0%

Pclass
Min      1.0
1st Qu.  2.0
Median   3.0
Mean     2.308641975308642
3rd Qu.  3.0
Max      3.0
NAs      0
NA%      0.0%

Name
Length  891
Type    UTF8String
NAs     0
NA%     0.0%
Unique  891

Sex
Length  891
Type    UTF8String
NAs     0
NA%     0.0%
Unique  2

Age
Min      0.42
1st Qu.  20.125
Median   28.0
Mean     29.69911764705882
3rd Qu.  38.0
Max      80.0
NAs      177
NA%      19.87%

SibSp
Min      0.0
1st Qu.  0.0
Median   0.0
Mean     0.5230078563411896
3rd Qu.  1.0
Max      8.0
NAs      0
NA%      0.0%

Parch
Min      0.0
1st Qu.  0.0
Median   0.0
Mean     0.38159371492704824
3rd Qu.  0.0
Max      6.0
NAs      0
NA%      0.0%

Ticket
Length  891
Type    UTF8String
NAs     0
NA%     0.0%
Unique  681

Fare
Min      0.0
1st Qu.  7.9104
Median   14.4542
Mean     32.20420796857464
3rd Qu.  31.0
Max      512.3292
NAs      0
NA%      0.0%

Cabin
Length  891
Type    UTF8String
NAs     687
NA%     77.1%
Unique  148

Embarked
Length  891
Type    UTF8String
NAs     2
NA%     0.22%
Unique  4

Test the data type

In [3]:
typeof(df)
Out[3]:
DataFrame (constructor with 22 methods)

Get the first row

In [4]:
df[1,:]
Out[4]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1103Braund, Mr. Owen Harrismale22.010A/5 211717.25NAS

Read the column of Passenger's name

In [5]:
df[:Name]
Out[5]:
891-element DataArray{UTF8String,1}:
 "Braund, Mr. Owen Harris"                            
 "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
 "Heikkinen, Miss. Laina"                             
 "Futrelle, Mrs. Jacques Heath (Lily May Peel)"       
 "Allen, Mr. William Henry"                           
 "Moran, Mr. James"                                   
 "McCarthy, Mr. Timothy J"                            
 "Palsson, Master. Gosta Leonard"                     
 "Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"  
 "Nasser, Mrs. Nicholas (Adele Achem)"                
 "Sandstrom, Miss. Marguerite Rut"                    
 "Bonnell, Miss. Elizabeth"                           
 "Saundercock, Mr. William Henry"                     
 ⋮                                                    
 "Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)"      
 "Shelley, Mrs. William (Imanita Parrish Hall)"       
 "Markun, Mr. Johann"                                 
 "Dahlberg, Miss. Gerda Ulrika"                       
 "Banfield, Mr. Frederick James"                      
 "Sutehall, Mr. Henry Jr"                             
 "Rice, Mrs. William (Margaret Norton)"               
 "Montvila, Rev. Juozas"                              
 "Graham, Miss. Margaret Edith"                       
 "Johnston, Miss. Catherine Helen \"Carrie\""         
 "Behr, Mr. Karl Howell"                              
 "Dooley, Mr. Patrick"                                

Convert data to factors

In [6]:
pool!(df,[:Sex])
pool!(df,[:Survived])
pool!(df,[:Pclass])
Plot the histogram of Sex by Suvived column
In [7]:
plot(df,x="Sex",color="Survived",Geom.histogram)
Out[7]:
Sex male female 1.0 0.5 0.0 Survived -700 -600 -500 -400 -300 -200 -100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 -600 -580 -560 -540 -520 -500 -480 -460 -440 -420 -400 -380 -360 -340 -320 -300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880 900 920 940 960 980 1000 1020 1040 1060 1080 1100 1120 1140 1160 1180 1200 -1000 0 1000 2000 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200
In [8]:
df[!isna(df[:Age]),:]
Out[8]:
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
1103Braund, Mr. Owen Harrismale22.010A/5 211717.25NAS
2211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female38.010PC 1759971.2833C85C
3313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.925NAS
4411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1C123S
5503Allen, Mr. William Henrymale35.0003734508.05NAS
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.075NAS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NAS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NAC
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.55C103S
121303Saundercock, Mr. William Henrymale20.000A/5. 21518.05NAS
131403Andersson, Mr. Anders Johanmale39.01534708231.275NAS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NAS
151612Hewlett, Mrs. (Mary D Kingcome) female55.00024870616.0NAS
161703Rice, Master. Eugenemale2.04138265229.125NAQ
171903Vander Planke, Mrs. Julius (Emelia Maria Vandemoortele)female31.01034576318.0NAS
182102Fynney, Mr. Joseph Jmale35.00023986526.0NAS
192212Beesley, Mr. Lawrencemale34.00024869813.0D56S
202313McGowan, Miss. Anna "Annie"female15.0003309238.0292NAQ
212411Sloper, Mr. William Thompsonmale28.00011378835.5A6S
222503Palsson, Miss. Torborg Danirafemale8.03134990921.075NAS
232613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia Johansson)female38.01534707731.3875NAS
242801Fortune, Mr. Charles Alexandermale19.03219950263.0C23 C25 C27S
253101Uruchurtu, Don. Manuel Emale40.000PC 1760127.7208NAC
263402Wheadon, Mr. Edward Hmale66.000C.A. 2457910.5NAS
273501Meyer, Mr. Edgar Josephmale28.010PC 1760482.1708NAC
283601Holverson, Mr. Alexander Oskarmale42.01011378952.0NAS
293803Cann, Mr. Ernest Charlesmale21.000A./5. 21528.05NAS
303903Vander Planke, Miss. Augusta Mariafemale18.02034576418.0NAS
In [9]:
averageAge=mean(df[!isna(df[:Age]),:Age])
Out[9]:
29.69911764705882

Fill NAs in df[:Age] by average

In [10]:
df[:Age]=array(df[:Age],averageAge)
Out[10]:
891-element Array{Float64,1}:
 22.0   
 38.0   
 26.0   
 35.0   
 35.0   
 29.6991
 54.0   
  2.0   
 27.0   
 14.0   
  4.0   
 58.0   
 20.0   
  ⋮     
 56.0   
 25.0   
 33.0   
 22.0   
 28.0   
 25.0   
 39.0   
 27.0   
 19.0   
 29.6991
 26.0   
 32.0   
In [11]:
typeof(df[:Sex])
Out[11]:
PooledDataArray{UTF8String,Uint8,1} (constructor with 1 method)

Clean Embarked column

In [12]:
plot(x=df[!isna(df[:Embarked]),:Embarked],Geom.histogram)
Out[12]:
x S C Q -1000 -800 -600 -400 -200 0 200 400 600 800 1000 1200 1400 1600 1800 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600 -1000 0 1000 2000 -800 -750 -700 -650 -600 -550 -500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500 1550 1600
In [13]:
df[:Embarked]=array(df[:Embarked],utf8("S"))
Out[13]:
891-element Array{UTF8String,1}:
 "S"
 "C"
 "S"
 "S"
 "S"
 "Q"
 "S"
 "S"
 "S"
 "C"
 "S"
 "S"
 "S"
 ⋮  
 "C"
 "S"
 "S"
 "S"
 "S"
 "S"
 "Q"
 "S"
 "S"
 "S"
 "C"
 "Q"
In [14]:
pool!(df,[:Embarked])
typeof(df[:Embarked])
Out[14]:
PooledDataArray{UTF8String,Uint8,1} (constructor with 1 method)

select feature and form a new DataFrame for classification tree

In [15]:
newdata=df[:,[:Pclass,:Age,:Sex,:SibSp,:Parch,:Fare,:Embarked]]
Out[15]:
PclassAgeSexSibSpParchFareEmbarked
1322.0male107.25S
2138.0female1071.2833C
3326.0female007.925S
4135.0female1053.1S
5335.0male008.05S
6329.69911764705882male008.4583Q
7154.0male0051.8625S
832.0male3121.075S
9327.0female0211.1333S
10214.0female1030.0708C
1134.0female1116.7S
12158.0female0026.55S
13320.0male008.05S
14339.0male1531.275S
15314.0female007.8542S
16255.0female0016.0S
1732.0male4129.125Q
18229.69911764705882male0013.0S
19331.0female1018.0S
20329.69911764705882female007.225C
21235.0male0026.0S
22234.0male0013.0S
23315.0female008.0292Q
24128.0male0035.5S
2538.0female3121.075S
26338.0female1531.3875S
27329.69911764705882male007.225C
28119.0male32263.0S
29329.69911764705882female007.8792Q
30329.69911764705882male007.8958S
In [16]:
describe(newdata)
Pclass
Min      1.0
1st Qu.  2.0
Median   3.0
Mean     2.308641975308642
3rd Qu.  3.0
Max      3.0
NAs      0
NA%      0.0%

Age
Min      0.42
1st Qu.  22.0
Median   29.69911764705882
Mean     29.699117647058845
3rd Qu.  35.0
Max      80.0
NAs      0
NA%      0.0%

Sex
Length  891
Type    Pooled UTF8String
NAs     0
NA%     0.0%
Unique  2

SibSp
Min      0.0
1st Qu.  0.0
Median   0.0
Mean     0.5230078563411896
3rd Qu.  1.0
Max      8.0
NAs      0
NA%      0.0%

Parch
Min      0.0
1st Qu.  0.0
Median   0.0
Mean     0.38159371492704824
3rd Qu.  0.0
Max      6.0
NAs      0
NA%      0.0%

Fare
Min      0.0
1st Qu.  7.9104
Median   14.4542
Mean     32.20420796857464
3rd Qu.  31.0
Max      512.3292
NAs      0
NA%      0.0%

Embarked
Length  891
Type    Pooled UTF8String
NAs     0
NA%     0.0%
Unique  3

Tree Classification

In [18]:
using DecisionTree
In [19]:
xTrain=newdata
Out[19]:
PclassAgeSexSibSpParchFareEmbarked
1322.0male107.25S
2138.0female1071.2833C
3326.0female007.925S
4135.0female1053.1S
5335.0male008.05S
6329.69911764705882male008.4583Q
7154.0male0051.8625S
832.0male3121.075S
9327.0female0211.1333S
10214.0female1030.0708C
1134.0female1116.7S
12158.0female0026.55S
13320.0male008.05S
14339.0male1531.275S
15314.0female007.8542S
16255.0female0016.0S
1732.0male4129.125Q
18229.69911764705882male0013.0S
19331.0female1018.0S
20329.69911764705882female007.225C
21235.0male0026.0S
22234.0male0013.0S
23315.0female008.0292Q
24128.0male0035.5S
2538.0female3121.075S
26338.0female1531.3875S
27329.69911764705882male007.225C
28119.0male32263.0S
29329.69911764705882female007.8792Q
30329.69911764705882male007.8958S
In [20]:
yTrain=df[:Survived]
Out[20]:
891-element PooledDataArray{Int64,Uint8,1}:
 0
 1
 1
 1
 0
 0
 0
 0
 1
 1
 1
 1
 0
 ⋮
 1
 1
 0
 0
 0
 0
 0
 0
 1
 0
 1
 0
In [26]:
yTrain=array(yTrain)
Out[26]:
891-element Array{Int64,1}:
 0
 1
 1
 1
 0
 0
 0
 0
 1
 1
 1
 1
 0
 ⋮
 1
 1
 0
 0
 0
 0
 0
 0
 1
 0
 1
 0
In [27]:
xTrain=array(xTrain)
Out[27]:
891x7 Array{Any,2}:
 3  22.0     "male"    1  0   7.25    "S"
 1  38.0     "female"  1  0  71.2833  "C"
 3  26.0     "female"  0  0   7.925   "S"
 1  35.0     "female"  1  0  53.1     "S"
 3  35.0     "male"    0  0   8.05    "S"
 3  29.6991  "male"    0  0   8.4583  "Q"
 1  54.0     "male"    0  0  51.8625  "S"
 3   2.0     "male"    3  1  21.075   "S"
 3  27.0     "female"  0  2  11.1333  "S"
 2  14.0     "female"  1  0  30.0708  "C"
 3   4.0     "female"  1  1  16.7     "S"
 1  58.0     "female"  0  0  26.55    "S"
 3  20.0     "male"    0  0   8.05    "S"
 ⋮                            ⋮          
 1  56.0     "female"  0  1  83.1583  "C"
 2  25.0     "female"  0  1  26.0     "S"
 3  33.0     "male"    0  0   7.8958  "S"
 3  22.0     "female"  0  0  10.5167  "S"
 2  28.0     "male"    0  0  10.5     "S"
 3  25.0     "male"    0  0   7.05    "S"
 3  39.0     "female"  0  5  29.125   "Q"
 2  27.0     "male"    0  0  13.0     "S"
 1  19.0     "female"  0  0  30.0     "S"
 3  29.6991  "female"  1  2  23.45    "S"
 1  26.0     "male"    0  0  30.0     "C"
 3  32.0     "male"    0  0   7.75    "Q"
In [31]:
accuracy = nfoldCV_forest(yTrain, xTrain, 5, 20, 4, 0.7)
Fold 1
Classes:  {0,1}
Matrix:   
[125 12
 23 62]
Accuracy: 0.8423423423423423
Kappa:    0.6579804560260585

Fold 2
Classes:  {0,1}
Matrix:   
[134 11
 26 51]
Accuracy: 0.8333333333333334
Kappa:    0.6145471609572971

Fold 3
Classes:  {0,1}
Matrix:   
[121 19
 26 56]
Accuracy: 0.7972972972972973
Kappa:    0.5570630486831604

Fold 4
Classes:  {0,1}
Matrix:   
[110 15
 26 71]
Accuracy: 0.8153153153153153
Kappa:    0.6198312588756161

Mean Accuracy: 0.822072072072072
Out[31]:
4-element Array{Float64,1}:
 0.842342
 0.833333
 0.797297
 0.815315
In [ ]: