** Data Architecture **: Data architecture is a set of rules, policies, standards and models that govern and define the type of data collected and how it is used, stored, managed and integrated within an organization and its database systems.
** Data Governance **: Data governance is a defined process an organization follows to ensure that high quality data exists throughout the complete lifecycle. The key focus areas of data governance include availability, usability, integrity and security.
** Data Extraction **: Data extraction is the act or process of retrieving data out of (usually unstructured or poorly structured) data sources for further data processing or data storage (data migration).
** Data Cleaning (Cleansing) **: Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.
The following code is written in Python 3.x. Libraries provide pre-written functionality to perform necessary tasks. The idea is why write ten lines of code, when you can write one line.
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
#load packages
import sys #access to system parameters https://docs.python.org/3/library/sys.html
print("Python version: {}". format(sys.version))
import pandas as pd #collection of functions for data processing and analysis modeled after R dataframes with SQL like features
print("pandas version: {}". format(pd.__version__))
import matplotlib #collection of functions for scientific and publication-ready visualization
print("matplotlib version: {}". format(matplotlib.__version__))
import numpy as np #foundational package for scientific computing
print("NumPy version: {}". format(np.__version__))
import scipy as sp #collection of functions for scientific computing and advance mathematics
print("SciPy version: {}". format(sp.__version__))
import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook
print("IPython version: {}". format(IPython.__version__))
import sklearn #collection of machine learning algorithms
print("scikit-learn version: {}". format(sklearn.__version__))
#misc libraries
import random
import time
#ignore warnings
import warnings
warnings.filterwarnings('ignore')
print('-'*25)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
from subprocess import check_output
print(check_output(["ls", "../data"]).decode("utf8"))
# Any results you write to the current directory are saved as output.
Python version: 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)] pandas version: 0.20.3 matplotlib version: 2.1.0 NumPy version: 1.13.3 SciPy version: 0.19.1 IPython version: 6.1.0 scikit-learn version: 0.19.1 ------------------------- titanic_test.csv titanic_train.csv
We will use the popular scikit-learn library to develop our machine learning algorithms. In sklearn, algorithms are called Estimators and implemented in their own classes. For data visualization, we will use the matplotlib and seaborn library. Below are common classes to load.
Note: You will have to run the following command in a command prompt WITHIN THIS FOLDER before proceeding. This will install xgboost as it does not exist within the Anaconda package you install initially.
> pip install .\xgboost-0.7-cp36-cp36m-win_amd64.whl
*the .whl (wheel) file is contained within this folder*
#Common Model Algorithms
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, discriminant_analysis, gaussian_process
from xgboost import XGBClassifier
#Common Model Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn import metrics
#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix
#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline
# The following are just some configuration steps to make sure our graphs look good.
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8
This is the meet and greet step. Get to know your data by first name and learn a little bit about it. What does it look like (datatype and values), what makes it tick (independent/feature variables(s)), what's its goals in life (dependent/target variable(s)). Think of it like a first date, before you jump in and start poking it in the bedroom.
To begin this step, we first import our data. Next we use the info() and sample() function, to get a quick and dirty overview of variable datatypes (i.e. qualitative vs quantitative). Click https://www.kaggle.com/c/titanic/data for the Source Data Dictionary.
#import data from file: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
data_raw = pd.read_csv('../data/titanic_train.csv')
#a dataset should be broken into 3 splits: train, test, and (final) validation
#the test file provided is the validation file for competition submission
#we will split the train set into train and test data in future sections
data_val = pd.read_csv('../data/titanic_test.csv')
#to play with our data we'll create a copy
#remember python assignment or equal passes by reference vs values, so we use the copy function: https://stackoverflow.com/questions/46327494/python-pandas-dataframe-copydeep-false-vs-copydeep-true-vs
data1 = data_raw.copy(deep = True)
#however passing by reference is convenient, because we can clean both datasets at once
data_cleaner = [data1, data_val]
#preview data
print (data_raw.info()) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html
#data_raw.head() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html
#data_raw.tail() #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.tail.html
data_raw.sample(10) #https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB None
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
237 | 238 | 1 | 2 | Collyer, Miss. Marjorie "Lottie" | female | 8.0 | 0 | 2 | C.A. 31921 | 26.2500 | NaN | S |
225 | 226 | 0 | 3 | Berglund, Mr. Karl Ivar Sven | male | 22.0 | 0 | 0 | PP 4348 | 9.3500 | NaN | S |
97 | 98 | 1 | 1 | Greenfield, Mr. William Bertram | male | 23.0 | 0 | 1 | PC 17759 | 63.3583 | D10 D12 | C |
808 | 809 | 0 | 2 | Meyer, Mr. August | male | 39.0 | 0 | 0 | 248723 | 13.0000 | NaN | S |
711 | 712 | 0 | 1 | Klaber, Mr. Herman | male | NaN | 0 | 0 | 113028 | 26.5500 | C124 | S |
94 | 95 | 0 | 3 | Coxon, Mr. Daniel | male | 59.0 | 0 | 0 | 364500 | 7.2500 | NaN | S |
466 | 467 | 0 | 2 | Campbell, Mr. William | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S |
388 | 389 | 0 | 3 | Sadlier, Mr. Matthew | male | NaN | 0 | 0 | 367655 | 7.7292 | NaN | Q |
594 | 595 | 0 | 2 | Chapman, Mr. John Henry | male | 37.0 | 1 | 0 | SC/AH 29037 | 26.0000 | NaN | S |
307 | 308 | 1 | 1 | Penasco y Castellana, Mrs. Victor de Satode (M... | female | 17.0 | 1 | 0 | PC 17758 | 108.9000 | C65 | C |
In this stage, we will clean our data by 1) correcting aberrant values and outliers, 2) completing missing information, 3) creating new features for analysis, and 4) converting fields to the correct format for calculations and presentation.
print('Train columns with null values:\n', data1.isnull().sum())
print("-"*20)
print('Test/Validation columns with null values:\n', data_val.isnull().sum())
print("-"*20)
data_raw.describe(include = 'all')
Train columns with null values: PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64 -------------------- Test/Validation columns with null values: PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64 --------------------
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891 | 891 | 714.000000 | 891.000000 | 891.000000 | 891 | 891.000000 | 204 | 889 |
unique | NaN | NaN | NaN | 891 | 2 | NaN | NaN | NaN | 681 | NaN | 147 | 3 |
top | NaN | NaN | NaN | Hickman, Mr. Lewis | male | NaN | NaN | NaN | 347082 | NaN | C23 C25 C27 | S |
freq | NaN | NaN | NaN | 1 | 577 | NaN | NaN | NaN | 7 | NaN | 4 | 644 |
mean | 446.000000 | 0.383838 | 2.308642 | NaN | NaN | 29.699118 | 0.523008 | 0.381594 | NaN | 32.204208 | NaN | NaN |
std | 257.353842 | 0.486592 | 0.836071 | NaN | NaN | 14.526497 | 1.102743 | 0.806057 | NaN | 49.693429 | NaN | NaN |
min | 1.000000 | 0.000000 | 1.000000 | NaN | NaN | 0.420000 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN | NaN |
25% | 223.500000 | 0.000000 | 2.000000 | NaN | NaN | 20.125000 | 0.000000 | 0.000000 | NaN | 7.910400 | NaN | NaN |
50% | 446.000000 | 0.000000 | 3.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | NaN | 14.454200 | NaN | NaN |
75% | 668.500000 | 1.000000 | 3.000000 | NaN | NaN | 38.000000 | 1.000000 | 0.000000 | NaN | 31.000000 | NaN | NaN |
max | 891.000000 | 1.000000 | 3.000000 | NaN | NaN | 80.000000 | 8.000000 | 6.000000 | NaN | 512.329200 | NaN | NaN |
###COMPLETING: complete or delete missing values in train and test/validation dataset
for dataset in data_cleaner:
#complete missing age with median
dataset['Age'].fillna(dataset['Age'].median(), inplace = True)
#complete embarked with mode
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace = True)
#complete missing fare with median
dataset['Fare'].fillna(dataset['Fare'].median(), inplace = True)
###DELETING: delete the cabin feature/column and others previously stated to exclude in train dataset
drop_column = ['PassengerId','Cabin', 'Ticket']
data1.drop(drop_column, axis=1, inplace = True)
print(data1.isnull().sum())
print("-"*20)
print(data_val.isnull().sum())
Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Embarked 0 dtype: int64 ---------- PassengerId 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 327 Embarked 0 dtype: int64
###CREATE: Feature Engineering for train and test/validation dataset
for dataset in data_cleaner:
#Discrete variables
dataset['FamilySize'] = dataset ['SibSp'] + dataset['Parch'] + 1
dataset['IsAlone'] = 1 #initialize to yes/1 is alone
dataset['IsAlone'].loc[dataset['FamilySize'] > 1] = 0 # now update to no/0 if family size is greater than 1
#quick and dirty code split title from name: http://www.pythonforbeginners.com/dictionary/python-split
dataset['Title'] = dataset['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
#Continuous variable bins; qcut vs cut: https://stackoverflow.com/questions/30211923/what-is-the-difference-between-pandas-qcut-and-pandas-cut
#Fare Bins/Buckets using qcut or frequency bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)
#Age Bins/Buckets using cut or value bins: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html
dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 5)
#cleanup rare title names
#print(data1['Title'].value_counts())
stat_min = 10 #while small is arbitrary, we'll use the common minimum in statistics: http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/
title_names = (data1['Title'].value_counts() < stat_min) #this will create a true false series with title name as index
#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code: https://community.modeanalytics.com/python/tutorial/pandas-groupby-and-python-lambda-functions/
data1['Title'] = data1['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(data1['Title'].value_counts())
print("-"*20)
#preview data again
data1.info()
data_val.info()
data1.sample(10)
Mr 517 Miss 182 Mrs 125 Master 40 Misc 27 Name: Title, dtype: int64 -------------------- <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 14 columns): Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 891 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Fare 891 non-null float64 Embarked 891 non-null object FamilySize 891 non-null int64 IsAlone 891 non-null int64 Title 891 non-null object FareBin 891 non-null category AgeBin 891 non-null category dtypes: category(2), float64(2), int64(6), object(4) memory usage: 85.5+ KB <class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 16 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 418 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 418 non-null float64 Cabin 91 non-null object Embarked 418 non-null object FamilySize 418 non-null int64 IsAlone 418 non-null int64 Title 418 non-null object FareBin 418 non-null category AgeBin 418 non-null category dtypes: category(2), float64(2), int64(6), object(6) memory usage: 46.8+ KB
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Embarked | FamilySize | IsAlone | Title | FareBin | AgeBin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
354 | 0 | 3 | Yousif, Mr. Wazli | male | 28.0 | 0 | 0 | 7.2250 | C | 1 | 1 | Mr | (-0.001, 7.91] | (16.0, 32.0] |
778 | 0 | 3 | Kilgannon, Mr. Thomas J | male | 28.0 | 0 | 0 | 7.7375 | Q | 1 | 1 | Mr | (-0.001, 7.91] | (16.0, 32.0] |
598 | 0 | 3 | Boulos, Mr. Hanna | male | 28.0 | 0 | 0 | 7.2250 | C | 1 | 1 | Mr | (-0.001, 7.91] | (16.0, 32.0] |
738 | 0 | 3 | Ivanoff, Mr. Kanio | male | 28.0 | 0 | 0 | 7.8958 | S | 1 | 1 | Mr | (-0.001, 7.91] | (16.0, 32.0] |
425 | 0 | 3 | Wiseman, Mr. Phillippe | male | 28.0 | 0 | 0 | 7.2500 | S | 1 | 1 | Mr | (-0.001, 7.91] | (16.0, 32.0] |
729 | 0 | 3 | Ilmakangas, Miss. Pieta Sofia | female | 25.0 | 1 | 0 | 7.9250 | S | 2 | 0 | Miss | (7.91, 14.454] | (16.0, 32.0] |
557 | 0 | 1 | Robbins, Mr. Victor | male | 28.0 | 0 | 0 | 227.5250 | C | 1 | 1 | Mr | (31.0, 512.329] | (16.0, 32.0] |
144 | 0 | 2 | Andrew, Mr. Edgardo Samuel | male | 18.0 | 0 | 0 | 11.5000 | S | 1 | 1 | Mr | (7.91, 14.454] | (16.0, 32.0] |
214 | 0 | 3 | Kiernan, Mr. Philip | male | 28.0 | 1 | 0 | 7.7500 | Q | 2 | 0 | Mr | (-0.001, 7.91] | (16.0, 32.0] |
174 | 0 | 1 | Smith, Mr. James Clinch | male | 56.0 | 0 | 0 | 30.6958 | C | 1 | 1 | Mr | (14.454, 31.0] | (48.0, 64.0] |
We will convert categorical data to dummy variables for mathematical analysis. There are multiple ways to encode categorical variables; we will use the sklearn and pandas functions.
In this step, we will also define our x (independent/features/explanatory/predictor/etc.) and y (dependent/target/outcome/response/etc.) variables for data modeling.
#CONVERT: convert objects to category using Label Encoder for train and test/validation dataset
#code categorical data
label = LabelEncoder()
for dataset in data_cleaner:
dataset['Sex_Code'] = label.fit_transform(dataset['Sex'])
dataset['Embarked_Code'] = label.fit_transform(dataset['Embarked'])
dataset['Title_Code'] = label.fit_transform(dataset['Title'])
dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])
#define y variable aka target/outcome
Target = ['Survived']
#define x variables for original features aka feature selection
data1_x = ['Sex','Pclass', 'Embarked', 'Title','SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] #pretty name/values for charts
data1_x_calc = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code','SibSp', 'Parch', 'Age', 'Fare'] #coded for algorithm calculation
data1_xy = Target + data1_x
print('Original X Y: ', data1_xy, '\n')
#define x variables for original w/bin features to remove continuous variables
data1_x_bin = ['Sex_Code','Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code']
data1_xy_bin = Target + data1_x_bin
print('Bin X Y: ', data1_xy_bin, '\n')
#define x and y variables for dummy features original
data1_dummy = pd.get_dummies(data1[data1_x])
data1_x_dummy = data1_dummy.columns.tolist()
data1_xy_dummy = Target + data1_x_dummy
print('Dummy X Y: ', data1_xy_dummy, '\n')
data1_dummy.head()
Original X Y: ['Survived', 'Sex', 'Pclass', 'Embarked', 'Title', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone'] Bin X Y: ['Survived', 'Sex_Code', 'Pclass', 'Embarked_Code', 'Title_Code', 'FamilySize', 'AgeBin_Code', 'FareBin_Code'] Dummy X Y: ['Survived', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare', 'FamilySize', 'IsAlone', 'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Title_Master', 'Title_Misc', 'Title_Miss', 'Title_Mr', 'Title_Mrs']
Pclass | SibSp | Parch | Age | Fare | FamilySize | IsAlone | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | Title_Master | Title_Misc | Title_Miss | Title_Mr | Title_Mrs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | 1 | 0 | 22.0 | 7.2500 | 2 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 1 | 1 | 0 | 38.0 | 71.2833 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 3 | 0 | 0 | 26.0 | 7.9250 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
3 | 1 | 1 | 0 | 35.0 | 53.1000 | 2 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
4 | 3 | 0 | 0 | 35.0 | 8.0500 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
Now that we've cleaned our data, let's do a discount da-double check!
print('Train columns with null values: \n', data1.isnull().sum())
print("-"*10)
print (data1.info())
print("-"*10)
print('Test/Validation columns with null values: \n', data_val.isnull().sum())
print("-"*10)
print (data_val.info())
print("-"*10)
data_raw.describe(include = 'all')
Train columns with null values: Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Embarked 0 FamilySize 0 IsAlone 0 Title 0 FareBin 0 AgeBin 0 Sex_Code 0 Embarked_Code 0 Title_Code 0 AgeBin_Code 0 FareBin_Code 0 dtype: int64 ---------- <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 19 columns): Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 891 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Fare 891 non-null float64 Embarked 891 non-null object FamilySize 891 non-null int64 IsAlone 891 non-null int64 Title 891 non-null object FareBin 891 non-null category AgeBin 891 non-null category Sex_Code 891 non-null int64 Embarked_Code 891 non-null int64 Title_Code 891 non-null int64 AgeBin_Code 891 non-null int64 FareBin_Code 891 non-null int64 dtypes: category(2), float64(2), int64(11), object(4) memory usage: 120.3+ KB None ---------- Test/Validation columns with null values: PassengerId 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 327 Embarked 0 FamilySize 0 IsAlone 0 Title 0 FareBin 0 AgeBin 0 Sex_Code 0 Embarked_Code 0 Title_Code 0 AgeBin_Code 0 FareBin_Code 0 dtype: int64 ---------- <class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 21 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 418 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 418 non-null float64 Cabin 91 non-null object Embarked 418 non-null object FamilySize 418 non-null int64 IsAlone 418 non-null int64 Title 418 non-null object FareBin 418 non-null category AgeBin 418 non-null category Sex_Code 418 non-null int64 Embarked_Code 418 non-null int64 Title_Code 418 non-null int64 AgeBin_Code 418 non-null int64 FareBin_Code 418 non-null int64 dtypes: category(2), float64(2), int64(11), object(6) memory usage: 63.1+ KB None ----------
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891 | 891 | 714.000000 | 891.000000 | 891.000000 | 891 | 891.000000 | 204 | 889 |
unique | NaN | NaN | NaN | 891 | 2 | NaN | NaN | NaN | 681 | NaN | 147 | 3 |
top | NaN | NaN | NaN | Hickman, Mr. Lewis | male | NaN | NaN | NaN | 347082 | NaN | C23 C25 C27 | S |
freq | NaN | NaN | NaN | 1 | 577 | NaN | NaN | NaN | 7 | NaN | 4 | 644 |
mean | 446.000000 | 0.383838 | 2.308642 | NaN | NaN | 29.699118 | 0.523008 | 0.381594 | NaN | 32.204208 | NaN | NaN |
std | 257.353842 | 0.486592 | 0.836071 | NaN | NaN | 14.526497 | 1.102743 | 0.806057 | NaN | 49.693429 | NaN | NaN |
min | 1.000000 | 0.000000 | 1.000000 | NaN | NaN | 0.420000 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN | NaN |
25% | 223.500000 | 0.000000 | 2.000000 | NaN | NaN | 20.125000 | 0.000000 | 0.000000 | NaN | 7.910400 | NaN | NaN |
50% | 446.000000 | 0.000000 | 3.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | NaN | 14.454200 | NaN | NaN |
75% | 668.500000 | 1.000000 | 3.000000 | NaN | NaN | 38.000000 | 1.000000 | 0.000000 | NaN | 31.000000 | NaN | NaN |
max | 891.000000 | 1.000000 | 3.000000 | NaN | NaN | 80.000000 | 8.000000 | 6.000000 | NaN | 512.329200 | NaN | NaN |
As mentioned previously, the test file provided is really validation data for competition submission. So, we will use sklearn function to split the training data in two datasets; 75/25 split. This is important, so we don't overfit our model. Meaning, the algorithm is so specific to a given subset, it cannot accurately generalize another subset, from the same dataset. It's important our algorithm has not seen the subset we will use to test, so it doesn't "cheat" by memorizing the answers. We will use sklearn's train_test_split function. In later sections we will also use sklearn's cross validation functions, that splits our dataset into train and test for data modeling comparison.
#split train and test data with function defaults
#random_state -> seed or control random number generator: https://www.quora.com/What-is-seed-in-random-number-generation
train1_x, test1_x, train1_y, test1_y = model_selection.train_test_split(data1[data1_x_calc], data1[Target], random_state = 0)
train1_x_bin, test1_x_bin, train1_y_bin, test1_y_bin = model_selection.train_test_split(data1[data1_x_bin], data1[Target] , random_state = 0)
train1_x_dummy, test1_x_dummy, train1_y_dummy, test1_y_dummy = model_selection.train_test_split(data1_dummy[data1_x_dummy], data1[Target], random_state = 0)
print("Data1 Shape: {}".format(data1.shape))
print("Train1 Shape: {}".format(train1_x.shape))
print("Test1 Shape: {}".format(test1_x.shape))
train1_x_bin.head()
Data1 Shape: (891, 19) Train1 Shape: (668, 8) Test1 Shape: (223, 8)
Sex_Code | Pclass | Embarked_Code | Title_Code | FamilySize | AgeBin_Code | FareBin_Code | |
---|---|---|---|---|---|---|---|
105 | 1 | 3 | 2 | 3 | 1 | 1 | 0 |
68 | 0 | 3 | 2 | 2 | 7 | 1 | 1 |
253 | 1 | 3 | 2 | 3 | 2 | 1 | 2 |
320 | 1 | 3 | 2 | 3 | 1 | 1 | 0 |
706 | 0 | 2 | 2 | 4 | 1 | 2 | 1 |