#!/usr/bin/env python # coding: utf-8 # # Kaggle Titanic Survivals # This is my first kernel in Kaggle. The goal of this project is to achieve more than 80% on accurracy of a given passanger has survivied or deceased. In this kernel I'm gonna use Support Vector Machines (http://scikit-learn.org/stable/modules/svm.html) as approach to solve this problem. # In[1]: # Importing libraries to play with data import numpy as np import pandas as pd # In[2]: # Importing the data itself train_set = pd.read_csv('train.csv') test_set = pd.read_csv('test.csv') # With the data imported, I want to take a look at it, see the information and the first rows # In[3]: train_set.info() # In[4]: test_set.info() # In[5]: train_set.head() # In[6]: test_set.head() # ## Learning with the data # First I will follow my intuition and see if women and children have more chances of survival. # In[7]: # Importing matlab to plot graphs import matplotlib as plt get_ipython().run_line_magic('matplotlib', 'inline') # In[8]: train_set.groupby('Sex').Survived.mean().plot(kind='bar') # In[9]: train_set.groupby('Age').Survived.mean().plot(kind='line') # As we predicted, we have a good rate of survivals amog females and childrens. Let's see if that's enough. # ## Setting a Support Vector Machine # Before we start, we have to standartize the data, such as Age and Sex. For the Sex it's just transform 'male' and 'female' into 0 and 1. For the age, we are going to round and predict the Null ones. We are already using a Support Vector Machine to make the Age prediction, but we will explain later. # In[10]: # Transforming the Sex into 0 and 1 train_set['Sex'] = train_set['Sex'].map({'male': 0, 'female': 1}).astype(int) # In[11]: # Rounding the Age train_set['Age'] = train_set.Age.round() # In[12]: # Separating the data to predict the missing ages X_train = train_set[train_set.Age.notnull()][['Pclass','Sex','SibSp','Parch', 'Fare']] X_test = train_set[train_set.Age.isnull()][['Pclass','Sex','SibSp','Parch', 'Fare']] y = train_set.Age.dropna() # In[13]: # Predicting the missing ages from sklearn.svm import SVC age_classifier = SVC() age_classifier.fit(X_train, y) prediction = age_classifier.predict(X_test) agePrediction = pd.DataFrame(data=prediction,index=X_test.index.values,columns=['Age']) train_set = train_set.combine_first(agePrediction) # In[14]: # Just confirming if there is no more ages missing train_set.Age.isnull().sum() # ## Predicting with Age and Sex # Now we have our data set and our intuition proved in the graphs, let's see how well these feature goes with the SVM. In the code below, we separete the train set given into another training set and a test set with only the Sex and Age features # In[15]: from sklearn.model_selection import train_test_split # Taking only the features that is important for now X = train_set[['Sex', 'Age']] # Taking the labels (Survived or Not Survived) Y = train_set['Survived'] # Spliting into 80% for training set and 20% for testing set so we can see our accuracy X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0) # In[16]: X_train.info() # Just seeing if everything gone right # In[17]: X_train.head() # In[18]: len(Y_train) # In[19]: Y_train.head() # ### Coding a Support Vector Machine # Here we are in fact using a SVM to make prediction on our data. Sickit-learn has some SVMs to use, for this project I choosen C-Support Vector Classification (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) because I think it's simpler. I will not tune it for now and see how it will go. # In[20]: # Importing C-Support Vector Classification from scikit-learn from sklearn.svm import SVC # Declaring the SVC with no tunning classifier = SVC() # Fitting the data. This is where the SVM will learn classifier.fit(X_train, Y_train) # Predicting the result and giving the accuracy score = classifier.score(x_test, y_test) print(score) # 78.77% in the first shot! I'm surprised! We are very close to our goal. # ## Predicting with Class # Now, we have to get a little better to achieve 80% of accuracy. For second intuition, I think that the richest people got better than the poorer, because, you know, that's how the world is. Let's take a look into the data. # In[21]: train_set.groupby('Pclass').Survived.mean().plot(kind='bar') # Yep. More than 60% of the people in the First class make it. Let's see these numbers among females. # In[22]: train_set.query('Sex == 1').groupby('Pclass').Survived.mean().plot(kind='bar') # Wow! Almost every women in the First class, make it! We are definitely in the right direction. Let's put it in our Support Vector machine and see how it goes. # In[23]: # Taking only the features that is important for now X = train_set[['Sex', 'Age', 'Pclass']] # Taking the labels (Survived or Not Survived) Y = train_set['Survived'] # Spliting into 80% for training set and 20% for testing set so we can see our accuracy X_train, x_test, Y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=0) # In[24]: # Declaring the SVC with no tunning classifier = SVC() # Fitting the data. This is where the SVM will learn classifier.fit(X_train, Y_train) # Predicting the result and giving the accuracy score = classifier.score(x_test, y_test) print(score) # Yay! We make it! With only three features we got an accuracy of more than 80%! Awesome. # # Conclusion # With this notebook our goal is to achive 80% of accuracy predicting the survivals of the Titanic disaster and we did it. Also we could see the power and simplicity of Support Vector Machines, with simple and few line of codes we could achieve our goal. For this is a well known fact, I, as anyone, could explore with intuition. It's obvious that children and women from upper class have more chances to survive, but the SVC did a great job classifying tham. Hope you enjoyed. # ## Appendix # Now let's generate the data to send to the contest. For this, we are going to do the same settings we did with `train_set` but now with the `test_set` # In[25]: # Transforming the Sex into 0 and 1 test_set['Sex'] = test_set['Sex'].map({'male': 0, 'female': 1}).astype(int) # In[26]: # Rounding the Age test_set['Age'] = test_set.Age.round() # In[27]: # Separating the data to predict the missing ages X_train = test_set[test_set.Age.notnull()][['Pclass','Sex','SibSp','Parch']] X_test = test_set[test_set.Age.isnull()][['Pclass','Sex','SibSp','Parch']] y = test_set.Age.dropna() # In[28]: age_classifier = SVC() age_classifier.fit(X_train, y) prediction = age_classifier.predict(X_test) agePrediction = pd.DataFrame(data=prediction,index=X_test.index.values,columns=['Age']) test_set = test_set.combine_first(agePrediction) # Now we will use the SVM to predict and after that, generate the CSV # In[29]: to_predict = test_set[['Sex', 'Age', 'Pclass']] predictions = classifier.predict(to_predict) dfPrediction = pd.DataFrame(data=predictions,index = test_set.PassengerId.values,columns=['Survived']) contest_output = dfPrediction.to_csv('contest_output.csv')