Enron POI Identification - Project

Enron, one of the largest US companies, collapsed into bankruptcy in 2002 due to widespread corporate fraud. As a result of the federal investigation, confidential information was made public, including emails and financial data for executives. We will build a POI (Person of Interest) identifier to extract individuals involved in the fraud from the data. We will use machine learning and sklearn to do so.

The email features in final_project_dataset.pkl are aggregated from the email dataset, and they record the number of messages to or from a given person/email address, as well as the number of messages to or from a known POI email address and the number of messages that have shared receipt with a POI.

Of the 146 people in the data set, there are 18 POIs. Using machine learning to identify the POIs is useful because of complexity of the data set. It allows us to try to find patterns to detect POIs. In this way, we can create a model that then may help us identify POIs from new data -- if a new person and their data are sent through the model, the model can then identify whether that new person may be a POI or not.

In this project I will build a person of interest identifier based on financial and email data made public as a result of the Enron scandal. I use email and financial data for 146 executives at Enron to identify persons of interest in the fraud case. A person of interest (POI) is someone who was indicted for fraud, settled with the government, or testified in exchange for immunity. This report documents the machine learning techniques used in building a POI identifier.

There are four major steps in my project:

  • Enron dataset
  • Feature processing
  • Algorithm
  • Validation

Let's Start with the First Step

Section 1 :Enron Dataset

In [1]:
import sys
import pickle
import pprint
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data,test_classifier

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
import seaborn as sns
/home/aishwary/anaconda2/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [2]:
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file) # this would the original copy of the data we are extracting.
df = pd.DataFrame.from_dict(data_dict, orient='index')

Let's look at our dataframe :-) By using DataFrames, we get amazing options to manipulate the data and use it for our tasks.
Creation of plots are much easier using DataFrams than dictionaries.

In [3]:
salary to_messages deferral_payments total_payments exercised_stock_options bonus restricted_stock shared_receipt_with_poi restricted_stock_deferred total_stock_value ... loan_advances from_messages other from_this_person_to_poi poi director_fees deferred_income long_term_incentive email_address from_poi_to_this_person
ALLEN PHILLIP K 201955 2902 2869717 4484442 1729541 4175000 126027 1407 -126027 1729541 ... NaN 2195 152 65 False NaN -3081055 304805 [email protected] 47
BADUM JAMES P NaN NaN 178980 182466 257817 NaN NaN NaN NaN 257817 ... NaN NaN NaN NaN False NaN NaN NaN NaN NaN
BANNANTINE JAMES M 477 566 NaN 916197 4046157 NaN 1757552 465 -560222 5243487 ... NaN 29 864523 0 False NaN -5104 NaN [email protected] 39
BAXTER JOHN C 267102 NaN 1295738 5634343 6680544 1200000 3942714 NaN NaN 10623258 ... NaN NaN 2660303 NaN False NaN -1386055 1586055 NaN NaN
BAY FRANKLIN R 239671 NaN 260455 827696 NaN 400000 145796 NaN -82782 63014 ... NaN NaN 69 NaN False NaN -201641 NaN [email protected] NaN

5 rows × 21 columns

In [4]:
count = 1
for i in data_dict.keys()[:15]:
    print count," ", i
# We're having a look at first few entries in the data dictionary.

Let's understand how the data inside one entry of the dictionary looks.

In [5]:
pprint.pprint (data_dict['BAXTER JOHN C'])
{'bonus': 1200000,
 'deferral_payments': 1295738,
 'deferred_income': -1386055,
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 6680544,
 'expenses': 11200,
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 1586055,
 'other': 2660303,
 'poi': False,
 'restricted_stock': 3942714,
 'restricted_stock_deferred': 'NaN',
 'salary': 267102,
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 5634343,
 'total_stock_value': 10623258}

Let's also explore what are the quantities of the Factors we're interested in.

In [6]:
print 'Number of people:', df['poi'].count()
print 'Number of POIs:', df.loc[df.poi == True, 'poi'].count()
print 'Fraction of examples that are POIs:', \
    float(df.loc[df.poi == True, 'poi'].count()) / df['poi'].count()
print 'Number of features:', df.shape[1]
Number of people: 146
Number of POIs: 18
Fraction of examples that are POIs: 0.123287671233
Number of features: 21
In [7]:
fpoi = open("poi_names.txt", "r")
rfile = fpoi.readlines()
poi = len(rfile[2:])
print "There were " + str(poi) + " poi's total."
There were 35 poi's total.

As it is visible from our above text file that the number of total POI's are 35 but for this project we are considering only 18 POI's

As visible from our DataFrame there are a lot of NaN values inside the DataFrame, lets count them.

In [8]:
salary                       0
to_messages                  0
deferral_payments            0
total_payments               0
exercised_stock_options      0
bonus                        0
restricted_stock             0
shared_receipt_with_poi      0
restricted_stock_deferred    0
total_stock_value            0
expenses                     0
loan_advances                0
from_messages                0
other                        0
from_this_person_to_poi      0
poi                          0
director_fees                0
deferred_income              0
long_term_incentive          0
email_address                0
from_poi_to_this_person      0
dtype: int64

It seems that although the NaN's exist but they're not in a recognizable format. So let's make them recongnizable.

In [9]:
df = df.replace('NaN', np.nan)
salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
exercised_stock_options       44
bonus                         64
restricted_stock              36
shared_receipt_with_poi       60
restricted_stock_deferred    128
total_stock_value             20
expenses                      51
loan_advances                142
from_messages                 60
other                         53
from_this_person_to_poi       60
poi                            0
director_fees                129
deferred_income               97
long_term_incentive           80
email_address                 35
from_poi_to_this_person       60
dtype: int64

One important observation to be made here is email features have same null count for example,

  • to_messages
  • shared_receipt_with_poi
  • from_messages
  • from_this_person_to_poi
  • from_poi_to_this_person
    All share the same value 60 by which we can understand that there is some process that is causing the missing data is lack of email data for certain persons.
In [10]:
<class 'pandas.core.frame.DataFrame'>
Index: 146 entries, ALLEN PHILLIP K to YEAP SOON
Data columns (total 21 columns):
salary                       95 non-null float64
to_messages                  86 non-null float64
deferral_payments            39 non-null float64
total_payments               125 non-null float64
exercised_stock_options      102 non-null float64
bonus                        82 non-null float64
restricted_stock             110 non-null float64
shared_receipt_with_poi      86 non-null float64
restricted_stock_deferred    18 non-null float64
total_stock_value            126 non-null float64
expenses                     95 non-null float64
loan_advances                4 non-null float64
from_messages                86 non-null float64
other                        93 non-null float64
from_this_person_to_poi      86 non-null float64
poi                          146 non-null bool
director_fees                17 non-null float64
deferred_income              49 non-null float64
long_term_incentive          66 non-null float64
email_address                111 non-null object
from_poi_to_this_person      86 non-null float64
dtypes: bool(1), float64(19), object(1)
memory usage: 24.1+ KB

In case of the Financial data we found some redundancies. There were NaNs in the columns for financial data like bonus.
So, the question that naturally arises here is why do NaNs exist there? To get a conclusion I explored the PDF document and I found out that the entries in few columns were missing, like bonus : it might be the case that the bonus might not be applicable. This view is supported by the insider pay pdf document in how the totals are calculated.

So, finally I decided to replace all the NaNs in the financial features with zero.

In [11]:
financial_features = ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', \
                      'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', \
                      'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', \
df[financial_features] = df[financial_features].fillna(0)
salary                        0
to_messages                  60
deferral_payments             0
total_payments                0
exercised_stock_options       0
bonus                         0
restricted_stock              0
shared_receipt_with_poi      60
restricted_stock_deferred     0
total_stock_value             0
expenses                      0
loan_advances                 0
from_messages                60
other                         0
from_this_person_to_poi      60
poi                           0
director_fees                 0
deferred_income               0
long_term_incentive           0
email_address                35
from_poi_to_this_person      60
dtype: int64

As now we have explored the financial features of the dataset, lets do something about the email features as well.

I have intentionally segregated these two features so as to understand the variegatedness of the data we are having

In [12]:
email_features = ['to_messages', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi',\
In [13]:
df[email_features] = df[email_features].fillna(df[email_features].median())
In [14]:
salary                        0
to_messages                   0
deferral_payments             0
total_payments                0
exercised_stock_options       0
bonus                         0
restricted_stock              0
shared_receipt_with_poi       0
restricted_stock_deferred     0
total_stock_value             0
expenses                      0
loan_advances                 0
from_messages                 0
other                         0
from_this_person_to_poi       0
poi                           0
director_fees                 0
deferred_income               0
long_term_incentive           0
email_address                35
from_poi_to_this_person       0
dtype: int64

After processing the email features, I decided not to touch email address because it would be required for this project and also processing it would be out of scope for us in this project.

Let's Visualize some data and understand how Salary and Bonus are related.

In [15]:
features = ["salary", "bonus"]
#data_dict.pop('TOTAL', 0)
data = featureFormat(data_dict, features)
### plot features
for point in data:
    salary = point[0]
    bonus = point[1]
    plt.scatter( salary, bonus )

plt.title("Salary vs Bonus")
<matplotlib.text.Text at 0x7f826e2e5c90>
In [16]:
outlier = max(data,key=lambda item:item[1])
print (outlier)

my_dataset = data_dict # we keep our original data as it is for our reference and use a copy for modifications

for person in my_dataset:
    if my_dataset[person]['salary'] == outlier[0] and my_dataset[person]['bonus'] == outlier[1]:
        print "The outlier is : ",person
[ 26704229.  97343619.]
The outlier is :  TOTAL

So, from our above obsevation it is now clear that our outlier is the "Total" of all the other variables, so it has to be extremely large than that of any individual variable. Also it's of no use for us. It is the value at the end of our dataset which contains the sum of all the other values, so we would not use it.

We are going to remove it from our dataset. And again observe what exactly is the situation of the scatter plot when it is not present.

Also, there is one problem in our dataset, upon observing the PDF we found that the entries for Eugene Lockhart I noticed that every single one of his features were either "NaN" or 0, so his information is completely useless.

In [17]:
print "Before removing TOTAL length of our dataset is ",len(my_dataset)
Before removing TOTAL length of our dataset is  146

Also, I found one very interesting entry in our Dataset which either contains 0 or it contains NaNs

In [18]:
my_dataset.pop('LOCKHART EUGENE E')
{'bonus': 'NaN',
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': 'NaN',
 'exercised_stock_options': 'NaN',
 'expenses': 'NaN',
 'from_messages': 'NaN',
 'from_poi_to_this_person': 'NaN',
 'from_this_person_to_poi': 'NaN',
 'loan_advances': 'NaN',
 'long_term_incentive': 'NaN',
 'other': 'NaN',
 'poi': False,
 'restricted_stock': 'NaN',
 'restricted_stock_deferred': 'NaN',
 'salary': 'NaN',
 'shared_receipt_with_poi': 'NaN',
 'to_messages': 'NaN',
 'total_payments': 'NaN',
 'total_stock_value': 'NaN'}

SO, we're going to remove the above value from our dataset.

In [19]:
my_dataset.pop('LOCKHART EUGENE E',0)
print "After removing TOTAL length of our dataset is ",len(my_dataset)
After removing TOTAL length of our dataset is  144
In [20]:
data = featureFormat(my_dataset,features) # keeping our original 'data' as it is, we're using the copy

So, our outlier and useless data was removed as expected, now lets see what happens when we plot it again.

In [21]:
for person in data:
    salary = person[0]
    bonus = person[1]
    plt.scatter(salary, bonus)
    plt.title('Salary vs Bonus')

So, we have a very beautiful scatterplot depicting the variation of Salary and Bonus in a sensible manner having regularly separated entries.

In [22]:
(df["salary"]).describe().apply(lambda x: format(x, '.2f'))
count         146.00
mean       365811.36
std       2203574.96
min             0.00
25%             0.00
50%        210596.00
75%        270850.50
max      26704229.00
Name: salary, dtype: object

The mean value for the salary is 562,194.
Standard Deviation is 2,716,369 which is high!!
So, on an arbitrary basis, let's consider the outlier to be someone having salary of more than a million (1,000,000) and bonus superior to 4,000,000.

In [23]:
for person in my_dataset:
    if my_dataset[person]['salary'] != 'NaN' and my_dataset[person]['bonus'] != 'NaN' \
    and my_dataset[person]['salary'] >= 1000000 and my_dataset[person]['bonus'] >= 5000000:
        print person, 'Salary:', my_dataset[person]['salary'], 'Bonus:', my_dataset[person]['bonus']
LAY KENNETH L Salary: 1072321 Bonus: 7000000
SKILLING JEFFREY K Salary: 1111258 Bonus: 5600000

Two people who came out to have outlier values are :

  • Kenneth L. Lay
  • Jeffrey K. Skilling Let's find something more about them
In [24]:
pprint.pprint (my_dataset['LAY KENNETH L'])
{'bonus': 7000000,
 'deferral_payments': 202911,
 'deferred_income': -300000,
 'director_fees': 'NaN',
 'email_address': '[email protected]',
 'exercised_stock_options': 34348384,
 'expenses': 99832,
 'from_messages': 36,
 'from_poi_to_this_person': 123,
 'from_this_person_to_poi': 16,
 'loan_advances': 81525000,
 'long_term_incentive': 3600000,
 'other': 10359729,
 'poi': True,
 'restricted_stock': 14761694,
 'restricted_stock_deferred': 'NaN',
 'salary': 1072321,
 'shared_receipt_with_poi': 2411,
 'to_messages': 4273,
 'total_payments': 103559793,
 'total_stock_value': 49110078}
In [25]:
pprint.pprint (my_dataset['SKILLING JEFFREY K'])
{'bonus': 5600000,
 'deferral_payments': 'NaN',
 'deferred_income': 'NaN',
 'director_fees': 'NaN',
 'email_address': '[email protected]',
 'exercised_stock_options': 19250000,
 'expenses': 29336,
 'from_messages': 108,
 'from_poi_to_this_person': 88,
 'from_this_person_to_poi': 30,
 'loan_advances': 'NaN',
 'long_term_incentive': 1920000,
 'other': 22122,
 'poi': True,
 'restricted_stock': 6843672,
 'restricted_stock_deferred': 'NaN',
 'salary': 1111258,
 'shared_receipt_with_poi': 2042,
 'to_messages': 3627,
 'total_payments': 8682716,
 'total_stock_value': 26093672}

As visible, their salary and bonus values are above our defined outlier values. Digging more about them, we come to know that their positions are as follows:

  • Kenneth L. Lay, CEO, and then Chairman of Enron Corporation
  • Jeffrey K. Skilling, COO and then CEO of Enron Corporation
In [26]:
df = df.drop('TOTAL')
In [27]:
df = df.drop('LOCKHART EUGENE E')

Here we are going to explore more about the financial features of our dataset.

In [28]:
g = sns.PairGrid(df, vars=['salary','bonus','total_stock_value','total_payments'],
g.map_diag(sns.kdeplot, lw=1)


From this we can observe that the POI values and NON-POI values are distributed in what ways, in general they are mixed quite well with the regular data but sometimes we can observe the outliers of POI values.

Let's also explore the email features of our dataset, particularly the to_messages and from_messages features.

In [29]:
pois = df[df.poi]
plt.scatter(pois.from_messages, pois.to_messages, c='red');
non_pois = df[~df.poi]
plt.scatter(non_pois.from_messages, non_pois.to_messages, c='blue');

plt.xlabel('From messages')
plt.ylabel('To messages')
plt.legend(['POIs', 'non-POIs'])
<matplotlib.legend.Legend at 0x7f826de16f90>

Here we find an outlier, which has value of from_messages more than 14,000. The outliers are the people those who send or recieve more emails than the average users. The next closest to the person who sent most of the email is near around 7000 and also there is one more point which is to the top left corner, so altogether 3 points which need to be identified are identified using the code below.

In [30]:
# extracting the outlier points
outliers = df[np.logical_or(df.from_messages > 6000, df.to_messages > 10000)]

#plot them in red with the originals
plt.scatter(df.from_messages, df.to_messages, c='blue');
plt.scatter(outliers.from_messages, outliers.to_messages, c='red')
plt.xlabel('From messages')
plt.ylabel('To messages')
plt.legend(['Inliers', 'Potential Outliers'])

# Let's look at our outliers!
salary to_messages deferral_payments total_payments exercised_stock_options bonus restricted_stock shared_receipt_with_poi restricted_stock_deferred total_stock_value ... loan_advances from_messages other from_this_person_to_poi poi director_fees deferred_income long_term_incentive email_address from_poi_to_this_person
KAMINSKI WINCENTY J 275101.0 4607.0 0.0 1086821.0 850010.0 400000.0 126027.0 583.0 0.0 976037.0 ... 0.0 14368.0 4669.0 171.0 False 0.0 0.0 323466.0 vince.kaminski@enron.com 41.0
KEAN STEVEN J 404338.0 12754.0 0.0 1747522.0 2022048.0 1000000.0 4131594.0 3639.0 0.0 6153642.0 ... 0.0 6759.0 1231.0 387.0 False 0.0 0.0 300000.0 steven.kean@enron.com 140.0
SHAPIRO RICHARD S 269076.0 15149.0 0.0 1057548.0 607837.0 650000.0 379164.0 4527.0 0.0 987001.0 ... 0.0 1215.0 705.0 65.0 False 0.0 0.0 0.0 richard.shapiro@enron.com 74.0

3 rows × 21 columns

So now they've been identified.
As far as my observations are concerned, I would say that these potential outliers do not really give us a lot of information to identify the POI.
Why? It's likely that these points maybe generated due to an error or to be rational these points are not capturing the general trend of email usage but are instead most likely people in some special job position that requires them to do a lot of communication by email.
The only thing to be noticed here is that these points are having values way above the average and it might affect our explorations and outcomes of the machine learning algorithms, so I would remove them.

In [31]:
In [32]:
# removing the outlier candidates from the data set
df = df[df.from_messages < 6000]
In [33]:
df = df[df.to_messages < 10000]
In [34]:

Section 2 :Feature Processing

Now we're ready to go ahead for our feature processing.

To Recap once again our features list is divided in 2 parts:

  • Financial Features
  • Email Features

Let's begin with out first attempt to classify the dataset

Features we're going to explore would be our financial features. So let's begin

We are going to work with our data dictionary that we created above.

Creating New Features

New features that we create are :

  • networth : The sum total of total_payments and total_stock_value
  • proportion_from_poi
  • proportion_to_poi
In [35]:
# Helper Functions
from sklearn.feature_selection import SelectKBest,f_classif
f_scores =[]

def proportionFromPOI(data_dict):
    for k, v in data_dict.iteritems():
    #Assigning value to the feature 'proportion_from_poi'
        if v['from_poi_to_this_person'] != 'NaN' and  v['from_messages'] != 'NaN':
            v['proportion_from_poi'] = float(v['from_poi_to_this_person']) / v['from_messages'] 
            v['proportion_from_poi'] = 0.0
    return (data_dict)       
def proportionToPOI(data_dict):
    for k, v in data_dict.iteritems():
        #Assigning value to the feature 'proportion_to_poi'        
        if v['from_this_person_to_poi'] != 'NaN' and  v['to_messages'] != 'NaN':
            v['proportion_to_poi'] = float(v['from_this_person_to_poi'] )/ v['to_messages']   
            v['proportion_to_poi'] = 0.0
    return (data_dict)

def net_worth (data_dict) :
    features = ['total_payments','total_stock_value']
    for key in data_dict :
        name = data_dict[key]
        is_null = False 
        for feature in features:
            if name[feature] == 'NaN':
                is_null = True
        if not is_null:
            name['net_worth'] = name[features[0]] + name[features[1]]
            name['net_worth'] = 'NaN'
    return data_dict                
def select_features(features,labels,features_list,k=10) :
    clf = SelectKBest(f_classif,k)
    new_features = clf.fit_transform(features,labels)
    features_l=[features_list[i+1] for i in clf.get_support(indices=True)]
    f_scores = zip(features_list[1:],clf.scores_[:])
    f_scores = sorted(f_scores,key=lambda x: x[1],reverse=True)
    return new_features, ['poi'] + features_l, f_scores
In [36]:
data_dict = net_worth(data_dict)
data_dict = proportionFromPOI(data_dict)
data_dict = proportionToPOI(data_dict)
pprint.pprint(data_dict['ALLEN PHILLIP K'])
{'bonus': 4175000,
 'deferral_payments': 2869717,
 'deferred_income': -3081055,
 'director_fees': 'NaN',
 'email_address': 'phillip.allen@enron.com',
 'exercised_stock_options': 1729541,
 'expenses': 13868,
 'from_messages': 2195,
 'from_poi_to_this_person': 47,
 'from_this_person_to_poi': 65,
 'loan_advances': 'NaN',
 'long_term_incentive': 304805,
 'net_worth': 6213983,
 'other': 152,
 'poi': False,
 'proportion_from_poi': 0.0214123006833713,
 'proportion_to_poi': 0.022398345968297727,
 'restricted_stock': 126027,
 'restricted_stock_deferred': -126027,
 'salary': 201955,
 'shared_receipt_with_poi': 1407,
 'to_messages': 2902,
 'total_payments': 4484442,
 'total_stock_value': 1729541}
In [37]:
# we will add these features to our financial_features 
In [38]:
my_dataset = data_dict
In [39]:
data = featureFormat(my_dataset, financial_features, sort_keys = True)
labels, features = targetFeatureSplit(data)
In [40]:
for point in data_dict:
    salary = data_dict[point]["proportion_from_poi"]
    bonus = data_dict[point]["proportion_to_poi"]
    plt.scatter( salary, bonus)
In [41]:
# call the function with uses selectkbest
print ("features_list---" ,financial_features)
print("feature scores")
for i in f_scores:
    print (i)
data = featureFormat(my_dataset, financial_features, sort_keys = True)
labels, features = targetFeatureSplit(data)
('features_list---', ['poi', 'loan_advances', 'bonus', 'deferred_income', 'other', 'long_term_incentive', 'proportion_to_poi'])
feature scores
('loan_advances', inf)
('bonus', 772.43341185601332)
('other', 556.77730806873853)
('deferred_income', 287.2664203370756)
('long_term_incentive', 52.561787970591261)
('proportion_to_poi', 33.494255767406649)
('total_payments', 24.176713973334053)
('restricted_stock', 16.643636767803986)
('net_worth', 13.673228759023853)
('proportion_from_poi', 7.2097443906077103)
('deferral_payments', 2.7152382606791057)
('total_stock_value', 2.4765688698248112)
('expenses', 1.712424636015998)
('exercised_stock_options', 1.3179737674187317)
('director_fees', 0.12814399385851574)
('restricted_stock_deferred', 0.017226936204361963)
In [42]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
	Accuracy: 0.75575	Precision: 0.25666	Recall: 0.24550	F1: 0.25096	F2: 0.24765
	Total predictions: 12000	True positives:  491	False positives: 1422	False negatives: 1509	True negatives: 8578

In [43]:
from sklearn import tree
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
	Accuracy: 0.76367	Precision: 0.25296	Recall: 0.21400	F1: 0.23185	F2: 0.22080
	Total predictions: 12000	True positives:  428	False positives: 1264	False negatives: 1572	True negatives: 8736

In [44]:
from sklearn.ensemble import AdaBoostClassifier
clf2 = AdaBoostClassifier()
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None)
	Accuracy: 0.77017	Precision: 0.27914	Recall: 0.23950	F1: 0.25780	F2: 0.24650
	Total predictions: 12000	True positives:  479	False positives: 1237	False negatives: 1521	True negatives: 8763

In [45]:
from sklearn.neighbors import KNeighborsClassifier 
clf3=KNeighborsClassifier(n_neighbors = 4)
Got a divide by zero when trying out: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=4, p=2,
Precision or recall may be undefined due to a lack of true positive predicitons.
In [46]:
from sklearn.neighbors.nearest_centroid import NearestCentroid
clf4 = NearestCentroid()
NearestCentroid(metric='euclidean', shrink_threshold=None)
	Accuracy: 0.82375	Precision: 0.36407	Recall: 0.07700	F1: 0.12712	F2: 0.09142
	Total predictions: 12000	True positives:  154	False positives:  269	False negatives: 1846	True negatives: 9731

In [47]:
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test =     train_test_split(features, \
                                                    labels, test_size=0.3, random_state=42)
In [48]:
from time import time
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV

t0 = time()

pipe1 = Pipeline([('pca',PCA()),('classifier',GaussianNB())])
param = {'pca__n_components':[4,5,6]}
gsv = GridSearchCV(pipe1, param_grid=param,n_jobs=2,scoring = 'f1',cv=2)
clf = gsv.best_estimator_
print("GausianNB with PCA fitting time: %rs" % round(time()-t0, 3))
pred = clf.predict(features_test)

t0 = time()
test_classifier(clf,my_dataset,financial_features,folds = 1000)
print("GausianNB  evaluation time: %rs" % round(time()-t0, 3))
GausianNB with PCA fitting time: 0.387s
Pipeline(steps=[('pca', PCA(copy=True, iterated_power='auto', n_components=4, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('classifier', GaussianNB(priors=None))])
	Accuracy: 0.82500	Precision: 0.40775	Recall: 0.11050	F1: 0.17388	F2: 0.12936
	Total predictions: 12000	True positives:  221	False positives:  321	False negatives: 1779	True negatives: 9679

GausianNB  evaluation time: 2.707s
In [49]:
Adaboost tuned for comparision with final algorithm
from sklearn.tree import DecisionTreeClassifier
abc = AdaBoostClassifier(random_state=40)
data = featureFormat(my_dataset, financial_features, sort_keys = True)
labels, features = targetFeatureSplit(data)
dt = []
for i in range(6):
ab_params = {'base_estimator': dt,'n_estimators': [60,45, 101,10]}
t0 = time()
abt = GridSearchCV(abc, ab_params, scoring='f1',)
abt = abt.fit(features_train,labels_train)
print("AdaBoost fitting time: %rs" % round(time()-t0, 3))
abc = abt.best_estimator_
t0 = time()
test_classifier(abc, data_dict, financial_features, folds = 100)
print("AdaBoost evaluation time: %rs" % round(time()-t0, 3))
AdaBoost fitting time: 9.306s
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best'),
          learning_rate=1.0, n_estimators=60, random_state=40)
	Accuracy: 0.79000	Precision: 0.27966	Recall: 0.16500	F1: 0.20755	F2: 0.17974
	Total predictions: 1200	True positives:   33	False positives:   85	False negatives:  167	True negatives:  915

AdaBoost evaluation time: 17.214s