However, what looked like a dream company from the outside was actually crumbling under the weight of billions of dollars of debts and failed project from within. These debts and losses were skillfully, willfully and systematically swept under the rug through accounting and auditing frauds over a sustained period of time. Eventually, Enron filled for bankruptcy in the year 2001 wiping out $78 billion in stock market value. It has since been known as the biggest case of corporate fraud in U.S history ever.
Post the federal investigation of Enron Scandal, confidential information regarding the case crept into public domain. This information contained 600,000 emails generated by 158 enron employess and came to be known as the Enron Corpus.
My objective for this project is to use Machine Learning techniques to identify Persons of Interest from the data available from the Enron Corpus. A Person of Interest (POI) in this case is referred as person who was involved in the case.
In the cells below we can see data related to the queries I had. Later, I summarised these observations for more clarity:
Here is a summary of the data from the cells above:
I also wanted to know how many values were missing for every employee in the dataset. However, instead of plotting names of all the employees and their missing values, I chose to plot a bar chart for top 20 employees only.
I further examined NaN values in email and financial features of the dataset separetly. For this, I used the lists email_features
and financial_features
I created earlier. I thought this will help me to detect outliers that might exist within these features. Below is a Bar chart displaying number of NaN values in financial features for top 20 executives.
Below is a Bar chart displaying number NaN values in email features for all the employees.
salary
, bonus
, exercised_stock_options
and long_term_incentives
. I printed summaries of these features using the describe() method. This gave me a good idea of how values within these features were disperesed.
Looking at the summaries I realised there were extreme values or outliers lying outside the interquartile range of the data of these features. I created a function called find_outlier
to inspect these oultliers. This function was imported into the project from the custom_functions.py
file.
Upon my inspection, I found one particular record which seemed out place, 'TOTAL'. 'TOTAL', obviously was the sum total of all the financial features for all the employess. However, it was being treated as one of the entires in the dataset and was skewing the figures quite acutely.
All 'TOTAL' values are printed below for reference:
Below are a series of histograms displaying distribution of various features of interest. The most obvious thing to notice in these distributions is how all of them are skewed to the right.
Below are a series of scatter plots to examine correlations between from_poi_to_this_person
vs from_this_person_to_poi
, salary
vs bonus
, salary
vs long_term_incentives
and salary
vs exercised_stock_options
. These are all meaningfully positive correlations. from_poi_to_this_person
, from_this_person_to_poi
, salary
, bonus
, long_term_incentives
, exercised_stock_options
I thought were important features because they could be helpful in identifying POIs.
'from_ratio'
'to_ratio'
'exercised_stock_to_salary_ratio'
'bonus_to_salary_ratio'
'incentives_to_salary_ratio'
'from_ratio'
and 'to_ratio
depict ratios of a person receiving and sending emails from POI and to POI respectively. Because POIs are more likely to contact each other than non-POIs, I thought these two features could help the performace of the models for a better prediction.
The other three features are based on the financial features of the dataset. Infact, the new features are strictly derived by keeping 'salary'
as one of the constant features in computing the ratios. I thought, salary could be used as a feature to distinguish POIs and non-POIs as POIs might be getting higher incomes than non-POIs. And as salary has a strong positive correlation with 'bonus'
, 'exercised_stock_options'
and 'long_term_incentive'
it can be strongly assumed that POIs might be getting higher bonuses, incentives and stock option than non-POIs as well. Therefore, if I could use their ratios it might improve the models performace based on the composition of those ratios.
'to_ratio
and 'bonus_to_salary_ratio'
that I had created from this feature list. I did so to retain only the original features of the dataset as these were the final features I wanted to use to build my models. Here is the list of all the selected features and their scores (including 'to_ratio
and 'bonus_to_salary_ratio'
):
Since there is a lot of variation in the data, I used MinMax scaler to normalize the data and bring the chosen features to a common scale.
Furthermore, to test how the model will perform if new features are added to it, I created another feature list. This time I added all the newly created features to the list of features I had obtained from the SelectKBest function. I used the MinMax scaler to normalize and rescale these features as well.
In Model validation a trained model is evaluated using a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived. The main purpose of using the testing data set is to test how well the model will perform when new data points are introduced.
However, while splitting a data set into training and test sets there is a tradeoff where data points are lost from the training set and assigned to the test set. This where cross-validation technique known as k-fold cross-validation comes into picture. By using k-fold cross-validation, the original data set is randomly partitioned into k subsamples and one is left out in each iteration. The performance of the test set is computed by taking an average of the total number of iterations.
If cross validation is not performed on a data set, if a data set is not processed into a training and test sets, a machine learning algorithm could become a victim of over fitting. Because the algorithm will be built by fitting the entire data set, it won't learn anything new about the data and won't be able to adapt itself to new information. As a result, the algorithm will perform counter productively whenever it gets exposed to new data points.
Below is a list of five algorithms I tried with an objective to classify which employees are more likely to be Persons Of Interest:
I used the train_test_split
function to split the data into a 70:30 ratio; 70% for training, 30% for testing. Furthermore, I used average scores of three metrics accuarcy, precision and recall to evaluate the performance of my models. This average was computed based on 20 iterations of each algorithm. Below is quick demo of each of the five algorithms tested with and without new features. These algorithms are used straight out of the box without tuning any parameters. I used two function train_predict_evaluate
and score_chart
to evaluate the scores of each algorithm and plot a comparison.
Scores from this preliminary test showed that for all the algorithms the accuracy scores had marginally dropped after introduction of new features. On the other hand, there was a decent improvement in the recall and precision scores of Decision Tree Classifier, Random Forrest and Logistic Regression models after new features were introduced.
Here is brief report on the parameter which were tuned for each algorithm:
Decision Tree Classifier:
criterion = 'gini' or 'entropy'
splitter = 'best' or 'random'
min_samples_leaf = 1,2,3,4,5
min_samples_split = 2,3,4,5
Random Forrest Classifier:
min_samples_leaf = 1,2,3,4,5
min_samples_split = 2,3,4,5
n_estimators = 5,10,20,30
Support Vector Classifier:
kernel = 'linear', 'rbf' or 'sigmoid'
gamma = 1, 0.1, 0.01, 0.001, 0.0001
C = 0.1, 1, 10, 100, 1000
Logistic Regression:
C = 0.05, 0.5, 1, 10, 100, 1000, 10^5, 10^10
tol = 0.1, 10^-5, 10^-10
I used a combination of two functions parameter_tuning
and model_evaluation
to print out average accuracy, precision and recall scores along with the best hyperparameters for a model after tuning it for 100 iterations.
Below are the results of the final test:
An accuracy score of an algorithm measures the proportion of correct predictions made divided by the total number of predictions made. For example, the accuracy of our Naive Bayes model is 0.855. Meaning that out of the total number of examined cases the precentage of true results (both true positives and true negatives) is 85.5%.
A precision score of an algorithm measures the proportion of true positives from all cases that are classified as positives. It is also called the Positive Predictive Value. For our Naive Bayes model, the precision score is 0.433. Meaning that out 100 executives classified as POIs, 43.3 are actually POIs. Precision can be thought of as a measure of a classifiers exactness.
Recall score of an algorithm is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the True Positive Rate. For our Naive Bayes model, a recall of 0.373 means that out of 100 true POIs existing in the dataset, 37 POIs are correctly classified as POIs. Recall can be thought of as a measure of a classifiers completeness.
As we can see here, in terms of accuracy score all the algorithms are almost at par with each other. Logistic Regression, however, gave the highest accuracy score of 0.866. Gaussian Naive Bayes classifier gave the best precision score of 0.433 and recall score of 0.373.
Therefore, Gaussian Naive Bayes, with an accuracy: 0.855, precision: 0.433 and recall: 0.373, is the best performing algorithm in this case.