This is a practice hackathon. We have dataset with information about customers and the goal is to predict whether the company should give them loans or not.
At first I do some quiсk modelling to see what features are important. Then I do data exploration to get some insights and fill missing values. The prediction is done using RandomForest.
2.1 Load_ID
2.2 Gender
2.3 Dependents
2.4 Education
2.5 Self_Employed
2.6 ApplicantIncome
2.8 LoanAmount
2.9 Loan_Amount_Term
2.10 Credit_History
2.11 Property_Area
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from scipy.stats import skew
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
These are the customers' details available in the dataset.
The idea is to get a basic benchmark and to see which features are important while spending no time on data analysis. This will give a rough estimate, but it is useful.
train = train.fillna(train.mean())
test = test.fillna(test.mean())
#LoanID is just an index, so it isn't useful. LoanID in test data is necessary to create a submission file.
train.drop(['Loan_ID'], axis=1, inplace=True)
test_id = test.Loan_ID
test.drop(['Loan_ID'], axis=1, inplace=True)
for col in train.columns.drop('Loan_Status'):
if train[col].dtype != 'object':
if skew(train[col]) > 0.75:
train[col] = np.log1p(train[col])
pass
else:
dummies = pd.get_dummies(train[col], drop_first=False)
dummies = dummies.add_prefix("{}_".format(col))
train.drop(col, axis=1, inplace=True)
train = train.join(dummies)
for col in test.columns:
if test[col].dtype != 'object':
if skew(test[col]) > 0.75:
test[col] = np.log1p(test[col])
pass
else:
dummies = pd.get_dummies(test[col], drop_first=False)
dummies = dummies.add_prefix("{}_".format(col))
test.drop(col, axis=1, inplace=True)
test = test.join(dummies)
from sklearn.preprocessing import LabelEncoder
X_train = train.drop('Loan_Status', axis=1)
le = LabelEncoder()
Y_train = le.fit_transform(train.Loan_Status.values)
X_test = test
#Estimating feature importance.
clf = RandomForestClassifier(n_estimators=200)
clf = clf.fit(X_train, Y_train)
indices = np.argsort(clf.feature_importances_)[::-1]
print('Feature ranking:')
for f in range(X_train.shape[1]):
print('%d. feature %d %s (%f)' % (f + 1, indices[f], X_train.columns[indices[f]],
clf.feature_importances_[indices[f]]))
Obviously credit history, income, loan amount and loan amount term are important. Other variables have less importance and may be ignored for now.
#I'll use top-5 most important features.
best_features=X_train.columns[indices[0:5]]
X = X_train[best_features]
Xt = X_test[best_features]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, Y_train, test_size=0.20, random_state=36)
RandomForest is a suitable choice here.
clf = RandomForestClassifier(n_estimators=300, n_jobs=-1, criterion = 'gini')
#CalibratedClassifierCV - probability calibration with cross-validation.
calibrated_clf = CalibratedClassifierCV(clf, method='isotonic', cv=5)
calibrated_clf.fit(Xtrain, ytrain)
y_val = calibrated_clf.predict_proba(Xtest)
y_f = [1 if y_val[i][0] < 0.5 else 0 for i in range(len(ytest))]
print("Validation accuracy: ", sum(y_f == ytest) / len(ytest))
clf = RandomForestClassifier(n_estimators=300, n_jobs=-1, criterion = 'gini')
calibrated_clf = CalibratedClassifierCV(clf, method='isotonic', cv=5)
calibrated_clf.fit(X, Y_train)
y_submit = calibrated_clf.predict_proba(Xt)
submission = pd.DataFrame({'Loan_ID':test_id,
'Loan_Status':le.inverse_transform([1 if y_submit[i][0] < 0.5 else 0 for i in range(len(Xt))])})
submission.to_csv('Loan.csv', index=False)
This submission had 0.75 accuracy when submitted, which is a good result. Let's see how it can be improved after paying more attention to data.
Input the path to the files instead of "../input".
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.info()
test.info()
train.describe(include='all')
train.head()
A lot of missing values. I think that the score could be improved by careful imputation of missing values for important features.
rain.drop(['Loan_ID'], axis=1, inplace=True)
test_id = test.Loan_ID
test.drop(['Loan_ID'], axis=1, inplace=True)
train.Gender.value_counts()
sns.stripplot(x="Gender", y="ApplicantIncome", data=train, hue='Loan_Status', jitter=True)
Much more men than women in the dataset.
sns.boxplot(x='Gender', y='ApplicantIncome', data=train.loc[train.ApplicantIncome < 25000])
In this boxplot I showed distribution of income between genders with income < 25000, as only men have higher income. The difference of income isn't high.
train.groupby(['Gender'])['Loan_Status'].value_counts(normalize=True)
And little impact on Loan Status.
sns.factorplot(x="Credit_History", hue="Loan_Status", col="Gender", data=train, kind="count")
Cosidering all this information I'll fill nan with the most common value.
train['Gender'].fillna('Male', inplace=True)
test['Gender'].fillna('Male', inplace=True)
train.Married.value_counts()
pd.crosstab(train.Married, train.Loan_Status)
train.groupby(['Gender'])['Married'].value_counts(normalize=True)
sns.factorplot(x="Married", hue="Loan_Status", col="Gender", data=train, kind="count")
Women are less likely to be married than men.
train.loc[train.Married.isnull() == True]
Two men and one woman. Fillna with most common value for gender.
train.loc[(train.Gender == 'Male') & (train.Married.isnull() == True), 'Married'] = 'Yes'
train.loc[(train.Gender == 'Female') & (train.Married.isnull() == True), 'Married'] = 'No'
train.Dependents.value_counts()
train.groupby(['Dependents'])['Loan_Status'].value_counts(normalize=True)
sns.factorplot("Loan_Status", col="Dependents", col_wrap=4, data=train, kind="count", size=2.4, aspect=.8)
Most common number of Dependents is zero. And people having 2 Dependents are more likely to get the loan.
train.groupby(['Gender', 'Married', 'Property_Area'])['Dependents'].value_counts(normalize=True)
But even with grouping zero dependents is the most common value, so I'll use it to fill nan.
train['Dependents'] = train['Dependents'].fillna(train['Dependents'].mode().iloc[0])
test['Dependents'] = test['Dependents'].fillna(test['Dependents'].mode().iloc[0])
sns.factorplot(x="Education", hue="Loan_Status", data=train, kind="count")
It isn't surprising that graduates have more chances to get the loan.
train.groupby(['Self_Employed'])['Loan_Status'].value_counts(normalize=True)
sns.factorplot("Loan_Status", col="Self_Employed", col_wrap=4, data=train, kind="count", size=2.4, aspect=.8)
It seems that it doesn't really matter whether the customer is Self Employed or not.
train.groupby(['Education', 'Married', 'Dependents', 'Gender', 'Property_Area'])['Self_Employed'].apply(lambda x: x.mode())
I thought that this grouping makes sense, but there is only one case when most common value is "Yes". In other cases 'Not' is more common.
train.loc[(train.Education == 'Graduate') & (train.Married == 'Yes')
& (train.Dependents == '2') & (train.Gender == 'Male') & (train.Property_Area == 'Urban')
& (train.Self_Employed.isnull() == True), 'Self_Employed'] = 'Yes'
test.loc[(test.Education == 'Graduate') & (test.Married == 'Yes')
& (test.Dependents == '2') & (test.Gender == 'Male') & (test.Property_Area == 'Urban')
& (test.Self_Employed.isnull() == True), 'Self_Employed'] = 'Yes'
train['Self_Employed'].fillna('No', inplace=True)
test['Self_Employed'].fillna('No', inplace=True)
sns.distplot(train['ApplicantIncome'], kde=False, color='c', hist_kws={'alpha': 0.9})
The values are highly skewered. Logarithm of data looks better.
sns.distplot(np.log1p(train['ApplicantIncome']), kde=False, color='c', hist_kws={'alpha': 0.9})
I think that maybe income could be divided in several groups, and there groups could have various rates of getting loan. I begin with 10 groups and if some groups have much higher/lower rate, then groups could be combined.
train['Income_group'] = pd.qcut(train.ApplicantIncome, 10, labels=[0,1,2,3,4,5,6,7,8,9])
test['Income_group'] = pd.qcut(test.ApplicantIncome, 10, labels=[0,1,2,3,4,5,6,7,8,9])
train['Income_group'] = train['Income_group'].astype(str)
test['Income_group'] = test['Income_group'].astype(str)
train.groupby(['Income_group'])['Loan_Status'].value_counts(normalize=True)
This doesn't seem to be a good feature sadly. We'll see later.
sns.distplot(train['CoapplicantIncome'], kde=False, color='c', hist_kws={'alpha': 0.9})
sns.distplot(np.log1p(train['CoapplicantIncome']), kde=False, color='c', hist_kws={'alpha': 0.9})
This variable is also skewered, but logarithm isn't much better. The data has bimodal distribution, so let's divide it into two groups.
train['Coap_group'] = pd.qcut(train.CoapplicantIncome, 2, labels=[0,1])
test['Coap_group'] = pd.qcut(test.CoapplicantIncome, 2, labels=[0,1])
train['Coap_group'] = train['Coap_group'].astype(str)
test['Coap_group'] = test['Coap_group'].astype(str)
train.groupby(['Coap_group'])['Loan_Status'].value_counts(normalize=True)
Also not good.
plt.scatter(train['ApplicantIncome'], train['LoanAmount'])
People with higher income want higher loans. Well, this is reasonable.
train.groupby(['Education', 'Gender', 'Income_group', 'Self_Employed'])['LoanAmount'].median()
train.groupby(['Education', 'Gender', 'Self_Employed'])['LoanAmount'].median()
At first I fillna with mean by Education, Gender, Income Group and Self Employement, but not for all data exists, so second imputation is necesary.
train['LoanAmount'] = train.groupby(['Education', 'Gender', 'Income_group', 'Self_Employed'])['LoanAmount'].apply(lambda x: x.fillna(x.mean()))
test['LoanAmount'] = test.groupby(['Education', 'Gender', 'Income_group', 'Self_Employed'])['LoanAmount'].apply(lambda x: x.fillna(x.mean()))
train['LoanAmount'] = train.groupby(['Education', 'Gender', 'Self_Employed'])['LoanAmount'].apply(lambda x: x.fillna(x.mean()))
test['LoanAmount'] = test.groupby(['Education', 'Gender', 'Self_Employed'])['LoanAmount'].apply(lambda x: x.fillna(x.mean()))
sns.distplot(train['LoanAmount'], kde=False, color='c', hist_kws={'alpha': 0.9})
Loan Amount seems to be more normal than previous variables.
train['Loan_group'] = pd.qcut(train.LoanAmount, 10, labels=[0,1,2,3,4,5,6,7,8,9])
test['Loan_group'] = pd.qcut(test.LoanAmount, 10, labels=[0,1,2,3,4,5,6,7,8,9])
train['Loan_group'] = train['Loan_group'].astype(str)
test['Loan_group'] = test['Loan_group'].astype(str)
train.Loan_Amount_Term.value_counts()
It seems than this feature is in fact categorical and not continuous.
sns.factorplot("Loan_Status", col="Loan_Amount_Term", col_wrap=3,
data=train.loc[train.Loan_Amount_Term !=360.], kind="count", size=3.4, aspect=.8)
And various loan terms have different rates of getting loan.
train.groupby(['Education', 'Income_group', 'Loan_group'])['Loan_Amount_Term'].apply(lambda x: x.mode())
But 360 is truly the most common one.
train['Loan_Amount_Term'].fillna(360.0, inplace=True)
test['Loan_Amount_Term'].fillna(360.0, inplace=True)
train['Loan_Amount_Term'] = train['Loan_Amount_Term'].astype(str)
test['Loan_Amount_Term'] = test['Loan_Amount_Term'].astype(str)
train.Credit_History.value_counts()
train.groupby(['Education', 'Self_Employed', 'Property_Area', 'Income_group'])['Credit_History'].apply(lambda x: x.mode())
This is one of key variables. Filling missing values is an important decision. So I'll fill them with mode values based on the grouping higher.
train.loc[(train.Education == 'Graduate') & (train.Self_Employed == 'Yes')
& (train.Property_Area == 'Urban') & (train.Income_group == '9') & (train.Credit_History.isnull() == True),
'Self_Employed'] = 0.0
train.loc[(train.Education == 'Not Graduate') & (train.Self_Employed == 'No')
& (train.Property_Area == 'Rural') & (train.Income_group == '7') & (train.Credit_History.isnull() == True),
'Self_Employed'] = 0.0
train.loc[(train.Education == 'Not Graduate') & (train.Self_Employed == 'No')
& (train.Property_Area == 'Urban') & (train.Income_group == '2') & (train.Credit_History.isnull() == True),
'Self_Employed'] = 0.0
test.loc[(test.Education == 'Graduate') & (test.Self_Employed == 'Yes')
& (test.Property_Area == 'Urban') & (test.Income_group == '9') & (test.Credit_History.isnull() == True),
'Self_Employed'] = 0.0
test.loc[(test.Education == 'Not Graduate') & (test.Self_Employed == 'No')
& (test.Property_Area == 'Rural') & (test.Income_group == '7') & (test.Credit_History.isnull() == True),
'Self_Employed'] = 0.0
test.loc[(test.Education == 'Not Graduate') & (test.Self_Employed == 'No')
& (test.Property_Area == 'Urban') & (test.Income_group == '2') & (test.Credit_History.isnull() == True),
'Self_Employed'] = 0.0
train['Credit_History'].fillna(1.0, inplace=True)
test['Credit_History'].fillna(1.0, inplace=True)
train['Credit_History'] = train['Credit_History'].astype(str)
test['Credit_History'] = test['Credit_History'].astype(str)
sns.factorplot('Loan_Status', col='Property_Area', col_wrap=3, data=train, kind='count', size=2.5, aspect=.8)
It seems that people living in Semiurban area have more chances to get loans.
train.dtypes
for col in train.columns.drop('Loan_Status'):
if train[col].dtype != 'object':
if skew(train[col]) > 0.75:
train[col] = np.log1p(train[col])
pass
else:
dummies = pd.get_dummies(train[col], drop_first=False)
dummies = dummies.add_prefix("{}_".format(col))
if col == 'Credit_History' or col == 'Loan_Amount_Term':
pass
else:
train.drop(col, axis=1, inplace=True)
train = train.join(dummies)
for col in test.columns:
if test[col].dtype != 'object':
if skew(test[col]) > 0.75:
test[col] = np.log1p(test[col])
pass
else:
dummies = pd.get_dummies(test[col], drop_first=False)
dummies = dummies.add_prefix("{}_".format(col))
if col == 'Credit_History' or col == 'Loan_Amount_Term':
pass
else:
test.drop(col, axis=1, inplace=True)
test = test.join(dummies)
#I leave these two variables as they seem to be important by themselves.
train['Credit_History'] = train['Credit_History'].astype(float)
train['Loan_Amount_Term'] = train['Loan_Amount_Term'].astype(float)
test['Credit_History'] = test['Credit_History'].astype(float)
test['Loan_Amount_Term'] = test['Loan_Amount_Term'].astype(float)
X_train = train.drop('Loan_Status', axis=1)
le = LabelEncoder()
Y_train = le.fit_transform(train.Loan_Status.values)
X_test = test
clf = RandomForestClassifier(n_estimators=200)
clf = clf.fit(X_train, Y_train)
indices = np.argsort(clf.feature_importances_)[::-1]
print('Feature ranking:')
for f in range(X_train.shape[1]):
print('%d. feature %d %s (%f)' % (f + 1, indices[f], X_train.columns[indices[f]],
clf.feature_importances_[indices[f]]))
Well, little changed. The most important variables are the same. Also Credit History is really important.
best_features = X_train.columns[indices[0:6]]
X = X_train[best_features]
Xt = X_test[best_features]
Xtrain, Xtest, ytrain, ytest = train_test_split(X, Y_train, test_size=0.20, random_state=36)
clf = RandomForestClassifier(n_estimators=300, n_jobs=-1, criterion = 'gini')
calibrated_clf = CalibratedClassifierCV(clf, method='isotonic', cv=5)
calibrated_clf.fit(Xtrain, ytrain)
y_val = calibrated_clf.predict_proba(Xtest)
y_f = [1 if y_val[i][0] < 0.5 else 0 for i in range(len(ytest))]
sum(y_f == ytest) / len(ytest)
I tried using other algorithms, but they had worse results. Also I tried tuning RandomForest parameters, but it led to overfitting.
clf = RandomForestClassifier(n_estimators=300, n_jobs=-1, criterion = 'gini')
calibrated_clf = CalibratedClassifierCV(clf, method='isotonic', cv=5)
calibrated_clf.fit(X, Y_train)
y_submit = calibrated_clf.predict_proba(Xt)
y_pred = le.inverse_transform([1 if y_submit[i][0] < 0.5 else 0 for i in range(len(Xt))])
submission = pd.DataFrame({'Loan_ID':test_id, 'Loan_Status':y_pred})
submission.to_csv('Loan.csv', index=False)
This solution had an accuracy of 0.784722222222. I couldn't improve it. Then suddenly I made a mistake and made a prediction using estimator fitted not on the whole dataset, but only on the training part(splitted from main train data) and reached a new best accuracy of 0.798611. This is fifth best score. Not sure what caused the increase in the score. I suppose the reason is small amount of data. Adding or substracting some samples could lead to changes is weights, assigned by the estimator. So while the score is higher, there could be overfitting. And on bigger datasets training model on the whole training data is better and more adequate.