by: Keith Qu

Using the data set from Kaggle's Give Me Some Credit competition, which contains 150,000 observations with 10 features and the objective of predicting whether a lendee will have a serious delinquincy (90+ days past due) within 2 years.

Using the methods described, we were able to attain an AUC score of 0.866752 on the private leaderboard.

Contents:

In [141]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.manifold import TSNE
from scipy.stats import randint, uniform
from scipy import linalg
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_predict, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error
import xgboost as xgb
import random
%matplotlib inline
```

We have the following variables:

(target): individual experiences a 90+ day overdue payment within 2 years__SeriousDlqin2yrs__**age****RevolvingUtilizationOfUnsecuredLines**: balance on credit cards and lines of credit / sum of limits (excluding real etate, installment debt)**DebtRatio**: monthly debt, alimony, living cost payments divided by monthly income**NumberRealEstateLoansOrLines**: only real estate loans**NumberOfOpenCreditLinesAndLoans**: # install ment loans and LOCs**NumberOfTime30-59DaysPastDueNotWorse**: number of times person experienced 30-59 day lateness, but not more than that**NumberOfTime60-89DaysPastDueNotWorse**: same as above with 60-89 days**NumberOfTime90DaysPastDueNotWorse**: same as above with 90+**NumberOfDependents**: number of family members excluding self**MonthlyIncome**

In [2]:

```
df = pd.read_csv('cs-training.csv',index_col=0)
df.head()
```

Out[2]:

In [3]:

```
df.info()
```

Only 6.684% of the observations have delinquincies.

In [4]:

```
df['SeriousDlqin2yrs'].value_counts().plot(kind='bar')
```

Out[4]:

In [5]:

```
df['SeriousDlqin2yrs'].mean()
```

Out[5]:

In [6]:

```
df['SeriousDlqin2yrs'].describe()
```

Out[6]:

Nothing odd here.

There's someone aged 0, which should be an error. Let's set it to 21, the minimum age that makes sense. We can also set it to the median age, but this won't make a huge difference considering there are 150,000 observations.

In [7]:

```
df['age'].value_counts().sort_index().iloc[:20]
```

Out[7]:

In [8]:

```
df['age'].value_counts().sort_index().iloc[-10:]
```

Out[8]:

In [9]:

```
df.loc[df['age']==0,'age'] = 21
```

This should be a ratio but for many observations they seem to actually be dollar amounts. Maybe the denominator was 1.

In [10]:

```
df['RevolvingUtilizationOfUnsecuredLines'].describe()
```

Out[10]:

In [11]:

```
df['RevolvingUtilizationOfUnsecuredLines'].median()
```

Out[11]:

In [12]:

```
freq=pd.DataFrame(sorted(df['RevolvingUtilizationOfUnsecuredLines']))
freq.quantile(q=np.arange(0.995,0.99999,0.0001)).plot()
plt.title('Revolving Utilization by Percentile')
```

Out[12]:

The values only start going crazy at some point after the 99.8th percentile, so there fewer than 300 "irregular" values.

In [13]:

```
freq.quantile(q=[0.998])
```

Out[13]:

In [14]:

```
df[df['MonthlyIncome']>=2.761009].head()
```

Out[14]:

With a media of 0.154 and mean of 6.04 there is huge positive skewness. It's possible that there is some inconsistency in the data. Even accounts that are victims of fraud will not have rates in the thousands. It is possible that they were recorded as dollar amounts, or when credit card limits were unavailable, 1 was substituted in the denominators.

Regardless, we should deal with this problem in some way, such as censoring all amounts above some threshold. We will play it safe and use 2.5.

In [15]:

```
df['RevolvingUtilizationOfUnsecuredLines'] = df['RevolvingUtilizationOfUnsecuredLines'].apply(lambda i:i if i <= 2.5 else 2.5)
```

In [16]:

```
plt.plot(df['RevolvingUtilizationOfUnsecuredLines'], 'bo', markersize=2)
```

Out[16]:

There are also other ways to deal with the outliers. For one, we can ignore them, since decision tree models

- NumberOfTime30-59DaysPastDueNotWorse
- NumberOfTime60-89DaysPastDueNotWorse
- NumberOfTimes90DaysLate

There are some outliers in these values. While it may be feasible, if unlikely, for someone to have nearly 100 loans past due, in the actual data, these people with huge numbers of defaults don't seem more likely to default than anyone else.

In [17]:

```
df['NumberOfTime30-59DaysPastDueNotWorse'].value_counts().sort_index()
```

Out[17]:

In [18]:

```
df['NumberOfTime30-59DaysPastDueNotWorse'] = [v if v <= 13 else np.median(df['NumberOfTime30-59DaysPastDueNotWorse'])
for v in df['NumberOfTime30-59DaysPastDueNotWorse']]
```

In [19]:

```
df['NumberOfTime60-89DaysPastDueNotWorse'].value_counts().sort_index()
```

Out[19]:

In [20]:

```
df['NumberOfTime60-89DaysPastDueNotWorse'] = [v if v <= 11 else np.median(df['NumberOfTime60-89DaysPastDueNotWorse'])
for v in df['NumberOfTime60-89DaysPastDueNotWorse']]
```

In [21]:

```
df['NumberOfTimes90DaysLate'].value_counts().sort_index()
```

Out[21]:

In [22]:

```
df['NumberOfTimes90DaysLate'] = [v if v <= 17 else np.median(df['NumberOfTimes90DaysLate'])
for v in df['NumberOfTimes90DaysLate']]
```

In [23]:

```
latecor = df[['SeriousDlqin2yrs','NumberOfTime30-59DaysPastDueNotWorse','NumberOfTime60-89DaysPastDueNotWorse','NumberOfTimes90DaysLate']].copy()
```

In [25]:

```
for c in latecor.columns:
latecor[c] = [v if v < 1 else 1 for v in latecor[c]]
```

In [26]:

```
sns.heatmap(latecor.corr())
```

Out[26]:

In [27]:

```
latecor.corr()
```

Out[27]:

We can see that the correlation between delinquincy within 2 years increases in the number of days of lateness.

Past lateness alone is not a very accurate predictor of future delinquency, but they are a valuable contributor. In a linear model, the probability of delinquency conditional on having prior lateness would be higher depending on the degree of past lateness. In a decision tree we can imagine, for example, that a path with a larger number of prior late notices, low income, high debt, and a larger number of dependents would be very likely to have a serious delinquency. Conditional probability is one of the most powerful concepts in the universe.

There is no reason to suspect any of these numbers, even the 2 outliers within 13 and 20 dependents. Those numbers are not out of the question at all. We can set the 2 outliers to 10, but we'll leave them be for now.

There is missing data we will fill with the median, which turns out to just be 0.

In [29]:

```
df['NumberOfDependents'].value_counts().sort_index()
```

Out[29]:

In [30]:

```
df['NumberOfDependents'].median()
```

Out[30]:

In [31]:

```
df['NumberOfDependents'].fillna(0, inplace=True)
```

Like unsecured credit utilization, there are a small number of outliers that we can choose to censor. Let's set 15 as a threshold. These higher numbers of real estate loans don't necessarily give us that much relevant information.

In [32]:

```
df['NumberRealEstateLoansOrLines'].value_counts().sort_index()
```

Out[32]:

In [33]:

```
df['NumberRealEstateLoansOrLines'] = [v if v <= 15 else np.median(df['NumberRealEstateLoansOrLines'])
for v in df['NumberRealEstateLoansOrLines']]
```

In [34]:

```
df.info()
```

As with revolving utilization, the ratio is not too informative when the denominator is very low. 329664 doesn't make much since as a debt ratio, but it makes a lot of sense for a dollar amount. Indeed, the documentation describes DebtRatio as debt payments, alimony and living costs divided by monthly gross income. We can surmise that when income was 0 or was missing, 1 was substituted.

In [35]:

```
df['DebtRatio'].describe()
```

Out[35]:

In [36]:

```
df[df['MonthlyIncome'].isnull()].describe()['DebtRatio']
```

Out[36]:

This actually gives us an interesting opportunity to create a monthly payments feature. But first let's look at the accounts with 1 monthly income. It's very interesting to see that there are significantly fewer delinquincies among people with 1 than those with 0, and fewer even than those with higher incomes. Although this could just mean that those with lower incomes receive fewer loans, and perhaps those with 0 income did not have 0 when the loans were taken.

In [37]:

```
df['MonthlyIncome'].describe()
```

Out[37]:

In [38]:

```
df['MonthlyIncome'].value_counts().sort_index()[:50]
```

Out[38]:

In [39]:

```
df[(df['MonthlyIncome']==0)]['SeriousDlqin2yrs'].mean()
```

Out[39]:

In [40]:

```
df[(df['MonthlyIncome']==1)]['SeriousDlqin2yrs'].mean()
```

Out[40]:

In [41]:

```
df[(df['MonthlyIncome']>=750) & (df['MonthlyIncome']<=1250)]['SeriousDlqin2yrs'].mean()
```

Out[41]:

In [42]:

```
df[(df['MonthlyIncome']>=1) & (df['MonthlyIncome']<=15)]['SeriousDlqin2yrs'].mean()
```

Out[42]:

In [43]:

```
df[(df['MonthlyIncome']>=16) & (df['MonthlyIncome']<=5400)]['SeriousDlqin2yrs'].mean()
```

Out[43]:

In [44]:

```
df[(df['MonthlyIncome']>=5400) & (df['MonthlyIncome']<=8249)]['SeriousDlqin2yrs'].mean()
```

Out[44]:

In [45]:

```
df[(df['MonthlyIncome']>=8249)]['SeriousDlqin2yrs'].mean()
```

Out[45]:

It's probably best to leave these values as they are. While for example 1000 might have mistakenly been entered as 1, overall the non-zero low income observations have very low delinquincy rates, and it would be unwise to make any assumptions. The "1" can mean anything: unknown income, 0 income, people with trust funds, people with non traditional incomes, etc. But they are a very small group (605) with an exceptionall low rate of serious delinquency.

From monthly income and debt ratio, we can determine the monthly payments. This could be extremely valuable, since people with the same debt ratio may be in vastly different economic situations. Imagine 3 cases with a 1.0 debt ratio:

- \$200 income, \$200 payment
- \$2000 income, \$2000 payment
- \$20000 income, \$20000 payment

The first case seems like a college student with a part time job and limited expenses, but who has their room and board covered by their parents. The second case could be a working class person struggling to make ends meet, while the third case could be someone with a high income who is borrowing for investment purposes, has a gambling problem, or got hit with a nasty divorce settlement. Each person has their own distinct challenges when it comes to repaying their loans. Consider a situation where all three lose their jobs: the third person is going to have the most to make up for.

These values can also interact with age: for the first case, the person being 21 or 22 would be a vastly different situation than if it was a 50 year old. In short, the separate dollar amounts can give us different information than the ratios.

In [51]:

```
payments = []
for i in range(len(df)):
if df['MonthlyIncome'].iloc[i] >= 0:
payments.append(df['MonthlyIncome'].iloc[i] * df['DebtRatio'].iloc[i])
else:
payments.append(df['DebtRatio'].iloc[i])
```

In [52]:

```
df['MonthlyPayments']=payments
```

In [53]:

```
df.head()
```

Out[53]:

In [54]:

```
df['MonthlyPayments'].describe()
```

Out[54]:

In [55]:

```
df[df['MonthlyPayments']>250000]
```

Out[55]:

We need to estimate the missing values. If we had education data, we could fit a Mincer human capital model. But let's see what we can do with what we have.

Log-wages should be useful (but set the 0s to the lowest non-zero amount, 1, to avoid the log 0 problem). We could also do +1 logs.

In [56]:

```
df['MonthlyIncome'].describe()
```

Out[56]:

In [57]:

```
df['MonthlyIncome'].value_counts().sort_index()[:20]
```

Out[57]:

In [58]:

```
# Since log 1 is 0 and we're "setting" all 0 values to 1 anyway, this code is effectively the same as that
# without prematurely filling in the missing values with 0. A win-win-win
# Also do the same to the monthly payments
df['logMI'] = [np.log(v) if v > 0 else v for v in df['MonthlyIncome']]
df['logMP'] = [np.log(v) if v > 0 else v for v in df['MonthlyPayments']]
```

In [59]:

```
fig,ax = plt.subplots(1,2,figsize=(12,6))
plt.subplot(1,2,1)
df['logMI'].hist(bins=100)
plt.ylabel('freq')
plt.xlabel('log wage')
plt.title('Log wage histogram')
plt.subplot(1,2,2)
df['logMP'].hist(bins=100)
plt.ylabel('freq')
plt.xlabel('log payment')
plt.title('Log payment histogram')
```

Out[59]:

Ignoring the 0s, there is something very "normal" looing about log wages, which should not be shocking to anyone in a dataset of this size. Log payment is negatively skewed. In any case, the log transformations make these datum much more manageable.

In [60]:

```
train = df[df['logMI'].isnull() == False]
test = df[df['logMI'].isnull()]
```

In [61]:

```
X_train = train.drop(['MonthlyIncome','logMI','SeriousDlqin2yrs','MonthlyPayments','logMP'], axis=1)
y_train = train['logMI']
X_test = test.drop(['MonthlyIncome','logMI','SeriousDlqin2yrs','MonthlyPayments','logMP'], axis=1)
```

In [62]:

```
X_train.head()
```

Out[62]:

In [63]:

```
grad=GradientBoostingRegressor(learning_rate=0.05,n_estimators=350)
```

In [64]:

```
pred = cross_val_predict(grad,X_train,y_train,cv=10)
```

In [65]:

```
print(r2_score(y_train,pred))
print(mean_squared_error(y_train,pred))
```

Let's fill in the missing incomes.

In [66]:

```
grad.fit(X_train,y_train)
test_pred = grad.predict(X_test)
```

In [67]:

```
df.loc[df['logMI'].isnull(),'logMI'] = test_pred
```

In [68]:

```
df.info()
```

In [69]:

```
df.to_csv('cs-trainmod.csv')
```

In [70]:

```
features = ['RevolvingUtilizationOfUnsecuredLines','age','NumberOfTime30-59DaysPastDueNotWorse',
'NumberOfOpenCreditLinesAndLoans','NumberOfTimes90DaysLate','NumberRealEstateLoansOrLines',
'NumberOfTime60-89DaysPastDueNotWorse','NumberOfDependents','logMI','logMP']
```

Let's transform n dimensions into 2. t-SNE emphasizes placing similar features near each other, as opposed to methods like PCA which focus on maximizing variance/separating differences. We can see that t-SNE will cluster a large number of features related to delinquency together.

In [71]:

```
X = df[features]
y = df['SeriousDlqin2yrs']
sns.set(rc={'figure.figsize':(12,12)})
```

In [145]:

```
tsne = TSNE(early_exaggeration=4.0, learning_rate=25.0, n_iter=250)
# this stuff takes a long time, let's take a smaller sample
X_samp, X_notused, y_samp, y_notused = train_test_split(X, y, test_size=0.33)
transtsne = tsne.fit_transform(X_samp)
tsnedf = pd.DataFrame(transtsne, columns=['1', '2'])
y_samp.index = range(0,len(tsnedf))
tsnedf['class'] = y_samp
sns.lmplot(x='1', y='2', data = tsnedf,hue='class',size=10,palette="Set2",fit_reg=False,scatter_kws={'alpha':0.2})
```

Out[145]:

In [73]:

```
ss = StandardScaler()
X_norm = ss.fit_transform(X)
forPca = pd.DataFrame(X)
pct = PCA(n_components=2)
X_trans = pct.fit_transform(X)
pcadf = pd.DataFrame(X_trans,columns=['1','2'])
pcadf['class'] = y
sns.lmplot(x='1',y='2',data=pcadf,hue='class',size=10,fit_reg=False,palette="Set2",scatter_kws={'alpha':0.2})
#plt.axis([-500, 1500, -50, 50])
```

Out[73]:

We can see that there are significant overlaps, but there are regions where there are clearly more class 0s than 1s. Particularly in the t-SNE graph, there is a major high risk concentration near the center of the cloud.

In [74]:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
```

As is customary, let's form a linear baseline with logistic regression.

In [102]:

```
lr = LogisticRegression()
lr.fit(X_train,y_train)
log_pred = lr.predict_proba(X_test)[:,1]
print ('Logistic regression test AUC')
roc_auc_score(y_test,log_pred)
```

Out[102]:

In [105]:

```
print('Logistic regression training AUC')
roc_auc_score(y_train, lr.predict_proba(X_train)[:,1])
```

Out[105]:

In [103]:

```
rf = RandomForestClassifier(n_estimators=20)
rf.fit(X_train,y_train)
rf_pred = rf.predict_proba(X_test)[:,1]
print ('Random forest test AUC')
roc_auc_score(y_test,rf_pred)
```

Out[103]:

In [104]:

```
print('Random forest training AUC')
roc_auc_score(y_train, rf.predict_proba(X_train)[:,1])
```

Out[104]:

We can see that logistic regression performs noticeably better on the test set than random forest. It is our experience that in many binary classification problems - especially in credit risk problems - random forest severely overfits on the training set. In fact there usually exists some significantly high number of trees (n_estimators) such that random forest can achieve 100% training set accuracy. Of course, such a model would not be a good predictor.

Maybe get some pizza and watch a movie.

In [106]:

```
grad_test = GradientBoostingRegressor(learning_rate=0.05)
hyper_params = {'n_estimators': randint(10, 650),
'max_depth': randint(1,10),
'min_samples_split': randint(2,5),
'min_samples_leaf': randint(1,5)}
```

In [107]:

```
search = RandomizedSearchCV(grad_test, param_distributions=hyper_params, n_jobs=4, scoring='roc_auc', verbose=4, cv=5)
search.fit(X_train,y_train)
search.best_params_, search.best_score_
```

Out[107]:

In [108]:

```
probs = search.best_estimator_.predict(X_test)
```

In [109]:

```
roc_auc_score(y_test,probs)
```

Out[109]:

USing this method on the test set, we scored 0.866752 on the Kaggle private set (70% of the test set), compared to the 0.869558 first place score.

The numbers seem close, but this disparity could mean several hundred more wrong lending decisions out of 150,000.

In [143]:

```
xg = xgb.XGBRegressor(learning_rate=0.01, n_jobs=4, eval_metric='mae')
hyper_params1 = {'n_estimators': randint(1000, 2000),
'max_depth': randint(1,12),
'subsample': uniform(0,1),
'min_child_weight': randint(1,10),
'max_delta_step': randint(1,10)}
search1 = RandomizedSearchCV(xg, param_distributions=hyper_params1 ,n_jobs=4, verbose=4, cv=5)
search1.fit(X_train,y_train)
search1.best_params_, search.best_score_
```

Out[143]:

In [144]:

```
probs1 = search1.best_estimator_.predict(X_test)
roc_auc_score(y_test,probs1)
```

Out[144]:

Let's try to set an optimal risk threshold for a lender, since lending to someone with a slightly below 50% chance of default is probably not a great idea. We will stick with the gradient boosted model since that's what we ultimately used for our Kaggle submission.

In [110]:

```
highest_auc = 0
opt_thresh = 0
n = 10000
for i in range(n):
thresh = i/n
dec = [1 if p > thresh else 0 for p in probs]
auc = roc_auc_score(y_test,dec)
if auc > highest_auc:
highest_auc = auc
opt_thresh = thresh
print (opt_thresh)
dec = [1 if p > opt_thresh else 0 for p in probs]
```

Only lend when there is less than 7.2 % probability of default.

Compared to simply giving everyone a loan, we deny 1547 bad borrowers at the cost of denying 5381 good borrowers. With more data, it would be possible to go further and estimate the present values of these loans to predict a monetary value for a particular lending scheme. It is reasonable to assume that the average amount lost from a delinquincy greatly exceeds the discounted profit from a good loan.

In [140]:

```
conf = confusion_matrix(y_test,dec)
norm_conf = conf.astype('float') / conf.sum(axis=1)[:,np.newaxis]
print ('Confusion matrix\n')
print (conf)
print ('\nNormalized confusion matrix\n')
print (norm_conf)
print ('\nClassification report\n')
print (classification_report(y_test,dec))
plt.figure(figsize=(8,8))
sns.heatmap(norm_conf)
plt.title('Confusion heatmap')
plt.ylabel('True classification')
plt.xlabel('Lender classification')
```

Out[140]:

Age is useful in income prediction, but how good is it assessing risk in this data set? Should we try to create a model with binned ages or leave it as is for now?

We can use a more traditional method such as weight of evidence (WoE) to assess age.

WoE: log(%non event/% event) where the "event" in question is delinquincy

Age is technically already binned, since time is a continuous variable and we're rounding down to whole numbers. Still, grouping them further may make it more useful. As we saw from the income analysis above, there is evidence that income groups can be usefully binned. However, we will stick with age for now.

In [117]:

```
plt.figure(figsize=(8,8))
df['age'].hist(bins=100)
plt.title('Number of loans by age')
```

Out[117]:

Let's be generous and say that everyone above 81 is now 81.

In [124]:

```
df2 = df.copy()
df2['age'] = [v if v <= 81 else 81 for v in df2['age']]
```

In [125]:

```
unique_ages = df2['age'].value_counts().index.sort_values()
```

In [126]:

```
bad_age = []
good_age = []
for a in unique_ages:
good_age.append(df2[df2['age']==a].groupby('SeriousDlqin2yrs').count()['age'][0])
bad_age.append(df2[df2['age']==a].groupby('SeriousDlqin2yrs').count()['age'][1])
```

In [127]:

```
age_df = pd.DataFrame(list(zip(unique_ages,good_age,bad_age)),columns=['Age','TotalNonDlq','TotalDlq'])
total_good = sum(age_df['TotalNonDlq'])
total_dlq = sum(age_df['TotalDlq'])
```

In [128]:

```
age_df['PercentNonDlq'] = age_df['TotalNonDlq'].apply(lambda v:v/total_good)
age_df['PercentDlq'] = age_df['TotalDlq'].apply(lambda v:v/total_dlq)
```

In [129]:

```
age_df['WoE'] = np.log(age_df['PercentNonDlq']/age_df['PercentDlq'])
```

In [130]:

```
plt.figure(figsize=(8,8))
sns.regplot('Age','WoE',age_df,order=3)
plt.title ('WoE by Age')
```

Out[130]:

The cubic fit line doesn't really have any use, it just looks nice.

We can also find the information value for each age.

In [131]:

```
age_df['IV'] = age_df['WoE']*(age_df['PercentNonDlq'] - age_df['PercentDlq'])
age_df.head()
```

Out[131]:

Siddiqi (2006) suggests that a bin with an IV of under 0.1 has weak predictive power, and under 0.02 it's not useful at all. We should aim for the (0.3,0.5) range.

In [132]:

```
plt.figure(figsize=[8,8])
age_df['IV'].plot()
```

Out[132]:

The IVs are very low, so let's try binning them further.

In [133]:

```
freq=pd.DataFrame(sorted(df2['age']))
```

In [134]:

```
df2['ageCat'] = np.digitize(np.array(df2['age']),freq.quantile(q=np.arange(0.1,0.9,0.1))[0].values)
```

In [135]:

```
unique_cats = df2['ageCat'].value_counts().index.sort_values()
```

In [136]:

```
bad_cat = []
good_cat = []
for a in unique_cats:
good_cat.append(df2[df2['ageCat']==a].groupby('SeriousDlqin2yrs').count()['ageCat'][0])
bad_cat.append(df2[df2['ageCat']==a].groupby('SeriousDlqin2yrs').count()['ageCat'][1])
```

In [137]:

```
cat_df = pd.DataFrame(list(zip(unique_cats,good_cat,bad_cat)),columns=['ageCat','TotalNonDlq','TotalDlq'])
```

In [138]:

```
cat_df['PercentNonDlq'] = cat_df['TotalNonDlq'].apply(lambda v:v/total_good)
cat_df['PercentDlq'] = cat_df['TotalDlq'].apply(lambda v:v/total_dlq)
cat_df['WoE'] = np.log(cat_df['PercentNonDlq']/cat_df['PercentDlq'])
cat_df['IV'] = cat_df['WoE']*(cat_df['PercentNonDlq'] - cat_df['PercentDlq'])
```

In [139]:

```
cat_df
```

Out[139]:

It seems like age by itself is not worth binning, with such low IV scores. However, these only apply for age categories on their own. It is entirely possible that they can be the basis for features that may be useful. A few that come to mind include finding the log difference between an individual's income and their age group average, or a similar measure of dependents relative to the group average.

Measures such as these may give additional information about an individual's economic situation and ability to pay, and they would be useful to explore in any further research.

In terms of the baseline unused models, logistic regression performs very well. This is not very surprising, as it is the traditional method of risk classification, with nearly 100 years of theoretical development and practical use. It is also a deterministic model that requires relatively little computing power, and so it can easily and quickly be reproduced on even a mobile device. Being deterministic, these results are 100% consistent for any given set of observations and features. However, on average, the results from boosting methods are clearly superior. Without using XGBoost or more ensemble methods, we managed to achieve a score of 0.866752 on the Kaggle private leaderboard with just a randomized cross-validated gradient boosting model.

The log transformations removed many of the outliers, but it's not clear how much of a difference this made in our final model, since decision tree methods tend to be robust for monotonic transformations (since they are order-preserving). However, they did make the data more palatable and easier to visualize. In addition to the features mentioned above for age we could try traditional methods like including squared features or trying different interactions and ratios between the existing features. We could also try to improve monthly income estimation, which is in itself a large topic. Finding more uses for age could be a major component of that as well.

The prize winning team used over 20 features and was developed over the course of several months. Intellectually, it would have been fascinating to have been a fly on the wall during their deliberations. But in a more practical sense, we have to consider the tradeoff between time and money, and the diminishing returns from spending a lot more time to make relatively minor improvements in the model.