Kiva is a non-profit organization that hosts a platform, kiva.org, to crowdfund loans to individuals and groups who otherwise may not have had access to capital. They provide a rich dataset of historical loans from 2006 to 2019.
In this notebook, we will first explore the dataset to glean insights and deeper understanding about the platform.
We will then consider the expected personal loss of making a single loan, and then withdrawing any repayment as soon as possible.
Finally, we will attempt to use machine learning models to predict which loans will get funded, and which will get fully repaid.
import pandas as pd
import matplotlib.colors as mplc
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from scipy import stats
import numpy as np
import seaborn as sns
%matplotlib inline
# activate plot theme
import qeds
qeds.themes.mpl_style();
# Source: https://www.kiva.org/build/data-snapshots
loans_raw = pd.read_csv("datasets/loans_big.csv", parse_dates=["POSTED_TIME","PLANNED_EXPIRATION_TIME","DISBURSE_TIME","RAISED_TIME",])
The above dataset was provided by Kiva as a snapshot of their historical loans. Below is an example row:
pd.set_option('display.max_columns', 40)
loans_raw.dropna().head(1)
LOAN_ID | LOAN_NAME | ORIGINAL_LANGUAGE | DESCRIPTION | DESCRIPTION_TRANSLATED | FUNDED_AMOUNT | LOAN_AMOUNT | STATUS | IMAGE_ID | VIDEO_ID | ACTIVITY_NAME | SECTOR_NAME | LOAN_USE | COUNTRY_CODE | COUNTRY_NAME | TOWN_NAME | CURRENCY_POLICY | CURRENCY_EXCHANGE_COVERAGE_RATE | CURRENCY | PARTNER_ID | POSTED_TIME | PLANNED_EXPIRATION_TIME | DISBURSE_TIME | RAISED_TIME | LENDER_TERM | NUM_LENDERS_TOTAL | NUM_JOURNAL_ENTRIES | NUM_BULK_ENTRIES | TAGS | BORROWER_NAMES | BORROWER_GENDERS | BORROWER_PICTURED | REPAYMENT_INTERVAL | DISTRIBUTION_MODEL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
77657 | 807094 | GUSTAVO | Spanish | Gustavo es soltero y vive con sus padres en La... | Gustavo is single and lives with his parents i... | 500.0 | 500.0 | funded | 1745738.0 | 3002.0 | Higher education costs | Education | to pay university tuition | BO | Bolivia | La Paz | shared | 0.1 | USD | 48.0 | 2014-11-27 15:25:02+00:00 | 2015-01-25 02:20:02+00:00 | 2014-11-21 08:00:00+00:00 | 2014-12-21 15:17:44+00:00 | 20.0 | 17 | 2 | 1 | user_favorite, user_favorite, user_favorite | GUSTAVO | male | true | monthly | field_partner |
min_date = loans_raw["POSTED_TIME"].min()
max_date = loans_raw["POSTED_TIME"].max()
print(f"The dataset contains {len(loans_raw)} loans in the date range {min_date} to {max_date}.")
The dataset contains 1682790 loans in the date range 2006-04-16 07:10:50+00:00 to 2019-02-25 04:12:27+00:00.
We will first explore the quality of the data and see if there will be any issues with missing data.
print(loans_raw.isnull().sum())
LOAN_ID 0 LOAN_NAME 48555 ORIGINAL_LANGUAGE 44209 DESCRIPTION 44244 DESCRIPTION_TRANSLATED 453635 FUNDED_AMOUNT 0 LOAN_AMOUNT 0 STATUS 0 IMAGE_ID 44209 VIDEO_ID 1681943 ACTIVITY_NAME 0 SECTOR_NAME 0 LOAN_USE 44232 COUNTRY_CODE 29 COUNTRY_NAME 0 TOWN_NAME 163515 CURRENCY_POLICY 0 CURRENCY_EXCHANGE_COVERAGE_RATE 337326 CURRENCY 0 PARTNER_ID 18325 POSTED_TIME 0 PLANNED_EXPIRATION_TIME 371834 DISBURSE_TIME 3189 RAISED_TIME 85150 LENDER_TERM 24 NUM_LENDERS_TOTAL 0 NUM_JOURNAL_ENTRIES 0 NUM_BULK_ENTRIES 0 TAGS 831171 BORROWER_NAMES 48555 BORROWER_GENDERS 44209 BORROWER_PICTURED 44209 REPAYMENT_INTERVAL 0 DISTRIBUTION_MODEL 0 dtype: int64
There are some missing cells. However, the core data relating to the loans themselves are complete. Although cells relating to currencies are listed, manually investigating a sample revealed that all loan amounts are in USD.
We will proceed by asking some basic questions about the loans.
plt.rcParams['figure.figsize'] = (10, 5)
What are these loans going towards?
sectors = loans_raw["SECTOR_NAME"].value_counts().sort_values()
ax = sectors.plot.barh()
ax.set_title("Loans by Sector")
Text(0.5, 1.0, 'Loans by Sector')
What about if we get slightly more granular?
activities = loans_raw["ACTIVITY_NAME"].value_counts().nlargest(15).sort_values()
ax = activities.plot.barh()
ax.set_title("Loans by Activity")
Text(0.5, 1.0, 'Loans by Activity')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
additional_stopwords = ['buy', 'purchase', 'sell', 'pay', 'additional', 'items', 'materials', 'stock', 'like', 'goods',
'increase', 'new', 'improve', 'build', 'make', 'products', 'supplies']
stop_words = text.ENGLISH_STOP_WORDS.union(additional_stopwords)
vectorizer = CountVectorizer(stop_words=stop_words)
words = vectorizer.fit_transform(loans_raw["LOAN_USE"].dropna())
words_sum = words.sum(axis=0)
words_freq = [(word, words_sum[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
# The double sorting is because barh puts the first element at the bottom
top_20_words = sorted(words_freq[0:20], key=lambda x:x[1])
y = [word[0] for word in top_20_words]
width = [freq[1] for freq in top_20_words]
fig, ax = plt.subplots()
ax.set_title("Most common specific description words")
ax.barh(y, width=width)
<BarContainer object of 20 artists>
As you can see, the most common reasons for loans are to help small businesses, such as farming and local retail.
What does the distribution of loan amounts look like?
amounts = loans_raw['LOAN_AMOUNT']
# Remove outliers to be able to view amounts at a reasonable scale.
# These outliers are quite rare, but some have extremely large amounts, skewing the image.
def filter_outliers(data, crit=4):
"""Removes data points with a zscore magnitude greater than the critical value crit."""
z = np.abs(stats.zscore(data))
return data[(z < crit)]
filtered_amounts = filter_outliers(amounts)
ax = sns.distplot(filtered_amounts)
ax.set_xticks(np.arange(0, np.max(filtered_amounts) + 1, 500))
ax.set_xlabel("Loan Amounts in USD")
Text(0.5, 0, 'Loan Amounts in USD')
We can see that the most common loan amount is 250 USD, though multiples of 500 are generally more popular than surrounding amounts.
How many people tend to contribute to each loan?
filtered_lenders = filter_outliers(loans_raw["NUM_LENDERS_TOTAL"])
ax = sns.distplot(filtered_lenders)
ax.set_xticks(np.arange(0, 141, 10))
ax.set_xlabel("Number of lenders per loan")
print(filtered_lenders.value_counts().head())
8 78933 1 72962 9 72353 7 72134 5 69761 Name: NUM_LENDERS_TOTAL, dtype: int64
We see that the distribution is centered around 8 lenders, with a long right tail. However, we can also see that a significant number of loans were funded by only a single person.
The loans are not guaranteed to get fully funded. How many do not?
loan_diffs = amounts - loans_raw["FUNDED_AMOUNT"]
funding_status = ["funded" if diff == 0 else "unfunded" for diff in loan_diffs]
loan_diffs = loan_diffs[loan_diffs != 0]
percent_unfunded = ((loan_diffs.count() / amounts.count()) * 100).round(1)
print(f"{percent_unfunded}% of loans do not get fully funded.")
unfunded_asking_amounts = amounts[loan_diffs.index]
unfunded_ratios = loan_diffs / unfunded_asking_amounts
how_much_funded = (unfunded_ratios.mean() * 100).round(1)
print(f"Of those, the mean amount funded is {how_much_funded}% of the ask.")
sns.countplot(funding_status)
5.1% of loans do not get fully funded. Of those, the mean amount funded is 56.4% of the ask.
<matplotlib.axes._subplots.AxesSubplot at 0x1da5508b748>
loans_raw["TIME_TO_FUND"] = loans_raw["RAISED_TIME"] - loans_raw["POSTED_TIME"]
loans_raw["DAYS_TO_FUND"] = [time.days for time in loans_raw["TIME_TO_FUND"]]
days_to_fund = filter_outliers(loans_raw["DAYS_TO_FUND"].dropna())
ax = sns.distplot(days_to_fund)
ax.set_xticks(np.arange(0, 71, 5))
plt.show()
Is there a relationship between the loan amount and the number of days it takes to fund the loan?
# Need to filter outliers but don't want to filter by different amounts
amounts_and_days = loans_raw[["LOAN_AMOUNT", "DAYS_TO_FUND"]]
amounts_and_days = amounts_and_days[(amounts_and_days["LOAN_AMOUNT"].isin(filtered_amounts))
& (amounts_and_days["DAYS_TO_FUND"].isin(days_to_fund))]
sample = amounts_and_days.sample(250)
sns.lmplot(x="LOAN_AMOUNT", y="DAYS_TO_FUND", data=sample, fit_reg=True)
<seaborn.axisgrid.FacetGrid at 0x1da53a875c8>
There does not seem to be a clear relationship.
Once the loans are distributed, there is no guarantee that they will be repaid. How many do not?
Unfortunately, there was no information to answer this question in the original dataset. We instead turn to a previously supplied dataset of 5000 randomly sampled loans with their repayment details, taken from https://stat.duke.edu/resources/datasets/kiva-loans
# Source: https://stat.duke.edu/resources/datasets/kiva-loans
repayments = pd.read_excel("datasets/loan_repayment_samples.xlsx")
# basket_amount was all n/a, video.youtube_id was not used
repayments = repayments.drop(["basket_amount", "video.youtube_id"], axis=1)
# This dataset has a row for each individual payment; aggregate these into a single row per loan
grouped_repayments = repayments.groupby("id", as_index=False).agg(lambda x: x.tolist())
# Get the payment status of the loan
grouped_repayments["status"] = grouped_repayments["status"].agg(lambda x: "defaulted" if "defaulted" in x else "paid")
status_counts = grouped_repayments["status"].value_counts()
default_ratio = ((status_counts["defaulted"] / len(grouped_repayments)) * 100).round(1)
print(f"{default_ratio} % of loans in this sample have defaulted.")
fig, ax = plt.subplots()
ax.pie(x=status_counts, labels=["paid", "defaulted"])
plt.show()
2.0 % of loans in this sample have defaulted.
What is the gender ratio of those receiving loans?
genders = loans_raw["BORROWER_GENDERS"].dropna()
# The gender entries include all recipients as a comma separated string, so a simple value_counts will not work.
females = 0
males = 0
for entry in genders:
recipients = entry.split(", ")
for gender in recipients:
if gender == "female":
females += 1
else:
males += 1
fig, ax = plt.subplots()
ax.pie(x=[males, females], labels=["males", "females"])
female_ratio = round((females / (females + males)) * 100, 1)
male_ratio = round(((males / (females + males)) * 100), 1)
print(f"Loan recipients are {female_ratio}% female and {male_ratio}% male.")
Loan recipients are 80.0% female and 20.0% male.
What about the lenders themselves? What can we determine about them?
lenders_raw = pd.read_csv("datasets/lenders.csv")
lenders_raw.dropna().head(1)
PERMANENT_NAME | DISPLAY_NAME | MAIN_PIC_ID | CITY | STATE | COUNTRY_CODE | MEMBER_SINCE | PERSONAL_URL | OCCUPATION | LOAN_BECAUSE | OTHER_INFO | LOAN_PURCHASE_NUM | INVITED_BY | NUM_INVITED | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11476 | barketing | Erika Godwin | 2899846.0 | Ottawa | Ontario | CA | 1528393535 | www.barketing.co/ | Business Owner | I loan because I want to help fellow female en... | I am co-founder of ProPet Software and owner o... | 5 | Bella | 0 |
map_loans_to_lenders = pd.read_csv("datasets/loans_lenders.csv")
loans_raw = pd.merge(loans_raw, map_loans_to_lenders, on="LOAN_ID")
What are the nationalities of the lenders?
import squarify
lenders_by_country = lenders_raw["COUNTRY_CODE"].value_counts()
ax = squarify.plot(
sizes=lenders_by_country.values[0:20],
label=lenders_by_country.index[0:20],
alpha=.7
)
Evidently the majority of the users live in the US. Which states in particular?
state_mappings = pd.read_csv("datasets/states.csv")
state_pops = pd.read_csv("datasets/us_state_populations.csv")
import plotly.express as px
state_values = lenders_raw["STATE"].value_counts()
num_lenders_per_state = {}
for state in state_values.index[0:200]: # ignore the irrelevant state entries from other countries
# Some entries have state as its full name, others its code; combine the two
if state in state_mappings["State"].values:
code = state_mappings[state_mappings["State"] == state]["Abbreviation"].iat[0]
num_lenders_per_state[code] = num_lenders_per_state.get(code, 0) + state_values[state]
elif state in state_mappings["Abbreviation"].values:
num_lenders_per_state[state] = num_lenders_per_state.get(state, 0) + state_values[state]
us_lenders_by_state = pd.Series(num_lenders_per_state).sort_values(ascending=False)
print(us_lenders_by_state.head())
fig = px.choropleth(
locations=us_lenders_by_state.index,
color=us_lenders_by_state.values,
locationmode="USA-states",
title="Number of lenders per US state",
scope="usa",
color_continuous_scale=px.colors.sequential.Plasma
)
fig.show()
CA 57270 NY 20708 TX 16410 WA 15102 IL 12441 dtype: int64
The number of users in California more than doubles the next state. However, the above plots were with raw numbers, but what about adjusting for population? Let's again look at countries first:
populations = pd.read_csv("datasets/world_pop.csv")
# Load a dictionary to convert iso2 to iso3 for other datasets and plotly
import urllib.request, json
with urllib.request.urlopen("http://country.io/iso3.json") as url:
iso2_to_3_dict = json.loads(url.read().decode())
country_codes = []
per_capita_values = []
for country in lenders_by_country.index:
code = iso2_to_3_dict.get(country)
pop = populations[populations["Country Code"] == code]["2016"]
if pop is None or len(pop) == 0:
continue
val = lenders_by_country[country] / pop.iat[0]
country_codes.append(code)
per_capita_values.append(val)
lenders_per_capita = pd.Series(per_capita_values, index=country_codes).sort_values(ascending=False)
ax = squarify.plot(
sizes=lenders_per_capita.values[0:20],
label=lenders_per_capita.index[0:20],
alpha=0.7,
)
Per-capita, user nationality is much more evenly distributed, with Canada winning out. What about states?
us_per_capita_lenders = {}
for state in us_lenders_by_state.index:
full_name = state_mappings[state_mappings["Abbreviation"] == state]["State"].iat[0]
pop = state_pops[state_pops["NAME"] == full_name]["POPESTIMATE2019"].iat[0]
us_per_capita_lenders[state] = us_lenders_by_state[state] / pop
us_per_capita_lenders = pd.Series(us_per_capita_lenders).sort_values(ascending=False)
print(us_per_capita_lenders.head())
fig = px.choropleth(
locations=us_per_capita_lenders.index,
color=us_per_capita_lenders.values,
locationmode="USA-states",
title="Number of per-capita lenders per state",
scope="usa",
color_continuous_scale=px.colors.sequential.Plasma
)
fig.show()
DC 0.004748 WA 0.001983 OR 0.001890 AK 0.001792 MA 0.001604 dtype: float64
After adjusting for population, California is no longer even in the top 5. Though difficult to see on the map, the number one spot goes to the District of Columbia. Another surprise is that Alaska comes in at number 4.
How many available loans are there at once? How has this changed over time as the organization has grown?
timeseries = loans_raw.set_index("POSTED_TIME").sort_index()
timeseries = timeseries.set_index(timeseries.index.to_pydatetime())
ax = timeseries["LOAN_ID"].rolling("30d").count().plot() # Loans are available for 30 days once posted
ax.set_title("Number of available loans on the site")
Text(0.5, 1.0, 'Number of available loans on the site')
ax = timeseries["FUNDED_AMOUNT"].rolling("30d").sum().plot() # Loans are available for 30 days once posted
ax.set_title("Amount Funded in 10s of millions")
Text(0.5, 1.0, 'Amount Funded in 10s of millions')
Which countries receive the most loans?
loans_raw["ISO3_CODES"] = [iso2_to_3_dict.get(code, np.nan) for code in loans_raw["COUNTRY_CODE"]]
countries = loans_raw["ISO3_CODES"].dropna()
loans_per_country = countries.value_counts()
fig = px.choropleth(
locations=loans_per_country.index,
color=loans_per_country.values,
title="Number of loans per country",
color_continuous_scale=px.colors.sequential.Plasma
)
fig.show()
ax = squarify.plot(sizes=loans_per_country.values[0:20], label=loans_per_country.index[0:20], alpha=.7)
We can see that two countries have received significantly more loans than others: the Phillipines and Kenya, with the Phillipines receiving more than 300 thousand loans and Kenya receiving more than 170 thousand. After that, Peru, Cambodia, and El Salvador have each received close to 90 thousand loans.
What about the actual amounts received, rather than number of loans?
amounts_by_country = loans_raw.groupby("ISO3_CODES")["FUNDED_AMOUNT"].agg(sum)
sorted_amounts = amounts_by_country.sort_values(ascending=False)
fig = px.choropleth(
locations=amounts_by_country.index,
color=amounts_by_country,
title="Dollars received in loans per country",
color_continuous_scale=px.colors.sequential.Plasma
)
fig.show()
ax = squarify.plot(sizes=sorted_amounts.values[0:20], label=sorted_amounts.index[0:20], alpha=.7)
ax.set_title("Dollars received in loans per country")
<matplotlib.axes._subplots.AxesSubplot at 0x1dabe2923c8>
When sorting by amount received rather than number of loans, the United States appear in the top 10. Why might that be?
us_loans = loans_raw[loans_raw["COUNTRY_NAME"] == "United States"]
us_activities_by_amount = us_loans.groupby("ACTIVITY_NAME")["FUNDED_AMOUNT"].agg(sum).sort_values(ascending=False)
ax = us_activities_by_amount[0:20].sort_values().plot.barh()
ax.set_title("US loans by Activity")
Text(0.5, 1.0, 'US loans by Activity')
Although I originally hypothesized that perhaps the loans were for personal health care, Kiva focuses its within-US efforts on supporting entrepreneurs, leading to the activity breakdown above.
Compared to the rest of the countries that receive significant amounts of money through Kiva, the US seems like an outlier. How does its distribution of lenders compare to other countries?
us_lenders = us_loans["LENDERS"]
made_us_loan = set()
def get_lender_names(lenders_csv):
lender_usernames = lenders_csv.split(", ")
for name in lender_usernames:
made_us_loan.add(name)
us_lenders.apply(get_lender_names)
lenders_by_username = lenders_raw.set_index("PERMANENT_NAME")
filtered_us_lenders = lenders_raw[lenders_raw["PERMANENT_NAME"].isin(made_us_loan)]
filtered_us_lenders = filtered_us_lenders[filtered_us_lenders["COUNTRY_CODE"].notna()]
country_codes = lenders_raw["COUNTRY_CODE"]
lender_countries = {country: 0 for country in country_codes.unique()}
def count_lender_countries(lender):
country_code = lender["COUNTRY_CODE"]
lender_countries[country_code] += 1
filtered_us_lenders.apply(count_lender_countries, axis=1)
codes_df = pd.DataFrame(lender_countries.values(), index=lender_countries.keys(), columns=["LENDER_COUNT"])
us_lender_countries = codes_df.nlargest(20, columns="LENDER_COUNT")
ax = squarify.plot(sizes=us_lender_countries.values[0:20], label=us_lender_countries.index[0:20], alpha=.7)
ax.set_title("Lender countries for US loans")
Text(0.5, 1.0, 'Lender countries for US loans')
As one might expect, the percentage of US loans that are funded by US citizens is larger than the baseline for all loans, going from roughly 70% to over 80%.
Looking into the United States led to the question: what are the largest average loan amounts per country?
def get_large_sample_size_mean(country_loans):
"""Returns the mean of a country's loans if the sample size is at least 500, otherwise 0."""
return 0 if len(country_loans) < 500 else np.mean(country_loans)
avg_amt_by_country = loans_raw.groupby("COUNTRY_NAME")["FUNDED_AMOUNT"].agg(get_large_sample_size_mean).sort_values(ascending=False)
ax = avg_amt_by_country[0:20].sort_values().plot.barh()
ax.set_title("Average loan amounts by country")
Text(0.5, 1.0, 'Average loan amounts by country')
Using Duke's sample dataset, we previously estimated that the loan default rate was roughly 2%. We will proceed using this estimate to calculate the expected personal loss in present value when making a single loan.
We will make the following simplifying assumptions:
In order to compute the expected loss, we must figure out the expected repayment intervals.
monthly_loans = loans_raw[loans_raw["REPAYMENT_INTERVAL"] == "monthly"]
num_repayments = monthly_loans["LENDER_TERM"].mean().round()
print(num_repayments)
print(default_ratio)
13.0 2.0
def NPV(pay_per_period, discount_rate, n_periods):
"""Return the Net Present Value of an asset."""
result = 0
for t in range(1, int(n_periods) + 1):
result += pay_per_period / (1 + discount_rate)**t
return result
def expected_loss(initial_amount, discount_rate=0.035, num_repayments=13):
pay_per_period = initial_amount / num_repayments
# default ratio was computed as an integer, get the probability that we get our money back
chance_to_repay = (100 - default_ratio) / 100
expected_discounted_value = chance_to_repay * NPV(pay_per_period, discount_rate, num_repayments)
# discounted value - current value is how much we're losing
return expected_discounted_value - initial_amount
x = np.linspace(0, 2500, 50) # amounts of money we could loan
discount = 0.0052 # current long-term average S&P 500 monthly return
# (according to https://ycharts.com/indicators/sp_500_monthly_total_return)
y = expected_loss(x, discount, num_repayments)
fig, ax = plt.subplots(1)
ax.set_xlabel("Amount loaned")
ax.set_ylabel("Change in present value")
ax.set_title("Expected result with 0.52% monthly discount rate")
ax.plot(x, y, 'r')
[<matplotlib.lines.Line2D at 0x1dad794c148>]
Assuming that one would've otherwise put the money into an index fund, if one is willing to accept the variance of these loans, they can provide \$1000 worth of capital while only losing roughly $55 in present value.
We now turn to the task of attempting to use the dataset to predict whether a loan will be fully funded or not. Because the dataset has multiple different data types, missing values, and superfluous features, it requires quite a bit of preprocessing before we can pass it to machine learning models.
First, we will split the features into 3 sets and ensure that each set can be trained on:
Second, we will try learning on these sets of features individually to try to see if any are more predictive than others. We will construct an ensemble of classification models to first benchmark each model's individual performance, and then the performance of the models together.
Finally, we will combine all that we have done previously into one large benchmark. If we get any promising preliminary results for a particular model, we may spend more time optimizing that final model by tuning scikit-learn parameters and model hyperparameters.
# For the sake of the exercise, ignore features that:
# 1) are not populated upon the loan first being available
# 2) are irrelevant ids unique to the loan
# Note that disbursed dates are typically available when first viewing the loan,
# either as pre-dispursal from the partner or as the expected date
number_features = [
'LOAN_AMOUNT',
'DISBURSE_TIME',
'LENDER_TERM',
'CURRENCY_EXCHANGE_COVERAGE_RATE',
'POSTED_TIME'
]
categorical_features = [
'ACTIVITY_NAME',
'SECTOR_NAME',
'COUNTRY_CODE',
'COUNTRY_NAME',
'TOWN_NAME',
'CURRENCY_POLICY',
'CURRENCY',
'TAGS',
'BORROWER_GENDERS',
'BORROWER_PICTURED',
'PARTNER_ID',
]
description_features = [
'LOAN_USE',
'DESCRIPTION',
'DESCRIPTION_TRANSLATED'
]
import logging
import sys
from time import time
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import RidgeCV
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import StackingClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier
# Source: forked from https://scikit-learn.org/stable/auto_examples/text/plot_document_classification_20newsgroups.html#sphx-glr-auto-examples-text-plot-document-classification-20newsgroups-py
# License: BSD 3 clause
# Display progress logs on stdout
logging.basicConfig(level=logging.INFO,
format='%(asctime)s %(levelname)s %(message)s')
def benchmark(clf, name, X_train, X_test, y_train, y_test):
"""Trains and tests a model. Prints information about its performance."""
print('=' * 80)
print(name)
print('_' * 80)
print("Training: ")
print(clf)
t0 = time()
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time: %0.3fs" % test_time)
score = metrics.accuracy_score(y_test, pred)
print("accuracy: %0.3f" % score)
if hasattr(clf, 'coef_'):
print("dimensionality: %d" % clf.coef_.shape[1])
print("density: %f" % density(clf.coef_))
print()
print("confusion matrix:")
print(metrics.confusion_matrix(y_test, pred))
print()
clf_descr = str(clf).split('(')[0]
return clf_descr, score, train_time, test_time
def rank_classifiers(X_train, X_test, y_train, y_test, n_iters=100, njobs=-1):
"""Creates, trains, and evaluates many classification models."""
results = []
classifiers = []
# Baseline
mode = y_train.value_counts().index[0]
score = len(y_test[y_test.values == mode]) / len(y_test)
results.append(
("Predict Mode",
score,
0,
0)
)
print('=' * 80)
print("Predict the Mode")
print('_' * 80)
print("Score:", score)
print()
for clf, name in (
(RidgeClassifier(tol=1e-2, solver="auto"), "Ridge Classifier"),
(Perceptron(max_iter=n_iters, n_jobs=-1), "Perceptron"),
(MLPClassifier(max_iter=n_iters, early_stopping=True), "Neural Network"),
(PassiveAggressiveClassifier(max_iter=n_iters * 10, n_jobs=njobs), "Passive-Aggressive"),
(KNeighborsClassifier(n_neighbors=5, n_jobs=njobs), "kNN"),
(RandomForestClassifier(max_depth=n_iters, n_jobs=njobs), "Random forest")):
results.append(benchmark(clf, name, X_train, X_test, y_train, y_test))
classifiers.append((name, clf))
for penalty in ["l2", "l1"]:
name = f"Liblinear with {penalty.upper()} penalty"
clf = LinearSVC(penalty=penalty, dual=False, tol=1e-3, max_iter = n_iters * 50)
results.append(benchmark(clf, name, X_train, X_test, y_train, y_test))
classifiers.append((name, clf))
name = f"Ordinary SGD with {penalty.upper()} penalty"
clf = SGDClassifier(alpha=.0001, max_iter=n_iters, penalty=penalty, n_jobs=njobs)
results.append(benchmark(clf, name, X_train, X_test, y_train, y_test))
classifiers.append((name, clf))
name = "SGD with Elastic Net penalty"
clf = SGDClassifier(alpha=.0001, max_iter=n_iters, penalty="elasticnet", n_jobs=njobs)
results.append(benchmark(clf, name, X_train, X_test, y_train, y_test))
classifiers.append((name, clf))
name = "LinearSVC with L1-based feature selection"
# The smaller C, the stronger the regularization.
# The more regularization, the more sparsity.
clf = Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1", dual=False, tol=1e-3))),
('classification', LinearSVC(penalty="l2"))
])
results.append(benchmark(clf, name, X_train, X_test, y_train, y_test))
classifiers.append((name, clf))
# Ensemble methods were causing intermittent mmap file errors on windows
try:
name = "Voting classifier with all previous estimators"
voting_classifier = VotingClassifier(estimators=classifiers)
results.append(benchmark(voting_classifier, name, X_train, X_test, y_train, y_test))
name = "Stacking classifier with all previous estimators"
stacking_classifier = StackingClassifier(estimators=classifiers)
results.append(benchmark(stacking_classifier, name, X_train, X_test, y_train, y_test))
except:
pass
return results
def plot_rankings(results):
"""Given tuples of a model name, score, training time, and testing time, produces a plot of each model's performance."""
indices = np.arange(len(results))
results = [[x[i] for x in results] for i in range(4)]
clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)
plt.figure(figsize=(14, 10))
plt.title("Score")
plt.barh(indices, score, .2, label="score", color='navy')
plt.barh(indices + .3, training_time, .2, label="training time", color='c')
plt.barh(indices + .6, test_time, .2, label="test time", color='darkorange')
plt.yticks(())
plt.legend(loc='best')
plt.subplots_adjust(left=.25)
plt.subplots_adjust(top=.95)
plt.subplots_adjust(bottom=.05)
for i, c in zip(indices, clf_names):
plt.text(-.3, i, c)
plt.show()
For the sake of time, we only run the models on a very small subset of the dataset. Previous trial runs that used the full dataset have never shown significant differences in performance.
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
X_cat = loans_raw[categorical_features]
X_cat = X_cat.fillna("")
y = loans_raw["STATUS"]
categories = [X_cat[column].unique() for column in X_cat]
cat_pipe = make_pipeline(
SimpleImputer(missing_values=None,
strategy='constant',
fill_value='missing'),
OneHotEncoder(categories=categories)
)
X_train, X_test, y_train, y_test = train_test_split(X_cat, y, train_size=0.01, test_size=0.01)
cat_pipe.fit(X_train)
X_train = cat_pipe.transform(X_train)
X_test = cat_pipe.transform(X_test)
cat_results = rank_classifiers(X_train, X_test, y_train, y_test, 100)
plot_rankings(cat_results)
================================================================================ Predict the Mode ________________________________________________________________________________ Score: 0.947670864408852 ================================================================================ Ridge Classifier ________________________________________________________________________________ Training: RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01) train time: 2.647s test time: 0.007s accuracy: 0.944 dimensionality: 185898 density: 0.046068 confusion matrix: [[ 20 0 731 0] [ 0 0 34 0] [ 76 2 15424 0] [ 0 0 68 3]] ================================================================================ Perceptron ________________________________________________________________________________ Training: Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty=None, random_state=0, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.125s test time: 0.011s accuracy: 0.924 dimensionality: 185898 density: 0.013029 confusion matrix: [[ 162 1 588 0] [ 1 0 33 0] [ 545 8 14942 7] [ 0 0 53 18]] ================================================================================ Neural Network ________________________________________________________________________________ Training: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=True, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant', learning_rate_init=0.001, max_fun=15000, max_iter=100, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False) train time: 1375.956s test time: 0.061s accuracy: 0.947 confusion matrix: [[ 9 0 742 0] [ 0 0 34 0] [ 18 0 15484 0] [ 0 0 71 0]] ================================================================================ Passive-Aggressive ________________________________________________________________________________ Training: PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None, early_stopping=False, fit_intercept=True, loss='hinge', max_iter=1000, n_iter_no_change=5, n_jobs=-1, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.226s test time: 0.011s accuracy: 0.934 dimensionality: 185898 density: 0.021068 confusion matrix: [[ 83 1 667 0] [ 1 0 33 0] [ 303 6 15175 18] [ 0 0 47 24]] ================================================================================ kNN ________________________________________________________________________________ Training: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=5, p=2, weights='uniform') train time: 0.029s test time: 11.451s accuracy: 0.943 confusion matrix: [[ 47 0 704 0] [ 0 0 34 0] [ 124 1 15374 3] [ 2 0 67 2]] ================================================================================ Random forest ________________________________________________________________________________ Training: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=100, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=None, verbose=0, warm_start=False) train time: 24.794s test time: 0.217s accuracy: 0.948 confusion matrix: [[ 0 0 751 0] [ 0 0 34 0] [ 2 0 15500 0] [ 0 0 71 0]] ================================================================================ Liblinear with L2 penalty ________________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.001, verbose=0) train time: 1.388s test time: 0.003s accuracy: 0.939 dimensionality: 185898 density: 0.046068 confusion matrix: [[ 58 0 693 0] [ 0 0 34 0] [ 205 1 15293 3] [ 0 0 54 17]] ================================================================================ Ordinary SGD with L2 penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty='l2', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.122s test time: 0.011s accuracy: 0.946 dimensionality: 185898 density: 0.015328 confusion matrix: [[ 20 0 731 0] [ 0 0 34 0] [ 45 0 15457 0] [ 0 0 69 2]] ================================================================================ Liblinear with L1 penalty ________________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l1', random_state=None, tol=0.001, verbose=0) train time: 3.318s test time: 0.006s accuracy: 0.942 dimensionality: 185898 density: 0.003877 confusion matrix: [[ 26 0 725 0] [ 0 0 34 0] [ 128 4 15370 0] [ 0 0 65 6]] ================================================================================ Ordinary SGD with L1 penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty='l1', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.129s test time: 0.012s accuracy: 0.946 dimensionality: 185898 density: 0.000769 confusion matrix: [[ 3 0 748 0] [ 0 0 34 0] [ 27 1 15473 1] [ 0 0 69 2]] ================================================================================ SGD with Elastic Net penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.233s test time: 0.011s accuracy: 0.947 dimensionality: 185898 density: 0.005589 confusion matrix: [[ 4 0 747 0] [ 0 0 34 0] [ 26 0 15476 0] [ 0 0 68 3]] ================================================================================ LinearSVC with L1-based feature selection ________________________________________________________________________________ Training: Pipeline(memory=None, steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l1', random_state=None, tol=0.001, verbose=0), max_features=None, norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False) train time: 3.752s test time: 0.030s accuracy: 0.942 confusion matrix: [[ 37 0 714 0] [ 0 0 34 0] [ 144 1 15355 2] [ 0 0 56 15]] ================================================================================ Voting classifier with all previous estimators ________________________________________________________________________________ Training: VotingClassifier(estimators=[('Ridge Classifier', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01)), ('Perceptron', Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty... verbose=0), max_features=None, norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False))], flatten_transform=True, n_jobs=None, voting='hard', weights=None) train time: 1170.299s test time: 12.258s accuracy: 0.945 confusion matrix: [[ 20 0 731 0] [ 0 0 34 0] [ 68 0 15434 0] [ 0 0 68 3]] ================================================================================ Stacking classifier with all previous estimators ________________________________________________________________________________ Training: StackingClassifier(cv=None, estimators=[('Ridge Classifier', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01)), ('Perceptron', Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=... norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False))], final_estimator=None, n_jobs=None, passthrough=False, stack_method='auto', verbose=0) train time: 6410.119s test time: 10.902s accuracy: 0.948 confusion matrix: [[ 10 0 741 0] [ 0 0 34 0] [ 23 0 15471 8] [ 0 0 51 20]]
loans_raw['DISBURSE_TIME'] = pd.to_numeric(loans_raw['DISBURSE_TIME'])
loans_raw['POSTED_TIME'] = pd.to_numeric(loans_raw['POSTED_TIME'])
X_num = loans_raw[number_features]
y = loans_raw["STATUS"]
X_train, X_test, y_train, y_test = train_test_split(X_num, y, train_size=0.01, test_size=0.01)
num_pipe = make_pipeline(SimpleImputer(strategy='mean'))
X_train = num_pipe.fit_transform(X_train)
X_test = num_pipe.transform(X_test)
number_results = rank_classifiers(X_train, X_test, y_train, y_test, 1000)
plot_rankings(number_results)
================================================================================ Predict the Mode ________________________________________________________________________________ Score: 0.9461425602151853 ================================================================================ Ridge Classifier ________________________________________________________________________________ Training: RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01) train time: 0.106s test time: 0.001s accuracy: 0.947 dimensionality: 5 density: 1.000000 confusion matrix: [[ 14 0 752 0] [ 0 0 46 0] [ 4 0 15473 0] [ 7 0 62 0]] ================================================================================ Perceptron ________________________________________________________________________________ Training: Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty=None, random_state=0, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.124s test time: 0.001s accuracy: 0.947 dimensionality: 5 density: 1.000000 confusion matrix: [[ 14 0 752 0] [ 0 0 46 0] [ 0 0 15477 0] [ 6 0 63 0]] ================================================================================ Neural Network ________________________________________________________________________________ Training: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=True, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant', learning_rate_init=0.001, max_fun=15000, max_iter=1000, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False) train time: 1.418s test time: 0.045s accuracy: 0.947 confusion matrix: [[ 14 0 752 0] [ 0 0 46 0] [ 0 0 15477 0] [ 6 0 63 0]] ================================================================================ Passive-Aggressive ________________________________________________________________________________ Training: PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None, early_stopping=False, fit_intercept=True, loss='hinge', max_iter=10000, n_iter_no_change=5, n_jobs=-1, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.122s test time: 0.001s accuracy: 0.947 dimensionality: 5 density: 1.000000 confusion matrix: [[ 14 0 752 0] [ 0 0 46 0] [ 0 0 15477 0] [ 6 0 63 0]] ================================================================================ kNN ________________________________________________________________________________ Training: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=5, p=2, weights='uniform') train time: 0.079s test time: 0.825s accuracy: 0.944 confusion matrix: [[ 35 0 731 0] [ 0 16 30 0] [ 72 15 15390 0] [ 6 0 63 0]] ================================================================================ Random forest ________________________________________________________________________________ Training: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=1000, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=None, verbose=0, warm_start=False) train time: 0.526s test time: 0.107s accuracy: 0.944 confusion matrix: [[ 54 0 712 0] [ 0 20 26 0] [ 94 10 15370 3] [ 4 0 62 3]] ================================================================================ Liblinear with L2 penalty ________________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=10000, multi_class='ovr', penalty='l2', random_state=None, tol=0.001, verbose=0) train time: 0.050s test time: 0.000s accuracy: 0.947 dimensionality: 5 density: 1.000000 confusion matrix: [[ 14 0 752 0] [ 0 0 46 0] [ 0 0 15477 0] [ 6 0 63 0]] ================================================================================ Ordinary SGD with L2 penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty='l2', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.323s test time: 0.003s accuracy: 0.947 dimensionality: 5 density: 1.000000 confusion matrix: [[ 14 0 752 0] [ 0 0 46 0] [ 0 0 15477 0] [ 6 0 63 0]] ================================================================================ Liblinear with L1 penalty ________________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=10000, multi_class='ovr', penalty='l1', random_state=None, tol=0.001, verbose=0) train time: 10.301s test time: 0.003s accuracy: 0.946 dimensionality: 5 density: 0.600000 confusion matrix: [[ 0 0 766 0] [ 0 0 46 0] [ 6 0 15471 0] [ 1 0 68 0]] ================================================================================ Ordinary SGD with L1 penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty='l1', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.525s test time: 0.002s accuracy: 0.947 dimensionality: 5 density: 0.900000 confusion matrix: [[ 14 0 752 0] [ 0 0 46 0] [ 0 3 15474 0] [ 6 0 63 0]] ================================================================================ SGD with Elastic Net penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.625s test time: 0.003s accuracy: 0.947 dimensionality: 5 density: 1.000000 confusion matrix: [[ 14 0 752 0] [ 0 0 46 0] [ 0 0 15477 0] [ 6 0 63 0]] ================================================================================ LinearSVC with L1-based feature selection ________________________________________________________________________________ Training: Pipeline(memory=None, steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l1', random_state=None, tol=0.001, verbose=0), max_features=None, norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False) train time: 5.417s test time: 0.005s accuracy: 0.946 confusion matrix: [[ 0 0 766 0] [ 0 0 46 0] [ 0 3 15474 0] [ 0 1 68 0]] ================================================================================ Voting classifier with all previous estimators ________________________________________________________________________________ Training: VotingClassifier(estimators=[('Ridge Classifier', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01)), ('Perceptron', Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalt... verbose=0), max_features=None, norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False))], flatten_transform=True, n_jobs=None, voting='hard', weights=None) train time: 21.122s test time: 1.331s accuracy: 0.947 confusion matrix: [[ 14 0 752 0] [ 0 0 46 0] [ 0 0 15477 0] [ 6 0 63 0]] ================================================================================ Stacking classifier with all previous estimators ________________________________________________________________________________ Training: StackingClassifier(cv=None, estimators=[('Ridge Classifier', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01)), ('Perceptron', Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs... norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False))], final_estimator=None, n_jobs=None, passthrough=False, stack_method='auto', verbose=0) train time: 102.763s test time: 0.276s accuracy: 0.047 confusion matrix: [[ 766 0 0 0] [ 46 0 0 0] [15477 0 0 0] [ 69 0 0 0]]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import coo_matrix, hstack
description_pipe = make_pipeline(
TfidfVectorizer(),
TfidfTransformer()
)
X_desc = loans_raw[description_features].fillna("")
y = loans_raw["STATUS"]
X_train, X_test, y_train, y_test = train_test_split(X_desc, y, train_size=0.01, test_size=0.01)
# was too annoying to setup the initial matrix to do it in a loop
def get_description_features(X_train, X_test, description_pipe):
sparse_X_train = description_pipe.fit_transform(X_train["LOAN_USE"])
sparse_X_test = description_pipe.transform(X_test["LOAN_USE"])
sparse_X_train = hstack( [sparse_X_train, description_pipe.fit_transform(X_train["DESCRIPTION"])] )
sparse_X_test = hstack( [sparse_X_test, description_pipe.transform(X_test["DESCRIPTION"])] )
sparse_X_train = hstack( [sparse_X_train, description_pipe.fit_transform(X_train["DESCRIPTION_TRANSLATED"])] )
sparse_X_test = hstack( [sparse_X_test, description_pipe.transform(X_test["DESCRIPTION_TRANSLATED"])] )
return (sparse_X_train, sparse_X_test)
sparse_X_train, sparse_X_test = get_description_features(X_train, X_test, description_pipe)
description_results = rank_classifiers(sparse_X_train, sparse_X_test, y_train, y_test, 1000)
plot_rankings(description_results)
================================================================================ Predict the Mode ________________________________________________________________________________ Score: 0.9466316175571585 ================================================================================ Ridge Classifier ________________________________________________________________________________ Training: RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01) train time: 2.500s test time: 0.069s accuracy: 0.945 dimensionality: 87527 density: 1.000000 confusion matrix: [[ 13 0 754 0] [ 1 0 35 0] [ 35 0 15450 0] [ 0 0 70 0]] ================================================================================ Perceptron ________________________________________________________________________________ Training: Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty=None, random_state=0, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.358s test time: 0.072s accuracy: 0.928 dimensionality: 87527 density: 0.192006 confusion matrix: [[ 54 0 699 14] [ 3 0 32 1] [ 220 4 15126 135] [ 0 0 70 0]] ================================================================================ Neural Network ________________________________________________________________________________ Training: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=True, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant', learning_rate_init=0.001, max_fun=15000, max_iter=1000, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False) train time: 546.074s test time: 0.903s accuracy: 0.947 confusion matrix: [[ 0 0 767 0] [ 0 0 36 0] [ 0 0 15485 0] [ 0 0 70 0]] ================================================================================ Passive-Aggressive ________________________________________________________________________________ Training: PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None, early_stopping=False, fit_intercept=True, loss='hinge', max_iter=10000, n_iter_no_change=5, n_jobs=-1, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.482s test time: 0.070s accuracy: 0.936 dimensionality: 87527 density: 0.502382 confusion matrix: [[ 59 0 708 0] [ 3 0 33 0] [ 238 1 15246 0] [ 0 0 70 0]] ================================================================================ kNN ________________________________________________________________________________ Training: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=5, p=2, weights='uniform') train time: 0.171s test time: 13.992s accuracy: 0.921 confusion matrix: [[ 1 0 746 20] [ 0 0 35 1] [ 6 0 15069 410] [ 0 0 70 0]] ================================================================================ Random forest ________________________________________________________________________________ Training: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=1000, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=None, verbose=0, warm_start=False) train time: 3.194s test time: 0.473s accuracy: 0.946 confusion matrix: [[ 3 0 764 0] [ 0 0 36 0] [ 7 0 15478 0] [ 0 0 70 0]] ================================================================================ Liblinear with L2 penalty ________________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=10000, multi_class='ovr', penalty='l2', random_state=None, tol=0.001, verbose=0) train time: 2.864s test time: 0.057s accuracy: 0.942 dimensionality: 87527 density: 1.000000 confusion matrix: [[ 39 0 728 0] [ 2 0 34 0] [ 114 0 15371 0] [ 0 0 70 0]] ================================================================================ Ordinary SGD with L2 penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty='l2', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.459s test time: 0.069s accuracy: 0.946 dimensionality: 87527 density: 0.307094 confusion matrix: [[ 11 0 756 0] [ 1 0 35 0] [ 21 0 15464 0] [ 0 0 70 0]] ================================================================================ Liblinear with L1 penalty ________________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=10000, multi_class='ovr', penalty='l1', random_state=None, tol=0.001, verbose=0) train time: 5.252s test time: 0.057s accuracy: 0.942 dimensionality: 87527 density: 0.011739 confusion matrix: [[ 19 0 748 0] [ 1 0 35 0] [ 92 7 15386 0] [ 0 0 70 0]] ================================================================================ Ordinary SGD with L1 penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty='l1', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.569s test time: 0.070s accuracy: 0.947 dimensionality: 87527 density: 0.000654 confusion matrix: [[ 0 0 767 0] [ 0 0 36 0] [ 0 0 15485 0] [ 0 0 70 0]] ================================================================================ SGD with Elastic Net penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.675s test time: 0.068s accuracy: 0.947 dimensionality: 87527 density: 0.027354 confusion matrix: [[ 0 0 767 0] [ 0 0 36 0] [ 0 0 15485 0] [ 0 0 70 0]] ================================================================================ LinearSVC with L1-based feature selection ________________________________________________________________________________ Training: Pipeline(memory=None, steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l1', random_state=None, tol=0.001, verbose=0), max_features=None, norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False) train time: 5.187s test time: 0.056s accuracy: 0.943 confusion matrix: [[ 30 0 737 0] [ 1 0 35 0] [ 97 0 15388 0] [ 0 0 70 0]] ================================================================================ Voting classifier with all previous estimators ________________________________________________________________________________ Training: VotingClassifier(estimators=[('Ridge Classifier', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01)), ('Perceptron', Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs=-1, penalt... verbose=0), max_features=None, norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False))], flatten_transform=True, n_jobs=None, voting='hard', weights=None) train time: 571.932s test time: 16.037s accuracy: 0.946 confusion matrix: [[ 11 0 756 0] [ 1 0 35 0] [ 24 0 15461 0] [ 0 0 70 0]] ================================================================================ Stacking classifier with all previous estimators ________________________________________________________________________________ Training: StackingClassifier(cv=None, estimators=[('Ridge Classifier', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01)), ('Perceptron', Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=1000, n_iter_no_change=5, n_jobs... norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False))], final_estimator=None, n_jobs=None, passthrough=False, stack_method='auto', verbose=0) train time: 3007.199s test time: 13.469s accuracy: 0.946 confusion matrix: [[ 8 0 759 0] [ 0 0 36 0] [ 18 0 15467 0] [ 0 0 70 0]]
from sklearn.compose import make_column_transformer
feature_transformer = make_column_transformer(
(cat_pipe, categorical_features),
(num_pipe, number_features),
remainder="drop"
)
X = pd.concat([X_num, X_cat, X_desc], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.005, test_size=0.01)
sparse_X_train, sparse_X_test = get_description_features(X_train, X_test, description_pipe)
X_train = feature_transformer.fit_transform(X_train)
X_test = feature_transformer.transform(X_test)
sparse_X_train = hstack([sparse_X_train, X_train])
sparse_X_test = hstack([sparse_X_test, X_test])
np.seterr('ignore')
description_results = rank_classifiers(sparse_X_train, sparse_X_test, y_train, y_test, 100)
plot_rankings(description_results)
================================================================================ Predict the Mode ________________________________________________________________________________ Score: 0.9463259567184252 ================================================================================ Ridge Classifier ________________________________________________________________________________ Training: RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01) train time: 3363.492s test time: 0.067s accuracy: 0.047 dimensionality: 246749 density: 1.000000 confusion matrix: [[ 763 0 0 0] [ 39 0 0 0] [15480 0 0 0] [ 76 0 0 0]] ================================================================================ Perceptron ________________________________________________________________________________ Training: Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty=None, random_state=0, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.288s test time: 0.081s accuracy: 0.947 dimensionality: 246749 density: 0.111970 confusion matrix: [[ 9 0 754 0] [ 0 0 39 0] [ 2 0 15478 0] [ 2 0 74 0]] ================================================================================ Neural Network ________________________________________________________________________________ Training: MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=True, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate='constant', learning_rate_init=0.001, max_fun=15000, max_iter=100, momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5, random_state=None, shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False) train time: 881.287s test time: 1.271s accuracy: 0.946 confusion matrix: [[ 0 0 754 9] [ 0 0 39 0] [ 0 0 15478 2] [ 0 0 74 2]] ================================================================================ Passive-Aggressive ________________________________________________________________________________ Training: PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None, early_stopping=False, fit_intercept=True, loss='hinge', max_iter=1000, n_iter_no_change=5, n_jobs=-1, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.199s test time: 0.085s accuracy: 0.947 dimensionality: 246749 density: 0.141683 confusion matrix: [[ 9 0 754 0] [ 0 0 39 0] [ 2 0 15478 0] [ 2 0 74 0]] ================================================================================ kNN ________________________________________________________________________________ Training: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=5, p=2, weights='uniform') train time: 0.103s test time: 8.126s accuracy: 0.944 confusion matrix: [[ 19 0 741 3] [ 0 3 36 0] [ 45 9 15426 0] [ 0 0 74 2]] ================================================================================ Random forest ________________________________________________________________________________ Training: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=100, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=None, verbose=0, warm_start=False) train time: 2.957s test time: 0.494s accuracy: 0.946 confusion matrix: [[ 0 0 763 0] [ 0 0 39 0] [ 3 0 15477 0] [ 0 0 76 0]] ================================================================================ Liblinear with L2 penalty ________________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.001, verbose=0) train time: 0.523s test time: 0.068s accuracy: 0.947 dimensionality: 246749 density: 0.268694 confusion matrix: [[ 9 0 754 0] [ 0 0 39 0] [ 2 0 15478 0] [ 2 0 74 0]] ================================================================================ Ordinary SGD with L2 penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty='l2', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 0.791s test time: 0.082s accuracy: 0.947 dimensionality: 246749 density: 0.187064 confusion matrix: [[ 9 0 754 0] [ 0 0 39 0] [ 2 0 15478 0] [ 2 0 74 0]] ================================================================================ Liblinear with L1 penalty ________________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l1', random_state=None, tol=0.001, verbose=0) train time: 114.357s test time: 0.058s accuracy: 0.941 dimensionality: 246749 density: 0.002444 confusion matrix: [[ 44 0 719 0] [ 1 0 38 0] [ 142 4 15332 2] [ 0 0 53 23]] ================================================================================ Ordinary SGD with L1 penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty='l1', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 1.808s test time: 0.072s accuracy: 0.946 dimensionality: 246749 density: 0.002931 confusion matrix: [[ 9 0 754 0] [ 0 0 39 0] [ 2 6 15470 2] [ 2 1 73 0]] ================================================================================ SGD with Elastic Net penalty ________________________________________________________________________________ Training: SGDClassifier(alpha=0.0001, average=False, class_weight=None, early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty='elasticnet', power_t=0.5, random_state=None, shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0, warm_start=False) train time: 2.404s test time: 0.065s accuracy: 0.947 dimensionality: 246749 density: 0.020402 confusion matrix: [[ 9 0 754 0] [ 0 0 39 0] [ 2 0 15478 0] [ 2 0 74 0]] ================================================================================ LinearSVC with L1-based feature selection ________________________________________________________________________________ Training: Pipeline(memory=None, steps=[('feature_selection', SelectFromModel(estimator=LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l1', random_state=None, tol=0.001, verbose=0), max_features=None, norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False) train time: 117.112s test time: 0.060s accuracy: 0.946 confusion matrix: [[ 0 0 763 0] [ 0 0 39 0] [ 0 0 15480 0] [ 0 0 76 0]] ================================================================================ Voting classifier with all previous estimators ________________________________________________________________________________ Training: VotingClassifier(estimators=[('Ridge Classifier', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01)), ('Perceptron', Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=-1, penalty... verbose=0), max_features=None, norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False))], flatten_transform=True, n_jobs=None, voting='hard', weights=None) train time: 4395.905s test time: 9.928s accuracy: 0.947 confusion matrix: [[ 9 0 754 0] [ 0 0 39 0] [ 2 0 15478 0] [ 2 0 74 0]] ================================================================================ Stacking classifier with all previous estimators ________________________________________________________________________________ Training: StackingClassifier(cv=None, estimators=[('Ridge Classifier', RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True, max_iter=None, normalize=False, random_state=None, solver='auto', tol=0.01)), ('Perceptron', Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0, fit_intercept=True, max_iter=100, n_iter_no_change=5, n_jobs=... norm_order=1, prefit=False, threshold=None)), ('classification', LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))], verbose=False))], final_estimator=None, n_jobs=None, passthrough=False, stack_method='auto', verbose=0)
Unfortunately, although we achieved slight improvements when using all features, we were not able to produce significantly better performing predictions than simply always guessing that the loans will be fully funded. While it might be possible with better tuned models, using more of the dataset, better feature selection, and running models for more iterations, there was no evidence found for this throughout the course of development.
Ultimately, it just may not be possible to predict given the information available when the loans are first posted, and has more to do with which loans get presented to users by Kiva.
Kiva has grown significantly over the past 15 years. Despite the majority of loans still focusing on poorer countries, there are many options for the average user living in the US to help domestically if they so choose. Relative to ordinary loans, the default rates here are extremely good, implying that the loans tend to be successful. As shown in the section on the expected costs of loaning, if one has any amount of disposable income, wishes to help others, and believes in the power of investing in small businesses, then these microloans can be a worthwhile charitable option to consider without needing to give up much in the long run.