Kiva is a non-profit organization that hosts a platform, kiva.org, to crowdfund loans to individuals and groups who otherwise may not have had access to capital. They provide a rich dataset of historical loans from 2006 to 2019.
In this notebook, we will first explore the dataset to glean insights and deeper understanding about the platform.
We will then consider the expected personal loss of making a single loan, and then withdrawing any repayment as soon as possible.
Finally, we will attempt to use machine learning models to predict which loans will get funded, and which will get fully repaid.
import pandas as pd
import matplotlib.colors as mplc
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from scipy import stats
import numpy as np
import seaborn as sns
%matplotlib inline
# activate plot theme
import qeds
qeds.themes.mpl_style();
# Source: https://www.kiva.org/build/data-snapshots
loans_raw = pd.read_csv("datasets/loans_big.csv", parse_dates=["POSTED_TIME","PLANNED_EXPIRATION_TIME","DISBURSE_TIME","RAISED_TIME",])
The above dataset was provided by Kiva as a snapshot of their historical loans. Below is an example row:
pd.set_option('display.max_columns', 40)
loans_raw.dropna().head(1)
min_date = loans_raw["POSTED_TIME"].min()
max_date = loans_raw["POSTED_TIME"].max()
print(f"The dataset contains {len(loans_raw)} loans in the date range {min_date} to {max_date}.")
We will first explore the quality of the data and see if there will be any issues with missing data.
print(loans_raw.isnull().sum())
There are some missing cells. However, the core data relating to the loans themselves are complete. Although cells relating to currencies are listed, manually investigating a sample revealed that all loan amounts are in USD.
We will proceed by asking some basic questions about the loans.
plt.rcParams['figure.figsize'] = (10, 5)
What are these loans going towards?
sectors = loans_raw["SECTOR_NAME"].value_counts().sort_values()
ax = sectors.plot.barh()
ax.set_title("Loans by Sector")
What about if we get slightly more granular?
activities = loans_raw["ACTIVITY_NAME"].value_counts().nlargest(15).sort_values()
ax = activities.plot.barh()
ax.set_title("Loans by Activity")
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
additional_stopwords = ['buy', 'purchase', 'sell', 'pay', 'additional', 'items', 'materials', 'stock', 'like', 'goods',
'increase', 'new', 'improve', 'build', 'make', 'products', 'supplies']
stop_words = text.ENGLISH_STOP_WORDS.union(additional_stopwords)
vectorizer = CountVectorizer(stop_words=stop_words)
words = vectorizer.fit_transform(loans_raw["LOAN_USE"].dropna())
words_sum = words.sum(axis=0)
words_freq = [(word, words_sum[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
# The double sorting is because barh puts the first element at the bottom
top_20_words = sorted(words_freq[0:20], key=lambda x:x[1])
y = [word[0] for word in top_20_words]
width = [freq[1] for freq in top_20_words]
fig, ax = plt.subplots()
ax.set_title("Most common specific description words")
ax.barh(y, width=width)
As you can see, the most common reasons for loans are to help small businesses, such as farming and local retail.
What does the distribution of loan amounts look like?
amounts = loans_raw['LOAN_AMOUNT']
# Remove outliers to be able to view amounts at a reasonable scale.
# These outliers are quite rare, but some have extremely large amounts, skewing the image.
def filter_outliers(data, crit=4):
"""Removes data points with a zscore magnitude greater than the critical value crit."""
z = np.abs(stats.zscore(data))
return data[(z < crit)]
filtered_amounts = filter_outliers(amounts)
ax = sns.distplot(filtered_amounts)
ax.set_xticks(np.arange(0, np.max(filtered_amounts) + 1, 500))
ax.set_xlabel("Loan Amounts in USD")
We can see that the most common loan amount is 250 USD, though multiples of 500 are generally more popular than surrounding amounts.
How many people tend to contribute to each loan?
filtered_lenders = filter_outliers(loans_raw["NUM_LENDERS_TOTAL"])
ax = sns.distplot(filtered_lenders)
ax.set_xticks(np.arange(0, 141, 10))
ax.set_xlabel("Number of lenders per loan")
print(filtered_lenders.value_counts().head())
We see that the distribution is centered around 8 lenders, with a long right tail. However, we can also see that a significant number of loans were funded by only a single person.
The loans are not guaranteed to get fully funded. How many do not?
loan_diffs = amounts - loans_raw["FUNDED_AMOUNT"]
funding_status = ["funded" if diff == 0 else "unfunded" for diff in loan_diffs]
loan_diffs = loan_diffs[loan_diffs != 0]
percent_unfunded = ((loan_diffs.count() / amounts.count()) * 100).round(1)
print(f"{percent_unfunded}% of loans do not get fully funded.")
unfunded_asking_amounts = amounts[loan_diffs.index]
unfunded_ratios = loan_diffs / unfunded_asking_amounts
how_much_funded = (unfunded_ratios.mean() * 100).round(1)
print(f"Of those, the mean amount funded is {how_much_funded}% of the ask.")
sns.countplot(funding_status)
loans_raw["TIME_TO_FUND"] = loans_raw["RAISED_TIME"] - loans_raw["POSTED_TIME"]
loans_raw["DAYS_TO_FUND"] = [time.days for time in loans_raw["TIME_TO_FUND"]]
days_to_fund = filter_outliers(loans_raw["DAYS_TO_FUND"].dropna())
ax = sns.distplot(days_to_fund)
ax.set_xticks(np.arange(0, 71, 5))
plt.show()
Is there a relationship between the loan amount and the number of days it takes to fund the loan?
# Need to filter outliers but don't want to filter by different amounts
amounts_and_days = loans_raw[["LOAN_AMOUNT", "DAYS_TO_FUND"]]
amounts_and_days = amounts_and_days[(amounts_and_days["LOAN_AMOUNT"].isin(filtered_amounts))
& (amounts_and_days["DAYS_TO_FUND"].isin(days_to_fund))]
sample = amounts_and_days.sample(250)
sns.lmplot(x="LOAN_AMOUNT", y="DAYS_TO_FUND", data=sample, fit_reg=True)
There does not seem to be a clear relationship.
Once the loans are distributed, there is no guarantee that they will be repaid. How many do not?
Unfortunately, there was no information to answer this question in the original dataset. We instead turn to a previously supplied dataset of 5000 randomly sampled loans with their repayment details, taken from https://stat.duke.edu/resources/datasets/kiva-loans
# Source: https://stat.duke.edu/resources/datasets/kiva-loans
repayments = pd.read_excel("datasets/loan_repayment_samples.xlsx")
# basket_amount was all n/a, video.youtube_id was not used
repayments = repayments.drop(["basket_amount", "video.youtube_id"], axis=1)
# This dataset has a row for each individual payment; aggregate these into a single row per loan
grouped_repayments = repayments.groupby("id", as_index=False).agg(lambda x: x.tolist())
# Get the payment status of the loan
grouped_repayments["status"] = grouped_repayments["status"].agg(lambda x: "defaulted" if "defaulted" in x else "paid")
status_counts = grouped_repayments["status"].value_counts()
default_ratio = ((status_counts["defaulted"] / len(grouped_repayments)) * 100).round(1)
print(f"{default_ratio} % of loans in this sample have defaulted.")
fig, ax = plt.subplots()
ax.pie(x=status_counts, labels=["paid", "defaulted"])
plt.show()
What is the gender ratio of those receiving loans?
genders = loans_raw["BORROWER_GENDERS"].dropna()
# The gender entries include all recipients as a comma separated string, so a simple value_counts will not work.
females = 0
males = 0
for entry in genders:
recipients = entry.split(", ")
for gender in recipients:
if gender == "female":
females += 1
else:
males += 1
fig, ax = plt.subplots()
ax.pie(x=[males, females], labels=["males", "females"])
female_ratio = round((females / (females + males)) * 100, 1)
male_ratio = round(((males / (females + males)) * 100), 1)
print(f"Loan recipients are {female_ratio}% female and {male_ratio}% male.")
What about the lenders themselves? What can we determine about them?
lenders_raw = pd.read_csv("datasets/lenders.csv")
lenders_raw.dropna().head(1)
map_loans_to_lenders = pd.read_csv("datasets/loans_lenders.csv")
loans_raw = pd.merge(loans_raw, map_loans_to_lenders, on="LOAN_ID")
What are the nationalities of the lenders?
import squarify
lenders_by_country = lenders_raw["COUNTRY_CODE"].value_counts()
ax = squarify.plot(
sizes=lenders_by_country.values[0:20],
label=lenders_by_country.index[0:20],
alpha=.7
)
Evidently the majority of the users live in the US. Which states in particular?
state_mappings = pd.read_csv("datasets/states.csv")
state_pops = pd.read_csv("datasets/us_state_populations.csv")
import plotly.express as px
state_values = lenders_raw["STATE"].value_counts()
num_lenders_per_state = {}
for state in state_values.index[0:200]: # ignore the irrelevant state entries from other countries
# Some entries have state as its full name, others its code; combine the two
if state in state_mappings["State"].values:
code = state_mappings[state_mappings["State"] == state]["Abbreviation"].iat[0]
num_lenders_per_state[code] = num_lenders_per_state.get(code, 0) + state_values[state]
elif state in state_mappings["Abbreviation"].values:
num_lenders_per_state[state] = num_lenders_per_state.get(state, 0) + state_values[state]
us_lenders_by_state = pd.Series(num_lenders_per_state).sort_values(ascending=False)
print(us_lenders_by_state.head())
fig = px.choropleth(
locations=us_lenders_by_state.index,
color=us_lenders_by_state.values,
locationmode="USA-states",
title="Number of lenders per US state",
scope="usa",
color_continuous_scale=px.colors.sequential.Plasma
)
fig.show()