A look into kiva.org

Kiva is a non-profit organization that hosts a platform, kiva.org, to crowdfund loans to individuals and groups who otherwise may not have had access to capital. They provide a rich dataset of historical loans from 2006 to 2019.

In this notebook, we will first explore the dataset to glean insights and deeper understanding about the platform.

We will then consider the expected personal loss of making a single loan, and then withdrawing any repayment as soon as possible.

Finally, we will attempt to use machine learning models to predict which loans will get funded, and which will get fully repaid.

In [1]:
import pandas as pd
import matplotlib.colors as mplc
import matplotlib.patches as patches
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from scipy import stats
import numpy as np
import seaborn as sns
In [2]:
%matplotlib inline
# activate plot theme
import qeds
qeds.themes.mpl_style();

Data Exploration

In [3]:
# Source: https://www.kiva.org/build/data-snapshots
loans_raw = pd.read_csv("datasets/loans_big.csv", parse_dates=["POSTED_TIME","PLANNED_EXPIRATION_TIME","DISBURSE_TIME","RAISED_TIME",])

The above dataset was provided by Kiva as a snapshot of their historical loans. Below is an example row:

In [4]:
pd.set_option('display.max_columns', 40)
loans_raw.dropna().head(1)
Out[4]:
LOAN_ID LOAN_NAME ORIGINAL_LANGUAGE DESCRIPTION DESCRIPTION_TRANSLATED FUNDED_AMOUNT LOAN_AMOUNT STATUS IMAGE_ID VIDEO_ID ACTIVITY_NAME SECTOR_NAME LOAN_USE COUNTRY_CODE COUNTRY_NAME TOWN_NAME CURRENCY_POLICY CURRENCY_EXCHANGE_COVERAGE_RATE CURRENCY PARTNER_ID POSTED_TIME PLANNED_EXPIRATION_TIME DISBURSE_TIME RAISED_TIME LENDER_TERM NUM_LENDERS_TOTAL NUM_JOURNAL_ENTRIES NUM_BULK_ENTRIES TAGS BORROWER_NAMES BORROWER_GENDERS BORROWER_PICTURED REPAYMENT_INTERVAL DISTRIBUTION_MODEL
77657 807094 GUSTAVO Spanish Gustavo es soltero y vive con sus padres en La... Gustavo is single and lives with his parents i... 500.0 500.0 funded 1745738.0 3002.0 Higher education costs Education to pay university tuition BO Bolivia La Paz shared 0.1 USD 48.0 2014-11-27 15:25:02+00:00 2015-01-25 02:20:02+00:00 2014-11-21 08:00:00+00:00 2014-12-21 15:17:44+00:00 20.0 17 2 1 user_favorite, user_favorite, user_favorite GUSTAVO male true monthly field_partner
In [5]:
min_date = loans_raw["POSTED_TIME"].min()
max_date = loans_raw["POSTED_TIME"].max()
print(f"The dataset contains {len(loans_raw)} loans in the date range {min_date} to {max_date}.")
The dataset contains 1682790 loans in the date range 2006-04-16 07:10:50+00:00 to 2019-02-25 04:12:27+00:00.

We will first explore the quality of the data and see if there will be any issues with missing data.

In [6]:
print(loans_raw.isnull().sum())
LOAN_ID                                  0
LOAN_NAME                            48555
ORIGINAL_LANGUAGE                    44209
DESCRIPTION                          44244
DESCRIPTION_TRANSLATED              453635
FUNDED_AMOUNT                            0
LOAN_AMOUNT                              0
STATUS                                   0
IMAGE_ID                             44209
VIDEO_ID                           1681943
ACTIVITY_NAME                            0
SECTOR_NAME                              0
LOAN_USE                             44232
COUNTRY_CODE                            29
COUNTRY_NAME                             0
TOWN_NAME                           163515
CURRENCY_POLICY                          0
CURRENCY_EXCHANGE_COVERAGE_RATE     337326
CURRENCY                                 0
PARTNER_ID                           18325
POSTED_TIME                              0
PLANNED_EXPIRATION_TIME             371834
DISBURSE_TIME                         3189
RAISED_TIME                          85150
LENDER_TERM                             24
NUM_LENDERS_TOTAL                        0
NUM_JOURNAL_ENTRIES                      0
NUM_BULK_ENTRIES                         0
TAGS                                831171
BORROWER_NAMES                       48555
BORROWER_GENDERS                     44209
BORROWER_PICTURED                    44209
REPAYMENT_INTERVAL                       0
DISTRIBUTION_MODEL                       0
dtype: int64

There are some missing cells. However, the core data relating to the loans themselves are complete. Although cells relating to currencies are listed, manually investigating a sample revealed that all loan amounts are in USD.

We will proceed by asking some basic questions about the loans.

In [7]:
plt.rcParams['figure.figsize'] = (10, 5)

What are these loans going towards?

In [8]:
sectors = loans_raw["SECTOR_NAME"].value_counts().sort_values()
ax = sectors.plot.barh()
ax.set_title("Loans by Sector")
Out[8]:
Text(0.5, 1.0, 'Loans by Sector')

What about if we get slightly more granular?

In [9]:
activities = loans_raw["ACTIVITY_NAME"].value_counts().nlargest(15).sort_values()
ax = activities.plot.barh()
ax.set_title("Loans by Activity")
Out[9]:
Text(0.5, 1.0, 'Loans by Activity')
In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

additional_stopwords = ['buy', 'purchase', 'sell', 'pay', 'additional', 'items', 'materials', 'stock', 'like', 'goods',
                        'increase', 'new', 'improve', 'build', 'make', 'products', 'supplies']

stop_words = text.ENGLISH_STOP_WORDS.union(additional_stopwords)

vectorizer = CountVectorizer(stop_words=stop_words)
words = vectorizer.fit_transform(loans_raw["LOAN_USE"].dropna())
words_sum = words.sum(axis=0)

words_freq = [(word, words_sum[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)

# The double sorting is because barh puts the first element at the bottom
top_20_words = sorted(words_freq[0:20], key=lambda x:x[1])

y = [word[0] for word in top_20_words]
width = [freq[1] for freq in top_20_words]

fig, ax = plt.subplots()
ax.set_title("Most common specific description words")
ax.barh(y, width=width)
Out[10]:
<BarContainer object of 20 artists>

As you can see, the most common reasons for loans are to help small businesses, such as farming and local retail.

What does the distribution of loan amounts look like?

In [11]:
amounts = loans_raw['LOAN_AMOUNT']

# Remove outliers to be able to view amounts at a reasonable scale.
# These outliers are quite rare, but some have extremely large amounts, skewing the image.
def filter_outliers(data, crit=4):
    """Removes data points with a zscore magnitude greater than the critical value crit."""
    z = np.abs(stats.zscore(data))
    return data[(z < crit)]

filtered_amounts = filter_outliers(amounts)

ax = sns.distplot(filtered_amounts)
ax.set_xticks(np.arange(0, np.max(filtered_amounts) + 1, 500))
ax.set_xlabel("Loan Amounts in USD")
Out[11]:
Text(0.5, 0, 'Loan Amounts in USD')

We can see that the most common loan amount is 250 USD, though multiples of 500 are generally more popular than surrounding amounts.

How many people tend to contribute to each loan?

In [12]:
filtered_lenders = filter_outliers(loans_raw["NUM_LENDERS_TOTAL"])

ax = sns.distplot(filtered_lenders)
ax.set_xticks(np.arange(0, 141, 10))
ax.set_xlabel("Number of lenders per loan")

print(filtered_lenders.value_counts().head())
8    78933
1    72962
9    72353
7    72134
5    69761
Name: NUM_LENDERS_TOTAL, dtype: int64

We see that the distribution is centered around 8 lenders, with a long right tail. However, we can also see that a significant number of loans were funded by only a single person.

The loans are not guaranteed to get fully funded. How many do not?

In [13]:
loan_diffs = amounts - loans_raw["FUNDED_AMOUNT"]
funding_status = ["funded" if diff == 0 else "unfunded" for diff in loan_diffs]

loan_diffs = loan_diffs[loan_diffs != 0]
percent_unfunded = ((loan_diffs.count() / amounts.count()) * 100).round(1)
print(f"{percent_unfunded}% of loans do not get fully funded.")

unfunded_asking_amounts = amounts[loan_diffs.index]
unfunded_ratios = loan_diffs / unfunded_asking_amounts
how_much_funded = (unfunded_ratios.mean() * 100).round(1)
print(f"Of those, the mean amount funded is {how_much_funded}% of the ask.")

sns.countplot(funding_status)
5.1% of loans do not get fully funded.
Of those, the mean amount funded is 56.4% of the ask.
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1da5508b748>
In [14]:
loans_raw["TIME_TO_FUND"] = loans_raw["RAISED_TIME"] - loans_raw["POSTED_TIME"]
loans_raw["DAYS_TO_FUND"] = [time.days for time in loans_raw["TIME_TO_FUND"]]
days_to_fund = filter_outliers(loans_raw["DAYS_TO_FUND"].dropna())
In [15]:
ax = sns.distplot(days_to_fund)
ax.set_xticks(np.arange(0, 71, 5))
plt.show()

Is there a relationship between the loan amount and the number of days it takes to fund the loan?

In [16]:
# Need to filter outliers but don't want to filter by different amounts
amounts_and_days = loans_raw[["LOAN_AMOUNT", "DAYS_TO_FUND"]]
amounts_and_days = amounts_and_days[(amounts_and_days["LOAN_AMOUNT"].isin(filtered_amounts))
                                    & (amounts_and_days["DAYS_TO_FUND"].isin(days_to_fund))]
In [17]:
sample = amounts_and_days.sample(250)
sns.lmplot(x="LOAN_AMOUNT", y="DAYS_TO_FUND", data=sample, fit_reg=True)
Out[17]:
<seaborn.axisgrid.FacetGrid at 0x1da53a875c8>

There does not seem to be a clear relationship.

Once the loans are distributed, there is no guarantee that they will be repaid. How many do not?

Unfortunately, there was no information to answer this question in the original dataset. We instead turn to a previously supplied dataset of 5000 randomly sampled loans with their repayment details, taken from https://stat.duke.edu/resources/datasets/kiva-loans

In [18]:
# Source: https://stat.duke.edu/resources/datasets/kiva-loans
repayments = pd.read_excel("datasets/loan_repayment_samples.xlsx")

# basket_amount was all n/a, video.youtube_id was not used
repayments = repayments.drop(["basket_amount", "video.youtube_id"], axis=1)
In [19]:
# This dataset has a row for each individual payment; aggregate these into a single row per loan
grouped_repayments = repayments.groupby("id", as_index=False).agg(lambda x: x.tolist())
In [20]:
# Get the payment status of the loan
grouped_repayments["status"] = grouped_repayments["status"].agg(lambda x: "defaulted" if "defaulted" in x else "paid")

status_counts = grouped_repayments["status"].value_counts()
default_ratio = ((status_counts["defaulted"] / len(grouped_repayments)) * 100).round(1)
print(f"{default_ratio} % of loans in this sample have defaulted.")

fig, ax = plt.subplots()
ax.pie(x=status_counts, labels=["paid", "defaulted"])
plt.show()
2.0 % of loans in this sample have defaulted.

What is the gender ratio of those receiving loans?

In [21]:
genders = loans_raw["BORROWER_GENDERS"].dropna()

# The gender entries include all recipients as a comma separated string, so a simple value_counts will not work.
females = 0
males = 0
for entry in genders:
    recipients = entry.split(", ")
    for gender in recipients:
        if gender == "female":
            females += 1
        else:
            males += 1
            
fig, ax = plt.subplots()
ax.pie(x=[males, females], labels=["males", "females"])

female_ratio = round((females / (females + males)) * 100, 1)
male_ratio = round(((males / (females + males)) * 100), 1)

print(f"Loan recipients are {female_ratio}% female and {male_ratio}% male.")
Loan recipients are 80.0% female and 20.0% male.

What about the lenders themselves? What can we determine about them?

In [22]:
lenders_raw = pd.read_csv("datasets/lenders.csv")
lenders_raw.dropna().head(1)
Out[22]:
PERMANENT_NAME DISPLAY_NAME MAIN_PIC_ID CITY STATE COUNTRY_CODE MEMBER_SINCE PERSONAL_URL OCCUPATION LOAN_BECAUSE OTHER_INFO LOAN_PURCHASE_NUM INVITED_BY NUM_INVITED
11476 barketing Erika Godwin 2899846.0 Ottawa Ontario CA 1528393535 www.barketing.co/ Business Owner I loan because I want to help fellow female en... I am co-founder of ProPet Software and owner o... 5 Bella 0
In [23]:
map_loans_to_lenders = pd.read_csv("datasets/loans_lenders.csv")
loans_raw = pd.merge(loans_raw, map_loans_to_lenders, on="LOAN_ID")

What are the nationalities of the lenders?

In [24]:
import squarify
In [25]:
lenders_by_country = lenders_raw["COUNTRY_CODE"].value_counts()
ax = squarify.plot(
    sizes=lenders_by_country.values[0:20],
    label=lenders_by_country.index[0:20],
    alpha=.7
)

Evidently the majority of the users live in the US. Which states in particular?

In [26]:
state_mappings = pd.read_csv("datasets/states.csv")
state_pops = pd.read_csv("datasets/us_state_populations.csv")
In [27]:
import plotly.express as px
In [28]:
state_values = lenders_raw["STATE"].value_counts()
num_lenders_per_state = {}
for state in state_values.index[0:200]: # ignore the irrelevant state entries from other countries
    # Some entries have state as its full name, others its code; combine the two
    if state in state_mappings["State"].values:
        code = state_mappings[state_mappings["State"] == state]["Abbreviation"].iat[0]
        num_lenders_per_state[code] = num_lenders_per_state.get(code, 0) + state_values[state]
    elif state in state_mappings["Abbreviation"].values:
        num_lenders_per_state[state] = num_lenders_per_state.get(state, 0) + state_values[state]

us_lenders_by_state = pd.Series(num_lenders_per_state).sort_values(ascending=False)
print(us_lenders_by_state.head())

fig = px.choropleth(
    locations=us_lenders_by_state.index,
    color=us_lenders_by_state.values,
    locationmode="USA-states",
    title="Number of lenders per US state",
    scope="usa",
    color_continuous_scale=px.colors.sequential.Plasma
)
fig.show()
CA    57270
NY    20708
TX    16410
WA    15102
IL    12441
dtype: int64