• Logistic regression, to determine which features of the data set contribute towards someone paying off their loan or defaulting on their loan in United States.
• Using cartodb to map the features of the data set and see which states stand out among the rest in terms of paying back their loans off.
import pandas as pd
import numpy as np
from datetime import datetime
from matplotlib import pyplot as plt
%matplotlib inline
url = '/Users/olehdubno/Desktop/python_tests/LoanStats3b2.csv'
loan = pd.read_csv(url, low_memory = False)
Creating a separate set of features we will be cleaning and working with.
loan_2 = loan[['funded_amnt','emp_length','annual_inc','loan_status','home_ownership','addr_state','tax_liens','grade']]
loan_2.head()
funded_amnt | emp_length | annual_inc | loan_status | home_ownership | addr_state | tax_liens | grade | |
---|---|---|---|---|---|---|---|---|
0 | 24000 | 10+ years | 100000 | Current | MORTGAGE | MI | 0 | B |
1 | 11100 | 10+ years | 90000 | Current | MORTGAGE | NY | 0 | C |
2 | 12000 | 3 years | 96500 | Current | MORTGAGE | TX | 0 | A |
3 | 15000 | 10+ years | 98000 | Fully Paid | RENT | NY | 0 | C |
4 | 27600 | 6 years | 73000 | Current | MORTGAGE | CO | 0 | D |
Cleaning:
•Convert "loan_status" to booleans
•Clean up "emp_length"
•Convert "grade" to integer values
•Drop N/A values from fields
•Account for outliers
Dropping N/A values (It's only 4 rows and not very significant)
loan_2.dropna().info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 188123 entries, 0 to 188122 Data columns (total 8 columns): funded_amnt 188123 non-null float64 emp_length 188123 non-null object annual_inc 188123 non-null float64 loan_status 188123 non-null object home_ownership 188123 non-null object addr_state 188123 non-null object tax_liens 188123 non-null float64 grade 188123 non-null object dtypes: float64(3), object(5)
Lets plot Annual Income against Funded Amount.
plt.figure(figsize=(10,5))
plt.scatter(loan_2['annual_inc'], loan_2['funded_amnt'])
plt.title("Plotting Annual Income against Funded Amount")
plt.ylabel('Funded Amount')
plt.xlabel('Annual Income')
plt.show()
loan_2.annual_inc.hist(figsize=(10,5))
plt.ylabel('Number of Loans')
plt.xlabel('Annual Income')
<matplotlib.text.Text at 0x10840e4d0>
There are several outliers to be accounted for. Lets limit the data to annual income of $200000.
loan_2 = loan_2[loan_2['annual_inc']<200000]
loan_2.annual_inc.hist(figsize=(10,5))
plt.ylabel('Number of Loans')
plt.xlabel('Annual Income')
<matplotlib.text.Text at 0x1081c50d0>
Much better!
Let's take a quick look at the funded amount. We will plot funded amount both from the unfiltered data frame and the filtered data frame (annual income < $200,000).
print loan.funded_amnt.hist()
plt.title("Loan with income maximum of $8,000,000.00")
plt.xlabel("Funded Amount")
plt.show()
print loan_2.funded_amnt.hist()
plt.title("Loan with income maximum of $200,000.00")
plt.xlabel("Funded Amount")
plt.show()
Axes(0.125,0.125;0.775x0.775)
Axes(0.125,0.125;0.775x0.775)
There's no significant difference to Funded Amount when we remove the outliers in Annual Income.
Our feature "loan_status" has seven unique values. To do our logistic regression we require two.
Below, you'll notice a bar chart that highlights each of the seven values. We'll be removing "Current", as our goal is to focus on who paid or didn't pay their loans. "Fully Paid" will remain as is and the rest of the columns will be characterized as "Unpaid", after all that's pretty much what they are.
loan_2.loan_status.value_counts().plot(kind='bar',alpha=.30)
<matplotlib.axes._subplots.AxesSubplot at 0x108744c90>
#cleaning "loan_status"
loan_2['loan_status_clean'] = loan_2['loan_status'].map({'Current': 2, 'Fully Paid': 1, 'Charged Off':0, 'Late(31-120 days)':0, 'In Grace Period': 0, 'Late(16-30 days)': 0, 'Default': 0})
loan_2 = loan_2[loan_2.loan_status_clean != 2]
loan_2["loan_status_clean"] = loan_2["loan_status_clean"].apply(lambda loan_status_clean: 0 if loan_status_clean == 0 else 1)
loan_2.loan_status_clean.value_counts().plot(kind='bar',alpha=.30)
<matplotlib.axes._subplots.AxesSubplot at 0x10a984190>
loan_2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 53487 entries, 3 to 188121 Data columns (total 9 columns): funded_amnt 53487 non-null float64 emp_length 53487 non-null object annual_inc 53487 non-null float64 loan_status 53487 non-null object home_ownership 53487 non-null object addr_state 53487 non-null object tax_liens 53487 non-null float64 grade 53487 non-null object loan_status_clean 53487 non-null int64 dtypes: float64(3), int64(1), object(5)
loan_2['emp_length_clean'] = loan_2.emp_length.str.replace('+','')
loan_2['emp_length_clean'] = loan_2.emp_length_clean.str.replace('<','')
loan_2['emp_length_clean'] = loan_2.emp_length_clean.str.replace('years','')
loan_2['emp_length_clean'] = loan_2.emp_length_clean.str.replace('year','')
loan_2['emp_length_clean'] = loan_2.emp_length_clean.str.replace('n/a','0')
loan_2.emp_length_clean.unique()
array(['10 ', '2 ', '1 ', '9 ', '5 ', ' 1 ', '8 ', '0', '7 ', '4 ', '3 ', '6 '], dtype=object)
loan_2['emp_length_clean'] = loan_2.emp_length_clean.map(float)
Here, we'll be adding a value to letter grade assigned to individual loans. "A", the highest rating, will receive the value 7. "G", the lowest rating, will receive the value 1.
Using the map function, we assign the values:
loan_2['grade_clean'] = loan_2['grade'].map({'A':7,'B':6,'C':5,'D':4,'E':3,'F':2,'G':1})
funded_amnt = loan_2.funded_amnt
mean_funded_amnt = loan_2[loan_2.funded_amnt.notnull()].funded_amnt.mean()
loan_2.funded_amnt.fillna(mean_funded_amnt, inplace=True)
annual_inc = loan_2.annual_inc
mean_annual_inc = loan_2[loan_2.annual_inc.notnull()].annual_inc.mean()
loan_2.annual_inc.fillna(mean_annual_inc, inplace=True)
emp_length = loan_2.emp_length_clean
mean_emp_length_clean = loan_2[loan_2.emp_length_clean.notnull()].emp_length_clean.mean()
loan_2.emp_length_clean.fillna(mean_emp_length_clean, inplace=True)
grade = loan_2.grade
mean_grade_clean = loan_2[loan_2.grade.notnull()].grade_clean.mean()
loan_2.grade_clean.fillna(mean_grade_clean, inplace=True)
import statsmodels.api as sm
from sklearn import linear_model, datasets
from sklearn.cross_validation import train_test_split
Predicting whether a loan will be paid off using Emplyoment Length and Grade of the Loan.
X_Variables = ['emp_length_clean', 'grade_clean']
X = loan_2[X_Variables]
X = X.values
y = loan_2['loan_status_clean'].values
clf = linear_model.LogisticRegression()
model = clf.fit(X,y)
model.score(X, y)
0.77702993250696428
pd.DataFrame(zip(X_Variables,model.coef_.T))
0 | 1 | |
---|---|---|
0 | emp_length_clean | [0.0160344150048] |
1 | grade_clean | [0.312299443185] |
Above we have our coefficients:
0.0160 for the lenght of employment
0.3123 for the grade a loan received.
Lets take a look at the grade a loan receives. For every additional increase in the grade "G" to "F" or in our case "1" to "2" the chance of the loan being paid off increases by .3123
Makes intuitive sense right? Why else would Lending Tree give a high grade to a loan that they think is faulty and as the grade for a loan increases so does the chance of the loan being paid off in this case by a coefficient of .3123
Alright. Well what about the years that someone has been employed? That certainly could be used as a predictor. In this case, it's not the best predictor. For every each additional year that someone is employed the chance of that person paying back their loan increases only by 0.0160
Predicting whether a loan will be paid off using Funded Ammount and Annual Income.
X_Variables_2 = ['funded_amnt', 'annual_inc']
X_2 = loan_2[X_Variables_2]
X_2 = X_2.values
y_2 = loan_2['loan_status_clean'].values
model_2 = clf.fit(X_2,y_2)
model_2.score(X_2, y)
0.77809561201787347
pd.DataFrame(zip(X_Variables_2,model_2.coef_.T))
0 | 1 | |
---|---|---|
0 | funded_amnt | [-2.38695184337e-05] |
1 | annual_inc | [2.2999063003e-05] |
The reason for such low coefficients, for funded amount and annual income, is that the numbers are in thousands, granted they're in $, and loan status is binary ranging from 0 to 1.
Let's look at the amount funded. As the amount funded increases by $10,000 the chance of it getting paid back decreases by -0.238 = (10,000 x -0.0000238).
Similar, as annual income increases so does the chance of the loan being paid off. Intuitive, right? This is understandable and supported by the positive coefficient 0.0000202. In other words if my annual income increases by $10,000 so does the chance of me paying back the loan by 0.230 (10,000 x 0.0000230)
Predicting whether a loan will be paid off given the individuals home ownership status.
We currently have a column "home_ownership" with five unique values: "Rent", "Mortgage", "Own", "None", "Other".
loan_2.home_ownership.unique().tolist()
['RENT', 'MORTGAGE', 'OWN', 'NONE', 'OTHER']
Before running this unique list, using logistic regression, against loan status, we have to create individual columns for each value, referred to as dummy variables.
Each column will have a True or a False value associated with the individual loan that has either "Rent", "Mortgage", "Own", "None" or "Other".
home_ownership = pd.get_dummies(loan_2.home_ownership)
loan_2 = loan_2.join(home_ownership)
Below are our dummy variables.
loan_2.head()
funded_amnt | emp_length | annual_inc | loan_status | home_ownership | addr_state | tax_liens | grade | loan_status_clean | emp_length_clean | grade_clean | MORTGAGE | NONE | OTHER | OWN | RENT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 15000 | 10+ years | 98000 | Fully Paid | RENT | NY | 0 | C | 1 | 10 | 5 | 0 | 0 | 0 | 0 | 1 |
12 | 3000 | 10+ years | 25000 | Fully Paid | RENT | FL | 0 | B | 1 | 10 | 6 | 0 | 0 | 0 | 0 | 1 |
15 | 4800 | 2 years | 39600 | Fully Paid | MORTGAGE | TX | 0 | B | 1 | 2 | 6 | 1 | 0 | 0 | 0 | 0 |
22 | 6000 | 1 year | 70000 | Fully Paid | MORTGAGE | NC | 0 | B | 1 | 1 | 6 | 1 | 0 | 0 | 0 | 0 |
26 | 10075 | 2 years | 55000 | Fully Paid | MORTGAGE | DE | 0 | E | 1 | 2 | 3 | 1 | 0 | 0 | 0 | 0 |
Let's run the logistic regression.
X_Variables_3 = ['RENT', 'MORTGAGE', 'OWN', 'NONE', 'OTHER']
X_3 = loan_2[X_Variables_3]
X_3 = X_3.values
y_3 = loan_2['loan_status_clean'].values
model_3 = clf.fit(X_3,y_3)
model_3.score(X_3,y_3)
0.77809561201787347
pd.DataFrame(zip(X_Variables_3, model_3.coef_.T))
0 | 1 | |
---|---|---|
0 | RENT | [0.219178922014] |
1 | MORTGAGE | [0.553450501195] |
2 | OWN | [0.317744061658] |
3 | NONE | [0.434085592491] |
4 | OTHER | [-0.654472520962] |
My understanding for someone putting “OTHER” for home ownership on the loan application is that they either did not want to reveal their home ownership situation, are hiding something, or are bad at filling out applications. “None” could be an honest answer, from someone that may be living with their parents.
Regardless, it seems that if someone checks off “OTHER” and gets funded, then there’s a very good chance of that individual defaulting on his or her loan.
Using our column of years employed, we clreate dummies so that we could easily run the logistic regression. We're trying to see which length of employment is best predictive of someone paying back their loan.
emp_dummies = pd.get_dummies(loan_2.emp_length)
loan_2 = loan_2.join(emp_dummies)
X_Variables_4 = ['< 1 year','1 year','2 years','3 years','4 years','5 years','6 years','7 years','8 years','9 years','10+ years']
X_4 = loan_2[X_Variables_4]
y_4 = loan_2['loan_status_clean'].values
model_4 = clf.fit(X_4,y_4)
model_4.score(X_4, y_4)
0.77809561201787347
pd.DataFrame(zip(X_Variables_4,model_4.coef_.T))
0 | 1 | |
---|---|---|
0 | < 1 year | [0.391361177496] |
1 | 1 year | [0.452628193439] |
2 | 2 years | [0.508271648822] |
3 | 3 years | [0.475809147396] |
4 | 4 years | [0.395646588617] |
5 | 5 years | [0.449174342544] |
6 | 6 years | [0.405863824018] |
7 | 7 years | [0.42051748587] |
8 | 8 years | [0.418944522839] |
9 | 9 years | [0.414805343338] |
10 | 10+ years | [0.496719497268] |
There doesn't appear to be too much variance between the generated coefficients of the years employed. It looks like, so long as the person is employed they will be paying back their loan.
However, it holds true, that if someone is unemployed or has less than a year of employment then they'll have a lower chance of repaying their loan.
Below we're generating a csv file, with select columns, so that we could pull it into cartodb. Cartodb is a web mapping tool.
loan_cartodb2 = loan_2[['addr_state','funded_amnt','emp_length_clean','annual_inc','grade_clean','loan_status_clean']]
#loan_cartodb2.to_csv('/Users/olehdubno/Desktop/loan_cartodb2.csv', index=False)
Here we use www.cartodb.com. A very intuitive and friendly way of generating maps.
Before mapping our data, cartodb uses an intelligent way, in this instance, in converting State acronyms into latitude and longitude.
After selecting the features we want to play with, cartodb generates a map and ways to share it. One of the ways is using IFrame. IFrame uses HTML to embed content from one source into another.
from IPython.display import HTML
HTML("<iframe width='100%' height='520' frameborder='0' src='http://shortyskater456.cartodb.com/viz/40d16f7e-6d3e-11e4-a898-0e9d821ea90d/embed_map' allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>")
The above map is referred to as the choropleth map, "a thematic map in which areas are shade patterned in proportion to the measurement of the statistical variable being displayed." (wikipedia)
As the intensity of the color increases (gets closer to 1), on average the majority of the people residing in that state have paid of their loan.
The number near the point references the amount of loans given in that state.
By the looks of the map I wouldn't give loans out to Oregon, Wisconsin, Nevada, Tennessee, Virginia, Indianapolis, maybe a few others.
Of course this an average of individual loans, per state, discounting specific regions of the state, and is not the best estimate for whether a funded individual in that state is likely to repay their loan.
However, maybe the other features could help determine which state is less likelier to pay off a loan.
HTML("<iframe width='100%' height='520' frameborder='0' src='http://olehdubno.cartodb.com/viz/0dce85a4-6d10-11e4-98f3-0e9d821ea90d/embed_map' allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>")
Reviewing the map we could see that Oregon, Montana and Mississippi, on average, have taken out higher loans (closer to $35,000).
According to the "Paid vs Unpaid" map, Oregon is not only taking out the highest loans, it's also not paying them back.
On average, indivduals receiving a loan in Oregon are much more liklier to default on their loan as they are also liklier to receive bigger loans.
HTML("<iframe width='100%' height='520' frameborder='0' src='http://olehdubno.cartodb.com/viz/c2c9b8a4-6ba6-11e4-aadc-0e4fddd5de28/embed_map' allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>")
There's several outliers in the data that have been removed, using cartodb, in terms of annual income.
Before removing the outliers, the income ranges from 33,504.72to7,241,778. Which is an obsene amount. I limit it to $500,000.00
Interestingly, Oregon is the state with the highest income, lowest payback rate and on average the state that takes out the highest loans.
HTML("<iframe width='100%' height='520' frameborder='0' src='http://olehdubno.cartodb.com/viz/57bfeb6c-6ba8-11e4-a74d-0e4fddd5de28/embed_map' allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>")
Keeping on track with Oregon, a state I'm not too familiar with, it happens to be a state with a fairly good rating for loans according to the data, not really. At least for the loans given out by Lending Club.
I could understand why Lending Club, on average, would give a pretty good grade to loans in Oregon. The average population there has some of the highest income. We could see that by looking at the income map presented before.
Here, we're using the train test split function from sklearn to split up the features into train and test values. Our test size will be 25% of our actual data.
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
X_Variables = ['emp_length_clean', 'grade_clean','emp_length_clean','grade_clean']
X = loan_2[X_Variables]
X = X.values
y = loan_2['loan_status_clean'].values
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.25)
clf.fit cretes a classifier object, which is called "clf', and then this new "fitted" object, "clf", can do things like score and predict.
clf = GaussianNB()
clf.fit(X_train,Y_train)
GaussianNB()
clf.score(X_train,Y_train)
0.74363704349993764
clf.score(X_test,Y_test)
0.74132515704457069
clf.fit(X_train,Y_train)
GaussianNB()
from sklearn import metrics
def measure_performance(X,y,clf, show_accuracy=True, show_classification_report=True, show_confusion_matrix=True):
y_pred=clf.predict(X)
if show_accuracy:
print "Accuracy:{0:.3f}".format(metrics.accuracy_score(y,y_pred)),"\n"
if show_classification_report:
print "Classification report"
print metrics.classification_report(y,y_pred),"\n"
if show_confusion_matrix:
print "Confusion matrix"
print metrics.confusion_matrix(y,y_pred),"\n"
measure_performance(X_train,Y_train,clf, show_classification_report=True, show_confusion_matrix=True)
Accuracy:0.744 Classification report precision recall f1-score support 0 0.35 0.19 0.25 8847 1 0.80 0.90 0.85 31268 avg / total 0.70 0.74 0.71 40115 Confusion matrix [[ 1708 7139] [ 3145 28123]]
from IPython.display import Image
Image(filename='/Users/olehdubno/Desktop/python_tests/confusion_matrix.png')
Confusion Matrix allows for more detailed analysis than mere proportion of correct guesses.
For instance 2,528 loans from paid loans were incorrecly predicted as unpaid.
Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (1,364 loans + 28,709 loans) and the total number of incorrect predictions is (2,528 loans + 7,514 loans).
The confusion matrix provides the information needed to determine how well a classification model performs. The perforamnce metric, accuracy, summarizes this information with a single number .777
Accuracy takes the total number of correct predictions and divides it by the total number of all predictions made.
predictions = [p[1] for p in clf.predict_proba(X_train)]
fpr_p, tpr_p, thresholds_p = metrics.roc_curve(Y_train,predictions)
fig = plt.figure()
fig.set_figwidth(10)
fig.suptitle('AUC for Decision Tree Classifier Predicting Loans Paid')
ax1 = plt.subplot(1, 2, 1)
ax1.set_xlabel('false positive rate')
ax1.set_ylabel('true positive rate')
ax1.plot(fpr_p, tpr_p)
fpr, tpr, thresholds = metrics.roc_curve(Y_train,clf.predict(X_train))
ax2 = plt.subplot(1, 2, 2)
ax2.set_xlabel('false positive rate')
ax2.set_ylabel('true positive rate')
ax2.plot(fpr, tpr)
print "False-positive rate:", fpr
print "True-positive rate: ", tpr
print "Thresholds: ", thresholds
print fig
False-positive rate: [ 0. 0.80694021 1. ] True-positive rate: [ 0. 0.89941794 1. ] Thresholds: [2 1 0] Figure(800x320)
loan_limit_by_inc = loan_2[loan_2['annual_inc']<200000]
import statsmodels.formula.api as smf
# OLS, or ordinary least squares, takes a y (dependent variable) and X (independent variables) (formula = y ~ X)
# Below, we copy the data frame and remove the na variables, and create a single variable linear model
# to return a test statistic and p-value, to see how strong of a relationship bodyweight and brainweight have.
loan_limit_by_inc['log_annual_inc'] = np.log(loan_limit_by_inc['annual_inc'])
loan_limit_by_inc['log_funded_amnt'] = np.log(loan_limit_by_inc['funded_amnt'])
fig, axes = plt.subplots(nrows=1,ncols=2)
axes[0].plot(loan_limit_by_inc.annual_inc, loan_limit_by_inc.funded_amnt, 'go')
model = smf.ols(formula='funded_amnt ~ annual_inc', data=loan_limit_by_inc)
results = model.fit()
print 'NORMAL FIT SUMMARY'
print(results.summary())
print
axes[1].plot(loan_limit_by_inc.log_annual_inc, loan_limit_by_inc.log_funded_amnt, 'mo')
log_model = smf.ols(formula='log_funded_amnt ~ log_annual_inc', data=loan_limit_by_inc)
log_results = log_model.fit()
print 'LOG-LOG FIT SUMMARY'
print(log_results.summary())
print fig
NORMAL FIT SUMMARY OLS Regression Results ============================================================================== Dep. Variable: funded_amnt R-squared: 0.201 Model: OLS Adj. R-squared: 0.201 Method: Least Squares F-statistic: 1.346e+04 Date: Thu, 20 Nov 2014 Prob (F-statistic): 0.00 Time: 20:47:36 Log-Likelihood: -5.5036e+05 No. Observations: 53487 AIC: 1.101e+06 Df Residuals: 53485 BIC: 1.101e+06 Df Model: 1 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ Intercept 6142.0477 72.531 84.682 0.000 5999.887 6284.209 annual_inc 0.1122 0.001 116.031 0.000 0.110 0.114 ============================================================================== Omnibus: 1305.010 Durbin-Watson: 1.979 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1402.884 Skew: 0.397 Prob(JB): 2.33e-305 Kurtosis: 2.989 Cond. No. 1.77e+05 ============================================================================== Warnings: [1] The condition number is large, 1.77e+05. This might indicate that there are strong multicollinearity or other numerical problems. LOG-LOG FIT SUMMARY OLS Regression Results ============================================================================== Dep. Variable: log_funded_amnt R-squared: 0.201 Model: OLS Adj. R-squared: 0.201 Method: Least Squares F-statistic: 1.349e+04 Date: Thu, 20 Nov 2014 Prob (F-statistic): 0.00 Time: 20:47:36 Log-Likelihood: -48153. No. Observations: 53487 AIC: 9.631e+04 Df Residuals: 53485 BIC: 9.633e+04 Df Model: 1 ================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ---------------------------------------------------------------------------------- Intercept 2.2661 0.061 37.187 0.000 2.147 2.386 log_annual_inc 0.6417 0.006 116.158 0.000 0.631 0.653 ============================================================================== Omnibus: 7182.767 Durbin-Watson: 1.986 Prob(Omnibus): 0.000 Jarque-Bera (JB): 11028.790 Skew: -0.964 Prob(JB): 0.00 Kurtosis: 4.108 Cond. No. 263. ============================================================================== Figure(480x320)
#multilinear regression OLS
import pandas as pd
import numpy as np
import statsmodels.api as sm
X = loan_limit_by_inc[['annual_inc','emp_length_clean','grade_clean']]
y = loan_limit_by_inc['funded_amnt']
X = sm.add_constant(X)
est = sm.OLS(y,X).fit()
est.summary()
Dep. Variable: | funded_amnt | R-squared: | 0.269 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.269 |
Method: | Least Squares | F-statistic: | 6551. |
Date: | Thu, 20 Nov 2014 | Prob (F-statistic): | 0.00 |
Time: | 20:47:40 | Log-Likelihood: | -5.4799e+05 |
No. Observations: | 53487 | AIC: | 1.096e+06 |
Df Residuals: | 53483 | BIC: | 1.096e+06 |
Df Model: | 3 |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
const | 1.327e+04 | 140.208 | 94.669 | 0.000 | 1.3e+04 1.35e+04 |
annual_inc | 0.1094 | 0.001 | 117.082 | 0.000 | 0.108 0.111 |
emp_length_clean | 136.6026 | 8.394 | 16.275 | 0.000 | 120.151 153.054 |
grade_clean | -1503.2291 | 22.103 | -68.010 | 0.000 | -1546.551 -1459.907 |
Omnibus: | 510.645 | Durbin-Watson: | 1.977 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 525.249 |
Skew: | 0.243 | Prob(JB): | 8.78e-115 |
Kurtosis: | 3.012 | Cond. No. | 3.60e+05 |
state_geo = r'https://gist.githubusercontent.com/datadave/108b5f382c838c3963d7/raw/3036216d894d49205948dbbfd562754ef3814785/us-states.json'
df = loan_2[['addr_state','funded_amnt','annual_inc','emp_length_clean','loan_status_clean','grade_clean']]
df = df[df['annual_inc']<200000]
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 53487 entries, 3 to 188121 Data columns (total 6 columns): addr_state 53487 non-null object funded_amnt 53487 non-null float64 annual_inc 53487 non-null float64 emp_length_clean 53487 non-null float64 loan_status_clean 53487 non-null int64 grade_clean 53487 non-null int64 dtypes: float64(3), int64(2), object(1)
import folium
from IPython.display import HTML
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 53487 entries, 3 to 188121 Data columns (total 6 columns): addr_state 53487 non-null object funded_amnt 53487 non-null float64 annual_inc 53487 non-null float64 emp_length_clean 53487 non-null float64 loan_status_clean 53487 non-null int64 grade_clean 53487 non-null int64 dtypes: float64(3), int64(2), object(1)
map = folium.Map(location=[40, -100], zoom_start=4) # Initialize map
thresh = [2, 3, 4, 5, 6, 7] # set the threshold, use histogram as guide
map.geo_json(geo_path=state_geo, data=df,
columns=['addr_state', 'grade_clean'], # pick columns
key_on='feature.id',
threshold_scale = thresh, # set threshold
fill_color='YlOrRd', fill_opacity=0.75, line_opacity=0.5, # colors
legend_name='Grade (G-1 F-2 E-3 D-4 C-5 B-6 A-7)') # legend
map.create_map(path='grade5_chloropleth.html') #draw map
# Locally, you can use HTML library to Display the map inline
# HTML('<iframe src=grade5_chloropleth.html width=1000 height = 500><iframe>')
# For publishing on the gist, showing image via markdown
Image(filename='/Users/olehdubno/Desktop/python_tests/map_grade.png')
map = folium.Map(location=[40, -100], zoom_start=4) # Initialize map
thresh = [5000, 10000, 20000, 30000, 35000] # set the threshold, use histogram as guide
map.geo_json(geo_path=state_geo, data=df,
columns=['addr_state', 'funded_amnt'], # pick columns
key_on='feature.id',
threshold_scale = thresh, # set threshold
fill_color='YlOrRd', fill_opacity=0.75, line_opacity=0.5, # colors
legend_name='Funded Amount') # legend
map.create_map(path='funded5_chloropleth.html') #draw map
# Locally, you can use HTML library to Display the map inline
# HTML('<iframe src=grade5_chloropleth.html width=1000 height = 500><iframe>')
# For publishing on the gist, showing image via markdown
Image(filename='/Users/olehdubno/Desktop/python_tests/map_funded_amount.png')
df.annual_inc.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1204c6450>
map = folium.Map(location=[40, -100], zoom_start=4) # Initialize map
thresh = [20000, 40000, 60000, 80000, 100000, 120000] # set the threshold, use histogram as guide
map.geo_json(geo_path=state_geo, data=df,
columns=['addr_state', 'annual_inc'], # pick columns
key_on='feature.id',
threshold_scale = thresh, # set threshold
fill_color='YlOrRd', fill_opacity=0.75, line_opacity=0.5, # colors
legend_name='Annual Income') # legend
map.create_map(path='income4_chloropleth.html') #draw map
# Locally, you can use HTML library to Display the map inline
# HTML('<iframe src=grade5_chloropleth.html width=1000 height = 500><iframe>')
# For publishing on the gist, showing image via markdown
Image(filename='/Users/olehdubno/Desktop/python_tests/map_annual_income.png')
map = folium.Map(location=[40, -100], zoom_start=4) # Initialize map
thresh = [1, 2, 4, 6, 8, 10] # set the threshold, use histogram as guide
map.geo_json(geo_path=state_geo, data=df,
columns=['addr_state', 'emp_length_clean'], # pick columns
key_on='feature.id',
threshold_scale = thresh, # set threshold
fill_color='YlOrRd', fill_opacity=0.75, line_opacity=0.5, # colors
legend_name='Employment Length (years)') # legend
map.create_map(path='emp_length2_chloropleth.html') #draw map
# Locally, you can use HTML library to Display the map inline
# HTML('<iframe src=grade5_chloropleth.html width=1000 height = 500><iframe>')
# For publishing on the gist, showing image via markdown
Image(filename='/Users/olehdubno/Desktop/python_tests/map_emp_length.png')
Doing some cleaning on loan status to determine the average status per state
df1 = df.groupby(['addr_state']).loan_status_clean.mean()
#df1.to_csv("/Users/olehdubno/Desktop/python_tests/df1.csv")
df1 = pd.read_csv("/Users/olehdubno/Desktop/python_tests/df1.csv")
df = df1
df.head()
addr_state | loan_status_clean | |
---|---|---|
0 | AK | 0.818792 |
1 | AL | 0.736349 |
2 | AR | 0.778947 |
3 | AZ | 0.785294 |
4 | CA | 0.794257 |
map = folium.Map(location=[40, -100], zoom_start=4) # Initialize map
thresh = [0, .2, .4, .6, .8, 1] # set the threshold, use histogram as guide
map.geo_json(geo_path=state_geo, data=df,
columns=['addr_state', 'loan_status_clean'], # pick columns
key_on='feature.id',
threshold_scale = thresh, # set threshold
fill_color='YlOrRd', fill_opacity=0.75, line_opacity=0.5, # colors
legend_name='Loan Status') # legend
map.create_map(path='loan_status3_chloropleth.html') #draw map
# Locally, you can use HTML library to Display the map inline
# HTML('<iframe src=grade5_chloropleth.html width=1000 height = 500><iframe>')
# For publishing on the gist, showing image via markdown
Image(filename='/Users/olehdubno/Desktop/python_tests/map_loan_status.png')