Here we get into the detailed steps involved in data cleanup of the Lending Club dataset. We glossed over this in the Data Exploration lesson and left it as an exercise. This is where we actually do the cleanup step by step.
%pylab inline
import pandas as pd
from numpy import nan as NA
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
loansData
Populating the interactive namespace from numpy and matplotlib
Amount.Requested | Amount.Funded.By.Investors | Interest.Rate | Loan.Length | Loan.Purpose | Debt.To.Income.Ratio | State | Home.Ownership | Monthly.Income | FICO.Range | Open.CREDIT.Lines | Revolving.CREDIT.Balance | Inquiries.in.the.Last.6.Months | Employment.Length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
81174 | 20000 | 20000.00 | 8.90% | 36 months | debt_consolidation | 14.90% | SC | MORTGAGE | 6541.67 | 735-739 | 14 | 14272 | 2 | < 1 year |
99592 | 19200 | 19200.00 | 12.12% | 36 months | debt_consolidation | 28.36% | TX | MORTGAGE | 4583.33 | 715-719 | 12 | 11140 | 1 | 2 years |
80059 | 35000 | 35000.00 | 21.98% | 60 months | debt_consolidation | 23.81% | CA | MORTGAGE | 11500.00 | 690-694 | 14 | 21977 | 1 | 2 years |
15825 | 10000 | 9975.00 | 9.99% | 36 months | debt_consolidation | 14.30% | KS | MORTGAGE | 3833.33 | 695-699 | 10 | 9346 | 0 | 5 years |
33182 | 12000 | 12000.00 | 11.71% | 36 months | credit_card | 18.78% | NJ | RENT | 3195.00 | 695-699 | 11 | 14469 | 0 | 9 years |
62403 | 6000 | 6000.00 | 15.31% | 36 months | other | 20.05% | CT | OWN | 4891.67 | 670-674 | 17 | 10391 | 2 | 3 years |
48808 | 10000 | 10000.00 | 7.90% | 36 months | debt_consolidation | 26.09% | MA | RENT | 2916.67 | 720-724 | 10 | 15957 | 0 | 10+ years |
22090 | 33500 | 33450.00 | 17.14% | 60 months | credit_card | 14.70% | LA | MORTGAGE | 13863.42 | 705-709 | 12 | 27874 | 0 | 10+ years |
76404 | 14675 | 14675.00 | 14.33% | 36 months | credit_card | 26.92% | CA | RENT | 3150.00 | 685-689 | 9 | 7246 | 1 | 8 years |
15867 | 7000 | 7000.00 | 6.91% | 36 months | credit_card | 7.10% | CA | RENT | 5000.00 | 715-719 | 8 | 7612 | 0 | 3 years |
94971 | 2000 | 2000.00 | 19.72% | 36 months | moving | 10.29% | FL | RENT | 3575.00 | 670-674 | 10 | 12036 | 0 | 6 years |
36911 | 10625 | 10625.00 | 14.27% | 36 months | debt_consolidation | 12.54% | CA | MORTGAGE | 4250.00 | 665-669 | 14 | 10767 | 0 | < 1 year |
41200 | 28000 | 27975.00 | 21.67% | 60 months | debt_consolidation | 13.07% | CT | MORTGAGE | 14166.67 | 670-674 | 12 | 10311 | 0 | 1 year |
83869 | 35000 | 34950.00 | 8.90% | 36 months | debt_consolidation | 20.46% | CT | RENT | 9166.67 | 735-739 | 19 | 21536 | 0 | 1 year |
53853 | 9600 | 9600.00 | 7.62% | 36 months | debt_consolidation | 3.45% | DC | RENT | 11250.00 | 725-729 | 13 | 4606 | 0 | < 1 year |
21399 | 25000 | 24975.00 | 15.65% | 60 months | debt_consolidation | 21.99% | CA | RENT | 5416.67 | 730-734 | 6 | 13929 | 0 | 9 years |
62127 | 10000 | 10000.00 | 12.12% | 36 months | debt_consolidation | 17.72% | CA | RENT | 9000.00 | 695-699 | 18 | 20317 | 0 | 7 years |
23446 | 14000 | 13900.25 | 10.37% | 60 months | debt_consolidation | 11.95% | OH | RENT | 4333.33 | 740-744 | 6 | 7419 | 0 | 9 years |
44987 | 10000 | 10000.00 | 9.76% | 36 months | credit_card | 7.13% | FL | RENT | 2733.33 | 730-734 | 7 | 6112 | 2 | 3 years |
17977 | 5200 | 5175.00 | 9.99% | 60 months | debt_consolidation | 10.29% | AL | MORTGAGE | 3750.00 | 760-764 | 10 | 16094 | 0 | < 1 year |
86099 | 22000 | 21975.00 | 21.98% | 36 months | debt_consolidation | 11.19% | TX | MORTGAGE | 6666.67 | 665-669 | 9 | 23124 | 0 | 10+ years |
99483 | 30000 | 30000.00 | 19.05% | 60 months | credit_card | 21.25% | FL | MORTGAGE | 6250.00 | 695-699 | 12 | 34927 | 0 | 6 years |
28798 | 6500 | 6500.00 | 17.99% | 60 months | car | 19.63% | FL | RENT | 4100.00 | 665-669 | 11 | 11697 | 1 | 2 years |
24168 | 17400 | 17400.00 | 11.99% | 36 months | credit_card | 12.47% | AZ | RENT | 6833.33 | 695-699 | 7 | 26587 | 0 | 7 years |
10356 | 4000 | 4000.00 | 16.82% | 60 months | vacation | 13.71% | GA | MORTGAGE | 4500.00 | 670-674 | 5 | 20804 | 0 | 3 years |
46027 | 7200 | 7200.00 | 7.90% | 36 months | debt_consolidation | 24.82% | TX | RENT | 5416.67 | 705-709 | 8 | 12017 | 0 | 7 years |
2238 | 8000 | 8000.00 | 14.42% | 36 months | debt_consolidation | 24.63% | MA | RENT | 2964.17 | 675-679 | 9 | 8928 | 2 | 6 years |
65278 | 8000 | 8000.00 | 15.31% | 36 months | debt_consolidation | 15.46% | CA | MORTGAGE | 2916.67 | 675-679 | 13 | 7152 | 1 | 5 years |
4227 | 3000 | 3000.00 | 8.59% | 36 months | other | 3.72% | MA | MORTGAGE | 4167.00 | 765-769 | 4 | 7074 | 0 | 5 years |
50182 | 14500 | 14500.00 | 7.90% | 36 months | debt_consolidation | 4.85% | GA | MORTGAGE | 3958.33 | 760-764 | 4 | 9598 | 0 | 4 years |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
84265 | 20000 | 20000.00 | 22.95% | 60 months | debt_consolidation | 7.10% | NJ | RENT | 6750.00 | 665-669 | 6 | 16104 | 1 | 7 years |
80231 | 19000 | 19000.00 | 7.90% | 36 months | debt_consolidation | 9.76% | RI | MORTGAGE | 5166.67 | 770-774 | 18 | 43617 | 2 | 10+ years |
49533 | 17300 | 17250.00 | 22.45% | 60 months | wedding | 3.58% | PA | MORTGAGE | 5500.00 | 685-689 | 11 | 2306 | 3 | 4 years |
102514 | 7000 | 711.54 | 15.13% | 36 months | major_purchase | 18.91% | CO | MORTGAGE | 3833.00 | 650-654 | 13 | 12634 | 0 | 2 years |
78618 | 7200 | 7200.00 | 18.75% | 36 months | debt_consolidation | 16.21% | MI | RENT | 8333.33 | 660-664 | 8 | 28916 | 0 | 10+ years |
86953 | 10000 | 10000.00 | 14.09% | 36 months | major_purchase | 9.71% | CO | RENT | 4583.33 | 675-679 | 6 | 3859 | 0 | < 1 year |
80129 | 4000 | 3925.00 | 14.09% | 36 months | credit_card | 12.27% | FL | MORTGAGE | 8583.33 | 675-679 | 9 | 36943 | 1 | 10+ years |
85216 | 17500 | 17500.00 | 8.90% | 36 months | debt_consolidation | 10.94% | UT | MORTGAGE | 25000.00 | 730-734 | 9 | 34545 | 0 | 6 years |
38247 | 20000 | 20000.00 | 11.71% | 36 months | credit_card | 9.58% | SD | MORTGAGE | 5416.67 | 725-729 | 12 | 18267 | 1 | 4 years |
91245 | 16200 | 16200.00 | 15.80% | 60 months | debt_consolidation | 7.92% | PA | MORTGAGE | 4833.33 | 680-684 | 12 | 12313 | 2 | 10+ years |
53041 | 10000 | 10000.00 | 6.03% | 36 months | small_business | 13.03% | FL | RENT | 5000.00 | 760-764 | 8 | 3952 | 0 | 2 years |
63051 | 27000 | 27000.00 | 6.62% | 36 months | debt_consolidation | 12.21% | OH | MORTGAGE | 9250.00 | 810-814 | 12 | 4211 | 0 | 5 years |
14446 | 4500 | 4475.00 | 7.51% | 36 months | small_business | 20.27% | VA | MORTGAGE | 7075.50 | 720-724 | 15 | 68618 | 2 | 10+ years |
68628 | 15875 | 15875.00 | 14.33% | 36 months | small_business | 17.44% | MD | MORTGAGE | 3416.67 | 675-679 | 11 | 15891 | 0 | 2 years |
98758 | 15000 | 15000.00 | 10.16% | 36 months | credit_card | 28.28% | OH | MORTGAGE | 6666.67 | 690-694 | 15 | 14880 | 0 | 10+ years |
13070 | 25000 | 24950.00 | 10.75% | 36 months | debt_consolidation | 20.48% | OR | MORTGAGE | 7083.33 | 765-769 | 10 | 25429 | 0 | 6 years |
45836 | 7000 | 7000.00 | 17.27% | 36 months | other | 18.38% | NY | OWN | 2464.37 | 665-669 | 9 | 7089 | 0 | 3 years |
52330 | 15000 | 15000.00 | 19.99% | 36 months | wedding | 18.05% | CA | RENT | 8000.00 | 660-664 | 6 | 45976 | 1 | 2 years |
48243 | 17000 | 17000.00 | 15.81% | 36 months | debt_consolidation | 17.01% | CO | RENT | 3833.33 | 685-689 | 6 | 15484 | 1 | 6 years |
63256 | 19075 | 19075.00 | 18.75% | 36 months | debt_consolidation | 15.23% | NY | RENT | 5166.67 | 670-674 | 17 | 13749 | 3 | 10+ years |
42124 | 10000 | 10000.00 | 11.71% | 36 months | debt_consolidation | 8.40% | CA | RENT | 4500.00 | 710-714 | 8 | 8404 | 1 | 3 years |
78043 | 8475 | 8475.00 | 7.62% | 36 months | debt_consolidation | 15.88% | CA | RENT | 3983.33 | 720-724 | 9 | 6882 | 0 | n/a |
925 | 6400 | 6350.00 | 10.08% | 36 months | debt_consolidation | 8.11% | NJ | MORTGAGE | 5166.67 | 710-714 | 5 | 5815 | 2 | 10+ years |
74047 | 30000 | 30000.00 | 23.28% | 60 months | other | 12.10% | IL | MORTGAGE | 7083.33 | 675-679 | 16 | 17969 | 1 | 10+ years |
49957 | 24000 | 23975.00 | 14.65% | 36 months | debt_consolidation | 15.29% | WA | MORTGAGE | 6666.67 | 685-689 | 13 | 17521 | 0 | 5 years |
23735 | 30000 | 29950.00 | 16.77% | 60 months | debt_consolidation | 19.23% | NY | MORTGAGE | 9250.00 | 705-709 | 15 | 45880 | 1 | 8 years |
65882 | 16000 | 16000.00 | 14.09% | 60 months | home_improvement | 21.54% | MD | OWN | 8903.25 | 740-744 | 18 | 18898 | 1 | 10+ years |
55610 | 10000 | 10000.00 | 13.99% | 36 months | debt_consolidation | 4.89% | PA | MORTGAGE | 2166.67 | 680-684 | 4 | 4544 | 0 | 10+ years |
38576 | 6000 | 6000.00 | 12.42% | 36 months | major_purchase | 16.66% | NJ | RENT | 3500.00 | 675-679 | 8 | 7753 | 0 | 5 years |
3116 | 9000 | 5242.75 | 13.79% | 36 months | debt_consolidation | 6.76% | NY | RENT | 3875.00 | 670-674 | 7 | 7589 | 0 | 10+ years |
2500 rows × 14 columns
loansData['Interest.Rate'][0:5] # first five rows of Interest.Rate
81174 8.90% 99592 12.12% 80059 21.98% 15825 9.99% 33182 11.71% Name: Interest.Rate, dtype: object
ir = loansData['Interest.Rate']
ir[0:5]
81174 8.90% 99592 12.12% 80059 21.98% 15825 9.99% 33182 11.71% Name: Interest.Rate, dtype: object
type(ir[0:1])
irbak = ir
type(irbak)
loansDataBak = loansData
loansDataBak.reset_index()
loansDataBak
loansDataBak['Interest.Rate']
loansData
<class 'pandas.core.frame.DataFrame'> Int64Index: 2500 entries, 81174 to 3116 Data columns (total 14 columns): Amount.Requested 2500 non-null values Amount.Funded.By.Investors 2500 non-null values Interest.Rate 2500 non-null values Loan.Length 2500 non-null values Loan.Purpose 2500 non-null values Debt.To.Income.Ratio 2500 non-null values State 2500 non-null values Home.Ownership 2500 non-null values Monthly.Income 2499 non-null values FICO.Range 2500 non-null values Open.CREDIT.Lines 2498 non-null values Revolving.CREDIT.Balance 2498 non-null values Inquiries.in.the.Last.6.Months 2498 non-null values Employment.Length 2500 non-null values dtypes: float64(5), int64(1), object(8)
ldb = loansDataBak.reset_index() # explain
ldb['Interest.Rate'][0:5]
irates = ldb['Interest.Rate'][0:]
type(irates)
[irates[k] for k in (0,1,2,3,4)]
#int_rates = [ float(x[:-1]) for x in irates[k] ]
#"""
#---------------------------------------------------------------------------
#ValueError Traceback (most recent call last)
#<ipython-input-46-5c4f2333efd1> in <module>()
# 5
# 6 [irates[k] for k in (0,1,2,3,4)]
#----> 7 int_rates = [ float(x[:-1]) for x in irates[k] ]
#
#ValueError: could not convert string to float:
#"""
['8.90%', '12.12%', '21.98%', '9.99%', '11.71%']
type(ldb)
pandas.core.frame.DataFrame
srates = ldb['Interest.Rate']
#nas = [ x for x in srates if x.isnull() ] # AttributeError: 'str' object has no attribute 'isnull'
nas = [ x for x in srates if not(x[0].isdigit()) ] # AttributeError: 'str' object has no attribute 'isnull'
len(nas)
0
srates[0][:-1]
'8.90'
float(srates[0][:-1])
8.9
nopct = [ x[:-1] for x in srates ]
flrates = [float(x) for x in nopct]
flrates[0:5]
[8.9, 12.12, 21.98, 9.99, 11.71]
flrate = map(float, nopct)
flrate[0:5]
[8.9, 12.12, 21.98, 9.99, 11.71]
len(flrate)
2500
ldb
<class 'pandas.core.frame.DataFrame'> Int64Index: 2500 entries, 0 to 2499 Data columns (total 15 columns): index 2500 non-null values Amount.Requested 2500 non-null values Amount.Funded.By.Investors 2500 non-null values Interest.Rate 2500 non-null values Loan.Length 2500 non-null values Loan.Purpose 2500 non-null values Debt.To.Income.Ratio 2500 non-null values State 2500 non-null values Home.Ownership 2500 non-null values Monthly.Income 2499 non-null values FICO.Range 2500 non-null values Open.CREDIT.Lines 2498 non-null values Revolving.CREDIT.Balance 2498 non-null values Inquiries.in.the.Last.6.Months 2498 non-null values Employment.Length 2500 non-null values dtypes: float64(5), int64(2), object(8)
ldb['Interest.Rate'] = flrate
ldb[0:5]
<class 'pandas.core.frame.DataFrame'> Int64Index: 5 entries, 0 to 4 Data columns (total 15 columns): index 5 non-null values Amount.Requested 5 non-null values Amount.Funded.By.Investors 5 non-null values Interest.Rate 5 non-null values Loan.Length 5 non-null values Loan.Purpose 5 non-null values Debt.To.Income.Ratio 5 non-null values State 5 non-null values Home.Ownership 5 non-null values Monthly.Income 5 non-null values FICO.Range 5 non-null values Open.CREDIT.Lines 5 non-null values Revolving.CREDIT.Balance 5 non-null values Inquiries.in.the.Last.6.Months 5 non-null values Employment.Length 5 non-null values dtypes: float64(6), int64(2), object(7)
ldb['Interest.Rate'][0:5]
0 8.90 1 12.12 2 21.98 3 9.99 4 11.71 Name: Interest.Rate, dtype: float64
srates = loansData['Interest.Rate']
nopct = [ x[:-1] for x in srates ]
flrates = [float(x) for x in nopct]
rates = [float(x[:-1]) for x in srates] # use this
flrates == rates
True
loansData['Interest.Rate'] = flrates
loansData['Interest.Rate'][0:5]
81174 8.90 99592 12.12 80059 21.98 15825 9.99 33182 11.71 Name: Interest.Rate, dtype: float64
withmons = ldb['Loan.Length']
wmons = withmons[0:]
wmons[0:5]
0 36 months 1 36 months 2 60 months 3 36 months 4 36 months Name: Loan.Length, dtype: object
wmons
0 36 months 1 36 months 2 60 months 3 36 months 4 36 months 5 36 months 6 36 months 7 60 months 8 36 months 9 36 months 10 36 months 11 36 months 12 60 months 13 36 months 14 36 months ... 2485 36 months 2486 36 months 2487 36 months 2488 36 months 2489 36 months 2490 36 months 2491 36 months 2492 36 months 2493 60 months 2494 36 months 2495 60 months 2496 60 months 2497 36 months 2498 36 months 2499 36 months Name: Loan.Length, Length: 2500, dtype: object
wmons[0].split()
['36', 'months']
wmons[0].split()[0]
'36'
int(wmons[0].split()[0])
36
x = wmons[0].split()
x[0]
'36'
int(x[0])
36
intmons = [ int(x.split()[0]) for x in wmons ]
intmons[0:10]
[36, 36, 60, 36, 36, 36, 36, 60, 36, 36]
loansData['Loan.Length']
81174 36 months 99592 36 months 80059 60 months 15825 36 months 33182 36 months 62403 36 months 48808 36 months 22090 60 months 76404 36 months 15867 36 months 94971 36 months 36911 36 months 41200 60 months 83869 36 months 53853 36 months ... 13070 36 months 45836 36 months 52330 36 months 48243 36 months 63256 36 months 42124 36 months 78043 36 months 925 36 months 74047 60 months 49957 36 months 23735 60 months 65882 60 months 55610 36 months 38576 36 months 3116 36 months Name: Loan.Length, Length: 2500, dtype: object
loansData['Loan.Length'].value_counts()
36 months 1952 60 months 548 dtype: int64
loansData['Monthly.Income'].describe()
count 2499.000000 mean 5688.931321 std 3963.118185 min 588.500000 25% 3500.000000 50% 5000.000000 75% 6800.000000 max 102750.000000 dtype: float64
First remove implausible values. We see the max value to be 102750.
This is a MONTHLY income of 100K dollars, which is certainly possible, but ....
highly implausible for a person seeking a loan of a few 10's of K dollars, i.e. implausible in this context.
loansData['Monthly.Income'].idxmax() # find the place where the max occurs
54487
But there's a better way - a row filter i.e. an expression used as a way to restrict the rows in a dataframe.
In our case we want to eliminate rows above 100K dollars. i.e. only keep those less than 100K dollars.
loansData['Monthly.Income'][loansData['Monthly.Income'] < 100000]
81174 6541.67 99592 4583.33 80059 11500.00 15825 3833.33 33182 3195.00 62403 4891.67 48808 2916.67 22090 13863.42 76404 3150.00 15867 5000.00 94971 3575.00 36911 4250.00 41200 14166.67 83869 9166.67 53853 11250.00 ... 13070 7083.33 45836 2464.37 52330 8000.00 48243 3833.33 63256 5166.67 42124 4500.00 78043 3983.33 925 5166.67 74047 7083.33 49957 6666.67 23735 9250.00 65882 8903.25 55610 2166.67 38576 3500.00 3116 3875.00 Name: Monthly.Income, Length: 2498, dtype: float64
loansData['Monthly.Income'].describe()
count 2499.000000 mean 5688.931321 std 3963.118185 min 588.500000 25% 3500.000000 50% 5000.000000 75% 6800.000000 max 102750.000000 dtype: float64
ldlt100 = ldb[ldb['Monthly.Income'] < 100000]
ldlt100
<class 'pandas.core.frame.DataFrame'> Int64Index: 2498 entries, 0 to 2499 Data columns (total 15 columns): index 2498 non-null values Amount.Requested 2498 non-null values Amount.Funded.By.Investors 2498 non-null values Interest.Rate 2498 non-null values Loan.Length 2498 non-null values Loan.Purpose 2498 non-null values Debt.To.Income.Ratio 2498 non-null values State 2498 non-null values Home.Ownership 2498 non-null values Monthly.Income 2498 non-null values FICO.Range 2498 non-null values Open.CREDIT.Lines 2497 non-null values Revolving.CREDIT.Balance 2497 non-null values Inquiries.in.the.Last.6.Months 2497 non-null values Employment.Length 2498 non-null values dtypes: float64(6), int64(2), object(7)
len(ldlt100)
2498
Now drop any rows that have 'NA' values ie data not available.
In database terminiology these would be 'NULL' values.
ldb2 = ldlt100.dropna()
ldb2
<class 'pandas.core.frame.DataFrame'> Int64Index: 2497 entries, 0 to 2499 Data columns (total 15 columns): index 2497 non-null values Amount.Requested 2497 non-null values Amount.Funded.By.Investors 2497 non-null values Interest.Rate 2497 non-null values Loan.Length 2497 non-null values Loan.Purpose 2497 non-null values Debt.To.Income.Ratio 2497 non-null values State 2497 non-null values Home.Ownership 2497 non-null values Monthly.Income 2497 non-null values FICO.Range 2497 non-null values Open.CREDIT.Lines 2497 non-null values Revolving.CREDIT.Balance 2497 non-null values Inquiries.in.the.Last.6.Months 2497 non-null values Employment.Length 2497 non-null values dtypes: float64(6), int64(2), object(7)
len(ldb2)
2497
So we dropped one row that had an NA value somewhere.
Now that we have removed the "bad" data, let's take on the final data-cleaning task for this data set - converting the ranges to single integers.
Note that the FICO values are given in a range which is in the form of a string that looks like lowerlimit-upperlimit, eg 720-724.
We want to convert these values to a single int value representing the lower limit. e.g. 720 in the above example.
ficostr = ldb2['FICO.Range']
ficostr[0:10]
0 735-739 1 715-719 2 690-694 3 695-699 4 695-699 5 670-674 6 720-724 7 705-709 8 685-689 9 715-719 Name: FICO.Range, dtype: object
ficostr[0]
'735-739'
ficoint = [ int(x.split('-')[0]) for x in ficostr ]
ficoint[0:10]
[735, 715, 690, 695, 695, 670, 720, 705, 685, 715]
ldb2['FICO.Range'] = ficoint
len(ficoint)
2497
ldb2['FICO.Range']
0 735 1 715 2 690 3 695 4 695 5 670 6 720 7 705 8 685 9 715 10 670 11 665 12 670 13 735 14 725 ... 2485 765 2486 665 2487 660 2488 685 2489 670 2490 710 2491 720 2492 710 2493 675 2494 685 2495 705 2496 740 2497 680 2498 675 2499 670 Name: FICO.Range, Length: 2497, dtype: int64
Now we have a dataset that we can use for our data exploration and analysis