Linear Regression - Data Cleanup

Here we get into the detailed steps involved in data cleanup of the Lending Club dataset. We glossed over this in the Data Exploration lesson and left it as an exercise. This is where we actually do the cleanup step by step.

  1. Remove '%'
    • strip the last letter
    • map to float
  1. Remove ' months' suffix

    • strip the tailing letters
    • map to int
  2. Remove bad data

    • n/a values
    • outlier income
    • histograms
    • describe
  3. Map FICO.Range to single int

    • split on '-'
    • use bottom value
    • convert to int
    • show how to do mid-point
In [1]:
%pylab inline
import pandas as pd
from numpy import nan as NA
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
loansData
Populating the interactive namespace from numpy and matplotlib
Out[1]:
Amount.Requested Amount.Funded.By.Investors Interest.Rate Loan.Length Loan.Purpose Debt.To.Income.Ratio State Home.Ownership Monthly.Income FICO.Range Open.CREDIT.Lines Revolving.CREDIT.Balance Inquiries.in.the.Last.6.Months Employment.Length
81174 20000 20000.00 8.90% 36 months debt_consolidation 14.90% SC MORTGAGE 6541.67 735-739 14 14272 2 < 1 year
99592 19200 19200.00 12.12% 36 months debt_consolidation 28.36% TX MORTGAGE 4583.33 715-719 12 11140 1 2 years
80059 35000 35000.00 21.98% 60 months debt_consolidation 23.81% CA MORTGAGE 11500.00 690-694 14 21977 1 2 years
15825 10000 9975.00 9.99% 36 months debt_consolidation 14.30% KS MORTGAGE 3833.33 695-699 10 9346 0 5 years
33182 12000 12000.00 11.71% 36 months credit_card 18.78% NJ RENT 3195.00 695-699 11 14469 0 9 years
62403 6000 6000.00 15.31% 36 months other 20.05% CT OWN 4891.67 670-674 17 10391 2 3 years
48808 10000 10000.00 7.90% 36 months debt_consolidation 26.09% MA RENT 2916.67 720-724 10 15957 0 10+ years
22090 33500 33450.00 17.14% 60 months credit_card 14.70% LA MORTGAGE 13863.42 705-709 12 27874 0 10+ years
76404 14675 14675.00 14.33% 36 months credit_card 26.92% CA RENT 3150.00 685-689 9 7246 1 8 years
15867 7000 7000.00 6.91% 36 months credit_card 7.10% CA RENT 5000.00 715-719 8 7612 0 3 years
94971 2000 2000.00 19.72% 36 months moving 10.29% FL RENT 3575.00 670-674 10 12036 0 6 years
36911 10625 10625.00 14.27% 36 months debt_consolidation 12.54% CA MORTGAGE 4250.00 665-669 14 10767 0 < 1 year
41200 28000 27975.00 21.67% 60 months debt_consolidation 13.07% CT MORTGAGE 14166.67 670-674 12 10311 0 1 year
83869 35000 34950.00 8.90% 36 months debt_consolidation 20.46% CT RENT 9166.67 735-739 19 21536 0 1 year
53853 9600 9600.00 7.62% 36 months debt_consolidation 3.45% DC RENT 11250.00 725-729 13 4606 0 < 1 year
21399 25000 24975.00 15.65% 60 months debt_consolidation 21.99% CA RENT 5416.67 730-734 6 13929 0 9 years
62127 10000 10000.00 12.12% 36 months debt_consolidation 17.72% CA RENT 9000.00 695-699 18 20317 0 7 years
23446 14000 13900.25 10.37% 60 months debt_consolidation 11.95% OH RENT 4333.33 740-744 6 7419 0 9 years
44987 10000 10000.00 9.76% 36 months credit_card 7.13% FL RENT 2733.33 730-734 7 6112 2 3 years
17977 5200 5175.00 9.99% 60 months debt_consolidation 10.29% AL MORTGAGE 3750.00 760-764 10 16094 0 < 1 year
86099 22000 21975.00 21.98% 36 months debt_consolidation 11.19% TX MORTGAGE 6666.67 665-669 9 23124 0 10+ years
99483 30000 30000.00 19.05% 60 months credit_card 21.25% FL MORTGAGE 6250.00 695-699 12 34927 0 6 years
28798 6500 6500.00 17.99% 60 months car 19.63% FL RENT 4100.00 665-669 11 11697 1 2 years
24168 17400 17400.00 11.99% 36 months credit_card 12.47% AZ RENT 6833.33 695-699 7 26587 0 7 years
10356 4000 4000.00 16.82% 60 months vacation 13.71% GA MORTGAGE 4500.00 670-674 5 20804 0 3 years
46027 7200 7200.00 7.90% 36 months debt_consolidation 24.82% TX RENT 5416.67 705-709 8 12017 0 7 years
2238 8000 8000.00 14.42% 36 months debt_consolidation 24.63% MA RENT 2964.17 675-679 9 8928 2 6 years
65278 8000 8000.00 15.31% 36 months debt_consolidation 15.46% CA MORTGAGE 2916.67 675-679 13 7152 1 5 years
4227 3000 3000.00 8.59% 36 months other 3.72% MA MORTGAGE 4167.00 765-769 4 7074 0 5 years
50182 14500 14500.00 7.90% 36 months debt_consolidation 4.85% GA MORTGAGE 3958.33 760-764 4 9598 0 4 years
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
84265 20000 20000.00 22.95% 60 months debt_consolidation 7.10% NJ RENT 6750.00 665-669 6 16104 1 7 years
80231 19000 19000.00 7.90% 36 months debt_consolidation 9.76% RI MORTGAGE 5166.67 770-774 18 43617 2 10+ years
49533 17300 17250.00 22.45% 60 months wedding 3.58% PA MORTGAGE 5500.00 685-689 11 2306 3 4 years
102514 7000 711.54 15.13% 36 months major_purchase 18.91% CO MORTGAGE 3833.00 650-654 13 12634 0 2 years
78618 7200 7200.00 18.75% 36 months debt_consolidation 16.21% MI RENT 8333.33 660-664 8 28916 0 10+ years
86953 10000 10000.00 14.09% 36 months major_purchase 9.71% CO RENT 4583.33 675-679 6 3859 0 < 1 year
80129 4000 3925.00 14.09% 36 months credit_card 12.27% FL MORTGAGE 8583.33 675-679 9 36943 1 10+ years
85216 17500 17500.00 8.90% 36 months debt_consolidation 10.94% UT MORTGAGE 25000.00 730-734 9 34545 0 6 years
38247 20000 20000.00 11.71% 36 months credit_card 9.58% SD MORTGAGE 5416.67 725-729 12 18267 1 4 years
91245 16200 16200.00 15.80% 60 months debt_consolidation 7.92% PA MORTGAGE 4833.33 680-684 12 12313 2 10+ years
53041 10000 10000.00 6.03% 36 months small_business 13.03% FL RENT 5000.00 760-764 8 3952 0 2 years
63051 27000 27000.00 6.62% 36 months debt_consolidation 12.21% OH MORTGAGE 9250.00 810-814 12 4211 0 5 years
14446 4500 4475.00 7.51% 36 months small_business 20.27% VA MORTGAGE 7075.50 720-724 15 68618 2 10+ years
68628 15875 15875.00 14.33% 36 months small_business 17.44% MD MORTGAGE 3416.67 675-679 11 15891 0 2 years
98758 15000 15000.00 10.16% 36 months credit_card 28.28% OH MORTGAGE 6666.67 690-694 15 14880 0 10+ years
13070 25000 24950.00 10.75% 36 months debt_consolidation 20.48% OR MORTGAGE 7083.33 765-769 10 25429 0 6 years
45836 7000 7000.00 17.27% 36 months other 18.38% NY OWN 2464.37 665-669 9 7089 0 3 years
52330 15000 15000.00 19.99% 36 months wedding 18.05% CA RENT 8000.00 660-664 6 45976 1 2 years
48243 17000 17000.00 15.81% 36 months debt_consolidation 17.01% CO RENT 3833.33 685-689 6 15484 1 6 years
63256 19075 19075.00 18.75% 36 months debt_consolidation 15.23% NY RENT 5166.67 670-674 17 13749 3 10+ years
42124 10000 10000.00 11.71% 36 months debt_consolidation 8.40% CA RENT 4500.00 710-714 8 8404 1 3 years
78043 8475 8475.00 7.62% 36 months debt_consolidation 15.88% CA RENT 3983.33 720-724 9 6882 0 n/a
925 6400 6350.00 10.08% 36 months debt_consolidation 8.11% NJ MORTGAGE 5166.67 710-714 5 5815 2 10+ years
74047 30000 30000.00 23.28% 60 months other 12.10% IL MORTGAGE 7083.33 675-679 16 17969 1 10+ years
49957 24000 23975.00 14.65% 36 months debt_consolidation 15.29% WA MORTGAGE 6666.67 685-689 13 17521 0 5 years
23735 30000 29950.00 16.77% 60 months debt_consolidation 19.23% NY MORTGAGE 9250.00 705-709 15 45880 1 8 years
65882 16000 16000.00 14.09% 60 months home_improvement 21.54% MD OWN 8903.25 740-744 18 18898 1 10+ years
55610 10000 10000.00 13.99% 36 months debt_consolidation 4.89% PA MORTGAGE 2166.67 680-684 4 4544 0 10+ years
38576 6000 6000.00 12.42% 36 months major_purchase 16.66% NJ RENT 3500.00 675-679 8 7753 0 5 years
3116 9000 5242.75 13.79% 36 months debt_consolidation 6.76% NY RENT 3875.00 670-674 7 7589 0 10+ years

2500 rows × 14 columns

In [330]:
loansData['Interest.Rate'][0:5] # first five rows of Interest.Rate
Out[330]:
81174     8.90%
99592    12.12%
80059    21.98%
15825     9.99%
33182    11.71%
Name: Interest.Rate, dtype: object
  • removing % signs from rates
In [331]:
ir = loansData['Interest.Rate']
ir[0:5]
Out[331]:
81174     8.90%
99592    12.12%
80059    21.98%
15825     9.99%
33182    11.71%
Name: Interest.Rate, dtype: object
In [332]:
type(ir[0:1])
irbak = ir
type(irbak)
loansDataBak = loansData
loansDataBak.reset_index()
loansDataBak
loansDataBak['Interest.Rate']
loansData
Out[332]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 81174 to 3116
Data columns (total 14 columns):
Amount.Requested                  2500  non-null values
Amount.Funded.By.Investors        2500  non-null values
Interest.Rate                     2500  non-null values
Loan.Length                       2500  non-null values
Loan.Purpose                      2500  non-null values
Debt.To.Income.Ratio              2500  non-null values
State                             2500  non-null values
Home.Ownership                    2500  non-null values
Monthly.Income                    2499  non-null values
FICO.Range                        2500  non-null values
Open.CREDIT.Lines                 2498  non-null values
Revolving.CREDIT.Balance          2498  non-null values
Inquiries.in.the.Last.6.Months    2498  non-null values
Employment.Length                 2500  non-null values
dtypes: float64(5), int64(1), object(8)
In [333]:
ldb = loansDataBak.reset_index() # explain
ldb['Interest.Rate'][0:5]
irates = ldb['Interest.Rate'][0:]
type(irates)

[irates[k] for k in (0,1,2,3,4)]
#int_rates = [ float(x[:-1]) for x in irates[k] ]
#"""
#---------------------------------------------------------------------------
#ValueError                                Traceback (most recent call last)
#<ipython-input-46-5c4f2333efd1> in <module>()
#      5 
#      6 [irates[k] for k in (0,1,2,3,4)]
#----> 7 int_rates = [ float(x[:-1]) for x in irates[k] ]
#
#ValueError: could not convert string to float: 
#"""
Out[333]:
['8.90%', '12.12%', '21.98%', '9.99%', '11.71%']
In [334]:
type(ldb)
Out[334]:
pandas.core.frame.DataFrame
In [335]:
srates = ldb['Interest.Rate']
In [336]:
#nas = [ x for x in srates if x.isnull() ] # AttributeError: 'str' object has no attribute 'isnull'
In [337]:
nas = [ x for x in srates if not(x[0].isdigit()) ] # AttributeError: 'str' object has no attribute 'isnull'
In [338]:
len(nas)
Out[338]:
0
In [339]:
srates[0][:-1]
Out[339]:
'8.90'
In [340]:
float(srates[0][:-1])
Out[340]:
8.9
In [341]:
nopct = [ x[:-1] for x in srates ]
In [342]:
flrates = [float(x) for x in nopct]
In [343]:
flrates[0:5]
Out[343]:
[8.9, 12.12, 21.98, 9.99, 11.71]
In [344]:
flrate = map(float, nopct)
In [345]:
flrate[0:5]
Out[345]:
[8.9, 12.12, 21.98, 9.99, 11.71]
In [346]:
len(flrate)
Out[346]:
2500
In [347]:
ldb
Out[347]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 0 to 2499
Data columns (total 15 columns):
index                             2500  non-null values
Amount.Requested                  2500  non-null values
Amount.Funded.By.Investors        2500  non-null values
Interest.Rate                     2500  non-null values
Loan.Length                       2500  non-null values
Loan.Purpose                      2500  non-null values
Debt.To.Income.Ratio              2500  non-null values
State                             2500  non-null values
Home.Ownership                    2500  non-null values
Monthly.Income                    2499  non-null values
FICO.Range                        2500  non-null values
Open.CREDIT.Lines                 2498  non-null values
Revolving.CREDIT.Balance          2498  non-null values
Inquiries.in.the.Last.6.Months    2498  non-null values
Employment.Length                 2500  non-null values
dtypes: float64(5), int64(2), object(8)
In [348]:
ldb['Interest.Rate'] = flrate
In [349]:
ldb[0:5]
Out[349]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 15 columns):
index                             5  non-null values
Amount.Requested                  5  non-null values
Amount.Funded.By.Investors        5  non-null values
Interest.Rate                     5  non-null values
Loan.Length                       5  non-null values
Loan.Purpose                      5  non-null values
Debt.To.Income.Ratio              5  non-null values
State                             5  non-null values
Home.Ownership                    5  non-null values
Monthly.Income                    5  non-null values
FICO.Range                        5  non-null values
Open.CREDIT.Lines                 5  non-null values
Revolving.CREDIT.Balance          5  non-null values
Inquiries.in.the.Last.6.Months    5  non-null values
Employment.Length                 5  non-null values
dtypes: float64(6), int64(2), object(7)
In [350]:
ldb['Interest.Rate'][0:5]
Out[350]:
0     8.90
1    12.12
2    21.98
3     9.99
4    11.71
Name: Interest.Rate, dtype: float64
In [351]:
srates = loansData['Interest.Rate']
In [352]:
nopct = [ x[:-1] for x in srates ]
In [353]:
flrates = [float(x) for x in nopct]
In [354]:
rates = [float(x[:-1]) for x in srates] # use this
In [355]:
flrates == rates
Out[355]:
True
In [356]:
loansData['Interest.Rate'] = flrates
In [357]:
loansData['Interest.Rate'][0:5]
Out[357]:
81174     8.90
99592    12.12
80059    21.98
15825     9.99
33182    11.71
Name: Interest.Rate, dtype: float64

Conclusion of step 1

  • ok! whew! we're done with the % symbol stuff
  • we learnt quite a few things along the way that will be useful in the next part

Step 2: Remove the months

In [358]:
withmons = ldb['Loan.Length'] 
In [358]:
 
In [359]:
wmons = withmons[0:]
In [360]:
wmons[0:5]
Out[360]:
0    36 months
1    36 months
2    60 months
3    36 months
4    36 months
Name: Loan.Length, dtype: object
In [361]:
wmons
Out[361]:
0     36 months
1     36 months
2     60 months
3     36 months
4     36 months
5     36 months
6     36 months
7     60 months
8     36 months
9     36 months
10    36 months
11    36 months
12    60 months
13    36 months
14    36 months
...
2485    36 months
2486    36 months
2487    36 months
2488    36 months
2489    36 months
2490    36 months
2491    36 months
2492    36 months
2493    60 months
2494    36 months
2495    60 months
2496    60 months
2497    36 months
2498    36 months
2499    36 months
Name: Loan.Length, Length: 2500, dtype: object
In [362]:
wmons[0].split()
Out[362]:
['36', 'months']
In [363]:
wmons[0].split()[0]
Out[363]:
'36'
In [364]:
int(wmons[0].split()[0])
Out[364]:
36
In [365]:
x = wmons[0].split()
In [366]:
x[0]
Out[366]:
'36'
In [367]:
int(x[0])
Out[367]:
36
In [368]:
intmons = [ int(x.split()[0]) for x in wmons ] 
In [369]:
intmons[0:10]
Out[369]:
[36, 36, 60, 36, 36, 36, 36, 60, 36, 36]
In [370]:
loansData['Loan.Length']
Out[370]:
81174    36 months
99592    36 months
80059    60 months
15825    36 months
33182    36 months
62403    36 months
48808    36 months
22090    60 months
76404    36 months
15867    36 months
94971    36 months
36911    36 months
41200    60 months
83869    36 months
53853    36 months
...
13070    36 months
45836    36 months
52330    36 months
48243    36 months
63256    36 months
42124    36 months
78043    36 months
925      36 months
74047    60 months
49957    36 months
23735    60 months
65882    60 months
55610    36 months
38576    36 months
3116     36 months
Name: Loan.Length, Length: 2500, dtype: object
In [371]:
loansData['Loan.Length'].value_counts()
Out[371]:
36 months    1952
60 months     548
dtype: int64

Conclusion of Step 2

  • Here we used the techniques we learned in Step 1.
    • Pull out a column from a data frame
    • Operate on it, perform some transformations
    • Replace the column in the original dataframe with this new column
  • We applied them to removing the ' months' suffix in the Loan.Length column.

Step 3: Remove bad data

In [372]:
loansData['Monthly.Income'].describe()
Out[372]:
count      2499.000000
mean       5688.931321
std        3963.118185
min         588.500000
25%        3500.000000
50%        5000.000000
75%        6800.000000
max      102750.000000
dtype: float64

First remove implausible values. We see the max value to be 102750.
This is a MONTHLY income of 100K dollars, which is certainly possible, but ....
highly implausible for a person seeking a loan of a few 10's of K dollars, i.e. implausible in this context.

In [373]:
loansData['Monthly.Income'].idxmax() # find the place where the max occurs
Out[373]:
54487

But there's a better way - a row filter i.e. an expression used as a way to restrict the rows in a dataframe.
In our case we want to eliminate rows above 100K dollars. i.e. only keep those less than 100K dollars.

In [374]:
loansData['Monthly.Income'][loansData['Monthly.Income'] < 100000]
Out[374]:
81174     6541.67
99592     4583.33
80059    11500.00
15825     3833.33
33182     3195.00
62403     4891.67
48808     2916.67
22090    13863.42
76404     3150.00
15867     5000.00
94971     3575.00
36911     4250.00
41200    14166.67
83869     9166.67
53853    11250.00
...
13070    7083.33
45836    2464.37
52330    8000.00
48243    3833.33
63256    5166.67
42124    4500.00
78043    3983.33
925      5166.67
74047    7083.33
49957    6666.67
23735    9250.00
65882    8903.25
55610    2166.67
38576    3500.00
3116     3875.00
Name: Monthly.Income, Length: 2498, dtype: float64
In [374]:
 
In [375]:
loansData['Monthly.Income'].describe()
Out[375]:
count      2499.000000
mean       5688.931321
std        3963.118185
min         588.500000
25%        3500.000000
50%        5000.000000
75%        6800.000000
max      102750.000000
dtype: float64
In [376]:
ldlt100 = ldb[ldb['Monthly.Income'] < 100000]
In [377]:
ldlt100
Out[377]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2498 entries, 0 to 2499
Data columns (total 15 columns):
index                             2498  non-null values
Amount.Requested                  2498  non-null values
Amount.Funded.By.Investors        2498  non-null values
Interest.Rate                     2498  non-null values
Loan.Length                       2498  non-null values
Loan.Purpose                      2498  non-null values
Debt.To.Income.Ratio              2498  non-null values
State                             2498  non-null values
Home.Ownership                    2498  non-null values
Monthly.Income                    2498  non-null values
FICO.Range                        2498  non-null values
Open.CREDIT.Lines                 2497  non-null values
Revolving.CREDIT.Balance          2497  non-null values
Inquiries.in.the.Last.6.Months    2497  non-null values
Employment.Length                 2498  non-null values
dtypes: float64(6), int64(2), object(7)
In [378]:
len(ldlt100)
Out[378]:
2498

Now drop any rows that have 'NA' values ie data not available.
In database terminiology these would be 'NULL' values.

In [379]:
ldb2 = ldlt100.dropna()
In [380]:
ldb2
Out[380]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2497 entries, 0 to 2499
Data columns (total 15 columns):
index                             2497  non-null values
Amount.Requested                  2497  non-null values
Amount.Funded.By.Investors        2497  non-null values
Interest.Rate                     2497  non-null values
Loan.Length                       2497  non-null values
Loan.Purpose                      2497  non-null values
Debt.To.Income.Ratio              2497  non-null values
State                             2497  non-null values
Home.Ownership                    2497  non-null values
Monthly.Income                    2497  non-null values
FICO.Range                        2497  non-null values
Open.CREDIT.Lines                 2497  non-null values
Revolving.CREDIT.Balance          2497  non-null values
Inquiries.in.the.Last.6.Months    2497  non-null values
Employment.Length                 2497  non-null values
dtypes: float64(6), int64(2), object(7)
In [381]:
len(ldb2)
Out[381]:
2497

So we dropped one row that had an NA value somewhere.

Conclusion of step 3

  • We used techniques we learned in step 1 to pick out a columns and operate on it
  • We also learned how to filter the data based on expressions involving column values
  • Finally we learnt how to drop NA values

Now that we have removed the "bad" data, let's take on the final data-cleaning task for this data set - converting the ranges to single integers.

Step 4: Change FICO range to a single value

Note that the FICO values are given in a range which is in the form of a string that looks like lowerlimit-upperlimit, eg 720-724.
We want to convert these values to a single int value representing the lower limit. e.g. 720 in the above example.

In [382]:
ficostr = ldb2['FICO.Range'] 
In [383]:
ficostr[0:10]
Out[383]:
0    735-739
1    715-719
2    690-694
3    695-699
4    695-699
5    670-674
6    720-724
7    705-709
8    685-689
9    715-719
Name: FICO.Range, dtype: object
In [384]:
ficostr[0]
Out[384]:
'735-739'
In [385]:
ficoint = [ int(x.split('-')[0]) for x in ficostr ]
In [386]:
ficoint[0:10]
Out[386]:
[735, 715, 690, 695, 695, 670, 720, 705, 685, 715]
In [387]:
ldb2['FICO.Range'] = ficoint
In [388]:
len(ficoint)
Out[388]:
2497
In [389]:
ldb2['FICO.Range']
Out[389]:
0     735
1     715
2     690
3     695
4     695
5     670
6     720
7     705
8     685
9     715
10    670
11    665
12    670
13    735
14    725
...
2485    765
2486    665
2487    660
2488    685
2489    670
2490    710
2491    720
2492    710
2493    675
2494    685
2495    705
2496    740
2497    680
2498    675
2499    670
Name: FICO.Range, Length: 2497, dtype: int64

Conclusion of Step 4

  • We used techniques similar to the ones in Step 1
    • we picked a column - the FICO.Range column
    • we split the values on the separator '-'
    • we picked the fist value, i.e. the lower limit
    • we converted it to an int

Now we have a dataset that we can use for our data exploration and analysis

In [389]: