Linear Regression - Data Cleanup¶

Here we get into the detailed steps involved in data cleanup of the Lending Club dataset. We glossed over this in the Data Exploration lesson and left it as an exercise. This is where we actually do the cleanup step by step.

Remove '%'

strip the last letter
map to float

Remove ' months' suffix

strip the tailing letters
map to int

Remove bad data

n/a values
outlier income
histograms
describe

Map FICO.Range to single int

split on '-'
use bottom value
convert to int
show how to do mid-point

In [1]:

%pylab inline
import pandas as pd
from numpy import nan as NA
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
loansData

Populating the interactive namespace from numpy and matplotlib

Out[1]:

	Amount.Requested	Amount.Funded.By.Investors	Interest.Rate	Loan.Length	Loan.Purpose	Debt.To.Income.Ratio	State	Home.Ownership	Monthly.Income	FICO.Range	Open.CREDIT.Lines	Revolving.CREDIT.Balance	Inquiries.in.the.Last.6.Months	Employment.Length
81174	20000	20000.00	8.90%	36 months	debt_consolidation	14.90%	SC	MORTGAGE	6541.67	735-739	14	14272	2	< 1 year
99592	19200	19200.00	12.12%	36 months	debt_consolidation	28.36%	TX	MORTGAGE	4583.33	715-719	12	11140	1	2 years
80059	35000	35000.00	21.98%	60 months	debt_consolidation	23.81%	CA	MORTGAGE	11500.00	690-694	14	21977	1	2 years
15825	10000	9975.00	9.99%	36 months	debt_consolidation	14.30%	KS	MORTGAGE	3833.33	695-699	10	9346	0	5 years
33182	12000	12000.00	11.71%	36 months	credit_card	18.78%	NJ	RENT	3195.00	695-699	11	14469	0	9 years
62403	6000	6000.00	15.31%	36 months	other	20.05%	CT	OWN	4891.67	670-674	17	10391	2	3 years
48808	10000	10000.00	7.90%	36 months	debt_consolidation	26.09%	MA	RENT	2916.67	720-724	10	15957	0	10+ years
22090	33500	33450.00	17.14%	60 months	credit_card	14.70%	LA	MORTGAGE	13863.42	705-709	12	27874	0	10+ years
76404	14675	14675.00	14.33%	36 months	credit_card	26.92%	CA	RENT	3150.00	685-689	9	7246	1	8 years
15867	7000	7000.00	6.91%	36 months	credit_card	7.10%	CA	RENT	5000.00	715-719	8	7612	0	3 years
94971	2000	2000.00	19.72%	36 months	moving	10.29%	FL	RENT	3575.00	670-674	10	12036	0	6 years
36911	10625	10625.00	14.27%	36 months	debt_consolidation	12.54%	CA	MORTGAGE	4250.00	665-669	14	10767	0	< 1 year
41200	28000	27975.00	21.67%	60 months	debt_consolidation	13.07%	CT	MORTGAGE	14166.67	670-674	12	10311	0	1 year
83869	35000	34950.00	8.90%	36 months	debt_consolidation	20.46%	CT	RENT	9166.67	735-739	19	21536	0	1 year
53853	9600	9600.00	7.62%	36 months	debt_consolidation	3.45%	DC	RENT	11250.00	725-729	13	4606	0	< 1 year
21399	25000	24975.00	15.65%	60 months	debt_consolidation	21.99%	CA	RENT	5416.67	730-734	6	13929	0	9 years
62127	10000	10000.00	12.12%	36 months	debt_consolidation	17.72%	CA	RENT	9000.00	695-699	18	20317	0	7 years
23446	14000	13900.25	10.37%	60 months	debt_consolidation	11.95%	OH	RENT	4333.33	740-744	6	7419	0	9 years
44987	10000	10000.00	9.76%	36 months	credit_card	7.13%	FL	RENT	2733.33	730-734	7	6112	2	3 years
17977	5200	5175.00	9.99%	60 months	debt_consolidation	10.29%	AL	MORTGAGE	3750.00	760-764	10	16094	0	< 1 year
86099	22000	21975.00	21.98%	36 months	debt_consolidation	11.19%	TX	MORTGAGE	6666.67	665-669	9	23124	0	10+ years
99483	30000	30000.00	19.05%	60 months	credit_card	21.25%	FL	MORTGAGE	6250.00	695-699	12	34927	0	6 years
28798	6500	6500.00	17.99%	60 months	car	19.63%	FL	RENT	4100.00	665-669	11	11697	1	2 years
24168	17400	17400.00	11.99%	36 months	credit_card	12.47%	AZ	RENT	6833.33	695-699	7	26587	0	7 years
10356	4000	4000.00	16.82%	60 months	vacation	13.71%	GA	MORTGAGE	4500.00	670-674	5	20804	0	3 years
46027	7200	7200.00	7.90%	36 months	debt_consolidation	24.82%	TX	RENT	5416.67	705-709	8	12017	0	7 years
2238	8000	8000.00	14.42%	36 months	debt_consolidation	24.63%	MA	RENT	2964.17	675-679	9	8928	2	6 years
65278	8000	8000.00	15.31%	36 months	debt_consolidation	15.46%	CA	MORTGAGE	2916.67	675-679	13	7152	1	5 years
4227	3000	3000.00	8.59%	36 months	other	3.72%	MA	MORTGAGE	4167.00	765-769	4	7074	0	5 years
50182	14500	14500.00	7.90%	36 months	debt_consolidation	4.85%	GA	MORTGAGE	3958.33	760-764	4	9598	0	4 years
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
84265	20000	20000.00	22.95%	60 months	debt_consolidation	7.10%	NJ	RENT	6750.00	665-669	6	16104	1	7 years
80231	19000	19000.00	7.90%	36 months	debt_consolidation	9.76%	RI	MORTGAGE	5166.67	770-774	18	43617	2	10+ years
49533	17300	17250.00	22.45%	60 months	wedding	3.58%	PA	MORTGAGE	5500.00	685-689	11	2306	3	4 years
102514	7000	711.54	15.13%	36 months	major_purchase	18.91%	CO	MORTGAGE	3833.00	650-654	13	12634	0	2 years
78618	7200	7200.00	18.75%	36 months	debt_consolidation	16.21%	MI	RENT	8333.33	660-664	8	28916	0	10+ years
86953	10000	10000.00	14.09%	36 months	major_purchase	9.71%	CO	RENT	4583.33	675-679	6	3859	0	< 1 year
80129	4000	3925.00	14.09%	36 months	credit_card	12.27%	FL	MORTGAGE	8583.33	675-679	9	36943	1	10+ years
85216	17500	17500.00	8.90%	36 months	debt_consolidation	10.94%	UT	MORTGAGE	25000.00	730-734	9	34545	0	6 years
38247	20000	20000.00	11.71%	36 months	credit_card	9.58%	SD	MORTGAGE	5416.67	725-729	12	18267	1	4 years
91245	16200	16200.00	15.80%	60 months	debt_consolidation	7.92%	PA	MORTGAGE	4833.33	680-684	12	12313	2	10+ years
53041	10000	10000.00	6.03%	36 months	small_business	13.03%	FL	RENT	5000.00	760-764	8	3952	0	2 years
63051	27000	27000.00	6.62%	36 months	debt_consolidation	12.21%	OH	MORTGAGE	9250.00	810-814	12	4211	0	5 years
14446	4500	4475.00	7.51%	36 months	small_business	20.27%	VA	MORTGAGE	7075.50	720-724	15	68618	2	10+ years
68628	15875	15875.00	14.33%	36 months	small_business	17.44%	MD	MORTGAGE	3416.67	675-679	11	15891	0	2 years
98758	15000	15000.00	10.16%	36 months	credit_card	28.28%	OH	MORTGAGE	6666.67	690-694	15	14880	0	10+ years
13070	25000	24950.00	10.75%	36 months	debt_consolidation	20.48%	OR	MORTGAGE	7083.33	765-769	10	25429	0	6 years
45836	7000	7000.00	17.27%	36 months	other	18.38%	NY	OWN	2464.37	665-669	9	7089	0	3 years
52330	15000	15000.00	19.99%	36 months	wedding	18.05%	CA	RENT	8000.00	660-664	6	45976	1	2 years
48243	17000	17000.00	15.81%	36 months	debt_consolidation	17.01%	CO	RENT	3833.33	685-689	6	15484	1	6 years
63256	19075	19075.00	18.75%	36 months	debt_consolidation	15.23%	NY	RENT	5166.67	670-674	17	13749	3	10+ years
42124	10000	10000.00	11.71%	36 months	debt_consolidation	8.40%	CA	RENT	4500.00	710-714	8	8404	1	3 years
78043	8475	8475.00	7.62%	36 months	debt_consolidation	15.88%	CA	RENT	3983.33	720-724	9	6882	0	n/a
925	6400	6350.00	10.08%	36 months	debt_consolidation	8.11%	NJ	MORTGAGE	5166.67	710-714	5	5815	2	10+ years
74047	30000	30000.00	23.28%	60 months	other	12.10%	IL	MORTGAGE	7083.33	675-679	16	17969	1	10+ years
49957	24000	23975.00	14.65%	36 months	debt_consolidation	15.29%	WA	MORTGAGE	6666.67	685-689	13	17521	0	5 years
23735	30000	29950.00	16.77%	60 months	debt_consolidation	19.23%	NY	MORTGAGE	9250.00	705-709	15	45880	1	8 years
65882	16000	16000.00	14.09%	60 months	home_improvement	21.54%	MD	OWN	8903.25	740-744	18	18898	1	10+ years
55610	10000	10000.00	13.99%	36 months	debt_consolidation	4.89%	PA	MORTGAGE	2166.67	680-684	4	4544	0	10+ years
38576	6000	6000.00	12.42%	36 months	major_purchase	16.66%	NJ	RENT	3500.00	675-679	8	7753	0	5 years
3116	9000	5242.75	13.79%	36 months	debt_consolidation	6.76%	NY	RENT	3875.00	670-674	7	7589	0	10+ years

2500 rows × 14 columns

In [330]:

loansData['Interest.Rate'][0:5] # first five rows of Interest.Rate

Out[330]:

81174     8.90%
99592    12.12%
80059    21.98%
15825     9.99%
33182    11.71%
Name: Interest.Rate, dtype: object

removing % signs from rates

In [331]:

ir = loansData['Interest.Rate']
ir[0:5]

Out[331]:

81174     8.90%
99592    12.12%
80059    21.98%
15825     9.99%
33182    11.71%
Name: Interest.Rate, dtype: object

In [332]:

type(ir[0:1])
irbak = ir
type(irbak)
loansDataBak = loansData
loansDataBak.reset_index()
loansDataBak
loansDataBak['Interest.Rate']
loansData

Out[332]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 81174 to 3116
Data columns (total 14 columns):
Amount.Requested                  2500  non-null values
Amount.Funded.By.Investors        2500  non-null values
Interest.Rate                     2500  non-null values
Loan.Length                       2500  non-null values
Loan.Purpose                      2500  non-null values
Debt.To.Income.Ratio              2500  non-null values
State                             2500  non-null values
Home.Ownership                    2500  non-null values
Monthly.Income                    2499  non-null values
FICO.Range                        2500  non-null values
Open.CREDIT.Lines                 2498  non-null values
Revolving.CREDIT.Balance          2498  non-null values
Inquiries.in.the.Last.6.Months    2498  non-null values
Employment.Length                 2500  non-null values
dtypes: float64(5), int64(1), object(8)

In [333]:

ldb = loansDataBak.reset_index() # explain
ldb['Interest.Rate'][0:5]
irates = ldb['Interest.Rate'][0:]
type(irates)

[irates[k] for k in (0,1,2,3,4)]
#int_rates = [ float(x[:-1]) for x in irates[k] ]
#"""
#---------------------------------------------------------------------------
#ValueError                                Traceback (most recent call last)
#<ipython-input-46-5c4f2333efd1> in <module>()
#      5 
#      6 [irates[k] for k in (0,1,2,3,4)]
#----> 7 int_rates = [ float(x[:-1]) for x in irates[k] ]
#
#ValueError: could not convert string to float: 
#"""

Out[333]:

['8.90%', '12.12%', '21.98%', '9.99%', '11.71%']

In [334]:

type(ldb)

Out[334]:

pandas.core.frame.DataFrame

In [335]:

srates = ldb['Interest.Rate']

In [336]:

#nas = [ x for x in srates if x.isnull() ] # AttributeError: 'str' object has no attribute 'isnull'

In [337]:

nas = [ x for x in srates if not(x[0].isdigit()) ] # AttributeError: 'str' object has no attribute 'isnull'

In [338]:

len(nas)

Out[338]:

In [339]:

srates[0][:-1]

Out[339]:

'8.90'

In [340]:

float(srates[0][:-1])

Out[340]:

8.9

In [341]:

nopct = [ x[:-1] for x in srates ]

In [342]:

flrates = [float(x) for x in nopct]

In [343]:

flrates[0:5]

Out[343]:

[8.9, 12.12, 21.98, 9.99, 11.71]

In [344]:

flrate = map(float, nopct)

In [345]:

flrate[0:5]

Out[345]:

[8.9, 12.12, 21.98, 9.99, 11.71]

In [346]:

len(flrate)

Out[346]:

In [347]:

ldb

Out[347]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2500 entries, 0 to 2499
Data columns (total 15 columns):
index                             2500  non-null values
Amount.Requested                  2500  non-null values
Amount.Funded.By.Investors        2500  non-null values
Interest.Rate                     2500  non-null values
Loan.Length                       2500  non-null values
Loan.Purpose                      2500  non-null values
Debt.To.Income.Ratio              2500  non-null values
State                             2500  non-null values
Home.Ownership                    2500  non-null values
Monthly.Income                    2499  non-null values
FICO.Range                        2500  non-null values
Open.CREDIT.Lines                 2498  non-null values
Revolving.CREDIT.Balance          2498  non-null values
Inquiries.in.the.Last.6.Months    2498  non-null values
Employment.Length                 2500  non-null values
dtypes: float64(5), int64(2), object(8)

In [348]:

ldb['Interest.Rate'] = flrate

In [349]:

ldb[0:5]

Out[349]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 15 columns):
index                             5  non-null values
Amount.Requested                  5  non-null values
Amount.Funded.By.Investors        5  non-null values
Interest.Rate                     5  non-null values
Loan.Length                       5  non-null values
Loan.Purpose                      5  non-null values
Debt.To.Income.Ratio              5  non-null values
State                             5  non-null values
Home.Ownership                    5  non-null values
Monthly.Income                    5  non-null values
FICO.Range                        5  non-null values
Open.CREDIT.Lines                 5  non-null values
Revolving.CREDIT.Balance          5  non-null values
Inquiries.in.the.Last.6.Months    5  non-null values
Employment.Length                 5  non-null values
dtypes: float64(6), int64(2), object(7)

In [350]:

ldb['Interest.Rate'][0:5]

Out[350]:

0     8.90
1    12.12
2    21.98
3     9.99
4    11.71
Name: Interest.Rate, dtype: float64

In [351]:

srates = loansData['Interest.Rate']

In [352]:

nopct = [ x[:-1] for x in srates ]

In [353]:

flrates = [float(x) for x in nopct]

In [354]:

rates = [float(x[:-1]) for x in srates] # use this

In [355]:

flrates == rates

Out[355]:

True

In [356]:

loansData['Interest.Rate'] = flrates

In [357]:

loansData['Interest.Rate'][0:5]

Out[357]:

81174     8.90
99592    12.12
80059    21.98
15825     9.99
33182    11.71
Name: Interest.Rate, dtype: float64

Conclusion of step 1¶

ok! whew! we're done with the % symbol stuff
we learnt quite a few things along the way that will be useful in the next part

Step 2: Remove the months¶

In [358]:

withmons = ldb['Loan.Length'] 

In [358]:

In [359]:

wmons = withmons[0:]

In [360]:

wmons[0:5]

Out[360]:

0    36 months
1    36 months
2    60 months
3    36 months
4    36 months
Name: Loan.Length, dtype: object

In [361]:

wmons

Out[361]:

0     36 months
1     36 months
2     60 months
3     36 months
4     36 months
5     36 months
6     36 months
7     60 months
8     36 months
9     36 months
10    36 months
11    36 months
12    60 months
13    36 months
14    36 months
...
2485    36 months
2486    36 months
2487    36 months
2488    36 months
2489    36 months
2490    36 months
2491    36 months
2492    36 months
2493    60 months
2494    36 months
2495    60 months
2496    60 months
2497    36 months
2498    36 months
2499    36 months
Name: Loan.Length, Length: 2500, dtype: object

In [362]:

wmons[0].split()

Out[362]:

['36', 'months']

In [363]:

wmons[0].split()[0]

Out[363]:

'36'

In [364]:

int(wmons[0].split()[0])

Out[364]:

In [365]:

x = wmons[0].split()

In [366]:

x[0]

Out[366]:

'36'

In [367]:

int(x[0])

Out[367]:

In [368]:

intmons = [ int(x.split()[0]) for x in wmons ] 

In [369]:

intmons[0:10]

Out[369]:

[36, 36, 60, 36, 36, 36, 36, 60, 36, 36]

In [370]:

loansData['Loan.Length']

Out[370]:

81174    36 months
99592    36 months
80059    60 months
15825    36 months
33182    36 months
62403    36 months
48808    36 months
22090    60 months
76404    36 months
15867    36 months
94971    36 months
36911    36 months
41200    60 months
83869    36 months
53853    36 months
...
13070    36 months
45836    36 months
52330    36 months
48243    36 months
63256    36 months
42124    36 months
78043    36 months
925      36 months
74047    60 months
49957    36 months
23735    60 months
65882    60 months
55610    36 months
38576    36 months
3116     36 months
Name: Loan.Length, Length: 2500, dtype: object

In [371]:

loansData['Loan.Length'].value_counts()

Out[371]:

36 months    1952
60 months     548
dtype: int64

Conclusion of Step 2¶

Here we used the techniques we learned in Step 1.
Pull out a column from a data frame
Operate on it, perform some transformations
Replace the column in the original dataframe with this new column
We applied them to removing the ' months' suffix in the Loan.Length column.

Step 3: Remove bad data¶

In [372]:

loansData['Monthly.Income'].describe()

Out[372]:

count      2499.000000
mean       5688.931321
std        3963.118185
min         588.500000
25%        3500.000000
50%        5000.000000
75%        6800.000000
max      102750.000000
dtype: float64

First remove implausible values. We see the max value to be 102750.
This is a MONTHLY income of 100K dollars, which is certainly possible, but ....
highly implausible for a person seeking a loan of a few 10's of K dollars, i.e. implausible in this context.

In [373]:

loansData['Monthly.Income'].idxmax() # find the place where the max occurs

Out[373]:

But there's a better way - a row filter i.e. an expression used as a way to restrict the rows in a dataframe.
In our case we want to eliminate rows above 100K dollars. i.e. only keep those less than 100K dollars.

In [374]:

loansData['Monthly.Income'][loansData['Monthly.Income'] < 100000]

Out[374]:

81174     6541.67
99592     4583.33
80059    11500.00
15825     3833.33
33182     3195.00
62403     4891.67
48808     2916.67
22090    13863.42
76404     3150.00
15867     5000.00
94971     3575.00
36911     4250.00
41200    14166.67
83869     9166.67
53853    11250.00
...
13070    7083.33
45836    2464.37
52330    8000.00
48243    3833.33
63256    5166.67
42124    4500.00
78043    3983.33
925      5166.67
74047    7083.33
49957    6666.67
23735    9250.00
65882    8903.25
55610    2166.67
38576    3500.00
3116     3875.00
Name: Monthly.Income, Length: 2498, dtype: float64

In [374]:

In [375]:

loansData['Monthly.Income'].describe()

Out[375]:

count      2499.000000
mean       5688.931321
std        3963.118185
min         588.500000
25%        3500.000000
50%        5000.000000
75%        6800.000000
max      102750.000000
dtype: float64

In [376]:

ldlt100 = ldb[ldb['Monthly.Income'] < 100000]

In [377]:

ldlt100

Out[377]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2498 entries, 0 to 2499
Data columns (total 15 columns):
index                             2498  non-null values
Amount.Requested                  2498  non-null values
Amount.Funded.By.Investors        2498  non-null values
Interest.Rate                     2498  non-null values
Loan.Length                       2498  non-null values
Loan.Purpose                      2498  non-null values
Debt.To.Income.Ratio              2498  non-null values
State                             2498  non-null values
Home.Ownership                    2498  non-null values
Monthly.Income                    2498  non-null values
FICO.Range                        2498  non-null values
Open.CREDIT.Lines                 2497  non-null values
Revolving.CREDIT.Balance          2497  non-null values
Inquiries.in.the.Last.6.Months    2497  non-null values
Employment.Length                 2498  non-null values
dtypes: float64(6), int64(2), object(7)

In [378]:

len(ldlt100)

Out[378]:

Now drop any rows that have 'NA' values ie data not available.
In database terminiology these would be 'NULL' values.

In [379]:

ldb2 = ldlt100.dropna()

In [380]:

ldb2

Out[380]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2497 entries, 0 to 2499
Data columns (total 15 columns):
index                             2497  non-null values
Amount.Requested                  2497  non-null values
Amount.Funded.By.Investors        2497  non-null values
Interest.Rate                     2497  non-null values
Loan.Length                       2497  non-null values
Loan.Purpose                      2497  non-null values
Debt.To.Income.Ratio              2497  non-null values
State                             2497  non-null values
Home.Ownership                    2497  non-null values
Monthly.Income                    2497  non-null values
FICO.Range                        2497  non-null values
Open.CREDIT.Lines                 2497  non-null values
Revolving.CREDIT.Balance          2497  non-null values
Inquiries.in.the.Last.6.Months    2497  non-null values
Employment.Length                 2497  non-null values
dtypes: float64(6), int64(2), object(7)

In [381]:

len(ldb2)

Out[381]:

So we dropped one row that had an NA value somewhere.

Conclusion of step 3¶

We used techniques we learned in step 1 to pick out a columns and operate on it
We also learned how to filter the data based on expressions involving column values
Finally we learnt how to drop NA values

Now that we have removed the "bad" data, let's take on the final data-cleaning task for this data set - converting the ranges to single integers.

Step 4: Change FICO range to a single value¶

Note that the FICO values are given in a range which is in the form of a string that looks like lowerlimit-upperlimit, eg 720-724.
We want to convert these values to a single int value representing the lower limit. e.g. 720 in the above example.

In [382]:

ficostr = ldb2['FICO.Range'] 

In [383]:

ficostr[0:10]

Out[383]:

0    735-739
1    715-719
2    690-694
3    695-699
4    695-699
5    670-674
6    720-724
7    705-709
8    685-689
9    715-719
Name: FICO.Range, dtype: object

In [384]:

ficostr[0]

Out[384]:

'735-739'

In [385]:

ficoint = [ int(x.split('-')[0]) for x in ficostr ]

In [386]:

ficoint[0:10]

Out[386]:

[735, 715, 690, 695, 695, 670, 720, 705, 685, 715]

In [387]:

ldb2['FICO.Range'] = ficoint

In [388]:

len(ficoint)

Out[388]:

In [389]:

ldb2['FICO.Range']

Out[389]:

0     735
1     715
2     690
3     695
4     695
5     670
6     720
7     705
8     685
9     715
10    670
11    665
12    670
13    735
14    725
...
2485    765
2486    665
2487    660
2488    685
2489    670
2490    710
2491    720
2492    710
2493    675
2494    685
2495    705
2496    740
2497    680
2498    675
2499    670
Name: FICO.Range, Length: 2497, dtype: int64

Conclusion of Step 4¶

We used techniques similar to the ones in Step 1
we picked a column - the FICO.Range column
we split the values on the separator '-'
we picked the fist value, i.e. the lower limit
we converted it to an int

Now we have a dataset that we can use for our data exploration and analysis

In [389]: