Here we get into the detailed steps involved in data cleanup of the Lending Club dataset. We glossed over this in the Data Exploration lesson and left it as an exercise. This is where we actually do the cleanup step by step.
Remove ' months' suffix
Remove bad data
Map FICO.Range to single int
%pylab inline
import pandas as pd
from numpy import nan as NA
loansData = pd.read_csv('https://spark-public.s3.amazonaws.com/dataanalysis/loansData.csv')
loansData
loansData['Interest.Rate'][0:5] # first five rows of Interest.Rate
ir = loansData['Interest.Rate']
ir[0:5]
type(ir[0:1])
irbak = ir
type(irbak)
loansDataBak = loansData
loansDataBak.reset_index()
loansDataBak
loansDataBak['Interest.Rate']
loansData
ldb = loansDataBak.reset_index() # explain
ldb['Interest.Rate'][0:5]
irates = ldb['Interest.Rate'][0:]
type(irates)
[irates[k] for k in (0,1,2,3,4)]
#int_rates = [ float(x[:-1]) for x in irates[k] ]
#"""
#---------------------------------------------------------------------------
#ValueError Traceback (most recent call last)
#<ipython-input-46-5c4f2333efd1> in <module>()
# 5
# 6 [irates[k] for k in (0,1,2,3,4)]
#----> 7 int_rates = [ float(x[:-1]) for x in irates[k] ]
#
#ValueError: could not convert string to float:
#"""
type(ldb)
srates = ldb['Interest.Rate']
#nas = [ x for x in srates if x.isnull() ] # AttributeError: 'str' object has no attribute 'isnull'
nas = [ x for x in srates if not(x[0].isdigit()) ] # AttributeError: 'str' object has no attribute 'isnull'
len(nas)
srates[0][:-1]
float(srates[0][:-1])
nopct = [ x[:-1] for x in srates ]
flrates = [float(x) for x in nopct]
flrates[0:5]
flrate = map(float, nopct)
flrate[0:5]
len(flrate)
ldb
ldb['Interest.Rate'] = flrate
ldb[0:5]
ldb['Interest.Rate'][0:5]
srates = loansData['Interest.Rate']
nopct = [ x[:-1] for x in srates ]
flrates = [float(x) for x in nopct]
rates = [float(x[:-1]) for x in srates] # use this
flrates == rates
loansData['Interest.Rate'] = flrates
loansData['Interest.Rate'][0:5]
withmons = ldb['Loan.Length']
wmons = withmons[0:]
wmons[0:5]
wmons
wmons[0].split()
wmons[0].split()[0]
int(wmons[0].split()[0])
x = wmons[0].split()
x[0]
int(x[0])
intmons = [ int(x.split()[0]) for x in wmons ]
intmons[0:10]
loansData['Loan.Length']
loansData['Loan.Length'].value_counts()
loansData['Monthly.Income'].describe()
First remove implausible values. We see the max value to be 102750.
This is a MONTHLY income of 100K dollars, which is certainly possible, but ....
highly implausible for a person seeking a loan of a few 10's of K dollars, i.e. implausible in this context.
loansData['Monthly.Income'].idxmax() # find the place where the max occurs
But there's a better way - a row filter i.e. an expression used as a way to restrict the rows in a dataframe.
In our case we want to eliminate rows above 100K dollars. i.e. only keep those less than 100K dollars.
loansData['Monthly.Income'][loansData['Monthly.Income'] < 100000]
loansData['Monthly.Income'].describe()
ldlt100 = ldb[ldb['Monthly.Income'] < 100000]
ldlt100
len(ldlt100)
Now drop any rows that have 'NA' values ie data not available.
In database terminiology these would be 'NULL' values.
ldb2 = ldlt100.dropna()
ldb2
len(ldb2)
So we dropped one row that had an NA value somewhere.
Now that we have removed the "bad" data, let's take on the final data-cleaning task for this data set - converting the ranges to single integers.
Note that the FICO values are given in a range which is in the form of a string that looks like lowerlimit-upperlimit, eg 720-724.
We want to convert these values to a single int value representing the lower limit. e.g. 720 in the above example.
ficostr = ldb2['FICO.Range']
ficostr[0:10]
ficostr[0]
ficoint = [ int(x.split('-')[0]) for x in ficostr ]
ficoint[0:10]
ldb2['FICO.Range'] = ficoint
len(ficoint)
ldb2['FICO.Range']
Now we have a dataset that we can use for our data exploration and analysis