Chapter 12 -- Additional Data Handling¶

Topics Covered:¶

In [1]:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Index
from IPython.display import Image

Sort and Sort Sequences¶

The example below is used in Time Series Walk-Through in Chapter 09 -- Panda Time Series and Date Handling. It uses the read_csv() method to construct the 'df_us' DataFrame begining its read on row 3,082.

In [2]:

df_states = pd.read_csv("C:\Data\\HPI_master.csv",                 
            skiprows=3082,
            usecols=(0, 1, 2, 3, 4, 5, 6, 7, 8),
            names=('hpi_type', 'hpi_flavor', 'frequency', 'level', 'place_name', 'place_id', 'yr', 'period', 'index_nsa'),
            header=None)

The following SAS Data Step reads the same .csv file using FIRSTOBS= to begin reading from an arbitary row position.

    /********************************/
    /* c09_read()_csv_df_states.sas */
    /********************************/
    data df_states;
          infile 'C:\Data\HPI_master.csv' delimiter=',' missover dsd firstobs=3081; 
          informat hpi_type $12.
                   hpi_flavor $16.
                   frequency $9.
                   level $28.
                   place_name $33.
                   place_id $8.
                   yr $5.
                   period $6.
                   index_nsa 8.;
             input hpi_type $
                   hpi_flavor $
                   frequency $
                   level $
                   place_name $
                   place_id $
                   yr $
                   period $
                   index_nsa ;

Verify the read using the .shape attribute.

In [3]:

df_states.shape

Out[3]:

(96244, 9)

Inspect the first 5 rows of the 'df_states' DataFrame.

In [4]:

df_states.head()

Out[4]:

	hpi_type	hpi_flavor	frequency	level	place_name	place_id	yr	period	index_nsa
0	traditional	all-transactions	quarterly	MSA	Abilene, TX	10180	1986	4	94.95
1	traditional	all-transactions	quarterly	MSA	Abilene, TX	10180	1987	1	100.91
2	traditional	all-transactions	quarterly	MSA	Abilene, TX	10180	1987	2	100.76
3	traditional	all-transactions	quarterly	MSA	Abilene, TX	10180	1987	3	94.02
4	traditional	all-transactions	quarterly	MSA	Abilene, TX	10180	1987	4	91.43

Inspect the first 5 rows of the SAS data set df_states.

In [5]:

Image(filename='Anaconda3\\output\\df_states_output.JPG')  

Out[5]:

Exam the default sort by supplying the the sort key as the minimum argument to the .sort_values() attribute. The sort operation creates the new 'default_srt' DataFrame.

In [6]:

default_srt = df_states.sort_values('index_nsa')

Display the first 4 rows of the sorted DataFrame 'default_srt'

In [7]:

default_srt.iloc[0:4]

Out[7]:

	hpi_type	hpi_flavor	frequency	level	place_name	place_id	yr	period	index_nsa
42692	traditional	all-transactions	quarterly	MSA	San Francisco-Redwood City-South San Francisco...	41884	1975	3	18.31
42856	traditional	all-transactions	quarterly	MSA	San Jose-Sunnyvale-Santa Clara, CA	41940	1975	4	18.82
48737	traditional	all-transactions	quarterly	MSA	Honolulu ('Urban Honolulu'), HI	46520	1981	4	18.91
42693	traditional	all-transactions	quarterly	MSA	San Francisco-Redwood City-South San Francisco...	41884	1975	4	19.26

By examing the first the first four rows of the sorted DataFrame, 'default_srt' we see the default sort sequence is ascending. Of course, by reading the doc for pandas.DataFrame.sort_values we could 'see' this as well.

The default SAS sort syntax is:

    /******************************************************/
    /* c12_print_first4_rows_sorted.sas                   */
    /******************************************************/
84  proc sort data=df_states;
85      by index_nsa;
NOTE: 96244 observations were read from "WORK.df_states"
NOTE: Data set "WORK.df_states" has 96244 observation(s) and 9 variable(s)
86         
87   proc print data=df_states (obs=4);

Like panda the default SAS sort sequence is ascending. This is confirmed by the SAS doc located here .

In [8]:

Image(filename='Anaconda3\\output\\default_first_four.JPG')  

Out[8]:

In contrast to the panda default sort the default SAS sort sequence places its missing values first in the data set as illustrated above.

Displaying the last 4 rows in the 'default_srt' DataFrame finds the 2 NaN's sorted last since this is the default location panda places its missing values.

In [9]:

default_srt.tail(4)  

Out[9]:

	hpi_type	hpi_flavor	frequency	level	place_name	place_id	yr	period	index_nsa
54262	traditional	all-transactions	quarterly	State	District of Columbia	DC	2016	1	780.98
54263	traditional	all-transactions	quarterly	State	District of Columbia	DC	2016	2	791.71
60576	traditional	all-transactions	quarterly	State	Vermont	VT	1976	1	NaN
61098	traditional	all-transactions	quarterly	State	West Virginia	WV	1982	1	NaN

panda's default sort sequence places NaN's last in the sort sequence by default and can be used as an alternative to boolean operators and the .loc() method to detect missing values.

The SAS program to access the last four observations in the data set is:

    /******************************************************/
    /* c12_print_last4_rows_sorted.sas                    */
    /******************************************************/
    79         data last4;  
    80           set df_states (firstobs=96241);
    81           by index_nsa;

    NOTE: 4 observations were read from "WORK.last4"

In [10]:

Image(filename='Anaconda3\\output\\sort_last_four.JPG')  

Out[10]:

Alter the organization of the 'states_desc' DataFrame by supplying arguments and values to the sort_values() attribute. The example below sorts descending placeing missing values at the beginning of the DataFrame.

In [11]:

states_desc = df_states.sort_values('index_nsa', ascending=False, na_position='first')

Setting the na_position= argument to 'first' places NaN's at the beginning of the sort sequence. This is the same beavhior for SAS' sort sequence. Details describing SAS sort order for missing values is descried here .

The first two rows in the DataFrame 'states_desc' contain the NaN's values for the 'index_nsa' column, while the next 2 rows contain the highest values.

In [12]:

states_desc.iloc[0:4,]

Out[12]:

	hpi_type	hpi_flavor	frequency	level	place_name	place_id	yr	period	index_nsa
60576	traditional	all-transactions	quarterly	State	Vermont	VT	1976	1	NaN
61098	traditional	all-transactions	quarterly	State	West Virginia	WV	1982	1	NaN
54263	traditional	all-transactions	quarterly	State	District of Columbia	DC	2016	2	791.71
54262	traditional	all-transactions	quarterly	State	District of Columbia	DC	2016	1	780.98

SAS has missing 28 missing value indicators for numerics described here . However, if you want missing values to be 'first' in a data set using an ascending sort sequence, then the missing value indicator must be an actual numeric value larger than the largest non-missing value. In other words, by default, the SAS missing value indicator for numerics is always the smallest numeric value.

The SAS example below sorts the df_states data set in descending order by the 'index_nsa' variable. Like the panda example above, the

out=states_srt2

syntax creates the new SAS data set 'states_srt2'.

    /******************************************************/
    /* c12_print_last4_rows_sorted_descending.sas        */
    /*****************************************************/
    44        proc sort data=df_states
    45           out=states_srt2;
    46           by descending index_nsa;
    NOTE: 96244 observations were read from "WORK.df_states"
    NOTE: Data set "WORK.states_srt2" has 96244 observation(s) and 9 variable(s)
    47        
    48         data first4;
    49           set states_srt2 (obs=4);
    50           by descending index_nsa;

In [13]:

Image(filename='Anaconda3\\output\\sort_last4_descending.JPG')  

Out[13]:

The section below examines a number of common data management patterns for analysis.

1. Read the Lending Club Loan Status file.¶

In [14]:

lc = pd.read_csv("C:\\Data\\LC_Loan_Stats.csv",
                 low_memory=False)     
lc.shape

Out[14]:

(42633, 51)

Check for missing values. Clearly, some columns are not useful.

In [15]:

lc.isnull().sum()

Out[15]:

ID                                 2
Member_ID                          3
Loan_Amnt                          3
Term                               3
Int_Rate                           3
Installment                        3
Grade                              3
Sub_Grade                          3
Emp_Length                         3
Home_Ownership                     3
Annual_Inc                         7
Verification_Status                3
Loan_Status                        3
Purpose                            3
Zip_Code                           3
Addr_State                         3
DTI                                3
Delinq_2yrs                       32
Earliest_Cr_Line                  32
Inq_Last_6mths                    32
Mths_Since_Last_Delinq         26998
Mths_Since_Last_Record         38978
Open_Acc                          32
Pub_Rec                           32
Revol_Bal                          3
Revol_Util                        93
Total_Acc                         32
Initial_List_Status                3
Out_Prncp                          3
Out_Prncp_Inv                      3
Total_Pymnt                        3
Total_Pymnt_Inv                    3
Total_Rec_Prncp                    3
Total_Rec_Int                      3
Total_Rec_Late_Fee                 3
Recoveries                         3
Collection_Recovery_Fee            3
Last_Pymnt_D                      85
Last_Pymnt_Amnt                    3
Next_Pymnt_D                   39412
Last_Credit_Pull_D                 7
Collections_12_Mths_Ex_Med       148
Mths_Since_Last_Major_Derog    42633
Policy_Code                        3
Application_Type                   3
Annual_Inc_Joint               42633
Acc_Now_Delinq                    32
Chargeoff_Within_12_Mths         148
Delinq_Amnt                       32
Pub_Rec_Bankruptcies            1368
Tax_Liens                        108
dtype: int64

Drop/Keep Columns¶

Keep some of the columns in the 'lc' DataFrame.

In [16]:

lc = lc[['ID', 'Member_ID', 'Loan_Amnt', 'Term', 'Int_Rate', 'Installment', 'Grade', 'Sub_Grade', 'Emp_Length',
                     'Home_Ownership', 'Annual_Inc', 'Verification_Status', 'Loan_Status', 'Purpose', 'Zip_Code',
                     'Addr_State', 'DTI', 'Delinq_2yrs', 'Earliest_Cr_Line', 'Inq_Last_6mths', 'Open_Acc', 'Revol_Bal',                          
                     'Revol_Util']]

Rename Columns¶

In [17]:

lc = lc.rename(columns = {
           'Member_ID'              :'mem_id',
           'Loan_Amtn'              :'ln_amt',
           'Int_Rate'               :'rate',
           'Home_Ownership'         :'own_rnt',
           'Verification_Status'    :'vrfy_stat',
           'Loan_Status'            :'ln_stat',
           'Addr_State'             :'state',
           'Earliest_Cr_Line'       :'earliest_ln'
    })

Lower-case all the 'lc' DataFrame column names.

In [18]:

lc.columns = map(str.lower, lc.columns)

An investigation of the LC_Loan_Stats.csv file reveals it has 2 parts. The first part, rows 0 to 39786 contain data for those loans meeting their credit policies. Rows 39790 to the end contain data for those loans that are outside their normal lending policies.

Create the 'lc1' DataFrame by reading a sub-set of columns and rows.

In [19]:

lc1 = pd.read_csv("C:\\Data\\LC_Loan_Stats.csv", 
          low_memory=False,            
          usecols=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 15, 16, 17, 18, 19, 22, 24, 25),
          names=('id','mem_id', 'ln_amt', 'term','rate', 'm_pay', 'grade', 'sub_grd', 'emp_len',
                 'own_rnt', 'income', 'ln_stat', 'purpose', 'state', 'dti', 'delinq_2yrs', 'ln_fst',
                'inq_6mnth', 'open_acc', 'revol_bal', 'revol_util' ),
          skiprows=1,
          nrows=39786,
          header=None)
lc1.shape

Out[19]:

(39786, 21)

The analog SAS program for reading sub-sets of rows and columns from the 'LC_Loan_Stats.csv' file.

    /******************************************************/
    /* c12_read_1st_half_loan_cvs.sas                     */
    /******************************************************/
    4         options obs=39787;
    5         proc import datafile="c:\data\LC_Loan_Stats.csv"
    6              dbms=csv
    7              out=loans
    8              replace;
    9
    10          data lc1(rename=(Member_ID     = mem_id
    12                        Loan_Amnt        = ln_amt
    13                        Int_Rate         = rate
    14                        Installment      = m_pay
    15                        Sub_Grade        = sub_grd
    16                        Emp_Length       = emp_len
    17                        Home_Ownership   = own_rnt
    18                        Annual_Inc       = income
    19                        Loan_Status      = ln_stat
    20                        Addr_State       = state
    21                        Earliest_Cr_Line = ln_fst
    22                        Inq_Last_6mths   = inq_6mnth));
    23       
    24         set loans(keep =  ID
    25                           Member_ID
    26                           Loan_Amnt
    27                           Term
    28                           Int_Rate
    29                           Installment
    30                           Grade
    31                           Sub_Grade
    32                           Emp_Length
    33                           Home_Ownership
    34                           Annual_Inc
    35                           Loan_Status
    36                           Purpose
    37                           Addr_State
    38                           DTI
    39                           Delinq_2yrs
    40                           Earliest_Cr_Line
    41                           Inq_Last_6mths
    42                           Open_Acc
    43                           Revol_Bal
    44                           Revol_Util);
    45
    46        ln_plcy = 'True';

    NOTE: 39786 observations were read from "WORK.loans"
    NOTE: Data set "WORK.lc1" has 39786 observation(s) and 22 variable(s)

Find Duplicate Values¶

In the 'lc1' DataFrame we expect to find no duplicate id values. Setting the .duplicated attribute to False returns all of the duplicate values, in our case for the lc1['id'] column. The .duplicated() attribute for DataFrames is documented here.

False marks all duplicate values as True. Create the new 'lc1_dups' DataFrame containing the duplicates for the lc1['id'] column. If its length is zero then there are no duplicates. Locating the duplicate rows is not the same as removing then which is illustrated below.

Begin by setting the index to the lc1['id'] column. Next, create a Boolean mask locating the duplicate values for the lc1['id'] column. The keep='first' argument behaves similiar to SAS FIRST.variable in BY-Group processing. We want to ignore the first occurance of the duplicate 'index' column value and extract the remaining ones.

In [20]:

lc1.set_index('id', inplace=True, drop=False)
dup_mask = lc1.duplicated('id', keep='first')

Extract Duplicate Values¶

Apply the Boolean mask using the .loc attribute to create the new 'lc1_dups' DataFrame. The .shape attribute shows there are 35 rows with duplicate 'id' values.

In [21]:

lc1_dups = lc1.loc[dup_mask]
lc1_dups.shape

Out[21]:

(35, 21)

The SAS SORT option NODUPKEY checks for and eliminates observations with duplicate BY values. The DUPOUT= option names the target SAS data set to write the duplicates.

    /******************************************************/
    /* c12_sort_nodupkey.sas                              */
    /******************************************************/
    50       proc sort data = lc1 dupout=lc1_dups nodupkey;
    51          by id;
    NOTE: 39786 observations were read from "WORK.lc1"
    NOTE: 35 observations were found and deleted due to having duplicate sort keys
    NOTE: Data set "WORK.lc1" has 39751 observation(s) and 21 variable(s)
    NOTE: Data set "WORK.lc1_dups" has 35 observation(s) and 21 variable(s)

Drop Duplicate Rows¶

In [22]:

lc1.shape

Out[22]:

(39786, 21)

Use the .drop_duplicates attribute to drop duplicate values.

In [23]:

lc1.drop_duplicates(['id'], keep='first', inplace=True)

In [24]:

lc1.shape 

Out[24]:

(39751, 21)

Add a New DataFrame Column¶

Create the new column lc1['ln_plcy'] as a flag to indicate all rows read from row 2 to 39786 are loans granted within standard lending guidelines.

In [25]:

lc1['ln_plcy'] = lc1['id'].map(lambda x: True)
lc1.shape

Out[25]:

(39751, 22)

Cast Strings to Floats¶

In [26]:

lc1.describe(include=['O'])

Out[26]:

	term	rate	grade	sub_grd	emp_len	own_rnt	ln_stat	purpose	state	ln_fst	revol_util
count	39751	39751	39751	39751	39751	39751	39751	39751	39751	39751	39701
unique	2	371	7	35	12	5	7	14	50	529	1092
top	36 months	10.99%	B	B3	10+ years	RENT	Fully Paid	debt_consolidation	CA	Nov-98	0%
freq	29070	958	12021	2918	8891	18893	33639	18661	7095	371	980

Both the lc1['rate'] and lc1['revol_util'] column values are formatted with Excel's percent format (%), making them string values. The .replace() method replaces the '%' with a white-space. The .astype attribute converts the string into a float. The resulting float is divided by 100.

In [27]:

lc1['revol_util'] = lc1.revol_util.replace('%','',regex=True).astype('float')/100
lc1['rate'] = lc1.rate.replace('%','',regex=True).astype('float')/100

With the lc1['revol_util'] column values cast to floats use the .fillna() attribute to replace missing values with the calculated mean value from the lc1['revol_util'] column.

In [28]:

lc1["revol_util"] = lc1[["revol_util"]].fillna(lc1.revol_util.mean())
lc1.revol_util.isnull().sum()

Out[28]:

The analog SAS program uses PROC SQL to insert the mean value for 'revol_util' into the SAS Macro variable 'mean_revol'. A WHERE statement is used to locate the NULL values for the variable 'revol_util' and if found then the UPDATE statement inserts the value from the SAS MACRO variable &mean_revol into the 'revol_util' column.

The CREATE TABLE statement starting at line 13 tests if there are missing values found for the column 'revol_util'.

    /******************************************************/
    /* c12_update_missing_revol_until.sas                 */
    /******************************************************/
    6       proc sql;
    7          select mean(revol_util) format 6.2 into :mean_revol
    8          from lc1;
    9       update lc1
    10            set revol_util = &mean_revol
    11            where revol_util is null;
    NOTE: 50 record(s) updated in table WORK.lc1
    12       
    13       create table miss as
    14          select revol_util
    15          from lc1
    16          where revol_util is null;
    NOTE: Data set "WORK.miss" has 0 observation(s) and 1 variable(s)
    17       quit;

2. Read the second portion of the LC_Loan_Stats.csv file beginning with row 39790.¶

Create the 'lc0' DataFrame by reading the loan status file, supplying a tuple of integers for the usecols= argument and a tuple of names for the column labels.

In [29]:

lc0 = pd.read_csv("C:\\Data\\LC_Loan_Stats.csv", 
          low_memory=False,            
          usecols=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 15, 16, 17, 18, 19, 22, 24, 25),
          names=('id','mem_id', 'ln_amt', 'term','rate', 'm_pay', 'grade', 'sub_grd', 'emp_len',
                 'own_rnt', 'income', 'ln_stat', 'purpose', 'state', 'dti', 'delinq_2yrs', 'ln_fst',
                'inq_6mnth', 'open_acc', 'revol_bal', 'revol_util' ),
          skiprows=39790,
          header=None)
lc0.shape

Out[29]:

(2844, 21)

Create the column lc0['ln_plcy'] with a value of False to indicate rows 39790 to the end of the file are loans made outside the standard credit lending policy. The assignment below uses the .map() attribute and calls the anonymous lambda function to create the new column lc0[''ln_plcy'].

In [30]:

lc0['ln_plcy'] = lc0['id'].map(lambda x: False)

Similiar to above, the lc0['rate'] and lc0['revol_util'] column values are formatted with Excel's percent format (%), making them string values. The .replace() method replaces the '%' with a white-space. The .astype() attribute converts the string into a float. The resulting float is divided by 100.

In [31]:

lc0['revol_util'] = lc0.revol_util.replace('%','',regex=True).astype('float')/100
lc0['rate'] = lc0.rate.replace('%','',regex=True).astype('float')/100

Calculate the mean value for 'revol_util' and use the .fillna method to replace the missing values.

In [32]:

lc0['revol_util'] = lc0[["revol_util"]].fillna(lc0.revol_util.mean())

Return the number of missing values.

lc0.revol_util.isnull().sum()

Concatenating DataFrames (Join)¶

Chapter 7 -- Pandas, Part 2 covers joining operations by illustrating the pd.merge() method which follows SQL's relational algebraic statements. An alternative is the concat function described here.

In [33]:

print(lc0.shape)
print(lc1.shape)

(2844, 22)
(39751, 22)

Use the pd.concat() method to join the 'lc1' and 'lc0' DataFrames.

In [34]:

frames = [lc1, lc0]
df = pd.concat(frames)
df.shape

Out[34]:

(42595, 22)

The analog SAS program uses the SET statement to join the data sets 'lc0' and 'lc1' together. The variable names are identical in both data sets.

    /******************************************************/
    /* c12_concatenate_lc0_lc1.sas                        */
    /******************************************************/
    7       data df;
    8          set lc0
    9              lc1;
    10       run;

    NOTE: 2844 observations were read from "WORK.lc0"
    NOTE: 39751 observations were read from "WORK.lc1"
    NOTE: Data set "WORK.df" has 42595 observation(s) and 22 variable(s)

Crosstabs¶

In [35]:

pd.crosstab(df.grade, df.own_rnt , margins=True)

Out[35]:

own_rnt	MORTGAGE	NONE	OTHER	OWN	RENT	All
grade
A	5227	5	25	872	4073	10202
B	5427	2	36	957	5986	12408
C	3587	0	24	642	4494	8747
D	2368	0	26	433	3198	6025
E	1499	0	14	242	1646	3401
F	620	1	4	75	600	1300
G	253	0	7	36	216	512
All	18981	8	136	3257	20213	42595

In [36]:

pd.crosstab([df.grade, df.own_rnt], df.purpose, margins=True)

Out[36]:

	purpose	car	credit_card	debt_consolidation	educational	home_improvement	house	major_purchase	medical	moving	other	renewable_energy	small_business	vacation	wedding	All
grade	own_rnt
A	MORTGAGE	359	685	1813	36	806	42	458	100	49	540	20	187	60	72	5227
	NONE	0	0	1	0	0	0	0	0	0	4	0	0	0	0	5
	OTHER	1	4	7	2	2	1	3	1	0	2	0	2	0	0	25
	OWN	62	90	315	3	99	6	90	23	6	116	2	26	14	20	872
	RENT	195	638	1761	47	66	53	287	89	118	460	8	136	61	154	4073
B	MORTGAGE	252	732	2400	28	742	40	237	90	33	450	20	270	43	90	5427
	NONE	0	0	0	0	0	0	0	0	0	2	0	0	0	0	2
	OTHER	1	3	17	1	3	1	4	0	0	2	0	3	0	1	36
	OWN	44	96	409	10	104	8	71	19	11	122	3	33	11	16	957
	RENT	208	871	2948	72	87	67	312	100	147	709	14	193	77	181	5986
C	MORTGAGE	98	487	1671	30	467	31	139	63	18	297	9	194	16	67	3587
	OTHER	1	2	8	0	1	0	1	1	1	7	0	2	0	0	24
	OWN	27	57	278	10	69	10	48	15	7	77	3	28	2	11	642
	RENT	132	605	2238	75	68	44	221	74	102	576	6	162	54	137	4494
D	MORTGAGE	49	277	1172	12	268	21	88	43	8	177	5	180	14	54	2368
	OTHER	0	4	12	0	0	0	0	2	1	3	1	3	0	0	26
	OWN	20	43	205	4	47	3	24	6	1	53	1	14	3	9	433
	RENT	80	361	1743	36	40	35	162	48	65	371	4	145	19	89	3198
E	MORTGAGE	16	170	790	3	172	14	42	26	6	126	2	105	8	19	1499
	OTHER	0	3	5	0	0	0	0	0	1	4	0	1	0	0	14
	OWN	8	23	113	6	25	5	11	8	2	27	1	9	2	2	242
	RENT	42	164	918	28	31	19	71	22	33	147	2	109	12	48	1646
F	MORTGAGE	5	71	345	3	63	5	13	7	2	40	1	56	1	8	620
	NONE	0	0	0	0	0	0	0	0	0	1	0	0	0	0	1
	OTHER	0	0	1	0	1	0	0	1	0	0	0	1	0	0	4
	OWN	2	4	33	0	8	1	5	0	1	10	0	11	0	0	75
	RENT	9	53	344	8	5	13	13	9	11	61	2	55	2	15	600
G	MORTGAGE	1	29	129	1	25	6	1	3	1	16	0	36	0	5	253
	OTHER	0	0	2	2	0	1	0	0	0	1	0	1	0	0	7
	OWN	0	2	16	1	2	1	4	1	0	3	0	6	0	0	36
	RENT	2	21	119	4	2	0	5	2	5	23	2	25	1	5	216
All		1614	5495	19813	422	3203	427	2310	753	629	4427	106	1993	400	1003	42595

Sampling¶

In [37]:

samp = df.sample(n=300)

In [38]:

import bokeh.charts
import bokeh.charts.utils
import bokeh.io
import bokeh.models
import bokeh.palettes
import bokeh.plotting
from bokeh.charts import Bar, Scatter, BoxPlot, Histogram
bokeh.io.output_notebook()

Loading BokehJS ...

In [39]:

p = BoxPlot(samp, values='income', label='grade', color='firebrick',
                      title="Average Interest Rate by Credit Grade")

bokeh.io.show(p)

Binning Continuous Values¶

Locate the minimum and maximum values. Scalars are returned. Returning a row using a minimum value is illustrated here and for maximum value here.

In [40]:

print(df.dti.min())
print(df.dti.max())

0.0
29.99

Create the list 'bins' containing integer values used to define bin levels. Construct the list 'names' to define the labels for the bins.

In [41]:

bins = [0, 5, 10, 15, 20, 25, 30]
names=['0-5%', '5-10%', '10-15%', '15-20%', '20-25%', '25-30%']
df['dti_cat'] = pd.cut(df['dti'], bins, labels=names, include_lowest=True)

Call pd.value_counts() method to return a count of values for each bin level.

In [42]:

pd.value_counts(df['dti_cat'])

Out[42]:

10-15%    10502
15-20%     9775
5-10%      8563
20-25%     7376
0-5%       5552
25-30%      827
Name: dti_cat, dtype: int64

Return the count of missing values for the df['income'] column.

In [43]:

df['income'].isnull().sum()

Out[43]:

Call the .fillna() method to replace missing values for the df['income'] column with the calcuated column mean. Additional examples for detecting and replacing missing values are described here.

In [44]:

df['income'] = df[["income"]].fillna(df.income.mean())

In [45]:

print(df.income.max())
print(df.income.min())

6000000.0
1896.0

Call the pd.qcut() method for creating deciles for the df['income'] column and copy them into the new df['inc_cat'] column. A more detailed example of creating deciles is described in the section Understanding Binning in Chapter 10--GroupBy.

In [46]:

df['inc_cat'] = pd.qcut(df['income'].values, 10).codes

Return the count of values from the income decile values in the df['inc_cat'] column.

In [47]:

pd.value_counts(df['inc_cat'])

Out[47]:

0    5088
7    4436
4    4309
3    4306
2    4265
9    4260
6    4253
5    4163
8    4082
1    3433
Name: inc_cat, dtype: int64

The df['inc_cat'] column containing income deciles used as the rows and values for df['grade'] nested inside values from the df['ln_plcy'] column as the header in a Crosstab.

In [48]:

pd.crosstab(df.inc_cat, [df.ln_plcy, df.grade] , margins=True)

Out[48]:

ln_plcy	False							True							All
grade	A	B	C	D	E	F	G	A	B	C	D	E	F	G
inc_cat
0	23	73	105	130	92	31	18	1219	1468	1025	625	213	51	15	5088
1	6	29	46	54	57	19	9	824	999	711	427	183	57	12	3433
2	10	25	71	85	53	20	17	1025	1264	849	522	242	62	20	4265
3	15	39	57	62	58	23	16	1034	1196	887	565	263	75	16	4306
4	8	26	65	66	55	25	16	1070	1221	805	531	309	91	21	4309
5	10	29	54	54	40	20	23	1023	1177	789	524	281	110	29	4163
6	12	37	50	62	45	25	13	1037	1224	753	537	303	122	33	4253
7	16	41	67	55	57	33	23	1033	1263	791	559	318	140	40	4436
8	9	43	66	57	38	23	18	940	1094	763	499	316	160	56	4082
9	14	45	66	77	49	28	41	874	1115	727	534	429	185	76	4260
All	123	387	647	702	544	247	194	10079	12021	8100	5323	2857	1053	318	42595

Save to Disk¶

The .to_pickle() method serializes the 'df' DataFrame in order to write to disk. The .to_pickle() method is documented here. The DataFrame is read in Chapter 10-- GroupBy.

In [49]:

df.to_pickle('lending_club.pkl')

We can also write the 'lc0' and 'lc1' DataFrames out as a .csv files.

In [50]:

lc0.to_csv("C:\Data\\Loans_lc0.csv")
lc1.to_csv("C:\Data\\Loans_lc1.csv")

Resources¶

Delete Duplicates in Pandas, by Chris Albon located here.

pandas.cut API doc located here.

Return to Chapter List