import numpy as np
import pandas as pd
from pandas import Series, DataFrame, Index
from IPython.display import Image
The example below is used in Time Series Walk-Through in Chapter 09 -- Panda Time Series and Date Handling. It uses the read_csv() method to construct the 'df_us' DataFrame begining its read on row 3,082.
df_states = pd.read_csv("C:\Data\\HPI_master.csv",
skiprows=3082,
usecols=(0, 1, 2, 3, 4, 5, 6, 7, 8),
names=('hpi_type', 'hpi_flavor', 'frequency', 'level', 'place_name', 'place_id', 'yr', 'period', 'index_nsa'),
header=None)
The following SAS Data Step reads the same .csv file using FIRSTOBS= to begin reading from an arbitary row position.
/********************************/
/* c09_read()_csv_df_states.sas */
/********************************/
data df_states;
infile 'C:\Data\HPI_master.csv' delimiter=',' missover dsd firstobs=3081;
informat hpi_type $12.
hpi_flavor $16.
frequency $9.
level $28.
place_name $33.
place_id $8.
yr $5.
period $6.
index_nsa 8.;
input hpi_type $
hpi_flavor $
frequency $
level $
place_name $
place_id $
yr $
period $
index_nsa ;
Verify the read using the .shape attribute.
df_states.shape
(96244, 9)
Inspect the first 5 rows of the 'df_states' DataFrame.
df_states.head()
hpi_type | hpi_flavor | frequency | level | place_name | place_id | yr | period | index_nsa | |
---|---|---|---|---|---|---|---|---|---|
0 | traditional | all-transactions | quarterly | MSA | Abilene, TX | 10180 | 1986 | 4 | 94.95 |
1 | traditional | all-transactions | quarterly | MSA | Abilene, TX | 10180 | 1987 | 1 | 100.91 |
2 | traditional | all-transactions | quarterly | MSA | Abilene, TX | 10180 | 1987 | 2 | 100.76 |
3 | traditional | all-transactions | quarterly | MSA | Abilene, TX | 10180 | 1987 | 3 | 94.02 |
4 | traditional | all-transactions | quarterly | MSA | Abilene, TX | 10180 | 1987 | 4 | 91.43 |
Inspect the first 5 rows of the SAS data set df_states.
Image(filename='Anaconda3\\output\\df_states_output.JPG')
Exam the default sort by supplying the the sort key as the minimum argument to the .sort_values() attribute. The sort operation creates the new 'default_srt' DataFrame.
default_srt = df_states.sort_values('index_nsa')
Display the first 4 rows of the sorted DataFrame 'default_srt'
default_srt.iloc[0:4]
hpi_type | hpi_flavor | frequency | level | place_name | place_id | yr | period | index_nsa | |
---|---|---|---|---|---|---|---|---|---|
42692 | traditional | all-transactions | quarterly | MSA | San Francisco-Redwood City-South San Francisco... | 41884 | 1975 | 3 | 18.31 |
42856 | traditional | all-transactions | quarterly | MSA | San Jose-Sunnyvale-Santa Clara, CA | 41940 | 1975 | 4 | 18.82 |
48737 | traditional | all-transactions | quarterly | MSA | Honolulu ('Urban Honolulu'), HI | 46520 | 1981 | 4 | 18.91 |
42693 | traditional | all-transactions | quarterly | MSA | San Francisco-Redwood City-South San Francisco... | 41884 | 1975 | 4 | 19.26 |
By examing the first the first four rows of the sorted DataFrame, 'default_srt' we see the default sort sequence is ascending. Of course, by reading the doc for pandas.DataFrame.sort_values we could 'see' this as well.
The default SAS sort syntax is:
/******************************************************/
/* c12_print_first4_rows_sorted.sas */
/******************************************************/
84 proc sort data=df_states;
85 by index_nsa;
NOTE: 96244 observations were read from "WORK.df_states"
NOTE: Data set "WORK.df_states" has 96244 observation(s) and 9 variable(s)
86
87 proc print data=df_states (obs=4);
Like panda the default SAS sort sequence is ascending. This is confirmed by the SAS doc located here .
Image(filename='Anaconda3\\output\\default_first_four.JPG')
In contrast to the panda default sort the default SAS sort sequence places its missing values first in the data set as illustrated above.
Displaying the last 4 rows in the 'default_srt' DataFrame finds the 2 NaN's sorted last since this is the default location panda places its missing values.
default_srt.tail(4)
hpi_type | hpi_flavor | frequency | level | place_name | place_id | yr | period | index_nsa | |
---|---|---|---|---|---|---|---|---|---|
54262 | traditional | all-transactions | quarterly | State | District of Columbia | DC | 2016 | 1 | 780.98 |
54263 | traditional | all-transactions | quarterly | State | District of Columbia | DC | 2016 | 2 | 791.71 |
60576 | traditional | all-transactions | quarterly | State | Vermont | VT | 1976 | 1 | NaN |
61098 | traditional | all-transactions | quarterly | State | West Virginia | WV | 1982 | 1 | NaN |
panda's default sort sequence places NaN's last in the sort sequence by default and can be used as an alternative to boolean operators and the .loc() method to detect missing values.
The SAS program to access the last four observations in the data set is:
/******************************************************/
/* c12_print_last4_rows_sorted.sas */
/******************************************************/
79 data last4;
80 set df_states (firstobs=96241);
81 by index_nsa;
NOTE: 4 observations were read from "WORK.last4"
Image(filename='Anaconda3\\output\\sort_last_four.JPG')
Alter the organization of the 'states_desc' DataFrame by supplying arguments and values to the sort_values() attribute. The example below sorts descending placeing missing values at the beginning of the DataFrame.
states_desc = df_states.sort_values('index_nsa', ascending=False, na_position='first')
Setting the na_position= argument to 'first' places NaN's at the beginning of the sort sequence. This is the same beavhior for SAS' sort sequence. Details describing SAS sort order for missing values is descried here .
The first two rows in the DataFrame 'states_desc' contain the NaN's values for the 'index_nsa' column, while the next 2 rows contain the highest values.
states_desc.iloc[0:4,]
hpi_type | hpi_flavor | frequency | level | place_name | place_id | yr | period | index_nsa | |
---|---|---|---|---|---|---|---|---|---|
60576 | traditional | all-transactions | quarterly | State | Vermont | VT | 1976 | 1 | NaN |
61098 | traditional | all-transactions | quarterly | State | West Virginia | WV | 1982 | 1 | NaN |
54263 | traditional | all-transactions | quarterly | State | District of Columbia | DC | 2016 | 2 | 791.71 |
54262 | traditional | all-transactions | quarterly | State | District of Columbia | DC | 2016 | 1 | 780.98 |
SAS has missing 28 missing value indicators for numerics described here . However, if you want missing values to be 'first' in a data set using an ascending sort sequence, then the missing value indicator must be an actual numeric value larger than the largest non-missing value. In other words, by default, the SAS missing value indicator for numerics is always the smallest numeric value.
The SAS example below sorts the df_states data set in descending order by the 'index_nsa' variable. Like the panda example above, the
out=states_srt2
syntax creates the new SAS data set 'states_srt2'.
/******************************************************/
/* c12_print_last4_rows_sorted_descending.sas */
/*****************************************************/
44 proc sort data=df_states
45 out=states_srt2;
46 by descending index_nsa;
NOTE: 96244 observations were read from "WORK.df_states"
NOTE: Data set "WORK.states_srt2" has 96244 observation(s) and 9 variable(s)
47
48 data first4;
49 set states_srt2 (obs=4);
50 by descending index_nsa;
Image(filename='Anaconda3\\output\\sort_last4_descending.JPG')
The section below examines a number of common data management patterns for analysis.
lc = pd.read_csv("C:\\Data\\LC_Loan_Stats.csv",
low_memory=False)
lc.shape
(42633, 51)
Check for missing values. Clearly, some columns are not useful.
lc.isnull().sum()
ID 2 Member_ID 3 Loan_Amnt 3 Term 3 Int_Rate 3 Installment 3 Grade 3 Sub_Grade 3 Emp_Length 3 Home_Ownership 3 Annual_Inc 7 Verification_Status 3 Loan_Status 3 Purpose 3 Zip_Code 3 Addr_State 3 DTI 3 Delinq_2yrs 32 Earliest_Cr_Line 32 Inq_Last_6mths 32 Mths_Since_Last_Delinq 26998 Mths_Since_Last_Record 38978 Open_Acc 32 Pub_Rec 32 Revol_Bal 3 Revol_Util 93 Total_Acc 32 Initial_List_Status 3 Out_Prncp 3 Out_Prncp_Inv 3 Total_Pymnt 3 Total_Pymnt_Inv 3 Total_Rec_Prncp 3 Total_Rec_Int 3 Total_Rec_Late_Fee 3 Recoveries 3 Collection_Recovery_Fee 3 Last_Pymnt_D 85 Last_Pymnt_Amnt 3 Next_Pymnt_D 39412 Last_Credit_Pull_D 7 Collections_12_Mths_Ex_Med 148 Mths_Since_Last_Major_Derog 42633 Policy_Code 3 Application_Type 3 Annual_Inc_Joint 42633 Acc_Now_Delinq 32 Chargeoff_Within_12_Mths 148 Delinq_Amnt 32 Pub_Rec_Bankruptcies 1368 Tax_Liens 108 dtype: int64
Keep some of the columns in the 'lc' DataFrame.
lc = lc[['ID', 'Member_ID', 'Loan_Amnt', 'Term', 'Int_Rate', 'Installment', 'Grade', 'Sub_Grade', 'Emp_Length',
'Home_Ownership', 'Annual_Inc', 'Verification_Status', 'Loan_Status', 'Purpose', 'Zip_Code',
'Addr_State', 'DTI', 'Delinq_2yrs', 'Earliest_Cr_Line', 'Inq_Last_6mths', 'Open_Acc', 'Revol_Bal',
'Revol_Util']]
lc = lc.rename(columns = {
'Member_ID' :'mem_id',
'Loan_Amtn' :'ln_amt',
'Int_Rate' :'rate',
'Home_Ownership' :'own_rnt',
'Verification_Status' :'vrfy_stat',
'Loan_Status' :'ln_stat',
'Addr_State' :'state',
'Earliest_Cr_Line' :'earliest_ln'
})
Lower-case all the 'lc' DataFrame column names.
lc.columns = map(str.lower, lc.columns)
An investigation of the LC_Loan_Stats.csv file reveals it has 2 parts. The first part, rows 0 to 39786 contain data for those loans meeting their credit policies. Rows 39790 to the end contain data for those loans that are outside their normal lending policies.
Create the 'lc1' DataFrame by reading a sub-set of columns and rows.
lc1 = pd.read_csv("C:\\Data\\LC_Loan_Stats.csv",
low_memory=False,
usecols=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 15, 16, 17, 18, 19, 22, 24, 25),
names=('id','mem_id', 'ln_amt', 'term','rate', 'm_pay', 'grade', 'sub_grd', 'emp_len',
'own_rnt', 'income', 'ln_stat', 'purpose', 'state', 'dti', 'delinq_2yrs', 'ln_fst',
'inq_6mnth', 'open_acc', 'revol_bal', 'revol_util' ),
skiprows=1,
nrows=39786,
header=None)
lc1.shape
(39786, 21)
The analog SAS program for reading sub-sets of rows and columns from the 'LC_Loan_Stats.csv' file.
/******************************************************/
/* c12_read_1st_half_loan_cvs.sas */
/******************************************************/
4 options obs=39787;
5 proc import datafile="c:\data\LC_Loan_Stats.csv"
6 dbms=csv
7 out=loans
8 replace;
9
10 data lc1(rename=(Member_ID = mem_id
12 Loan_Amnt = ln_amt
13 Int_Rate = rate
14 Installment = m_pay
15 Sub_Grade = sub_grd
16 Emp_Length = emp_len
17 Home_Ownership = own_rnt
18 Annual_Inc = income
19 Loan_Status = ln_stat
20 Addr_State = state
21 Earliest_Cr_Line = ln_fst
22 Inq_Last_6mths = inq_6mnth));
23
24 set loans(keep = ID
25 Member_ID
26 Loan_Amnt
27 Term
28 Int_Rate
29 Installment
30 Grade
31 Sub_Grade
32 Emp_Length
33 Home_Ownership
34 Annual_Inc
35 Loan_Status
36 Purpose
37 Addr_State
38 DTI
39 Delinq_2yrs
40 Earliest_Cr_Line
41 Inq_Last_6mths
42 Open_Acc
43 Revol_Bal
44 Revol_Util);
45
46 ln_plcy = 'True';
NOTE: 39786 observations were read from "WORK.loans"
NOTE: Data set "WORK.lc1" has 39786 observation(s) and 22 variable(s)
In the 'lc1' DataFrame we expect to find no duplicate id values. Setting the .duplicated attribute to False returns all of the duplicate values, in our case for the lc1['id'] column. The .duplicated() attribute for DataFrames is documented here.
False marks all duplicate values as True. Create the new 'lc1_dups' DataFrame containing the duplicates for the lc1['id'] column. If its length is zero then there are no duplicates. Locating the duplicate rows is not the same as removing then which is illustrated below.
Begin by setting the index to the lc1['id'] column. Next, create a Boolean mask locating the duplicate values for the lc1['id'] column. The keep='first' argument behaves similiar to SAS FIRST.variable in BY-Group processing. We want to ignore the first occurance of the duplicate 'index' column value and extract the remaining ones.
lc1.set_index('id', inplace=True, drop=False)
dup_mask = lc1.duplicated('id', keep='first')
Apply the Boolean mask using the .loc attribute to create the new 'lc1_dups' DataFrame. The .shape attribute shows there are 35 rows with duplicate 'id' values.
lc1_dups = lc1.loc[dup_mask]
lc1_dups.shape
(35, 21)
The SAS SORT option NODUPKEY checks for and eliminates observations with duplicate BY values. The DUPOUT= option names the target SAS data set to write the duplicates.
/******************************************************/
/* c12_sort_nodupkey.sas */
/******************************************************/
50 proc sort data = lc1 dupout=lc1_dups nodupkey;
51 by id;
NOTE: 39786 observations were read from "WORK.lc1"
NOTE: 35 observations were found and deleted due to having duplicate sort keys
NOTE: Data set "WORK.lc1" has 39751 observation(s) and 21 variable(s)
NOTE: Data set "WORK.lc1_dups" has 35 observation(s) and 21 variable(s)
lc1.shape
(39786, 21)
Use the .drop_duplicates attribute to drop duplicate values.
lc1.drop_duplicates(['id'], keep='first', inplace=True)
lc1.shape
(39751, 21)
Create the new column lc1['ln_plcy'] as a flag to indicate all rows read from row 2 to 39786 are loans granted within standard lending guidelines.
lc1['ln_plcy'] = lc1['id'].map(lambda x: True)
lc1.shape
(39751, 22)
lc1.describe(include=['O'])
term | rate | grade | sub_grd | emp_len | own_rnt | ln_stat | purpose | state | ln_fst | revol_util | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 39751 | 39751 | 39751 | 39751 | 39751 | 39751 | 39751 | 39751 | 39751 | 39751 | 39701 |
unique | 2 | 371 | 7 | 35 | 12 | 5 | 7 | 14 | 50 | 529 | 1092 |
top | 36 months | 10.99% | B | B3 | 10+ years | RENT | Fully Paid | debt_consolidation | CA | Nov-98 | 0% |
freq | 29070 | 958 | 12021 | 2918 | 8891 | 18893 | 33639 | 18661 | 7095 | 371 | 980 |
Both the lc1['rate'] and lc1['revol_util'] column values are formatted with Excel's percent format (%), making them string values. The .replace() method replaces the '%' with a white-space. The .astype attribute converts the string into a float. The resulting float is divided by 100.
lc1['revol_util'] = lc1.revol_util.replace('%','',regex=True).astype('float')/100
lc1['rate'] = lc1.rate.replace('%','',regex=True).astype('float')/100
With the lc1['revol_util'] column values cast to floats use the .fillna() attribute to replace missing values with the calculated mean value from the lc1['revol_util'] column.
lc1["revol_util"] = lc1[["revol_util"]].fillna(lc1.revol_util.mean())
lc1.revol_util.isnull().sum()
0
The analog SAS program uses PROC SQL to insert the mean value for 'revol_util' into the SAS Macro variable 'mean_revol'. A WHERE statement is used to locate the NULL values for the variable 'revol_util' and if found then the UPDATE statement inserts the value from the SAS MACRO variable &mean_revol into the 'revol_util' column.
The CREATE TABLE statement starting at line 13 tests if there are missing values found for the column 'revol_util'.
/******************************************************/
/* c12_update_missing_revol_until.sas */
/******************************************************/
6 proc sql;
7 select mean(revol_util) format 6.2 into :mean_revol
8 from lc1;
9 update lc1
10 set revol_util = &mean_revol
11 where revol_util is null;
NOTE: 50 record(s) updated in table WORK.lc1
12
13 create table miss as
14 select revol_util
15 from lc1
16 where revol_util is null;
NOTE: Data set "WORK.miss" has 0 observation(s) and 1 variable(s)
17 quit;
Create the 'lc0' DataFrame by reading the loan status file, supplying a tuple of integers for the usecols= argument and a tuple of names for the column labels.
lc0 = pd.read_csv("C:\\Data\\LC_Loan_Stats.csv",
low_memory=False,
usecols=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 15, 16, 17, 18, 19, 22, 24, 25),
names=('id','mem_id', 'ln_amt', 'term','rate', 'm_pay', 'grade', 'sub_grd', 'emp_len',
'own_rnt', 'income', 'ln_stat', 'purpose', 'state', 'dti', 'delinq_2yrs', 'ln_fst',
'inq_6mnth', 'open_acc', 'revol_bal', 'revol_util' ),
skiprows=39790,
header=None)
lc0.shape
(2844, 21)
Create the column lc0['ln_plcy'] with a value of False to indicate rows 39790 to the end of the file are loans made outside the standard credit lending policy. The assignment below uses the .map() attribute and calls the anonymous lambda function to create the new column lc0[''ln_plcy'].
lc0['ln_plcy'] = lc0['id'].map(lambda x: False)
Similiar to above, the lc0['rate'] and lc0['revol_util'] column values are formatted with Excel's percent format (%), making them string values. The .replace() method replaces the '%' with a white-space. The .astype() attribute converts the string into a float. The resulting float is divided by 100.
lc0['revol_util'] = lc0.revol_util.replace('%','',regex=True).astype('float')/100
lc0['rate'] = lc0.rate.replace('%','',regex=True).astype('float')/100
Calculate the mean value for 'revol_util' and use the .fillna method to replace the missing values.
lc0['revol_util'] = lc0[["revol_util"]].fillna(lc0.revol_util.mean())
Return the number of missing values.
lc0.revol_util.isnull().sum()
Chapter 7 -- Pandas, Part 2 covers joining operations by illustrating the pd.merge() method which follows SQL's relational algebraic statements. An alternative is the concat function described here.
print(lc0.shape)
print(lc1.shape)
(2844, 22) (39751, 22)
Use the pd.concat() method to join the 'lc1' and 'lc0' DataFrames.
frames = [lc1, lc0]
df = pd.concat(frames)
df.shape
(42595, 22)
The analog SAS program uses the SET statement to join the data sets 'lc0' and 'lc1' together. The variable names are identical in both data sets.
/******************************************************/
/* c12_concatenate_lc0_lc1.sas */
/******************************************************/
7 data df;
8 set lc0
9 lc1;
10 run;
NOTE: 2844 observations were read from "WORK.lc0"
NOTE: 39751 observations were read from "WORK.lc1"
NOTE: Data set "WORK.df" has 42595 observation(s) and 22 variable(s)
pd.crosstab(df.grade, df.own_rnt , margins=True)
own_rnt | MORTGAGE | NONE | OTHER | OWN | RENT | All |
---|---|---|---|---|---|---|
grade | ||||||
A | 5227 | 5 | 25 | 872 | 4073 | 10202 |
B | 5427 | 2 | 36 | 957 | 5986 | 12408 |
C | 3587 | 0 | 24 | 642 | 4494 | 8747 |
D | 2368 | 0 | 26 | 433 | 3198 | 6025 |
E | 1499 | 0 | 14 | 242 | 1646 | 3401 |
F | 620 | 1 | 4 | 75 | 600 | 1300 |
G | 253 | 0 | 7 | 36 | 216 | 512 |
All | 18981 | 8 | 136 | 3257 | 20213 | 42595 |
pd.crosstab([df.grade, df.own_rnt], df.purpose, margins=True)
purpose | car | credit_card | debt_consolidation | educational | home_improvement | house | major_purchase | medical | moving | other | renewable_energy | small_business | vacation | wedding | All | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
grade | own_rnt | |||||||||||||||
A | MORTGAGE | 359 | 685 | 1813 | 36 | 806 | 42 | 458 | 100 | 49 | 540 | 20 | 187 | 60 | 72 | 5227 |
NONE | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 5 | |
OTHER | 1 | 4 | 7 | 2 | 2 | 1 | 3 | 1 | 0 | 2 | 0 | 2 | 0 | 0 | 25 | |
OWN | 62 | 90 | 315 | 3 | 99 | 6 | 90 | 23 | 6 | 116 | 2 | 26 | 14 | 20 | 872 | |
RENT | 195 | 638 | 1761 | 47 | 66 | 53 | 287 | 89 | 118 | 460 | 8 | 136 | 61 | 154 | 4073 | |
B | MORTGAGE | 252 | 732 | 2400 | 28 | 742 | 40 | 237 | 90 | 33 | 450 | 20 | 270 | 43 | 90 | 5427 |
NONE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | |
OTHER | 1 | 3 | 17 | 1 | 3 | 1 | 4 | 0 | 0 | 2 | 0 | 3 | 0 | 1 | 36 | |
OWN | 44 | 96 | 409 | 10 | 104 | 8 | 71 | 19 | 11 | 122 | 3 | 33 | 11 | 16 | 957 | |
RENT | 208 | 871 | 2948 | 72 | 87 | 67 | 312 | 100 | 147 | 709 | 14 | 193 | 77 | 181 | 5986 | |
C | MORTGAGE | 98 | 487 | 1671 | 30 | 467 | 31 | 139 | 63 | 18 | 297 | 9 | 194 | 16 | 67 | 3587 |
OTHER | 1 | 2 | 8 | 0 | 1 | 0 | 1 | 1 | 1 | 7 | 0 | 2 | 0 | 0 | 24 | |
OWN | 27 | 57 | 278 | 10 | 69 | 10 | 48 | 15 | 7 | 77 | 3 | 28 | 2 | 11 | 642 | |
RENT | 132 | 605 | 2238 | 75 | 68 | 44 | 221 | 74 | 102 | 576 | 6 | 162 | 54 | 137 | 4494 | |
D | MORTGAGE | 49 | 277 | 1172 | 12 | 268 | 21 | 88 | 43 | 8 | 177 | 5 | 180 | 14 | 54 | 2368 |
OTHER | 0 | 4 | 12 | 0 | 0 | 0 | 0 | 2 | 1 | 3 | 1 | 3 | 0 | 0 | 26 | |
OWN | 20 | 43 | 205 | 4 | 47 | 3 | 24 | 6 | 1 | 53 | 1 | 14 | 3 | 9 | 433 | |
RENT | 80 | 361 | 1743 | 36 | 40 | 35 | 162 | 48 | 65 | 371 | 4 | 145 | 19 | 89 | 3198 | |
E | MORTGAGE | 16 | 170 | 790 | 3 | 172 | 14 | 42 | 26 | 6 | 126 | 2 | 105 | 8 | 19 | 1499 |
OTHER | 0 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 1 | 4 | 0 | 1 | 0 | 0 | 14 | |
OWN | 8 | 23 | 113 | 6 | 25 | 5 | 11 | 8 | 2 | 27 | 1 | 9 | 2 | 2 | 242 | |
RENT | 42 | 164 | 918 | 28 | 31 | 19 | 71 | 22 | 33 | 147 | 2 | 109 | 12 | 48 | 1646 | |
F | MORTGAGE | 5 | 71 | 345 | 3 | 63 | 5 | 13 | 7 | 2 | 40 | 1 | 56 | 1 | 8 | 620 |
NONE | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | |
OTHER | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 4 | |
OWN | 2 | 4 | 33 | 0 | 8 | 1 | 5 | 0 | 1 | 10 | 0 | 11 | 0 | 0 | 75 | |
RENT | 9 | 53 | 344 | 8 | 5 | 13 | 13 | 9 | 11 | 61 | 2 | 55 | 2 | 15 | 600 | |
G | MORTGAGE | 1 | 29 | 129 | 1 | 25 | 6 | 1 | 3 | 1 | 16 | 0 | 36 | 0 | 5 | 253 |
OTHER | 0 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 7 | |
OWN | 0 | 2 | 16 | 1 | 2 | 1 | 4 | 1 | 0 | 3 | 0 | 6 | 0 | 0 | 36 | |
RENT | 2 | 21 | 119 | 4 | 2 | 0 | 5 | 2 | 5 | 23 | 2 | 25 | 1 | 5 | 216 | |
All | 1614 | 5495 | 19813 | 422 | 3203 | 427 | 2310 | 753 | 629 | 4427 | 106 | 1993 | 400 | 1003 | 42595 |
samp = df.sample(n=300)
import bokeh.charts
import bokeh.charts.utils
import bokeh.io
import bokeh.models
import bokeh.palettes
import bokeh.plotting
from bokeh.charts import Bar, Scatter, BoxPlot, Histogram
bokeh.io.output_notebook()
p = BoxPlot(samp, values='income', label='grade', color='firebrick',
title="Average Interest Rate by Credit Grade")
bokeh.io.show(p)
print(df.dti.min())
print(df.dti.max())
0.0 29.99
Create the list 'bins' containing integer values used to define bin levels. Construct the list 'names' to define the labels for the bins.
bins = [0, 5, 10, 15, 20, 25, 30]
names=['0-5%', '5-10%', '10-15%', '15-20%', '20-25%', '25-30%']
df['dti_cat'] = pd.cut(df['dti'], bins, labels=names, include_lowest=True)
Call pd.value_counts() method to return a count of values for each bin level.
pd.value_counts(df['dti_cat'])
10-15% 10502 15-20% 9775 5-10% 8563 20-25% 7376 0-5% 5552 25-30% 827 Name: dti_cat, dtype: int64
Return the count of missing values for the df['income'] column.
df['income'].isnull().sum()
4
Call the .fillna() method to replace missing values for the df['income'] column with the calcuated column mean. Additional examples for detecting and replacing missing values are described here.
df['income'] = df[["income"]].fillna(df.income.mean())
print(df.income.max())
print(df.income.min())
6000000.0 1896.0
Call the pd.qcut() method for creating deciles for the df['income'] column and copy them into the new df['inc_cat'] column. A more detailed example of creating deciles is described in the section Understanding Binning in Chapter 10--GroupBy.
df['inc_cat'] = pd.qcut(df['income'].values, 10).codes
Return the count of values from the income decile values in the df['inc_cat'] column.
pd.value_counts(df['inc_cat'])
0 5088 7 4436 4 4309 3 4306 2 4265 9 4260 6 4253 5 4163 8 4082 1 3433 Name: inc_cat, dtype: int64
The df['inc_cat'] column containing income deciles used as the rows and values for df['grade'] nested inside values from the df['ln_plcy'] column as the header in a Crosstab.
pd.crosstab(df.inc_cat, [df.ln_plcy, df.grade] , margins=True)
ln_plcy | False | True | All | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
grade | A | B | C | D | E | F | G | A | B | C | D | E | F | G | |
inc_cat | |||||||||||||||
0 | 23 | 73 | 105 | 130 | 92 | 31 | 18 | 1219 | 1468 | 1025 | 625 | 213 | 51 | 15 | 5088 |
1 | 6 | 29 | 46 | 54 | 57 | 19 | 9 | 824 | 999 | 711 | 427 | 183 | 57 | 12 | 3433 |
2 | 10 | 25 | 71 | 85 | 53 | 20 | 17 | 1025 | 1264 | 849 | 522 | 242 | 62 | 20 | 4265 |
3 | 15 | 39 | 57 | 62 | 58 | 23 | 16 | 1034 | 1196 | 887 | 565 | 263 | 75 | 16 | 4306 |
4 | 8 | 26 | 65 | 66 | 55 | 25 | 16 | 1070 | 1221 | 805 | 531 | 309 | 91 | 21 | 4309 |
5 | 10 | 29 | 54 | 54 | 40 | 20 | 23 | 1023 | 1177 | 789 | 524 | 281 | 110 | 29 | 4163 |
6 | 12 | 37 | 50 | 62 | 45 | 25 | 13 | 1037 | 1224 | 753 | 537 | 303 | 122 | 33 | 4253 |
7 | 16 | 41 | 67 | 55 | 57 | 33 | 23 | 1033 | 1263 | 791 | 559 | 318 | 140 | 40 | 4436 |
8 | 9 | 43 | 66 | 57 | 38 | 23 | 18 | 940 | 1094 | 763 | 499 | 316 | 160 | 56 | 4082 |
9 | 14 | 45 | 66 | 77 | 49 | 28 | 41 | 874 | 1115 | 727 | 534 | 429 | 185 | 76 | 4260 |
All | 123 | 387 | 647 | 702 | 544 | 247 | 194 | 10079 | 12021 | 8100 | 5323 | 2857 | 1053 | 318 | 42595 |
The .to_pickle() method serializes the 'df' DataFrame in order to write to disk. The .to_pickle() method is documented here. The DataFrame is read in Chapter 10-- GroupBy.
df.to_pickle('lending_club.pkl')
We can also write the 'lc0' and 'lc1' DataFrames out as a .csv files.
lc0.to_csv("C:\Data\\Loans_lc0.csv")
lc1.to_csv("C:\Data\\Loans_lc1.csv")