Introduction to Pandas/NumPy¶

Pandas and Numpy using Google Colab.

toc: true
badges: true
comments: true
categories: [Pandas, Python, numpy]
image: images/chart-preview.png

Importing libraries¶

In [0]:

import pandas as pd
import numpy as np 
import matplotlib as plt

The data Set¶

In [0]:

url = "https://raw.githubusercontent.com/daddyawesome/PythonStat/master/Basics/data_test_loan.csv" #the url is where the file is being uploaded
df = pd.read_csv(url) #Reading the dataset in a dataframe using Pandas

Quick Data Exploration¶

Once you have read the dataset, you can have a look at few top rows by using the function head()

df.head(10)

In [6]:

df.head(10)

Out[6]:

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area
0	LP001015	Male	Yes	0	Graduate	No	5720	0	110.0	360.0	1.0	Urban
1	LP001022	Male	Yes	1	Graduate	No	3076	1500	126.0	360.0	1.0	Urban
2	LP001031	Male	Yes	2	Graduate	No	5000	1800	208.0	360.0	1.0	Urban
3	LP001035	Male	Yes	2	Graduate	No	2340	2546	100.0	360.0	NaN	Urban
4	LP001051	Male	No	0	Not Graduate	No	3276	0	78.0	360.0	1.0	Urban
5	LP001054	Male	Yes	0	Not Graduate	Yes	2165	3422	152.0	360.0	1.0	Urban
6	LP001055	Female	No	1	Not Graduate	No	2226	0	59.0	360.0	1.0	Semiurban
7	LP001056	Male	Yes	2	Not Graduate	No	3881	0	147.0	360.0	0.0	Rural
8	LP001059	Male	Yes	2	Graduate	NaN	13633	0	280.0	240.0	1.0	Urban
9	LP001067	Male	No	0	Not Graduate	No	2400	2400	123.0	360.0	1.0	Semiurban

This should print 10 rows. Alternately, you can also look at more rows by printing the dataset. Next, you can look at summary of numerical fields by using describe() function

df.describe()

In [7]:

df.describe() #get summary of numerical variables

Out[7]:

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History
count	367.000000	367.000000	362.000000	361.000000	338.000000
mean	4805.599455	1569.577657	136.132597	342.537396	0.825444
std	4910.685399	2334.232099	61.366652	65.156643	0.380150
min	0.000000	0.000000	28.000000	6.000000	0.000000
25%	2864.000000	0.000000	100.250000	360.000000	1.000000
50%	3786.000000	1025.000000	125.000000	360.000000	1.000000
75%	5060.000000	2430.500000	158.000000	360.000000	1.000000
max	72529.000000	24000.000000	550.000000	480.000000	1.000000

describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output

Here are a few inferences, you can draw by looking at the output of describe() function:

LoanAmount has (614 – 592) 22 missing values.
Loan_Amount_Term has (614 – 600) 14 missing values.
Credit_History has (614 – 564) 50 missing values.
We can also look that about 84% applicants have a credit_history. How? The mean of Credit_History field is 0.84 (Remember, Credit_History has value 1 for those who have a credit history and 0 otherwise)
The ApplicantIncome distribution seems to be in line with expectation. Same with CoapplicantIncome

Please note that we can get an idea of a possible skew in the data by comparing the mean to the median, i.e. the 50% figure.

For the non-numerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency distribution to understand whether they make sense or not. The frequency table can be printed by following command:

df['Property_Area'].value_counts()

Similarly, we can look at unique values of port of credit history. Note that dfname['column_name'] is a basic indexing technique to access a particular column of the dataframe. It can be a list of columns as well.

For more information, refer to the "10 Minutes to Pandas" resource shared above.

Distribution analysis¶

Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with numeric variables - namely ApplicantIncome and LoanAmount

Lets start by plotting the histogram of ApplicantIncome using the following commands:

df['ApplicantIncome'].hist(bins=50)

In [8]:

df['ApplicantIncome'].hist(bins=50)

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9b18f56c88>

Here we observe that there are few extreme values. This is also the reason why 50 bins are required to depict the distribution clearly. Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:

df.boxplot(column='ApplicantIncome')

In [9]:

df.boxplot(column='ApplicantIncome')

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9b18e17978>

This confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education:

df.boxplot(column='ApplicantIncome', by = 'Education')

In [10]:

df.boxplot(column='ApplicantIncome', by = 'Education')

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9b1897ca20>

We can see that there is no substantial different between the mean income of graduate and non-graduates. But there are a higher number of graduates with very high incomes, which are appearing to be the outliers.

Now, Let's look at the histogram and boxplot of LoanAmount using the following command:

df['LoanAmount'].hist(bins=50)

In [11]:

df['LoanAmount'].hist(bins=50)

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9b188d5940>

In [12]:

df.boxplot(column='LoanAmount')

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f9b187cf550>

In [0]: