import pandas as pd
import numpy as np # necessity as pandas is built on np
from IPython.display import Image # to display images
The Pandas library is built on NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language.
Refer to these cheatsheets:
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience+(1).pdf
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf
Different types of data can be loaded in pandas dataframe. Pandas dataframe is like looks spreadsheet table (just a rough analogy)
Most common input types
pd.read_csv
pd.read_excel/ pd.ExcelFile
pd.read_feather
(feather format is used to reduce memory load in df as data is saved in binary form)pd.read_json
pd.read_html
pd.read_pickle
(can also infer if pickled object is zipped using infer=)Output types have format to_xxx
similar to input formats
Pandas data strctures include series
and dataframe
Series: A one-dimensional labeled array a capable of holding any data type
Image('../images/series.png', width=500) # not pandas, just showing example series
# index will set the index for further reference
# data can be passed as list
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
s
Can be indexed using both index name or number.
number: --> filter indexes after value of number
:number --> filter indexes before value of number
s[1:]
s['a']
s[:'c']
s.index
# series using dictionary
s2 = pd.Series({'a':3, 'b': -1, 'c': 12}); s2
s2['b']
Dataframe: A two-dimensional labeled data structure with columns of potentially different types. It is similar to excel table.
Can make data frame using dictionary, list of list
Image('../images/df.png', width=500)
data = {'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasília'],
'Population': [11190846, 1303171035, 207847528]}
df_sample = pd.DataFrame(data,
columns=['Country', 'Capital', 'Population'])
df_sample
Using famous titanic data
for analysis and exploration.
https://www.kaggle.com/c/titanic/data
df.shape
pd.head()
pd.dtypes
pd.columns
pd.nunique()
pd.info()
df = pd.read_csv('../data/train.csv') # read csv file
df.shape
df.head() # see top 5 rows of data
df.dtypes # see datatype of each variable
df.columns # column names
df.nunique() # unique value for each variable
df.info() # not null part is very useful to see how many nulls are there in data
iloc
¶Select based on integer location (that's why i). Can select single or multiple
df.iloc[0, 4] # 0 row, 4 column
df.iloc[1:4, 2:6] # indexes are maintained. Can reset_index() to start index from 0
loc
¶Select based on label name of column (can select single or multiple)
df.loc[1:2,'Name':"Age"] # here row indexes are numbers but column indexes are name of columns
df.loc[2,['Name',"Age"]] # here row indexes are numbers.
ix
¶ix has been deprecated in latest pandas library. It was used to select by label or position. But we can always use loc
to select with labels and iloc
to select with integers/position
Boolean indexing
¶Returns rows where the stated condition returns true
|
condition 2 (or
also works but throws ambiguity error for multiple conditions)&
condition 2 (and
also works but throws ambiguity error for multiple conditions~
(not condition)==
Satisfying condition any()
-> columns/rows with any value matching conditionall()
> columns/rows with all values matching some condition# select rows with either sex as female or Pclass as 1
df[(df.Sex == 'female') | (df.iloc[:,2] == 1) ].iloc[:3] # () are important
# first 3 rows of gives all columns which have all string values or all int > 1 values
df.loc[:,(df > 1).all()][:3]
# first 3 rows of all columns which have all not null values
df.loc[:,(df.notnull().all() )][:3]
# first 3 rows of all columns which have atleast 1 null value
df.loc[:, df.isnull().any()][:3]
df[(df.iloc[:,2] == 1) & (df.Sex == 'female')].shape
# fraction of males with Age > 25, df.shape[0] -> number of rows
sum((df.Age > 25) & (df.Sex == 'male'))/df.shape[0]
# number of people who survived and were not in class 3
sum((df.Survived != 0) & (~(df.Pclass == 3)) )
querying
¶Query columns (filter rows) of dataframe with boolean expression (Filter based on condition)
# filter all rows which have Age > Passenger ID
df.query('Age > PassengerId')
filter
Filter dataframe on column names or row names (labels) by regrex or just item names
# filter only sex and age columns (first 2 rows)
df.filter(items=['Age', 'Sex'])[:2]
# filter only 0 and 5 row index
df.filter(items=[0,5], axis=0)
# first 2 rows of column names ending with "ed" (think of past tense)
df.filter(like = 'ed', axis=1)[:2]
# Can use same thing as above using regex also
df.filter(regex='ed$', axis=1)[:2]
isin
¶Filter rows of column based on list of multiple values
df[df.Pclass.isin([0,1])].head()
Setting and resetting index are important when we merge/groupby 2 dataframe and want to do further analysis on new dataframe. A dataframe with repeated indexes can cause problems in filtering. Apart from this we cvan set a column into index which makes merging much faster
set_index()
¶Set any column you want as index of df
# setting
df.set_index('Ticket')[:2]
# can set multiple columns as index also. Just pass them in list
# Setting Ticket and Name as index
df.set_index(['Ticket', 'Name'])[:2]
# can see what are values of index.
# checking index of 1st row
df.set_index(['Ticket', 'Name']).index[0]
reset_index()
¶Can reset index back to 0....nrows-1
df_index = df.set_index(['Ticket', 'Name'])
df_index[:2]
df_index.reset_index()[:2]
In above case, index is back to 0,1...
rename()
¶Renaming column names or row indexes of dataframe. Default is index
df.rename(columns={'Name': 'Whats_name', 'Fare':'Price'})[:2]
# can use some mapper function also. default axis='index' (0)
df.rename(mapper=str.lower, axis='columns')[:2]
unique()
¶Number of unique values in a column of df (Use nunique() for count of unique in each column)
df.Sex.unique()
duplicated()
¶Check duplicated in column. Returns True/False
sum(df.PassengerId.duplicated()) # there are no duplicate passegerid. good thing to check
# can check duplicates in index also.
# useful if doubtful about duplicates in index doing bad things
sum(df.index.duplicated())
drop_duplicates
¶Drop rows which have duplicates
# can help in getting unique combination of multiple columns
# unique() dosn't work in this case
df.loc[:,['Sex', 'Embarked']].drop_duplicates()
Group by some column/columns, then we can aggregate to get mean, count, sum or custom function based on the group
groupby
¶# group by sex then count.
# returns count in each column. difference in some cases because of nulls in those columns
# can do iloc[:,0] to only get first column
df.groupby(by = ['Sex']).count()
# can use multiple conditions
# group by sex and survived -> mean of age
df.groupby(by = ['Sex', 'Survived']).mean().loc[:,'Age']
# can group by indexes also by using levels=
# useful when we have multindexes
# can use agg function with lambda func
df_index = df.set_index(['Sex', 'Pclass'])
df_index.groupby(level=[0,1]).agg({'Fare': lambda x: sum(x)/len(x), # this is also just mean actually
'Age' : np.mean})
Interesting! Ticket price of 1st class female is approximately double of 1st class male
transform
¶Can apply such functions for all columns also using transform which transforms all rows
# shape of below code is same as original df
df_index.groupby(level=[0,1]).transform(lambda x: sum(x)/len(x)).head()
dropna
¶Drop rows with na
# how=any -> row with any column = NA
df.dropna(axis=0, how='any').shape
# how=any -> row with all columns = NA
df.dropna(axis=0, how='all').shape
# drops column which have any row of NA
[set(df.columns) - set(df.dropna(axis=1, how='any').columns)]
Three columns have been removed
fillna
¶# replace with mean of that column
# can put any specific value also
# would not work for columns with string type like Cabin
df.fillna(np.mean)[:1]
merge
/ join
¶data1 = pd.DataFrame({'x1': list('abc'), 'x2': [11.432, 1.303, 99.906]})
data2 = pd.DataFrame({'x1': list('abd'), 'x3': [20.784, np.NaN, 20.784]})
data1
data2
# inner join when both table have that key (like sql)
data1.merge(data2, how='inner', on='x1')
# outer joins on all keys in both df and creates NA
data1.merge(data2, how='outer', on='x1')
can also use join
but merge
is faster. just use merge
# if columns overlap, have to specify suffix as it makes for all
data1.join(data2, on='x1', how='left', lsuffix='L')
concatenate
¶# join over axis=0, i.e rows combine
# also adds all columns with na
pd.concat([data1, data2], axis=0)
Notice that it has index duplicates as it maintain original df index
Can use ignore_index=True
to make index start from 0
pd.concat([data1, data2], axis=0, ignore_index=True)
data2.loc[3] = ['g', 500] # adding new row
data2
# join over axis=1, i.e columns combine
pd.concat([data1, data2], axis=1)
to_datetime()
-> convert whatever format argument to datetime (obviously that can be parsed to datetime)date_range()
-> generates datetime dataDatetimeindex
-> datetypeindex datapd.to_datetime('2018-2-19')
# gives datetimeindex format
pd.date_range('2018-4-18', periods=6, freq='d')
data1['date'] = pd.date_range('2018-4-18', periods=3, freq='d')
data1
data1.date
pd.DatetimeIndex(data1.date)
pivot
-> reshape data¶Ever used pivot table in excel? It's same.
# index = new index, columns = new_columns, values = values to put
df.pivot(index='Sex', columns = 'PassengerId', values = 'Age')
In above case, use of pivot doesn't make sense but this is just an example
stack
¶Convert whole df into 1 long format
df.stack()
You won't generally use it. I have never come across its use over my experience with python
To get column/row indexes, series pair.
iteritems()
for column-index, seriesiterrows()
for row-index, serieslist(df.Sex.iteritems())[:5]
list(df.iterrows())[0]
apply
-> apply function over dfapply_map
-> apply function elementwise (for each series of df. think of column wise)# function squares when type(x) = float, cubes when type(x) = int, return same when other
f = lambda x: x**2 if type(x) == float else x**3 if type(x) == int else x
# whole series is passed
df.Fare.apply(f)[:3]
# elements are passed
df.applymap(f)[:3]
What all can we do when we have string datatype in pandas dataframe/series ?
str
¶Working with string format in pandas series/df
We can do:
str.upper()/lower()
to convert string into upper or lower casestr.len()
to find the length of stingstr.strip()/lstrip()/rstrip()
to strip spacesstr.replace()
to replace anything from stringstr.split()
to split words of string or using some other delimiterstr.get()
to access elements in slit list str.resplit()
spit in reverse order of string based on some delimiterstr.extract()
extract specific thing from string. alphabet or numberLet's see how to use all that in pandas series. Keep in mind pandas DataFrame has no attribute called str
and works on Series object only. So, grab column of df, then apply str
# converts all rows into lower
df.Name.str.lower().head()
# converts all rows into upper
df.Sex.str.upper().head()
# counts all the characters including spaces
df.Name.str.len().head()
# splits strings in each row over whitespaces ()
# expand=True : expand columns
# pat = regex to split on
df.Name.str.split(pat=',',expand=True).head().rename(columns={0:'First_Name', 1: 'Last_Name'})
# splits strings in each row over whitespaces ()
# expand=False : doesn't expand columns
# pat = regex to split on
df.Name.str.split(expand=False).head()
# replace Mr. with empty space
df.Name.str.replace('Mr.', '').head()
# get() is used to get particular row of split
df.Name.str.split().get(1)
df.Name[:10]
# Extract just last name
df.Name.str.extract('(?P<Last_Name>[a-zA-Z]+)', expand=True).head()