Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
You can say pandas is extremely powerful version of Excel
In this section we are going to talk about
Fristly we are going to talk about Series DataType .
A Series is very similar to numpy array , it is built on top of NumPy Array.. But Series can have axis labels , meaning it can be indexed by labels instead of just number location
Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.
# Lets import numpy and pandas
import numpy as np
import pandas as pd
# We can convert a list , numpy array , or dict to Series
labels = ['Shivendra','Ragavendra','Narendra']
my_list= [21,25,30]
arr=np.array([10,20,30])
d={'Shivendra':21,'Raghavendra':25,'Narendra':30}
# Using List
pd.Series(data=my_list)
0 21 1 25 2 30 dtype: int64
pd.Series (data=my_list,index=labels )
Shivendra 21 Ragavendra 25 Narendra 30 dtype: int64
pd.Series(my_list,labels)
Shivendra 21 Ragavendra 25 Narendra 30 dtype: int64
# NumPy Array
pd.Series(arr)
0 10 1 20 2 30 dtype: int32
pd.Series (data=arr,index=labels )
Shivendra 10 Ragavendra 20 Narendra 30 dtype: int32
# Dictonary
pd.Series (d)
Shivendra 21 Raghavendra 25 Narendra 30 dtype: int64
A Pandas Series can hold a variety of Objects
pd.Series (data=labels )
0 Shivendra 1 Ragavendra 2 Narendra dtype: object
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).
Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:
ser1= pd.Series ([1,2,3,4], index =['Chennai','Bihar','West Bengal','Rajasthan'])
ser1
Chennai 1 Bihar 2 West Bengal 3 Rajasthan 4 dtype: int64
ser2=pd.Series ([1,2,5,4],index=['Chennai','Bihar','Assam','Rajasthan'])
ser2
Chennai 1 Bihar 2 Assam 5 Rajasthan 4 dtype: int64
ser1['Chennai']
1
# Operations are then also done based off of index:
ser1+ser2
Assam NaN Bihar 4.0 Chennai 2.0 Rajasthan 8.0 West Bengal NaN dtype: float64
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)
df=pd.DataFrame (randn(5,5),index='Chennai Bihar UtterPredesh Delhi Mumbai'.split(),columns ='SRM NIT_PATNA BHU IIT_DELHI IIT_Bombay'.split())
df
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
Let's learn the various methods to grab data from a DataFrame
df['SRM']
Chennai 2.706850 Bihar -0.319318 UtterPredesh 0.528813 Delhi 0.955057 Mumbai 0.302665 Name: SRM, dtype: float64
# We can pass a list of columns names
df[['SRM' , 'BHU']]
SRM | BHU | |
---|---|---|
Chennai | 2.706850 | 0.907969 |
Bihar | -0.319318 | 0.605965 |
UtterPredesh | 0.528813 | 0.188695 |
Delhi | 0.955057 | 1.978757 |
Mumbai | 0.302665 | -1.706086 |
df.SRM # SQL syntax
Chennai 2.706850 Bihar -0.319318 UtterPredesh 0.528813 Delhi 0.955057 Mumbai 0.302665 Name: SRM, dtype: float64
# Dataframe Columns are just Series
type(df['SRM'])
pandas.core.series.Series
# Creating a new columns
df['UPES']=df['SRM']
df['Harshita']=df['SRM'] + df['BHU']
df
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | UPES | Harshita | |
---|---|---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 | 2.706850 | 3.614819 |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 | -0.319318 | 0.286647 |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 | 0.528813 | 0.717509 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 0.955057 | 2.933814 |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 | 0.302665 | -1.403420 |
df.drop('UPES',axis=1) # Axis = 1 for column
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
df # But again it will be appeared we need to use inplace to remove it parmanently
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | UPES | |
---|---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 | 2.706850 |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 | -0.319318 |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 | 0.528813 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 0.955057 |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 | 0.302665 |
df.drop('UPES',axis=1,inplace =True )
df.drop('Harshita',axis=1,inplace = True)
df
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
df.drop('Delhi',axis=0)
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
df.loc['Chennai']
SRM 2.706850 NIT_PATNA 0.628133 BHU 0.907969 IIT_DELHI 0.503826 IIT_Bombay 0.651118 Name: Chennai, dtype: float64
# We can select based on indexing
df.iloc[1]
SRM -0.319318 NIT_PATNA -0.848077 BHU 0.605965 IIT_DELHI -2.018168 IIT_Bombay 0.740122 Name: Bihar, dtype: float64
df.loc['Bihar','NIT_PATNA']
-0.8480769834036315
df.loc[['Bihar','Mumbai'],['NIT_PATNA','IIT_DELHI']]
NIT_PATNA | IIT_DELHI | |
---|---|---|
Bihar | -0.848077 | -2.018168 |
Mumbai | 1.693723 | -1.159119 |
df
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
df>0
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | True | True | True | True | True |
Bihar | False | False | True | False | True |
UtterPredesh | True | False | True | False | False |
Delhi | True | True | True | True | True |
Mumbai | True | True | False | False | False |
df[df>0]
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
Bihar | NaN | NaN | 0.605965 | NaN | 0.740122 |
UtterPredesh | 0.528813 | NaN | 0.188695 | NaN | NaN |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Mumbai | 0.302665 | 1.693723 | NaN | NaN | NaN |
df [df['SRM']>0] # It will not print Bihar Cz Bihar is having negetive number
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
df[df['SRM']>0]['BHU'] # It will not print Bihar data since it is having negetive number
Chennai 0.907969 UtterPredesh 0.188695 Delhi 1.978757 Mumbai -1.706086 Name: BHU, dtype: float64
df[df['BHU']>0]['SRM'] # It will not print mumbai's data
Chennai 2.706850 Bihar -0.319318 UtterPredesh 0.528813 Delhi 0.955057 Name: SRM, dtype: float64
df[df['SRM']>0][['BHU','IIT_DELHI']]
BHU | IIT_DELHI | |
---|---|---|
Chennai | 0.907969 | 0.503826 |
UtterPredesh | 0.188695 | -0.758872 |
Delhi | 1.978757 | 2.605967 |
Mumbai | -1.706086 | -1.159119 |
df[(df['SRM']>0.955)& df['BHU']>0]
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!
df
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
df.reset_index()
index | SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|---|
0 | Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
1 | Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 |
2 | UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
3 | Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
4 | Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
newind ='Tamil_Nadu BIHAR UP Delhi Maharastra'.split()
df['States']=newind
df
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | States | |
---|---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 | Tamil_Nadu |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 | BIHAR |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 | UP |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 | Delhi |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 | Maharastra |
df.set_index('States')
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
States | |||||
Tamil_Nadu | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
BIHAR | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 |
UP | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Maharastra | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
df
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | States | |
---|---|---|---|---|---|---|
Chennai | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 | Tamil_Nadu |
Bihar | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 | BIHAR |
UtterPredesh | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 | UP |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 | Delhi |
Mumbai | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 | Maharastra |
df.set_index('States',inplace=True)
df
SRM | NIT_PATNA | BHU | IIT_DELHI | IIT_Bombay | |
---|---|---|---|---|---|
States | |||||
Tamil_Nadu | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 0.651118 |
BIHAR | -0.319318 | -0.848077 | 0.605965 | -2.018168 | 0.740122 |
UP | 0.528813 | -0.589001 | 0.188695 | -0.758872 | -0.933237 |
Delhi | 0.955057 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
Maharastra | 0.302665 | 1.693723 | -1.706086 | -1.159119 | -0.134841 |
Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:
# index Levels
outside =['Big_data','Big_data','Big_data','AI','AI','AI']
inside =[1,2,3,1,2,3]
hier_index=list(zip(outside,inside))
hier_index=pd.MultiIndex.from_tuples(hier_index)
hier_index
MultiIndex([('Big_data', 1), ('Big_data', 2), ('Big_data', 3), ( 'AI', 1), ( 'AI', 2), ( 'AI', 3)], )
df=pd.DataFrame(np.random.rand (6,2),index=hier_index,columns=['Core','volunteers'])
df
Core | volunteers | ||
---|---|---|---|
Big_data | 1 | 0.701371 | 0.487635 |
2 | 0.680678 | 0.521548 | |
3 | 0.043397 | 0.223937 | |
AI | 1 | 0.575205 | 0.120434 |
2 | 0.500117 | 0.138010 | |
3 | 0.052808 | 0.178277 |
Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
df.loc['Big_data']
Core | volunteers | |
---|---|---|
1 | 0.701371 | 0.487635 |
2 | 0.680678 | 0.521548 |
3 | 0.043397 | 0.223937 |
df.loc['Big_data'].loc[1]
Core 0.701371 volunteers 0.487635 Name: 1, dtype: float64
df.index.names
FrozenList([None, None])
df.index.names=['Domain','S.NO']
df
Core | volunteers | ||
---|---|---|---|
Domain | S.NO | ||
Big_data | 1 | 0.701371 | 0.487635 |
2 | 0.680678 | 0.521548 | |
3 | 0.043397 | 0.223937 | |
AI | 1 | 0.575205 | 0.120434 |
2 | 0.500117 | 0.138010 | |
3 | 0.052808 | 0.178277 |
df.xs('Big_data')
Core | volunteers | |
---|---|---|
S.NO | ||
1 | 0.701371 | 0.487635 |
2 | 0.680678 | 0.521548 |
3 | 0.043397 | 0.223937 |
weather_data = {
'day': ['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/5/2017','1/6/2017'],
'temperature': [32,35,28,24,32,31],
'windspeed': [6,7,2,7,4,2],
'event': ['Rain', 'Sunny', 'Snow','Snow','Rain', 'Sunny']
}
df=pd.DataFrame(weather_data)
df
day | temperature | windspeed | event | |
---|---|---|---|---|
0 | 1/1/2017 | 32 | 6 | Rain |
1 | 1/2/2017 | 35 | 7 | Sunny |
2 | 1/3/2017 | 28 | 2 | Snow |
3 | 1/4/2017 | 24 | 7 | Snow |
4 | 1/5/2017 | 32 | 4 | Rain |
5 | 1/6/2017 | 31 | 2 | Sunny |
df.shape # rows, columns = df.shape
(6, 4)
df.head() # df.head(3)
day | temperature | windspeed | event | |
---|---|---|---|---|
0 | 1/1/2017 | 32 | 6 | Rain |
1 | 1/2/2017 | 35 | 7 | Sunny |
2 | 1/3/2017 | 28 | 2 | Snow |
3 | 1/4/2017 | 24 | 7 | Snow |
4 | 1/5/2017 | 32 | 4 | Rain |
df.tail() # df.tail(2)
day | temperature | windspeed | event | |
---|---|---|---|---|
1 | 1/2/2017 | 35 | 7 | Sunny |
2 | 1/3/2017 | 28 | 2 | Snow |
3 | 1/4/2017 | 24 | 7 | Snow |
4 | 1/5/2017 | 32 | 4 | Rain |
5 | 1/6/2017 | 31 | 2 | Sunny |
df[1:3]
day | temperature | windspeed | event | |
---|---|---|---|---|
1 | 1/2/2017 | 35 | 7 | Sunny |
2 | 1/3/2017 | 28 | 2 | Snow |
df.columns
Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')
df['day']
0 1/1/2017 1 1/2/2017 2 1/3/2017 3 1/4/2017 4 1/5/2017 5 1/6/2017 Name: day, dtype: object
type(df['day'])
pandas.core.series.Series
df[['day','temperature']]
day | temperature | |
---|---|---|
0 | 1/1/2017 | 32 |
1 | 1/2/2017 | 35 |
2 | 1/3/2017 | 28 |
3 | 1/4/2017 | 24 |
4 | 1/5/2017 | 32 |
5 | 1/6/2017 | 31 |
df['temperature'].max()
35
df[df['temperature']>32]
day | temperature | windspeed | event | |
---|---|---|---|---|
1 | 1/2/2017 | 35 | 7 | Sunny |
df['day'][df['temperature'] == df['temperature'].max()] # Kinda doing SQL in pandas
1 1/2/2017 Name: day, dtype: object
df['temperature'].std()
3.8297084310253524
df['event'].max() # But mean() won't work since data type is string
'Sunny'
df.describe()
temperature | windspeed | |
---|---|---|
count | 6.000000 | 6.000000 |
mean | 30.333333 | 4.666667 |
std | 3.829708 | 2.338090 |
min | 24.000000 | 2.000000 |
25% | 28.750000 | 2.500000 |
50% | 31.500000 | 5.000000 |
75% | 32.000000 | 6.750000 |
max | 35.000000 | 7.000000 |
Let's show a few convenient methods to deal with Missing Data in pandas:
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1 |
1 | 2.0 | NaN | 2 |
2 | NaN | NaN | 3 |
df.dropna()
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1 |
df.dropna(axis=1)
C | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
df.dropna(thresh=2)
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1 |
1 | 2.0 | NaN | 2 |
df.fillna(value='shivendra')
A | B | C | |
---|---|---|---|
0 | 1 | 5 | 1 |
1 | 2 | shivendra | 2 |
2 | shivendra | shivendra | 3 |
df['A'].fillna(value=df['A'].mean())
0 1.0 1 2.0 2 1.5 Name: A, dtype: float64
The groupby method allows you to group rows of data together and call aggregate functions
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Shivendra','Abhishek','Sowjanya','Manish','Mini','Satya'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
Company | Person | Sales | |
---|---|---|---|
0 | GOOG | Shivendra | 200 |
1 | GOOG | Abhishek | 120 |
2 | MSFT | Sowjanya | 340 |
3 | MSFT | Manish | 124 |
4 | FB | Mini | 243 |
5 | FB | Satya | 350 |
** Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:**
df.groupby('Company')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F12523CB08>
#You can save this object as a new variable:
by_comp = df.groupby("Company")
#And then call aggregate methods off the object:
by_comp.mean()
Sales | |
---|---|
Company | |
FB | 296.5 |
GOOG | 160.0 |
MSFT | 232.0 |
df.groupby('Company').mean()
Sales | |
---|---|
Company | |
FB | 296.5 |
GOOG | 160.0 |
MSFT | 232.0 |
#More examples of aggregate methods:
by_comp.std()
Sales | |
---|---|
Company | |
FB | 75.660426 |
GOOG | 56.568542 |
MSFT | 152.735065 |
by_comp.max()
Person | Sales | |
---|---|---|
Company | ||
FB | Satya | 350 |
GOOG | Shivendra | 200 |
MSFT | Sowjanya | 340 |
by_comp.min()
Person | Sales | |
---|---|---|
Company | ||
FB | Mini | 243 |
GOOG | Abhishek | 120 |
MSFT | Manish | 124 |
by_comp.describe()
Sales | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
Company | ||||||||
FB | 2.0 | 296.5 | 75.660426 | 243.0 | 269.75 | 296.5 | 323.25 | 350.0 |
GOOG | 2.0 | 160.0 | 56.568542 | 120.0 | 140.00 | 160.0 | 180.00 | 200.0 |
MSFT | 2.0 | 232.0 | 152.735065 | 124.0 | 178.00 | 232.0 | 286.00 | 340.0 |
by_comp.describe().transpose()
Company | FB | GOOG | MSFT | |
---|---|---|---|---|
Sales | count | 2.000000 | 2.000000 | 2.000000 |
mean | 296.500000 | 160.000000 | 232.000000 | |
std | 75.660426 | 56.568542 | 152.735065 | |
min | 243.000000 | 120.000000 | 124.000000 | |
25% | 269.750000 | 140.000000 | 178.000000 | |
50% | 296.500000 | 160.000000 | 232.000000 | |
75% | 323.250000 | 180.000000 | 286.000000 | |
max | 350.000000 | 200.000000 | 340.000000 |
by_comp.describe().transpose()['GOOG']
Sales count 2.000000 mean 160.000000 std 56.568542 min 120.000000 25% 140.000000 50% 160.000000 75% 180.000000 max 200.000000 Name: GOOG, dtype: float64
There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this we will discuss these 3 methods with examples.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df1
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
df2
A | B | C | D | |
---|---|---|---|---|
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
df3
A | B | C | D | |
---|---|---|---|---|
8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 |
10 | A10 | B10 | C10 | D10 |
11 | A11 | B11 | C11 | D11 |
Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:
pd.concat([df1,df2,df3])
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 |
10 | A10 | B10 | C10 | D10 |
11 | A11 | B11 | C11 | D11 |
pd.concat([df1,df2,df3],axis=1)
A | B | C | D | A | B | C | D | A | B | C | D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | A3 | B3 | C3 | D3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | A4 | B4 | C4 | D4 | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | A5 | B5 | C5 | D5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN | A6 | B6 | C6 | D6 | NaN | NaN | NaN | NaN |
7 | NaN | NaN | NaN | NaN | A7 | B7 | C7 | D7 | NaN | NaN | NaN | NaN |
8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A8 | B8 | C8 | D8 |
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A9 | B9 | C9 | D9 |
10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A10 | B10 | C10 | D10 |
11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A11 | B11 | C11 | D11 |
There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
col1 | col2 | col3 | |
---|---|---|---|
0 | 1 | 444 | abc |
1 | 2 | 555 | def |
2 | 3 | 666 | ghi |
3 | 4 | 444 | xyz |
df['col2'].unique()
array([444, 555, 666], dtype=int64)
df['col2'].nunique()
3
df['col2'].value_counts()
444 2 555 1 666 1 Name: col2, dtype: int64
#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
newdf
col1 | col2 | col3 | |
---|---|---|---|
3 | 4 | 444 | xyz |
# Applying Functions
def times2(x):
return x*2
df['col1'].apply(times2)
0 2 1 4 2 6 3 8 Name: col1, dtype: int64
df['col3'].apply(len)
0 3 1 3 2 3 3 3 Name: col3, dtype: int64
df['col1'].sum()
10
** Permanently Removing a Column**
del df['col1']
df
col2 | col3 | |
---|---|---|
0 | 444 | abc |
1 | 555 | def |
2 | 666 | ghi |
3 | 444 | xyz |
# get columns and index names
df.columns
Index(['col2', 'col3'], dtype='object')
df.index
RangeIndex(start=0, stop=4, step=1)
df
col2 | col3 | |
---|---|---|
0 | 444 | abc |
1 | 555 | def |
2 | 666 | ghi |
3 | 444 | xyz |
df.sort_values(by='col2') #inplace=False by default
col2 | col3 | |
---|---|---|
0 | 444 | abc |
3 | 444 | xyz |
1 | 555 | def |
2 | 666 | ghi |
# Check is there any null value or not
df.isnull()
col2 | col3 | |
---|---|---|
0 | False | False |
1 | False | False |
2 | False | False |
3 | False | False |
# Drop rows with NaN Values
df.dropna()
col2 | col3 | |
---|---|---|
0 | 444 | abc |
1 | 555 | def |
2 | 666 | ghi |
3 | 444 | xyz |
df = pd.DataFrame({'col1':[1,2,3,np.nan],
'col2':[np.nan,555,666,444],
'col3':['abc','def','ghi','xyz']})
df.head()
col1 | col2 | col3 | |
---|---|---|---|
0 | 1.0 | NaN | abc |
1 | 2.0 | 555.0 | def |
2 | 3.0 | 666.0 | ghi |
3 | NaN | 444.0 | xyz |
df.fillna('FILL')
col1 | col2 | col3 | |
---|---|---|---|
0 | 1 | FILL | abc |
1 | 2 | 555 | def |
2 | 3 | 666 | ghi |
3 | FILL | 444 | xyz |
data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | x | 1 |
1 | foo | one | y | 3 |
2 | foo | two | x | 2 |
3 | bar | two | y | 5 |
4 | bar | one | x | 4 |
5 | bar | one | y | 1 |