Pandas Introduction¶

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

You can say pandas is extremely powerful version of Excel

In this section we are going to talk about

Introduction To pandas
Seies
DataFrames
Missing Data
Merging , Joining , And Concatenating
Operations
Data Input and Output

Series¶

Fristly we are going to talk about Series DataType .

A Series is very similar to numpy array , it is built on top of NumPy Array.. But Series can have axis labels , meaning it can be indexed by labels instead of just number location

Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.

In [1]:

# Lets import numpy and pandas 
import numpy as np
import pandas as pd

In [2]:

# We can convert a list , numpy array , or dict to Series

labels = ['Shivendra','Ragavendra','Narendra']
my_list= [21,25,30]
arr=np.array([10,20,30])
d={'Shivendra':21,'Raghavendra':25,'Narendra':30}

In [4]:

# Using List 
pd.Series(data=my_list)

Out[4]:

0    21
1    25
2    30
dtype: int64

In [5]:

pd.Series (data=my_list,index=labels )

Out[5]:

Shivendra     21
Ragavendra    25
Narendra      30
dtype: int64

In [5]:

pd.Series(my_list,labels)

Out[5]:

Shivendra     21
Ragavendra    25
Narendra      30
dtype: int64

In [7]:

# NumPy Array
pd.Series(arr)

Out[7]:

0    10
1    20
2    30
dtype: int32

In [8]:

pd.Series (data=arr,index=labels )

Out[8]:

Shivendra     10
Ragavendra    20
Narendra      30
dtype: int32

In [9]:

# Dictonary
pd.Series (d)

Out[9]:

Shivendra      21
Raghavendra    25
Narendra       30
dtype: int64

In [ ]:

Data In A Series¶

A Pandas Series can hold a variety of Objects

In [10]:

pd.Series (data=labels )

Out[10]:

0     Shivendra
1    Ragavendra
2      Narendra
dtype: object

Using an Index¶

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [11]:

ser1= pd.Series ([1,2,3,4], index =['Chennai','Bihar','West Bengal','Rajasthan'])

In [13]:

ser1

Out[13]:

Chennai        1
Bihar          2
West Bengal    3
Rajasthan      4
dtype: int64

In [14]:

ser2=pd.Series ([1,2,5,4],index=['Chennai','Bihar','Assam','Rajasthan'])

In [15]:

ser2

Out[15]:

Chennai      1
Bihar        2
Assam        5
Rajasthan    4
dtype: int64

In [16]:

ser1['Chennai']

Out[16]:

In [18]:

# Operations are then also done based off of index:
ser1+ser2

Out[18]:

Assam          NaN
Bihar          4.0
Chennai        2.0
Rajasthan      8.0
West Bengal    NaN
dtype: float64

DataFrames¶

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic!

In [19]:

import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101)

In [20]:

df=pd.DataFrame (randn(5,5),index='Chennai Bihar UtterPredesh Delhi Mumbai'.split(),columns ='SRM NIT_PATNA BHU IIT_DELHI IIT_Bombay'.split())

In [21]:

df

Out[21]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841

Selection and Indexing¶

Let's learn the various methods to grab data from a DataFrame

In [22]:

df['SRM']

Out[22]:

Chennai         2.706850
Bihar          -0.319318
UtterPredesh    0.528813
Delhi           0.955057
Mumbai          0.302665
Name: SRM, dtype: float64

In [23]:

# We can pass a list of columns names 
df[['SRM' , 'BHU']]

Out[23]:

	SRM	BHU
Chennai	2.706850	0.907969
Bihar	-0.319318	0.605965
UtterPredesh	0.528813	0.188695
Delhi	0.955057	1.978757
Mumbai	0.302665	-1.706086

In [25]:

df.SRM # SQL syntax

Out[25]:

Chennai         2.706850
Bihar          -0.319318
UtterPredesh    0.528813
Delhi           0.955057
Mumbai          0.302665
Name: SRM, dtype: float64

In [26]:

# Dataframe  Columns are just Series

In [27]:

type(df['SRM'])

Out[27]:

pandas.core.series.Series

In [30]:

# Creating a new columns 
df['UPES']=df['SRM']

In [35]:

df['Harshita']=df['SRM'] + df['BHU']

In [36]:

df

Out[36]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay	UPES	Harshita
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118	2.706850	3.614819
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122	-0.319318	0.286647
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237	0.528813	0.717509
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509	0.955057	2.933814
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841	0.302665	-1.403420

In [32]:

df.drop('UPES',axis=1) # Axis = 1 for column

Out[32]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841

In [27]:

df # But again it will be appeared we need to use inplace to remove it parmanently

Out[27]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay	UPES
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118	2.706850
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122	-0.319318
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237	0.528813
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509	0.955057
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841	0.302665

In [38]:

df.drop('UPES',axis=1,inplace =True )

In [40]:

df.drop('Harshita',axis=1,inplace = True)

In [41]:

df

Out[41]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841

In [44]:

df.drop('Delhi',axis=0)

Out[44]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841

In [45]:

df.loc['Chennai']

Out[45]:

SRM           2.706850
NIT_PATNA     0.628133
BHU           0.907969
IIT_DELHI     0.503826
IIT_Bombay    0.651118
Name: Chennai, dtype: float64

In [46]:

# We can select based on indexing 
df.iloc[1]

Out[46]:

SRM          -0.319318
NIT_PATNA    -0.848077
BHU           0.605965
IIT_DELHI    -2.018168
IIT_Bombay    0.740122
Name: Bihar, dtype: float64

In [47]:

df.loc['Bihar','NIT_PATNA']

Out[47]:

-0.8480769834036315

In [48]:

df.loc[['Bihar','Mumbai'],['NIT_PATNA','IIT_DELHI']]

Out[48]:

	NIT_PATNA	IIT_DELHI
Bihar	-0.848077	-2.018168
Mumbai	1.693723	-1.159119

In [49]:

df

Out[49]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841

In [50]:

df>0

Out[50]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	True	True	True	True	True
Bihar	False	False	True	False	True
UtterPredesh	True	False	True	False	False
Delhi	True	True	True	True	True
Mumbai	True	True	False	False	False

In [51]:

df[df>0]

Out[51]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
Bihar	NaN	NaN	0.605965	NaN	0.740122
UtterPredesh	0.528813	NaN	0.188695	NaN	NaN
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
Mumbai	0.302665	1.693723	NaN	NaN	NaN

In [53]:

df [df['SRM']>0] # It will not print Bihar Cz Bihar is having negetive number

Out[53]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841

In [55]:

df[df['SRM']>0]['BHU'] # It will not print Bihar data since it is having negetive number 

Out[55]:

Chennai         0.907969
UtterPredesh    0.188695
Delhi           1.978757
Mumbai         -1.706086
Name: BHU, dtype: float64

In [56]:

df[df['BHU']>0]['SRM'] # It will not print mumbai's data

Out[56]:

Chennai         2.706850
Bihar          -0.319318
UtterPredesh    0.528813
Delhi           0.955057
Name: SRM, dtype: float64

In [58]:

df[df['SRM']>0][['BHU','IIT_DELHI']]

Out[58]:

	BHU	IIT_DELHI
Chennai	0.907969	0.503826
UtterPredesh	0.188695	-0.758872
Delhi	1.978757	2.605967
Mumbai	-1.706086	-1.159119

In [59]:

df[(df['SRM']>0.955)& df['BHU']>0]

Out[59]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509

More Index Details¶

Let's discuss some more features of indexing, including resetting the index or setting it something else. We'll also talk about index hierarchy!

In [60]:

df

Out[60]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841

In [62]:

df.reset_index()

Out[62]:

	index	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
0	Chennai	2.706850	0.628133	0.907969	0.503826	0.651118
1	Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122
2	UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237
3	Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
4	Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841

In [64]:

newind ='Tamil_Nadu BIHAR UP Delhi Maharastra'.split()

In [65]:

df['States']=newind

In [66]:

df

Out[66]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay	States
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118	Tamil_Nadu
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122	BIHAR
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237	UP
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509	Delhi
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841	Maharastra

In [67]:

df.set_index('States')

Out[67]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
States
Tamil_Nadu	2.706850	0.628133	0.907969	0.503826	0.651118
BIHAR	-0.319318	-0.848077	0.605965	-2.018168	0.740122
UP	0.528813	-0.589001	0.188695	-0.758872	-0.933237
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
Maharastra	0.302665	1.693723	-1.706086	-1.159119	-0.134841

In [68]:

df

Out[68]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay	States
Chennai	2.706850	0.628133	0.907969	0.503826	0.651118	Tamil_Nadu
Bihar	-0.319318	-0.848077	0.605965	-2.018168	0.740122	BIHAR
UtterPredesh	0.528813	-0.589001	0.188695	-0.758872	-0.933237	UP
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509	Delhi
Mumbai	0.302665	1.693723	-1.706086	-1.159119	-0.134841	Maharastra

In [69]:

df.set_index('States',inplace=True)

In [70]:

df

Out[70]:

	SRM	NIT_PATNA	BHU	IIT_DELHI	IIT_Bombay
States
Tamil_Nadu	2.706850	0.628133	0.907969	0.503826	0.651118
BIHAR	-0.319318	-0.848077	0.605965	-2.018168	0.740122
UP	0.528813	-0.589001	0.188695	-0.758872	-0.933237
Delhi	0.955057	0.190794	1.978757	2.605967	0.683509
Maharastra	0.302665	1.693723	-1.706086	-1.159119	-0.134841

Multi-Index and Index Hierarchy¶

Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [71]:

# index Levels 
outside =['Big_data','Big_data','Big_data','AI','AI','AI']
inside =[1,2,3,1,2,3]
hier_index=list(zip(outside,inside))
hier_index=pd.MultiIndex.from_tuples(hier_index)

In [72]:

hier_index

Out[72]:

MultiIndex([('Big_data', 1),
            ('Big_data', 2),
            ('Big_data', 3),
            (      'AI', 1),
            (      'AI', 2),
            (      'AI', 3)],
           )

In [74]:

df=pd.DataFrame(np.random.rand (6,2),index=hier_index,columns=['Core','volunteers'])

In [75]:

df

Out[75]:

		Core	volunteers
Big_data	1	0.701371	0.487635
	2	0.680678	0.521548
	3	0.043397	0.223937
AI	1	0.575205	0.120434
	2	0.500117	0.138010
	3	0.052808	0.178277

Now let's show how to index this! For index hierarchy we use df.loc[], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

In [77]:

df.loc['Big_data']

Out[77]:

	Core	volunteers
1	0.701371	0.487635
2	0.680678	0.521548
3	0.043397	0.223937

In [78]:

df.loc['Big_data'].loc[1]

Out[78]:

Core          0.701371
volunteers    0.487635
Name: 1, dtype: float64

In [79]:

df.index.names

Out[79]:

FrozenList([None, None])

In [80]:

df.index.names=['Domain','S.NO']

In [81]:

df

Out[81]:

		Core	volunteers
Domain	S.NO
Big_data	1	0.701371	0.487635
	2	0.680678	0.521548
	3	0.043397	0.223937
AI	1	0.575205	0.120434
	2	0.500117	0.138010
	3	0.052808	0.178277

In [82]:

df.xs('Big_data')

Out[82]:

	Core	volunteers
S.NO
1	0.701371	0.487635
2	0.680678	0.521548
3	0.043397	0.223937

In [83]:

weather_data = {
    'day': ['1/1/2017','1/2/2017','1/3/2017','1/4/2017','1/5/2017','1/6/2017'],
    'temperature': [32,35,28,24,32,31],
    'windspeed': [6,7,2,7,4,2],
    'event': ['Rain', 'Sunny', 'Snow','Snow','Rain', 'Sunny']
}

In [84]:

df=pd.DataFrame(weather_data)

In [85]:

df

Out[85]:

	day	temperature	windspeed	event
0	1/1/2017	32	6	Rain
1	1/2/2017	35	7	Sunny
2	1/3/2017	28	2	Snow
3	1/4/2017	24	7	Snow
4	1/5/2017	32	4	Rain
5	1/6/2017	31	2	Sunny

In [86]:

df.shape # rows, columns = df.shape

Out[86]:

(6, 4)

In [87]:

df.head() # df.head(3)

Out[87]:

	day	temperature	windspeed	event
0	1/1/2017	32	6	Rain
1	1/2/2017	35	7	Sunny
2	1/3/2017	28	2	Snow
3	1/4/2017	24	7	Snow
4	1/5/2017	32	4	Rain

In [88]:

df.tail() # df.tail(2)

Out[88]:

	day	temperature	windspeed	event
1	1/2/2017	35	7	Sunny
2	1/3/2017	28	2	Snow
3	1/4/2017	24	7	Snow
4	1/5/2017	32	4	Rain
5	1/6/2017	31	2	Sunny

In [89]:

df[1:3]

Out[89]:

	day	temperature	windspeed	event
1	1/2/2017	35	7	Sunny
2	1/3/2017	28	2	Snow

Columns¶

In [90]:

df.columns

Out[90]:

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

In [91]:

df['day']

Out[91]:

0    1/1/2017
1    1/2/2017
2    1/3/2017
3    1/4/2017
4    1/5/2017
5    1/6/2017
Name: day, dtype: object

In [92]:

type(df['day'])

Out[92]:

pandas.core.series.Series

In [94]:

df[['day','temperature']]

Out[94]:

	day	temperature
0	1/1/2017	32
1	1/2/2017	35
2	1/3/2017	28
3	1/4/2017	24
4	1/5/2017	32
5	1/6/2017	31

Operations On DataFrame¶

In [97]:

df['temperature'].max()

Out[97]:

In [98]:

df[df['temperature']>32]

Out[98]:

	day	temperature	windspeed	event
1	1/2/2017	35	7	Sunny

In [101]:

df['day'][df['temperature'] == df['temperature'].max()] # Kinda doing SQL in pandas

Out[101]:

1    1/2/2017
Name: day, dtype: object

In [102]:

df['temperature'].std()

Out[102]:

3.8297084310253524

In [103]:

df['event'].max() # But mean() won't work since data type is string

Out[103]:

'Sunny'

In [78]:

df.describe()

Out[78]:

	temperature	windspeed
count	6.000000	6.000000
mean	30.333333	4.666667
std	3.829708	2.338090
min	24.000000	2.000000
25%	28.750000	2.500000
50%	31.500000	5.000000
75%	32.000000	6.750000
max	35.000000	7.000000

Missing Data¶

Let's show a few convenient methods to deal with Missing Data in pandas:

In [104]:

df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

In [105]:

df

Out[105]:

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

In [106]:

df.dropna()

Out[106]:

	A	B	C
0	1.0	5.0	1

In [107]:

df.dropna(axis=1)

Out[107]:

	C
0	1
1	2
2	3

In [83]:

df.dropna(thresh=2)

Out[83]:

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2

In [108]:

df.fillna(value='shivendra')

Out[108]:

	A	B	C
0	1	5	1
1	2	shivendra	2
2	shivendra	shivendra	3

In [109]:

df['A'].fillna(value=df['A'].mean())

Out[109]:

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

Groupby¶

The groupby method allows you to group rows of data together and call aggregate functions

In [86]:

import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
       'Person':['Shivendra','Abhishek','Sowjanya','Manish','Mini','Satya'],
       'Sales':[200,120,340,124,243,350]}

In [87]:

df = pd.DataFrame(data)

In [88]:

df

Out[88]:

	Company	Person	Sales
0	GOOG	Shivendra	200
1	GOOG	Abhishek	120
2	MSFT	Sowjanya	340
3	MSFT	Manish	124
4	FB	Mini	243
5	FB	Satya	350

** Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:**

In [89]:

df.groupby('Company')

Out[89]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F12523CB08>

In [90]:

#You can save this object as a new variable:
by_comp = df.groupby("Company")
#And then call aggregate methods off the object:
by_comp.mean()

Out[90]:

	Sales
Company
FB	296.5
GOOG	160.0
MSFT	232.0

In [91]:

df.groupby('Company').mean()

Out[91]:

	Sales
Company
FB	296.5
GOOG	160.0
MSFT	232.0

In [92]:

#More examples of aggregate methods:
by_comp.std()

Out[92]:

	Sales
Company
FB	75.660426
GOOG	56.568542
MSFT	152.735065

In [93]:

by_comp.max()

Out[93]:

	Person	Sales
Company
FB	Satya	350
GOOG	Shivendra	200
MSFT	Sowjanya	340

In [94]:

by_comp.min()

Out[94]:

	Person	Sales
Company
FB	Mini	243
GOOG	Abhishek	120
MSFT	Manish	124

In [95]:

by_comp.describe()

Out[95]:

	Sales
	count	mean	std	min	25%	50%	75%	max
Company
FB	2.0	296.5	75.660426	243.0	269.75	296.5	323.25	350.0
GOOG	2.0	160.0	56.568542	120.0	140.00	160.0	180.00	200.0
MSFT	2.0	232.0	152.735065	124.0	178.00	232.0	286.00	340.0

In [96]:

by_comp.describe().transpose()

Out[96]:

	Company	FB	GOOG	MSFT
Sales	count	2.000000	2.000000	2.000000
	mean	296.500000	160.000000	232.000000
	std	75.660426	56.568542	152.735065
	min	243.000000	120.000000	124.000000
	25%	269.750000	140.000000	178.000000
	50%	296.500000	160.000000	232.000000
	75%	323.250000	180.000000	286.000000
	max	350.000000	200.000000	340.000000

In [97]:

by_comp.describe().transpose()['GOOG']

Out[97]:

Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64

Merging, Joining, and Concatenating¶

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. In this we will discuss these 3 methods with examples.

In [98]:

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])

In [99]:

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 

In [100]:

df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

In [101]:

df1

Out[101]:

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3

In [102]:

df2

Out[102]:

	A	B	C	D
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7

In [103]:

df3

Out[103]:

	A	B	C	D
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

Concatenation¶

Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:

In [104]:

pd.concat([df1,df2,df3])

Out[104]:

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

In [105]:

pd.concat([df1,df2,df3],axis=1)

Out[105]:

	A	B	C	D	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A8	B8	C8	D8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A9	B9	C9	D9
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A10	B10	C10	D10
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A11	B11	C11	D11

Operations¶

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

In [106]:

df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

Out[106]:

	col1	col2	col3
0	1	444	abc
1	2	555	def
2	3	666	ghi
3	4	444	xyz

In [107]:

df['col2'].unique()

Out[107]:

array([444, 555, 666], dtype=int64)

In [108]:

df['col2'].nunique()

Out[108]:

In [109]:

df['col2'].value_counts()

Out[109]:

444    2
555    1
666    1
Name: col2, dtype: int64

In [110]:

#Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]

In [111]:

newdf

Out[111]:

	col1	col2	col3
3	4	444	xyz

In [112]:

# Applying Functions
def times2(x):
    return x*2

In [113]:

df['col1'].apply(times2)

Out[113]:

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [114]:

df['col3'].apply(len)

Out[114]:

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

In [115]:

df['col1'].sum()

Out[115]:

** Permanently Removing a Column**

In [116]:

del df['col1']

In [117]:

df

Out[117]:

	col2	col3
0	444	abc
1	555	def
2	666	ghi
3	444	xyz

In [118]:

# get columns and index names 
df.columns 

Out[118]:

Index(['col2', 'col3'], dtype='object')

In [119]:

df.index

Out[119]:

RangeIndex(start=0, stop=4, step=1)

In [120]:

df

Out[120]:

	col2	col3
0	444	abc
1	555	def
2	666	ghi
3	444	xyz

In [121]:

df.sort_values(by='col2') #inplace=False by default

Out[121]:

	col2	col3
0	444	abc
3	444	xyz
1	555	def
2	666	ghi

In [123]:

# Check is there any null value or not 
df.isnull()

Out[123]:

	col2	col3
0	False	False
1	False	False
2	False	False
3	False	False

In [124]:

# Drop rows with NaN Values
df.dropna()

Out[124]:

	col2	col3
0	444	abc
1	555	def
2	666	ghi
3	444	xyz

In [125]:

df = pd.DataFrame({'col1':[1,2,3,np.nan],
                   'col2':[np.nan,555,666,444],
                   'col3':['abc','def','ghi','xyz']})
df.head()

Out[125]:

	col1	col2	col3
0	1.0	NaN	abc
1	2.0	555.0	def
2	3.0	666.0	ghi
3	NaN	444.0	xyz

In [126]:

df.fillna('FILL')

Out[126]:

	col1	col2	col3
0	1	FILL	abc
1	2	555	def
2	3	666	ghi
3	FILL	444	xyz

In [127]:

data = {'A':['foo','foo','foo','bar','bar','bar'],
     'B':['one','one','two','two','one','one'],
       'C':['x','y','x','y','x','y'],
       'D':[1,3,2,5,4,1]}

df = pd.DataFrame(data)

In [128]:

df

Out[128]:

	A	B	C	D
0	foo	one	x	1
1	foo	one	y	3
2	foo	two	x	2
3	bar	two	y	5
4	bar	one	x	4
5	bar	one	y	1

Great¶

In [ ]:

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

	A	B	C	D	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A8	B8	C8	D8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A9	B9	C9	D9
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A10	B10	C10	D10
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A11	B11	C11	D11

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

	A	B	C	D	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A8	B8	C8	D8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A9	B9	C9	D9
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A10	B10	C10	D10
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A11	B11	C11	D11

	A	B	C	D
0	A0	B0	C0	D0
1	A1	B1	C1	D1
2	A2	B2	C2	D2
3	A3	B3	C3	D3
4	A4	B4	C4	D4
5	A5	B5	C5	D5
6	A6	B6	C6	D6
7	A7	B7	C7	D7
8	A8	B8	C8	D8
9	A9	B9	C9	D9
10	A10	B10	C10	D10
11	A11	B11	C11	D11

	A	B	C	D	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A8	B8	C8	D8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A9	B9	C9	D9
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A10	B10	C10	D10
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A11	B11	C11	D11