Pandas¶

Pandas contains high level data structures and manipulation tools to make data analysis fast and easy in Python.

In [2]:

import pandas as pd #I am importing pandas as pd
from pandas import Series, DataFrame # Series and Data Frame are two data structures available in python

Series¶

Series is a one-dimensional array like object containing an array of data(any Numpy data type, and an associated array of data labels, called its index.

In [13]:

mjp= Series([5,4,3,2,1])# a simple series
print mjp        # A series is represented by index on the left and values on the right
print mjp.values # similar to dictionary. ".values" command returns values in a series 

0    5
1    4
2    3
3    2
4    1
dtype: int64
[5 4 3 2 1]

In [14]:

print mjp.index # returns the index values of the series

Int64Index([0, 1, 2, 3, 4], dtype='int64')

In [27]:

jeeva = Series([5,4,3,2,1,-7,-29], index =['a','b','c','d','e','f','h']) # The index is specified
print jeeva # try jeeva.index and jeeva.values
print jeeva['a'] # selecting a particular value from a Series, by using index

a     5
b     4
c     3
d     2
e     1
f    -7
h   -29
dtype: int64
5

In [28]:

jeeva['d'] = 9 # change the value of a particular element in series
print jeeva
jeeva[['a','b','c']] # select a group of values

a     5
b     4
c     3
d     9
e     1
f    -7
h   -29
dtype: int64

Out[28]:

a    5
b    4
c    3
dtype: int64

In [31]:

print jeeva[jeeva>0] # returns only the positive values
print jeeva *2 # multiplies 2 to each element of a series

a    5
b    4
c    3
d    9
e    1
dtype: int64
a    10
b     8
c     6
d    18
e     2
f   -14
h   -58
dtype: int64

In [34]:

import numpy as np
np.mean(jeeva) # you can apply numpy functions to a Series

Out[34]:

-2.0

In [37]:

print 'b' in jeeva # checks whether the index is present in Series or not
print 'z' in jeeva

True
False

In [46]:

player_salary ={'Rooney': 50000, 'Messi': 75000, 'Ronaldo': 85000, 'Fabregas':40000, 'Van persie': 67000} 
new_player = Series(player_salary)# converting a dictionary to a series
print new_player # the series has keys of a dictionary

Fabregas      40000
Messi         75000
Ronaldo       85000
Rooney        50000
Van persie    67000
dtype: int64

In [49]:

players =['Klose', 'Messi', 'Ronaldo', 'Van persie', 'Ballack'] 
player_1 =Series(player_salary, index= players)
print player_1 # I have changed the index of the Series. Since, no value was not found for Klose and Ballack, it appears as NAN

Klose           NaN
Messi         75000
Ronaldo       85000
Van persie    67000
Ballack         NaN
dtype: float64

In [53]:

pd.isnull(player_1)#checks for Null values in player_1, pd denotes a pandas dataframe

Out[53]:

Klose          True
Messi         False
Ronaldo       False
Van persie    False
Ballack        True
dtype: bool

In [52]:

pd.notnull(player_1)# Checks for null values that are not Null

Out[52]:

Klose         False
Messi          True
Ronaldo        True
Van persie     True
Ballack       False
dtype: bool

In [64]:

player_1.name ='Bundesliga players' # name for the Series
player_1.index.name='Player names' #name of the index
player_1

Out[64]:

Player names
Klose             NaN
Messi           75000
Ronaldo         85000
Van persie      67000
Ballack           NaN
Name: Bundesliga players, dtype: float64

In [67]:

player_1.index =['Neymar', 'Hulk', 'Pirlo', 'Buffon', 'Anderson'] # is used to alter the index of Series
player_1 

Out[67]:

Neymar        NaN
Hulk        75000
Pirlo       85000
Buffon      67000
Anderson      NaN
Name: Bundesliga players, dtype: float64

Data Frame¶

Data frame is a spread sheet like structure, containing ordered collection of columns. Each column can have different value type. Data frame has both row index and column index.

In [74]:

states ={'State' :['Gujarat', 'Tamil Nadu', ' Andhra', 'Karnataka', 'Kerala'],
                  'Population': [36, 44, 67,89,34],
                  'Language' :['Gujarati', 'Tamil', 'Telugu', 'Kannada', 'Malayalam']}
india = DataFrame(states) # creating a data frame
india

Out[74]:

	Language	Population	State
0	Gujarati	36	Gujarat
1	Tamil	44	Tamil Nadu
2	Telugu	67	Andhra
3	Kannada	89	Karnataka
4	Malayalam	34	Kerala

In [75]:

DataFrame(states, columns=['State', 'Language', 'Population']) # change the sequence of column index

Out[75]:

	State	Language	Population
0	Gujarat	Gujarati	36
1	Tamil Nadu	Tamil	44
2	Andhra	Telugu	67
3	Karnataka	Kannada	89
4	Kerala	Malayalam	34

In [82]:

new_farme = DataFrame(states, columns=['State', 'Language', 'Population', 'Per Capita Income'], index =['a','b','c','d','e'])
#if you pass a column that isnt in states, it will appear with Na values

In [86]:

print new_farme.columns
print new_farme['State'] # retrieveing data like dictionary

Index([u'State', u'Language', u'Population', u'Per Capita Income'], dtype='object')
a       Gujarat
b    Tamil Nadu
c        Andhra
d     Karnataka
e        Kerala
Name: State, dtype: object

In [89]:

new_farme.Population # like Series

Out[89]:

a    36
b    44
c    67
d    89
e    34
Name: Population, dtype: int64

In [91]:

new_farme.ix[3] # rows can be retrieved using .ic function
# here I have retrieved 3rd row

Out[91]:

State                Karnataka
Language               Kannada
Population                  89
Per Capita Income          NaN
Name: d, dtype: object

In [94]:

 new_farme

Out[94]:

	State	Language	Population	Per Capita Income
a	Gujarat	Gujarati	36	NaN
b	Tamil Nadu	Tamil	44	NaN
c	Andhra	Telugu	67	NaN
d	Karnataka	Kannada	89	NaN
e	Kerala	Malayalam	34	NaN

In [97]:

new_farme['Per Capita Income'] = 99 # the empty per capita income column can be assigned a value
new_farme

Out[97]:

	State	Language	Population	Per Capita Income
a	Gujarat	Gujarati	36	99
b	Tamil Nadu	Tamil	44	99
c	Andhra	Telugu	67	99
d	Karnataka	Kannada	89	99
e	Kerala	Malayalam	34	99

In [99]:

new_farme['Per Capita Income'] = np.arange(5) # assigning a value to the last column
new_farme

Out[99]:

	State	Language	Population	Per Capita Income
a	Gujarat	Gujarati	36	0
b	Tamil Nadu	Tamil	44	1
c	Andhra	Telugu	67	2
d	Karnataka	Kannada	89	3
e	Kerala	Malayalam	34	4

In [104]:

series = Series([44,33,22], index =['b','c','d'])
new_farme['Per Capita Income'] = series
#when assigning list or arrays to a column, the values lenght should match the length of the DataFrame
new_farme # again the missing values are displayed as NAN

Out[104]:

	State	Language	Population	Per Capita Income
a	Gujarat	Gujarati	36	NaN
b	Tamil Nadu	Tamil	44	44
c	Andhra	Telugu	67	33
d	Karnataka	Kannada	89	22
e	Kerala	Malayalam	34	NaN

In [119]:

new_farme['Development'] = new_farme.State == 'Gujarat'# assigning a new column
print new_farme
del new_farme['Development'] # will delete the column 'Development'
new_farme

        State   Language  Population  Per Capita Income Development
a     Gujarat   Gujarati          36                NaN        True
b  Tamil Nadu      Tamil          44                 44       False
c      Andhra     Telugu          67                 33       False
d   Karnataka    Kannada          89                 22       False
e      Kerala  Malayalam          34                NaN       False

Out[119]:

	State	Language	Population	Per Capita Income
a	Gujarat	Gujarati	36	NaN
b	Tamil Nadu	Tamil	44	44
c	Andhra	Telugu	67	33
d	Karnataka	Kannada	89	22
e	Kerala	Malayalam	34	NaN

In [16]:

new_data ={'Modi': {2010: 72, 2012: 78, 2014 : 98},'Rahul': {2010: 55, 2012: 34, 2014: 22}}
elections = DataFrame(new_data) 
print elections# the outer dict keys are columns and inner dict keys are rows
elections.T # transpose of a data frame

      Modi  Rahul
2010    72     55
2012    78     34
2014    98     22

Out[16]:

	2010	2012	2014
Modi	72	78	98
Rahul	55	34	22

In [17]:

DataFrame(new_data, index =[2012, 2014, 2016]) # you can assign index for the data frame

Out[17]:

	Modi	Rahul
2012	78	34
2014	98	22
2016	NaN	NaN

In [18]:

ex= {'Gujarat':elections['Modi'][:-1], 'India': elections['Rahul'][:2]}
px =DataFrame(ex)
px

Out[18]:

	Gujarat	India
2010	72	55
2012	78	34

In [150]:

from IPython.display import Image
i = Image(filename='Constructors.png')
i # list of things you can pass to a dataframe

Out[150]:

In [155]:

px.index.name = 'year'
px.columns.name = 'politicians'
px

Out[155]:

politicians	Gujarat	India
year
2010	72	55
2012	78	34

In [156]:

px.values

Out[156]:

array([[72, 55],
       [78, 34]], dtype=int64)

In [3]:

jeeva = Series([5,4,3,2,1,-7,-29], index =['a','b','c','d','e','f','h'])
index = jeeva.index
print index #u denotes unicode
print index[1:]# returns all the index elements except a. 
index[1] = 'f' # you cannot modify an index element. It will generate an error. In other words, they are immutable

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-e8b7ee2d0552> in <module>()
      3 print index #u denotes unicode
      4 print index[1:]# returns all the index elements except a.
----> 5 index[1] = 'f' # you cannot modify an index element. It will generate an error. In other words, they are immutable

C:\Users\tk\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\base.pyc in _disabled(self, *args, **kwargs)
    177         """This method will not function because object is immutable."""
    178         raise TypeError("'%s' does not support mutable operations." %
--> 179                         self.__class__)
    180 
    181     __setitem__ = __setslice__ = __delitem__ = __delslice__ = _disabled

TypeError: '<class 'pandas.core.index.Index'>' does not support mutable operations.

Index([u'a', u'b', u'c', u'd', u'e', u'f', u'h'], dtype='object')
Index([u'b', u'c', u'd', u'e', u'f', u'h'], dtype='object')

In [22]:

print px
2013 in px.index # checks if 2003 is an index in data frame px

      Gujarat  India
2010       72     55
2012       78     34

Out[22]:

False

Reindex¶

In [27]:

var = Series(['Python', 'Java', 'c', 'c++', 'Php'], index =[5,4,3,2,1])
print var
var1 = var.reindex([1,2,3,4,5])# reindex creates a new object 
print var1 

5    Python
4      Java
3         c
2       c++
1       Php
dtype: object
1       Php
2       c++
3         c
4      Java
5    Python
dtype: object

In [28]:

var.reindex([1,2,3,4,5,6,7])# introduces new indexes with values Nan

Out[28]:

1       Php
2       c++
3         c
4      Java
5    Python
6       NaN
7       NaN
dtype: object

In [31]:

var.reindex([1,2,3,4,5,6,7], fill_value =1) # you can use fill value to fill the Nan values. Here I have used fill value as 1. You can use any value.

Out[31]:

1       Php
2       c++
3         c
4      Java
5    Python
6         1
7         1
dtype: object

In [35]:

gh =Series(['Dhoni', 'Sachin', 'Kohli'], index =[0,2,4])
print gh
gh.reindex(range(6), method ='ffill') #ffill is forward fill. It forward fills the values

0     Dhoni
2    Sachin
4     Kohli
dtype: object

Out[35]:

0     Dhoni
1     Dhoni
2    Sachin
3    Sachin
4     Kohli
5     Kohli
dtype: object

In [36]:

gh.reindex(range(6), method ='bfill')# bfill, backward fills the values

Out[36]:

0     Dhoni
1    Sachin
2    Sachin
3     Kohli
4     Kohli
5       NaN
dtype: object

In [45]:

import numpy as np
fp = DataFrame(np.arange(9).reshape((3,3)),index =['a','b','c'], columns =['Gujarat','Tamil Nadu', 'Kerala'])
fp

Out[45]:

	Gujarat	Tamil Nadu	Kerala
a	0	1	2
b	3	4	5
c	6	7	8

In [55]:

fp1 =fp.reindex(['a', 'b', 'c', 'd'], columns = states) # reindexing columns and indices
fp1

Out[55]:

	Gujarat	Assam	Kerala
a	0	NaN	2
b	3	NaN	5
c	6	NaN	8
d	NaN	NaN	NaN

Other Reindexing arguments
limit When forward- or backfilling, maximum size gap to fill
level Match simple Index on level of MultiIndex, otherwise select subset of
copy Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data).

Dropping entries from an axis¶

In [62]:

er = Series(np.arange(5), index =['a','b','c','d','e'])
print er
er.drop(['a','b']) #drop method will return a new object  with values deleted from an axis

a    0
b    1
c    2
d    3
e    4
dtype: int32

Out[62]:

c    2
d    3
e    4
dtype: int32

In [77]:

states ={'State' :['Gujarat', 'Tamil Nadu', ' Andhra', 'Karnataka', 'Kerala'],
                  'Population': [36, 44, 67,89,34],
                  'Language' :['Gujarati', 'Tamil', 'Telugu', 'Kannada', 'Malayalam']}
india = DataFrame(states, columns =['State', 'Population', 'Language'])
print india
india.drop([0,1])# will drop index 0 and 1

        State  Population   Language
0     Gujarat          36   Gujarati
1  Tamil Nadu          44      Tamil
2      Andhra          67     Telugu
3   Karnataka          89    Kannada
4      Kerala          34  Malayalam

Out[77]:

	State	Population	Language
2	Andhra	67	Telugu
3	Karnataka	89	Kannada
4	Kerala	34	Malayalam

In [82]:

india.drop(['State', 'Population'], axis =1 )# the function dropped population and state columns. Apply the same concept with axis =0

Out[82]:

	Language
0	Gujarati
1	Tamil
2	Telugu
3	Kannada
4	Malayalam

Selection, Indexing and Filtering¶

In [102]:

var = Series(['Python', 'Java', 'c', 'c++', 'Php'], index =[5,4,3,2,1])
var

Out[102]:

5    Python
4      Java
3         c
2       c++
1       Php
dtype: object

In [103]:

print var[5]
print var[2:4]

Python
3      c
2    c++
dtype: object

In [104]:

var[[3,2,1]]

Out[104]:

3      c
2    c++
1    Php
dtype: object

In [109]:

var[var == 'Php']

Out[109]:

1    Php
dtype: object

In [111]:

states ={'State' :['Gujarat', 'Tamil Nadu', ' Andhra', 'Karnataka', 'Kerala'],
                  'Population': [36, 44, 67,89,34],
                  'Language' :['Gujarati', 'Tamil', 'Telugu', 'Kannada', 'Malayalam']}
india = DataFrame(states, columns =['State', 'Population', 'Language'])
india

Out[111]:

	State	Population	Language
0	Gujarat	36	Gujarati
1	Tamil Nadu	44	Tamil
2	Andhra	67	Telugu
3	Karnataka	89	Kannada
4	Kerala	34	Malayalam

In [114]:

india[['Population', 'Language']] # retrieve data from data frame

Out[114]:

	Population	Language
0	36	Gujarati
1	44	Tamil
2	67	Telugu
3	89	Kannada
4	34	Malayalam

In [115]:

india[india['Population'] > 50] # returns data for population greater than 50

Out[115]:

	State	Population	Language
2	Andhra	67	Telugu
3	Karnataka	89	Kannada

In [117]:

india[:3] # first three rows

Out[117]:

	State	Population	Language
0	Gujarat	36	Gujarati
1	Tamil Nadu	44	Tamil
2	Andhra	67	Telugu

In [4]:

# for selecting specific rows and columns, you can use ix function
import pandas as pd
states ={'State' :['Gujarat', 'Tamil Nadu', ' Andhra', 'Karnataka', 'Kerala'],
                  'Population': [36, 44, 67,89,34],
                  'Language' :['Gujarati', 'Tamil', 'Telugu', 'Kannada', 'Malayalam']}
india = DataFrame(states, columns =['State', 'Population', 'Language'], index =['a', 'b', 'c', 'd', 'e'])
india

Out[4]:

	State	Population	Language
a	Gujarat	36	Gujarati
b	Tamil Nadu	44	Tamil
c	Andhra	67	Telugu
d	Karnataka	89	Kannada
e	Kerala	34	Malayalam

In [128]:

india.ix[['a','b'], ['State','Language']] # this is how you select subset of rows

Out[128]:

	State	Language
a	Gujarat	Gujarati
b	Tamil Nadu	Tamil