Drawing

In [2]:
import pandas as pd
import numpy as np # necessity as pandas is built on np
from IPython.display import Image # to display images 

The Pandas library is built on NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language.

Refer to these cheatsheets:
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience+(1).pdf https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf

Input/Output

Different types of data can be loaded in pandas dataframe. Pandas dataframe is like looks spreadsheet table (just a rough analogy)

  • Most common input types

    • pd.read_csv
    • pd.read_excel/ pd.ExcelFile
    • pd.read_feather (feather format is used to reduce memory load in df as data is saved in binary form)
    • pd.read_json
    • pd.read_html
    • pd.read_pickle (can also infer if pickled object is zipped using infer=)
  • Output types have format to_xxx similar to input formats

Pandas data structures

Pandas data strctures include series and dataframe

Series: A one-dimensional labeled array a capable of holding any data type

In [18]:
Image('../images/series.png', width=500) # not pandas, just showing example series
Out[18]:
In [22]:
# index will set the index for further reference
# data can be passed as list
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd']) 
In [21]:
s
Out[21]:
a    3
b   -5
c    7
d    4
dtype: int64

Can be indexed using both index name or number.
number: --> filter indexes after value of number
:number --> filter indexes before value of number

In [42]:
s[1:]
Out[42]:
b   -5
c    7
d    4
dtype: int64
In [46]:
s['a'] 
Out[46]:
3
In [47]:
s[:'c']
Out[47]:
a    3
b   -5
c    7
dtype: int64
In [50]:
s.index
Out[50]:
Index(['a', 'b', 'c', 'd'], dtype='object')
In [37]:
# series using dictionary

s2 = pd.Series({'a':3, 'b': -1, 'c': 12}); s2
Out[37]:
a     3
b    -1
c    12
dtype: int64
In [38]:
s2['b']
Out[38]:
-1

Dataframe: A two-dimensional labeled data structure with columns of potentially different types. It is similar to excel table.

Can make data frame using dictionary, list of list

In [25]:
Image('../images/df.png', width=500) 
Out[25]:
In [288]:
data = {'Country': ['Belgium', 'India', 'Brazil'],
 'Capital': ['Brussels', 'New Delhi', 'Brasília'],
 'Population': [11190846, 1303171035, 207847528]}
In [289]:
df_sample = pd.DataFrame(data,
 columns=['Country', 'Capital', 'Population'])
In [290]:
df_sample
Out[290]:
Country Capital Population
0 Belgium Brussels 11190846
1 India New Delhi 1303171035
2 Brazil Brasília 207847528

Common dataframe functionality

Using famous titanic data for analysis and exploration. https://www.kaggle.com/c/titanic/data

  • Common things to do when we get data in a dataframe (examples shown below):
    • see dataframe shape (number of rows, number of columns) using df.shape
    • see top 5 rows using pd.head()
    • check datatype of each column using pd.dtypes
    • check column names using pd.columns
    • count unique values of each column to see cardinality levels using pd.nunique()
    • number of non null in each column, memory usage of df, datatype (especially for large df) using pd.info()
In [3]:
df = pd.read_csv('../data/train.csv') # read csv file
In [33]:
df.shape
Out[33]:
(891, 12)
In [8]:
df.head() # see top 5 rows of data
Out[8]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [26]:
df.dtypes # see datatype of each variable
Out[26]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [27]:
df.columns # column names
Out[27]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [32]:
df.nunique() # unique value for each variable
Out[32]:
PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64
In [34]:
df.info() # not null part is very useful to see how many nulls are there in data
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Advanced Indexing

iloc

Select based on integer location (that's why i). Can select single or multiple

In [56]:
df.iloc[0, 4] # 0 row, 4 column
Out[56]:
'male'
In [58]:
df.iloc[1:4, 2:6] # indexes are maintained. Can reset_index() to start index from 0  
Out[58]:
Pclass Name Sex Age
1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
2 3 Heikkinen, Miss. Laina female 26.0
3 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0

loc

Select based on label name of column (can select single or multiple)

In [63]:
df.loc[1:2,'Name':"Age"] # here row indexes are numbers but column indexes are name of columns 
Out[63]:
Name Sex Age
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0
2 Heikkinen, Miss. Laina female 26.0
In [4]:
df.loc[2,['Name',"Age"]] # here row indexes are numbers. 
Out[4]:
Name    Heikkinen, Miss. Laina
Age                         26
Name: 2, dtype: object

ix

ix has been deprecated in latest pandas library. It was used to select by label or position. But we can always use loc to select with labels and iloc to select with integers/position

Boolean indexing

Returns rows where the stated condition returns true

  • or -> condition 1 | condition 2 (or also works but throws ambiguity error for multiple conditions)
  • and -> condition 1 & condition 2 (and also works but throws ambiguity error for multiple conditions
  • not -> ~ (not condition)
  • equal -> == Satisfying condition
  • any() -> columns/rows with any value matching condition
  • all() > columns/rows with all values matching some condition
In [31]:
# select rows with either sex as female or Pclass as 1
df[(df.Sex == 'female') | (df.iloc[:,2] == 1) ].iloc[:3] # () are important
Out[31]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
In [63]:
# first 3 rows of gives all columns which have all string values or all int > 1 values
df.loc[:,(df > 1).all()][:3] 
Out[63]:
Name Sex Ticket Cabin Embarked
0 Braund, Mr. Owen Harris male A/5 21171 NaN S
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female PC 17599 C85 C
2 Heikkinen, Miss. Laina female STON/O2. 3101282 NaN S
In [65]:
# first 3 rows of all columns which have all not null values
df.loc[:,(df.notnull().all() )][:3]
Out[65]:
PassengerId Survived Pclass Name Sex SibSp Parch Ticket Fare
0 1 0 3 Braund, Mr. Owen Harris male 1 0 A/5 21171 7.2500
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1 0 PC 17599 71.2833
2 3 1 3 Heikkinen, Miss. Laina female 0 0 STON/O2. 3101282 7.9250
In [67]:
# first 3 rows of all columns which have atleast 1 null value
df.loc[:, df.isnull().any()][:3]
Out[67]:
Age Cabin Embarked
0 22.0 NaN S
1 38.0 C85 C
2 26.0 NaN S
In [33]:
df[(df.iloc[:,2] == 1) & (df.Sex == 'female')].shape
Out[33]:
(94, 12)
In [39]:
# fraction of males with Age > 25, df.shape[0] -> number of rows 

sum((df.Age > 25) & (df.Sex == 'male'))/df.shape[0] 
Out[39]:
0.3075196408529742
In [50]:
# number of people who survived and were not in class 3

sum((df.Survived != 0) & (~(df.Pclass == 3)) ) 
Out[50]:
223

querying

Query columns (filter rows) of dataframe with boolean expression (Filter based on condition)

In [75]:
# filter all rows which have Age > Passenger ID
df.query('Age > PassengerId')
Out[75]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
30 31 0 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C
33 34 0 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 NaN S
35 36 0 1 Holverson, Mr. Alexander Oskar male 42.0 1 0 113789 52.0000 NaN S
54 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C

filter

Filter dataframe on column names or row names (labels) by regrex or just item names

In [3]:
# filter only sex and age columns (first 2 rows)

df.filter(items=['Age', 'Sex'])[:2]
Out[3]:
Age Sex
0 22.0 male
1 38.0 female
In [6]:
# filter only 0 and 5 row index

df.filter(items=[0,5], axis=0)
Out[6]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
In [8]:
# first 2 rows of column names ending with "ed" (think of past tense)

df.filter(like = 'ed', axis=1)[:2]
Out[8]:
Survived Embarked
0 0 S
1 1 C
In [20]:
# Can use same thing as above using regex also

df.filter(regex='ed$', axis=1)[:2]
Out[20]:
Series([], Name: Name, dtype: object)

isin

Filter rows of column based on list of multiple values

In [28]:
df[df.Pclass.isin([0,1])].head()
Out[28]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S

Setting/ Resetting Index

Setting and resetting index are important when we merge/groupby 2 dataframe and want to do further analysis on new dataframe. A dataframe with repeated indexes can cause problems in filtering. Apart from this we cvan set a column into index which makes merging much faster

set_index()

Set any column you want as index of df

In [76]:
# setting 
df.set_index('Ticket')[:2]
Out[76]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked
Ticket
A/5 21171 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 7.2500 NaN S
PC 17599 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 71.2833 C85 C
In [77]:
# can set multiple columns as index also. Just pass them in list
# Setting Ticket and Name as index

df.set_index(['Ticket', 'Name'])[:2]
Out[77]:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
Ticket Name
A/5 21171 Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 7.2500 NaN S
PC 17599 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 71.2833 C85 C
In [90]:
# can see what are values of index. 
# checking index of 1st row

df.set_index(['Ticket', 'Name']).index[0]
Out[90]:
('A/5 21171', 'Braund, Mr. Owen Harris')

reset_index()

Can reset index back to 0....nrows-1

In [91]:
df_index = df.set_index(['Ticket', 'Name'])
In [92]:
df_index[:2]
Out[92]:
PassengerId Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
Ticket Name
A/5 21171 Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 7.2500 NaN S
PC 17599 Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 71.2833 C85 C
In [93]:
df_index.reset_index()[:2]
Out[93]:
Ticket Name PassengerId Survived Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 A/5 21171 Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 7.2500 NaN S
1 PC 17599 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 1 1 female 38.0 1 0 71.2833 C85 C

In above case, index is back to 0,1...

rename()

Renaming column names or row indexes of dataframe. Default is index

In [94]:
df.rename(columns={'Name': 'Whats_name', 'Fare':'Price'})[:2]
Out[94]:
PassengerId Survived Pclass Whats_name Sex Age SibSp Parch Ticket Price Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
In [96]:
# can use some mapper function also. default axis='index' (0)

df.rename(mapper=str.lower, axis='columns')[:2]
Out[96]:
passengerid survived pclass name sex age sibsp parch ticket fare cabin embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C

Duplicated data

unique()

Number of unique values in a column of df (Use nunique() for count of unique in each column)

In [98]:
df.Sex.unique()
Out[98]:
array(['male', 'female'], dtype=object)

duplicated()

Check duplicated in column. Returns True/False

In [102]:
sum(df.PassengerId.duplicated()) # there are no duplicate passegerid. good thing to check
Out[102]:
0
In [115]:
# can check duplicates in index also.
# useful if doubtful about duplicates in index doing bad things 

sum(df.index.duplicated())
Out[115]:
0

drop_duplicates

Drop rows which have duplicates

In [106]:
# can help in getting unique combination of multiple columns
# unique() dosn't work in this case
df.loc[:,['Sex', 'Embarked']].drop_duplicates()
Out[106]:
Sex Embarked
0 male S
1 female C
2 female S
5 male Q
22 female Q
26 male C
61 female NaN

Grouping data

Group by some column/columns, then we can aggregate to get mean, count, sum or custom function based on the group

groupby

In [123]:
# group by sex then count. 
# returns count in each column. difference in some cases because of nulls in those columns
# can do iloc[:,0] to only get first column 

df.groupby(by = ['Sex']).count()
Out[123]:
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked
Sex
female 314 314 314 314 261 314 314 314 314 97 312
male 577 577 577 577 453 577 577 577 577 107 577
In [125]:
# can use multiple conditions
# group by sex and survived -> mean of age

df.groupby(by = ['Sex', 'Survived']).mean().loc[:,'Age']
Out[125]:
Sex     Survived
female  0           25.046875
        1           28.847716
male    0           31.618056
        1           27.276022
Name: Age, dtype: float64
In [145]:
# can group by indexes also by using levels= 
# useful when we have multindexes
# can use agg function with lambda func

df_index = df.set_index(['Sex', 'Pclass'])
df_index.groupby(level=[0,1]).agg({'Fare': lambda x: sum(x)/len(x), # this is also just mean actually
                                  'Age' : np.mean})
Out[145]:
Fare Age
Sex Pclass
female 1 106.125798 34.611765
2 21.970121 28.722973
3 16.118810 21.750000
male 1 67.226127 41.281386
2 19.741782 30.740707
3 12.661633 26.507589

Interesting! Ticket price of 1st class female is approximately double of 1st class male

transform

Can apply such functions for all columns also using transform which transforms all rows

In [153]:
# shape of below code is same as original df

df_index.groupby(level=[0,1]).transform(lambda x: sum(x)/len(x)).head()
Out[153]:
PassengerId Survived Age SibSp Parch Fare
Sex Pclass
male 3 455.515850 0.135447 NaN 0.498559 0.224784 12.661633
female 1 469.212766 0.968085 NaN 0.553191 0.457447 106.125798
3 399.729167 0.500000 NaN 0.895833 0.798611 16.118810
1 469.212766 0.968085 NaN 0.553191 0.457447 106.125798
male 3 455.515850 0.135447 NaN 0.498559 0.224784 12.661633

Handling missing data

dropna

Drop rows with na

In [156]:
# how=any -> row with any column = NA

df.dropna(axis=0,  how='any').shape
Out[156]:
(183, 12)
In [159]:
# how=any -> row with all columns = NA

df.dropna(axis=0, how='all').shape
Out[159]:
(891, 12)
In [167]:
# drops column which have any row of NA

[set(df.columns) - set(df.dropna(axis=1, how='any').columns)]
Out[167]:
[{'Age', 'Cabin', 'Embarked'}]

Three columns have been removed

fillna

In [173]:
# replace with mean of that column
# can put any specific value also
# would not work for columns with string type like Cabin

df.fillna(np.mean)[:1]
Out[173]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 <function mean at 0x10c7d7d90> S

Combining Data

Drawing

merge / join

  • how = 'left', 'right', 'outer', 'inner'
  • on
In [182]:
data1 = pd.DataFrame({'x1': list('abc'), 'x2': [11.432, 1.303, 99.906]})
In [197]:
data2 = pd.DataFrame({'x1': list('abd'), 'x3': [20.784,  np.NaN, 20.784]})
In [183]:
data1
Out[183]:
x1 x2
0 a 11.432
1 b 1.303
2 c 99.906
In [198]:
data2
Out[198]:
x1 x3
0 a 20.784
1 b NaN
2 d 20.784
In [199]:
# inner join when both table have that key (like sql)

data1.merge(data2, how='inner', on='x1')
Out[199]:
x1 x2 x3
0 a 11.432 20.784
1 b 1.303 NaN
In [200]:
# outer joins on all keys in both df and creates NA

data1.merge(data2, how='outer', on='x1')
Out[200]:
x1 x2 x3
0 a 11.432 20.784
1 b 1.303 NaN
2 c 99.906 NaN
3 d NaN 20.784

can also use join but merge is faster. just use merge

In [202]:
# if columns overlap, have to specify suffix as it makes for all

data1.join(data2, on='x1', how='left', lsuffix='L')
Out[202]:
x1L x2 x1 x3
0 a 11.432 NaN NaN
1 b 1.303 NaN NaN
2 c 99.906 NaN NaN

concatenate

In [223]:
# join over axis=0, i.e rows combine 
# also adds all columns with na

pd.concat([data1, data2], axis=0)
Out[223]:
x1 x2 x3
0 a 11.432 NaN
1 b 1.303 NaN
2 c 99.906 NaN
0 a NaN 20.784
1 b NaN NaN
2 d NaN 20.784
3 g NaN 500.000

Notice that it has index duplicates as it maintain original df index

Can use ignore_index=True to make index start from 0

In [211]:
pd.concat([data1, data2], axis=0, ignore_index=True)
Out[211]:
x1 x2 x3
0 a 11.432 NaN
1 b 1.303 NaN
2 c 99.906 NaN
3 a NaN 20.784
4 b NaN NaN
5 d NaN 20.784
In [217]:
data2.loc[3] = ['g', 500] # adding new row
data2
Out[217]:
x1 x3
0 a 20.784
1 b NaN
2 d 20.784
3 g 500.000
In [228]:
# join over axis=1, i.e columns combine 

pd.concat([data1, data2], axis=1)
Out[228]:
x1 x2 x1 x3
0 a 11.432 a 20.784
1 b 1.303 b NaN
2 c 99.906 d 20.784
3 NaN NaN g 500.000

Date formatting

  • to_datetime() -> convert whatever format argument to datetime (obviously that can be parsed to datetime)
  • date_range() -> generates datetime data
  • Datetimeindex -> datetypeindex data
In [246]:
pd.to_datetime('2018-2-19')
Out[246]:
Timestamp('2018-02-19 00:00:00')
In [250]:
# gives datetimeindex format

pd.date_range('2018-4-18', periods=6, freq='d')
Out[250]:
DatetimeIndex(['2018-04-18', '2018-04-19', '2018-04-20', '2018-04-21',
               '2018-04-22', '2018-04-23'],
              dtype='datetime64[ns]', freq='D')
In [235]:
data1['date'] = pd.date_range('2018-4-18', periods=3, freq='d')
In [236]:
data1
Out[236]:
x1 x2 date
0 a 11.432 2018-04-18
1 b 1.303 2018-04-19
2 c 99.906 2018-04-20
In [248]:
data1.date
Out[248]:
0   2018-04-18
1   2018-04-19
2   2018-04-20
Name: date, dtype: datetime64[ns]
In [247]:
pd.DatetimeIndex(data1.date)
Out[247]:
DatetimeIndex(['2018-04-18', '2018-04-19', '2018-04-20'], dtype='datetime64[ns]', name='date', freq=None)

Reshaping data

pivot -> reshape data

Ever used pivot table in excel? It's same.

In [252]:
# index = new index, columns = new_columns, values = values to put

df.pivot(index='Sex', columns = 'PassengerId', values = 'Age')
Out[252]:
PassengerId 1 2 3 4 5 6 7 8 9 10 ... 882 883 884 885 886 887 888 889 890 891
Sex
female NaN 38.0 26.0 35.0 NaN NaN NaN NaN 27.0 14.0 ... NaN 22.0 NaN NaN 39.0 NaN 19.0 NaN NaN NaN
male 22.0 NaN NaN NaN 35.0 NaN 54.0 2.0 NaN NaN ... 33.0 NaN 28.0 25.0 NaN 27.0 NaN NaN 26.0 32.0

2 rows × 891 columns

In above case, use of pivot doesn't make sense but this is just an example

stack

Convert whole df into 1 long format

In [258]:
df.stack()
Out[258]:
0    PassengerId                                                    1
     Survived                                                       0
     Pclass                                                         3
     Name                                     Braund, Mr. Owen Harris
     Sex                                                         male
     Age                                                           22
     SibSp                                                          1
     Parch                                                          0
     Ticket                                                 A/5 21171
     Fare                                                        7.25
     Embarked                                                       S
1    PassengerId                                                    2
     Survived                                                       1
     Pclass                                                         1
     Name           Cumings, Mrs. John Bradley (Florence Briggs Th...
     Sex                                                       female
     Age                                                           38
     SibSp                                                          1
     Parch                                                          0
     Ticket                                                  PC 17599
     Fare                                                     71.2833
     Cabin                                                        C85
     Embarked                                                       C
2    PassengerId                                                    3
     Survived                                                       1
     Pclass                                                         3
     Name                                      Heikkinen, Miss. Laina
     Sex                                                       female
     Age                                                           26
     SibSp                                                          0
                                          ...                        
888  Name                    Johnston, Miss. Catherine Helen "Carrie"
     Sex                                                       female
     SibSp                                                          1
     Parch                                                          2
     Ticket                                                W./C. 6607
     Fare                                                       23.45
     Embarked                                                       S
889  PassengerId                                                  890
     Survived                                                       1
     Pclass                                                         1
     Name                                       Behr, Mr. Karl Howell
     Sex                                                         male
     Age                                                           26
     SibSp                                                          0
     Parch                                                          0
     Ticket                                                    111369
     Fare                                                          30
     Cabin                                                       C148
     Embarked                                                       C
890  PassengerId                                                  891
     Survived                                                       0
     Pclass                                                         3
     Name                                         Dooley, Mr. Patrick
     Sex                                                         male
     Age                                                           32
     SibSp                                                          0
     Parch                                                          0
     Ticket                                                    370376
     Fare                                                        7.75
     Embarked                                                       Q
Length: 9826, dtype: object

You won't generally use it. I have never come across its use over my experience with python

Iteration

To get column/row indexes, series pair.

  • iteritems() for column-index, series
  • iterrows() for row-index, series
In [286]:
list(df.Sex.iteritems())[:5]
Out[286]:
[(0, 'male'), (1, 'female'), (2, 'female'), (3, 'female'), (4, 'male')]
In [285]:
list(df.iterrows())[0]
Out[285]:
(0, PassengerId                          1
 Survived                             0
 Pclass                               3
 Name           Braund, Mr. Owen Harris
 Sex                               male
 Age                                 22
 SibSp                                1
 Parch                                0
 Ticket                       A/5 21171
 Fare                              7.25
 Cabin                              NaN
 Embarked                             S
 Name: 0, dtype: object)

Apply functions

  • apply -> apply function over df
  • apply_map -> apply function elementwise (for each series of df. think of column wise)
In [338]:
# function squares when type(x) = float, cubes when type(x) = int, return same when other

f = lambda x: x**2 if type(x) == float else x**3 if type(x) == int else x
In [335]:
# whole series is passed

df.Fare.apply(f)[:3]
Out[335]:
0      52.562500
1    5081.308859
2      62.805625
Name: Fare, dtype: float64
In [339]:
# elements are passed

df.applymap(f)[:3]
Out[339]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 27 Braund, Mr. Owen Harris male 484.0 1 0 A/5 21171 52.562500 NaN S
1 8 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 1444.0 1 0 PC 17599 5081.308859 C85 C
2 27 1 27 Heikkinen, Miss. Laina female 676.0 0 0 STON/O2. 3101282 62.805625 NaN S

Working with text data

What all can we do when we have string datatype in pandas dataframe/series ?

str

Working with string format in pandas series/df

We can do:

  • str.upper()/lower() to convert string into upper or lower case
  • str.len() to find the length of sting
  • str.strip()/lstrip()/rstrip() to strip spaces
  • str.replace() to replace anything from string
  • str.split() to split words of string or using some other delimiter
  • str.get() to access elements in slit list
  • str.resplit() spit in reverse order of string based on some delimiter
  • str.extract() extract specific thing from string. alphabet or number

Let's see how to use all that in pandas series. Keep in mind pandas DataFrame has no attribute called str and works on Series object only. So, grab column of df, then apply str

In [10]:
# converts all rows into lower

df.Name.str.lower().head()
Out[10]:
0                              braund, mr. owen harris
1    cumings, mrs. john bradley (florence briggs th...
2                               heikkinen, miss. laina
3         futrelle, mrs. jacques heath (lily may peel)
4                             allen, mr. william henry
Name: Name, dtype: object
In [16]:
# converts all rows into upper 

df.Sex.str.upper().head()
Out[16]:
0      MALE
1    FEMALE
2    FEMALE
3    FEMALE
4      MALE
Name: Sex, dtype: object
In [17]:
# counts all the characters including spaces

df.Name.str.len().head()
Out[17]:
0    23
1    51
2    22
3    44
4    24
Name: Name, dtype: int64
In [46]:
# splits strings in each row over whitespaces ()
# expand=True : expand columns
# pat = regex to split on

df.Name.str.split(pat=',',expand=True).head().rename(columns={0:'First_Name', 1: 'Last_Name'})
Out[46]:
First_Name Last_Name
0 Braund Mr. Owen Harris
1 Cumings Mrs. John Bradley (Florence Briggs Thayer)
2 Heikkinen Miss. Laina
3 Futrelle Mrs. Jacques Heath (Lily May Peel)
4 Allen Mr. William Henry
In [42]:
# splits strings in each row over whitespaces ()
# expand=False : doesn't expand columns
# pat = regex to split on

df.Name.str.split(expand=False).head()
Out[42]:
0                         [Braund,, Mr., Owen, Harris]
1    [Cumings,, Mrs., John, Bradley, (Florence, Bri...
2                           [Heikkinen,, Miss., Laina]
3    [Futrelle,, Mrs., Jacques, Heath, (Lily, May, ...
4                        [Allen,, Mr., William, Henry]
Name: Name, dtype: object
In [49]:
# replace Mr. with empty space

df.Name.str.replace('Mr.', '').head()
Out[49]:
0                                Braund,  Owen Harris
1    Cumings, . John Bradley (Florence Briggs Thayer)
2                              Heikkinen, Miss. Laina
3           Futrelle, . Jacques Heath (Lily May Peel)
4                               Allen,  William Henry
Name: Name, dtype: object
In [71]:
# get() is used to get particular row of split

df.Name.str.split().get(1)
Out[71]:
['Cumings,', 'Mrs.', 'John', 'Bradley', '(Florence', 'Briggs', 'Thayer)']
In [17]:
df.Name[:10]
Out[17]:
0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
6                              McCarthy, Mr. Timothy J
7                       Palsson, Master. Gosta Leonard
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                  Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object
In [28]:
# Extract just last name

df.Name.str.extract('(?P<Last_Name>[a-zA-Z]+)', expand=True).head()
Out[28]:
Last_Name
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen

End