import pandas as pd
import numpy as np # necessity as pandas is built on np
from IPython.display import Image # to display images
The Pandas library is built on NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language.
Refer to these cheatsheets:
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience+(1).pdf
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf
Different types of data can be loaded in pandas dataframe. Pandas dataframe is like looks spreadsheet table (just a rough analogy)
Most common input types
pd.read_csv
pd.read_excel/ pd.ExcelFile
pd.read_feather
(feather format is used to reduce memory load in df as data is saved in binary form)pd.read_json
pd.read_html
pd.read_pickle
(can also infer if pickled object is zipped using infer=)Output types have format to_xxx
similar to input formats
** Pandas data strctures include series
and dataframe
**
Series: A one-dimensional labeled array a capable of holding any data type
Image('../images/series.png', width=500) # not pandas, just showing example series
# index will set the index for further reference
# data can be passed as list
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
s
a 3 b -5 c 7 d 4 dtype: int64
Can be indexed using both index name or number.
number: --> filter indexes after value of number
:number --> filter indexes before value of number
s[1:]
b -5 c 7 d 4 dtype: int64
s['a']
3
s[:'c']
a 3 b -5 c 7 dtype: int64
s.index
Index(['a', 'b', 'c', 'd'], dtype='object')
# series using dictionary
s2 = pd.Series({'a':3, 'b': -1, 'c': 12}); s2
a 3 b -1 c 12 dtype: int64
s2['b']
-1
Dataframe: A two-dimensional labeled data structure with columns of potentially different types. It is similar to excel table.
Can make data frame using dictionary, list of list
Image('../images/df.png', width=500)
data = {'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasília'],
'Population': [11190846, 1303171035, 207847528]}
df_sample = pd.DataFrame(data,
columns=['Country', 'Capital', 'Population'])
df_sample
Country | Capital | Population | |
---|---|---|---|
0 | Belgium | Brussels | 11190846 |
1 | India | New Delhi | 1303171035 |
2 | Brazil | Brasília | 207847528 |
Using famous titanic data
for analysis and exploration.
https://www.kaggle.com/c/titanic/data
df.shape
pd.head()
pd.dtypes
pd.columns
pd.nunique()
pd.info()
df = pd.read_csv('../data/train.csv') # read csv file
df.shape
(891, 12)
df.head() # see top 5 rows of data
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
df.dtypes # see datatype of each variable
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
df.columns # column names
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
df.nunique() # unique value for each variable
PassengerId 891 Survived 2 Pclass 3 Name 891 Sex 2 Age 88 SibSp 7 Parch 7 Ticket 681 Fare 248 Cabin 147 Embarked 3 dtype: int64
df.info() # not null part is very useful to see how many nulls are there in data
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
iloc
¶Select based on integer location (that's why i). Can select single or multiple
df.iloc[0, 4] # 0 row, 4 column
'male'
df.iloc[1:4, 2:6] # indexes are maintained. Can reset_index() to start index from 0
Pclass | Name | Sex | Age | |
---|---|---|---|---|
1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 |
2 | 3 | Heikkinen, Miss. Laina | female | 26.0 |
3 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 |
loc
¶Select based on label name of column (can select single or multiple)
df.loc[1:2,'Name':"Age"] # here row indexes are numbers but column indexes are name of columns
Name | Sex | Age | |
---|---|---|---|
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 |
2 | Heikkinen, Miss. Laina | female | 26.0 |
df.loc[2,['Name',"Age"]] # here row indexes are numbers.
Name Heikkinen, Miss. Laina Age 26 Name: 2, dtype: object
ix
¶ix has been deprecated in latest pandas library. It was used to select by label or position. But we can always use loc
to select with labels and iloc
to select with integers/position
Boolean indexing
¶Returns rows where the stated condition returns true
|
condition 2 (or
also works but throws ambiguity error for multiple conditions)&
condition 2 (and
also works but throws ambiguity error for multiple conditions~
(not condition)==
Satisfying conditionany()
-> columns/rows with any value matching conditionall()
> columns/rows with all values matching some condition# select rows with either sex as female or Pclass as 1
df[(df.Sex == 'female') | (df.iloc[:,2] == 1) ].iloc[:3] # () are important
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
# first 3 rows of gives all columns which have all string values or all int > 1 values
df.loc[:,(df > 1).all()][:3]
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
0 | Braund, Mr. Owen Harris | male | A/5 21171 | NaN | S |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | PC 17599 | C85 | C |
2 | Heikkinen, Miss. Laina | female | STON/O2. 3101282 | NaN | S |
# first 3 rows of all columns which have all not null values
df.loc[:,(df.notnull().all() )][:3]
PassengerId | Survived | Pclass | Name | Sex | SibSp | Parch | Ticket | Fare | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 1 | 0 | A/5 21171 | 7.2500 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 1 | 0 | PC 17599 | 71.2833 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 0 | 0 | STON/O2. 3101282 | 7.9250 |
# first 3 rows of all columns which have atleast 1 null value
df.loc[:, df.isnull().any()][:3]
Age | Cabin | Embarked | |
---|---|---|---|
0 | 22.0 | NaN | S |
1 | 38.0 | C85 | C |
2 | 26.0 | NaN | S |
df[(df.iloc[:,2] == 1) & (df.Sex == 'female')].shape
(94, 12)
# fraction of males with Age > 25, df.shape[0] -> number of rows
sum((df.Age > 25) & (df.Sex == 'male'))/df.shape[0]
0.3075196408529742
# number of people who survived and were not in class 3
sum((df.Survived != 0) & (~(df.Pclass == 3)) )
223
querying
¶Query columns (filter rows) of dataframe with boolean expression (Filter based on condition)
# filter all rows which have Age > Passenger ID
df.query('Age > PassengerId')
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | NaN | S |
20 | 21 | 0 | 2 | Fynney, Mr. Joseph J | male | 35.0 | 0 | 0 | 239865 | 26.0000 | NaN | S |
21 | 22 | 1 | 2 | Beesley, Mr. Lawrence | male | 34.0 | 0 | 0 | 248698 | 13.0000 | D56 | S |
23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.0 | 0 | 0 | 113788 | 35.5000 | A6 | S |
25 | 26 | 1 | 3 | Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... | female | 38.0 | 1 | 5 | 347077 | 31.3875 | NaN | S |
30 | 31 | 0 | 1 | Uruchurtu, Don. Manuel E | male | 40.0 | 0 | 0 | PC 17601 | 27.7208 | NaN | C |
33 | 34 | 0 | 2 | Wheadon, Mr. Edward H | male | 66.0 | 0 | 0 | C.A. 24579 | 10.5000 | NaN | S |
35 | 36 | 0 | 1 | Holverson, Mr. Alexander Oskar | male | 42.0 | 1 | 0 | 113789 | 52.0000 | NaN | S |
54 | 55 | 0 | 1 | Ostby, Mr. Engelhart Cornelius | male | 65.0 | 0 | 1 | 113509 | 61.9792 | B30 | C |
filter
Filter dataframe on column names or row names (labels) by regrex or just item names
# filter only sex and age columns (first 2 rows)
df.filter(items=['Age', 'Sex'])[:2]
Age | Sex | |
---|---|---|
0 | 22.0 | male |
1 | 38.0 | female |
# filter only 0 and 5 row index
df.filter(items=[0,5], axis=0)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
# first 2 rows of column names ending with "ed" (think of past tense)
df.filter(like = 'ed', axis=1)[:2]
Survived | Embarked | |
---|---|---|
0 | 0 | S |
1 | 1 | C |
# Can use same thing as above using regex also
df.filter(regex='ed$', axis=1)[:2]
Series([], Name: Name, dtype: object)
isin
¶Filter rows of column based on list of multiple values
df[df.Pclass.isin([0,1])].head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.0 | 0 | 0 | 113788 | 35.5000 | A6 | S |
Setting and resetting index are important when we merge/groupby 2 dataframe and want to do further analysis on new dataframe. A dataframe with repeated indexes can cause problems in filtering. Apart from this we cvan set a column into index which makes merging much faster
set_index()
¶Set any column you want as index of df
# setting
df.set_index('Ticket')[:2]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Ticket | |||||||||||
A/5 21171 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
PC 17599 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
# can set multiple columns as index also. Just pass them in list
# Setting Ticket and Name as index
df.set_index(['Ticket', 'Name'])[:2]
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Ticket | Name | ||||||||||
A/5 21171 | Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
PC 17599 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
# can see what are values of index.
# checking index of 1st row
df.set_index(['Ticket', 'Name']).index[0]
('A/5 21171', 'Braund, Mr. Owen Harris')
reset_index()
¶Can reset index back to 0....nrows-1
df_index = df.set_index(['Ticket', 'Name'])
df_index[:2]
PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Ticket | Name | ||||||||||
A/5 21171 | Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
PC 17599 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
df_index.reset_index()[:2]
Ticket | Name | PassengerId | Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A/5 21171 | Braund, Mr. Owen Harris | 1 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | NaN | S |
1 | PC 17599 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 2 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C85 | C |
In above case, index is back to 0,1...
rename()
¶Renaming column names or row indexes of dataframe. Default is index
df.rename(columns={'Name': 'Whats_name', 'Fare':'Price'})[:2]
PassengerId | Survived | Pclass | Whats_name | Sex | Age | SibSp | Parch | Ticket | Price | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
# can use some mapper function also. default axis='index' (0)
df.rename(mapper=str.lower, axis='columns')[:2]
passengerid | survived | pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
unique()
¶Number of unique values in a column of df (Use nunique() for count of unique in each column)
df.Sex.unique()
array(['male', 'female'], dtype=object)
duplicated()
¶Check duplicated in column. Returns True/False
sum(df.PassengerId.duplicated()) # there are no duplicate passegerid. good thing to check
0
# can check duplicates in index also.
# useful if doubtful about duplicates in index doing bad things
sum(df.index.duplicated())
0
drop_duplicates
¶Drop rows which have duplicates
# can help in getting unique combination of multiple columns
# unique() dosn't work in this case
df.loc[:,['Sex', 'Embarked']].drop_duplicates()
Sex | Embarked | |
---|---|---|
0 | male | S |
1 | female | C |
2 | female | S |
5 | male | Q |
22 | female | Q |
26 | male | C |
61 | female | NaN |
Group by some column/columns, then we can aggregate to get mean, count, sum or custom function based on the group
groupby
¶# group by sex then count.
# returns count in each column. difference in some cases because of nulls in those columns
# can do iloc[:,0] to only get first column
df.groupby(by = ['Sex']).count()
PassengerId | Survived | Pclass | Name | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
Sex | |||||||||||
female | 314 | 314 | 314 | 314 | 261 | 314 | 314 | 314 | 314 | 97 | 312 |
male | 577 | 577 | 577 | 577 | 453 | 577 | 577 | 577 | 577 | 107 | 577 |
# can use multiple conditions
# group by sex and survived -> mean of age
df.groupby(by = ['Sex', 'Survived']).mean().loc[:,'Age']
Sex Survived female 0 25.046875 1 28.847716 male 0 31.618056 1 27.276022 Name: Age, dtype: float64
# can group by indexes also by using levels=
# useful when we have multindexes
# can use agg function with lambda func
df_index = df.set_index(['Sex', 'Pclass'])
df_index.groupby(level=[0,1]).agg({'Fare': lambda x: sum(x)/len(x), # this is also just mean actually
'Age' : np.mean})
Fare | Age | ||
---|---|---|---|
Sex | Pclass | ||
female | 1 | 106.125798 | 34.611765 |
2 | 21.970121 | 28.722973 | |
3 | 16.118810 | 21.750000 | |
male | 1 | 67.226127 | 41.281386 |
2 | 19.741782 | 30.740707 | |
3 | 12.661633 | 26.507589 |
Interesting! Ticket price of 1st class female is approximately double of 1st class male
transform
¶Can apply such functions for all columns also using transform which transforms all rows
# shape of below code is same as original df
df_index.groupby(level=[0,1]).transform(lambda x: sum(x)/len(x)).head()
PassengerId | Survived | Age | SibSp | Parch | Fare | ||
---|---|---|---|---|---|---|---|
Sex | Pclass | ||||||
male | 3 | 455.515850 | 0.135447 | NaN | 0.498559 | 0.224784 | 12.661633 |
female | 1 | 469.212766 | 0.968085 | NaN | 0.553191 | 0.457447 | 106.125798 |
3 | 399.729167 | 0.500000 | NaN | 0.895833 | 0.798611 | 16.118810 | |
1 | 469.212766 | 0.968085 | NaN | 0.553191 | 0.457447 | 106.125798 | |
male | 3 | 455.515850 | 0.135447 | NaN | 0.498559 | 0.224784 | 12.661633 |
dropna
¶Drop rows with na
# how=any -> row with any column = NA
df.dropna(axis=0, how='any').shape
(183, 12)
# how=any -> row with all columns = NA
df.dropna(axis=0, how='all').shape
(891, 12)
# drops column which have any row of NA
[set(df.columns) - set(df.dropna(axis=1, how='any').columns)]
[{'Age', 'Cabin', 'Embarked'}]
Three columns have been removed
fillna
¶# replace with mean of that column
# can put any specific value also
# would not work for columns with string type like Cabin
df.fillna(np.mean)[:1]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | <function mean at 0x10c7d7d90> | S |
merge
/ join
¶data1 = pd.DataFrame({'x1': list('abc'), 'x2': [11.432, 1.303, 99.906]})
data2 = pd.DataFrame({'x1': list('abd'), 'x3': [20.784, np.NaN, 20.784]})
data1
x1 | x2 | |
---|---|---|
0 | a | 11.432 |
1 | b | 1.303 |
2 | c | 99.906 |
data2
x1 | x3 | |
---|---|---|
0 | a | 20.784 |
1 | b | NaN |
2 | d | 20.784 |
# inner join when both table have that key (like sql)
data1.merge(data2, how='inner', on='x1')
x1 | x2 | x3 | |
---|---|---|---|
0 | a | 11.432 | 20.784 |
1 | b | 1.303 | NaN |
# outer joins on all keys in both df and creates NA
data1.merge(data2, how='outer', on='x1')
x1 | x2 | x3 | |
---|---|---|---|
0 | a | 11.432 | 20.784 |
1 | b | 1.303 | NaN |
2 | c | 99.906 | NaN |
3 | d | NaN | 20.784 |
can also use join
but merge
is faster. just use merge
# if columns overlap, have to specify suffix as it makes for all
data1.join(data2, on='x1', how='left', lsuffix='L')
x1L | x2 | x1 | x3 | |
---|---|---|---|---|
0 | a | 11.432 | NaN | NaN |
1 | b | 1.303 | NaN | NaN |
2 | c | 99.906 | NaN | NaN |
concatenate
¶# join over axis=0, i.e rows combine
# also adds all columns with na
pd.concat([data1, data2], axis=0)
x1 | x2 | x3 | |
---|---|---|---|
0 | a | 11.432 | NaN |
1 | b | 1.303 | NaN |
2 | c | 99.906 | NaN |
0 | a | NaN | 20.784 |
1 | b | NaN | NaN |
2 | d | NaN | 20.784 |
3 | g | NaN | 500.000 |
Notice that it has index duplicates as it maintain original df index
Can use ignore_index=True
to make index start from 0
pd.concat([data1, data2], axis=0, ignore_index=True)
x1 | x2 | x3 | |
---|---|---|---|
0 | a | 11.432 | NaN |
1 | b | 1.303 | NaN |
2 | c | 99.906 | NaN |
3 | a | NaN | 20.784 |
4 | b | NaN | NaN |
5 | d | NaN | 20.784 |
data2.loc[3] = ['g', 500] # adding new row
data2
x1 | x3 | |
---|---|---|
0 | a | 20.784 |
1 | b | NaN |
2 | d | 20.784 |
3 | g | 500.000 |
# join over axis=1, i.e columns combine
pd.concat([data1, data2], axis=1)
x1 | x2 | x1 | x3 | |
---|---|---|---|---|
0 | a | 11.432 | a | 20.784 |
1 | b | 1.303 | b | NaN |
2 | c | 99.906 | d | 20.784 |
3 | NaN | NaN | g | 500.000 |
to_datetime()
-> convert whatever format argument to datetime (obviously that can be parsed to datetime)date_range()
-> generates datetime dataDatetimeindex
-> datetypeindex datapd.to_datetime('2018-2-19')
Timestamp('2018-02-19 00:00:00')
# gives datetimeindex format
pd.date_range('2018-4-18', periods=6, freq='d')
DatetimeIndex(['2018-04-18', '2018-04-19', '2018-04-20', '2018-04-21', '2018-04-22', '2018-04-23'], dtype='datetime64[ns]', freq='D')
data1['date'] = pd.date_range('2018-4-18', periods=3, freq='d')
data1
x1 | x2 | date | |
---|---|---|---|
0 | a | 11.432 | 2018-04-18 |
1 | b | 1.303 | 2018-04-19 |
2 | c | 99.906 | 2018-04-20 |
data1.date
0 2018-04-18 1 2018-04-19 2 2018-04-20 Name: date, dtype: datetime64[ns]
pd.DatetimeIndex(data1.date)
DatetimeIndex(['2018-04-18', '2018-04-19', '2018-04-20'], dtype='datetime64[ns]', name='date', freq=None)
pivot
-> reshape data¶Ever used pivot table in excel? It's same.
# index = new index, columns = new_columns, values = values to put
df.pivot(index='Sex', columns = 'PassengerId', values = 'Age')
PassengerId | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 882 | 883 | 884 | 885 | 886 | 887 | 888 | 889 | 890 | 891 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sex | |||||||||||||||||||||
female | NaN | 38.0 | 26.0 | 35.0 | NaN | NaN | NaN | NaN | 27.0 | 14.0 | ... | NaN | 22.0 | NaN | NaN | 39.0 | NaN | 19.0 | NaN | NaN | NaN |
male | 22.0 | NaN | NaN | NaN | 35.0 | NaN | 54.0 | 2.0 | NaN | NaN | ... | 33.0 | NaN | 28.0 | 25.0 | NaN | 27.0 | NaN | NaN | 26.0 | 32.0 |
2 rows × 891 columns
In above case, use of pivot doesn't make sense but this is just an example
stack
¶Convert whole df into 1 long format
df.stack()
0 PassengerId 1 Survived 0 Pclass 3 Name Braund, Mr. Owen Harris Sex male Age 22 SibSp 1 Parch 0 Ticket A/5 21171 Fare 7.25 Embarked S 1 PassengerId 2 Survived 1 Pclass 1 Name Cumings, Mrs. John Bradley (Florence Briggs Th... Sex female Age 38 SibSp 1 Parch 0 Ticket PC 17599 Fare 71.2833 Cabin C85 Embarked C 2 PassengerId 3 Survived 1 Pclass 3 Name Heikkinen, Miss. Laina Sex female Age 26 SibSp 0 ... 888 Name Johnston, Miss. Catherine Helen "Carrie" Sex female SibSp 1 Parch 2 Ticket W./C. 6607 Fare 23.45 Embarked S 889 PassengerId 890 Survived 1 Pclass 1 Name Behr, Mr. Karl Howell Sex male Age 26 SibSp 0 Parch 0 Ticket 111369 Fare 30 Cabin C148 Embarked C 890 PassengerId 891 Survived 0 Pclass 3 Name Dooley, Mr. Patrick Sex male Age 32 SibSp 0 Parch 0 Ticket 370376 Fare 7.75 Embarked Q Length: 9826, dtype: object
You won't generally use it. I have never come across its use over my experience with python
To get column/row indexes, series pair.
iteritems()
for column-index, seriesiterrows()
for row-index, serieslist(df.Sex.iteritems())[:5]
[(0, 'male'), (1, 'female'), (2, 'female'), (3, 'female'), (4, 'male')]
list(df.iterrows())[0]
(0, PassengerId 1 Survived 0 Pclass 3 Name Braund, Mr. Owen Harris Sex male Age 22 SibSp 1 Parch 0 Ticket A/5 21171 Fare 7.25 Cabin NaN Embarked S Name: 0, dtype: object)
apply
-> apply function over dfapply_map
-> apply function elementwise (for each series of df. think of column wise)# function squares when type(x) = float, cubes when type(x) = int, return same when other
f = lambda x: x**2 if type(x) == float else x**3 if type(x) == int else x
# whole series is passed
df.Fare.apply(f)[:3]
0 52.562500 1 5081.308859 2 62.805625 Name: Fare, dtype: float64
# elements are passed
df.applymap(f)[:3]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 27 | Braund, Mr. Owen Harris | male | 484.0 | 1 | 0 | A/5 21171 | 52.562500 | NaN | S |
1 | 8 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 1444.0 | 1 | 0 | PC 17599 | 5081.308859 | C85 | C |
2 | 27 | 1 | 27 | Heikkinen, Miss. Laina | female | 676.0 | 0 | 0 | STON/O2. 3101282 | 62.805625 | NaN | S |
What all can we do when we have string datatype in pandas dataframe/series ?
str
¶Working with string format in pandas series/df
We can do:
str.upper()/lower()
to convert string into upper or lower casestr.len()
to find the length of stingstr.strip()/lstrip()/rstrip()
to strip spacesstr.replace()
to replace anything from stringstr.split()
to split words of string or using some other delimiterstr.get()
to access elements in slit liststr.resplit()
spit in reverse order of string based on some delimiterstr.extract()
extract specific thing from string. alphabet or numberLet's see how to use all that in pandas series. Keep in mind pandas DataFrame has no attribute called str
and works on Series object only. So, grab column of df, then apply str
# converts all rows into lower
df.Name.str.lower().head()
0 braund, mr. owen harris 1 cumings, mrs. john bradley (florence briggs th... 2 heikkinen, miss. laina 3 futrelle, mrs. jacques heath (lily may peel) 4 allen, mr. william henry Name: Name, dtype: object
# converts all rows into upper
df.Sex.str.upper().head()
0 MALE 1 FEMALE 2 FEMALE 3 FEMALE 4 MALE Name: Sex, dtype: object
# counts all the characters including spaces
df.Name.str.len().head()
0 23 1 51 2 22 3 44 4 24 Name: Name, dtype: int64
# splits strings in each row over whitespaces ()
# expand=True : expand columns
# pat = regex to split on
df.Name.str.split(pat=',',expand=True).head().rename(columns={0:'First_Name', 1: 'Last_Name'})
First_Name | Last_Name | |
---|---|---|
0 | Braund | Mr. Owen Harris |
1 | Cumings | Mrs. John Bradley (Florence Briggs Thayer) |
2 | Heikkinen | Miss. Laina |
3 | Futrelle | Mrs. Jacques Heath (Lily May Peel) |
4 | Allen | Mr. William Henry |
# splits strings in each row over whitespaces ()
# expand=False : doesn't expand columns
# pat = regex to split on
df.Name.str.split(expand=False).head()
0 [Braund,, Mr., Owen, Harris] 1 [Cumings,, Mrs., John, Bradley, (Florence, Bri... 2 [Heikkinen,, Miss., Laina] 3 [Futrelle,, Mrs., Jacques, Heath, (Lily, May, ... 4 [Allen,, Mr., William, Henry] Name: Name, dtype: object
# replace Mr. with empty space
df.Name.str.replace('Mr.', '').head()
0 Braund, Owen Harris 1 Cumings, . John Bradley (Florence Briggs Thayer) 2 Heikkinen, Miss. Laina 3 Futrelle, . Jacques Heath (Lily May Peel) 4 Allen, William Henry Name: Name, dtype: object
# get() is used to get particular row of split
df.Name.str.split().get(1)
['Cumings,', 'Mrs.', 'John', 'Bradley', '(Florence', 'Briggs', 'Thayer)']
df.Name[:10]
0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry 5 Moran, Mr. James 6 McCarthy, Mr. Timothy J 7 Palsson, Master. Gosta Leonard 8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 9 Nasser, Mrs. Nicholas (Adele Achem) Name: Name, dtype: object
# Extract just last name
df.Name.str.extract('(?P<Last_Name>[a-zA-Z]+)', expand=True).head()
Last_Name | |
---|---|
0 | Braund |
1 | Cumings |
2 | Heikkinen |
3 | Futrelle |
4 | Allen |