Data Manupaltion using Python Pandas

The following operations in Pandas are done to achieve the same results shown in vignette of data.table package in R.

In [1]:
import pandas as pd

1. Basics

In [2]:
# Read csv file in pandas:

flights = pd.read_csv("D:\PythonPandas\FlightsNew.csv")
flights.head()
Out[2]:
year month day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum flight origin dest air_time distance hour min
0 2014 1 1 914 14 1238 13 0 AA N338AA 1 JFK LAX 359 2475 9 14
1 2014 1 1 1157 -3 1523 13 0 AA N335AA 3 JFK LAX 363 2475 11 57
2 2014 1 1 1902 2 2224 9 0 AA N327AA 21 JFK LAX 351 2475 19 2
3 2014 1 1 722 -8 1014 -26 0 AA N3EHAA 29 LGA PBI 157 1035 7 22
4 2014 1 1 1347 2 1706 1 0 AA N319AA 117 JFK LAX 350 2475 13 47

In this notebook, we will :

start with basics - how to subset rows, select and compute on columns

and then we will look at performing data aggregations by group.

Get all the flights with “JFK” as the origin airport in the month of March.

In [3]:
ans=flights.loc[(flights["origin"].isin(["JFK"]))&(flights["month"]==3)].reset_index()
ans.head()
Out[3]:
index year month day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum flight origin dest air_time distance hour min
0 43609 2014 3 1 849 -11 1306 36 0 AA N784AA 1 JFK LAX 375 2475 8 49
1 43610 2014 3 1 1157 -3 1529 14 0 AA N787AA 3 JFK LAX 368 2475 11 57
2 43611 2014 3 1 652 -3 1019 4 0 AA N785AA 9 JFK LAX 366 2475 6 52
3 43612 2014 3 1 1117 47 1502 77 0 AA N788AA 19 JFK LAX 379 2475 11 17
4 43613 2014 3 1 1950 50 2321 66 0 AA N336AA 21 JFK LAX 350 2475 19 50

Get the first two rows from flights.

In [4]:
flights_tworows=flights.loc[0:1]
flights_tworows.head()
Out[4]:
year month day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum flight origin dest air_time distance hour min
0 2014 1 1 914 14 1238 13 0 AA N338AA 1 JFK LAX 359 2475 9 14
1 2014 1 1 1157 -3 1523 13 0 AA N335AA 3 JFK LAX 363 2475 11 57

Sort flights first by column origin in ascending order, and then by dest in descending order:

In [5]:
flights_sort=flights.sort_values(["origin","dest"],ascending=[1,0]).reset_index()
flights_sort.head()
Out[5]:
index year month day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum flight origin dest air_time distance hour min
0 2790 2014 1 5 836 6 1151 49 0 EV N12175 4419 EWR XNA 195 1131 8 36
1 3351 2014 1 6 833 7 1111 13 0 EV N24128 4419 EWR XNA 190 1131 8 33
2 4066 2014 1 7 811 -6 1035 -13 0 EV N12142 4419 EWR XNA 179 1131 8 11
3 4825 2014 1 8 810 -7 1036 -12 0 EV N11193 4419 EWR XNA 184 1131 8 10
4 5658 2014 1 9 833 16 1055 7 0 EV N14198 4419 EWR XNA 181 1131 8 33

Select both arr_delay and dep_delay columns.

In [6]:
arr_dep_delay=flights[["arr_delay","dep_delay"]].copy()
arr_dep_delay.head()
Out[6]:
arr_delay dep_delay
0 13 14
1 13 -3
2 9 2
3 -26 -8
4 1 2

Select both arr_delay and dep_delay columns and rename them to delay_arr and delay_dep.

In [7]:
delay_arr_dep=flights[["arr_delay","dep_delay"]].rename(columns={"arr_delay":"delay_arr","dep_delay":"delay_dep"})
delay_arr_dep.head()
Out[7]:
delay_arr delay_dep
0 13 14
1 13 -3
2 9 2
3 -26 -8
4 1 2

How many trips have had total delay < 0?

In [8]:
#Step1: Create a column called "delay" in "flights" dataframe and assign 0 to it.
flights["delay"]=0
#Step2: Check for cases where arr_delay + dep_delay is less than 0
flights.loc[(flights["arr_delay"]+flights["dep_delay"])<0,"delay"]=1
#Step3: Count the number of 1 in the delay column
flights.loc[flights["delay"]==1]["delay"].count()
Out[8]:
141814

Calculate the average arrival and departure delay for all flights with “JFK” as the origin airport in the month of June.

In [9]:
flights.loc[flights["origin"].isin(["JFK"])][["arr_delay","dep_delay"]].mean()
Out[9]:
arr_delay     7.731465
dep_delay    11.446167
dtype: float64

How many trips have been made in 2014 from “JFK” airport in the month of June?

In [10]:
len(flights.loc[(flights["origin"].isin(["JFK"]))&(flights["month"]==6)]["origin"].tolist())
Out[10]:
8422

2. Aggregations

How can we get the number of trips corresponding to each origin airport?

In [11]:
aggregations={"origin":"count"}
             
In [12]:
flights.groupby(["origin"]).agg({"origin":"count"}).rename(columns={"origin":"number_trips"}).reset_index()
Out[12]:
origin number_trips
0 EWR 87400
1 JFK 81483
2 LGA 84433

How can we calculate the number of trips for each origin airport for carrier code “AA”?

In [13]:
flights.loc[flights["carrier"].isin(["AA"])].groupby(["origin"]).count().reset_index()[['origin','flight']].rename(columns={"flight":"number_trips"})
Out[13]:
origin number_trips
0 EWR 2649
1 JFK 11923
2 LGA 11730

How can we get the total number of trips for each origin, dest pair for carrier code “AA”?

In [14]:
flights.loc[flights["carrier"].isin(["AA"])].groupby(["origin","dest"]).count().reset_index()[["origin","dest","flight"]].rename(columns={"flight":"number_trips"}).head()
Out[14]:
origin dest number_trips
0 EWR DFW 1618
1 EWR LAX 62
2 EWR MIA 848
3 EWR PHX 121
4 JFK AUS 297

How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”?

In [15]:
flights.loc[flights["carrier"].isin(["AA"])].groupby(["origin","dest","month"])[["arr_delay","dep_delay"]].mean().reset_index().head()
Out[15]:
origin dest month arr_delay dep_delay
0 EWR DFW 1 6.427673 10.012579
1 EWR DFW 2 10.536765 11.345588
2 EWR DFW 3 12.865031 8.079755
3 EWR DFW 4 17.792683 12.920732
4 EWR DFW 5 18.487805 18.682927

3. Melting Operation

Melting operation is converting from wide form to long form

In [16]:
family_data=pd.DataFrame({'family_id':[1,2,3,4,5], 
                          'age_mother':[30,27,26,32,29],
                          'dob_child1':['1998-11-26','1996-06-22','2002-07-11','2004-10-10','2000-12-05'],
                          'dob_child2':['NA','2004-04-05','2004-04-05','2009-08-27','2005-02-28'],
                          'dob_child3':['NA','NA','2007-09-02','2012-07-21', ""]
                         
                         })[['family_id','age_mother','dob_child1','dob_child2','dob_child3']]
family_data
Out[16]:
family_id age_mother dob_child1 dob_child2 dob_child3
0 1 30 1998-11-26 NA NA
1 2 27 1996-06-22 2004-04-05 NA
2 3 26 2002-07-11 2004-04-05 2007-09-02
3 4 32 2004-10-10 2009-08-27 2012-07-21
4 5 29 2000-12-05 2005-02-28

We need to melt the above dataframe

In [17]:
family_melt=pd.melt(family_data,
                   id_vars=['family_id','age_mother'],
                   value_vars=['dob_child1','dob_child2','dob_child3'],
                   var_name=['dob_child'],
                   value_name='dob')
family_melt
Out[17]:
family_id age_mother dob_child dob
0 1 30 dob_child1 1998-11-26
1 2 27 dob_child1 1996-06-22
2 3 26 dob_child1 2002-07-11
3 4 32 dob_child1 2004-10-10
4 5 29 dob_child1 2000-12-05
5 1 30 dob_child2 NA
6 2 27 dob_child2 2004-04-05
7 3 26 dob_child2 2004-04-05
8 4 32 dob_child2 2009-08-27
9 5 29 dob_child2 2005-02-28
10 1 30 dob_child3 NA
11 2 27 dob_child3 NA
12 3 26 dob_child3 2007-09-02
13 4 32 dob_child3 2012-07-21
14 5 29 dob_child3

4. Dcast operation

In [18]:
#family_melt.set_index(['dob','family_id']).unstack('dob')

family_dcast=family_melt.pivot_table(index=['family_id','age_mother'],columns="dob_child",values="dob",aggfunc='first')
In [19]:
family_dcast
Out[19]:
dob_child dob_child1 dob_child2 dob_child3
family_id age_mother
1 30 1998-11-26 NA NA
2 27 1996-06-22 2004-04-05 NA
3 26 2002-07-11 2004-04-05 2007-09-02
4 32 2004-10-10 2009-08-27 2012-07-21
5 29 2000-12-05 2005-02-28
In [20]:
family_dcast.columns
Out[20]:
Index(['dob_child1', 'dob_child2', 'dob_child3'], dtype='object', name='dob_child')
In [47]:
family_dcast.reset_index().rename_axis("",axis="columns")
Out[47]:
family_id age_mother dob_child1 dob_child2 dob_child3
0 1 30 1998-11-26 NA NA
1 2 27 1996-06-22 2004-04-05 NA
2 3 26 2002-07-11 2004-04-05 2007-09-02
3 4 32 2004-10-10 2009-08-27 2012-07-21
4 5 29 2000-12-05 2005-02-28