QuantLab: pandas¶

(Go to Quant Lab)¶

Source: Python for Finance (2nd ed.)¶

Mastering Data-Driven Finance

pandas Basics¶

First Steps with DataFrame Class¶

In [4]:

import pandas as pd

Create a dataframe by passing data, column names, and row names as arguments

In [5]:

df = pd.DataFrame([10, 20, 30, 40],  
                  columns=['numbers'],  
                  index=['a', 'b', 'c', 'd'])  
df

Out[5]:

	numbers
a	10
b	20
c	30
d	40

Access the row names of a dataframe:

In [6]:

Out[6]:

Index(['a', 'b', 'c', 'd'], dtype='object')

Access the column names

In [8]:

Out[8]:

Index(['numbers'], dtype='object')

Access a row by its name

In [6]:

Out[6]:

numbers    30
Name: c, dtype: int64

Or access multiple rows

In [7]:

Out[7]:

	numbers
a	10
d	40

Access rows by index number rather than its name

In [8]:

Out[8]:

	numbers
b	20
c	30

Take sums along each column

In [9]:

Out[9]:

numbers    100
dtype: int64

Apply a specific operation to each column

In [10]:

Out[10]:

	numbers
a	100
b	400
c	900
d	1600

Another way of doing this operation

In [11]:

Out[11]:

	numbers
a	100
b	400
c	900
d	1600

Generate a new column called floats

In [12]:

Out[12]:

	numbers	floats
a	10	1.5
b	20	2.5
c	30	3.5
d	40	4.5

Access the floats column

In [13]:

Out[13]:

a    1.5
b    2.5
c    3.5
d    4.5
Name: floats, dtype: float64

Initialize a new column with names Yves, Sandra, Lilli, Henry at indices d, a, b, c

In [17]:

Out[17]:

	floats	names	numbers
a	1.50	Sandra	10
b	2.50	Lilli	20
c	3.50	Henry	30
d	4.50	Yves	40
y	5.75	NaN	100

Add a new row with entries [100, 5.75, 'Jil'] while ignoring the index

In [18]:

Out[18]:

	floats	names	numbers
0	1.50	Sandra	10
1	2.50	Lilli	20
2	3.50	Henry	30
3	4.50	Yves	40
4	5.75	NaN	100
5	5.75	Jil	100

Add a new row with entries [100, 5.75, 'Jil'] at index y

In [19]:

/Users/israeldiego/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py:6201: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  sort=sort)

Out[19]:

	floats	names	numbers
a	1.50	Sandra	10
b	2.50	Lilli	20
c	3.50	Henry	30
d	4.50	Yves	40
y	5.75	NaN	100
y	5.75	Jil	100

Add a new row with name Liz at index z

In [20]:

Check the data types for each column

In [22]:

Out[22]:

numbers    float64
floats     float64
names       object
dtype: object

Get the Averages for columns: numbers and floats

In [23]:

Out[23]:

numbers    40.00
floats      3.55
dtype: float64

Get the standard deviation for columns: numbers and floats

In [24]:

Out[24]:

numbers    35.355339
floats      1.662077
dtype: float64

Second Steps with DataFrame Class¶

In [23]:

import numpy as np
np.random.seed(100)

Generate 9x4 array of random numbers from $N(0,1)$ distribution

In [24]:

Out[24]:

array([[-1.74976547,  0.3426804 ,  1.1530358 , -0.25243604],
       [ 0.98132079,  0.51421884,  0.22117967, -1.07004333],
       [-0.18949583,  0.25500144, -0.45802699,  0.43516349],
       [-0.58359505,  0.81684707,  0.67272081, -0.10441114],
       [-0.53128038,  1.02973269, -0.43813562, -1.11831825],
       [ 1.61898166,  1.54160517, -0.25187914, -0.84243574],
       [ 0.18451869,  0.9370822 ,  0.73100034,  1.36155613],
       [-0.32623806,  0.05567601,  0.22239961, -1.443217  ],
       [-0.75635231,  0.81645401,  0.75044476, -0.45594693]])

Convert the previous numpy array to a dataframe

In [27]:

Out[27]:

	0	1	2	3
0	-1.749765	0.342680	1.153036	-0.252436
1	0.981321	0.514219	0.221180	-1.070043
2	-0.189496	0.255001	-0.458027	0.435163
3	-0.583595	0.816847	0.672721	-0.104411
4	-0.531280	1.029733	-0.438136	-1.118318
5	1.618982	1.541605	-0.251879	-0.842436
6	0.184519	0.937082	0.731000	1.361556
7	-0.326238	0.055676	0.222400	-1.443217
8	-0.756352	0.816454	0.750445	-0.455947

Change column names to ['No1', 'No2', 'No3', 'No4']

In [28]:

Out[28]:

	No1	No2	No3	No4
0	-1.749765	0.342680	1.153036	-0.252436
1	0.981321	0.514219	0.221180	-1.070043
2	-0.189496	0.255001	-0.458027	0.435163
3	-0.583595	0.816847	0.672721	-0.104411
4	-0.531280	1.029733	-0.438136	-1.118318
5	1.618982	1.541605	-0.251879	-0.842436
6	0.184519	0.937082	0.731000	1.361556
7	-0.326238	0.055676	0.222400	-1.443217
8	-0.756352	0.816454	0.750445	-0.455947

Get the mean of column No2

In [29]:

Out[29]:

0.7010330941456459

Generate dates for end of day-month for first 9 months of 2019

In [32]:

Out[32]:

DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30',
               '2019-05-31', '2019-06-30', '2019-07-31', '2019-08-31',
               '2019-09-30'],
              dtype='datetime64[ns]', freq='M')

Make these dates the new row indices of our dataframe

In [33]:

Out[33]:

	No1	No2	No3	No4
2019-01-31	-1.749765	0.342680	1.153036	-0.252436
2019-02-28	0.981321	0.514219	0.221180	-1.070043
2019-03-31	-0.189496	0.255001	-0.458027	0.435163
2019-04-30	-0.583595	0.816847	0.672721	-0.104411
2019-05-31	-0.531280	1.029733	-0.438136	-1.118318
2019-06-30	1.618982	1.541605	-0.251879	-0.842436
2019-07-31	0.184519	0.937082	0.731000	1.361556
2019-08-31	-0.326238	0.055676	0.222400	-1.443217
2019-09-30	-0.756352	0.816454	0.750445	-0.455947

Extract only the data from our dataframe

In [34]:

Out[34]:

array([[-1.74976547,  0.3426804 ,  1.1530358 , -0.25243604],
       [ 0.98132079,  0.51421884,  0.22117967, -1.07004333],
       [-0.18949583,  0.25500144, -0.45802699,  0.43516349],
       [-0.58359505,  0.81684707,  0.67272081, -0.10441114],
       [-0.53128038,  1.02973269, -0.43813562, -1.11831825],
       [ 1.61898166,  1.54160517, -0.25187914, -0.84243574],
       [ 0.18451869,  0.9370822 ,  0.73100034,  1.36155613],
       [-0.32623806,  0.05567601,  0.22239961, -1.443217  ],
       [-0.75635231,  0.81645401,  0.75044476, -0.45594693]])

Extract the data using numpy command

In [35]:

Out[35]:

array([[-1.74976547,  0.3426804 ,  1.1530358 , -0.25243604],
       [ 0.98132079,  0.51421884,  0.22117967, -1.07004333],
       [-0.18949583,  0.25500144, -0.45802699,  0.43516349],
       [-0.58359505,  0.81684707,  0.67272081, -0.10441114],
       [-0.53128038,  1.02973269, -0.43813562, -1.11831825],
       [ 1.61898166,  1.54160517, -0.25187914, -0.84243574],
       [ 0.18451869,  0.9370822 ,  0.73100034,  1.36155613],
       [-0.32623806,  0.05567601,  0.22239961, -1.443217  ],
       [-0.75635231,  0.81645401,  0.75044476, -0.45594693]])

Basic Analytics¶

In [36]:

df.info()  

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 9 entries, 2019-01-31 to 2019-09-30
Freq: M
Data columns (total 4 columns):
No1    9 non-null float64
No2    9 non-null float64
No3    9 non-null float64
No4    9 non-null float64
dtypes: float64(4)
memory usage: 360.0 bytes

Get descriptive statistics for data frame

In [41]:

Out[41]:

	No1	No2	No3	No4
count	9.000000	9.000000	9.000000	9.000000
mean	-0.150212	0.701033	0.289193	-0.387788
std	0.988306	0.457685	0.579920	0.877532
min	-1.749765	0.055676	-0.458027	-1.443217
25%	-0.583595	0.342680	-0.251879	-1.070043
50%	-0.326238	0.816454	0.222400	-0.455947
75%	0.184519	0.937082	0.731000	-0.104411
max	1.618982	1.541605	1.153036	1.361556

Get column sums

In [ ]:

Get column means

In [43]:

Out[43]:

No1   -0.150212
No2    0.701033
No3    0.289193
No4   -0.387788
dtype: float64

Get column means by specifying axis

In [44]:

Out[44]:

No1   -0.150212
No2    0.701033
No3    0.289193
No4   -0.387788
dtype: float64

Get row means by specifying axis

In [45]:

Out[45]:

2019-01-31   -0.126621
2019-02-28    0.161669
2019-03-31    0.010661
2019-04-30    0.200390
2019-05-31   -0.264500
2019-06-30    0.516568
2019-07-31    0.803539
2019-08-31   -0.372845
2019-09-30    0.088650
Freq: M, dtype: float64

Get column cumulative sum

In [46]:

Out[46]:

	No1	No2	No3	No4
2019-01-31	-1.749765	0.342680	1.153036	-0.252436
2019-02-28	-0.768445	0.856899	1.374215	-1.322479
2019-03-31	-0.957941	1.111901	0.916188	-0.887316
2019-04-30	-1.541536	1.928748	1.588909	-0.991727
2019-05-31	-2.072816	2.958480	1.150774	-2.110045
2019-06-30	-0.453834	4.500086	0.898895	-2.952481
2019-07-31	-0.269316	5.437168	1.629895	-1.590925
2019-08-31	-0.595554	5.492844	1.852294	-3.034142
2019-09-30	-1.351906	6.309298	2.602739	-3.490089

Take log of each entry

In [48]:

# raises warning

/Users/yves/miniconda3/envs/py4fi2nd/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in log

Out[48]:

	No1	No2	No3	No4
2019-01-31	NaN	-1.070957	0.142398	NaN
2019-02-28	-0.018856	-0.665106	-1.508780	NaN
2019-03-31	NaN	-1.366486	NaN	-0.832033
2019-04-30	NaN	-0.202303	-0.396425	NaN
2019-05-31	NaN	0.029299	NaN	NaN
2019-06-30	0.481797	0.432824	NaN	NaN
2019-07-31	-1.690005	-0.064984	-0.313341	0.308628
2019-08-31	NaN	-2.888206	-1.503279	NaN
2019-09-30	NaN	-0.202785	-0.287089	NaN

Take square root of each entry

In [49]:

Out[49]:

	No1	No2	No3	No4
2019-01-31	1.322787	0.585389	1.073795	0.502430
2019-02-28	0.990616	0.717091	0.470297	1.034429
2019-03-31	0.435311	0.504977	0.676777	0.659669
2019-04-30	0.763934	0.903796	0.820196	0.323127
2019-05-31	0.728890	1.014757	0.661918	1.057506
2019-06-30	1.272392	1.241614	0.501876	0.917843
2019-07-31	0.429556	0.968030	0.854986	1.166857
2019-08-31	0.571173	0.235958	0.471593	1.201340
2019-09-30	0.869685	0.903578	0.866282	0.675238

Basic Visualization¶

In [52]:

from pylab import plt, mpl  
%matplotlib inline

Plot the cumulative sum of each column from our df

In [53]:

Plot Bar chart for each date

In [54]:

GroupBy Operations¶

In [64]:

df['Quarter'] = ['Q1', 'Q1', 'Q1', 'Q2', 'Q2',
                 'Q2', 'Q3', 'Q3', 'Q3']
df

Out[64]:

	No1	No2	No3	No4	Quarter
2019-01-31	-1.749765	0.342680	1.153036	-0.252436	Q1
2019-02-28	0.981321	0.514219	0.221180	-1.070043	Q1
2019-03-31	-0.189496	0.255001	-0.458027	0.435163	Q1
2019-04-30	-0.583595	0.816847	0.672721	-0.104411	Q2
2019-05-31	-0.531280	1.029733	-0.438136	-1.118318	Q2
2019-06-30	1.618982	1.541605	-0.251879	-0.842436	Q2
2019-07-31	0.184519	0.937082	0.731000	1.361556	Q3
2019-08-31	-0.326238	0.055676	0.222400	-1.443217	Q3
2019-09-30	-0.756352	0.816454	0.750445	-0.455947	Q3

Group by quarter

In [65]:

Get count for each group

In [66]:

Out[66]:

Quarter
Q1    3
Q2    3
Q3    3
dtype: int64

Get mean of each group

In [67]:

Out[67]:

	No1	No2	No3	No4
Quarter
Q1	-0.319314	0.370634	0.305396	-0.295772
Q2	0.168035	1.129395	-0.005765	-0.688388
Q3	-0.299357	0.603071	0.567948	-0.179203

Get maximum of each group

In [68]:

Out[68]:

	No1	No2	No3	No4
Quarter
Q1	0.981321	0.514219	1.153036	0.435163
Q2	1.618982	1.541605	0.672721	-0.104411
Q3	0.184519	0.937082	0.750445	1.361556

Get minimum and maximum of each group

In [69]:

Out[69]:

	No1		No2		No3		No4
	min	max	min	max	min	max	min	max
Quarter
Q1	-1.75	0.98	0.26	0.51	-0.46	1.15	-1.07	0.44
Q2	-0.58	1.62	0.82	1.54	-0.44	0.67	-1.12	-0.10
Q3	-0.76	0.18	0.06	0.94	0.22	0.75	-1.44	1.36

Now we add another column to group

In [70]:

df['Odd_Even'] = ['Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
                  'Odd', 'Even', 'Odd']

Group by Quarter followed by Odd_Even

In [71]:

In [72]:

Out[72]:

Quarter  Odd_Even
Q1       Even        1
         Odd         2
Q2       Even        2
         Odd         1
Q3       Even        1
         Odd         2
dtype: int64

In [73]:

groups[['No1', 'No4']].aggregate([sum, np.mean])

Out[73]:

		No1		No4
		sum	mean	sum	mean
Quarter	Odd_Even
Q1	Even	0.981321	0.981321	-1.070043	-1.070043
Q1	Odd	-1.939261	-0.969631	0.182727	0.091364
Q2	Even	1.035387	0.517693	-0.946847	-0.473423
Q2	Odd	-0.531280	-0.531280	-1.118318	-1.118318
Q3	Even	-0.326238	-0.326238	-1.443217	-1.443217
Q3	Odd	-0.571834	-0.285917	0.905609	0.452805

Complex Selection¶

In [74]:

data = np.random.standard_normal((10, 2))  

In [75]:

df = pd.DataFrame(data, columns=['x', 'y'])  

In [76]:

df.info()  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
x    10 non-null float64
y    10 non-null float64
dtypes: float64(2)
memory usage: 240.0 bytes

In [77]:

df.head()  

Out[77]:

	x	y
0	1.189622	-1.690617
1	-1.356399	-1.232435
2	-0.544439	-0.668172
3	0.007315	-0.612939
4	1.299748	-1.733096

In [78]:

df.tail()  

Out[78]:

	x	y
5	-0.983310	0.357508
6	-1.613579	1.470714
7	-1.188018	-0.549746
8	-0.940046	-0.827932
9	0.108863	0.507810

Filter rows where x>0 and y < 0

In [84]:

Out[84]:

	x	y
0	1.189622	-1.690617
3	0.007315	-0.612939
4	1.299748	-1.733096

Concatenation, Joining and Merging¶

Create our first dataframe

In [89]:

df1 = pd.DataFrame(['100', '200', '300', '400'], 
                    index=['a', 'b', 'c', 'd'],
                    columns=['A',])

In [90]:

df1

Out[90]:

	A
a	100
b	200
c	300
d	400

Create a second dataframe

In [91]:

df2 = pd.DataFrame(['200', '150', '50'], 
                    index=['f', 'b', 'd'],
                    columns=['B',])

In [92]:

df2

Out[92]:

	B
f	200
b	150
d	50

Concatenation¶

Concatenate the columns and indices of both dataframes

In [93]:

df1.append(df2, sort=False)  

Out[93]:

	A	B
a	100	NaN
b	200	NaN
c	300	NaN
d	400	NaN
f	NaN	200
b	NaN	150
d	NaN	50

In [94]:

df1.append(df2, ignore_index=True, sort=False)  

Out[94]:

	A	B
0	100	NaN
1	200	NaN
2	300	NaN
3	400	NaN
4	NaN	200
5	NaN	150
6	NaN	50

Can also achieve the same with concat method

In [95]:

pd.concat((df1, df2), sort=False)  

Out[95]:

	A	B
a	100	NaN
b	200	NaN
c	300	NaN
d	400	NaN
f	NaN	200
b	NaN	150
d	NaN	50

In [96]:

pd.concat((df1, df2), ignore_index=True, sort=False)  

Out[96]:

	A	B
0	100	NaN
1	200	NaN
2	300	NaN
3	400	NaN
4	NaN	200
5	NaN	150
6	NaN	50

Joining¶

Join performs left join by default

In [97]:

df1.join(df2)  

Out[97]:

	A	B
a	100	NaN
b	200	150
c	300	NaN
d	400	50

In [98]:

df2.join(df1)  

Out[98]:

	B	A
f	200	NaN
b	150	200
d	50	400

In [99]:

df1.join(df2, how='left')  

Out[99]:

	A	B
a	100	NaN
b	200	150
c	300	NaN
d	400	50

In [100]:

df1.join(df2, how='right')  

Out[100]:

	A	B
f	NaN	200
b	200	150
d	400	50

In [101]:

df1.join(df2, how='inner')  

Out[101]:

	A	B
b	200	150
d	400	50

In [102]:

df1.join(df2, how='outer')  

Out[102]:

	A	B
a	100	NaN
b	200	150
c	300	NaN
d	400	50
f	NaN	200

In [103]:

df = pd.DataFrame()

In [104]:

df['A'] = df1['A']  

In [105]:

df

Out[105]:

	A
a	100
b	200
c	300
d	400

In [106]:

df['B'] = df2  

In [107]:

df

Out[107]:

	A	B
a	100	NaN
b	200	150
c	300	NaN
d	400	50

In [108]:

df = pd.DataFrame({'A': df1['A'], 'B': df2['B']})  

In [109]:

df

Out[109]:

	A	B
a	100	NaN
b	200	150
c	300	NaN
d	400	50
f	NaN	200

Merging¶

In [110]:

c = pd.Series([250, 150, 50], index=['b', 'd', 'c'])
df1['C'] = c
df2['C'] = c

In [111]:

df1

Out[111]:

	A	C
a	100	NaN
b	200	250.0
c	300	50.0
d	400	150.0

In [112]:

df2

Out[112]:

	B	C
f	200	NaN
b	150	250.0
d	50	150.0

In [113]:

pd.merge(df1, df2)  

Out[113]:

	A	C	B
0	100	NaN	200
1	200	250.0	150
2	400	150.0	50

In [114]:

pd.merge(df1, df2, on='C')  

Out[114]:

	A	C	B
0	100	NaN	200
1	200	250.0	150
2	400	150.0	50

In [115]:

pd.merge(df1, df2, how='outer')  

Out[115]:

	A	C	B
0	100	NaN	200
1	200	250.0	150
2	300	50.0	NaN
3	400	150.0	50

In [116]:

pd.merge(df1, df2, left_on='A', right_on='B')

Out[116]:

	A	C_x	B	C_y
0	200	250.0	200	NaN

In [117]:

pd.merge(df1, df2, left_on='A', right_on='B', how='outer')  

Out[117]:

	A	C_x	B	C_y
0	100	NaN	NaN	NaN
1	200	250.0	200	NaN
2	300	50.0	NaN	NaN
3	400	150.0	NaN	NaN
4	NaN	NaN	150	250.0
5	NaN	NaN	50	150.0

In [118]:

pd.merge(df1, df2, left_index=True, right_index=True)

Out[118]:

	A	C_x	B	C_y
b	200	250.0	150	250.0
d	400	150.0	50	150.0

In [119]:

pd.merge(df1, df2, on='C', left_index=True)

Out[119]:

	A	C	B
f	100	NaN	200
b	200	250.0	150
d	400	150.0	50

In [120]:

pd.merge(df1, df2, on='C', right_index=True)

Out[120]:

	A	C	B
a	100	NaN	200
b	200	250.0	150
d	400	150.0	50

In [121]:

pd.merge(df1, df2, on='C', left_index=True, right_index=True)

Out[121]:

	A	C	B
b	200	250.0	150
d	400	150.0	50