Mastering Data-Driven Finance
© Dr. Yves J. Hilpisch | The Python Quants GmbH
import pandas as pd
Create a dataframe by passing data, column names, and row names as arguments
df = pd.DataFrame([10, 20, 30, 40],
columns=['numbers'],
index=['a', 'b', 'c', 'd'])
df
numbers | |
---|---|
a | 10 |
b | 20 |
c | 30 |
d | 40 |
Access the row names of a dataframe:
Index(['a', 'b', 'c', 'd'], dtype='object')
Access the column names
Index(['numbers'], dtype='object')
Access a row by its name
numbers 30 Name: c, dtype: int64
Or access multiple rows
numbers | |
---|---|
a | 10 |
d | 40 |
Access rows by index number rather than its name
numbers | |
---|---|
b | 20 |
c | 30 |
Take sums along each column
numbers 100 dtype: int64
Apply a specific operation to each column
numbers | |
---|---|
a | 100 |
b | 400 |
c | 900 |
d | 1600 |
Another way of doing this operation
numbers | |
---|---|
a | 100 |
b | 400 |
c | 900 |
d | 1600 |
Generate a new column called floats
numbers | floats | |
---|---|---|
a | 10 | 1.5 |
b | 20 | 2.5 |
c | 30 | 3.5 |
d | 40 | 4.5 |
Access the floats column
a 1.5 b 2.5 c 3.5 d 4.5 Name: floats, dtype: float64
Initialize a new column with names Yves, Sandra, Lilli, Henry
at indices d, a, b, c
floats | names | numbers | |
---|---|---|---|
a | 1.50 | Sandra | 10 |
b | 2.50 | Lilli | 20 |
c | 3.50 | Henry | 30 |
d | 4.50 | Yves | 40 |
y | 5.75 | NaN | 100 |
Add a new row with entries [100, 5.75, 'Jil']
while ignoring the index
floats | names | numbers | |
---|---|---|---|
0 | 1.50 | Sandra | 10 |
1 | 2.50 | Lilli | 20 |
2 | 3.50 | Henry | 30 |
3 | 4.50 | Yves | 40 |
4 | 5.75 | NaN | 100 |
5 | 5.75 | Jil | 100 |
Add a new row with entries [100, 5.75, 'Jil']
at index y
/Users/israeldiego/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py:6201: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=True'. To retain the current behavior and silence the warning, pass sort=False sort=sort)
floats | names | numbers | |
---|---|---|---|
a | 1.50 | Sandra | 10 |
b | 2.50 | Lilli | 20 |
c | 3.50 | Henry | 30 |
d | 4.50 | Yves | 40 |
y | 5.75 | NaN | 100 |
y | 5.75 | Jil | 100 |
Add a new row with name Liz
at index z
Check the data types for each column
numbers float64 floats float64 names object dtype: object
Get the Averages for columns: numbers
and floats
numbers 40.00 floats 3.55 dtype: float64
Get the standard deviation for columns: numbers
and floats
numbers 35.355339 floats 1.662077 dtype: float64
import numpy as np
np.random.seed(100)
Generate 9x4
array of random numbers from $N(0,1)$ distribution
array([[-1.74976547, 0.3426804 , 1.1530358 , -0.25243604], [ 0.98132079, 0.51421884, 0.22117967, -1.07004333], [-0.18949583, 0.25500144, -0.45802699, 0.43516349], [-0.58359505, 0.81684707, 0.67272081, -0.10441114], [-0.53128038, 1.02973269, -0.43813562, -1.11831825], [ 1.61898166, 1.54160517, -0.25187914, -0.84243574], [ 0.18451869, 0.9370822 , 0.73100034, 1.36155613], [-0.32623806, 0.05567601, 0.22239961, -1.443217 ], [-0.75635231, 0.81645401, 0.75044476, -0.45594693]])
Convert the previous numpy array to a dataframe
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -1.749765 | 0.342680 | 1.153036 | -0.252436 |
1 | 0.981321 | 0.514219 | 0.221180 | -1.070043 |
2 | -0.189496 | 0.255001 | -0.458027 | 0.435163 |
3 | -0.583595 | 0.816847 | 0.672721 | -0.104411 |
4 | -0.531280 | 1.029733 | -0.438136 | -1.118318 |
5 | 1.618982 | 1.541605 | -0.251879 | -0.842436 |
6 | 0.184519 | 0.937082 | 0.731000 | 1.361556 |
7 | -0.326238 | 0.055676 | 0.222400 | -1.443217 |
8 | -0.756352 | 0.816454 | 0.750445 | -0.455947 |
Change column names to ['No1', 'No2', 'No3', 'No4']
No1 | No2 | No3 | No4 | |
---|---|---|---|---|
0 | -1.749765 | 0.342680 | 1.153036 | -0.252436 |
1 | 0.981321 | 0.514219 | 0.221180 | -1.070043 |
2 | -0.189496 | 0.255001 | -0.458027 | 0.435163 |
3 | -0.583595 | 0.816847 | 0.672721 | -0.104411 |
4 | -0.531280 | 1.029733 | -0.438136 | -1.118318 |
5 | 1.618982 | 1.541605 | -0.251879 | -0.842436 |
6 | 0.184519 | 0.937082 | 0.731000 | 1.361556 |
7 | -0.326238 | 0.055676 | 0.222400 | -1.443217 |
8 | -0.756352 | 0.816454 | 0.750445 | -0.455947 |
Get the mean of column No2
0.7010330941456459
Generate dates for end of day-month for first 9 months of 2019
DatetimeIndex(['2019-01-31', '2019-02-28', '2019-03-31', '2019-04-30', '2019-05-31', '2019-06-30', '2019-07-31', '2019-08-31', '2019-09-30'], dtype='datetime64[ns]', freq='M')
Make these dates the new row indices of our dataframe
No1 | No2 | No3 | No4 | |
---|---|---|---|---|
2019-01-31 | -1.749765 | 0.342680 | 1.153036 | -0.252436 |
2019-02-28 | 0.981321 | 0.514219 | 0.221180 | -1.070043 |
2019-03-31 | -0.189496 | 0.255001 | -0.458027 | 0.435163 |
2019-04-30 | -0.583595 | 0.816847 | 0.672721 | -0.104411 |
2019-05-31 | -0.531280 | 1.029733 | -0.438136 | -1.118318 |
2019-06-30 | 1.618982 | 1.541605 | -0.251879 | -0.842436 |
2019-07-31 | 0.184519 | 0.937082 | 0.731000 | 1.361556 |
2019-08-31 | -0.326238 | 0.055676 | 0.222400 | -1.443217 |
2019-09-30 | -0.756352 | 0.816454 | 0.750445 | -0.455947 |
Extract only the data from our dataframe
array([[-1.74976547, 0.3426804 , 1.1530358 , -0.25243604], [ 0.98132079, 0.51421884, 0.22117967, -1.07004333], [-0.18949583, 0.25500144, -0.45802699, 0.43516349], [-0.58359505, 0.81684707, 0.67272081, -0.10441114], [-0.53128038, 1.02973269, -0.43813562, -1.11831825], [ 1.61898166, 1.54160517, -0.25187914, -0.84243574], [ 0.18451869, 0.9370822 , 0.73100034, 1.36155613], [-0.32623806, 0.05567601, 0.22239961, -1.443217 ], [-0.75635231, 0.81645401, 0.75044476, -0.45594693]])
Extract the data using numpy command
array([[-1.74976547, 0.3426804 , 1.1530358 , -0.25243604], [ 0.98132079, 0.51421884, 0.22117967, -1.07004333], [-0.18949583, 0.25500144, -0.45802699, 0.43516349], [-0.58359505, 0.81684707, 0.67272081, -0.10441114], [-0.53128038, 1.02973269, -0.43813562, -1.11831825], [ 1.61898166, 1.54160517, -0.25187914, -0.84243574], [ 0.18451869, 0.9370822 , 0.73100034, 1.36155613], [-0.32623806, 0.05567601, 0.22239961, -1.443217 ], [-0.75635231, 0.81645401, 0.75044476, -0.45594693]])
df.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 9 entries, 2019-01-31 to 2019-09-30 Freq: M Data columns (total 4 columns): No1 9 non-null float64 No2 9 non-null float64 No3 9 non-null float64 No4 9 non-null float64 dtypes: float64(4) memory usage: 360.0 bytes
Get descriptive statistics for data frame
No1 | No2 | No3 | No4 | |
---|---|---|---|---|
count | 9.000000 | 9.000000 | 9.000000 | 9.000000 |
mean | -0.150212 | 0.701033 | 0.289193 | -0.387788 |
std | 0.988306 | 0.457685 | 0.579920 | 0.877532 |
min | -1.749765 | 0.055676 | -0.458027 | -1.443217 |
25% | -0.583595 | 0.342680 | -0.251879 | -1.070043 |
50% | -0.326238 | 0.816454 | 0.222400 | -0.455947 |
75% | 0.184519 | 0.937082 | 0.731000 | -0.104411 |
max | 1.618982 | 1.541605 | 1.153036 | 1.361556 |
Get column sums
Get column means
No1 -0.150212 No2 0.701033 No3 0.289193 No4 -0.387788 dtype: float64
Get column means by specifying axis
No1 -0.150212 No2 0.701033 No3 0.289193 No4 -0.387788 dtype: float64
Get row means by specifying axis
2019-01-31 -0.126621 2019-02-28 0.161669 2019-03-31 0.010661 2019-04-30 0.200390 2019-05-31 -0.264500 2019-06-30 0.516568 2019-07-31 0.803539 2019-08-31 -0.372845 2019-09-30 0.088650 Freq: M, dtype: float64
Get column cumulative sum
No1 | No2 | No3 | No4 | |
---|---|---|---|---|
2019-01-31 | -1.749765 | 0.342680 | 1.153036 | -0.252436 |
2019-02-28 | -0.768445 | 0.856899 | 1.374215 | -1.322479 |
2019-03-31 | -0.957941 | 1.111901 | 0.916188 | -0.887316 |
2019-04-30 | -1.541536 | 1.928748 | 1.588909 | -0.991727 |
2019-05-31 | -2.072816 | 2.958480 | 1.150774 | -2.110045 |
2019-06-30 | -0.453834 | 4.500086 | 0.898895 | -2.952481 |
2019-07-31 | -0.269316 | 5.437168 | 1.629895 | -1.590925 |
2019-08-31 | -0.595554 | 5.492844 | 1.852294 | -3.034142 |
2019-09-30 | -1.351906 | 6.309298 | 2.602739 | -3.490089 |
Take log of each entry
# raises warning
/Users/yves/miniconda3/envs/py4fi2nd/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in log
No1 | No2 | No3 | No4 | |
---|---|---|---|---|
2019-01-31 | NaN | -1.070957 | 0.142398 | NaN |
2019-02-28 | -0.018856 | -0.665106 | -1.508780 | NaN |
2019-03-31 | NaN | -1.366486 | NaN | -0.832033 |
2019-04-30 | NaN | -0.202303 | -0.396425 | NaN |
2019-05-31 | NaN | 0.029299 | NaN | NaN |
2019-06-30 | 0.481797 | 0.432824 | NaN | NaN |
2019-07-31 | -1.690005 | -0.064984 | -0.313341 | 0.308628 |
2019-08-31 | NaN | -2.888206 | -1.503279 | NaN |
2019-09-30 | NaN | -0.202785 | -0.287089 | NaN |
Take square root of each entry
No1 | No2 | No3 | No4 | |
---|---|---|---|---|
2019-01-31 | 1.322787 | 0.585389 | 1.073795 | 0.502430 |
2019-02-28 | 0.990616 | 0.717091 | 0.470297 | 1.034429 |
2019-03-31 | 0.435311 | 0.504977 | 0.676777 | 0.659669 |
2019-04-30 | 0.763934 | 0.903796 | 0.820196 | 0.323127 |
2019-05-31 | 0.728890 | 1.014757 | 0.661918 | 1.057506 |
2019-06-30 | 1.272392 | 1.241614 | 0.501876 | 0.917843 |
2019-07-31 | 0.429556 | 0.968030 | 0.854986 | 1.166857 |
2019-08-31 | 0.571173 | 0.235958 | 0.471593 | 1.201340 |
2019-09-30 | 0.869685 | 0.903578 | 0.866282 | 0.675238 |
from pylab import plt, mpl
%matplotlib inline
Plot the cumulative sum of each column from our df
Plot Bar chart for each date
df['Quarter'] = ['Q1', 'Q1', 'Q1', 'Q2', 'Q2',
'Q2', 'Q3', 'Q3', 'Q3']
df
No1 | No2 | No3 | No4 | Quarter | |
---|---|---|---|---|---|
2019-01-31 | -1.749765 | 0.342680 | 1.153036 | -0.252436 | Q1 |
2019-02-28 | 0.981321 | 0.514219 | 0.221180 | -1.070043 | Q1 |
2019-03-31 | -0.189496 | 0.255001 | -0.458027 | 0.435163 | Q1 |
2019-04-30 | -0.583595 | 0.816847 | 0.672721 | -0.104411 | Q2 |
2019-05-31 | -0.531280 | 1.029733 | -0.438136 | -1.118318 | Q2 |
2019-06-30 | 1.618982 | 1.541605 | -0.251879 | -0.842436 | Q2 |
2019-07-31 | 0.184519 | 0.937082 | 0.731000 | 1.361556 | Q3 |
2019-08-31 | -0.326238 | 0.055676 | 0.222400 | -1.443217 | Q3 |
2019-09-30 | -0.756352 | 0.816454 | 0.750445 | -0.455947 | Q3 |
Group by quarter
Get count for each group
Quarter Q1 3 Q2 3 Q3 3 dtype: int64
Get mean of each group
No1 | No2 | No3 | No4 | |
---|---|---|---|---|
Quarter | ||||
Q1 | -0.319314 | 0.370634 | 0.305396 | -0.295772 |
Q2 | 0.168035 | 1.129395 | -0.005765 | -0.688388 |
Q3 | -0.299357 | 0.603071 | 0.567948 | -0.179203 |
Get maximum of each group
No1 | No2 | No3 | No4 | |
---|---|---|---|---|
Quarter | ||||
Q1 | 0.981321 | 0.514219 | 1.153036 | 0.435163 |
Q2 | 1.618982 | 1.541605 | 0.672721 | -0.104411 |
Q3 | 0.184519 | 0.937082 | 0.750445 | 1.361556 |
Get minimum and maximum of each group
No1 | No2 | No3 | No4 | |||||
---|---|---|---|---|---|---|---|---|
min | max | min | max | min | max | min | max | |
Quarter | ||||||||
Q1 | -1.75 | 0.98 | 0.26 | 0.51 | -0.46 | 1.15 | -1.07 | 0.44 |
Q2 | -0.58 | 1.62 | 0.82 | 1.54 | -0.44 | 0.67 | -1.12 | -0.10 |
Q3 | -0.76 | 0.18 | 0.06 | 0.94 | 0.22 | 0.75 | -1.44 | 1.36 |
Now we add another column to group
df['Odd_Even'] = ['Odd', 'Even', 'Odd', 'Even', 'Odd', 'Even',
'Odd', 'Even', 'Odd']
Group by Quarter followed by Odd_Even
Quarter Odd_Even Q1 Even 1 Odd 2 Q2 Even 2 Odd 1 Q3 Even 1 Odd 2 dtype: int64
groups[['No1', 'No4']].aggregate([sum, np.mean])
No1 | No4 | ||||
---|---|---|---|---|---|
sum | mean | sum | mean | ||
Quarter | Odd_Even | ||||
Q1 | Even | 0.981321 | 0.981321 | -1.070043 | -1.070043 |
Odd | -1.939261 | -0.969631 | 0.182727 | 0.091364 | |
Q2 | Even | 1.035387 | 0.517693 | -0.946847 | -0.473423 |
Odd | -0.531280 | -0.531280 | -1.118318 | -1.118318 | |
Q3 | Even | -0.326238 | -0.326238 | -1.443217 | -1.443217 |
Odd | -0.571834 | -0.285917 | 0.905609 | 0.452805 |
data = np.random.standard_normal((10, 2))
df = pd.DataFrame(data, columns=['x', 'y'])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 2 columns): x 10 non-null float64 y 10 non-null float64 dtypes: float64(2) memory usage: 240.0 bytes
df.head()
x | y | |
---|---|---|
0 | 1.189622 | -1.690617 |
1 | -1.356399 | -1.232435 |
2 | -0.544439 | -0.668172 |
3 | 0.007315 | -0.612939 |
4 | 1.299748 | -1.733096 |
df.tail()
x | y | |
---|---|---|
5 | -0.983310 | 0.357508 |
6 | -1.613579 | 1.470714 |
7 | -1.188018 | -0.549746 |
8 | -0.940046 | -0.827932 |
9 | 0.108863 | 0.507810 |
Filter rows where x
>0 and y
< 0
x | y | |
---|---|---|
0 | 1.189622 | -1.690617 |
3 | 0.007315 | -0.612939 |
4 | 1.299748 | -1.733096 |
Create our first dataframe
df1 = pd.DataFrame(['100', '200', '300', '400'],
index=['a', 'b', 'c', 'd'],
columns=['A',])
df1
A | |
---|---|
a | 100 |
b | 200 |
c | 300 |
d | 400 |
Create a second dataframe
df2 = pd.DataFrame(['200', '150', '50'],
index=['f', 'b', 'd'],
columns=['B',])
df2
B | |
---|---|
f | 200 |
b | 150 |
d | 50 |
Concatenate the columns and indices of both dataframes
df1.append(df2, sort=False)
A | B | |
---|---|---|
a | 100 | NaN |
b | 200 | NaN |
c | 300 | NaN |
d | 400 | NaN |
f | NaN | 200 |
b | NaN | 150 |
d | NaN | 50 |
df1.append(df2, ignore_index=True, sort=False)
A | B | |
---|---|---|
0 | 100 | NaN |
1 | 200 | NaN |
2 | 300 | NaN |
3 | 400 | NaN |
4 | NaN | 200 |
5 | NaN | 150 |
6 | NaN | 50 |
Can also achieve the same with concat method
pd.concat((df1, df2), sort=False)
A | B | |
---|---|---|
a | 100 | NaN |
b | 200 | NaN |
c | 300 | NaN |
d | 400 | NaN |
f | NaN | 200 |
b | NaN | 150 |
d | NaN | 50 |
pd.concat((df1, df2), ignore_index=True, sort=False)
A | B | |
---|---|---|
0 | 100 | NaN |
1 | 200 | NaN |
2 | 300 | NaN |
3 | 400 | NaN |
4 | NaN | 200 |
5 | NaN | 150 |
6 | NaN | 50 |
Join performs left join by default
df1.join(df2)
A | B | |
---|---|---|
a | 100 | NaN |
b | 200 | 150 |
c | 300 | NaN |
d | 400 | 50 |
df2.join(df1)
B | A | |
---|---|---|
f | 200 | NaN |
b | 150 | 200 |
d | 50 | 400 |
df1.join(df2, how='left')
A | B | |
---|---|---|
a | 100 | NaN |
b | 200 | 150 |
c | 300 | NaN |
d | 400 | 50 |
df1.join(df2, how='right')
A | B | |
---|---|---|
f | NaN | 200 |
b | 200 | 150 |
d | 400 | 50 |
df1.join(df2, how='inner')
A | B | |
---|---|---|
b | 200 | 150 |
d | 400 | 50 |
df1.join(df2, how='outer')
A | B | |
---|---|---|
a | 100 | NaN |
b | 200 | 150 |
c | 300 | NaN |
d | 400 | 50 |
f | NaN | 200 |
df = pd.DataFrame()
df['A'] = df1['A']
df
A | |
---|---|
a | 100 |
b | 200 |
c | 300 |
d | 400 |
df['B'] = df2
df
A | B | |
---|---|---|
a | 100 | NaN |
b | 200 | 150 |
c | 300 | NaN |
d | 400 | 50 |
df = pd.DataFrame({'A': df1['A'], 'B': df2['B']})
df
A | B | |
---|---|---|
a | 100 | NaN |
b | 200 | 150 |
c | 300 | NaN |
d | 400 | 50 |
f | NaN | 200 |
c = pd.Series([250, 150, 50], index=['b', 'd', 'c'])
df1['C'] = c
df2['C'] = c
df1
A | C | |
---|---|---|
a | 100 | NaN |
b | 200 | 250.0 |
c | 300 | 50.0 |
d | 400 | 150.0 |
df2
B | C | |
---|---|---|
f | 200 | NaN |
b | 150 | 250.0 |
d | 50 | 150.0 |
pd.merge(df1, df2)
A | C | B | |
---|---|---|---|
0 | 100 | NaN | 200 |
1 | 200 | 250.0 | 150 |
2 | 400 | 150.0 | 50 |
pd.merge(df1, df2, on='C')
A | C | B | |
---|---|---|---|
0 | 100 | NaN | 200 |
1 | 200 | 250.0 | 150 |
2 | 400 | 150.0 | 50 |
pd.merge(df1, df2, how='outer')
A | C | B | |
---|---|---|---|
0 | 100 | NaN | 200 |
1 | 200 | 250.0 | 150 |
2 | 300 | 50.0 | NaN |
3 | 400 | 150.0 | 50 |
pd.merge(df1, df2, left_on='A', right_on='B')
A | C_x | B | C_y | |
---|---|---|---|---|
0 | 200 | 250.0 | 200 | NaN |
pd.merge(df1, df2, left_on='A', right_on='B', how='outer')
A | C_x | B | C_y | |
---|---|---|---|---|
0 | 100 | NaN | NaN | NaN |
1 | 200 | 250.0 | 200 | NaN |
2 | 300 | 50.0 | NaN | NaN |
3 | 400 | 150.0 | NaN | NaN |
4 | NaN | NaN | 150 | 250.0 |
5 | NaN | NaN | 50 | 150.0 |
pd.merge(df1, df2, left_index=True, right_index=True)
A | C_x | B | C_y | |
---|---|---|---|---|
b | 200 | 250.0 | 150 | 250.0 |
d | 400 | 150.0 | 50 | 150.0 |
pd.merge(df1, df2, on='C', left_index=True)
A | C | B | |
---|---|---|---|
f | 100 | NaN | 200 |
b | 200 | 250.0 | 150 |
d | 400 | 150.0 | 50 |
pd.merge(df1, df2, on='C', right_index=True)
A | C | B | |
---|---|---|---|
a | 100 | NaN | 200 |
b | 200 | 250.0 | 150 |
d | 400 | 150.0 | 50 |
pd.merge(df1, df2, on='C', left_index=True, right_index=True)
A | C | B | |
---|---|---|---|
b | 200 | 250.0 | 150 |
d | 400 | 150.0 | 50 |