This short notebook will introduce a method to calculate rolling correlations of time series data using R. The intent is not to create a rolling pairwise correlation, but rather the average of each pairwise correlation of a group of time series. Data is a group of stocks prices forming two portfolios.
This is a companion to the Python implementation of the same functionality.
Data will come from a csv saved in my github repository.
# Required packages
install.packages('dplyr')
install.packages('ramify')
install.packages('tibble')
install.packages('tidyr')
install.packages('purrr')
install.packages('slider')
library('dplyr', verbose = FALSE, warn.conflicts = FALSE)
library('ramify', verbose = FALSE, warn.conflicts = FALSE)
library('tibble', verbose = FALSE, warn.conflicts = FALSE)
library('tidyr', verbose = FALSE, warn.conflicts = FALSE)
library('purrr', verbose = FALSE, warn.conflicts = FALSE)
library('slider', verbose = FALSE, warn.conflicts = FALSE)
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
A time series of 12 stocks, 6 per sector.
csv <- 'https://github.com/Brent-Morrison/Misc_scripts/raw/master/daily_price_ts_vw_20201018.csv'
daily_price_ts_vw_20201018 <- read.csv(csv)
tail(daily_price_ts_vw_20201018)
symbol | sector | date_stamp | close | adjusted_close | volume | sp500 | |
---|---|---|---|---|---|---|---|
<chr> | <int> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | |
1751 | AMD | 2 | 2020-07-28 | 67.61 | 67.61 | 94181374 | 3218.44 |
1752 | AMD | 2 | 2020-07-29 | 76.09 | 76.09 | 132969679 | 3258.44 |
1753 | AMD | 2 | 2020-07-30 | 78.20 | 78.20 | 80286888 | 3246.22 |
1754 | AMD | 2 | 2020-07-31 | 77.43 | 77.43 | 71699667 | 3271.12 |
1755 | AMD | 2 | 2020-08-03 | 77.67 | 77.67 | 42628817 | 3294.61 |
1756 | AMD | 2 | 2020-08-04 | 85.04 | 85.04 | 155676106 | 3306.51 |
head(daily_price_ts_vw_20201018)
symbol | sector | date_stamp | close | adjusted_close | volume | sp500 | |
---|---|---|---|---|---|---|---|
<chr> | <int> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | AAL | 1 | 2020-01-02 | 29.09 | 28.9880 | 6275633 | 3257.85 |
2 | AAL | 1 | 2020-01-03 | 27.65 | 27.5531 | 14020066 | 3234.85 |
3 | AAL | 1 | 2020-01-06 | 27.32 | 27.2242 | 6108646 | 3246.28 |
4 | AAL | 1 | 2020-01-07 | 27.22 | 27.1246 | 6197079 | 3237.18 |
5 | AAL | 1 | 2020-01-08 | 27.84 | 27.7424 | 10497296 | 3253.05 |
6 | AAL | 1 | 2020-01-09 | 27.95 | 27.8520 | 6901065 | 3274.70 |
Construct a small matrix to serve as dummy data for development.
mtrx <- mat('1,2,3,4; 2,1,5,6; 3,5,1,7; 4,6,7,1')
# View as a tibble for nice formatting
as_tibble(mtrx)
Warning message: “The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0. Using compatibility `.name_repair`. This warning is displayed once every 8 hours. Call `lifecycle::last_warnings()` to see where this warning was generated.”
V1 | V2 | V3 | V4 |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> |
1 | 2 | 3 | 4 |
2 | 1 | 5 | 6 |
3 | 5 | 1 | 7 |
4 | 6 | 7 | 1 |
Extract the upper triangle of the matrix. These are the values of interest for calculating the average of a correlation matrix.
Note the values on the diagonal and below the diagonal are returned as zero.
mtrx_triu <- triu(mtrx, diag = FALSE)
as_tibble(mtrx_triu)
V1 | V2 | V3 | V4 |
---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> |
0 | 2 | 3 | 4 |
0 | 0 | 5 | 6 |
0 | 0 | 0 | 7 |
0 | 0 | 0 | 0 |
Take the mean omitting the zero values.
mean(mtrx_triu[mtrx_triu != 0], na.rm= TRUE)
Combine to a single call.
mean(triu(mtrx, diag = FALSE)[triu(mtrx, diag = FALSE) != 0], na.rm= TRUE)
Having dug out the ramify package, specifically for its triu
function, I've just found the base function upper.tri
.
This is a lot simpler.
mean(mtrx[upper.tri(mtrx)])
Write as a function. The input, x
is a correlation matrix.
mean_mtrx <- function(x) {
mean(x[upper.tri(x)])
}
mean_mtrx(mtrx)
The code below creates a correlation matrix of the returns of the stocks loaded in the daily_price_ts_vw_20201018
data initially loaded.
daily_price_ts_vw_20201018 %>%
group_by(symbol) %>%
mutate(rtn_log_1d = log(adjusted_close) - lag(log(adjusted_close))) %>%
slice(2:n()) %>% # remove first row for each group, 'rtn_log_1d' will be NA
ungroup() %>%
select(date_stamp, symbol, rtn_log_1d) %>%
pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>%
select(-date_stamp) %>%
cor()
AAL | AAN | AAPL | AAWW | ABM | ACCO | ACM | ADBE | ADI | ADT | AKAM | AMD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AAL | 1.0000000 | 0.5342663 | 0.3257310 | 0.30581378 | 0.4705691 | 0.5045357 | 0.5098054 | 0.2195771 | 0.3736394 | NA | 0.10189296 | 0.2363810 |
AAN | 0.5342663 | 1.0000000 | 0.5115475 | 0.20691703 | 0.4901531 | 0.5652053 | 0.6067528 | 0.4285318 | 0.5752995 | NA | 0.18240040 | 0.4104081 |
AAPL | 0.3257310 | 0.5115475 | 1.0000000 | 0.33300143 | 0.5812976 | 0.3612282 | 0.6011034 | 0.8251769 | 0.7234479 | NA | 0.58102879 | 0.6827976 |
AAWW | 0.3058138 | 0.2069170 | 0.3330014 | 1.00000000 | 0.4195440 | 0.3527609 | 0.5151434 | 0.4556682 | 0.5315856 | NA | 0.05984882 | 0.3314828 |
ABM | 0.4705691 | 0.4901531 | 0.5812976 | 0.41954401 | 1.0000000 | 0.5012358 | 0.6307948 | 0.5497051 | 0.5602179 | NA | 0.33346050 | 0.4111256 |
ACCO | 0.5045357 | 0.5652053 | 0.3612282 | 0.35276092 | 0.5012358 | 1.0000000 | 0.6885873 | 0.2760395 | 0.4853297 | NA | 0.10028232 | 0.2727514 |
ACM | 0.5098054 | 0.6067528 | 0.6011034 | 0.51514345 | 0.6307948 | 0.6885873 | 1.0000000 | 0.6153312 | 0.7075872 | NA | 0.24644228 | 0.4947761 |
ADBE | 0.2195771 | 0.4285318 | 0.8251769 | 0.45566822 | 0.5497051 | 0.2760395 | 0.6153312 | 1.0000000 | 0.7720009 | NA | 0.54210911 | 0.6958997 |
ADI | 0.3736394 | 0.5752995 | 0.7234479 | 0.53158558 | 0.5602179 | 0.4853297 | 0.7075872 | 0.7720009 | 1.0000000 | NA | 0.33402275 | 0.6333702 |
ADT | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA |
AKAM | 0.1018930 | 0.1824004 | 0.5810288 | 0.05984882 | 0.3334605 | 0.1002823 | 0.2464423 | 0.5421091 | 0.3340228 | NA | 1.00000000 | 0.4926676 |
AMD | 0.2363810 | 0.4104081 | 0.6827976 | 0.33148280 | 0.4111256 | 0.2727514 | 0.4947761 | 0.6958997 | 0.6333702 | NA | 0.49266762 | 1.0000000 |
The default setting to the cor
function does not handle missing observations or NA's.
Inspection of the underlying data however reveals that the stock ADT does not have observation for the period 2020-06-19 to 2020-08-04. We need to be able to handle this situation.
Setting the use
parameter to pairwise.complete.obs
will calculate the correlation between each pair of variables using all complete pairs of observations of those variables.
Lets try that.
daily_price_ts_vw_20201018 %>%
group_by(symbol) %>%
mutate(rtn_log_1d = log(adjusted_close) - lag(log(adjusted_close))) %>%
slice(2:n()) %>% # remove first row for each group, 'rtn_log_1d' will be NA
ungroup() %>%
select(date_stamp, symbol, rtn_log_1d) %>%
pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>%
select(-date_stamp) %>%
cor(use = 'pairwise.complete.obs')
AAL | AAN | AAPL | AAWW | ABM | ACCO | ACM | ADBE | ADI | ADT | AKAM | AMD | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AAL | 1.0000000 | 0.5342663 | 0.3257310 | 0.30581378 | 0.4705691 | 0.5045357 | 0.5098054 | 0.2195771 | 0.3736394 | 0.4282776 | 0.10189296 | 0.2363810 |
AAN | 0.5342663 | 1.0000000 | 0.5115475 | 0.20691703 | 0.4901531 | 0.5652053 | 0.6067528 | 0.4285318 | 0.5752995 | 0.5542218 | 0.18240040 | 0.4104081 |
AAPL | 0.3257310 | 0.5115475 | 1.0000000 | 0.33300143 | 0.5812976 | 0.3612282 | 0.6011034 | 0.8251769 | 0.7234479 | 0.4782523 | 0.58102879 | 0.6827976 |
AAWW | 0.3058138 | 0.2069170 | 0.3330014 | 1.00000000 | 0.4195440 | 0.3527609 | 0.5151434 | 0.4556682 | 0.5315856 | 0.3686245 | 0.05984882 | 0.3314828 |
ABM | 0.4705691 | 0.4901531 | 0.5812976 | 0.41954401 | 1.0000000 | 0.5012358 | 0.6307948 | 0.5497051 | 0.5602179 | 0.3873883 | 0.33346050 | 0.4111256 |
ACCO | 0.5045357 | 0.5652053 | 0.3612282 | 0.35276092 | 0.5012358 | 1.0000000 | 0.6885873 | 0.2760395 | 0.4853297 | 0.5251819 | 0.10028232 | 0.2727514 |
ACM | 0.5098054 | 0.6067528 | 0.6011034 | 0.51514345 | 0.6307948 | 0.6885873 | 1.0000000 | 0.6153312 | 0.7075872 | 0.5887212 | 0.24644228 | 0.4947761 |
ADBE | 0.2195771 | 0.4285318 | 0.8251769 | 0.45566822 | 0.5497051 | 0.2760395 | 0.6153312 | 1.0000000 | 0.7720009 | 0.4323994 | 0.54210911 | 0.6958997 |
ADI | 0.3736394 | 0.5752995 | 0.7234479 | 0.53158558 | 0.5602179 | 0.4853297 | 0.7075872 | 0.7720009 | 1.0000000 | 0.5440289 | 0.33402275 | 0.6333702 |
ADT | 0.4282776 | 0.5542218 | 0.4782523 | 0.36862451 | 0.3873883 | 0.5251819 | 0.5887212 | 0.4323994 | 0.5440289 | 1.0000000 | 0.20803371 | 0.4137642 |
AKAM | 0.1018930 | 0.1824004 | 0.5810288 | 0.05984882 | 0.3334605 | 0.1002823 | 0.2464423 | 0.5421091 | 0.3340228 | 0.2080337 | 1.00000000 | 0.4926676 |
AMD | 0.2363810 | 0.4104081 | 0.6827976 | 0.33148280 | 0.4111256 | 0.2727514 | 0.4947761 | 0.6958997 | 0.6333702 | 0.4137642 | 0.49266762 | 1.0000000 |
This looks like a good solution, the NA's returned in the initial correlation matrix have now been populated.
It should be noted the correlation for pairs against ADT may be spurious when applied on a rolling basis. In the extreme, the current code may return a correlation based on only two observations.
Some other form of error checking needs to be implemented. This is probably best implemented using a filter in the dplyr chain that reshapes the data for ingestion to the cor
function.
We will now pipe this into the custom mean_mtrx
function.
daily_price_ts_vw_20201018 %>%
group_by(symbol) %>%
mutate(rtn_log_1d = log(adjusted_close) - lag(log(adjusted_close))) %>%
slice(2:n()) %>% # remove first row for each group, 'rtn_log_1d' will be NA
ungroup() %>%
select(date_stamp, symbol, rtn_log_1d) %>%
pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>%
select(-date_stamp) %>%
cor(use = 'pairwise.complete.obs') %>%
mean_mtrx()
Move the pivot_wider
, cor
and mean_mtrx
components into a function.
This function accepts a dataframe.
ipc <- function(df) {
max_date = max(df$date_stamp)
ipc = df %>%
select(date_stamp, symbol, rtn_log_1d) %>%
pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>%
select(-date_stamp) %>%
cor(use = 'pairwise.complete.obs') %>%
mean_mtrx()
return(tibble(date_stamp = max_date, ipc = ipc))
}
Create a data frame with the rtn_log_1d
column...
daily_price_rtn <- daily_price_ts_vw_20201018 %>%
mutate(
date_stamp = as.Date(date_stamp),
rtn_log_1d = log(adjusted_close) - lag(log(adjusted_close))
) %>%
slice(2:n()) %>%
arrange(date_stamp)
head(daily_price_rtn)
symbol | sector | date_stamp | close | adjusted_close | volume | sp500 | rtn_log_1d | |
---|---|---|---|---|---|---|---|---|
<chr> | <int> | <date> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | |
1 | AAN | 1 | 2020-01-02 | 57.67 | 57.4919 | 658505 | 3257.85 | 1.611909 |
2 | AAPL | 2 | 2020-01-02 | 300.35 | 298.8389 | 33911864 | 3257.85 | 1.744631 |
3 | AAWW | 1 | 2020-01-02 | 27.73 | 27.7300 | 226199 | 3257.85 | -2.761210 |
4 | ABM | 1 | 2020-01-02 | 38.42 | 37.9229 | 358600 | 3257.85 | -0.384784 |
5 | ACCO | 1 | 2020-01-02 | 9.11 | 8.8936 | 505243 | 3257.85 | -1.397909 |
6 | ACM | 1 | 2020-01-02 | 42.98 | 42.9800 | 855310 | 3257.85 | 1.864615 |
... and apply the ipc
function.
ipc(daily_price_rtn)
date_stamp | ipc |
---|---|
<date> | <dbl> |
2020-08-04 | 0.03777869 |
Lets see if we can apply by group.
daily_price_rtn %>%
filter(date_stamp >= '2020-02-11' & date_stamp <= '2020-07-31') %>%
split(.$sector) %>%
map_dfr(., ipc, .id = 'sector')
sector | date_stamp | ipc |
---|---|---|
<chr> | <date> | <dbl> |
1 | 2020-07-31 | 0.4958920 |
2 | 2020-07-31 | 0.5700418 |
daily_price_rtn %>%
filter(date_stamp >= '2020-02-11' & date_stamp <= '2020-07-31') %>%
group_by(sector) %>%
group_modify(~ ipc(.x))
sector | date_stamp | ipc |
---|---|---|
<int> | <date> | <dbl> |
1 | 2020-07-31 | 0.4958920 |
2 | 2020-07-31 | 0.5700418 |
Validate the results for sector 2.
daily_price_rtn %>%
filter(sector == 2) %>%
filter(date_stamp >= '2020-02-11' & date_stamp <= '2020-07-31') %>%
select(date_stamp, symbol, rtn_log_1d) %>%
pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>%
select(-date_stamp) %>%
cor(use = 'pairwise.complete.obs') %>%
mean_mtrx()
All good so far.
ipc_by_grp <- function(df) {
df %>%
split(.$sector) %>%
map_dfr(., ipc, .id = 'sector')
}
daily_price_rtn %>%
filter(date_stamp >= '2020-02-11' & date_stamp <= '2020-07-31') %>%
ipc_by_grp()
sector | date_stamp | ipc |
---|---|---|
<chr> | <date> | <dbl> |
1 | 2020-07-31 | 0.4958920 |
2 | 2020-07-31 | 0.5700418 |
Finally we slide it.
slide_period_dfr(
.x = daily_price_rtn,
.i = daily_price_rtn$date_stamp,
.period = "month",
.f = ipc_by_grp,
.before = 5,
.complete = TRUE
)
sector | date_stamp | ipc |
---|---|---|
<chr> | <date> | <dbl> |
1 | 2020-06-30 | 0.09560748 |
2 | 2020-06-30 | -0.14320115 |
1 | 2020-07-31 | 0.49081102 |
2 | 2020-07-31 | 0.56886202 |
1 | 2020-08-04 | 0.51351245 |
2 | 2020-08-04 | 0.56957118 |
The results for July 2020 differ very slightly - 0.57 versus 0.568 for sector 2. This likely due to the slide_period_dfr
function being applied by month, as opposed to the specific day range for the ipc_by_grp
function.
We can check this by changing the day range on the earlier function.
daily_price_rtn %>%
filter(date_stamp >= '2020-02-01' & date_stamp <= '2020-07-31') %>%
ipc_by_grp()
sector | date_stamp | ipc |
---|---|---|
<chr> | <date> | <dbl> |
1 | 2020-07-31 | 0.490811 |
2 | 2020-07-31 | 0.568862 |
In agreement. That's good enough for me.