Rolling portfolio correlations - R¶

This short notebook will introduce a method to calculate rolling correlations of time series data using R. The intent is not to create a rolling pairwise correlation, but rather the average of each pairwise correlation of a group of time series. Data is a group of stocks prices forming two portfolios.

This is a companion to the Python implementation of the same functionality.

Data will come from a csv saved in my github repository.

In [1]:

# Required packages
install.packages('dplyr')
install.packages('ramify')
install.packages('tibble')
install.packages('tidyr')
install.packages('purrr')
install.packages('slider')

library('dplyr', verbose = FALSE, warn.conflicts = FALSE)
library('ramify', verbose = FALSE, warn.conflicts = FALSE)
library('tibble', verbose = FALSE, warn.conflicts = FALSE)
library('tidyr', verbose = FALSE, warn.conflicts = FALSE)
library('purrr', verbose = FALSE, warn.conflicts = FALSE)
library('slider', verbose = FALSE, warn.conflicts = FALSE)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

The data¶

A time series of 12 stocks, 6 per sector.

In [2]:

csv <- 'https://github.com/Brent-Morrison/Misc_scripts/raw/master/daily_price_ts_vw_20201018.csv'
daily_price_ts_vw_20201018 <- read.csv(csv)

tail(daily_price_ts_vw_20201018)

A data.frame: 6 × 7
	symbol	sector	date_stamp	close	adjusted_close	volume	sp500
	<chr>	<int>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
1751	AMD	2	2020-07-28	67.61	67.61	94181374	3218.44
1752	AMD	2	2020-07-29	76.09	76.09	132969679	3258.44
1753	AMD	2	2020-07-30	78.20	78.20	80286888	3246.22
1754	AMD	2	2020-07-31	77.43	77.43	71699667	3271.12
1755	AMD	2	2020-08-03	77.67	77.67	42628817	3294.61
1756	AMD	2	2020-08-04	85.04	85.04	155676106	3306.51

In [3]:

head(daily_price_ts_vw_20201018)

A data.frame: 6 × 7
	symbol	sector	date_stamp	close	adjusted_close	volume	sp500
	<chr>	<int>	<chr>	<dbl>	<dbl>	<dbl>	<dbl>
1	AAL	1	2020-01-02	29.09	28.9880	6275633	3257.85
2	AAL	1	2020-01-03	27.65	27.5531	14020066	3234.85
3	AAL	1	2020-01-06	27.32	27.2242	6108646	3246.28
4	AAL	1	2020-01-07	27.22	27.1246	6197079	3237.18
5	AAL	1	2020-01-08	27.84	27.7424	10497296	3253.05
6	AAL	1	2020-01-09	27.95	27.8520	6901065	3274.70

Averaging a correlation matrix¶

Construct a small matrix to serve as dummy data for development.

In [4]:

mtrx <- mat('1,2,3,4; 2,1,5,6; 3,5,1,7; 4,6,7,1')

# View as a tibble for nice formatting
as_tibble(mtrx)

Warning message:
“The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.”

A tibble: 4 × 4
V1	V2	V3	V4
<dbl>	<dbl>	<dbl>	<dbl>
1	2	3	4
2	1	5	6
3	5	1	7
4	6	7	1

Extract the upper triangle of the matrix. These are the values of interest for calculating the average of a correlation matrix.

Note the values on the diagonal and below the diagonal are returned as zero.

In [5]:

mtrx_triu <- triu(mtrx, diag = FALSE)
as_tibble(mtrx_triu)

A tibble: 4 × 4
V1	V2	V3	V4
<dbl>	<dbl>	<dbl>	<dbl>
0	2	3	4
0	0	5	6
0	0	0	7
0	0	0	0

Take the mean omitting the zero values.

In [6]:

mean(mtrx_triu[mtrx_triu != 0], na.rm= TRUE)

4.5

Combine to a single call.

In [7]:

mean(triu(mtrx, diag = FALSE)[triu(mtrx, diag = FALSE) != 0], na.rm= TRUE)

4.5

Having dug out the ramify package, specifically for its triu function, I've just found the base function upper.tri.

This is a lot simpler.

In [8]:

mean(mtrx[upper.tri(mtrx)])

4.5

Write as a function. The input, x is a correlation matrix.

In [9]:

mean_mtrx <- function(x) {
   mean(x[upper.tri(x)])
}

In [10]:

mean_mtrx(mtrx)

4.5

Getting back to the stock data.¶

The code below creates a correlation matrix of the returns of the stocks loaded in the daily_price_ts_vw_20201018 data initially loaded.

In [11]:

daily_price_ts_vw_20201018 %>% 
  group_by(symbol) %>%
  mutate(rtn_log_1d = log(adjusted_close) - lag(log(adjusted_close))) %>%
  slice(2:n()) %>%  # remove first row for each group, 'rtn_log_1d' will be NA
  ungroup() %>%
  select(date_stamp, symbol, rtn_log_1d) %>%
  pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>% 
  select(-date_stamp) %>% 
  cor()

A matrix: 12 × 12 of type dbl
	AAL	AAN	AAPL	AAWW	ABM	ACCO	ACM	ADBE	ADI	ADT	AKAM	AMD
AAL	1.0000000	0.5342663	0.3257310	0.30581378	0.4705691	0.5045357	0.5098054	0.2195771	0.3736394	NA	0.10189296	0.2363810
AAN	0.5342663	1.0000000	0.5115475	0.20691703	0.4901531	0.5652053	0.6067528	0.4285318	0.5752995	NA	0.18240040	0.4104081
AAPL	0.3257310	0.5115475	1.0000000	0.33300143	0.5812976	0.3612282	0.6011034	0.8251769	0.7234479	NA	0.58102879	0.6827976
AAWW	0.3058138	0.2069170	0.3330014	1.00000000	0.4195440	0.3527609	0.5151434	0.4556682	0.5315856	NA	0.05984882	0.3314828
ABM	0.4705691	0.4901531	0.5812976	0.41954401	1.0000000	0.5012358	0.6307948	0.5497051	0.5602179	NA	0.33346050	0.4111256
ACCO	0.5045357	0.5652053	0.3612282	0.35276092	0.5012358	1.0000000	0.6885873	0.2760395	0.4853297	NA	0.10028232	0.2727514
ACM	0.5098054	0.6067528	0.6011034	0.51514345	0.6307948	0.6885873	1.0000000	0.6153312	0.7075872	NA	0.24644228	0.4947761
ADBE	0.2195771	0.4285318	0.8251769	0.45566822	0.5497051	0.2760395	0.6153312	1.0000000	0.7720009	NA	0.54210911	0.6958997
ADI	0.3736394	0.5752995	0.7234479	0.53158558	0.5602179	0.4853297	0.7075872	0.7720009	1.0000000	NA	0.33402275	0.6333702
ADT	NA	NA	NA	NA	NA	NA	NA	NA	NA	1	NA	NA
AKAM	0.1018930	0.1824004	0.5810288	0.05984882	0.3334605	0.1002823	0.2464423	0.5421091	0.3340228	NA	1.00000000	0.4926676
AMD	0.2363810	0.4104081	0.6827976	0.33148280	0.4111256	0.2727514	0.4947761	0.6958997	0.6333702	NA	0.49266762	1.0000000

The default setting to the cor function does not handle missing observations or NA's.

Inspection of the underlying data however reveals that the stock ADT does not have observation for the period 2020-06-19 to 2020-08-04. We need to be able to handle this situation.

Setting the use parameter to pairwise.complete.obs will calculate the correlation between each pair of variables using all complete pairs of observations of those variables.

Lets try that.

In [12]:

daily_price_ts_vw_20201018 %>% 
  group_by(symbol) %>%
  mutate(rtn_log_1d = log(adjusted_close) - lag(log(adjusted_close))) %>%
  slice(2:n()) %>%  # remove first row for each group, 'rtn_log_1d' will be NA
  ungroup() %>%
  select(date_stamp, symbol, rtn_log_1d) %>%
  pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>% 
  select(-date_stamp) %>% 
  cor(use = 'pairwise.complete.obs')

A matrix: 12 × 12 of type dbl
	AAL	AAN	AAPL	AAWW	ABM	ACCO	ACM	ADBE	ADI	ADT	AKAM	AMD
AAL	1.0000000	0.5342663	0.3257310	0.30581378	0.4705691	0.5045357	0.5098054	0.2195771	0.3736394	0.4282776	0.10189296	0.2363810
AAN	0.5342663	1.0000000	0.5115475	0.20691703	0.4901531	0.5652053	0.6067528	0.4285318	0.5752995	0.5542218	0.18240040	0.4104081
AAPL	0.3257310	0.5115475	1.0000000	0.33300143	0.5812976	0.3612282	0.6011034	0.8251769	0.7234479	0.4782523	0.58102879	0.6827976
AAWW	0.3058138	0.2069170	0.3330014	1.00000000	0.4195440	0.3527609	0.5151434	0.4556682	0.5315856	0.3686245	0.05984882	0.3314828
ABM	0.4705691	0.4901531	0.5812976	0.41954401	1.0000000	0.5012358	0.6307948	0.5497051	0.5602179	0.3873883	0.33346050	0.4111256
ACCO	0.5045357	0.5652053	0.3612282	0.35276092	0.5012358	1.0000000	0.6885873	0.2760395	0.4853297	0.5251819	0.10028232	0.2727514
ACM	0.5098054	0.6067528	0.6011034	0.51514345	0.6307948	0.6885873	1.0000000	0.6153312	0.7075872	0.5887212	0.24644228	0.4947761
ADBE	0.2195771	0.4285318	0.8251769	0.45566822	0.5497051	0.2760395	0.6153312	1.0000000	0.7720009	0.4323994	0.54210911	0.6958997
ADI	0.3736394	0.5752995	0.7234479	0.53158558	0.5602179	0.4853297	0.7075872	0.7720009	1.0000000	0.5440289	0.33402275	0.6333702
ADT	0.4282776	0.5542218	0.4782523	0.36862451	0.3873883	0.5251819	0.5887212	0.4323994	0.5440289	1.0000000	0.20803371	0.4137642
AKAM	0.1018930	0.1824004	0.5810288	0.05984882	0.3334605	0.1002823	0.2464423	0.5421091	0.3340228	0.2080337	1.00000000	0.4926676
AMD	0.2363810	0.4104081	0.6827976	0.33148280	0.4111256	0.2727514	0.4947761	0.6958997	0.6333702	0.4137642	0.49266762	1.0000000

This looks like a good solution, the NA's returned in the initial correlation matrix have now been populated.

It should be noted the correlation for pairs against ADT may be spurious when applied on a rolling basis. In the extreme, the current code may return a correlation based on only two observations.

Some other form of error checking needs to be implemented. This is probably best implemented using a filter in the dplyr chain that reshapes the data for ingestion to the cor function.

We will now pipe this into the custom mean_mtrx function.

In [13]:

daily_price_ts_vw_20201018 %>% 
  group_by(symbol) %>%
  mutate(rtn_log_1d = log(adjusted_close) - lag(log(adjusted_close))) %>%
  slice(2:n()) %>%  # remove first row for each group, 'rtn_log_1d' will be NA
  ungroup() %>%
  select(date_stamp, symbol, rtn_log_1d) %>%
  pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>% 
  select(-date_stamp) %>% 
  cor(use = 'pairwise.complete.obs') %>%
  mean_mtrx()

0.457381420427434

As a function¶

Move the pivot_wider, cor and mean_mtrx components into a function.

This function accepts a dataframe.

In [14]:

ipc <- function(df) {
  
  max_date = max(df$date_stamp)
  
  ipc = df %>%
    select(date_stamp, symbol, rtn_log_1d) %>%
    pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>% 
    select(-date_stamp) %>% 
    cor(use = 'pairwise.complete.obs') %>%
    mean_mtrx()

  return(tibble(date_stamp = max_date, ipc = ipc))
}

Create a data frame with the rtn_log_1d column...

In [15]:

daily_price_rtn <- daily_price_ts_vw_20201018 %>% 
  mutate(
    date_stamp = as.Date(date_stamp),
    rtn_log_1d = log(adjusted_close) - lag(log(adjusted_close))
    ) %>%
  slice(2:n()) %>%
  arrange(date_stamp)

head(daily_price_rtn)

A data.frame: 6 × 8
	symbol	sector	date_stamp	close	adjusted_close	volume	sp500	rtn_log_1d
	<chr>	<int>	<date>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	AAN	1	2020-01-02	57.67	57.4919	658505	3257.85	1.611909
2	AAPL	2	2020-01-02	300.35	298.8389	33911864	3257.85	1.744631
3	AAWW	1	2020-01-02	27.73	27.7300	226199	3257.85	-2.761210
4	ABM	1	2020-01-02	38.42	37.9229	358600	3257.85	-0.384784
5	ACCO	1	2020-01-02	9.11	8.8936	505243	3257.85	-1.397909
6	ACM	1	2020-01-02	42.98	42.9800	855310	3257.85	1.864615

... and apply the ipc function.

In [16]:

ipc(daily_price_rtn)

A tibble: 1 × 2
date_stamp	ipc
<date>	<dbl>
2020-08-04	0.03777869

Lets see if we can apply by group.

Method 1¶

In [17]:

daily_price_rtn %>% 
  filter(date_stamp >= '2020-02-11' & date_stamp <= '2020-07-31') %>%
  split(.$sector) %>%
  map_dfr(., ipc, .id = 'sector')

A tibble: 2 × 3
sector	date_stamp	ipc
<chr>	<date>	<dbl>
1	2020-07-31	0.4958920
2	2020-07-31	0.5700418

Method 2¶

In [18]:

daily_price_rtn %>% 
  filter(date_stamp >= '2020-02-11' & date_stamp <= '2020-07-31') %>%
  group_by(sector) %>%
  group_modify(~ ipc(.x))

A grouped_df: 2 × 3
sector	date_stamp	ipc
<int>	<date>	<dbl>
1	2020-07-31	0.4958920
2	2020-07-31	0.5700418

Validate the results for sector 2.

In [19]:

daily_price_rtn %>% 
  filter(sector == 2) %>% 
  filter(date_stamp >= '2020-02-11' & date_stamp <= '2020-07-31') %>%
  select(date_stamp, symbol, rtn_log_1d) %>%
  pivot_wider(names_from = symbol, values_from = rtn_log_1d) %>% 
  select(-date_stamp) %>% 
  cor(use = 'pairwise.complete.obs') %>%
  mean_mtrx()

0.57004180297818

All good so far.

Rolling intra-portfolio correlation¶

We are looking at time series data. Our function needs to be applied on a rolling basis.

This is where the slider package, and its slide_period_dfr function come in.

First the ipc function is wrapped in an outer function in order to apply by group.

In [20]:

ipc_by_grp <- function(df) { 
  df %>% 
    split(.$sector) %>%
    map_dfr(., ipc, .id = 'sector')
}

In [21]:

daily_price_rtn %>% 
  filter(date_stamp >= '2020-02-11' & date_stamp <= '2020-07-31') %>%
  ipc_by_grp()

A tibble: 2 × 3
sector	date_stamp	ipc
<chr>	<date>	<dbl>
1	2020-07-31	0.4958920
2	2020-07-31	0.5700418

Finally we slide it.

In [22]:

slide_period_dfr(
  .x = daily_price_rtn,
  .i = daily_price_rtn$date_stamp,
  .period = "month",
  .f = ipc_by_grp,
  .before = 5,
  .complete = TRUE
)

A tibble: 6 × 3
sector	date_stamp	ipc
<chr>	<date>	<dbl>
1	2020-06-30	0.09560748
2	2020-06-30	-0.14320115
1	2020-07-31	0.49081102
2	2020-07-31	0.56886202
1	2020-08-04	0.51351245
2	2020-08-04	0.56957118

The results for July 2020 differ very slightly - 0.57 versus 0.568 for sector 2. This likely due to the slide_period_dfr function being applied by month, as opposed to the specific day range for the ipc_by_grp function.

We can check this by changing the day range on the earlier function.

In [23]:

daily_price_rtn %>% 
  filter(date_stamp >= '2020-02-01' & date_stamp <= '2020-07-31') %>%
  ipc_by_grp()

A tibble: 2 × 3
sector	date_stamp	ipc
<chr>	<date>	<dbl>
1	2020-07-31	0.490811
2	2020-07-31	0.568862

In agreement. That's good enough for me.