#!/usr/bin/env python
# coding: utf-8

# ## Chapter Goals
# 
#     1. Understand the basics of stationarity
#     2. Understand how asset returns violate the rules of stationarity
#     3. Demonstrate that concept that asset returns can be considered as being generated by different distributions
#     4. Motivating the use of a more robust prediction methodology    

# ## Chapter Outline
# 
#     1. Getting and Storing a Dataset
#     2. Defining Stationarity
#     3. How Asset Returns Violate Rules of Stationarity
#     4. Conclusions

# ## 1. Getting and Storing a Dataset
# 
# Start by importing the required modules and tools

# In[3]:


get_ipython().run_line_magic('load_ext', 'watermark')
get_ipython().run_line_magic('watermark', '')

from IPython.display import display

# import standard libs
from pathlib import PurePath, Path
import sys
import time

# get project dir
pp = PurePath(Path.cwd()).parts[:-1]
pdir = PurePath(*pp)
script_dir = pdir / 'scripts' 
sys.path.append(script_dir.as_posix())

# import python scientific stack

import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
import pandas_datareader.data as web
import numpy as np
import sklearn.mixture as mix
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as stats

# import visual tools

from mpl_toolkits import mplot3d
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
get_ipython().run_line_magic('matplotlib', 'inline')
import plotnine as pn
import mizani.breaks as mzb
import mizani.formatters as mzf
import seaborn as sns

# import util libs
import pyarrow as pa
import pyarrow.parquet as pq
import warnings
warnings.filterwarnings("ignore")
from utils import cprint

# set globals
plt.style.use('seaborn-ticks')
plt.rcParams['font.family'] = 'DejaVu Sans Mono'
plt.rcParams['font.size'] = 9.5
plt.rcParams['font.weight'] = 'medium'
plt.rcParams['figure.figsize'] = 10,7

blue, green, red, purple, gold, teal = sns.color_palette('colorblind', 6)
RANDOM_STATE = 777

print()
get_ipython().run_line_magic('watermark', '-p pandas,pandas_datareader,numpy,sklearn,statsmodels,scipy,matplotlib,seaborn,plotnine,pyarrow')


# #### First let's get some market data from Yahoo. We'll get the adjusted closing price for multiple ETFs.

# In[4]:


get_price = lambda sym, start, end: web.DataReader(sym, 'yahoo', start, end)['Adj Close']

chosen_syms = ['SPY', 'QQQ', 'TLT', 'GLD', 'EFA', 'EEM']

end = pd.to_datetime('2017-09-30')
start = end - 25 * 252 * pd.tseries.offsets.BDay()

#%time df = (pd.DataFrame({sym:get_price(sym, start, end) for sym in chosen_syms}).dropna())
#cprint(df)


# #### Next we create the return dataframe and save it using pyarrow. 
# 
# Please note that although this dataset is small enough to save as csv or h5, if you intend on dealing with larger datasets aka "big data" on "distributed systems" I recommend developing the pyarrow habit now. This will save you time and storage costs down the road.

# In[5]:


# compute returns

#rs = np.log(df/df.shift(1)).dropna()
#cprint(rs)


# In[6]:


# simple save as parquet func
save_parquet = lambda df, save_fp: pq.write_table(pa.Table.from_pandas(df), save_fp)

#save_fp = pdir.as_posix() + f'/data/etf_returns_{rs.index.min().date()}-{rs.index.max().date()}.parquet'
#save_parquet(rs, save_fp)


# After saving the data we can load back into the environment like so...

# In[9]:


# simple load func
load_parquet = lambda fp: pq.read_table(fp)

load_fp = pdir.as_posix()+'/data/etf_returns_2004-11-19-2017-09-29.parquet'
rs = (load_parquet(load_fp)
      .to_pandas() # convert to pandas df
      .assign(year=lambda df: df.index.year)) # add year column for later conv.
cprint(rs)


# Please be aware that the load function used above immediately converts the `parquet` file into a `pandas.DataFrame` object. For smaller datasets and modern laptop hardware that is not a problem (less than 1 `gb`). However once your dataset starts getting into the multi `gb` range one might experience delayed performance in which case I would recommend learning and using [Dask](http://dask.pydata.org/en/latest/).

# ## 2. Defining Stationarity
# 
# The following image sums up this concept best. The simple intuition is that the series mean and variance should NOT change as a result of passing time. The covariance between returns in the time series should also NOT be a function of time.

# In[7]:


from IPython.display import Image

media = './visuals/01_Motivation/stationarity infographic screenshot-www.seanabu.com-2016-11-01-14-16-35.png'
Image(media)


# ## 3. How Asset Returns Violate Rules of Stationarity
# 
# Now that you are familiarized with the concept of stationary time series, the remainder of this notebook will demonstrate/discuss some of the nuances and challenges one runs into in when trying to predict time series. 
# 
# Before we start just know that the behavior of securities' returns challenges these requirements of stationarity and our attempts to predict them in ways that are sometimes unknowable. This is a very humbling realization but is a necessary fact that requires a strategic response in the practice of return prediction.
# 
# The remainder of this notebook will use the canonical `SPY` etf as the primary example but I urge you to download the notebook and experiment with the other securities.
# 
# Below I define some convenience functions for plotting.

# In[8]:


def add_mean_std_text(x, **kwargs):
    """fn: add mean, std text to seaborn plot
    
    # Args
        x : pd.Series()
    """
    mean, std = x.mean(), x.std()
    mean_tx = f"mean: {mean:.4%}\nstd: {std:.4%}"
    
    txkw = dict(size=14, fontweight='demi', color='red', rotation=0)
    ymin, ymax = plt.gca().get_ylim()
    plt.text(mean+0.025, 0.8*ymax, mean_tx, **txkw)
    return

def plot_dist(rs, ex):
    """fn: to plot single distro with fitted histograms using FacetGrid
    
    # Args
        rs : pd.DataFrame(), return df
        ex : str(), security/column name
    """    
    plt.style.use('dark_background')
    plt.rcParams['font.size'] = 14
    g = (rs
         .pipe(sns.FacetGrid, 
               size=5,
               aspect=1.5)
         .map(sns.distplot, ex, kde=False, fit=stats.norm,
              fit_kws={'color':green, 'lw':2.5, 'label':'norm'})
         .map(sns.distplot, ex, kde=False, fit=stats.laplace,
              fit_kws={'linestyle':'--', 'color':gold, 'lw':2.5, 'label':'laplace'})
         .map(sns.distplot, ex, kde=False, fit=stats.johnsonsu,
              fit_kws={'linestyle':'-', 'color':red, 'lw':2.5, 'label':'jsu'})
         .map(add_mean_std_text, ex))
    g.add_legend()
    sns.despine(offset=1)
    plt.title(f'{ex} returns')
    return
    
def plot_facet_hist(rs, ex):
    """fn: to plot multiple fitted histograms using FacetGrid
    
    # Args
        rs : pd.DataFrame(), return df
        ex : str(), security/column name
    """
    plt.style.use('dark_background')
    
    plt.rcParams['font.size'] = 12
    df = rs.assign(year=lambda df: df.index.year)
    g = (sns.FacetGrid(df, col='year',col_wrap=2, size=4, aspect=1.2) # make sure to add legend
         .map(sns.distplot, ex, kde=False, fit=stats.norm,
              fit_kws={'color':green, 'lw':2.5, 'label':'norm'})
         .map(sns.distplot, ex, kde=False, fit=stats.laplace,
              fit_kws={'linestyle':'--', 'color':gold, 'lw':2.5, 'label':'laplace'})
         .map(sns.distplot, ex, kde=False, fit=stats.johnsonsu,
              fit_kws={'linestyle':'-', 'color':red, 'lw':2.5, 'label':'jsu'})
         .map(add_mean_std_text, ex))

    g.add_legend()
    g.fig.subplots_adjust(hspace=.20)
    sns.despine(offset=1)
    return


# #### First let's address the concept of normality. Are SPY returns normally distributed? (short answer: sometimes)
# 
# Let's look at the entire return distribution of SPY as a histogram using seaborn's distplot. You'll notice that I have layered fitted density plots using the following distributions: *normal*, *laplace*, and *johnsonsu*.

# In[9]:


ex = 'SPY'
df = rs.loc['2005':].copy() # use 2005 cutoff b/c it's first full year of data

plot_dist(df, ex)


# Looking at the chart above we can see how badly the normal distribution approximates the dataset. Both laplace and jsu distributions do a much better job approximating the returns. All this proves is that the return distribution for the SPY etf, over the selected time period, is certainly not normal. 
# 
# Let's explore this a bit further. We can use a quantile plot which which allows us to compare the quantile distribution of our parameter e.g. SPY returns, with that of a theoretical distribution, e.g. Normal.

# In[10]:


def quantile_plot(x, **kwargs):
    """fn: plot a quantile plot of returns vs normal distribution
    
    # Args
        x : pd.Series()
    """
    
    plt.rcParams['font.family'] = 'dejavu sans mono'
    res = stats.probplot(x, fit=True, plot=plt)
    _slope, _int, _r = res[-1]

    ax = plt.gca()
    ax.get_lines()[0].set_marker('s')
    ax.get_lines()[0].set_markerfacecolor('r')
    ax.get_lines()[0].set_markersize(13.0)
    ax.get_children()[-2].set_fontsize(22.)
    
    txkw = dict(size=14, fontweight='demi', color='r')
    r2_tx = "r^2 = {:.2%}\nslope = {:.4f}".format(_r, _slope)
    
    ymin, ymax = ax.get_ylim()
    xmin, xmax = ax.get_xlim()
    ax.text(0.5*xmax, .8*ymin, r2_tx, **txkw)
    return 


# In[11]:


s = df[ex]
quantile_plot(s)


# As most of you already know SPY returns exhibit a phenomenon known as "fat tails" as clearly demonstrated by the above plot. All this means is that extreme returns, both positive and negative, occur at a much higher frequency than predicted by the normal distribution. When you read various papers, authors may also refer to this concept by saying "the distribution has more MASS in the TAILS". In this context they mean the same thing. So this shows that over the period of 12 years that SPY returns are not distributed according to the Normal distribution.
# 
# But, we haven't really addressed the stationary aspect, yet.
# 
# One way we can examine whether the returns are stationary is looking at the mean and standard deviation of returns by year. If those two quantities are not supposed to be a function of time then they should be pretty similar across years, right?

# In[12]:


get_ipython().run_cell_magic('time', '', "\nplot_facet_hist(df, ex)\nplt.savefig(f'./visuals/01_Motivation/spy return faceted histogram plot-mean std.png', dpi=300, bbox_inches='tight') \n")


# #### The annual means and standard deviations vary signficantly across years.
# 
# Just by eyeballing the plots you can see how varied the return distribution is. Sometimes the distribution is highly peaked like 2006 and 2017, and other times it's wider and flatter, like 2008-09. We can see a little bit of that variation more clearly below.

# In[13]:


df.groupby('year')[ex].agg(['mean', 'std']).plot(marker='o', subplots=True);


# Did you notice the fitted normal density plots in the faceted histogram looked like it did a better job of approximating the returns for the individual years? Does that mean that our previous conclusion that returns are NOT normally distributed is simply a matter of the time scale? Let's look into this a little further with a faceted quantile plot.

# In[14]:


def plot_facet_qq(rs, ex):
    """fn: to plot faceted quantile plots using FacetGrid
    
    # Args
        rs : pd.DataFrame()
        ex : str(), security/column name
    """
    g = (rs
         .pipe(sns.FacetGrid, 
               col='year',
               col_wrap=2,
               size=7,
               aspect=1.3)
         .map(quantile_plot, ex)
         .fig.subplots_adjust(hspace=0.2))
    sns.despine(offset=1, trim=True)
    return


# In[15]:


get_ipython().run_line_magic('time', 'plot_facet_qq(df, ex)')
plt.savefig(f'./visuals/01_Motivation/spy return faceted quantile plot-kstest.png', dpi=300, bbox_inches='tight') 


# Unsurprisingly we see that in certain years the return distribution is indistiguishable from a normally distributed approximation. In other years we see that the assumption of normality would have been disastrous. So we can say, by visual observation, that the means and variances of the return series are at minimum a function of time, a function of the time scale, and a function of the quantity of data points. 
# 
# What we can also observe is the effect of central bank intervention on US listed multinational corporations and their asset prices. In 2005-06 we can see return distributions are effectively normally distributed. By 2007 we can see that some returns were more negative and occuring with more frequency than predicted using normal approximation. By 2008 the wheels were falling off the bus and the volatility and magnitude of the returns, both postive and negative, were highly non-normal. 2009-10 marks the time that the Fed stepped in to stabilize markets. However it was not enough, as the financial "contagion" had spread globally and especially to European markets. 2011 was the Euro-Crisis and we can see similar non-normal return pattern returns, especially negative returns. Then the global recapitalization aka global debt monetization scheme kicked in around 2012 and we've seen an unprecedented decline in volatility and return dispersion.

# In[16]:


norm = stats.norm

def generate_norm_rvs(ser, N=None):
    if not N: N = ser.shape[0]
    return norm.rvs(ser.mean(), ser.std(), size=N, random_state=RANDOM_STATE)
    
def generate_norm_pdf(ser, N=None):
    if not N: N = ser.shape[0]
    _min, _max = ser.min(), ser.max()
    x = np.linspace(_min, _max, N)
    
    y = norm.pdf(x, ser.mean(), ser.std())
    return x, y

def generate_norm_cdf(ser, N=None):
    if not N: N = ser.shape[0]
    _min, _max = ser.min(), ser.max()
    x = np.linspace(_min, _max, N)
    
    y = norm.cdf(x, ser.mean(), ser.std())
    return x, y


# #### The final test - Kolmogorov-Smirnov. 
# 
# Let's test this more rigorously. For this we will use the CDF of the returns and compare those with a fitted or empirical CDF using the kolmogorov-smirnov two sample test (`scipy.stats.kstest`). This test will allow us to moore definitively state whether the CDF of the two series come from the same distribution e.g. the Normal distribution. the kstest outputs the ks statistic and a p-value. P-values less than 0.05 means we can reject the null hypothesis that the two series were drawn from the same distribution. The smaller the p-value the more confident we are that they came from different distributions.
# 
# First we will analyze the entire return series, then compare the test across years.

# In[17]:


def plot_cdf(ser, **kwds):   
    """fn: to plot single CDF comparison
    
    # Args
        ser : pd.Series()
    """
    plt.style.use('dark_background')
    plt.rcParams['font.family'] = 'dejavu sans mono'
    
    # plot kde
    g = sns.kdeplot(ser, cumulative=True, lw=3, color=blue)
    
    x, y = generate_norm_cdf(ser) # generate fitted normal CDF
    g.plot(x, y, color=red, lw=3, label='fitted normal CDF')
    
    # 2 sided ks test
    #  https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html
    ks, p = stats.kstest(ser, 'norm', args=(ser.mean(), ser.std()))
    
    xmin,xmax=plt.gca().get_xlim()
    ymin,ymax=plt.gca().get_ylim()

    # add text
    txkw = dict(size=14, fontweight='demi', color=gold, rotation=0)
    tx_N = ser.shape[0]
    tx_args = (ks, p, tx_N, ser.shape[0])
    tx = 'ks 2 sample test\nks: {:.4f}, p: {:.4f}\nrvs sample N: {:.0f}\nreturn sample N: {:.0f}'.format(*tx_args)
    plt.text(xmax*0.2, 0.2*ymax, tx, **txkw)    
    
    sns.despine(offset=1)
    (plt.legend(frameon=True, prop={'weight':'demi', 'size':12})
     .get_frame()
     .set_edgecolor(blue))
    return

plot_cdf(df[ex])    


# This further confirms our assumption that the two series are in fact drawn from seperate distros. Let's look at the same information by year.

# In[18]:


def plot_facet_cdf(rs, ex, **kwds):
    """fn: to plot single distro with fitted histograms using FacetGrid
    
    # Args
        rs : pd.DataFrame(), returns dataframe
        ex : str(), security/column name
    """
    g = (rs
         .pipe(sns.FacetGrid, 
               col='year',
               col_wrap=2,
               size=5,
               aspect=1.3)
         .map(plot_cdf, ex, **kwds))
    g.add_legend()
    g.fig.subplots_adjust(hspace=.20)
    sns.despine(offset=1)
    return

get_ipython().run_line_magic('time', 'plot_facet_cdf(df, ex)')
plt.savefig(f'./visuals/01_Motivation/spy return cdf compare with ecdf-kstest.png', dpi=300, bbox_inches='tight') 


# From 2007-11 the kstest is pretty confident that the return CDF and the fitted normal ECDF are different as the p-values are smaller than our arbitrary threshold of 0.05. From 2012-13, and 2015 the kstest doesn't see a difference between them. And in 2014, and from 2016-17 the kstest tells us they came from different distributions.
# 
# Finally we can look at the signs of correlated data points by plotting the return series and its autocorrelation and partial autocorrelation function. First we'll plot the entire period, and then the faceted plots by year.

# In[19]:


def tsplot(y, lags=None, **kwds):
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
        
    plt.style.use('dark_background')
    plt.rcParams['font.family'] = 'dejavu sans mono'        
        
    fig = plt.figure(figsize=(10,8))
    layout = (2, 2)
    ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
    acf_ax = plt.subplot2grid(layout, (1, 0))
    pacf_ax = plt.subplot2grid(layout, (1, 1))

    y.plot(ax=ts_ax)
    ts_ax.axhline(0, color=red, ls='--', lw=2.)
    smt.graphics.plot_acf(y, lags=lags, ax=acf_ax, alpha=0.05, color=red)
    acf_ax.get_lines()[1].set_marker('s');
    acf_ax.get_lines()[1].set_markerfacecolor('cyan');
    acf_ax.get_lines()[1].set_markeredgecolor('cyan');

    smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax, alpha=0.05, color=red)
    pacf_ax.get_lines()[1].set_marker('s');
    pacf_ax.get_lines()[1].set_markerfacecolor('cyan');
    pacf_ax.get_lines()[1].set_markeredgecolor('cyan');

    plt.tight_layout()
    return

tsplot(df[ex], lags=120);


# In[20]:


def facet_tsplot(df, ex, *args):
    """fn: to plot faceted timeseries plot using gridspec
    
    # Args 
        df : pd.DataFrame(), return data
        ex : str(), etf symbol
        _lags : numeric
        _years : array-like, list of years
        _dt : int(), number of columns
        _partial_len : int(), number of rows
    """
    plt.style.use('dark_background')
    
    fig = plt.figure(figsize=(25,40))
    outer_grid = gridspec.GridSpec(_partial_len, _dt)

    for i, year in zip(range(int(_partial_len*_dt)), _years):        
        y = df[ex].loc[str(year)]    

        inner_grid = gridspec.GridSpecFromSubplotSpec(2, 2,
                                                      subplot_spec=outer_grid[i],
                                                      wspace=0.0, hspace=0.5)

        ts_ax = plt.Subplot(fig, inner_grid[0, :])
        acf_ax = plt.Subplot(fig, inner_grid[1, 0])
        pacf_ax = plt.Subplot(fig, inner_grid[1, -1])

        y.plot(ax=ts_ax)
        ts_ax.axhline(0, color=red, ls='--', lw=2.)
        ts_ax.tick_params(axis='x', size=9)
        ts_ax.set_title(str(year), y=.8)

        smt.graphics.plot_acf(y, lags=_lags, ax=acf_ax, alpha=0.05, color=red)
        acf_ax.get_lines()[1].set_marker('s');
        acf_ax.get_lines()[1].set_markerfacecolor('cyan');
        acf_ax.get_lines()[1].set_markeredgecolor('cyan');

        smt.graphics.plot_pacf(y, lags=_lags, ax=pacf_ax, alpha=0.05, color=red)
        pacf_ax.get_lines()[1].set_marker('s');
        pacf_ax.get_lines()[1].set_markerfacecolor('cyan');
        pacf_ax.get_lines()[1].set_markeredgecolor('cyan');
        pacf_ax.set_yticks([])

        fig.add_subplot(ts_ax)
        fig.add_subplot(acf_ax)
        fig.add_subplot(pacf_ax)

        sns.despine()

    plt.show()
    return


_years = np.arange(df.year.min(), df.year.max())
# array([2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016])

_dt = 2
_partial_len = int(len(_years)/_dt)
_lags = 120

get_ipython().run_line_magic('time', 'facet_tsplot(df, ex, _lags, _years, _dt, _partial_len)')
plt.savefig(f'./visuals/01_Motivation/spy return faceted tsplot acf and pacf.png', dpi=300, bbox_inches='tight') 


# So we can see clearly that SPY returns are nonstationary in the mean, in the variance, and demonstrate periods of variable autocorrelation. Furthermore we see the aggregate `SPY` return distribution is non-normal but at various smaller scales e.g. annually, there are periods of normality. So what happens when we try to use past returns approximated by a normal distribution's summary metrics (mean, variance) to predict future return distributions?
# 
# To explore this we'll use (`TimeSeriesSplit`) method from `sklearn`. This provides a walf-forward form of cross validation where previous data points, are used to predict the following period in sequence, thus preserving time dependent information. 

# In[21]:


get_ipython().run_cell_magic('time', '', "\nfrom sklearn.model_selection import TimeSeriesSplit\n\nxx = df[ex].copy() # pd series representing example data\n\n_base = 252 # 1 trading year\n_max_train_sizes = [_base*1, _base*2, _base*3, _base*5] # historical lookbacks\n#_splits = np.arange(2, 12, 2)\n_n_split=5 # try with different values\n\ngs = gridspec.GridSpec(_n_split, len(_max_train_sizes), wspace=0.0)\n\nfig = plt.figure(figsize=(20,25))\n\nrows = []\nfor j, max_size in enumerate(_max_train_sizes):\n    tscv = TimeSeriesSplit(n_splits=_n_split, max_train_size=max_size)\n    \n    for i, (train, test) in enumerate(tscv.split(xx)):\n        tmp_train = xx.iloc[train] # temporary train data\n        tmp_test = xx.iloc[test] # temporary test data\n        \n        # get date ranges for datasets\n        min_train_dt, max_train_dt = tmp_train.index.min(), tmp_train.index.max()\n        min_test_dt, max_test_dt = tmp_test.index.min(), tmp_test.index.max()\n        \n        # compare train distro vs test distro \n        #   https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html\n        ks, p = stats.ks_2samp(tmp_train, tmp_test) # get ks statistic and pvalue\n        \n        df_row = (max_size, ks, p,\n                  min_train_dt.date(), max_train_dt.date(),\n                  min_test_dt.date(), max_test_dt.date())\n        rows.append(df_row) # append data for df\n        \n        tmp_ax = plt.subplot(gs[i, j]) # get temp ax from gridspec object\n        \n        # remove yticks except for first column\n        if i in [0,1,2,3,4] and j != 0: tmp_ax.set_yticks([])\n        # plot cdfs\n        sns.kdeplot(tmp_train, cumulative=True, lw=3, color=blue, ax=tmp_ax, label='train')\n        sns.kdeplot(tmp_test, cumulative=True, lw=3, color=red, ax=tmp_ax, label='test')\n        plt.title('max train size: {}, ks: {:.4f}, p: {:.4f}\\ntrain dates: {}_{}\\ntest dates: {}_{}'\n                  .format(max_size, ks, p,\n                          min_train_dt.date(), max_train_dt.date(),\n                          min_test_dt.date(), max_test_dt.date()), \n                  fontsize=11.)\n        plt.subplots_adjust(top=1.03) # adjust title spacing\n        plt.tight_layout()\nplt.savefig(f'./visuals/01_Motivation/spy return cdf compare-time series split {_n_split} folds-kstest.png',\n            dpi=300, bbox_inches='tight') \n")


# We can see how varied the CDF for train and test periods are throughout the different lookback periods, including times where they sync up exactly. To quantify our observations, we can create a dataframe containing the output of the ks test along with the information on the lookback periods and the data splits.

# In[22]:


# create data table with ks test results and tss information

cols = ['max_size', 'ks', 'p', 'min_train_dt', 'max_train_dt', 'min_test_dt', 'max_test_dt']
tss_ks_df = pd.DataFrame(rows, columns=cols)
obj_cols = tss_ks_df.select_dtypes(include=['object']).columns
tss_ks_df[obj_cols] = tss_ks_df[obj_cols].apply(pd.to_datetime)
tss_ks_df


# We can now examine the percentage of train/test pairs that reject the null hypothesis at alpha equal to 0.05 and 0.01.

# In[23]:


get_pct_reject = lambda df, col, alpha: df.query(f'{col} < {alpha}').shape[0] / df.shape[0]

a1 = 0.05
a2 = 0.01

# percentage of rejected null hypothesis that the samples came from the same e.g. Normal distro
#  at alphas 0.05 and 0.01 respectively
get_pct_reject(tss_ks_df, 'p', a1), get_pct_reject(tss_ks_df, 'p', a2) 


# Unsurprisingly the results are terrible. 60% of the train/test pairs reject the null hypothesis that the training and testing data came from the same distribution. This proves that if we were to apply the mean, variance parameters of the training data to predict the test data return distribution we're going to wrong a lot of the time.
# 
# For fun we can look at the periods where the kstest accepts the null hypothesis and see if any patterns jump out.

# In[24]:


tss_ks_df.query('p > 0.05')


# At a glance, it looks like the kstest was likely to accept the null hypothesis when the training and test periods contained significant periods of low and/or stable volatility.

# ## 4. Conclusions
# 
# We've seen that SPY returns are sometimes normal but mostly non-normally distributed, that the means and variances are not stationary, and that there is autocorrelation between returns at differing lags, and multiple time scales. We've also seen how there is potential for large errors/losses if we assume that the return distribution from the previous period(s) to the next are the same because in most cases they are not.
# 
# Hopefully this notebook shows that there is a need for a modeling framework that can overcome these challenges in a robust manner. In the next chapter I will demonstrate how mixture models approach these issues and then benefits they bring to this task.

# In[ ]: