#!/usr/bin/env python
# coding: utf-8

# # 3. Characterizing outperformed & underperformed companies 1-Month Post-COVID19 Vaccine Success
# 
# ## Instructions
# In 2020 August, the author attempted a characterization of [S&P500 companies based on fundamental metrics during covid19](https://medium.com/analytics-vidhya/characterising-companies-based-on-financial-metrics-during-covid19-1a6ce9cc4ada). Here, the author sought to characterize the performance of S&P500 companies 1-month after [the first covid19 vaccine success announcement by Pfizer](https://www.pfizer.com/news/press-release/press-release-detail/pfizer-and-biontech-conclude-phase-3-study-covid-19-vaccine) on the Nov 18, 2020. Briefly, the stock return was calculated based on the difference between the end of day (eod) stock prices on the Nov 17 and Dec 18, 2020. The contemporary fundamental metrics of the companies were retrieved to investigate their correlations with the top and worst performer during the month.
# 
# ## Packages
# You'll need to you install the quanp package (https://quanp.readthedocs.io/en/latest/installation.html) that should install all necessary packages/libraries required to execute the codes in this tutorial. Please create and use virtualenv with python version 3.6 to avoid dependency problem.
# 
# ### Install Packages

# In[1]:


# import sys
# !conda install seaborn scikit-learn statsmodels numba pytables
# !conda install -c conda-forge python-igraph leidenalg
# !{sys.executable} -m pip install quanp


# In[30]:


# importing library packages
from datetime import datetime, timedelta
import os

import matplotlib.pyplot as pl
import numpy as np
import pandas as pd
import quanp as qp
import re

from IPython.display import display
from matplotlib import rcParams

# setting visualization/logging parameters
pd.set_option('display.max_columns', None)
qp.set_figure_params(dpi=100, color_map = 'viridis_r')
qp.settings.verbosity = 1
qp.logging.print_versions()


# ### Download data
# Here, we get the 505 S&P500 member companies listed on the wikipedia & get a list of fundamental metrics for each company from the TD Ameritrade API (All functions are available from the quanp tools).

# In[56]:


# S&P 500 metadata
df_metadata = qp.datasets.get_wiki_sp500_metadata()

# # S&P 500 fundamentals
# df_fundamental = qp.datasets.download_tickers_fundamental()

# # Download End of Day (eod) price history from the current sp500 list
# qp.datasets.download_tickers_price_history_fromlist(df_metadata.index)


# ### Loading and preparing data

# In[57]:


# Optional: The data retried in cell above were saved as csv file. You may execute this cell to avoid 
# rerunning the downloading cell above.
df_fundamental = pd.read_csv('data/metadata/sp500_metadata_fundamentalAdded.csv', index_col=0)
print(df_fundamental.columns)


# In[58]:


# prepare End of Day (eod) for each company
df_eod = qp.datasets.process_eod_price(df_metadata.index, startdate='2020-11-17', enddate='2020-12-18')

display(df_eod[:2])
display(df_eod[-2:])


# In[59]:


# calculate the 1-month log return of the stock price for each company, i.e. price difference
# between Nov 17 and Dec 18, 2020.
df_log_return_1mth = pd.DataFrame(np.log(df_eod.iloc[-1]) - np.log(df_eod.iloc[0]), 
                                  columns=['log_return_1mth'])

# merging fundamental data with the log_return data
df_fundamental_logreturn = pd.merge(df_fundamental, df_log_return_1mth, how='inner', left_index=True, 
                                    right_index=True)

# describe the 1-month log return of all sp500 companies.
df_fundamental_logreturn['log_return_1mth'].describe()


# In[60]:


ls_fundamental_target = ['log_return_1mth'] + ['beta','bookValuePerShare','currentRatio', 'quickRatio', 
'dividendYield', 'epsChangePercentTTM', 'epsChangeYear', 'epsTTM', 'grossMarginMRQ', 'grossMarginTTM', 
'interestCoverage', 'marketCap', 'marketCapFloat', 'netProfitMarginMRQ','netProfitMarginTTM', 
'operatingMarginMRQ', 'operatingMarginTTM', 'peRatio', 'pegRatio', 'pbRatio', 'pcfRatio',
'prRatio', 'returnOnAssets', 'returnOnEquity', 'returnOnInvestment', 'revChangeIn', 'revChangeTTM', 
'revChangeYear', 'sharesOutstanding', 'shortIntDayToCover', 'shortIntToFloat', 'totalDebtToCapital', 
'totalDebtToEquity', 'ltDebtToEquity', 'vol10DayAvg', 'vol1DayAvg', 'vol3MonthAvg']


# In[61]:


from sklearn.preprocessing import MinMaxScaler

# Minmax scaling (0, 1) for all the variables/features
scaler = MinMaxScaler((0, 1))
df_fundamental_logreturn_minmax = pd.DataFrame(scaler.fit_transform(df_fundamental_logreturn[ls_fundamental_target]), 
                                  index=df_fundamental_logreturn.index,
                                  columns=ls_fundamental_target)

df_fundamental_logreturn_minmax[ls_fundamental_target].describe()


# In[62]:


# Loading pandas dataframe as anndata 
adata = qp.AnnData(df_fundamental_logreturn_minmax[ls_fundamental_target])

# Saving raw data for visualization later
adata.raw = adata

# log(x+1) transformation for all data
qp.pp.log1p(adata)

# Standardization scaling per feature
qp.pp.scale(adata)


# In[63]:


# add a new `.obs` column for all comapnanies called `GICS_Sector`
adata.obs['GICS_Sector'] = df_fundamental_logreturn['GICS Sector'].values

adata


# ### Principal component analysis (PCA)
# Reduce the dimensionality of the data by running PCA, which reveals the main axes of variation and denoises the data.

# Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function ```qp.tl.leiden()```, ```qp.tl.louvain()```, or ```tSNE qp.tl.tsne()```. In our experience, often, a rough estimate of the number of PCs does fine. The 'elbow' point seems to suggest at least up to PC5 will be useful to characterize the companies. We are going to do further dimensional reduction based on the first 6 PCs later. We will perform a factor analysis to check for the underlying features that made up these PCs in a separate tutorial notebook later.

# In[64]:


rcParams['figure.figsize'] = 5,3
qp.tl.pca(adata, svd_solver='arpack', random_state=42)
qp.pl.pca_variance_ratio(adata, n_pcs=len(adata.var_names))


# In[65]:


# Optional: save the anndata in h5ad for reloading of the processed data later
adata.write('data/sp500_1mthlogreturn_afterVaccineSuccess.h5ad')
# adata.read('data/sp500_1mthlogreturn_afterVaccineSuccess.h5ad')
adata


# ### Computing the T-distributed Stochastic Neighbor Embedding (tSNE)
# 
# Let us further reduce the dimensionality of the signficant PCs identified above wholly in to 2 dimensions using the tSNE tool implemented as ```qp.tl.tsne(adata)```.

# In[66]:


qp.tl.tsne(adata, n_pcs=5, random_state=42); # only consider the first 8 pcs


# ### Computing the neighborhood graph
# 
# Before we view the tsne plots with Sector annotations, Let us compute the neighborhood graph of companies using the PCA representation of the data matrix. This will give rise to distances and connectivities in each company. Here, we consider 10 nearest neighbors with 5 PCs derived from the PCA

# In[67]:


qp.pp.neighbors(adata, n_neighbors=10, n_pcs=5, random_state=42);


# ### Clustering the neighborhood graph
# 
# Here, we use Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018) to cluster the neighborhood graph of companies, which we already computed in the previous section.

# In[68]:


qp.tl.leiden(adata)


# We can now map and view the annotations of leiden clustering, GICS_Sector, or any financial metrics/features on the tsne plots. We can see that companies from the Leiden Clusters 8 and 9 seems to perform the best 1 month post-vaccine-success, while the worst performers seem to be ones from the Clusters 7 and 6.

# In[69]:


# # Check data distribution of the log_return_1mth to set a appropriate vmin cutoff
df_fundamental_logreturn_minmax['log_return_1mth'].describe()


# In[70]:


rcParams['figure.figsize'] = 8,8
qp.pl.tsne(adata, color=['leiden', 'GICS_Sector', 'log_return_1mth'], 
           vmin=0.277238, legend_loc='on data', use_raw=True, legend_fontsize=20)


# ### Embedding the neighborhood graph
# 
# We can also embed the neighborhood graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE. Before running the UMAP, we compute the correlations between clusters as initiating positions for the UMAP.

# In[71]:


rcParams['figure.figsize'] = 5,5
qp.tl.paga(adata)
qp.pl.paga(adata, color=['GICS_Sector'], plot=True)


# We can now map and view the annotations of leiden clustering or any financial metrics/features on the umap plots. Again, we can see that companies from the Leiden Clusters 8, 9, and 3 seems to perform the best 1 month post-vaccine-success, while the worst performers seem to be ones from the Clusters 6 and 7.

# In[72]:


rcParams['figure.figsize'] = 8,8
qp.tl.umap(adata, init_pos='paga', random_state=42)
qp.pl.umap(adata, color=['leiden', 'log_return_1mth'], 
           legend_loc='on data', frameon=False, ncols=4, 
           vmin=0.277238, vmax=1.0, use_raw=True, legend_fontsize=20)


# ### Characterizing the clusters of companies

# We run `qp.tl.dendrogram` to compute hierarchical clustering. Multiple visualizations that can
# then include a dendrogram: `qp.pl.matrixplot`, `qp.pl.heatmap`, `qp.pl.dotplot` and `qp.pl.stacked_violin`. 

# In[73]:


qp.tl.dendrogram(adata, 'leiden', var_names=adata.var_names);


# In[74]:


qp.pl.matrixplot(adata, var_names=adata.var_names, groupby='leiden', use_raw=True, 
              cmap='RdBu_r', dendrogram=True, standard_scale='var')


# The matrixplot shows the median for each cluster/group of companies. Again, we are seeing Cluster 8 was the top performer in 'log_return_1mth', followed by Cluster 9, Cluster 3, and Cluster 7. Besides, we can also see that the 'beta' was high in both Cluster 8. In order to confirm the results, we are printing out the exact median values of log_return_1mth for each cluster/group of companies in dataframe.

# In[75]:


df_leiden = pd.DataFrame(adata.obs['leiden'])
df_all = pd.merge(df_fundamental_logreturn, df_leiden, how='inner', left_index=True, right_index=True)

# sort clusters by median log_return_1mth
stats_perCluster = df_all.groupby(['leiden']).agg({'log_return_1mth': ['median', 'min', 
                     'max', 'mean']}).dropna()
stats_perCluster.columns = ['median', 'min', 'max', 'mean']
stats_perCluster.sort_values(['median'], ascending=False).style.background_gradient(cmap='bwr', axis=0)


# Dataframe above shows that the median of 'log_return_1mth' was the highest in Leiden Cluster 8, while the lowest were Cluster 2. To confirm the 'beta', we did the similar in the following cell. We typically dislike [high beta as it usually means the company is more risky](https://www.investopedia.com/investing/beta-know-risk/).

# In[76]:


# sort clusters by median beta
stats_perCluster = df_all.groupby(['leiden']).agg({'beta': ['median', 'min', 
                     'max', 'mean']}).dropna()
stats_perCluster.columns = ['median', 'min', 'max', 'mean']
stats_perCluster.sort_values(['median'], ascending=False).style.background_gradient(cmap='bwr', axis=0)


# ## Visualizing the important features defining each cluster
# Instead of looking at all features of clusters as previously, we can identify features/metrics that are differentially characterizing each cluster. Here, we can see that the Cluster 8 was significantly positive-correlated with higher beta, volatility 3-month average (vol3MonthAvg), volatility 10-day average (vol10DayAvg), vol1DayAvg, but negatively associated with lower earning per share Trailing-twelve-month (epsTTM), price per earning ration (peRatio), operatingMarginTTM, netProfitMarginTTM, returnOnAssets, etc. 
# 
# The worst performer, Cluster 6, was associated with higher dividendYield, grossMarginMRQ, grossMarginTTM, TotalDebtToEquity, etc, and lower current, quick ratio, returnOnAssets, etc.
# 
# Here, the author is particularly interested in the companies from the Cluster 9, which associated with lower beta, that typically means that the stocks in this cluster are considered less risky. Also, the Cluster 9 were associated with favorable higher Operating Profitability Factors ('netProfitMarginTTM', 'netProfitMarginMRQ', 'grossMarginTTM', 'grossMarginMRQ', 'returnOnInvestment', 'returnOnAssets', 'operatingMarginTTM', 'operatingMarginMRQ') and lower Financial Risk Factors ('ITDebtToEquity' and 'TotalDebtToEquity')

# In[77]:


qp.tl.rank_features_groups(adata, 'leiden', groups=['8', '9', '7', '6'], method='wilcoxon')
qp.pl.rank_features_groups(adata, n_features=38, sharey=False, fontsize=10, ncols=2)


# We can map and view the annotations of leiden clustering based on the financial metrics/features on the umap plots below. Here, we see that Leiden Cluster 8 was indeed mostly featured by high beta and unfavorable lower Operating Profitabiity Factors; Cluster 9 was featured by lower beta and many favourable higher Operating Profitability Factors; Clusters 6 and 7 were featured by unfavorable lower Solvency Factors (current, quick ratios).

# In[78]:


rcParams['figure.figsize'] = 5,5
qp.pl.umap(adata, color=['leiden', 'beta', 'vol3MonthAvg', 'epsTTM', 'netProfitMarginTTM', 
                         'netProfitMarginMRQ', 'grossMarginTTM', 'grossMarginMRQ', 
                         'returnOnInvestment', 'returnOnAssets', 'operatingMarginTTM', 
                         'operatingMarginMRQ', 'currentRatio', 'quickRatio', 'pbRatio'], 
           legend_loc='on data', frameon=False, ncols=4, cmap='bwr', vmax=0.5, legend_fontsize=25)


# ### Sector performance in the Cluster 9 companies

# In[79]:


stats_perCluster = df_all.groupby(['leiden', 'GICS Sector']).agg({'log_return_1mth': ['median', 
                                                                   'min', 'max', 'mean']}).dropna()

stats_perCluster.columns = ['median', 'min', 'max', 'mean']
stats_perCluster = stats_perCluster.reset_index()
stats_perCluster[stats_perCluster.leiden == '6'].sort_values(['median'], ascending=False).\
                                                style.background_gradient(cmap='bwr', axis=0)


# In[80]:


# Printing out Cluster 9 companies
heading_properties = [('font-size', '9px')]
cell_properties = [('font-size', '9.5px')]
dfstyle = [dict(selector="th", props=heading_properties),\
           dict(selector="td", props=cell_properties)]
df_all[df_all.leiden == '9'][['Security', 'GICS Sector', 'GICS Sub-Industry']].style.set_table_styles(dfstyle)


# In[81]:


# Printing out Cluster 6 companies
heading_properties = [('font-size', '8px')]
cell_properties = [('font-size', ' 8px')]
dfstyle = [dict(selector="th", props=heading_properties),\
           dict(selector="td", props=cell_properties)]

df_all[df_all.leiden == '6'][['Security', 'GICS Sector', 'GICS Sub-Industry']].style.set_table_styles(dfstyle)


# The sunburst chart below shows that Cluster 9 mostly consisted of Information Technology Sector, Cluster 10 mostly Energy Sector, Cluster 9 Financial Sector, and Cluster 2 Real Estate Sector.

# In[49]:


import plotly.express as px

# grouping statistics
ds = df_all.groupby(['leiden', 'GICS Sector', 'GICS Sub-Industry'])['log_return_1mth'].count().reset_index()
ds.columns = ['leiden', 'GICS Sector', 'GICS Sub-Industry', 'count']

# plotting sunburst
fig = px.sunburst(
    ds, 
    path=[
        'leiden', 'GICS Sector', 'GICS Sub-Industry'
    ], 
    values='count', 
    title='Sunburst chart',
    width=900,
    height=900
)
fig.show()


# The sunburst chart below shows that most of the Real Estate companies Fall on the worst performer 1-month post vaccine success, Cluster 6.

# In[55]:


import plotly.express as px
from itertools import repeat

Other_clusters = [c for c in df_all.leiden.unique() if c not in ['8', '9', '7', '6']]
dict_regroup = {key: item for key, item in zip(Other_clusters, repeat('Others'))}
dict_regroup['8'] = 'C8 - Top performer'
dict_regroup['9'] = 'C9 - Second top (ideal) performer'
dict_regroup['7'] = 'C7 - Second to Worst performer'
dict_regroup['6'] = 'C6 - Worst performer'

df_regrouped = df_all.replace({'leiden': dict_regroup})

# grouping statistics
ds = df_regrouped.groupby(['GICS Sector', 'leiden'])['log_return_1mth'].count().reset_index()
ds.columns = ['GICS Sector', 'leiden', 'count']

# plotting sunburst
fig = px.sunburst(
    ds, 
    path=[
        'GICS Sector','leiden'
    ], 
    values='count', 
    title='Sunburst chart',
    width=700,
    height=700
)
fig.show()


# ## Conclusions:-
# In this tutorial, we found that although the Cluster 8 was the top performer but it was associated with high beta companies — higher risk in stock market. Their performance were unjustifiable with statistically lower Operating Profit Factor. In comparison, Cluster 9 companies have lower beta — less risky, and its performance is justified by favourable higher Operating Profit Factor and lower Financial Risk Factors.
# 
# On the other hand, the worst performer was the Cluster 6, followed by Cluster 7. Both clusters have significantly lower solvency (current and quick ratios). Cluster 6 was also associated with higher financial risk, i.e. higher debt-to-capital/Equity. Of note, the cluster 6 was consisted mainly of the Real Estate companies. The worst performance probably justified by the on-going second wave during the current winter season, where city lockdown is widely adopted.

# REFERENCES:
# 1. https://quanp.readthedocs.io/en/latest/tutorials.html
# 2. Tabachnick & Fidell. Using Multivariate Statistics, Sixth Edition. PEARSON 2013; ISBN-13:9780205956227.
# 3. Traag et al. (2018), From Louvain to Leiden: guaranteeing well-connected communities arXiv.
# 4. Chincarini & Kim. Quantitative Equity Portfolio Management — An Active Approach to Portfolio Construction and Management. McGraw-Hill 2006; ISBN:0071459405.