#!/usr/bin/env python
# coding: utf-8

# # Financial Network
# 
# **Author**: [Erika Fille Legara](http://www.erikalegara.net/)
# 
# You are free to use (or change) this notebook for any purpose you'd like. However, please respect the MIT License that governs its use, and for copying permission.
# 
# Copyright © 2016 Erika Fille Legara
# 
# ---
# ## Description
# I have been receiving requests to release the Python code I wrote to produce the financial network discussed in my blog post at [erikafille.ph](http://erikafille.ph) titled [PSE Correlation-Based Network](https://erikafille.ph/2016/08/16/pse-correlation-based-network/). Here it is. 
# 
# In this notebook, we build a correlation-based network (minimum spanning tree (MST)) of companies listed in the [Philippine Stock Exchange](https://www.pse.com.ph/stockMarket/home.html). The entire process can be summarized into three steps:
# 1. Set up the **correlation matrix** for the stock prices in the Philippine Stock Exchange.
# 2. Convert the resulting correlation matrix into a **distance matrix**.
# 3. Build a **minimum spanning tree** from the distance matrix.
# 
# ### Data
# I got the link to the data from this [stock forum post](http://www.stockmarketpilipinas.com/thread-500.html). In this post it says [Drop Box: 2006-present worth of CSV files uploaded by Mr. Coelacanth](https://www.dropbox.com/sh/1dluf0lawy9a7rm/fHREemAjWS). The last I checked, the 2016 historical dataset is only until the 15th of July. 
# 
# ---
# ## Let's Dig In
# Import the necessary packages.

# In[1]:


from __future__ import division

try:
    import networkx as nx
    import pandas as pd
    import os
    import matplotlib.pyplot as plt
    import seaborn as sns
    import math
    import numpy as np
    from datetime import datetime
    get_ipython().run_line_magic('matplotlib', 'inline')
    
except:
    import traceback
    traceback.print_exc()
    raise ImportError('Something failed, see above.')


# ### Load Data
# We first load the list of companies in the PSE and store it in a Pandas data frame we call `pse_companies`.

# In[2]:


pse_companies = pd.read_csv("PSE-listed-companies.csv")
pse_companies = pse_companies[["Company Name", "Stock Symbol", "Sector", "Subsector", "Listing Date"]]
pse_companies.head()


# Then, load all files under the folder "2016" to load all 2016 data. Each file inside a folder contains the prices of all stock quotes in the PSE for the day (as indicated in the filename). As mentioned, the dataset I have for the year 2016 only runs until the 15th of July. 
# 
# The cell block below loads all files in the directory `"./2016/"`.

# In[3]:


files2016 = os.listdir("./2016/")


# Let's now explore the content of a file (stock prices for 1 day) inside the directory (folder). 
# 
# *I am not exactly sure what the last two columns are, so I'm assigning them the variables X1 and X2, respectively.*

# In[4]:


print "Day: ", files2016[0]
df0 = pd.read_csv("./2016/" + files2016[0], header=None)
df0.head()


# In[5]:


df0.columns = ["Company_Index", "Date", "Open", "High", "Low", "Close", "X1", "X2"]
df0.head()


# After my initial data exploration, I made a list of companies I am discarding in this analysis. I am also excluding the indices. In any case, you may add or delete stocks to your liking.

# In[6]:


pse_comp = list(pse_companies["Stock Symbol"])
discard = ['UW', 'VMC', 'VVT', 'PRIM', 'MJIC', 'MACAY', 'PMT', 'REG', 'ROX', \
           'RCI', 'SPC', 'SPM', 'STR', 'STN', 'SRDC', 'SGP', 'MAH', 'MGH', \
           'NXGEN', 'PCP', 'PMPC', 'PAX', 'PHC', 'H2O', 'PNC', 'PRC', 'PTT', \
           'PTC', 'PORT', 'GPH', 'GREEN', 'KPH', 'LMG', 'LSC', 'CHP', 'CAT', \
           'CIP', 'CSB', 'DWC', 'ECP', 'EVER', 'EIBA', 'FEU', 'FFI', 'FYN', \
           'FAF', 'ABC', 'AAA', 'ATI', 'AB', 'BH', 'CHI', 'CPV', "BCOR"]
companies = [company for company in pse_comp if company not in discard ]
print companies


# We then filter the dataframe to companies listed in `companies`. We also drop the other columns and only retain the following  `["Company_Index", "Date", "Close"]` columns

# In[7]:


allprices = pd.DataFrame()
for f in files2016:
    df = pd.read_csv("./2016/" + f, header=None)
    df.columns = ["Company_Index", "Date", "Open", "High", "Low", "Close", "Volume1", "Volume2"]
    df = df[df.Company_Index.isin(companies)]
    df.Date = pd.to_datetime(df.Date)
    df = df[["Company_Index", "Date", "Close"]]
    allprices = pd.concat([allprices,df], ignore_index = True)


# In[8]:


allprices.head()


# In[9]:


## List of all comapnies we are considering in this notebook

print set(list(allprices.Company_Index))


# In the variable `subset` below, we further filter the dataset by date range. Here, I only want to look at the prices from January 01, 2016 to the end of July. I then reshape the data frame `subset` and store it in the variable `final_df` where the columns are the assets and the rows are the prices of the assets on a particular day.

# In[10]:


subset = allprices[(allprices.Date > datetime(2016,1,1,0,0,0)) & (allprices.Date < datetime(2016,7,31,23,59,59))]
final_df = subset.pivot(index="Date", columns="Company_Index", values="Close")


# In[11]:


final_df.head()


# ---
# ### Step 1: Build Correlation Matrix
# From the data frame above, we now build the **correlation matrix**.
# <blockquote>In finance/stock trading, **correlation** is just a measure of the extent at which two equities behave with respect to each other.</blockquote>
# Below, we build the correlation matrix from `final_df` and store it in the variable `price_corr`. The matrix provides us with the corresponding **correlation coefficients (-1.0 to +1.0)** for all stock pairs in the list of companies. [Straightforward](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html), isn't it? 

# In[12]:


price_corr = final_df.corr()


# When two assets are *positively correlated*, it means that the general trends of the two stocks are similar; when one goes up, the other one goes up as well. When, on the other hand, two assets are *negatively correlated*, their trends go in opposite directions. 
# 
# I made a quick sketch below to illustrate these relationships. 
# 
# ![alt text](./img/corr.png "Correlation")
# 
# For a more concrete example, let us have a look at the relationships between (`$ABS` and `$ACR`), (`$FMETF` and `$ALI`) and (`$BLOOM` and `$BDO`); the `$` sign is the hashtag used for stock quotes. I chose these securities based on the correlation values of the pairs in `price_corr`. Can you tell which of the pairs are positively correlated? Negatively correlation?

# In[13]:


## ABS and ACR
plt.figure(figsize=(8,5))
plt.plot(final_df['ABS'], final_df["ACR"], '.', markersize=10)
plt.xlabel("Price of $ABS (PhP)")
plt.ylabel("Price of $ACR (PhP)")
_ = plt.show()


# In[14]:


## FMETF and ALI

plt.figure(figsize=(8,5))
plt.plot(final_df['FMETF'], final_df["ALI"], '.', markersize=10)
plt.xlabel("Price of $FMETF (PhP)")
plt.ylabel("Price of $ALI (PhP)")
_ = plt.show()


# In[15]:


## BLOOM and BDO

plt.figure(figsize=(8,5))
plt.plot(final_df['BLOOM'], final_df["BDO"], '.', markersize=10)
plt.xlabel("Price of $BLOOM (PhP)")
plt.ylabel("Price of $BDO (PhP)")
_ = plt.show()


# Below, we draw the heatmap of the resulting correlation matrix.

# In[16]:


## Source: https://stanford.edu/~mwaskom/software/seaborn/examples/many_pairwise_correlations.html

## Generate a mask for the upper triangle
mask = np.zeros_like(price_corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

## Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 15))

## Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

## Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(price_corr, mask=mask, cmap=cmap, vmax=.3,
            square=True, xticklabels=2, yticklabels=2,
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax)

_ = plt.show()


# ---
# ### Step 2: Build Distance Matrix
# As mentioned in the blog post, we will use two distance metrics in building the distance matrices. The first metric is from Bonanno et al.
# 
# \begin{equation}d_{ij} = \sqrt{2 \times (1 - c_{ij})}\end{equation}
# 
# where $c_{ij}$ is the correlation cofficient of stocks $i$ and $j$. In the equation, when $c_{ij}=1$, $d_{ij}=0$; and, when $c_{ij}=-1$, $d_{ij}=2$. That is, when there is a perfectly positive correlation (+1), the distance is 0; and, when the correlation is perfectly negative, the distance is the farthest at 2. The next distance measure is from [SliceMatrix](http://SliceMatrix.com) (mktstk) and it is given by
# 
# \begin{equation}d_{ij} = 1 - |c_{ij}|.\end{equation}
# 
# This equation does not distinguish between a positively or a negatively correlated pair of stocks; as long as two stocks are highly correlated, the distance is minimized. 
# 
# Here, we define the distance matrices as `dist_bonanno` and `dist_mktstk`.

# In[17]:


dist_bonanno = np.sqrt(2*(1-(price_corr)))
dist_mktstk = 1-abs(price_corr)

## I am just defining the labels
labs_bonanno = list(dist_bonanno.index)
labs_mktstk = list(dist_mktstk.index)


# ---
# ### Step 3: Build Minimum Spanning Tree (MST)
# 
# Now, we are ready to build the minimum spanning tree. The idea is to connect the ones that have the closest distance to each other, i.e. connect those that are highly correlated. Let's first build the "weighted" networks `G_bonanno` and `G_mktstk` from the distance matrices `dist_bonanno` and `dist_mktstk`, respectively. Using the Python package NetworkX, that's pretty straightforward to do.

# In[18]:


G_bonanno = nx.from_numpy_matrix(dist_bonanno.as_matrix())
G_mktstk = nx.from_numpy_matrix(dist_mktstk.as_matrix())


# Once we have the distance networks, we can already build minimum spanning trees (MST). Here, we use Kruskal's algorithm. Below is the pseudo-code copied from the Wikipedia entry on the algorithm.
# 
# <pre>
# KRUSKAL(G):
# 1 A = ∅
# 2 foreach v ∈ G.V:
# 3    MAKE-SET(v)
# 4 foreach (u, v) in G.E ordered by weight(u, v), increasing:
# 5    if FIND-SET(u) ≠ FIND-SET(v):
# 6       A = A ∪ {(u, v)}
# 7       UNION(u, v)
# 8 return A
# </pre>
# 
# Again, we can use NetworkX to build the MST with the graphs as inputs.

# In[19]:


MST_b = nx.minimum_spanning_tree(G_bonanno)  
MST_m = nx.minimum_spanning_tree(G_mktstk) 


# Finally, let's add more attributes to the "nodes" or the stocks. The attributes that I want to include here are: 
# 
# - `label` (the stock symbol)
# - `sector` (which sector the stock belongs)
# - `change` (the $\%$ change of the stock for the period under study)
# 
# This way, when we draw the MSTs, we can choose to color the nodes by either `sector` or `change`.

# In[20]:


change = (final_df.iloc[-1] - final_df.iloc[0]) * 100 / final_df.iloc[0]


# In[21]:


for node in MST_b.nodes():
    sector = pse_companies[pse_companies["Stock Symbol"] == labs_bonanno[node]].Sector.iloc[0]
    MST_b.node[node]["sector"] = sector
    MST_b.node[node]["label"] = labs_bonanno[node]
    if math.isnan(change[labs_bonanno[node]]):
        MST_b.node[node]["color"] = "black"
    elif change[labs_bonanno[node]] < -10:
        MST_b.node[node]["color"] = "red"
    elif change[labs_bonanno[node]] > 10:
        MST_b.node[node]["color"] = "green"
    else:
        MST_b.node[node]["color"] = "blue"


# In[22]:


for node in MST_m.nodes():
    sector = pse_companies[pse_companies["Stock Symbol"] == labs_mktstk[node]].Sector.iloc[0]
    MST_m.node[node]["sector"] = sector
    MST_m.node[node]["label"] = labs_mktstk[node]
    if math.isnan(change[labs_mktstk[node]]):
        #print change[labs_mktstk[node]], labs_mktstk[node]
        #Gm.node[node]["change"] = 101
        MST_m.node[node]["color"] = "black"
    elif change[labs_mktstk[node]] < -10:
        MST_m.node[node]["color"] = "red"
    elif change[labs_mktstk[node]] > 10:
        MST_m.node[node]["color"] = "green"
    else:
        MST_m.node[node]["color"] = "blue"


# ### Drawing the MSTs

# In[23]:


plt.figure(figsize=(10,10))
nx.draw_networkx(MST_b)


# In[24]:


plt.figure(figsize=(10,10))
nx.draw_networkx(MST_m)


# ### Write out MSTs
# Below, we write the MSTs as `gexf` files so we can use them in [Gephi](https://gephi.org/) (open-source and free) to generate much prettier networks/trees. 
# 
# <blockquote>Gephi is the leading visualization and exploration software for all kinds of graphs and networks. Gephi is open-source and free.</blockquote>
# 

# In[25]:


nx.write_gexf(MST_b, "corrmat_bonanno.gexf")
nx.write_gexf(MST_m, "corrmat_mktstk.gexf")


# Below is the resulting network (`MST_b`) drawn using Gephi.
# ![MST Bonanno](./img/mst_bonanno.png "MST Bonanno")

# In[ ]: