Quentin Batista
The University of Tokyo
Relative valuation is a popular technique which relies on the market value of similar assets to price a given asset. For example, if you want to price a laundry business, you might want to calculate the price to earnings ratio of other similar businesses and multiply this ratio by the earnings of the laundry business to obtain a valuation. Sheridan and Martin describe this methodology as a 4-step process:
Step 1: Identify similar or comparable investments and recent market prices for each.
Step 2: Calculate a valuation metric for use in valuing the asset.
Step 3: Calculate an initial estimate of value.
Step 4: Refine or tailor your initial valuation estimate to the specific characteristics of the investment.
Some of the most commonly used multiples are:
Price to earnings (P/E)
Market to book value of equity
Enterprise Value (EV) to EBITDA
Enterprise Value (EV) to revenue
Enterprise Value (EV) to cash flow
This list is not exhaustive, and it is possible for you to create your own multiple. This is particularly popular in the technology sector where analysts have come up with multiples such as Enterprise Value to unique visitors or website hits. In doing so, you must ensure that the components of the multiple are consistent with each other. For example, you might consider using the price to sales ratio as a valuation multiple. However, an implicit assumption behind this multiple is that comparable companies have identical capital structures, which is very rarely the case in practice. When this assumption is violated, the multiple becomes inconsistent because it is impacted by the amount of debt that a company has relative to its equity.
Finally, a key step in applying this methodology is to determine which multiple is appropriate for the asset you are trying to value. For example, valuing young firms and startups using the P/E ratio is likely to be inappropriate if those firms have negative or highly volatile earnings. Instead, using the EV to sales ratio would likely give a better estimate. Additionally, it is important to realize that these multiples have different characteristics. While EV to EBITDA ignores the firm's CapEx, depreciation, and capital structure, while the P/E ratio takes those into account. Using these multiples concurrently allows you to see the big picture and understand what is driving the valuation of an asset.
In this notebook, we systematize the methodology introduced above for companies in the S&P 500 using two different Machine Learning approaches. First, we replicate Gael Varoquaux's analysis on scikit-learn which extracts a graphical structure from the correlation of intraday variations and then applies Affinity Propagation to group together stocks that behave similarly. Second, we analyze the companies' latest 10K using the Doc2Vec implementation in gensim to quantify similarity. In the first case, we use the average multiple in each cluster to value each company. In the second case, we pick the top 3 comparable companies and transform the similarity measures into weights to compute the average comparable multiple. We then apply this multiple to each company's financials to obtain a valuation.
The hope of using this approach is to capture relationships between companies that human analysts would not be able to perceive. Reading through the financial statements of all the companies in the S&P 500 and forming an opinion about how these companies relate to each other would be an extremely time-consuming process. As a result, most analysts on the Street cover a specific industry group, and therefore, they might not realize that a company they are researching is actually very similar to a company in classified in a different industry group. By using Machine Learning, we can alleviate this issue, and hopefully, gain some valuable insights about the stock market structure.
We scrape the list of companies in the S&P 500, their ticker and CIK code from Wikipedia. We then use the CIK code to scrape the latest 10K from EDGAR. There are 505 companies in the Wikipedia list because some companies trade under multiple symbols (for example, Discovery Communications Inc.). A few companies' financial statements are not available on EDGAR for various reasons -- we ignore those companies. We clean the data by removing "Table of Contents" markers when they exist, page numbers, line breaks, punctuations and numbers from the statements. We scrape company fundamentals and their historical prices from Yahoo! Finance. Note that we drop the companies with negative multiples to simplify valuation. While Yahoo! Finance might not be a particularly reliable source for getting data about company fundamentals, it is the best among free data sources.
!pip install numpy
!pip install pandas
!pip install beautifulsoup4
!pip install gensim
!pip install nltk
!pip install sklearn
!pip install bokeh
!pip install pandas_datareader
!pip install datetime
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
from collections import namedtuple
from gensim.models import doc2vec
from gensim.similarities import docsim
from gensim import corpora
import itertools
from string import punctuation, digits
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.manifold import TSNE
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool
from bokeh.palettes import all_palettes
from pandas_datareader.data import DataReader
from datetime import date
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
def table_extractor(soup):
"""Extract the tables from a soup object"""
for table in soup.find_all("table"):
table.extract()
return soup
sp_500_wiki_link = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
soup_wiki = BeautifulSoup(urlopen(sp_500_wiki_link), 'html.parser')
table_wiki = soup_wiki.find("table", {"class": "wikitable sortable"})
# Fail now if the right table hasn't been found
header = table_wiki.findAll('th')
if header[0].string != "Ticker symbol" or header[1].string != "Security":
raise Exception("Can't parse wikipedia's table!")
# Retreive the values in the table
records = []
rows = table_wiki.findAll('tr')
for row in rows:
fields = row.findAll('td')
if fields:
# Get info and SEC company link
symbol = fields[0].string
wiki_link = "https://en.wikipedia.org/wiki" + fields[1].a.get('href')
CIK = fields[7].string
sec_company_link = "https://www.sec.gov/cgi-bin/browse-edgar?" + \
"action=getcompany&CIK=" + CIK + \
"&type=10-K&dateb=&owner=include&count=40"
name = fields[1].a.string
sector = fields[3].string
# Get link for the page with latest 10-K related filings
soup_comp = BeautifulSoup(urlopen(sec_company_link), 'html.parser')
table_comp = soup_comp.find("table", {"class": "tableFile2"})
try:
# Get the latest filing page
filings_link = "https://www.sec.gov" + \
table_comp.a.get('href')
# Get the link for the latest 10K
soup_filings = BeautifulSoup(urlopen(filings_link),
'html.parser')
table_filings = soup_filings.find("table", {"class": "tableFile"})
_10K_link = "https://www.sec.gov" + table_filings.a.get('href')
# Extracting the text from the latest 10K
try:
soup_latest_10K = BeautifulSoup(urlopen(_10K_link).read(),
'html.parser')
soup_latest_10K = table_extractor(soup_latest_10K)
_latest_10K_txt = soup_latest_10K.get_text()
except:
# If the latest 10K is not available, return N/A
_latest_10K_txt = np.nan
except:
# If the filings are not available, return N/A
_10K_link = np.nan
_latest_10K_txt = np.nan
# Append results
records.append([symbol, wiki_link, name, sector, sec_company_link,
CIK, _latest_10K_txt])
headers = ['Symbol', 'Wikipedia Link', 'Name', 'Sector', 'SEC Filings Link',
'CIK', 'Latest 10K']
data = pd.DataFrame(records, columns=headers)
# Correct ambiguous tickers for later purpose
ambiguous_tickers = ['BRK.B', 'BF.B']
corrected_tickers = ['BRK-B', 'BF-B']
for i, ticker in enumerate(ambiguous_tickers):
data['Symbol'] = data['Symbol'].replace(ticker, corrected_tickers[i])
def unit_converter(data):
billion = 1_000_000_000
million = 1_000_000
if data[-1] == 'B':
return float(data[:-1])*billion
elif data[-1] == 'M':
return float(data[:-1])*million
else:
return float(data)
items = ['Enterprise Value', 'Enterprise Value/Revenue', 'Diluted EPS',
'Enterprise Value/EBITDA', 'Revenue', 'EBITDA', 'Trailing P/E']
for i, ticker in enumerate(data['Symbol']):
key_stats_link = 'https://finance.yahoo.com/quote/' + ticker + \
'/key-statistics?p=' + ticker
key_stats_soup = BeautifulSoup(urlopen(key_stats_link).read(),
'html.parser').findAll('td')
for j, row in enumerate(key_stats_soup):
for item in items:
try:
if item == row.span.string:
# Dangerous assumption
data.loc[i, item] = \
unit_converter(key_stats_soup[j+1].string)
except:
next
Disclaimer: This scraper makes an important assumption regarding the structure of the html file. Specifically, it assumes that the data for each item of interest is the string of text located in the 'td' tag following the one where the name of the item was found.
from pandas_datareader.data import DataReader
from datetime import date
start = date(2013, 1, 2)
end = date.today()
data_source = 'yahoo'
# Sometimes fail -- retry if it does
historical_prices_panel = DataReader(data['Symbol'], data_source, start, end,
retry_count=5)
# Current price is last close
last_price = \
historical_prices_panel['Close'].tail(1).T.iloc[:, 0].rename('Current Price')
data = data.join(last_price, on='Symbol')
close_price = historical_prices_panel['Close']
open_price = historical_prices_panel['Open']
# Uncomment to save the data
# data.to_csv('10K_data.csv')
# historical_prices_panel['Close'].to_csv('close_price_data.csv')
# historical_prices_panel['Open'].to_csv('open_price_data.csv')
# Uncomment to read the data
# data = pd.read_csv('10K_data.csv')
# close_price = pd.read_csv('close_price_data.csv')
# open_price = pd.read_csv('open_price_data.csv')
# Remove companies without filings
no_filings_data = data[data['Latest 10K'].isnull()]
data = data[~data['Latest 10K'].isnull()]
# Remove duplicates (keep first)
data = data.drop_duplicates(keep='first')
# Drop Google duplicate
data = data[data['Symbol'] != 'GOOG']
# Drop NA rows (about 60 companies)
data = data.dropna()
# Drop companies with negative EPS or EBITDA (about 30 companies)
data = data[(data[['EBITDA', 'Diluted EPS']] > 0).all(1)]
# Reset index
data = data.reset_index(drop=True)
from collections import namedtuple
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation, digits
def _10K_string_cleaner(_10K):
# Lowercase the text
_10K = _10K.lower()
stopchar = punctuation + digits + '’“”'
for ch in stopchar:
# Replace stopchar by whitespace
_10K = _10K.replace(ch, ' ')
# Tokenize
_10K = word_tokenize(_10K)
# Remove stopwords
_10K = [word for word in _10K if word not in stopwords.words('english')]
return _10K
corpus = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(data['Latest 10K']):
corpus.append(analyzedDocument(_10K_string_cleaner(text), [i]))
Doc2Vec is an implementation of the Paragraph Vector unsupervised learning algorithm proposed by Quoc Le and Tomas Mikolov for learning fixed-length feature representations from variable-length pieces of texts. It aims to overcome some of the weaknesses of the more traditional bag-of-words (BOW) methodology.
from gensim.models import doc2vec
from gensim.similarities import docsim
model_NLP = doc2vec.Doc2Vec(corpus, size=500, window=300, min_count=1,
workers=4)
import itertools
similarity_matrix_NLP = np.empty(0)
length_docvecs = len(model_NLP.docvecs)
for item in itertools.product(range(length_docvecs), range(length_docvecs)):
similarity_matrix_NLP = np.append(similarity_matrix_NLP,
model_NLP.docvecs.similarity(*item))
similarity_matrix_NLP = similarity_matrix_NLP.reshape((length_docvecs,
length_docvecs))
We use three different methodologies for visualizing the structure of the stock market. First, we create a simple scatter plot of the companies' EV/EBITDA against their P/E ratio. Second, we use the correlation matrix of the S&P 500 stocks between the start of 2013 and the last price. Third, we use the similarity matrix outputted from the Doc2Vec model. In the last two cases, t-SNE is used for dimensionality reduction.
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool
from bokeh.palettes import all_palettes
output_notebook()
category_items = data['Sector'].unique()
palette = all_palettes['Viridis'][len(category_items)]
colormap = dict(zip(category_items, palette))
data['Color'] = data['Sector'].map(colormap)
TOOLS = "crosshair,pan,wheel_zoom,reset,tap,save,box_select"
source = ColumnDataSource(dict(x=data['Trailing P/E'],
y=data['Enterprise Value/EBITDA'],
color=data['Color'],
label=data['Name'],
ticker=data['Symbol'],
sector=data['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
# Remove "legend='sector'," to remove legend
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "Current Valuation Structure"
p.title.text_font_size = "25px"
# Uncomment the following line to add labels
# p.add_layout(labels)
p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'
show(p)
There is a clear positive relationship between a company's EV/EBITDA and P/E ratio. While most companies are clustered in the bottom left quadrant, there are a few outliers. We choose to keep those outliers for valuation because they still provide valuable information about companies as we will see later.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0, perplexity=10.0)
Y = tsne.fit_transform(similarity_matrix_NLP)
plotting_df = pd.concat([data[['Name', 'Symbol', 'Sector', 'Color']],
pd.DataFrame(Y, columns=['x', 'y'])], axis=1)
source = ColumnDataSource(dict(x=plotting_df['x'],
y=plotting_df['y'],
color=plotting_df['Color'],
label=plotting_df['Name'],
ticker=plotting_df['Symbol'],
sector=plotting_df['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "2D Company Similarity Visualization (NLP)"
p.title.text_font_size = "25px"
# Uncomment the following line to add labels
# p.add_layout(labels)
show(p)
While companies are mostly clustered by their industry groups, this analysis seems to capture relationships between companies which go beyond those. The fact that companies such as Activision Blizzard and Electronic Arts are close to each other suggests that the model is doing something right. However, note that it is possible that the model relies on trivial factors, for example, attributing a high similarity score to companies whose financial statements have the same auditor (say, because the writing style is similar).
from sklearn import cluster, covariance
# This code was adapted from Gael Varoquaux's work (see references)
# Calculate intraday variation
variation_df = close_price - open_price
variation_df = variation_df.T.reindex(data['Symbol'])
# Get name, sector and color from the data dataframe
variation_df = \
data[['Symbol', 'Name', 'Sector', 'Color']].join(variation_df, on='Symbol')
# Drop rows with NAs
variation_df = variation_df.dropna(axis=0)
# Data for the model
var_data = variation_df.drop(['Symbol', 'Name', 'Sector', 'Color'], axis=1).T
# Learn a graphical structure from the correlations
edge_model = covariance.GraphLassoCV()
# Standardize the time series: using correlations rather than covariance is
# more efficient for structure recovery
var_data /= var_data.std(axis=0)
edge_model.fit(var_data)
# Cluster using affinity propagation
_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()
variation_df['Cluster'] = np.nan
for i in range(n_labels + 1):
variation_df.loc[labels == i, 'Cluster'] = i + 1
print('Cluster %i: %s' % ((i + 1),
', '.join(variation_df['Name'][labels == i])))
# Find a low-dimension embedding for visualization: find the best position of
# the nodes (the stocks) on a 2D plane
embedding = tsne.fit_transform(var_data.T)
# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.06)
# Plot the edges
start_idx, end_idx = np.where(non_zero)
segments = [[embedding.T[:, start], embedding.T[:, stop]]
for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_correlations[non_zero])
d1 = variation_df[['Name', 'Symbol', 'Sector', 'Color']].reset_index(drop=True)
d2 = pd.DataFrame(np.stack(embedding, axis=1).T, columns=['x', 'y'])
plotting_df = pd.concat([d1, d2], axis=1)
source = ColumnDataSource(dict(x=plotting_df['x'],
y=plotting_df['y'],
color=plotting_df['Color'],
label=plotting_df['Name'],
ticker=plotting_df['Symbol'],
sector=plotting_df['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "2D Company Similarity Visualization (Correlation)"
p.title.text_font_size = "25px"
# Comment this line to remove segments
p.segment(*np.reshape(np.array(segments).flatten(), (len(segments), 4)).T)
# Uncomment the following line to add labels
# p.add_layout(labels)
show(p)
Note: Segments represent partial correlation coefficients whose absolute value exceed 6%.
Here, companies are much more clearly clustered by industry groups. For example, one can see that airline companies (American Airlines Group, Delta Air Lines, etc) are clustered together and connected to each other.
# Subtract identity matrix from similarity matrix to avoid selecting the
# original company
sim_mat = similarity_matrix_NLP - np.eye(len(similarity_matrix_NLP))
def valuation_calculator(index, data, multiple, similarity_matrix):
sorted_similarity_array = -np.sort((-similarity_matrix[0]))
sorted_similarity_indices = (-similarity_matrix[index]).argsort()
top_3_comps = sorted_similarity_array[:3]
baseline_comp = sorted_similarity_array[4]
normalized_weights = ((top_3_comps-baseline_comp) /
sum(top_3_comps-baseline_comp))
top_3_pe = data[multiple][sorted_similarity_indices[:3]]
weigthed_pe = np.dot(normalized_weights, top_3_pe)
if multiple == 'Trailing P/E':
valuation = weigthed_pe * data['Diluted EPS'][index]
elif multiple == 'Enterprise Value/EBITDA':
valuation = weigthed_pe * data['EBITDA'][index]
elif multiple == 'Enterprise Value/Revenue':
valuation = weigthed_pe * data['Revenue'][index]
return valuation
valuation_df_nlp = data[['Name', 'Symbol', 'Sector', 'Color', 'Current Price',
'Enterprise Value']].copy()
for i, company in enumerate(data['Name']):
valuation_df_nlp.loc[i, 'Valuation (P/E)'] = \
valuation_calculator(i, data, 'Trailing P/E', sim_mat)
valuation_df_nlp.loc[i, 'Valuation (EV/EBITDA)'] = \
valuation_calculator(i, data, 'Enterprise Value/EBITDA', sim_mat)
valuation_df_nlp.loc[i, 'Valuation (EV/Revenue)'] = \
valuation_calculator(i, data, 'Enterprise Value/Revenue', sim_mat)
valuation_df_nlp['% Over/Undervaluation (EV/Revenue)'] = \
(valuation_df_nlp['Valuation (EV/Revenue)'] /
valuation_df_nlp['Enterprise Value']-1)*100
valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'] = \
(valuation_df_nlp['Valuation (EV/EBITDA)'] /
valuation_df_nlp['Enterprise Value']-1)*100
valuation_df_nlp['% Over/Undervaluation (P/E)'] = \
(valuation_df_nlp['Valuation (P/E)'] / valuation_df_nlp['Current Price']-1)*100
valuation_df_nlp['% Over/Undervaluation (EV/Revenue)'].describe()
valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'].describe()
valuation_df_nlp['% Over/Undervaluation (P/E)'].describe()
# Add the price on the last day of 2016 to the dataframe
last_day_2016 = close_price.loc['2016-12-30', :].rename('2016-12-30 Price')
valuation_df_nlp = valuation_df_nlp.join(last_day_2016, on='Symbol')
# Calculates the percentage of correct predictions year to date
valuation_df_nlp['Actual Change'] = \
valuation_df_nlp['Current Price'] > valuation_df_nlp['2016-12-30 Price']
valuation_df_nlp['Return'] = \
valuation_df_nlp['Current Price'] / valuation_df_nlp['2016-12-30 Price'] - 1
valuation_df_nlp['Prediction'] = \
valuation_df_nlp['Valuation (P/E)'] > valuation_df_nlp['2016-12-30 Price']
avg_return = \
np.mean(valuation_df_nlp['Prediction'].replace(0, -1) *
valuation_df_nlp['Return'])
outcome = (valuation_df_nlp['Prediction'] == valuation_df_nlp['Actual Change'])
correct_pred = sum(outcome) / len(outcome)
print('Percentage of correct predictions YTD (NLP): ' + str(correct_pred))
print('Average Return: ' + str(avg_return))
Disclaimer: While the system appears to be promising, there are many reasons other than luck for which this should not be taken at face value. First, we use the last price of 2016 as a reference, but the financial statements are published later even though they reflect the state of the company at the end of the year. Additionally, many more tests would actually need to performed to evaluate the quality of these signals which is beyond the aim of this notebook. As such, you should not consider using these results for taking investment decisions.
source = \
ColumnDataSource(dict(x=valuation_df_nlp['% Over/Undervaluation (P/E)'],
y=valuation_df_nlp['% Over/Undervaluation (EV/EBITDA)'],
color=valuation_df_nlp['Color'],
label=valuation_df_nlp['Name'],
ticker=valuation_df_nlp['Symbol'],
sector=valuation_df_nlp['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
# Remove "legend='sector'," to remove legend
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "Over/Undervaluation (NLP) Based On..."
p.title.text_font_size = "25px"
# Uncomment the following line to add labels
# p.add_layout(labels)
p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'
show(p)
A positive value means that the company should be trading at a higher value (as of the end of 2016). Overall, there appears to be a positive relationship between the valuation based on those two multiples. One caveat, however, is the limited amount of companies. Some companies in the S&P500 have no good comparable companies, and therefore, the percentage of over/undervaluation suggested by the model should not be taken at face value. For example, companies such as Netflix and Amazon appear highly overvalued based on this model. However, this is because idiosyncratic factors have a large influence on these companies which are not captured by this methodology. In fact, on an EV/Revenue basis, the model actually suggests that Amazon is undervalued. Conducting this analysis with a much larger dataset would partially alleviate this issue. An interesting observation is that Nvidia, which has performed very well recently due to developments in the GPU space is still not as "overvalued" as Netflix on the basis of those two multiples.
cols_to_be_used = ['Name', 'Symbol', 'Sector', 'Color', 'Current Price',
'Enterprise Value', 'Enterprise Value/Revenue',
'Enterprise Value/EBITDA', 'Trailing P/E']
df1 = variation_df[['Name', 'Cluster']].set_index('Name')
val_df_corr = data[cols_to_be_used].join(df1, on='Name')
val_df_corr = val_df_corr.dropna()
for i in range(n_labels + 2):
val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster P/E'] = \
np.mean(val_df_corr[val_df_corr['Cluster'] == i]['Trailing P/E'])
val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster EV/EBITDA'] = \
np.mean(val_df_corr[val_df_corr['Cluster'] == i]
['Enterprise Value/EBITDA'])
val_df_corr.loc[val_df_corr['Cluster'] == i, 'Cluster EV/Revenue'] = \
np.mean(val_df_corr[val_df_corr['Cluster'] == i]
['Enterprise Value/Revenue'])
val_df_corr['% Over/Undervaluation (EV/Revenue)'] = \
(val_df_corr['Enterprise Value/Revenue'] /
val_df_corr['Cluster EV/Revenue']-1)*100
val_df_corr['% Over/Undervaluation (EV/EBITDA)'] = \
(val_df_corr['Enterprise Value/EBITDA'] /
val_df_corr['Cluster EV/EBITDA'] - 1) * 100
val_df_corr['% Over/Undervaluation (P/E)'] = \
(val_df_corr['Trailing P/E'] / val_df_corr['Cluster P/E'] - 1) * 100
val_df_corr['% Over/Undervaluation (EV/Revenue)'].round(3).describe()
val_df_corr['% Over/Undervaluation (EV/EBITDA)'].round(3).describe()
val_df_corr['% Over/Undervaluation (P/E)'].round(3).describe()
source = \
ColumnDataSource(dict(x=val_df_corr['% Over/Undervaluation (P/E)'],
y=val_df_corr['% Over/Undervaluation (EV/EBITDA)'],
color=val_df_corr['Color'],
label=val_df_corr['Name'],
ticker=val_df_corr['Symbol'],
sector=val_df_corr['Sector']))
hover = HoverTool(tooltips=[
("index", "$index"),
("name", "@label"),
("sector", "@sector"),
("ticker", "@ticker"),
("(x,y)", "($x, $y)"),
])
p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)
labels = LabelSet(x='x', y='y', text='label', source=source,
text_font_size='8pt')
# Remove "legend='sector'," to remove legend
p.scatter(x='x', y='y', color='color', legend='sector', source=source)
p.title.text = "Over/Undervaluation (Correlation) Based On..."
p.title.text_font_size = "25px"
# Uncomment the following line to add labels
# p.add_layout(labels)
p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'
show(p)
This methodology gives valuations which are much less extreme than the NLP approach. One potential explanation for this is that clusters are closer to the traditional industry groups, and therefore, to how the market values these companies.
These results were obtained as of:
date.today()
Titman, Sheridan and Martin, John D. Valuation: the Art and Science of Corporate Investment Decisions. Prentice Hall, 2015. Print.
Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf
Laurens van der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
Gael Varoquaux. Visualizing the Stock Market Structure. http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#sphx-glr-auto-examples-applications-plot-stock-market-py