Welcome to your second data project! In this portfolio project we will be looking at data from the stock market, particularly some technology stocks. We will learn how to use pandas to get stock information, visualize different aspects of it, and finally we will look at a few ways of analyzing the risk of a stock, based on its previous performance history. We will also be predicting future stock prices through a Monte Carlo method!
We'll be answering the following questions along the way:
1.) What was the change in price of the stock over time?
2.) What was the daily return of the stock on average?
3.) What was the moving average of the various stocks?
4.) What was the correlation between different stocks' closing prices?
4.) What was the correlation between different stocks' daily returns?
5.) How much value do we put at risk by investing in a particular stock?
6.) How can we attempt to predict future stock behavior?
In this section we'll go over how to handle requesting stock information with pandas, and how to analyze basic attributes of a stock.
#Let's go ahead and start with some imports
import pandas as pd
from pandas import Series,DataFrame
import numpy as np
# For Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline
# For reading stock data from yahoo
from pandas.io.data import DataReader
# For time stamps
from datetime import datetime
# For division
from __future__ import division
Let's use Yahoo and pandas to grab some data for some tech stocks.
# The tech stocks we'll use for this analysis
tech_list = ['AAPL','GOOG','MSFT','AMZN']
# Set up End and Start times for data grab
end = datetime.now()
start = datetime(end.year - 1,end.month,end.day)
#For loop for grabing yahoo finance data and setting as a dataframe
for stock in tech_list:
# Set DataFrame as the Stock Ticker
globals()[stock] = DataReader(stock,'yahoo',start,end)
Quick note: Using globals() is a sloppy way of setting the DataFrame names, but its simple
Let's go ahead and play aorund with the AAPL DataFrame to get a feel for the data
# Summary Stats
AAPL.describe()
Open | High | Low | Close | Volume | Adj Close | |
---|---|---|---|---|---|---|
count | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 2.520000e+02 | 252.000000 |
mean | 192.191905 | 193.834008 | 190.688413 | 192.401905 | 5.776950e+07 | 102.697817 |
std | 182.130756 | 183.650776 | 181.165264 | 182.732488 | 2.409724e+07 | 14.860106 |
min | 90.210000 | 90.700000 | 89.650000 | 90.280000 | 1.447960e+07 | 72.620000 |
25% | 99.390000 | 100.672500 | 98.527500 | 99.620000 | 4.220820e+07 | 92.690000 |
50% | 111.045000 | 112.475000 | 109.190000 | 110.300000 | 5.325340e+07 | 100.360000 |
75% | 127.242500 | 128.130000 | 125.677500 | 126.777500 | 6.793232e+07 | 113.480000 |
max | 649.900000 | 651.260000 | 644.470000 | 647.350000 | 1.899779e+08 | 133.000000 |
# General Info
AAPL.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 252 entries, 2014-04-03 00:00:00 to 2015-04-02 00:00:00 Data columns (total 6 columns): Open 252 non-null float64 High 252 non-null float64 Low 252 non-null float64 Close 252 non-null float64 Volume 252 non-null int64 Adj Close 252 non-null float64 dtypes: float64(5), int64(1) memory usage: 13.8 KB
Now that we've seen the DataFrame, let's go ahead and plot out the volume and closing price of the stocks
# Let's see a historical view of the closing price
AAPL['Adj Close'].plot(legend=True,figsize=(10,4))
<matplotlib.axes._subplots.AxesSubplot at 0x4269cf8>
# Now let's plot the total volume of stock being traded each day over the past 5 years
AAPL['Volume'].plot(legend=True,figsize=(10,4))
<matplotlib.axes._subplots.AxesSubplot at 0x19007b38>
Now that we've seen the visualizations for the closing price and the volume traded each day, let's go ahead and caculate the moving average for the stock.
For more info on the moving average check out the following links:
1.) http://www.investopedia.com/terms/m/movingaverage.asp
2.) http://www.investopedia.com/articles/active-trading/052014/how-use-moving-average-buy-stocks.asp
# Luckily pandas has a built-in rolling mean calculator
# Let's go ahead and plot out several moving averages
ma_day = [10,20,50]
for ma in ma_day:
column_name = "MA for %s days" %(str(ma))
AAPL[column_name]=pd.rolling_mean(AAPL['Adj Close'],ma)
Now let's go ahead and plot all the additional Moving Averages
AAPL[['Adj Close','MA for 10 days','MA for 20 days','MA for 50 days']].plot(subplots=False,figsize=(10,4))
<matplotlib.axes._subplots.AxesSubplot at 0x192582b0>
Now that we've done some baseline analysis, let's go ahead and dive a little deeper. We're now going to analyze the risk of the stock. In order to do so we'll need to take a closer look at the daily changes of the stock, and not just its absolute value. Let's go ahead and use pandas to retrieve teh daily returns for the Apple stock.
# We'll use pct_change to find the percent change for each day
AAPL['Daily Return'] = AAPL['Adj Close'].pct_change()
# Then we'll plot the daily return percentage
AAPL['Daily Return'].plot(figsize=(12,4),legend=True,linestyle='--',marker='o')
<matplotlib.axes._subplots.AxesSubplot at 0x1b30a630>
Great, now let's get an overall look at the average daily return using a histogram. We'll use seaborn to create both a histogram and kde plot on the same figure.
# Note the use of dropna() here, otherwise the NaN values can't be read by seaborn
sns.distplot(AAPL['Daily Return'].dropna(),bins=100,color='purple')
# Could have also done:
#AAPL['Daily Return'].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1b3a5da0>
Now what if we wanted to analyze the returns of all the stocks in our list? Let's go ahead and build a DataFrame with all the ['Close'] columns for each of the stocks dataframes.
# Grab all the closing prices for the tech stock list into one DataFrame
closing_df = DataReader(['AAPL','GOOG','MSFT','AMZN'],'yahoo',start,end)['Adj Close']
# Let's take a quick look
closing_df.head()
AAPL | AMZN | GOOG | MSFT | |
---|---|---|---|---|
Date | ||||
2014-04-03 | 75.54 | 333.62 | 569.74 | 39.93 |
2014-04-04 | 74.56 | 323.00 | 543.14 | 38.82 |
2014-04-07 | 73.39 | 317.76 | 538.15 | 38.75 |
2014-04-08 | 73.38 | 327.07 | 554.90 | 38.77 |
2014-04-09 | 74.35 | 331.81 | 564.14 | 39.41 |
Now that we have all the closing prices, let's go ahead and get the daily return for all the stocks, like we did for the Apple stock.
# Make a new tech returns DataFrame
tech_rets = closing_df.pct_change()
Now we can compare the daily percentage return of two stocks to check how correlated. First let's see a sotck compared to itself.
# Comparing Google to itself should show a perfectly linear relationship
sns.jointplot('GOOG','GOOG',tech_rets,kind='scatter',color='seagreen')
<seaborn.axisgrid.JointGrid at 0x1b992160>
So now we can see that if two stocks are perfectly (and positivley) correlated with each other a linear relationship bewteen its daily return values should occur. So let's go ahead and compare Google and Microsoft the same way.
# We'll use joinplot to compare the daily returns of Google and Microsoft
sns.jointplot('GOOG','MSFT',tech_rets,kind='scatter')
<seaborn.axisgrid.JointGrid at 0x1bc48eb8>
Intersting, the pearsonr value (officially known as the Pearson product-moment correlation coefficient) can give you a sense of how correlated the daily percentage returns are. You can find more information about it at this link:
url - http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
But for a quick intuitive sense, check out the picture below.
from IPython.display import SVG
SVG(url='http://upload.wikimedia.org/wikipedia/commons/d/d4/Correlation_examples2.svg')
Seaborn and pandas make it very easy to repeat this comparison analysis for every possible combination of stocks in our technology stock ticker list. We can use sns.pairplot() to automatically create this plot
# We can simply call pairplot on our DataFrame for an automatic visual analysis of all the comparisons
sns.pairplot(tech_rets.dropna())
<seaborn.axisgrid.PairGrid at 0x1c2e9518>
Above we can see all the relationships on daily returns between all the stocks. A quick glance shows an interesting correlation between Google and Amazon daily returns. It might be interesting to investigate that individual comaprison. While the simplicity of just calling sns.pairplot() is fantastic we can also use sns.PairGrid() for full control of the figure, including what kind of plots go in the diagonal, the upper triangle, and the lower triangle. Below is an example of utilizing the full power of seaborn to achieve this result.
# Set up our figure by naming it returns_fig, call PairPLot on the DataFrame
returns_fig = sns.PairGrid(tech_rets.dropna())
# Using map_upper we can specify what the upper triangle will look like.
returns_fig.map_upper(plt.scatter,color='purple')
# We can also define the lower triangle in the figure, inclufing the plot type (kde) or the color map (BluePurple)
returns_fig.map_lower(sns.kdeplot,cmap='cool_d')
# Finally we'll define the diagonal as a series of histogram plots of the daily return
returns_fig.map_diag(plt.hist,bins=30)
We could have also analyzed the correlation of the closing prices using this exact same technique. Here it is shown, the code repeated from above with the exception of the DataFrame called.
# Set up our figure by naming it returns_fig, call PairPLot on the DataFrame
returns_fig = sns.PairGrid(closing_df)
# Using map_upper we can specify what the upper triangle will look like.
returns_fig.map_upper(plt.scatter,color='purple')
# We can also define the lower triangle in the figure, inclufing the plot type (kde) or the color map (BluePurple)
returns_fig.map_lower(sns.kdeplot,cmap='cool_d')
# Finally we'll define the diagonal as a series of histogram plots of the closing price
returns_fig.map_diag(plt.hist,bins=30)
Finally, we could also do a correlation plot, to get actual numerical values for the correlation between the stocks' daily return values. By comparing the closing prices, we see an interesting relationship between Microsoft and Apple.
# Let's go ahead and use sebron for a quick correlation plot for the daily returns
sns.corrplot(tech_rets.dropna(),annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x2370a898>
Fantastic! Just like we suspected in our PairPlot we see here numerically and visually that Amazon and Google had the strongest correlation of daily stock return. It's also interesting to see that all the technology comapnies are positively correlated.
Great! Now that we've done some daily return analysis, let's go ahead and start looking deeper into actual risk analysis.
There are many ways we can quantify risk, one of the most basic ways using the information we've gathered on daily percentage returns is by comparing the expected return with the standard deviation of the daily returns.
# Let's start by defining a new DataFrame as a clenaed version of the oriignal tech_rets DataFrame
rets = tech_rets.dropna()
area = np.pi*20
plt.scatter(rets.mean(), rets.std(),alpha = 0.5,s =area)
# Set the x and y limits of the plot (optional, remove this if you don't see anything in your plot)
plt.ylim([0.01,0.025])
plt.xlim([-0.003,0.004])
#Set the plot axis titles
plt.xlabel('Expected returns')
plt.ylabel('Risk')
# Label the scatter plots, for more info on how this is done, chekc out the link below
# http://matplotlib.org/users/annotations_guide.html
for label, x, y in zip(rets.columns, rets.mean(), rets.std()):
plt.annotate(
label,
xy = (x, y), xytext = (50, 50),
textcoords = 'offset points', ha = 'right', va = 'bottom',
arrowprops = dict(arrowstyle = '-', connectionstyle = 'arc3,rad=-0.3'))
Let's go ahead and define a value at risk parameter for our stocks. We can treat value at risk as the amount of money we could expect to lose (aka putting at risk) for a given confidence interval. Theres several methods we can use for estimating a value at risk. Let's go ahead and see some of them in action.
For this method we will calculate the empirical quantiles from a histogram of daily returns. For more information on quantiles, check out this link: http://en.wikipedia.org/wiki/Quantile
Let's go ahead and repeat the daily returns histogram for Apple stock.
# Note the use of dropna() here, otherwise the NaN values can't be read by seaborn
sns.distplot(AAPL['Daily Return'].dropna(),bins=100,color='purple')
<matplotlib.axes._subplots.AxesSubplot at 0x20ca53c8>
Now we can use quantile to get the risk value for the stock.
# The 0.05 empirical quantile of daily returns
rets['AAPL'].quantile(0.05)
-0.019003328976079148
The 0.05 empirical quantile of daily returns is at -0.019. That means that with 95% confidence, our worst daily loss will not exceed 1.9%. If we have a 1 million dollar investment, our one-day 5% VaR is 0.019 * 1,000,000 = $19,000.
Go ahead and repeat this for the other stocks in out portfolio, then afterwards we'll look at value at risk by implementing a Monte Carlo method.
Using the Monte Carlo to run many trials with random market conditions, then we'll calculate portfolio losses for each trial. After this, we'll use the aggregation of all these simulations to establish how risky the stock is.
Let's start with a brief explanation of what we're going to do:
We will use the geometric Brownian motion (GBM), which is technically known as a Markov process. This means that the stock price follows a random walk and is consistent with (at the very least) the weak form of the efficient market hypothesis (EMH): past price information is already incorporated and the next price movement is "conditionally independent" of past price movements.
This means that the past information on the price of a stock is independent of where the stock price will be in the future, basically meaning, you can't perfectly predict the future solely based on the previous price of a stock.
The equation for geometric Browninan motion is given by the following equation:
$$\frac{\Delta S}{S} = \mu\Delta t + \sigma \epsilon \sqrt{\Delta t}$$Where S is the stock price, mu is the expected return (which we calculated earlier),sigma is the standard deviation of the returns, t is time, and epsilon is the random variable.
We can mulitply both sides by the stock price (S) to rearrange the formula and solve for the stock price.
Now we see that the change in the stock price is the current stock price multiplied by two terms. The first term is known as "drift", which is the average daily return multiplied by the change of time. The second term is known as "shock", for each tiem period the stock will "drift" and then experience a "shock" which will randomly push the stock price up or down. By simulating this series of steps of drift and shock thousands of times, we can begin to do a simulation of where we might expect the stock price to be.
For more info on the Monte Carlo method for stocks, check out the following link: http://www.investopedia.com/articles/07/montecarlo.asp
To demonstrate a basic Monte Carlo method, we will start with just a few simulations. First we'll define the variables we'll be using the Google DataFrame GOOG
# Set up our time horizon
days = 365
# Now our delta
dt = 1/days
# Now let's grab our mu (drift) from the expected return data we got for AAPL
mu = rets.mean()['GOOG']
# Now let's grab the volatility of the stock from the std() of the average return
sigma = rets.std()['GOOG']
Next, we will create a function that takes in the starting price and number of days, and uses teh sigma and mu we already calculated form out daily returns
def stock_monte_carlo(start_price,days,mu,sigma):
''' This function takes in starting stock price, days of simulation,mu,sigma, and returns simulated price array'''
# Define a price array
price = np.zeros(days)
price[0] = start_price
# Schok and Drift
shock = np.zeros(days)
drift = np.zeros(days)
# Run price array for number of days
for x in xrange(1,days):
# Calculate Schock
shock[x] = np.random.normal(loc=mu * dt, scale=sigma * np.sqrt(dt))
# Calculate Drift
drift[x] = mu * dt
# Calculate Price
price[x] = price[x-1] + (price[x-1] * (drift[x] + shock[x]))
return price
Grate now let's put our function to work!
# Get start price from GOOG.head()
start_price = 569.85
for run in xrange(100):
plt.plot(stock_monte_carlo(start_price,days,mu,sigma))
plt.xlabel("Days")
plt.ylabel("Price")
plt.title('Monte Carlo Analysis for Google')
<matplotlib.text.Text at 0x1f5f42e8>
Let's go ahead and get a histogram of the end results for a much larger run. (note: This could take a little while to run , depending on the number of runs chosen)
# Set a large numebr of runs
runs = 10000
# Create an empty matrix to hold the end price data
simulations = np.zeros(runs)
# Set the print options of numpy to only display 0-5 points from an array to suppress output
np.set_printoptions(threshold=5)
for run in xrange(runs):
# Set the simulation data point as the last stock price for that run
simulations[run] = stock_monte_carlo(start_price,days,mu,sigma)[days-1];
Now that we have our array of simulations, we can go ahead and plot a histogram ,as well as use qunatile to define our risk for this stock.
For more info on quantiles, check out this link: http://en.wikipedia.org/wiki/Quantile
# Now we'lll define q as the 1% empirical qunatile, this basically means that 99% of the values should fall between here
q = np.percentile(simulations, 1)
# Now let's plot the distribution of the end prices
plt.hist(simulations,bins=200)
# Using plt.figtext to fill in some additional information onto the plot
# Starting Price
plt.figtext(0.6, 0.8, s="Start price: $%.2f" %start_price)
# Mean ending price
plt.figtext(0.6, 0.7, "Mean final price: $%.2f" % simulations.mean())
# Variance of the price (within 99% confidence interval)
plt.figtext(0.6, 0.6, "VaR(0.99): $%.2f" % (start_price - q,))
# Display 1% quantile
plt.figtext(0.15, 0.6, "q(0.99): $%.2f" % q)
# Plot a line at the 1% quantile result
plt.axvline(x=q, linewidth=4, color='r')
# Title
plt.title(u"Final price distribution for Google Stock after %s days" % days, weight='bold');
Awesome! Now we have looked at the 1% empirical quantile of the final price distribution to estimate the Value at Risk for the Google stock, which looks to be $18.38 for every investment of 569.85 (the price of one inital google stock).
This basically menas for every initial stock you purchase your putting about $18.38 at risk 99% of the time from our Monte Carlo Simulation.
Congrats on finishing the Stock Market Data Analysis project! Here are some additional quesitons and excercises for you to do:
1.) Estimate the values at risk using both methods we learned in this project for a stock not related to technology.
2.) Build a practice portfolio and see how well you can predict you risk values with real stock information!
3.) Look further into correlatino of two stocks and see if that gives you any insight into future possible stock prices.