In this notebook, we'll take a look at the Alpha Vertex Top 100/500 Securities PreCog dataset, available on Quantopian. This dataset spans 2010 through the current day. PreCog uses machine learning models to forecast stock returns at multiple horizons.
The 100 dataset contains 5 day predicted log returns for the top 100 securities by market cap. The 500 dataset contains 5 day predicted log returns for the top 500 securities by market cap.
Update time: Daily data will be updated close to midnight for the previous day. So on the 27th, you will have data with an asof_date of the 26th.
There are two ways to access the data and you'll find both of them listed below. Just click on the section you'd like to read through.
The result of any expression is limited to 10,000 rows to protect against runaway memory usage. To be clear, you have access to all the data server side. We are limiting the size of the responses back from Blaze.
There is a free version of this dataset as well as a paid one. The free sample includes data until 2 months prior to the current date.
To access the most up-to-date values for this data set for trading a live algorithm (as with other partner sets), you need to purchase access to the full set.
Partner datasets are available on Quantopian Research through an API service known as Blaze. Blaze provides the Quantopian user with a convenient interface to access very large datasets, in an interactive, generic manner.
Blaze provides an important function for accessing these datasets. Some of these sets are many millions of records. Bringing that data directly into Quantopian Research directly just is not viable. So Blaze allows us to provide a simple querying interface and shift the burden over to the server side.
It is common to use Blaze to perform a reduction expression on your dataset so that you don't have to pull the whole dataset into memory. You can convert the result of a blaze expression to a Pandas data structure (e.g. a DataFrame) and perform further computation, manipulation, and visualization on that structure.
Helpful links:
Once you have a Blaze expression that reduces the dataset to less than 10,000 rows, you can convert it to a Pandas DataFrames using:
from odo import odo
odo(expr, pandas.DataFrame)
Pipeline Overview
section of this notebook or head straight to Pipeline Overview.¶# import the free sample of the dataset
from quantopian.interactive.data.alpha_vertex import (
# Top 100 Securities
precog_top_100 as dataset_100,
# Top 500 Securities
precog_top_500 as dataset_500
)
# import data operations
from odo import odo
# import other libraries we will use
import pandas as pd
import matplotlib.pyplot as plt
# Let's use blaze to understand the data a bit using Blaze dshape()
dataset_500.asof_date.max()
# And how many rows are there?
# N.B. we're using a Blaze function to do this, not len()
dataset_500.count()
# Let's see what the data looks like. We'll grab the first few rows.
dataset_500.peek()
symbol | name | sid | predicted_five_day_log_return | asof_date | timestamp | |
---|---|---|---|---|---|---|
0 | AA | ALCOA INC | 2 | 0.064 | 2010-01-04 | 2010-01-05 |
1 | AAPL | APPLE INC | 24 | 0.000 | 2010-01-04 | 2010-01-05 |
2 | ABT | ABBOTT LABORATORIES | 62 | -0.001 | 2010-01-04 | 2010-01-05 |
3 | ABX | BARRICK GOLD CORP | 64 | 0.013 | 2010-01-04 | 2010-01-05 |
4 | ADSK | AUTODESK INC | 67 | -0.040 | 2010-01-04 | 2010-01-05 |
5 | TAP | MOLSON COORS BREWING CO | 76 | 0.012 | 2010-01-04 | 2010-01-05 |
6 | ADBE | ADOBE SYSTEMS INC | 114 | -0.013 | 2010-01-04 | 2010-01-05 |
7 | ADI | ANALOG DEVICES INC | 122 | -0.023 | 2010-01-04 | 2010-01-05 |
8 | ADM | ARCHER-DANIELS-MIDLAND CO | 128 | -0.020 | 2010-01-04 | 2010-01-05 |
9 | AEP | AMERICAN ELECTRIC POWER CO INC | 161 | 0.013 | 2010-01-04 | 2010-01-05 |
10 | AES | AES CORP / VA | 166 | -0.042 | 2010-01-04 | 2010-01-05 |
Let's go over the columns:
Fields like timestamp
and sid
are standardized across all Quantopian Store Datasets, so the datasets are easy to combine. The sid
field is also standardized across all Quantopian equity databases.
Now that we understand the data a bit better, let's get the predicted_five_day_log_return
data for Apple (sid 24) and visualize it with a chart.
# We start by defining a Blaze expression that gets the rows where symbol == AAPL.
aapl_data = dataset_500[dataset_500.symbol == 'AAPL']
# We then convert the Blaze expression to a pandas DataFrame, which is populated
# with the data resulting from our Blaze expression.
aapl_df = odo(aapl_data, pd.DataFrame)
# Display the first few rows of the DataFrame.
aapl_df.head()
symbol | name | sid | predicted_five_day_log_return | asof_date | timestamp | |
---|---|---|---|---|---|---|
0 | AAPL | APPLE INC | 24 | 0.000 | 2010-01-04 | 2010-01-05 |
1 | AAPL | APPLE INC | 24 | 0.003 | 2010-01-05 | 2010-01-06 |
2 | AAPL | APPLE INC | 24 | 0.016 | 2010-01-06 | 2010-01-07 |
3 | AAPL | APPLE INC | 24 | -0.022 | 2010-01-07 | 2010-01-08 |
4 | AAPL | APPLE INC | 24 | 0.011 | 2010-01-08 | 2010-01-09 |
# For plotting purposes, set the index of the DataFrame to the asof_date.
aapl_df.set_index('asof_date', inplace=True)
# Plot the predicted 5-day log return data.
aapl_df['predicted_five_day_log_return'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f37358e1e10>
Pipeline is a tool that can be used to define computations called factors, filters, or classifiers. These computations can be used in an algorithm to dynamically select securities, compute portfolio weights, compute risk factors, and more.
In research, pipeline is mostly used to explore these computations.
The only method for accessing partner data within an algorithm on Quantopian is in a pipeline. Before moving to the IDE to work on an algorithm, it's a good idea to define your pipeline in research, so that you can iterate on an idea and analyze the output.
To start, we need to import the following:
from quantopian.pipeline import Pipeline
from quantopian.research import run_pipeline
To access partner data in pipeline, you must import the dataset. If the data is in a format that's difficult to use (e.g. event-based datasets), the data is sometimes available via built-in factors or filters. There are no such built-ins for the Alpha Vertex dataset as the prediction data is in a nice, usable format.
Let's import the pipeline version of the Alpha Vertex dataset:
# These imports can be found in the store panel for each dataset
# (https://www.quantopian.com/data). Note that not all store datasets
# can be used in pipeline yet.
from quantopian.pipeline.data.alpha_vertex import (
# Top 100 Securities
precog_top_100 as dataset_100,
# Top 500 Securities|
precog_top_500 as dataset_500
)
Now that we've imported the data, let's take a look at which fields are available for each dataset, along with their datatypes.
print "Here are the list of available fields per dataset:"
print "---------------------------------------------------\n"
def _print_fields(dataset):
print "Dataset: %s\n" % dataset.__name__
print "Fields:"
for field in list(dataset.columns):
print "%s - %s" % (field.name, field.dtype)
print "\n"
_print_fields(dataset_500)
print "---------------------------------------------------\n"
Here are the list of available fields per dataset: --------------------------------------------------- Dataset: precog_top_500 Fields: name - object predicted_five_day_log_return - float64 asof_date - datetime64[ns] symbol - object ---------------------------------------------------
# Import the Q1500US pipeline filter.
from quantopian.pipeline.filters.morningstar import Q1500US
# We only want to get the signal for stocks in the Q1500US that have a non-null
# latest predicted_five_day_log_return.
universe = (Q1500US() & dataset_500.predicted_five_day_log_return.latest.notnull())
# Define our pipeline to return the latest prediction for the stocks in `universe`.
pipe = Pipeline(columns= {
'prediction': dataset_500.predicted_five_day_log_return.latest,
},
screen=universe)
# Run our pipeline (this gets the data).
pipe_output = run_pipeline(pipe, start_date='2014-01-01', end_date='2017-01-01')
# The result is a pandas DataFrame with a MultiIndex.
pipe_output.head()
prediction | ||
---|---|---|
2014-01-02 00:00:00+00:00 | Equity(2 [ARNC]) | -0.028 |
Equity(24 [AAPL]) | -0.014 | |
Equity(62 [ABT]) | -0.004 | |
Equity(67 [ADSK]) | 0.022 | |
Equity(76 [TAP]) | 0.022 |
# Let's see how many securities we have a prediction for each day.
pipe_output.groupby(pipe_output.index.get_level_values(0)).count().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f3714fb9f10>
The set of ~500 stocks in the PreCog top 500 is derived from market cap at the beginning of each year, which we can see above!
Now, you can to try writing an algorithm using this pipeline. The final lesson in the Pipeline Tutorial gives an example of moving from research to the IDE.
There is also an example algorithm using the PreCog 500 that can be found here.