pricing = get_pricing(['AAPL', 'MSFT'],
start_date='2014-01-01',
end_date='2014-01-07',
frequency='minute',
fields='price')
pricing
is a DataFrame
with the same structure as the return value of history
on quantopian.
pricing.head(10)
Security(24 [AAPL]) | Security(5061 [MSFT]) | |
---|---|---|
2014-01-02 14:31:00+00:00 | 79.446 | 37.340 |
2014-01-02 14:32:00+00:00 | 79.424 | 37.375 |
2014-01-02 14:33:00+00:00 | 79.490 | 37.260 |
2014-01-02 14:34:00+00:00 | 79.502 | 37.260 |
2014-01-02 14:35:00+00:00 | 79.252 | 37.280 |
2014-01-02 14:36:00+00:00 | 79.184 | 37.283 |
2014-01-02 14:37:00+00:00 | 79.260 | 37.270 |
2014-01-02 14:38:00+00:00 | 79.300 | 37.300 |
2014-01-02 14:39:00+00:00 | 79.259 | 37.300 |
2014-01-02 14:40:00+00:00 | 79.222 | 37.280 |
Pandas' built-in groupby
and apply
operations are extremely powerful. For more information on these features, see http://pandas.pydata.org/pandas-docs/stable/groupby.html.
from pandas.tseries.tools import normalize_date
def my_grouper(ts):
"Function to apply to the index of the DataFrame to break it into groups."
# Returns midnight of the supplied date.
return normalize_date(ts)
def first_thirty_minutes(frame):
"Function to apply to the resulting groups."
return frame.iloc[:30]
The result of a groupby
computation is a Hierarchichally-Indexed DataFrame where the outermost layer of the index is the groupby key, and the secondary layers are the values from the frame's original index.
data = pricing.groupby(my_grouper).apply(first_thirty_minutes)
data.head(40)
Security(24 [AAPL]) | Security(5061 [MSFT]) | ||
---|---|---|---|
2014-01-02 00:00:00+00:00 | 2014-01-02 14:31:00+00:00 | 79.446 | 37.3400 |
2014-01-02 14:32:00+00:00 | 79.424 | 37.3750 | |
2014-01-02 14:33:00+00:00 | 79.490 | 37.2600 | |
2014-01-02 14:34:00+00:00 | 79.502 | 37.2600 | |
2014-01-02 14:35:00+00:00 | 79.252 | 37.2800 | |
2014-01-02 14:36:00+00:00 | 79.184 | 37.2830 | |
2014-01-02 14:37:00+00:00 | 79.260 | 37.2700 | |
2014-01-02 14:38:00+00:00 | 79.300 | 37.3000 | |
2014-01-02 14:39:00+00:00 | 79.259 | 37.3000 | |
2014-01-02 14:40:00+00:00 | 79.222 | 37.2800 | |
2014-01-02 14:41:00+00:00 | 79.187 | 37.2500 | |
2014-01-02 14:42:00+00:00 | 79.069 | 37.2500 | |
2014-01-02 14:43:00+00:00 | 79.160 | 37.2300 | |
2014-01-02 14:44:00+00:00 | 79.186 | 37.2325 | |
2014-01-02 14:45:00+00:00 | 79.213 | 37.1950 | |
2014-01-02 14:46:00+00:00 | 79.152 | 37.1601 | |
2014-01-02 14:47:00+00:00 | 79.183 | 37.1800 | |
2014-01-02 14:48:00+00:00 | 79.202 | 37.1900 | |
2014-01-02 14:49:00+00:00 | 79.159 | 37.1550 | |
2014-01-02 14:50:00+00:00 | 79.190 | 37.1700 | |
2014-01-02 14:51:00+00:00 | 79.150 | 37.1600 | |
2014-01-02 14:52:00+00:00 | 79.064 | 37.2150 | |
2014-01-02 14:53:00+00:00 | 79.134 | 37.2500 | |
2014-01-02 14:54:00+00:00 | 79.074 | 37.2400 | |
2014-01-02 14:55:00+00:00 | 79.139 | 37.2350 | |
2014-01-02 14:56:00+00:00 | 79.164 | 37.1800 | |
2014-01-02 14:57:00+00:00 | 79.163 | 37.1600 | |
2014-01-02 14:58:00+00:00 | 79.217 | 37.1800 | |
2014-01-02 14:59:00+00:00 | 79.165 | 37.2001 | |
2014-01-02 15:00:00+00:00 | 79.144 | 37.2501 | |
2014-01-03 00:00:00+00:00 | 2014-01-03 14:31:00+00:00 | 78.937 | 37.1800 |
2014-01-03 14:32:00+00:00 | 78.702 | 37.1700 | |
2014-01-03 14:33:00+00:00 | 78.756 | 37.1301 | |
2014-01-03 14:34:00+00:00 | 78.552 | 37.1450 | |
2014-01-03 14:35:00+00:00 | 78.573 | 37.1500 | |
2014-01-03 14:36:00+00:00 | 78.616 | 37.1400 | |
2014-01-03 14:37:00+00:00 | 78.693 | 37.1100 | |
2014-01-03 14:38:00+00:00 | 78.630 | 37.1250 | |
2014-01-03 14:39:00+00:00 | 78.589 | 37.0900 | |
2014-01-03 14:40:00+00:00 | 78.543 | 37.1040 |
Because our DataFrame
is Hierarchically-Indexed, we can query it by our groupby keys.
from pandas import Timestamp
# This gives us the first thirty minutes of January 3rd.
data.loc[Timestamp('2014-01-03', tz='UTC')]
Security(24 [AAPL]) | Security(5061 [MSFT]) | |
---|---|---|
2014-01-03 14:31:00+00:00 | 78.937 | 37.1800 |
2014-01-03 14:32:00+00:00 | 78.702 | 37.1700 |
2014-01-03 14:33:00+00:00 | 78.756 | 37.1301 |
2014-01-03 14:34:00+00:00 | 78.552 | 37.1450 |
2014-01-03 14:35:00+00:00 | 78.573 | 37.1500 |
2014-01-03 14:36:00+00:00 | 78.616 | 37.1400 |
2014-01-03 14:37:00+00:00 | 78.693 | 37.1100 |
2014-01-03 14:38:00+00:00 | 78.630 | 37.1250 |
2014-01-03 14:39:00+00:00 | 78.589 | 37.0900 |
2014-01-03 14:40:00+00:00 | 78.543 | 37.1040 |
2014-01-03 14:41:00+00:00 | 78.424 | 37.1000 |
2014-01-03 14:42:00+00:00 | 78.462 | 37.1400 |
2014-01-03 14:43:00+00:00 | 78.539 | 37.1500 |
2014-01-03 14:44:00+00:00 | 78.573 | 37.1600 |
2014-01-03 14:45:00+00:00 | 78.509 | 37.1120 |
2014-01-03 14:46:00+00:00 | 78.510 | 37.0600 |
2014-01-03 14:47:00+00:00 | 78.477 | 37.0600 |
2014-01-03 14:48:00+00:00 | 78.519 | 37.0600 |
2014-01-03 14:49:00+00:00 | 78.501 | 37.0200 |
2014-01-03 14:50:00+00:00 | 78.419 | 37.0100 |
2014-01-03 14:51:00+00:00 | 78.392 | 37.0400 |
2014-01-03 14:52:00+00:00 | 78.317 | 37.0450 |
2014-01-03 14:53:00+00:00 | 78.287 | 37.0200 |
2014-01-03 14:54:00+00:00 | 78.216 | 37.0300 |
2014-01-03 14:55:00+00:00 | 78.230 | 37.0200 |
2014-01-03 14:56:00+00:00 | 78.244 | 37.0150 |
2014-01-03 14:57:00+00:00 | 78.299 | 37.0200 |
2014-01-03 14:58:00+00:00 | 78.252 | 37.0301 |
2014-01-03 14:59:00+00:00 | 78.344 | 37.0500 |
2014-01-03 15:00:00+00:00 | 78.332 | 37.0372 |
If we want to query on the second layer of the index, we have to use .xs
with a level argument instead of .loc
.
Note that level=1
means the second level of the index, because the levels start at index 0.
data.xs(Timestamp('2014-01-03 14:58:00', tz='UTC'), level=1)
Security(24 [AAPL]) | Security(5061 [MSFT]) | |
---|---|---|
2014-01-03 00:00:00+00:00 | 78.252 | 37.0301 |
If we just want to work with the original index values, we can drop the extra level from our index.
data_copy = data.copy()
data_copy.index = data_copy.index.droplevel(0)
data_copy.head()
Security(24 [AAPL]) | Security(5061 [MSFT]) | |
---|---|---|
2014-01-02 14:31:00+00:00 | 79.446 | 37.340 |
2014-01-02 14:32:00+00:00 | 79.424 | 37.375 |
2014-01-02 14:33:00+00:00 | 79.490 | 37.260 |
2014-01-02 14:34:00+00:00 | 79.502 | 37.260 |
2014-01-02 14:35:00+00:00 | 79.252 | 37.280 |