2017-02-03, Josh Montague
This is Part 1 of the "Time Series Modeling in Python" series. In this session, we spend time working through creation and manipulation of time series data in the pandas data structures. We'll get a sense for the patterns of use, and try out a handful of the handy built-in methods on dataframes and series that are made specifically for time series data.
import random
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
To start, we'll look at the specific data types in pandas which facilitate working with time series. They're largely built on top of numpy (as is much of the pandas library, which makes it very fast and efficient), and are designed for interoperability with Python's built-in datetime objects.
Timestamp
¶The Timestamp is similar to Python's datetime
(and the docs will sometimes interchange the names). But the pandas data structures have a bunch of super useful metadata and methods on them; often, these are precalculated to make subsequent calculations on them more efficient.
There's a top-of-the-module constructor that is flexible in parsing the arguments.
pd.Timestamp('2016')
# the parser can do a lot of inference
pd.Timestamp('2016 04')
pd.Timestamp('2016-06-01T11:03')
# this is a handy method for flexibly converting things to time-like things via pandas
pd.to_datetime('2017')
pd.to_datetime('2017-02')
The
# tons of helpful attributes and methods
ts = pd.Timestamp('2016-06-01 11:03')
# filter out the things that aren't intended to be "user-facing"
[ x for x in dir(ts) if not x.startswith('_') ]
# for example, the "weekday_name" attribute
print(ts.weekday_name)
DatetimeIndex
¶A DatetimeIndex is a sequence of Timestamp
s (the flexible interchange of "datetime" and "timestamp" is slightly confusing), with a type, frequency, and some other attributes.
These data types inherit a bunch of super handy things under the hood (like pre-calculated time intervals, and special slicing) which make calculations much faster and easier.
pd.date_range('2017-02-02', freq='D', periods=10)
When dealing with time intervals, there are a lot of magic strings in pandas. I have a general distaste for magic strings, but pandas uses them relatively sparingly.
I always get frustrated looking them up, too, so I made bit.ly/pd-offsets
in the hopes of saving a little time trying to remember the correct shorthand for such arcane things as "semi-month end frequency (15th and end of month)" and "business year start frequency". Fortunately "weekly" ('W') and "daily" ('D') are sensible.
# weekly data points that fit withinstart and end
pd.date_range(start='2017-01', end='2017-02', freq='W')
# specific set of (irregular) epoch seconds
pd.to_datetime([1486074696, 1486074698, 1486074699], unit='s')
# the datetimeindex is also a handy iterable (for the array values)
ts = pd.date_range(start='2017-01', end='2017-02', freq='W')
for t in ts:
print(t, random.choice(['hi','hey','ohai','cheerio']))
These are differences in timestamps. Very useful for math, and with a similarly flexible constructor.
pd.Timedelta('3D')
pd.Timedelta('-2 days')
pd.Timedelta(weeks=1, days=2, minutes=7, seconds=3)
pd.Timestamp('2017-02-02 16:00') + pd.Timedelta('-5 hours')
DatetimeIndex
on a Dataframe¶Much of the original motivation for writing pandas was for financial time series modeling. As a result, there are many efficiently implemented methods and conveniences built in the data structures.
Many of these depend on dataframe rows being addressed (indexed) with a time-like data type, which is exactly what the DatetimeIndex is for.
If our data has some sort of timestamp in it, how do we convert it to an associated datetimeindex?
d = pd.read_csv('misc/small.csv', names=['date','value'])
d.tail()
# is 'date' a data-like data type?
d.dtypes
pandas object
dtype means that at least one of the items in the column was a string. There is no such thing as a str
type in a pandas dtype.
Let's set the 'date' column to be the index first.
# by default, also drops the specified column from the frame
d.set_index('date').tail()
# what is the index type?
d.set_index('date').index[:5]
Ok, so that approach didn't work. Another way to consider it is to assume (correctly), that if the data is already date-like when we move it to the index, pandas will make a datetimeindex.
You've probably seen (and used) this explicit over-writing step to convert a column to a different type. We can use the nice convenience method we saw earlier.
# overwrite the column
d['date'] = pd.to_datetime(d['date'])
d.tail()
# now we have times
d.dtypes
# setting a time-like type column to the index automatically converts it to a datetimeindex
d.set_index('date').index[:5]
Another approach to getting this kind of index from the original file is to parse the dates on file-read.
# we know the dates are in column 0
d2 = pd.read_csv('misc/small.csv', names=['date','value'], parse_dates=[0])
d2.dtypes
d2.tail()
d2.set_index('date').tail()
And, there are probably other ways, to be totally honest. Those are the ones that I use most often.
First, note that we can parse the dates and move them right over to the DatetimeIndex all in one compact step with some chainging.
# NB: method chaining can be a nice way to make a handful of sequential manipulations readable
d = (pd.read_csv('misc/small.csv', names=['date','value'], parse_dates=[0])
.set_index('date')
)
d.head()
We're familiar with the convenience method for accessing elements of a dataframe: []
is shorthand for __getitem__
, and grabs columns of the frame (returning a series, or another frame if the argument is an array).
# default slicing => *columns*
d['value'].head()
type(d['value'])
# what about this?
d['2017'].head()
This behavior is a special corner of the pandas world, specifically those dataframes that have a datetimeindex. It's a "fallback" functionality of the slicing via square brackets. And it is a terrible! If you happen to have columns and index values that are the same, that's too ambiguous.
Instead, when addressing (accessing) a dataframe by row, always use .loc
(name) or .iloc
(integer).
Think: .loc
, like "location; .iloc
like "integer* location".
Another option is .ix
, which accessing rows by either names or position (integer), but honestly, why risk the ambiguity? Just choose an explicit one and the let it KeyError
if you choose wrong. Eventually you will have a row indexed by the string '0' and you will spend 15 minutes debugging this, and then you will be sad.
*(or "index location", but the term index is pretty overloaded in this context)
# location
d.loc['2017-01-02']
# integer location
d.iloc[1]
Note the difference in precision between the argument in .loc['2017-01-02']
and that shown in the 'Name:' (and remember that numpy's datetimes have nanosecond precision, under the hood).
# the timestamp repr displays down to the minute
d.index[1]
This is enabled by partial string indexing - another awesome feature of the datetimeindex. You can think of it as automatically expanding to encompass "all of the index values that would match as much of a criteria as you've specified."
d.loc['2017-02'].tail()
This isn't really a time series-related transformation, but it's useful so why not mention it...
d.head()
# shift (either + or -)
d.shift(2).head()
This is where the power of the datetimeindex really starts to show. These calculations are fast, even on very large dataframes.
Resampling takes advantage of the same split-apply-combine methodology as groupby calculations. This is one of the places where pandas can run many calculations at the same time for a big speed-up (RELEASE THE KRAKE... er, GIL).
d.head()
# kind of like a groupby object - needs an 'and then what?'
d.resample('7d')
(d.resample('7d')
.sum()
.head()
)
# convenient for plotting
for freq in ('1d', '3d', '1w'):
plt.plot( (d.resample(freq)
# choose one!
.sum()
#.mean()
#.min()
)
, 'o--', label=freq)
plt.legend()
fig = plt.gcf(); fig.autofmt_xdate()
You can also resample in the other direction (upsampling), where you don't have all the data.
# daily data
d.head()
# "resample" to 6-hourly data (4x more observations)
(d.resample('6h')
.sum()
.head(20)
)
That's not immediately useful because all we did was add a bunch of NaNs. But that's ok, because we can prescribe how the frame should fill in those observations...
# options for filling the NaNs - choose one!
(d.resample('6h')
# forward fill
#.ffill()
# backward fill
#.bfill()
# interpolate
.interpolate('linear')
# arb. (asfreq() => df)
#.asfreq().fillna(-1)
).head(20)
And when we chart it, we can see a little more clearly how e.g. the liner interpolation works.
rs = d.resample('6h').interpolate('linear')
plt.figure(figsize=(10,6))
# NB: mpl 2.0 takes dataframes as arguments!
plt.plot(rs['2017-02'], 'o--', label='interpolated')
plt.plot(d['2017-02'], 'o', markersize=20, markerfacecolor='none', label='original')
fig = plt.gcf(); fig.autofmt_xdate()
plt.legend()
Here are some of the useful tabs I had open while working on this session.