Pandas Time series

In [1]:
import addutils.toc ; addutils.toc.js(ipy_notebook=True)
Out[1]:

For further documentation see Time Series / Date functionality on the pandas documentation.

In [2]:
import numpy as np
import pandas as pd
from numpy import NaN
from IPython.display import (display, HTML)
from pandas.tseries.offsets import *
from addutils import side_by_side2
from addutils import css_notebook
css_notebook()
Out[2]:

1 Timestamps and DatetimeIndex

to_datetime convertsa list of date-like objects to Time Stamps. If the list is homegeneous infer_datetime_format=True can give a great speed-up. Otherwise with format is possible to define the format of the strings a-priori.

In [3]:
pd.to_datetime(['10-11-2000 17:52:13.7731+02:00', '3-5-2014 11:57:18.7754+01:00'], dayfirst=True, utc=True)
Out[3]:
DatetimeIndex(['2000-11-10 15:52:13.773100+00:00', '2014-05-03 10:57:18.775400+00:00'], dtype='datetime64[ns, UTC]', freq=None)

It’s also possible to convert integer or float epoch times. The float value is interpreted as a Unix timestamp and the default unit of measure is nanoseconds bu a different unit (D,s,ms,us,ns) can be specified:

In [4]:
print (pd.to_datetime([0]))
print (pd.to_datetime([1]))
print (pd.to_datetime([1], unit='D'))
DatetimeIndex(['1970-01-01'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['1970-01-01 00:00:00.000000001'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['1970-01-02'], dtype='datetime64[ns]', freq=None)

Index of Timestamps: use date_range and bdate_range to create regular frequency timestamp indexes. These functions return DatetimeIndex objects that are array of Timestamps:

In [5]:
rng = pd.date_range('1/1/2012 15:53:25.335',
                    periods=6,
                    freq='2d3h35min42s115ms',
                    tz='Europe/Rome')
for date in rng:
    print (date)
2012-01-01 15:53:25.335000+01:00
2012-01-03 19:29:07.450000+01:00
2012-01-05 23:04:49.565000+01:00
2012-01-08 02:40:31.680000+01:00
2012-01-10 06:16:13.795000+01:00
2012-01-12 09:51:55.910000+01:00
In [6]:
rng[0]
Out[6]:
Timestamp('2012-01-01 15:53:25.335000+0100', tz='Europe/Rome', freq='185742115L')

This can be used as an index for Series, Dataframes and Panels:

In [7]:
data = np.random.randint(0,99,(len(rng),3))
d1 = pd.DataFrame(data, index=rng, columns=list('ABC'))
display(d1)
A B C
2012-01-01 15:53:25.335000+01:00 25 4 32
2012-01-03 19:29:07.450000+01:00 20 15 13
2012-01-05 23:04:49.565000+01:00 68 65 13
2012-01-08 02:40:31.680000+01:00 51 53 4
2012-01-10 06:16:13.795000+01:00 2 41 53
2012-01-12 09:51:55.910000+01:00 26 5 6

Pandas provides a timezone conversion. Here we produce new data by converting the previous timeseries form the Rome timezone to the Eastern US timezone. When data with different timezones are combined toghether (as in d3 = d1+d2) the results are given in UTC time which can be in turn converted in any timezone. In this example the UTC timezone is shown with the notation +00:00

In [8]:
d2 = d1.tz_convert('US/Eastern')
HTML(side_by_side2(d1, d2))
Out[8]:
A B C
2012-01-01 15:53:25.335000+01:00 25 4 32
2012-01-03 19:29:07.450000+01:00 20 15 13
2012-01-05 23:04:49.565000+01:00 68 65 13
2012-01-08 02:40:31.680000+01:00 51 53 4
2012-01-10 06:16:13.795000+01:00 2 41 53
2012-01-12 09:51:55.910000+01:00 26 5 6
A B C
2012-01-01 09:53:25.335000-05:00 25 4 32
2012-01-03 13:29:07.450000-05:00 20 15 13
2012-01-05 17:04:49.565000-05:00 68 65 13
2012-01-07 20:40:31.680000-05:00 51 53 4
2012-01-10 00:16:13.795000-05:00 2 41 53
2012-01-12 03:51:55.910000-05:00 26 5 6
In [9]:
d3 = d1+d2
d3
Out[9]:
A B C
2012-01-01 14:53:25.335000+00:00 50 8 64
2012-01-03 18:29:07.450000+00:00 40 30 26
2012-01-05 22:04:49.565000+00:00 136 130 26
2012-01-08 01:40:31.680000+00:00 102 106 8
2012-01-10 05:16:13.795000+00:00 4 82 106
2012-01-12 08:51:55.910000+00:00 52 10 12

2 DateOffsets objects

In the previous example we used a time frequency string (with Offset Aliases) freq='2d3h35min42s115ms with date_range to create a DatetimeIndex. These frequency strings are being translated into an instance of pandas DateOffset, which represents a regular frequency increment. Specific offset logic like “month”, “business day”, or “one hour” is represented in its various subclasses.

The key features of a DateOffset object are:

  • it can be used to shift a datetime object
  • it can be multiplied by an integer
  • it has rollforward and rollback methods for moving a date forward or backward
In [10]:
from pandas.tseries.offsets import *
In [11]:
print (pd.datetime(2014, 1, 1) + Week() + Hour(13) + Minute()*15 + Milli()*350)
print (pd.datetime(2014, 1, 1) + MonthEnd())
print (pd.datetime(2014, 1, 1) + MonthBegin())
print (pd.datetime(2014, 1, 1) + BQuarterBegin(2))
2014-01-08 13:15:00.350000
2014-01-31 00:00:00
2014-02-01 00:00:00
2014-06-02 00:00:00

Offset Aliases can be used to define to define time series frequencies:

  • B - Business Day
  • D - Calendar Day
  • H - Hour
  • T - Minute
  • (see the Pandas documentation for a complete list)
In [12]:
print (pd.date_range('2014/1/1', periods=2, freq=Minute(4)))
print (pd.date_range('2014/1/1', periods=2, freq='4T'))
print (pd.date_range('2014/1/1', periods=2, freq='4min'))
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:04:00'], dtype='datetime64[ns]', freq='4T')
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:04:00'], dtype='datetime64[ns]', freq='4T')
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:04:00'], dtype='datetime64[ns]', freq='4T')
In [13]:
start = pd.datetime(2014, 1, 1, 12, 30)
end = pd.datetime(2014, 1, 12)
idx2 = pd.date_range(start, end, freq='1B1H10T')
for i in idx2:
    print (i)
2014-01-01 12:30:00
2014-01-02 13:40:00
2014-01-03 14:50:00
2014-01-06 16:00:00
2014-01-07 17:10:00
2014-01-08 18:20:00
2014-01-09 19:30:00

3 Indexing with a DateTime index

As the name can suggest Datetime Index objects can be used to index a DataFrames. Try uncommenting some of the following lines to see alternative ways for selecting dates:

In [14]:
data = np.random.randint(0,99,(len(idx2),3))
d2 = pd.DataFrame(data, index=idx2, columns=list('ABC'))
d2
Out[14]:
A B C
2014-01-01 12:30:00 65 8 4
2014-01-02 13:40:00 95 15 90
2014-01-03 14:50:00 71 92 2
2014-01-06 16:00:00 92 45 74
2014-01-07 17:10:00 48 65 59
2014-01-08 18:20:00 45 33 91
2014-01-09 19:30:00 94 28 46
In [15]:
# Remember that in Pandas endpoints are included
d2.loc['2014/01/02':'2014/01/08', ['A', 'C']]
Out[15]:
A C
2014-01-02 13:40:00 95 90
2014-01-03 14:50:00 71 2
2014-01-06 16:00:00 92 74
2014-01-07 17:10:00 48 59
2014-01-08 18:20:00 45 91
In [16]:
d2.loc['2014/01/06 12:00':'2014/01/08 12:00', ['A', 'C']]
Out[16]:
A C
2014-01-06 16:00:00 92 74
2014-01-07 17:10:00 48 59

It is possible to mangle with indexes using offset objects. BDay means business day (be aware that xmas is considered a business day).

4 Frequency conversion

We can convert all TimeSeries to specified frequency using DateOffset objects. Optionally we can provide a fill method to handle missing values.

In [17]:
# TODO: Fix
HTML(side_by_side2(d2,
                  d2.asfreq(Day(), method='ffill'),
                  d2.asfreq(Day(), method=None)))
Out[17]:
A B C
2014-01-01 12:30:00 65 8 4
2014-01-02 13:40:00 95 15 90
2014-01-03 14:50:00 71 92 2
2014-01-06 16:00:00 92 45 74
2014-01-07 17:10:00 48 65 59
2014-01-08 18:20:00 45 33 91
2014-01-09 19:30:00 94 28 46
A B C
2014-01-01 12:30:00 65 8 4
2014-01-02 12:30:00 65 8 4
2014-01-03 12:30:00 95 15 90
2014-01-04 12:30:00 71 92 2
2014-01-05 12:30:00 71 92 2
2014-01-06 12:30:00 71 92 2
2014-01-07 12:30:00 92 45 74
2014-01-08 12:30:00 48 65 59
2014-01-09 12:30:00 45 33 91
A B C
2014-01-01 12:30:00 65.0 8.0 4.0
2014-01-02 12:30:00 NaN NaN NaN
2014-01-03 12:30:00 NaN NaN NaN
2014-01-04 12:30:00 NaN NaN NaN
2014-01-05 12:30:00 NaN NaN NaN
2014-01-06 12:30:00 NaN NaN NaN
2014-01-07 12:30:00 NaN NaN NaN
2014-01-08 12:30:00 NaN NaN NaN
2014-01-09 12:30:00 NaN NaN NaN

5 Filling gaps

We are going to see some ways to let pandas fill NaN values on a dataframe.

The first method is called forward filling and consists on using the first element above that isn't NaN.

In [18]:
d3 = d2.copy()
d3.iloc[2:5, 0] = np.nan
d3.iloc[0:4, 1] = np.nan
d3.iloc[4:, 2] = np.nan
cols = ['A', 'B', 'C']
HTML(side_by_side2(d3, d3[cols].fillna(method='ffill')))
Out[18]:
A B C
2014-01-01 12:30:00 65.0 NaN 4.0
2014-01-02 13:40:00 95.0 NaN 90.0
2014-01-03 14:50:00 NaN NaN 2.0
2014-01-06 16:00:00 NaN NaN 74.0
2014-01-07 17:10:00 NaN 65.0 NaN
2014-01-08 18:20:00 45.0 33.0 NaN
2014-01-09 19:30:00 94.0 28.0 NaN
A B C
2014-01-01 12:30:00 65.0 NaN 4.0
2014-01-02 13:40:00 95.0 NaN 90.0
2014-01-03 14:50:00 95.0 NaN 2.0
2014-01-06 16:00:00 95.0 NaN 74.0
2014-01-07 17:10:00 95.0 65.0 74.0
2014-01-08 18:20:00 45.0 33.0 74.0
2014-01-09 19:30:00 94.0 28.0 74.0

Notice that on column 'B' the NaN values are at the beginning. So the method hasn't been able to fill those holes. The backward filling methods is complementary to the one above. It fills gaps using the first non NaN value below the cell.

In [19]:
HTML(side_by_side2(d3, d3[cols].fillna(method='bfill')))
Out[19]:
A B C
2014-01-01 12:30:00 65.0 NaN 4.0
2014-01-02 13:40:00 95.0 NaN 90.0
2014-01-03 14:50:00 NaN NaN 2.0
2014-01-06 16:00:00 NaN NaN 74.0
2014-01-07 17:10:00 NaN 65.0 NaN
2014-01-08 18:20:00 45.0 33.0 NaN
2014-01-09 19:30:00 94.0 28.0 NaN
A B C
2014-01-01 12:30:00 65.0 65.0 4.0
2014-01-02 13:40:00 95.0 65.0 90.0
2014-01-03 14:50:00 45.0 65.0 2.0
2014-01-06 16:00:00 45.0 65.0 74.0
2014-01-07 17:10:00 45.0 65.0 NaN
2014-01-08 18:20:00 45.0 33.0 NaN
2014-01-09 19:30:00 94.0 28.0 NaN

Visit www.add-for.com for more tutorials and updates.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.