Pandas Time Series Date Gaps¶

This notebook explains how to identify date gaps in time series data with pandas.

Packages¶

This tutorial uses:

pandas
numpy

In [1]:

import pandas as pd
import numpy as np

Creating the data¶

We will create a dataframe that contains multiple time series, one for each group.

In [2]:

np.random.seed(1066)
dates = pd.date_range(start='2010-01-01', end='2010-12-31', freq='D')
df = pd.DataFrame({'date': dates,
                   'group': 'A',
                   'value': np.random.randint(0, 100, size=len(dates))
                  }).append(pd.DataFrame({'date': dates,
                                          'group': 'B',
                                          'value': np.random.randint(0, 100, size=len(dates))
                  })).append(pd.DataFrame({'date': dates,
                                           'group': 'C',
                                           'value': np.random.randint(0, 100, size=len(dates))
                                          })).reset_index(drop=True)
df

Out[2]:

	date	group	value
0	2010-01-01	A	57
1	2010-01-02	A	11
2	2010-01-03	A	83
3	2010-01-04	A	83
4	2010-01-05	A	93
...	...	...	...
1090	2010-12-27	C	50
1091	2010-12-28	C	59
1092	2010-12-29	C	85
1093	2010-12-30	C	32
1094	2010-12-31	C	3

1095 rows × 3 columns

Drop some rows randomly to create gaps in the data.

In [3]:

length = df.shape[0]
droplist = np.unique(np.sort(np.random.randint(0, length, size=100))).tolist()
df = df.drop(droplist).reset_index(drop=True)
df

Out[3]:

	date	group	value
0	2010-01-01	A	57
1	2010-01-02	A	11
2	2010-01-03	A	83
3	2010-01-04	A	83
4	2010-01-05	A	93
...	...	...	...
992	2010-12-27	C	50
993	2010-12-28	C	59
994	2010-12-29	C	85
995	2010-12-30	C	32
996	2010-12-31	C	3

997 rows × 3 columns

Identify Date Gaps¶

In a single series¶

The pandas function diff applied to the date field will generate the time between subsequent datas. Comparing that to 1 day can identify date gaps in the data.

In [4]:

singleseries = df[df.group == 'A']
singleseries['gap'] = singleseries['date'].sort_values().diff() > pd.to_timedelta('1 day')
singleseries[singleseries.gap]

Out[4]:

	date	group	value	gap
39	2010-02-10	A	97	True
44	2010-02-17	A	93	True
45	2010-02-19	A	88	True
57	2010-03-04	A	92	True
77	2010-03-25	A	44	True
81	2010-03-30	A	94	True
86	2010-04-05	A	7	True
89	2010-04-10	A	65	True
92	2010-04-15	A	85	True
99	2010-04-23	A	7	True
115	2010-05-10	A	46	True
129	2010-05-25	A	50	True
132	2010-05-29	A	71	True
136	2010-06-03	A	42	True
164	2010-07-02	A	26	True
175	2010-07-14	A	50	True
198	2010-08-07	A	6	True
201	2010-08-11	A	84	True
209	2010-08-20	A	14	True
212	2010-08-24	A	60	True
238	2010-09-20	A	39	True
246	2010-09-29	A	34	True
259	2010-10-13	A	27	True
270	2010-10-25	A	19	True
271	2010-10-27	A	3	True
291	2010-11-17	A	1	True
297	2010-11-24	A	39	True
298	2010-11-26	A	28	True
302	2010-12-01	A	3	True

In multiple time series¶

The same process can be applied to data with multiple time series, we just need to group on the time series identifier (group) before applying the

In [5]:

df['gap'] = df[['group', 'date']].sort_values(by=['group', 'date']).groupby('group').diff() > pd.to_timedelta('1 day')
df[df.gap]

Out[5]:

	date	group	value	gap
39	2010-02-10	A	97	True
44	2010-02-17	A	93	True
45	2010-02-19	A	88	True
57	2010-03-04	A	92	True
77	2010-03-25	A	44	True
...	...	...	...	...
920	2010-10-11	C	98	True
950	2010-11-11	C	0	True
958	2010-11-20	C	72	True
974	2010-12-08	C	1	True
985	2010-12-20	C	39	True

90 rows × 4 columns