This notebook explains how to identify date gaps in time series data with pandas
.
import pandas as pd
import numpy as np
We will create a dataframe that contains multiple time series, one for each group.
np.random.seed(1066)
dates = pd.date_range(start='2010-01-01', end='2010-12-31', freq='D')
df = pd.DataFrame({'date': dates,
'group': 'A',
'value': np.random.randint(0, 100, size=len(dates))
}).append(pd.DataFrame({'date': dates,
'group': 'B',
'value': np.random.randint(0, 100, size=len(dates))
})).append(pd.DataFrame({'date': dates,
'group': 'C',
'value': np.random.randint(0, 100, size=len(dates))
})).reset_index(drop=True)
df
date | group | value | |
---|---|---|---|
0 | 2010-01-01 | A | 57 |
1 | 2010-01-02 | A | 11 |
2 | 2010-01-03 | A | 83 |
3 | 2010-01-04 | A | 83 |
4 | 2010-01-05 | A | 93 |
... | ... | ... | ... |
1090 | 2010-12-27 | C | 50 |
1091 | 2010-12-28 | C | 59 |
1092 | 2010-12-29 | C | 85 |
1093 | 2010-12-30 | C | 32 |
1094 | 2010-12-31 | C | 3 |
1095 rows × 3 columns
Drop some rows randomly to create gaps in the data.
length = df.shape[0]
droplist = np.unique(np.sort(np.random.randint(0, length, size=100))).tolist()
df = df.drop(droplist).reset_index(drop=True)
df
date | group | value | |
---|---|---|---|
0 | 2010-01-01 | A | 57 |
1 | 2010-01-02 | A | 11 |
2 | 2010-01-03 | A | 83 |
3 | 2010-01-04 | A | 83 |
4 | 2010-01-05 | A | 93 |
... | ... | ... | ... |
992 | 2010-12-27 | C | 50 |
993 | 2010-12-28 | C | 59 |
994 | 2010-12-29 | C | 85 |
995 | 2010-12-30 | C | 32 |
996 | 2010-12-31 | C | 3 |
997 rows × 3 columns
The pandas function diff
applied to the date field will generate the time between subsequent datas. Comparing that to 1 day can identify date gaps in the data.
singleseries = df[df.group == 'A']
singleseries['gap'] = singleseries['date'].sort_values().diff() > pd.to_timedelta('1 day')
singleseries[singleseries.gap]
date | group | value | gap | |
---|---|---|---|---|
39 | 2010-02-10 | A | 97 | True |
44 | 2010-02-17 | A | 93 | True |
45 | 2010-02-19 | A | 88 | True |
57 | 2010-03-04 | A | 92 | True |
77 | 2010-03-25 | A | 44 | True |
81 | 2010-03-30 | A | 94 | True |
86 | 2010-04-05 | A | 7 | True |
89 | 2010-04-10 | A | 65 | True |
92 | 2010-04-15 | A | 85 | True |
99 | 2010-04-23 | A | 7 | True |
115 | 2010-05-10 | A | 46 | True |
129 | 2010-05-25 | A | 50 | True |
132 | 2010-05-29 | A | 71 | True |
136 | 2010-06-03 | A | 42 | True |
164 | 2010-07-02 | A | 26 | True |
175 | 2010-07-14 | A | 50 | True |
198 | 2010-08-07 | A | 6 | True |
201 | 2010-08-11 | A | 84 | True |
209 | 2010-08-20 | A | 14 | True |
212 | 2010-08-24 | A | 60 | True |
238 | 2010-09-20 | A | 39 | True |
246 | 2010-09-29 | A | 34 | True |
259 | 2010-10-13 | A | 27 | True |
270 | 2010-10-25 | A | 19 | True |
271 | 2010-10-27 | A | 3 | True |
291 | 2010-11-17 | A | 1 | True |
297 | 2010-11-24 | A | 39 | True |
298 | 2010-11-26 | A | 28 | True |
302 | 2010-12-01 | A | 3 | True |
The same process can be applied to data with multiple time series, we just need to group on the time series identifier (group) before applying the
df['gap'] = df[['group', 'date']].sort_values(by=['group', 'date']).groupby('group').diff() > pd.to_timedelta('1 day')
df[df.gap]
date | group | value | gap | |
---|---|---|---|---|
39 | 2010-02-10 | A | 97 | True |
44 | 2010-02-17 | A | 93 | True |
45 | 2010-02-19 | A | 88 | True |
57 | 2010-03-04 | A | 92 | True |
77 | 2010-03-25 | A | 44 | True |
... | ... | ... | ... | ... |
920 | 2010-10-11 | C | 98 | True |
950 | 2010-11-11 | C | 0 | True |
958 | 2010-11-20 | C | 72 | True |
974 | 2010-12-08 | C | 1 | True |
985 | 2010-12-20 | C | 39 | True |
90 rows × 4 columns