This notbook is for challenge2 of Cambridge Energy Data Lab GitHub tasks

URL: https://github.com/camenergydatalab/EnergyDataSimulationChallenge

In [395]:
import pandas
import pylab

Steps 1 & 2

  • Download the data-set total-watt.csv
  • The data-set consists of two columns: a time stamp and the energy consumption

Load data file as pandas DataFrame

In [396]:
df_total_watt = pandas.read_table('../../data/total_watt.csv', sep=',', header=None, names=['datetime', 'consumption' ], parse_dates=['datetime'])

Quickly check loaded data

In [398]:
df_total_watt
Out[398]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1601 entries, 0 to 1600
Data columns (total 2 columns):
datetime       1601  non-null values
consumption    1601  non-null values
dtypes: datetime64[ns](1), float64(1)
In [399]:
df_total_watt.head()
Out[399]:
datetime consumption
0 2011-04-18 13:22:00 925.840614
1 2011-04-18 13:52:00 483.295892
2 2011-04-18 14:22:00 915.761634
3 2011-04-18 14:52:00 609.043491
4 2011-04-18 15:22:00 745.155434
In [405]:
df_total_watt.consumption.describe()
Out[405]:
count    1601.000000
mean      509.789108
std       788.525174
min        55.535874
25%       179.004211
50%       312.145277
75%       510.456476
max      9107.922620
dtype: float64

Steps 3

  • visualise the data-set

Simply Visualize the Dataset as it is

Use plot instead of bat because of too many data points

In [402]:
pylab.plot(df_total_watt.datetime, df_total_watt.consumption)
pylab.ylabel('consumption (Wh)')
pylab.xlabel('time')
pylab.title('energy consumption per 30mins')
Out[402]:
<matplotlib.text.Text at 0x124ee3e10>

Visualize changes per 30mins

In [406]:
pylab.plot(df_total_watt.datetime, df_total_watt.consumption.diff())
pylab.ylabel('consumption change (Wh)')
pylab.xlabel('time')
pylab.title('energy consumption change per 30mins')
Out[406]:
<matplotlib.text.Text at 0x127621e50>

Steps 4

  • visualise the data-set as values per day

Group by date

In [407]:
df_total_watt['date'] = [dt.date() for dt in df_total_watt['datetime']]
df_total_watt_daily = df_total_watt.groupby('date').sum().reset_index()
In [408]:
df_total_watt_daily.head()
Out[408]:
date consumption
0 2011-04-18 17105.982347
1 2011-04-19 30440.466453
2 2011-04-20 24027.226338
3 2011-04-21 17475.725746
4 2011-04-22 18776.278103
In [409]:
pylab.bar(df_total_watt_daily.date, df_total_watt_daily.consumption)
pylab.ylabel('consumption(Wh)')
pylab.xlabel('date')
pylab.title('energy consumption per day')
Out[409]:
<matplotlib.text.Text at 0x128062590>

Step 5

  • cluster the values per day into 3 groups: low, medium, and high energy consumption

Quickly check basic stats

In [362]:
df_total_watt_daily.consumption.describe()
Out[362]:
count       35.000000
mean     23319.210323
std      14109.683226
min       8278.602258
25%      13794.652107
50%      18776.278103
75%      25235.158554
max      66411.835632
dtype: float64

Check how total kWh/day is distributed

It is biased to range of 10000 to 25000 kWh.

In [640]:
count = {}
interval = 5000
consumption_ranges = range(0,70000,interval)
for consumption_range in consumption_ranges:
    count[consumption_range] = \
    df_total_watt_daily[df_total_watt_daily['consumption'] >= consumption_range].consumption.count() - \
    df_total_watt_daily[df_total_watt_daily['consumption'] >= consumption_range + interval].consumption.count()
    
pylab.bar(count.keys(), count.values(), width=interval)
pylab.ylabel('frequency (days)')
pylab.xlabel('total consumption per day (kWh)')
pylab.title('distribution of energy consumption per day')
Out[640]:
<matplotlib.text.Text at 0x1328e3150>

define high, low and mean as follows.

  • high > mean + std
  • low < mean - std
  • mean - std <= medium <= mean + std

(*) consumption is not normally distributed as above chart, so 1 std does not mean 70% in this case.

In [410]:
df_total_watt_daily_high = df_total_watt_daily[df_total_watt_daily.consumption > df_total_watt_daily.consumption.mean() + df_total_watt_daily.consumption.std()]
In [411]:
df_total_watt_daily_low = df_total_watt_daily[df_total_watt_daily.consumption < df_total_watt_daily.consumption.mean() - df_total_watt_daily.consumption.std()]
In [412]:
df_total_watt_daily_middle = df_total_watt_daily[df_total_watt_daily.consumption <= df_total_watt_daily.consumption.mean() + df_total_watt_daily.consumption.std()]
df_total_watt_daily_middle = df_total_watt_daily_middle[df_total_watt_daily_middle.consumption >= df_total_watt_daily.consumption.mean() - df_total_watt_daily.consumption.std()]
In [413]:
df_total_watt_daily_low
Out[413]:
date consumption
16 2011-05-06 8278.602258
In [414]:
df_total_watt_daily_high
Out[414]:
date consumption
5 2011-04-23 41551.530034
6 2011-04-24 58647.570168
12 2011-04-30 66411.835632
13 2011-05-01 42598.899117
32 2011-05-22 51413.203602
33 2011-05-23 40378.805967
In [415]:
df_total_watt_daily_middle.head()
Out[415]:
date consumption
0 2011-04-18 17105.982347
1 2011-04-19 30440.466453
2 2011-04-20 24027.226338
3 2011-04-21 17475.725746
4 2011-04-22 18776.278103

Step 6

  • visualise the clusters
In [416]:
middle = pylab.bar(df_total_watt_daily_middle.date, df_total_watt_daily_middle.consumption, color='green')
high = pylab.bar(df_total_watt_daily_high.date, df_total_watt_daily_high.consumption, color='red')
low = pylab.bar(df_total_watt_daily_low.date, df_total_watt_daily_low.consumption, color='blue')
pylab.ylabel('energy consumption per day (Wh)')
pylab.xlabel('date')
pylab.title('clustered energy consumption per day')
pylab.legend((middle, high, low), ('middle','high', 'low'))
Out[416]:
<matplotlib.legend.Legend at 0x127d36550>

(Optional) Use log values for categorization

Plot daily consumption in order

Looks like log value is more suitable for clustering

In [598]:
pylab.bar(range(df_total_watt_daily.consumption.count()),df_total_watt_daily.sort('consumption')['consumption'])
pylab.ylabel('energy consumption per day (Wh)')
pylab.xlabel('order')
pylab.title('energy consumption per day ordered by consumption')
Out[598]:
<matplotlib.text.Text at 0x131c79290>
In [597]:
pylab.bar(range(df_total_watt_daily.consumption.count()),df_total_watt_daily.sort('consumption')['consumption'],
log = True)
pylab.ylabel('energy consumption per day (Wh)')
pylab.xlabel('order')
pylab.title('energy consumption per day ordered by consumption')
Out[597]:
<matplotlib.text.Text at 0x131a00310>

Use log value this time

  • high > mean + std
  • low < mean - std
  • mean - std <= medium <= mean + std
In [560]:
df_total_watt_daily['consumption_log'] = df_total_watt_daily.consumption.map(lambda c: log(c))
In [418]:
df_total_watt_daily['consumption_log'].describe()
Out[418]:
count    35.000000
mean      9.915222
std       0.517556
min       9.021429
25%       9.531914
50%       9.840350
75%      10.135928
max      11.103631
dtype: float64
In [419]:
df_total_watt_daily_log_high = df_total_watt_daily[df_total_watt_daily['consumption_log'] > df_total_watt_daily['consumption_log'].mean() + df_total_watt_daily['consumption_log'].std()]
df_total_watt_daily_log_low = df_total_watt_daily[df_total_watt_daily['consumption_log'] < df_total_watt_daily['consumption_log'].mean() - df_total_watt_daily['consumption_log'].std()]
df_total_watt_daily_log_middle = df_total_watt_daily[df_total_watt_daily['consumption_log'] <= df_total_watt_daily['consumption_log'].mean() + df_total_watt_daily['consumption_log'].std()]
df_total_watt_daily_log_middle = df_total_watt_daily_log_middle[df_total_watt_daily_log_middle['consumption_log'] >= df_total_watt_daily['consumption_log'].mean() - df_total_watt_daily['consumption_log'].std()]
In [420]:
middle = pylab.bar(df_total_watt_daily_log_middle.date, df_total_watt_daily_log_middle.consumption, color='green')
high = pylab.bar(df_total_watt_daily_log_high.date, df_total_watt_daily_log_high.consumption, color='red')
low = pylab.bar(df_total_watt_daily_log_low.date, df_total_watt_daily_log_low.consumption, color='blue')
pylab.ylabel('energy consumption per day (Wh)')
pylab.xlabel('date')
pylab.title('clustered energy consumption per day')
pylab.legend((middle, high, low), ('middle','high', 'low'))
Out[420]:
<matplotlib.legend.Legend at 0x12a023550>

(Optional 2) Use K-means for Categorization

Simply create three clusters by k-means

k-means clusering using 25&, 50%, 75% percentiles as initial centroids

In [463]:
from scipy.cluster.vq import kmeans2, whiten
In [578]:
whitened = whiten(df_total_watt_daily.consumption.values)
initial_centroids = numpy.array([numpy.percentile(whitened,25),numpy.percentile(whitened,50),numpy.percentile(whitened,75)])
centroids, labels = kmeans2((whitened),initial_centroids)
In [579]:
centroids, labels
Out[579]:
(array([ 0.97785871,  1.70177426,  3.60740752]),
 array([0, 1, 1, 0, 1, 2, 2, 1, 1, 1, 0, 0, 2, 2, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 2, 2, 1]))
In [580]:
df_total_watt_daily['labels'] = labels
In [581]:
df_total_watt_daily_cluster_low = df_total_watt_daily[df_total_watt_daily['labels'] == 0]
df_total_watt_daily_cluster_middle = df_total_watt_daily[df_total_watt_daily['labels'] == 1]
df_total_watt_daily_cluster_high = df_total_watt_daily[df_total_watt_daily['labels'] == 2]
In [582]:
low = pylab.bar(df_total_watt_daily_cluster_low.date, df_total_watt_daily_cluster_low.consumption, color='blue')
middle = pylab.bar(df_total_watt_daily_cluster_middle.date, df_total_watt_daily_cluster_middle.consumption, color='green')
high = pylab.bar(df_total_watt_daily_cluster_high.date, df_total_watt_daily_cluster_high.consumption, color='red')
pylab.ylabel('energy consumption per day (Wh)')
pylab.xlabel('date')
pylab.title('clustered energy consumption per day')
pylab.legend((middle, high, low), ('middle','high', 'low'))
Out[582]:
<matplotlib.legend.Legend at 0x12eb65610>