Fine dust prediction (하루 평균 미세 먼지 수치 예측)

Training Convolutional LSTM network to predict fine-dust.

pm10 단위의 하루 평균 미세 먼지 수치를 예측하기 위해 ConvLSTM 을 훈련시킵니다.

In [1]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
LOCATION_CNT = 39
BOROUGH_CNT = 25
WEEKLY_BATCH = 7

TEST_SIZE = 40
TRAIN_SIZE = 240
BATCH_SIZE = 40

TRAIN_ITER = 100
BATCH_ITER = TRAIN_SIZE // BATCH_SIZE

Pre-processing data

Pre-processing the daily average amount of the fine dust(pm10) to bundle it in 7 days.

For the convolution layer, the amount of fine dust for each region is mapped to the geographical map of Seoul (seoul_map).

The data was extracted from the csv file Seoul Daily Average Air Pollution Degree Information provided by the Seoul Open Data Plaza.

하루 평균 미세 먼지 수치를 7일 단위로 전처리 합니다.

convolution layer를 위해, 지역별 미세 먼지 수치는 실제 서울시 지도에 대응됩니다 (seoul_map).

데이터는 서울 열린 데이터 광장에서 제공하는 서울시 일별 평균 대기 오염도 정보 csv 파일에서 추출하였습니다.

In [3]:
csv_name = 'MonthlyAverageAirPollutionInSeoul.csv'

seoul_map = [ # 9 by 8 matrix, 25 borough
    [0, 0, 0, 0, '도봉구', '노원구', 0, 0],
    [0, 0, 0, '강북구', '강북구', '노원구', '노원구', 0],
    [0, '은평구', '종로구', '성북구', '성북구', '성북구', '중랑구', 0],
    [0, '은평구', '서대문구', '종로구', '종로구', '동대문구', '중랑구', 0],
    [0, '은평구', '서대문구', '서대문구', '중구', '성동구', '광진구', '강동구'],
    [0, '마포구', '마포구', '마포구', '용산구', '강남구', '송파구', '강동구'],
    ['강서구', '강서구', '영등포구', '동작구', '서초구', '강남구', '송파구', 0],
    [0, '양천구', '영등포구', '관악구', '서초구', '강남구', '송파구', 0],
    [0, '구로구', '금천구', '관악구', '서초구', 0, 0, 0]
]
In [4]:
def read_csv(name):
    """Read the csv file and return the amount of fine dust for each region grouped by date.
    
    Args:
        name: The name of the csv file to read.
    
    Returns:
        Dictionary object mapping the amount of fine dust for each region by date.
        {'20170506': {'강남구': 50, '강동구': 60, '강서구':70}, '20170507': {'강남구': } ...}
    
    Raises:
        ValueError: If the number of the data per day in the csv file is not equal to LOCATION_CNT.
        
    """
    with open(name) as f:
        raw_data = f.read().strip()

    del_quote = raw_data.replace("\"", '')
    data = list(map(lambda x: x.split(','), del_quote.split('\n')))[1:] # [1:] csv header

    splitted = []

    ptr = 0
    for i in range(len(data) // LOCATION_CNT):
        splitted.append(data[ptr:ptr+LOCATION_CNT])
        ptr += LOCATION_CNT
    
    ## test case
    for date_list in splitted:
        date = date_list[0][0] # index 0:date
        for local in date_list:
            if date != local[0]:
                raise ValueError(date + ' is not same as ' + 'local[0]')
    
    def filter_borough(dic):
        return dict(filter(lambda t: '구' in t[0], dic.items())) #filter not road name only borough 

    # index 0:date, 1:local name, 6:pms
    pms = dict(map(lambda x: (x[0][0], dict(map(lambda t: (t[1], t[6]), x))), splitted))
    pms_filtered = dict(filter(lambda x: '' not in x[1].values(), pms.items())) # csv data contains spaces
    pms_filtered2 = dict(map(lambda x: (x[0], filter_borough(x[1])), pms_filtered.items()))
    
    return pms_filtered2
In [5]:
def geographical_mapping(pms_data):
    """Map the amount of fine dust for each region to geographical map of Seoul.
    
    Args:
        pms_data: Fine dust data pre-processed by read_csv.
            Dictionary obejct mapping the amount of fine dust for each region by date.
            {'20170506': {'강남구': 50, '강동구': 60, '강서구':70}, '20170507': {'강남구': } ...}
    
    Returns:
        Dictionary that map the amount of fine dust to the geographical map of Seoul.
        {'20170506': [[0, 0, 0, 0, 50, 60, 0, 0], [0, 0, 0, 40, ...]]}
      
    """
    def dict2seoul(p):
        return list(map(lambda t: list(map(lambda x: int(p[x]) if x != 0 else 0, t)), seoul_map))

    # map dict to seoul geographic map
    pms_mapped = dict(map(lambda p: (p[0], dict2seoul(p[1])), pms_data.items())) 
    return pms_mapped
In [6]:
def generate_dataset(data):
    """ Generate the daily average amount of the fine dust(pm10) bundled in 7 days.
    
    Args:
        data: Fine dust data pre-processed by read_csv
            Dictionary object mapping the amount of fine dust for each region by date.
            {'20170506': {'강남구': 50, '강동구': 60, '강서구':70}, '20170507': {'강남구': } ...}
    
    Returns:
        pms_sampled: the amount of fine dust bundled in 7 days.
        pms_result: the amount of fine dust on the next day.
        
    """
    pms_mapped = geographical_mapping(data)
    
    # tie data to WEEKLY_BATCH(7) batches
    pms_data = list(map(lambda x: x[1], sorted(pms_mapped.items()))) 
    pms_sampled = list(map(lambda i: pms_data[i:i+WEEKLY_BATCH], range(len(pms_data) - WEEKLY_BATCH - 1)))
    pms_result = pms_data[WEEKLY_BATCH:]
    
    return pms_sampled, pms_result
In [7]:
pms_data = read_csv(csv_name)
pms_sampled, pms_result = generate_dataset(pms_data)
In [8]:
test_sampled = pms_sampled[-TEST_SIZE:]
test_result = pms_result[-TEST_SIZE:]

train_set = list(zip(pms_sampled[:TRAIN_SIZE], pms_result[:TRAIN_SIZE]))
np.random.shuffle(train_set)

train_sampled = list(map(lambda x: x[0], train_set))
train_result = list(map(lambda x: x[1], train_set))

Generate Model

Generate ConvLSTM Networks to predict fine-dust. The original paper is Convolutional LSTM.

미세 먼지 예측이 이용할 ConvLSTM 네트워크를 구축합니다. ConvLSTM 네트워크는 Convlutional LSTM 논문을 참고하였습니다.

In [9]:
X = tf.placeholder(tf.float32, [None, WEEKLY_BATCH, 9, 8]) # matrix size [9, 8]
Y = tf.placeholder(tf.float32, [None, 9, 8])

Xr = tf.reshape(X, [-1, WEEKLY_BATCH, 9, 8, 1])
Xt = tf.transpose(Xr, [1, 0, 2, 3, 4])
In [10]:
class ConvLSTMCell(tf.contrib.rnn.RNNCell):
    """ Convolutional LSTM Model. 
    Similar as LSTM Cell, but it calculate hidden states with convolution.
        
    Attributes:
        shape (tf.TensorShape): tensor shape of the output value
        kernel (list): kernel shape [f_h, f_w]
        depth (int): depth of the output
        name (str): name of the variable scope
    """
    def __init__(self, shape, kernel, depth, name='ConvLSTM_'):
        """ Convolutional LSTM initializer. It creates some tensorflow variables.
        
        Args:
            shape (tf.TensorShape): tensor shape of the output value
            kernel (list): kernel shape [f_h, f_w]
            depth (int): depth of the output
            name (str): name of the variable scope
        """
        self._shape = tf.TensorShape(shape + [depth])
        self._kernel = kernel
        self._depth = depth
        self._name = name
        
    @property
    def state_size(self):
        return tf.contrib.rnn.LSTMStateTuple(self._shape, self._shape)
        
    @property
    def output_size(self):
        return self._shape
        
    def __call__(self, x, state):
        """ Feed input tensor to the ConvLSTM and return hidden units and cells.
        
        Args:
            x (tf.Tensor): Input value of the ConvLSTM. 5D tensor
                [input dim, batch size, width, height, channel]
        
        Returns:
            h (tf.Tensor): hidden units of the ConvLSTM
            state (LSTMStateTuple): states of the ConvLSTM [cell, hidden]
        """
        with tf.variable_scope(self._name):
            c_prev, h_prev = state
            
            x = tf.concat([x, h_prev], axis=3)
            in_dim = x.shape[-1].value
            h_dim = 4 * self._depth
            
            w = tf.get_variable('w', self._kernel + [in_dim, h_dim],
                               initializer=tf.random_normal_initializer(stddev=0.02))
            conv = tf.nn.conv2d(x, w, strides=[1, 1, 1, 1], padding='SAME')
            hi, hf, hc, ho = tf.split(conv, 4, axis=3)
            
            hi = tf.nn.sigmoid(hi)
            hf = tf.nn.sigmoid(hf)
            ho = tf.nn.sigmoid(ho)
            hc = tf.nn.tanh(hc)
            
            c = c_prev * hf + hc * hi
            h = ho * tf.nn.tanh(c)
            
            state = tf.contrib.rnn.LSTMStateTuple(c, h)
            
            return h, state
In [11]:
encoder1 = ConvLSTMCell([9, 8], [5, 5], 128, name='encoder1')
encoder2 = ConvLSTMCell([9, 8], [5, 5], 64, name='encoder2')
encoder3 = ConvLSTMCell([9, 8], [5, 5], 64, name='encoder3')

h1, state1 = tf.nn.dynamic_rnn(encoder1, Xt, dtype=tf.float32, time_major=True)
h2, state2 = tf.nn.dynamic_rnn(encoder2, h1, dtype=tf.float32, time_major=True)
h3, state3 = tf.nn.dynamic_rnn(encoder3, h2, dtype=tf.float32, time_major=True)

hidden = tf.concat([h1[-1], h2[-1], h3[-1]], axis=3)
In [12]:
# 1x1 convolution for flattening data
Wf = tf.get_variable('wf', shape=[1, 1, 256, 1],
                    initializer=tf.random_normal_initializer(stddev=0.02))
out = tf.nn.conv2d(hidden, Wf, strides=[1, 1, 1, 1], padding='SAME')

pred = tf.reshape(out, [-1, 9, 8])

loss = tf.reduce_mean(tf.square(Y - pred)) # MSE loss
opt = tf.train.RMSPropOptimizer(1e-3, 0.9).minimize(loss)

Training

Train ConvLSTM Networks with batch size 40, train iter 200.

Record training and cross validation errors during training.

ConvLSTM 네트워크를 학습시킵니다. batch 크기는 40개, 반복 횟수는 200번입니다.

훈련 과정에서 training error 와 cross validation error 을 계산, 기록합니다.

In [13]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
In [14]:
train_loss = []
test_loss = []

for i in range(TRAIN_ITER + 1):
    ptr = 0
    for _ in range(BATCH_ITER):
        _, trainloss = sess.run([opt, loss], feed_dict={X: train_sampled[ptr:ptr+BATCH_SIZE], 
                                                        Y: train_result[ptr:ptr+BATCH_SIZE]})
        
        testloss = sess.run(loss, feed_dict={X: test_sampled, Y: test_result})
        
        ptr += BATCH_SIZE

    train_loss.append(trainloss)
    test_loss.append(testloss)
    
    if i % (TRAIN_ITER // 10) == 0:
        print(i/TRAIN_ITER, testloss)
0.0 2634.04
0.1 1438.95
0.2 873.191
0.3 501.78
0.4 404.776
0.5 380.151
0.6 355.992
0.7 378.235
0.8 401.75
0.9 406.401
1.0 426.977

Visualization

Visualize loss graph and actual value and estimates for the test set.

loss graph 와 test set 에 대한 실제 값과 평가치를 시각화합니다.

In [15]:
plt.plot(train_loss, label='train')
plt.plot(test_loss, label='test')
plt.legend()
Out[15]:
<matplotlib.legend.Legend at 0x7f2fc6258978>
In [16]:
predict = sess.run(pred, feed_dict={X: test_sampled})
pred_gangnam = list(map(lambda x: x[6][5], predict[-15:]))
real_gangnam = list(map(lambda x: x[6][5], test_result[-15:]))

plt.plot(pred_gangnam, label='predict')
plt.plot(real_gangnam, label='real')
plt.legend()
Out[16]:
<matplotlib.legend.Legend at 0x7f2fc618ecf8>
In [17]:
pred_dongdaemun = list(map(lambda x: x[3][5], predict[-15:]))
real_dongdaemun = list(map(lambda x: x[3][5], test_result[-15:]))

plt.plot(pred_dongdaemun, label='predict')
plt.plot(real_dongdaemun, label='real')
plt.legend()
Out[17]:
<matplotlib.legend.Legend at 0x7f2fc6139f60>
In [18]:
sess.close()