:label:sec_sequence
Imagine that you are watching movies on Netflix. As a good Netflix user, you decide to rate each of the movies religiously. After all, a good movie is a good movie, and you want to watch more of them, right? As it turns out, things are not quite so simple. People's opinions on movies can change quite significantly over time. In fact, psychologists even have names for some of the effects:
:cite:Wu.Ahmed.Beutel.ea.2017
.
In short, movie ratings are anything but stationary. Thus, using temporal dynamics
led to more accurate movie recommendations :cite:Koren.2009
.
Of course, sequence data are not just about movie ratings. The following gives more illustrations.
We need statistical tools and new deep neural network architectures to deal with sequence data. To keep things simple, we use the stock price (FTSE 100 index) illustrated in :numref:fig_ftse100
as an example.
:width:400px
:label:fig_ftse100
Let us denote the prices by $x_t$, i.e., at time step $t \in \mathbb{Z}^+$ we observe price $x_t$. Note that for sequences in this text, $t$ will typically be discrete and vary over integers or its subset. Suppose that a trader who wants to do well in the stock market on day $t$ predicts $x_t$ via
$$x_t \sim P(x_t \mid x_{t-1}, \ldots, x_1).$$In order to achieve this, our trader could use a regression model such as the one that we trained in :numref:sec_linear_concise
.
There is just one major problem: the number of inputs, $x_{t-1}, \ldots, x_1$ varies, depending on $t$.
That is to say, the number increases with the amount of data that we encounter, and we will need an approximation to make this computationally tractable.
Much of what follows in this chapter will revolve around how to estimate $P(x_t \mid x_{t-1}, \ldots, x_1)$ efficiently. In a nutshell it boils down to two strategies as follows.
First, assume that the potentially rather long sequence $x_{t-1}, \ldots, x_1$ is not really necessary. In this case we might content ourselves with some timespan of length $\tau$ and only use $x_{t-1}, \ldots, x_{t-\tau}$ observations. The immediate benefit is that now the number of arguments is always the same, at least for $t > \tau$. This allows us to train a deep network as indicated above. Such models will be called autoregressive models, as they quite literally perform regression on themselves.
The second strategy, shown in :numref:fig_sequence-model
, is to keep some summary $h_t$ of the past observations, and at the same time update $h_t$ in addition to the prediction $\hat{x}_t$.
This leads to models that estimate $x_t$ with $\hat{x}_t = P(x_t \mid h_{t})$ and moreover updates of the form $h_t = g(h_{t-1}, x_{t-1})$. Since $h_t$ is never observed, these models are also called latent autoregressive models.
:label:fig_sequence-model
Both cases raise the obvious question of how to generate training data. One typically uses historical observations to predict the next observation given the ones up to right now. Obviously we do not expect time to stand still. However, a common assumption is that while the specific values of $x_t$ might change, at least the dynamics of the sequence itself will not. This is reasonable, since novel dynamics are just that, novel and thus not predictable using data that we have so far. Statisticians call dynamics that do not change stationary. Regardless of what we do, we will thus get an estimate of the entire sequence via
$$P(x_1, \ldots, x_T) = \prod_{t=1}^T P(x_t \mid x_{t-1}, \ldots, x_1).$$Note that the above considerations still hold if we deal with discrete objects, such as words, rather than continuous numbers. The only difference is that in such a situation we need to use a classifier rather than a regression model to estimate $P(x_t \mid x_{t-1}, \ldots, x_1)$.
Recall the approximation that in an autoregressive model we use only $x_{t-1}, \ldots, x_{t-\tau}$ instead of $x_{t-1}, \ldots, x_1$ to estimate $x_t$. Whenever this approximation is accurate we say that the sequence satisfies a Markov condition. In particular, if $\tau = 1$, we have a first-order Markov model and $P(x)$ is given by
$$P(x_1, \ldots, x_T) = \prod_{t=1}^T P(x_t \mid x_{t-1}) \text{ where } P(x_1 \mid x_0) = P(x_1).$$Such models are particularly nice whenever $x_t$ assumes only a discrete value, since in this case dynamic programming can be used to compute values along the chain exactly. For instance, we can compute $P(x_{t+1} \mid x_{t-1})$ efficiently:
$$\begin{aligned} P(x_{t+1} \mid x_{t-1}) &= \frac{\sum_{x_t} P(x_{t+1}, x_t, x_{t-1})}{P(x_{t-1})}\\ &= \frac{\sum_{x_t} P(x_{t+1} \mid x_t, x_{t-1}) P(x_t, x_{t-1})}{P(x_{t-1})}\\ &= \sum_{x_t} P(x_{t+1} \mid x_t) P(x_t \mid x_{t-1}) \end{aligned} $$by using the fact that we only need to take into account a very short history of past observations: $P(x_{t+1} \mid x_t, x_{t-1}) = P(x_{t+1} \mid x_t)$. Going into details of dynamic programming is beyond the scope of this section. Control and reinforcement learning algorithms use such tools extensively.
In principle, there is nothing wrong with unfolding $P(x_1, \ldots, x_T)$ in reverse order. After all, by conditioning we can always write it via
$$P(x_1, \ldots, x_T) = \prod_{t=T}^1 P(x_t \mid x_{t+1}, \ldots, x_T).$$In fact, if we have a Markov model, we can obtain a reverse conditional probability distribution, too. In many cases, however, there exists a natural direction for the data, namely going forward in time. It is clear that future events cannot influence the past. Hence, if we change $x_t$, we may be able to influence what happens for $x_{t+1}$ going forward but not the converse. That is, if we change $x_t$, the distribution over past events will not change. Consequently, it ought to be easier to explain $P(x_{t+1} \mid x_t)$ rather than $P(x_t \mid x_{t+1})$. For instance, it has been shown that in some cases we can find $x_{t+1} = f(x_t) + \epsilon$ for some additive noise $\epsilon$, whereas the converse is not true :cite:Hoyer.Janzing.Mooij.ea.2009
. This is great news, since it is typically the forward direction that we are interested in estimating.
The book by Peters et al. has
explained more on this topic :cite:Peters.Janzing.Scholkopf.2017
.
We are barely scratching the surface of it.
After reviewing so many statistical tools, let us try this out in practice. We begin by generating some data. To keep things simple we generate our sequence data by using a sine function with some additive noise for time steps $1, 2, \ldots, 1000$.
%load ../utils/djl-imports
%load ../utils/plot-utils
%load ../utils/Functions.java
NDManager manager = NDManager.newBaseManager();
public static Figure plot(double[] x, double[] y, String xLabel, String yLabel) {
ScatterTrace trace = ScatterTrace.builder(x,y)
.mode(ScatterTrace.Mode.LINE)
.build();
Layout layout = Layout.builder()
.showLegend(true)
.xAxis(Axis.builder().title(xLabel).build())
.yAxis(Axis.builder().title(yLabel).build())
.build();
return new Figure(layout, trace);
}
int T = 1000; //Generate a total of 1000 points
NDArray time = manager.arange(1f, T+1);
NDArray x = time.mul(0.01).sin().add(
manager.randomNormal(0f, 0.2f, new Shape(T), DataType.FLOAT32));
double[] xAxis = Functions.floatToDoubleArray(time.toFloatArray());
double[] yAxis = Functions.floatToDoubleArray(x.toFloatArray());
plot(xAxis, yAxis, "time", "x");
Next, we need to turn such a sequence into features and labels that our model can train on. Based on the embedding dimension $\tau$ we map the data into pairs $y_t = x_t$ and $\mathbf{x}_t = [x_{t-\tau}, \ldots, x_{t-1}]$. The astute reader might have noticed that this gives us $\tau$ fewer data examples, since we do not have sufficient history for the first $\tau$ of them. A simple fix, in particular if the sequence is long, is to discard those few terms. Alternatively we could pad the sequence with zeros. Here we only use the first 600 feature-label pairs for training.
int tau = 4;
NDArray features = manager.zeros(new Shape(T - tau, tau));
for (int i = 0; i < tau; i++) {
features.set(new NDIndex(":, {}", i), x.get(new NDIndex("{}:{}", i, T - tau + i)));
}
NDArray labels = x.get(new NDIndex("" + tau + ":")).reshape(new Shape(-1,1));
int batchSize = 16;
int nTrain = 600;
// Only the first `nTrain` examples are used for training
ArrayDataset trainIter = new ArrayDataset.Builder()
.setData(features.get(new NDIndex(":{}", nTrain)))
.optLabels(labels.get(new NDIndex(":{}", nTrain)))
.setSampling(batchSize, true)
.build();
Here we keep the architecture fairly simple: just an MLP with two fully-connected layers, ReLU activation, and squared loss.
// A simple MLP
public static SequentialBlock getNet() {
SequentialBlock net = new SequentialBlock();
net.add(Linear.builder().setUnits(10).build());
net.add(Activation::relu);
net.add(Linear.builder().setUnits(1).build());
return net;
}
Now we are ready to train the model. The code below is essentially identical to the training loop in previous sections,
such as :numref:sec_linear_concise
.
Thus, we will not delve into much detail.
// We add this outside of the function `train` to keep a strong reference to the trainer object in the notebook (else sometimes it may get closed)
Trainer trainer = null;
public static Model train(SequentialBlock net, ArrayDataset dataset, int batchSize, int numEpochs, float learningRate)
throws IOException, TranslateException {
// Square Loss
Loss loss = Loss.l2Loss();
Tracker lrt = Tracker.fixed(learningRate);
Optimizer adam = Optimizer.adam().optLearningRateTracker(lrt).build();
DefaultTrainingConfig config = new DefaultTrainingConfig(loss)
.optOptimizer(adam) // Optimizer (loss function)
.optInitializer(new XavierInitializer(), "")
.addTrainingListeners(TrainingListener.Defaults.logging()); // Logging
Model model = Model.newInstance("sequence");
model.setBlock(net);
trainer = model.newTrainer(config);
for (int epoch = 1; epoch <= numEpochs; epoch++) {
// Iterate over dataset
for (Batch batch : trainer.iterateDataset(dataset)) {
// Update loss and evaulator
EasyTrain.trainBatch(trainer, batch);
// Update parameters
trainer.step();
batch.close();
}
// reset training and validation evaluators at end of epoch
trainer.notifyListeners(listener -> listener.onEpoch(trainer));
System.out.printf("Epoch %d\n", epoch);
System.out.printf("Loss %f\n", trainer.getTrainingResult().getTrainLoss());
}
return model;
}
SequentialBlock net = getNet();
Model model = train(net, trainIter, batchSize, 5, 0.01f);
Since the training loss is small, we would expect our model to work well. Let us see what this means in practice. The first thing to check is how well the model is able to predict what happens just in the next time step, namely the one-step-ahead prediction.
Translator translator = new NoopTranslator(null);
Predictor predictor = model.newPredictor(translator);
NDArray onestepPreds = ((NDList) predictor.predict(new NDList(features))).get(0);
ScatterTrace trace = ScatterTrace.builder(Functions.floatToDoubleArray(time.toFloatArray()),
Functions.floatToDoubleArray(x.toFloatArray()))
.mode(ScatterTrace.Mode.LINE)
.name("data")
.build();
ScatterTrace trace2 = ScatterTrace.builder(Functions.floatToDoubleArray(time.get(new NDIndex("{}:", tau)).toFloatArray()),
Functions.floatToDoubleArray(onestepPreds.toFloatArray()))
.mode(ScatterTrace.Mode.LINE)
.name("1-step preds")
.build();
Layout layout = Layout.builder()
.showLegend(true)
.xAxis(Axis.builder().title("time").build())
.yAxis(Axis.builder().title("x").build())
.build();
new Figure(layout, trace, trace2);
The one-step-ahead predictions look nice, just as we expected.
Even beyond 604 (n_train + tau
) observations the predictions still look trustworthy.
However, there is just one little problem to this:
if we observe sequence data only until time step 604, we cannot hope to receive the inputs for all the future one-step-ahead predictions.
Instead, we need to work our way forward one step at a time:
Generally, for an observed sequence up to $x_t$, its predicted output $\hat{x}_{t+k}$ at time step $t+k$ is called the $k$-step-ahead prediction. Since we have observed up to $x_{604}$, its $k$-step-ahead prediction is $\hat{x}_{604+k}$. In other words, we will have to use our own predictions to make multistep-ahead predictions. Let us see how well this goes.
NDArray multiStepPreds = manager.zeros(new Shape(T));
multiStepPreds.set(new NDIndex(":{}", nTrain + tau), x.get(new NDIndex(":{}", nTrain + tau)));
for (int i = nTrain + tau; i < T; i++) {
NDArray tempX = multiStepPreds.get(new NDIndex("{}:{}", i - tau, i)).reshape(new Shape(1, -1));
NDArray prediction = ((NDList) predictor.predict(new NDList(tempX))).get(0);
multiStepPreds.set(new NDIndex(i), prediction);
}
ScatterTrace trace = ScatterTrace.builder(Functions.floatToDoubleArray(time.toFloatArray()),
Functions.floatToDoubleArray(x.toFloatArray()))
.mode(ScatterTrace.Mode.LINE)
.name("data")
.build();
ScatterTrace trace2 = ScatterTrace.builder(Functions.floatToDoubleArray(time.get(new NDIndex("{}:", tau)).toFloatArray()),
Functions.floatToDoubleArray(onestepPreds.toFloatArray()))
.mode(ScatterTrace.Mode.LINE)
.name("1-step preds")
.build();
ScatterTrace trace3 = ScatterTrace.builder(Functions.floatToDoubleArray(time.get(
new NDIndex("{}:", nTrain + tau)).toFloatArray()),
Functions.floatToDoubleArray(multiStepPreds.get(
new NDIndex("{}:", nTrain + tau)).toFloatArray()))
.mode(ScatterTrace.Mode.LINE)
.name("multistep preds")
.build();
Layout layout = Layout.builder()
.showLegend(true)
.xAxis(Axis.builder().title("time").build())
.yAxis(Axis.builder().title("x").build())
.build();
new Figure(layout, trace, trace2, trace3);
As the above example shows, this is a spectacular failure. The predictions decay to a constant pretty quickly after a few prediction steps. Why did the algorithm work so poorly? This is ultimately due to the fact that the errors build up. Let us say that after step 1 we have some error $\epsilon_1 = \bar\epsilon$. Now the input for step 2 is perturbed by $\epsilon_1$, hence we suffer some error in the order of $\epsilon_2 = \bar\epsilon + c \epsilon_1$ for some constant $c$, and so on. The error can diverge rather rapidly from the true observations. This is a common phenomenon. For instance, weather forecasts for the next 24 hours tend to be pretty accurate but beyond that the accuracy declines rapidly. We will discuss methods for improving this throughout this chapter and beyond.
Let us take a closer look at the difficulties in $k$-step-ahead predictions by computing predictions on the entire sequence for $k = 1, 4, 16, 64$.
int maxSteps = 64;
NDArray features = manager.zeros(new Shape(T - tau - maxSteps + 1, tau + maxSteps));
// Column `i` (`i` < `tau`) are observations from `x` for time steps from
// `i + 1` to `i + T - tau - maxSteps + 1`
for (int i = 0; i < tau; i++) {
features.set(new NDIndex(":, {}", i), x.get(new NDIndex("{}:{}", i, i + T - tau - maxSteps + 1)));
}
// Column `i` (`i` >= `tau`) are the (`i - tau + 1`)-step-ahead predictions for
// time steps from `i + 1` to `i + T - tau - maxSteps + 1`
for (int i = tau; i < tau + maxSteps; i++) {
NDArray tempX = features.get(new NDIndex(":, {}:{}", i - tau, i));
NDArray prediction = ((NDList) predictor.predict(new NDList(tempX))).get(0);
features.set(new NDIndex(":, {}", i), prediction.reshape(-1));
}
int[] steps = new int[] {1, 4, 16, 64};
ScatterTrace[] traces = new ScatterTrace[4];
for (int i = 0; i < traces.length; i++) {
int step = steps[i];
traces[i] = ScatterTrace.builder(Functions.floatToDoubleArray(time.get(new NDIndex("{}:{}", tau + step - 1, T - maxSteps + i)).toFloatArray()),
Functions.floatToDoubleArray(features.get(
new NDIndex(":,{}", tau + step - 1)).toFloatArray())
)
.mode(ScatterTrace.Mode.LINE)
.name(step + "-step preds")
.build();
}
Layout layout = Layout.builder()
.showLegend(true)
.xAxis(Axis.builder().title("time").build())
.yAxis(Axis.builder().title("x").build())
.build();
new Figure(layout, traces);
This clearly illustrates how the quality of the prediction changes as we try to predict further into the future. While the 4-step-ahead predictions still look good, anything beyond that is almost useless.