#!/usr/bin/env python # coding: utf-8 # In[1]: import keras keras.__version__ # # Predicting house prices: a regression example # # This notebook contains the code samples found in Chapter 3, Section 6 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments. # # ---- # # # In our two previous examples, we were considering classification problems, where the goal was to predict a single discrete label of an # input data point. Another common type of machine learning problem is "regression", which consists of predicting a continuous value instead # of a discrete label. For instance, predicting the temperature tomorrow, given meteorological data, or predicting the time that a # software project will take to complete, given its specifications. # # Do not mix up "regression" with the algorithm "logistic regression": confusingly, "logistic regression" is not a regression algorithm, # it is a classification algorithm. # ## The Boston Housing Price dataset # # # We will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the # suburb at the time, such as the crime rate, the local property tax rate, etc. # # The dataset we will be using has another interesting difference from our two previous examples: it has very few data points, only 506 in # total, split between 404 training samples and 102 test samples, and each "feature" in the input data (e.g. the crime rate is a feature) has # a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12, # others between 0 and 100... # # Let's take a look at the data: # In[2]: from keras.datasets import boston_housing (train_data, train_targets), (test_data, test_targets) = boston_housing.load_data() # In[3]: train_data.shape # In[4]: test_data.shape # # As you can see, we have 404 training samples and 102 test samples. The data comprises 13 features. The 13 features in the input data are as # follow: # # 1. Per capita crime rate. # 2. Proportion of residential land zoned for lots over 25,000 square feet. # 3. Proportion of non-retail business acres per town. # 4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). # 5. Nitric oxides concentration (parts per 10 million). # 6. Average number of rooms per dwelling. # 7. Proportion of owner-occupied units built prior to 1940. # 8. Weighted distances to five Boston employment centres. # 9. Index of accessibility to radial highways. # 10. Full-value property-tax rate per \$10,000. # 11. Pupil-teacher ratio by town. # 12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town. # 13. % lower status of the population. # # The targets are the median values of owner-occupied homes, in thousands of dollars: # In[5]: train_targets # # The prices are typically between \\$10,000 and \\$50,000. If that sounds cheap, remember this was the mid-1970s, and these prices are not # inflation-adjusted. # ## Preparing the data # # # It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to # automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal # with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we # will subtract the mean of the feature and divide by the standard deviation, so that the feature will be centered around 0 and will have a # unit standard deviation. This is easily done in Numpy: # In[6]: mean = train_data.mean(axis=0) train_data -= mean std = train_data.std(axis=0) train_data /= std test_data -= mean test_data /= std # # Note that the quantities that we use for normalizing the test data have been computed using the training data. We should never use in our # workflow any quantity computed on the test data, even for something as simple as data normalization. # ## Building our network # # # Because so few samples are available, we will be using a very small network with two # hidden layers, each with 64 units. In general, the less training data you have, the worse overfitting will be, and using # a small network is one way to mitigate overfitting. # In[7]: from keras import models from keras import layers def build_model(): # Because we will need to instantiate # the same model multiple times, # we use a function to construct it. model = models.Sequential() model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],))) model.add(layers.Dense(64, activation='relu')) model.add(layers.Dense(1)) model.compile(optimizer='rmsprop', loss='mse', metrics=['mae']) return model # # Our network ends with a single unit, and no activation (i.e. it will be linear layer). # This is a typical setup for scalar regression (i.e. regression where we are trying to predict a single continuous value). # Applying an activation function would constrain the range that the output can take; for instance if # we applied a `sigmoid` activation function to our last layer, the network could only learn to predict values between 0 and 1. Here, because # the last layer is purely linear, the network is free to learn to predict values in any range. # # Note that we are compiling the network with the `mse` loss function -- Mean Squared Error, the square of the difference between the # predictions and the targets, a widely used loss function for regression problems. # # We are also monitoring a new metric during training: `mae`. This stands for Mean Absolute Error. It is simply the absolute value of the # difference between the predictions and the targets. For instance, a MAE of 0.5 on this problem would mean that our predictions are off by # \\$500 on average. # ## Validating our approach using K-fold validation # # # To evaluate our network while we keep adjusting its parameters (such as the number of epochs used for training), we could simply split the # data into a training set and a validation set, as we were doing in our previous examples. However, because we have so few data points, the # validation set would end up being very small (e.g. about 100 examples). A consequence is that our validation scores may change a lot # depending on _which_ data points we choose to use for validation and which we choose for training, i.e. the validation scores may have a # high _variance_ with regard to the validation split. This would prevent us from reliably evaluating our model. # # The best practice in such situations is to use K-fold cross-validation. It consists of splitting the available data into K partitions # (typically K=4 or 5), then instantiating K identical models, and training each one on K-1 partitions while evaluating on the remaining # partition. The validation score for the model used would then be the average of the K validation scores obtained. # In terms of code, this is straightforward: # In[8]: import numpy as np k = 4 num_val_samples = len(train_data) // k num_epochs = 100 all_scores = [] for i in range(k): print('processing fold #', i) # Prepare the validation data: data from partition # k val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples] val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples] # Prepare the training data: data from all other partitions partial_train_data = np.concatenate( [train_data[:i * num_val_samples], train_data[(i + 1) * num_val_samples:]], axis=0) partial_train_targets = np.concatenate( [train_targets[:i * num_val_samples], train_targets[(i + 1) * num_val_samples:]], axis=0) # Build the Keras model (already compiled) model = build_model() # Train the model (in silent mode, verbose=0) model.fit(partial_train_data, partial_train_targets, epochs=num_epochs, batch_size=1, verbose=0) # Evaluate the model on the validation data val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0) all_scores.append(val_mae) # In[9]: all_scores # In[10]: np.mean(all_scores) # # As you can notice, the different runs do indeed show rather different validation scores, from 2.1 to 2.9. Their average (2.4) is a much more # reliable metric than any single of these scores -- that's the entire point of K-fold cross-validation. In this case, we are off by \\$2,400 on # average, which is still significant considering that the prices range from \\$10,000 to \\$50,000. # # Let's try training the network for a bit longer: 500 epochs. To keep a record of how well the model did at each epoch, we will modify our training loop # to save the per-epoch validation score log: # In[21]: from keras import backend as K # Some memory clean-up K.clear_session() # In[22]: num_epochs = 500 all_mae_histories = [] for i in range(k): print('processing fold #', i) # Prepare the validation data: data from partition # k val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples] val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples] # Prepare the training data: data from all other partitions partial_train_data = np.concatenate( [train_data[:i * num_val_samples], train_data[(i + 1) * num_val_samples:]], axis=0) partial_train_targets = np.concatenate( [train_targets[:i * num_val_samples], train_targets[(i + 1) * num_val_samples:]], axis=0) # Build the Keras model (already compiled) model = build_model() # Train the model (in silent mode, verbose=0) history = model.fit(partial_train_data, partial_train_targets, validation_data=(val_data, val_targets), epochs=num_epochs, batch_size=1, verbose=0) mae_history = history.history['val_mean_absolute_error'] all_mae_histories.append(mae_history) # We can then compute the average of the per-epoch MAE scores for all folds: # In[23]: average_mae_history = [ np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)] # Let's plot this: # In[26]: import matplotlib.pyplot as plt plt.plot(range(1, len(average_mae_history) + 1), average_mae_history) plt.xlabel('Epochs') plt.ylabel('Validation MAE') plt.show() # # It may be a bit hard to see the plot due to scaling issues and relatively high variance. Let's: # # * Omit the first 10 data points, which are on a different scale from the rest of the curve. # * Replace each point with an exponential moving average of the previous points, to obtain a smooth curve. # In[38]: def smooth_curve(points, factor=0.9): smoothed_points = [] for point in points: if smoothed_points: previous = smoothed_points[-1] smoothed_points.append(previous * factor + point * (1 - factor)) else: smoothed_points.append(point) return smoothed_points smooth_mae_history = smooth_curve(average_mae_history[10:]) plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history) plt.xlabel('Epochs') plt.ylabel('Validation MAE') plt.show() # # According to this plot, it seems that validation MAE stops improving significantly after 80 epochs. Past that point, we start overfitting. # # Once we are done tuning other parameters of our model (besides the number of epochs, we could also adjust the size of the hidden layers), we # can train a final "production" model on all of the training data, with the best parameters, then look at its performance on the test data: # In[28]: # Get a fresh, compiled model. model = build_model() # Train it on the entirety of the data. model.fit(train_data, train_targets, epochs=80, batch_size=16, verbose=0) test_mse_score, test_mae_score = model.evaluate(test_data, test_targets) # In[29]: test_mae_score # We are still off by about \\$2,550. # ## Wrapping up # # # Here's what you should take away from this example: # # * Regression is done using different loss functions from classification; Mean Squared Error (MSE) is a commonly used loss function for # regression. # * Similarly, evaluation metrics to be used for regression differ from those used for classification; naturally the concept of "accuracy" # does not apply for regression. A common regression metric is Mean Absolute Error (MAE). # * When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step. # * When there is little data available, using K-Fold validation is a great way to reliably evaluate a model. # * When little training data is available, it is preferable to use a small network with very few hidden layers (typically only one or two), # in order to avoid severe overfitting. # # This example concludes our series of three introductory practical examples. You are now able to handle common types of problems with vector data input: # # * Binary (2-class) classification. # * Multi-class, single-label classification. # * Scalar regression. # # In the next chapter, you will acquire a more formal understanding of some of the concepts you have encountered in these first examples, # such as data preprocessing, model evaluation, and overfitting.