Deadline: Sunday, June 16, 9pm
Late Penalty: There is a penalty-free grace period of one hour past the deadline. Any work that is submitted between 1 hour and 24 hours past the deadline will receive a 20% grade deduction. No other late work is accepted. Quercus submission time will be used, not your local computer time. You can submit your labs as many times as you want before the deadline, so please submit often and early.
TA: Huan Ling
In this lab, you will build and train an autoencoder to impute (or "fill in") missing data.
We will be using the Adult Data Set provided by the UCI Machine Learning Repository [1], available at https://archive.ics.uci.edu/ml/datasets/adult. The data set contains census record files of adults, including their age, martial status, the type of work they do, and other features.
Normally, people use this data set to build a supervised classification model to classify whether a person is a high income earner. We will not use the dataset for this original intended purpose.
Instead, we will perform the task of imputing (or "filling in") missing values in the dataset. For example, we may be missing one person's martial status, and another person's age, and a third person's level of education. Our model will predict the missing features based on the information that we do have about each person.
We will use a variation of a denoising autoencoder to solve this data imputation problem. Our autoencoder will be trained using inputs that have one categorical feature artificially removed, and the goal of the autoencoder is to correctly reconstruct all features, including the one removed from the input.
In the process, you are expected to learn to:
[1] Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Submit a PDF file containing all your code, outputs, and write-up. You can produce a PDF of your Google Colab file by going to File > Print and then save as PDF. The Colab instructions have more information.
Do not submit any other files produced by your code.
Include a link to your colab file in your submission.
Include a link to your Colab file here. If you would like the TA to look at your Colab file in case your solutions are cut off, please make sure that your Colab file is publicly accessible at the time of submission.
Colab Link:
import csv
import numpy as np
import random
import torch
import torch.utils.data
We will be using a package called pandas
for this assignment.
If you are using Colab, pandas
should already be available.
If you are using your own computer,
installation instructions for pandas
are available here:
https://pandas.pydata.org/pandas-docs/stable/install.html
import pandas as pd
The adult.data file is available at https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
The function pd.read_csv
loads the adult.data file into a pandas dataframe.
You can read about the pandas documentation for pd.read_csv
at
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
header = ['age', 'work', 'fnlwgt', 'edu', 'yredu', 'marriage', 'occupation',
'relationship', 'race', 'sex', 'capgain', 'caploss', 'workhr', 'country']
df = pd.read_csv(
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
names=header,
index_col=False)
df.shape # there are 32561 rows (records) in the data frame, and 14 columns (features)
(32561, 14)
For each of the columns ["age", "yredu", "capgain", "caploss", "workhr"]
, report the minimum, maximum, and average value across the dataset.
Then, normalize each of the features ["age", "yredu", "capgain", "caploss", "workhr"]
so that their values are always between 0 and 1.
Make sure that you are actually modifying the dataframe df
.
Like numpy arrays and torch tensors, pandas data frames can be sliced. For example, we can display the first 3 rows of the data frame (3 records) below.
df[:3] # show the first 3 records
age | work | fnlwgt | edu | yredu | marriage | occupation | relationship | race | sex | capgain | caploss | workhr | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States |
Alternatively, we can slice based on column names,
for example df["race"]
, df["hr"]
, or even index multiple columns
like below.
subdf = df[["age", "yredu", "capgain", "caploss", "workhr"]]
subdf[:3] # show the first 3 records
age | yredu | capgain | caploss | workhr | |
---|---|---|---|---|---|
0 | 39 | 13 | 2174 | 0 | 40 |
1 | 50 | 13 | 0 | 0 | 13 |
2 | 38 | 9 | 0 | 0 | 40 |
Numpy works nicely with pandas, like below:
np.sum(subdf["caploss"])
2842700
Just like numpy arrays, you can modify entire columns of data rather than one scalar element at a time. For example, the code
df["age"] = df["age"] + 1
would increment everyone's age by 1.
What percentage of people in our data set are male? Note that the data labels all have an unfortunate space in the beginning, e.g. " Male" instead of "Male".
What percentage of people in our data set are female?
# hint: you can do something like this in pandas
sum(df["sex"] == " Male")
21790
Before proceeding, we will modify our data frame in a couple more ways:
Both of these steps are done for you, below.
How many records contained missing features? What percentage of records were removed?
contcols = ["age", "yredu", "capgain", "caploss", "workhr"]
catcols = ["work", "marriage", "occupation", "edu", "relationship", "sex"]
features = contcols + catcols
df = df[features]
missing = pd.concat([df[c] == " ?" for c in catcols], axis=1).any(axis=1)
df_with_missing = df[missing]
df_not_missing = df[~missing]
What are all the possible values of the feature "work" in df_not_missing
? You may find the Python function set
useful.
We will be using a one-hot encoding to represent each of the categorical variables. Our autoencoder will be trained using these one-hot encodings.
We will use the pandas function get_dummies
to produce one-hot encodings
for all of the categorical variables in df_not_missing
.
data = pd.get_dummies(df_not_missing)
data[:3]
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-70fd09f15ca6> in <module> ----> 1 data[:3] NameError: name 'data' is not defined
The dataframe data
contains the cleaned and normalized data that we will use to train our denoising autoencoder.
How many columns (features) are in the dataframe data
?
Briefly explain where that number come from.
We will convert the pandas data frame data
into numpy, so that
it can be further converted into a PyTorch tensor.
However, in doing so, we lose the column label information that
a panda data frame automatically stores.
Complete the function get_categorical_value
that will return
the named value of a feature given a one-hot embedding.
You may find the global variables cat_index
and cat_values
useful. (Display them and figure out what they are first.)
We will need this function in the next part of the lab
to interpret our autoencoder outputs. So, the input
to our function get_categorical_values
might not
actually be "one-hot" -- the input may instead
contain real-valued predictions from our neural network.
datanp = data.values.astype(np.float32)
cat_index = {} # Mapping of feature -> start index of feature in a record
cat_values = {} # Mapping of feature -> list of categorical values the feature can take
# build up the cat_index and cat_values dictionary
for i, header in enumerate(data.keys()):
if "_" in header: # categorical header
feature, value = header.split()
feature = feature[:-1] # remove the last char; it is always an underscore
if feature not in cat_index:
cat_index[feature] = i
cat_values[feature] = [value]
else:
cat_values[feature].append(value)
def get_onehot(record, feature):
"""
Return the portion of `record` that is the one-hot encoding
of `feature`. For example, since the feature "work" is stored
in the indices [5:12] in each record, calling `get_range(record, "work")`
is equivalent to accessing `record[5:12]`.
Args:
- record: a numpy array representing one record, formatted
the same way as a row in `data.np`
- feature: a string, should be an element of `catcols`
"""
start_index = cat_index[feature]
stop_index = cat_index[feature] + len(cat_values[feature])
return record[start_index:stop_index]
def get_categorical_value(onehot, feature):
"""
Return the categorical value name of a feature given
a one-hot vector representing the feature.
Args:
- onehot: a numpy array one-hot representation of the feature
- feature: a string, should be an element of `catcols`
Examples:
>>> get_categorical_value(np.array([0., 0., 0., 0., 0., 1., 0.]), "work")
'State-gov'
>>> get_categorical_value(np.array([0.1, 0., 1.1, 0.2, 0., 1., 0.]), "work")
'Private'
"""
# <----- TODO: WRITE YOUR CODE HERE ----->
# You may find the variables `cat_index` and `cat_values`
# (created above) useful.
# more useful code, used during training, that depends on the function
# you write above
def get_feature(record, feature):
"""
Return the categorical feature value of a record
"""
onehot = get_onehot(record, feature)
return get_categorical_value(onehot, feature)
def get_features(record):
"""
Return a dictionary of all categorical feature values of a record
"""
return { f: get_feature(record, f) for f in catcols }
Randomly split the data into approximately 70% training, 15% validation and 15% test.
Report the number of items in your training, validation, and test set.
# set the numpy seed for reproducibility
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.seed.html
np.random.seed(50)
# todo
Design a fully-connected autoencoder by modifying the encoder
and decoder
below.
The input to this autoencoder will be the features of the data
, with
one categorical feature recorded as "missing". The output of the autoencoder
should be the reconstruction of the same features, but with the missing
value filled in.
Note: Do not reduce the dimensionality of the input too much! The output of your embedding is expected to contain information about ~11 features.
from torch import nn
class AutoEncoder(nn.Module):
def __init__(self):
super(AutoEncoder, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(57, 57) # TODO -- FILL OUT THE CODE HERE!
)
self.decoder = nn.Sequential(
nn.Linear(57, 57), # TODO -- FILL OUT THE CODE HERE!
nn.Sigmoid() # get to the range (0, 1)
)
def forward(self, x):
x = self.encoder(x)
x = self.decoder(x)
return x
Explain why there is a sigmoid activation in the last step of the decoder.
(Note: the values inside the data frame data
and the training code in Part 3 might be helpful.)
We will train our autoencoder in the following way:
zero_out_random_features
functionComplete the code to train the autoencoder, and plot the training and validation loss every few iterations. You may also want to plot training and validation "accuracy" every few iterations, as we will define in part (b). You may also want to checkpoint your model every few iterations or epochs.
Use nn.MSELoss()
as your loss function. (Side note: you might recognize that this loss function is not
ideal for this problem, but we will use it anyway.)
def zero_out_feature(records, feature):
""" Set the feature missing in records, by setting the appropriate
columns of records to 0
"""
start_index = cat_index[feature]
stop_index = cat_index[feature] + len(cat_values[feature])
records[:, start_index:stop_index] = 0
return records
def zero_out_random_feature(records):
""" Set one random feature missing in records, by setting the
appropriate columns of records to 0
"""
return zero_out_feature(records, random.choice(catcols))
def train(model, train_loader, valid_loader, num_epochs=5, learning_rate=1e-4):
""" Training loop. You should update this."""
torch.manual_seed(42)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for epoch in range(num_epochs):
for data in train_loader:
datam = zero_out_random_feature(data.clone()) # zero out one categorical feature
recon = model(datam)
loss = criterion(recon, data)
loss.backward()
optimizer.step()
optimizer.zero_grad()
While plotting training and validation loss is valuable, loss values are harder to compare than accuracy percentages. It would be nice to have a measure of "accuracy" in this problem.
Since we will only be imputing missing categorical values, we will define an accuracy measure. For each record and for each categorical feature, we determine whether the model can predict the categorical feature given all the other features of the record.
A function get_accuracy
is written for you. It is up to you to figure out how to
use the function. You don't need to submit anything in this part.
To earn the marks, correctly plot the training and validation accuracy every few
iterations as part of your training curve.
def get_accuracy(model, data_loader):
"""Return the "accuracy" of the autoencoder model across a data set.
That is, for each record and for each categorical feature,
we determine whether the model can successfully predict the value
of the categorical feature given all the other features of the
record. The returned "accuracy" measure is the percentage of times
that our model is successful.
Args:
- model: the autoencoder model, an instance of nn.Module
- data_loader: an instance of torch.utils.data.DataLoader
Example (to illustrate how get_accuracy is intended to be called.
Depending on your variable naming this code might require
modification.)
>>> model = AutoEncoder()
>>> vdl = torch.utils.data.DataLoader(data_valid, batch_size=256, shuffle=True)
>>> get_accuracy(model, vdl)
"""
total = 0
acc = 0
for col in catcols:
for item in data_loader: # minibatches
inp = item.detach().numpy()
out = model(zero_out_feature(item.clone(), col)).detach().numpy()
for i in range(out.shape[0]): # record in minibatch
acc += int(get_feature(out[i], col) == get_feature(inp[i], col))
total += 1
return acc / total
Run your updated training code, using reasonable initial hyperparameters.
Include your training curve in your submission.
Tune your hyperparameters, training at least 4 different models (4 sets of hyperparameters).
Do not include all your training curves. Instead, explain what hyperparameters you tried, what their effect was, and what your thought process was as you chose the next set of hyperparameters to try.
Based on the test accuracy alone, it is difficult to assess whether our model is actually performing well. We don't know whether a high accuracy is due to the simplicity of the problem, or if a poor accuracy is a result of the inherent difficulty of the problem.
It is therefore very important to be able to compare our model to at least one alternative. In particular, we consider a simple baseline model that is not very computationally expensive. Our neural network should at least outperform this baseline model. If our network is not much better than the baseline, then it is not doing well.
For our data imputation problem, consider the following baseline model: to predict a missing feature, the baseline model will look at the most common value of the feature in the training set.
For example, if the feature "marriage" is missing, then this model's prediction will be the most common value for "marriage" in the training set, which happens to be "Married-civ-spouse".
What would be the test accuracy of this baseline model?
Do not actually implement this baseline model. You should be able to compute the test accuracy by reasoning about how the basline model behaves.
How does your test accuracy from part (a) compared to your basline test accuracy in part (b)?
Look at the first item in your test data. Do you think it is reasonable for a human to be able to guess this person's education level based on their other features? Explain.
What is your model's prediction of this person's education level, given their other features?
What is the baseline model's prediction of this person's education level?