This notebook demonstrates how to use the ART library to conduct an adversarial attack on video action recognition.
First, we are going to show how to use MXNet and GluonCV for video action recognition. MXNet provides a set of pretrained models for this classification task. In our demonstration we use the pretrained i3d_resnet50_v1_ucf101
model, which is based on Carreira and Zisserman '18.
In the following we are going to use a video clip of a basketball action taken from the UCF101 data set. Namely, we will show how to correctly classify the following short video clip.
Let's walk through some initial work steps ensuring that the notebook will work smoothly. We will
import os
import tempfile
import decord
from gluoncv import utils
from gluoncv.data.transforms import video
from gluoncv.model_zoo import get_model
from gluoncv.utils.filesystem import try_import_decord
import imageio
from matplotlib.image import imsave
import matplotlib.pyplot as plt
import mxnet as mx
from mxnet import gluon, nd, image
from mxnet.gluon.data.vision import transforms
import numpy as np
from art.attacks.evasion import FastGradientMethod, FrameSaliencyAttack
from art import config
from art.defences.preprocessor import VideoCompression
from art.estimators.classification import MXClassifier
# set global variables
PRETRAINED_MODEL_NAME = 'i3d_resnet50_v1_ucf101'
VIDEO_SAMPLE_URI = 'https://github.com/bryanyzhu/tiny-ucf101/raw/master/v_Basketball_g01_c01.avi'
# set seed
np.random.seed(123)
def predict_top_k(video_input, model, k=5, verbose=True):
"""Return top-k class indices."""
pred = model(nd.array(video_input))
classes = model.classes
ind = nd.topk(pred, k=k)[0].astype('int')
if verbose:
msg = "The video sample clip is classified to be"
for i in range(k):
msg += f"\n\t[{classes[ind[i].asscalar()]}], with probability {nd.softmax(pred)[0][ind[i]].asscalar():.3f}."
print(msg)
return ind
def sample_to_gif(sample, output="sample.gif", path=config.ART_DATA_PATH, postprocess=None):
"""Convert a numpy video sample of shape 3xFramesxMxN into GIF."""
frame_count = sample.shape[1]
output_path = os.path.join(path, output)
with tempfile.TemporaryDirectory() as tmpdir, imageio.get_writer(output_path, mode='I') as writer:
for frame in range(frame_count):
file_path = os.path.join(tmpdir, f"{frame}.png")
imsave(file_path, np.transpose(sample[:,frame,:,:], (1,2,0)))
writer.append_data(imageio.imread(file_path))
return output_path
# download video sample
decord = try_import_decord()
video_fname = utils.download(VIDEO_SAMPLE_URI, path=config.ART_DATA_PATH);
video_reader = decord.VideoReader(video_fname)
frame_id_list = range(0, 64, 2)
video_data = video_reader.get_batch(frame_id_list).asnumpy()
video_sample_lst = [video_data[vid, :, :, :] for vid, _ in enumerate(frame_id_list)]
# preprocess benign video sample
transform_fn = video.VideoGroupValTransform(size=224, mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
sample = np.stack(transform_fn(video_sample_lst), axis=0)
sample = sample.reshape((-1,) + (32, 3, 224, 224))
sample = np.transpose(sample, (0, 2, 1, 3, 4))
print(f"`{video_fname}` has been downloaded and preprocessed.")
`/home/hessel/.art/data/v_Basketball_g01_c01.avi` has been downloaded and preprocessed.
# load pretrained model
model = get_model(PRETRAINED_MODEL_NAME, nclass=101, pretrained=True)
print(f"`{PRETRAINED_MODEL_NAME}` model was successfully loaded.")
`i3d_resnet50_v1_ucf101` model was successfully loaded.
# evaluate model on basketball video sample
_ = predict_top_k(sample, model)
The video sample clip is classified to be [Basketball], with probability 0.725. [TennisSwing], with probability 0.212. [VolleyballSpiking], with probability 0.036. [SoccerJuggling], with probability 0.012. [TableTennisShot], with probability 0.007.
We observe that for the given video sample, the model correctly classified it as playing basketball.
We are now ready to employ the ART library and craft an adversarial attack via the Fast Gradient Method. The attack will corrupt the video sample in such a way that it will be misclassified. We will also show how to convert the adversarial example into a GIF as shown below.
# preprocess adversarial video sample input
transform_fn_unnormalized = video.VideoGroupValTransform(size=224, mean=[0, 0, 0], std=[1, 1, 1])
adv_sample_input = np.stack(transform_fn_unnormalized(video_sample_lst), axis=0)
adv_sample_input = adv_sample_input.reshape((-1,) + (32, 3, 224, 224))
adv_sample_input = np.transpose(adv_sample_input, (0, 2, 1, 3, 4))
# wrap model in a ART classifier
model_wrapper = gluon.nn.Sequential()
with model_wrapper.name_scope():
model_wrapper.add(model)
# prepare mean and std arrays for ART classifier preprocessing
mean = np.array([0.485, 0.456, 0.406] * (32 * 224 * 224)).reshape((3, 32, 224, 224), order='F')
std = np.array([0.229, 0.224, 0.225] * (32 * 224 * 224)).reshape((3, 32, 224, 224), order='F')
classifier_art = MXClassifier(
model=model_wrapper,
loss=gluon.loss.SoftmaxCrossEntropyLoss(),
input_shape=(3, 32, 224, 224),
nb_classes=101,
preprocessing=(mean, std),
clip_values=(0, 1),
channels_first=True,
)
# verify that ART classifier predictions are consistent with original model:
pred = nd.array(classifier_art.predict(adv_sample_input))
ind = nd.topk(pred, k=5)[0].astype('int')
msg = "The video sample clip is classified by the ART classifier to be"
for i in range(len(ind)):
msg += f"\n\t[{model.classes[ind[i].asscalar()]}], with probability {nd.softmax(pred)[0][ind[i]].asscalar():.3f}."
print(msg)
The video sample clip is classified by the ART classifier to be [Basketball], with probability 0.725. [TennisSwing], with probability 0.212. [VolleyballSpiking], with probability 0.036. [SoccerJuggling], with probability 0.012. [TableTennisShot], with probability 0.007.
# craft adversarial attack with FGM
epsilon = 8/255
fgm = FastGradientMethod(
classifier_art,
eps=epsilon,
)
%%time
adv_sample = fgm.generate(
x=adv_sample_input
)
CPU times: user 3min 7s, sys: 1.77 s, total: 3min 9s Wall time: 2min 48s
# print results
_ = predict_top_k((adv_sample-mean)/std, model)
The video sample clip is classified to be [ThrowDiscus], with probability 0.266. [Hammering], with probability 0.244. [TennisSwing], with probability 0.155. [HulaHoop], with probability 0.082. [JavelinThrow], with probability 0.055.
# now we save the adversarial example to gif:
adversarial_gif = sample_to_gif(np.squeeze(adv_sample), "adversarial_basketball.gif")
print(f"`{adversarial_gif}` has been successfully created.")
`/home/hessel/.art/data/adversarial_basketball.gif` has been successfully created.
Using the Frame Saliency Attack, we finally show how to create a sparse adversarial example. The final result is shown in the GIF below. As we will see, only one frame needs to be perturbed to achieve a misclassification.
# We create a Frame Saliency Attack. Note: we specify here the frame axis, which is 2.
fsa = FrameSaliencyAttack(
classifier_art,
fgm,
"iterative_saliency",
frame_index = 2
)
%%time
adv_sample_sparse = fsa.generate(
x=adv_sample_input
)
CPU times: user 6min 48s, sys: 3.1 s, total: 6min 52s Wall time: 6min 6s
# We print the resulting predictions:
_ = predict_top_k((adv_sample_sparse-mean)/std, model)
The video sample clip is classified to be [TennisSwing], with probability 0.497. [Basketball], with probability 0.425. [VolleyballSpiking], with probability 0.040. [SoccerJuggling], with probability 0.014. [TableTennisShot], with probability 0.009.
# Again, we can save the adversarial example to gif:
adversarial_sparse_gif = sample_to_gif(np.squeeze(adv_sample_sparse), "adversarial_basketball_sparse.gif")
print(f"`{adversarial_sparse_gif}` has been successfully created.")
`/home/hessel/.art/data/adversarial_basketball_sparse.gif` has been successfully created.
# And now count the number of perturbed frames:
x_diff = adv_sample_sparse - adv_sample_input
x_diff = np.swapaxes(x_diff, 1, 2)
x_diff = np.reshape(x_diff, x_diff.shape[:2] + (np.prod(x_diff.shape[2:]), ))
x_diff_norm = np.sign(np.round(np.linalg.norm(x_diff, axis=-1), decimals=4))
print(f"Number of perturbed frames: {int(np.sum(x_diff_norm))}")
Number of perturbed frames: 1
Next we are going to apply a simple input preprocessing defense, namely VideoCompression
. Ideally, we want this defense to result in correct predictions when applied both to the original and the adversarial video input.
# initialize VideoCompression defense
video_compression = VideoCompression(video_format="avi", constant_rate_factor=30, channels_first=True)
# apply defense to original input
adv_sample_input_compressed = video_compression(adv_sample_input * 255)[0] / 255
# apply defense to sparse adversarial sample
adv_sample_sparse_compressed = video_compression(adv_sample_sparse * 255)[0] / 255
# print the resulting predictions on compressed original input
_ = predict_top_k((adv_sample_input_compressed-mean)/std, model)
# print the resulting predictions on sparse adversarial sample
_ = predict_top_k((adv_sample_sparse_compressed-mean)/std, model)
The video sample clip is classified to be [Basketball], with probability 0.516. [TennisSwing], with probability 0.443. [VolleyballSpiking], with probability 0.018. [TableTennisShot], with probability 0.010. [SoccerJuggling], with probability 0.005. The video sample clip is classified to be [Basketball], with probability 0.697. [TennisSwing], with probability 0.242. [VolleyballSpiking], with probability 0.026. [SoccerJuggling], with probability 0.011. [TableTennisShot], with probability 0.009.
In this notebook we have demonstrated how we can apply the ART library to a video action recognition task. By employing a pretrained MXNet model and loading it via ART's MXNetClassifier
we can easily plug in several off the shelf attacks like Fast Gradient Method.