CIFAR10 and Adversaral Attacks

https://www.math.upenn.edu/~kazdan/312F12/Notes/ max-min-notesJan09/max-min.pdf

Aim

We will finish the semester with a light task. We are going to build a small CNN to classify images from the CIFAR10 dataset and then see how easy it is to construct (near identical) images that our model has difficulty in classifying.

This is an example of an adversarial attacks — where specialised inputs are created with the purpose of confusing a neural network, resulting in a misclassification.

The attacks are based on taking valid inputs and applying changes that while are indistinguishable to the human eye cause the network to fail to identify the contents of the image. There are several types of such attacks, however, here the focus is on the fast gradient sign method (FGSM) attack, which is a white box attack whose goal is to ensure misclassification.

A white box attack is where the attacker has complete access to the model being attacked. One of the most famous examples of an adversarial image shown below is taken from the "Explaining and Harnessing Adversarial Examples" by Goodfellow et al. paper.

Panda to gibbon example (Goodfellow et al.)

Since this is the last practical of the semester we are going light on the theory and just exploring the FGSM attack. Also, to cut down on training time we will use the smaller CIFAR10 dataset, but you might want to do this week's lab using colab and turn select the T4 GPU option.

Imports and Setup

First import our standard python modules for data mining. Module tqdm is used to generate progress bars.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from tqdm import tqdm

Now we import tensorflow and the cifar10 dataset

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.datasets import mnist, cifar10

Dataset

The CIFAR-10 dataset (Canadian Institute for Advanced Research, 10 classes) is a subset of the Tiny Images dataset and consists of 60,000 32x32 color images. The images are labelled with one of 10 mutually exclusive classes: airplane, automobile (but not truck or pickup truck), bird, cat, deer, dog, frog, horse, ship, and truck (but not pickup truck). There are 6,000 images per class with 5,000 training and 1,000 testing images per class.

(x_train, y_train), (x_test, y_test) = cifar10.load_data()
[np.shape(a) for a in (x_train, x_test, y_train, y_test)]

This should produce output

1	`[(50000, 32, 32, 3), (10000, 32, 32, 3), (50000, 1), (10000, 1)]`

So we have split the 60,000 observations into 50,000 for training and 10,000 for testing. Each of the observations consists of a 32z32 pixel x 3 channel image. Each channel is a single integer in range 0..255.

Note that the output, y_train and y_test, is a different shape to that in the MNIST dataset.

To simplify things later we define the following:

height, width, channels = 32, 32, 3
nb_classes = 10 
label_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog',
'horse', 'ship', 'truck']

The following code show the first 40 observations and their labels.

fig, axs = plt.subplots(5,8, sharex=True, sharey=True, figsize=(10,8))

for ax in axs.flat:
    k = np.random.randint(0,x_train.shape[0])
    ax.imshow(x_train[k])
    ax.set_title(label_names[y_train[k,0]]);
    plt.suptitle("Sample digits in training set")
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)

plt.savefig("Sample_images_in_training.png", bbox_inches="tight")
plt.show()

Preprocessing

Similar to the MNIST dataset, we scale the inputs from range 0..255 to 0..1.

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255

Then add an extra dimension to the input data, so that the last dimension has size one. This is expected by the neural network.

# Make sure images have shape (32, 32, 3) - NOT NEEDED
print("original x_train shape:", x_train.shape)
x_train = x_train.reshape((-1, height, width, channels))
x_test = x_test.reshape((-1, height, width, channels))
print("x_train shape:", x_train.shape)

This should produce output

original x_train shape: (50000, 32, 32, 3)
x_train shape: (50000, 32, 32, 3)

Then we convert the single valued output (with 10 values) to a vector of 10-binary values (i.e., one-hot encoding)

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, nb_classes)
y_test = keras.utils.to_categorical(y_test, nb_classes)

Build a Simple CNN

Next we build a simple CNN. We write

model = keras.Sequential()

model.add(layers.Input(shape=(height, width, channels)))

model.add(layers.Conv2D(128, kernel_size=(3, 3),
    padding='same', activation='relu'))
model.add(layers.Dropout(0.3))

model.add(layers.Conv2D(64, kernel_size=(3, 3),
    padding='same', activation='relu'))
model.add(layers.Dropout(0.3))

model.add(layers.Conv2D(64, kernel_size=(3, 3),
    padding='same', activation='relu'))
model.add(layers.Dropout(0.3))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))

model.add(layers.Conv2D(64, kernel_size=(3, 3),
    padding='same', activation='relu'))
model.add(layers.MaxPooling2D(pool_size=(2, 2)))

model.add(layers.Dropout(0.3))
model.add(layers.Flatten())
model.add(layers.Dense(32))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(nb_classes, activation='softmax'))

model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])

model.summary()

Which has output

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv2d_16 (Conv2D)              │ (None, 32, 32, 128)    │         3,584 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_20 (Dropout)            │ (None, 32, 32, 128)    │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_17 (Conv2D)              │ (None, 32, 32, 64)     │        73,792 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_21 (Dropout)            │ (None, 32, 32, 64)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_18 (Conv2D)              │ (None, 32, 32, 64)     │        36,928 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_22 (Dropout)            │ (None, 32, 32, 64)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_8 (MaxPooling2D)  │ (None, 16, 16, 64)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_19 (Conv2D)              │ (None, 16, 16, 64)     │        36,928 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_9 (MaxPooling2D)  │ (None, 8, 8, 64)       │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_23 (Dropout)            │ (None, 8, 8, 64)       │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten_4 (Flatten)             │ (None, 4096)           │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_8 (Dense)                 │ (None, 32)             │       131,104 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout_24 (Dropout)            │ (None, 32)             │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_9 (Dense)                 │ (None, 10)             │           330 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 282,666 (1.08 MB)
 Trainable params: 282,666 (1.08 MB)
 Non-trainable params: 0 (0.00 B)

To train we use

# Training the model
history = model.fit(x_train, y_train,
  batch_size=32, epochs=8, validation_data=(x_test, y_test))

To see model loss and accuracy while training

plt.plot(history.history['loss'], label="loss")
plt.plot(history.history['val_loss'], label="val_loss")
plt.legend()
plt.show()

and

# plotting accuracy
plt.plot(history.history['accuracy'], label="accuracy")
plt.plot(history.history['val_accuracy'], label="val_accuracy")
plt.legend()
plt.show()

which should generate graphs similar to the following

Adversarial Attacks

Now we want to see about building images (near identical to original) but the model will miss-classify.

Fast gradient sign method

The fast gradient sign method works by using the gradients of the neural network to create an adversarial example. For an input image, the method uses the gradients of the loss with respect to the input image to create a new image that maximises the loss. This new image is called the adversarial image. This can be summarised using the following expression:

\[ x_\text{attack} = x + \epsilon \times \text{sign} \left(\nabla_x J(\theta,x,y)\right) \]

where

\(x_\text{attack}\) is desired adversarial image,
\(x\) is original input image,
\(y\) is riginal input label,
\(\epsilon\) a multiplier to ensure the perturbations are small.
\(\theta\) are the model parameters, and
\(J\) is the loss function.

An intriguing property here, is the fact that the gradients are taken with respect to the input image. This is done because the objective is to create an image that maximises the loss. A method to accomplish this is to find how much each pixel in the image contributes to the loss value, and add a perturbation accordingly. This works pretty fast because it is easy to find how each input pixel contributes to the loss by using the chain rule and finding the required gradients. Hence, the gradients are taken with respect to the image. In addition, since the model is no longer being trained (thus the gradient is not taken with respect to the trainable variables, i.e., the model parameters), and so the model parameters remain constant. The only goal is to fool an already trained model.

The first step is to create perturbations which will be used to distort an original image resulting in an adversarial image. As mentioned above, for this task, the gradients are taken with respect to the image.

def generate_adversary(image, label):
  image = tf.cast(image, tf.float32)

  with tf.GradientTape() as tape:
    tape.watch(image)
    prediction = model(image)
    loss = tf.keras.losses.MSE(label, prediction)

  # compute gradients of the loss w.r.t to the input image.
  gradient = tape.gradient(loss, image)

  # only sign of the gradients is used to create the perturbation
  sign_grad = tf.sign(gradient)

  return sign_grad

Adversarial attacks - single image

Selecting a random image for testing

np.random.seed(72)

rand_idx = np.random.randint(0,49999)
image = x_train[rand_idx].reshape((1, height, width, channels))
label = y_train[rand_idx]

print(f'Prediction from CNN: {label_names[np.where(label==1)[0][0]]}')
plt.figure(figsize=(3,3))
plt.imshow(image.reshape((height, width, channels)))
plt.savefig("Original.png", bbox_inches="tight")

plt.show()

Adding the adversary noise to image

perturbations = generate_adversary(image,label).numpy()

adversarial = (image + (perturbations * 0.03)).clip(0,1)

# Comparing both images 

fig, axs = plt.subplots(1, 3, sharey=True)

axs[0].imshow(image.reshape(height,width, channels))
orig_label = label_names[np.where(label==1)[0][0]]
orig_predict = label_names[model.predict(image).argmax()]
axs[0].set_title(f"Original\n({orig_label}/{orig_predict})")

axs[1].imshow(((perturbations+1)/2).reshape(height,width, channels))
axs[1].set_title("Perturbation\n")

attack_predict = label_names[model.predict(adversarial).argmax()]
axs[2].imshow(adversarial.reshape(height,width, channels))
axs[2].set_title(f"Adversary\n({attack_predict})")
plt.savefig("Original_and_Adversary.png", bbox_inches="tight")

plt.show()

Comparing predictions

print(f'Normal Image Prediction: {label_names[model.predict(image).argmax()]}')
print(f"Adversary Prediction: {label_names[model.predict(adversarial).argmax()]}")

which generates

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 14ms/step
Normal Image Prediction: automobile
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Adversary Prediction: truck

Note the effect of different values of epsilon. As the value of epsilon is increased, it becomes easier to fool the network. However, this comes as a trade-off which results in the perturbations becoming more identifiable.

Adversarial attacks - adversarial versions of train

# Function to generate batch of images with adversary
def adversary_generator(batch_size):
  while True:
    images = []
    labels = []
    for batch in range(batch_size):
      N = randint(0, 49999)
      label = y_train[N]
      image = x_train[N].reshape((1,height, width, channels))

      perturbations = generate_adversary(image, label).numpy()
      adversarial = image + (perturbations * 0.03)

      images.append(adversarial)
      labels.append(label)

      if batch%1000 == 0:
        print(f"{batch} images generated")

    images = np.asarray(images).reshape((batch_size, height, width, channels))
    labels = np.asarray(labels)

    yield images, labels

Testing model accuracy on adversarial examples

x_adversarial, y_adversarial = next(adversary_generator(10000))

ad_acc = model.evaluate(x_adversarial, y_adversarial, verbose=0)
print(f"Accuracy on Adversarial Examples: {ad_acc[1]*100}")

Should produce output

images generated
images generated
images generated
images generated
images generated
images generated
images generated
images generated
images generated
images generated
Accuracy on Adversarial Examples: 4.769999906420708