Adversarial Attacks¶
As we have seen, it is possible to modify the input \(\mathbf{x}\) so that the corresponding model approximates a chosen target output. This concept can also be applied to generate adverserial examples, i.e. images which have been intentionally modified to cause a model to misclassify it. In addition, we usually want the modification to be minimal or almost imperceptible to the human eye.
One common method for generating adversarial examples is known as the fast gradient sign method. Starting from an input \(\mathbf{x}^{0}\) which our model correctly classifies, we choose a target output \(\mathbf{y}^{*}\) which corresponds to a wrong classification, and follow the procedure described in the previous section with a slight modification. Instead of updating the input according to Eq. (49) we use the following update rule:
where \(L\) is given be Eq. (48). The \(\textrm{sign}(\dots) \in \lbrace -1, 1 \rbrace\) both serves to enhance the signal and also acts as constraint to limit the size of the modification. By choosing \(\eta = \frac{\epsilon}{T}\) and performing only \(T\) iterations, we can then guarantee that each component of the final input \(\mathbf{x}^{*}\) satisfies
which is important since we want our final image \(\mathbf{x}^{*}\) to be only minimally modified. We summarize this algorithm as follows:
Fast Gradient Sign Method
Input: A classification model \(\mathbf{f}\), a loss function \(L\), an initial image \(\mathbf{x}^{0}\), a target label \(\mathbf{y}_{\textrm{target}}\), perturbation size \(\epsilon\) and number of iterations \(T\)
Output: Adversarial example \(\mathbf{x}^{*}\) with \(|x^{*}_{i} - x^{0}_{i}| \leq \epsilon\)
\(\eta = \epsilon/T\)
for: i=1\dots T do
\(\quad\) \(\mathbf{x} = \mathbf{x} - \eta \ \textrm{sign}\left(\frac{\partial L}{\partial \mathbf{x}}\right)\)
end
This process of generating adversarial examples is called an adversarial attack, which we can classify under two broad categories: white box and black box attacks. In a white box attack, the attacker has full access to the network \(\mathbf{f}\) and is thus able to compute or estimate the gradients with respect to the input. On the other hand, in a black box attack, the adversarial examples are generated without using the target network \(\mathbf{f}\). In this case, a possible strategy for the attacker is to train his own model \(\mathbf{g}\), find an adversarial example for his model and use it against his target \(\mathbf{f}\) without actually having access to it. Although it might seem surprising, this strategy has been found to work albeit with a lower success rate as compared to white box methods. We shall illustrate these concepts in the example below.
Example¶
We shall use the same plant leaves classification example as above. The target model \(\mathbf{f}\) which we want to ‘attack’ is a pretrained model using Google’s well known InceptionV3 deep convolutional neural network containing over \(20\) million parameters1. The model achieved a test accuracy of \(\sim 95\%\). Assuming we have access to the gradients of the model \(\mathbf{f}\), we can then consider a white box attack. Starting from an image in the dataset which the target model correctly classifies and applying the fast gradient sign method with \(\epsilon=0.01\) and \(T=1\), we obtain an adversarial image which differs from the original image by almost imperceptible amount of noise as depicted on the left of Fig. 31. Any human would still correctly identify the image but yet the network, which has around \(95\%\) accuracy has completely failed.
If, however, the gradients and outputs of the target model \(\mathbf{f}\) are hidden, the above white box attack strategy becomes unfeasible. In this case, we can adopt the following ‘black box attack’ strategy. We train a secondary model \(\mathbf{g}\), and then applying the FGSM algorithm to \(\mathbf{g}\) to generate adversarial examples for \(\mathbf{g}\). Note that it is not necessary for \(\mathbf{g}\) to have the same network architecture as the target model \(\mathbf{f}\). In fact, it is possible that we do not even know the architecture of our target model.
Let us consider another pretrained network based on MobileNet containing about \(2\) million parameters. After retraining the top classification layer of this model to a test accuracy of \(\sim 95\%\), we apply the FGSM algorithm to generate some adversarial examples. If we now test these examples on our target model \(\mathbf{f}\), we notice a significant drop in the accuracy as shown on the graph on the right of Fig. 32. The fact that the drop in accuracy is greater for the black box generated adversarial images as compared to images with random noise (of the same scale) added to it, shows that adversarial images have some degree of transferability between models. As a side note, on the left of Fig. 32 we observe that black box attacks are more effective when only \(T=1\) iteration of the FGSM algorithm is used, contrary to the situation for the white box attack. This is because, with more iterations, the method has a tendency towards overfitting the secondary model, resulting in adversarial images which are less transferable.
These forms of attacks highlight a serious vulnerability of such data driven machine learning techniques. Defending against such attack is an active area of research but it is largely a cat and mouse game between the attacker and defender.
- 1
This is an example of transfer learning. The base model, InceptionV3, has been trained on a different classification dataset, ImageNet, with over \(1000\) classes. To apply this network to our binary classification problem, we simply replace the top layer with a simple duo-output dense softmax layer. We keep the weights of the base model fixed and only train the top layer.