Adversarial perturbations

Ashish Gaurav

Last updated - June 22, 2023
Changelog

Note: This article is not yet complete!

Trained neural networks can still be brittle with respect to their learned decision boundaries. That is, it may be possible to perturb the inputs by a very small amount (imperceptible to the human eye) and cause a significant change in the outputs. Training robust networks that are not affected by such adversarial perturbations is crucial for real world deployment of such models.

Generating adversarial examples

Classification

In the context of classification, we can find such perturbations in several ways.

Box-constrained Optimization \cite{szegedy2013intriguing}: Suppose we have an $m$-dimensional image in some dataset $\mathbf x \in \mathcal D$. The goal is to perturb $\mathbf x$ by a minimum amount $\mathbf r$ such that $\mathbf x + \mathbf r$ still is an image, and the label changes, from $f(\mathbf x)$ to a target label $l := f(\mathbf x+\mathbf r)$, where target label is different from original image. This is equivalent to the following problem, which can be solved approximately using an algorithm like L-BFGS-B.

$$ \min_{| \mathbf r |_2} ,,{ c,|\mathbf r|_2+ \text{loss}(f(\mathbf x+\mathbf r), l) }; ,, \text{ s.t. } ,, \mathbf{x}+\mathbf{r} \in [0, 1]^m $$

Fast Gradient Sign Method \cite{goodfellow2014explaining}: Given a neural network with parameters $\boldsymbol \theta$ and loss function for training $\mathcal {L}(\boldsymbol\theta,\mathbf x, \mathbf y)$, we know that $-\nabla_{\boldsymbol \theta} \mathcal L (\boldsymbol \theta, \mathbf x, \mathbf y)$ computes the change in $\boldsymbol \theta$ so that the loss can reduce (eventually to minima). Similarly, we can compute $\nabla_{\mathbf x} \mathcal L (\boldsymbol \theta, \mathbf x, \mathbf y)$ and that should provide the change in $\mathbf x$ that should increase the loss (eventually to maxima). However, we wish to ensure that with the perturbation, the image is still in $[0, 1]^m$. Therefore we can use the $\text{sign}(\cdot)$ function on the perturbation: $$ \widetilde {\mathbf x} := \mathbf x + \epsilon , \text{sign},{\nabla_{\mathbf x}\mathcal L(\boldsymbol \theta, \mathbf x, \mathbf y)} $$ The $\epsilon$ chosen should ensure that the perturbation still creates a valid image, that is, the most perturbed pixel still has a value $\leq 1$.

Targeted attacks \cite{kurakin2016adversarial}: Instead of using true label $\mathbf y$ for image $\mathbf x$, we can use a target label $\mathbf y’$ and minimize the loss of $\mathbf x$ wrt $\mathbf y’$by adding a different perturbation: $$ \widetilde {\mathbf x} := \mathbf x + \epsilon , \text{sign},{-\nabla_{\mathbf x}\mathcal L(\boldsymbol \theta, \mathbf x, \mathbf y’)} $$ Here $\mathbf y’$ could be a random label, or the label corresponding the least likely class.

Other kinds of attacks:

Reinforcement Learning

State space attacks: If we have access to the states, we can craft an adversarial perturbation for the states. For example, we can use the Fast Gradient Sign Method for this \cite{huang2017adversarial}, to attack at all timesteps. We can also do an attack at specific timesteps by perturbing at crucial timesteps \cite{lin2017tactics}, or by attacking depending on the value function \cite{kos2017delving}, or by using cross entropy loss to increase probability of worst action \cite{pattanaik2017robust}.

Action based perturbations: If the actions can be perturbed, then we can use projected gradient perturbations or a lookahead based approach \cite{lee2020spatiotemporally}.

Reward based perturbations: Wang et al. (2020) \cite{wang2020reinforcement} investigate the robustness of learning when reward functions are perturbed, and provide an algorithm for reward-robust RL.