Why Machine Learning is vulnerable to adversarial attacks and how to fix it

Something beyond human perception?Just recently, a group from MIT posted a study hypothesizing that previously published observations of the vulnerability of machine learning to adversarial techniques are the direct consequence of inherent patterns within standard data sets.

While incomprehensible to humans, these exist as natural features that are fundamentally used by supervised learning algorithms.

In this paper, Andrew Ilyas et al.

describe a notion of robust and non-robust features existing within a data set that both provide useful signals for classification.

In hindsight, this might not be surprising when we recall that classifiers typically are trained only to maximize accuracy.

Which features a model uses to obtain the highest accuracy – as determined by a human’s supervisory labeling – is left to the model to sort through with brute force.

What looks to us like a person’s nose in an input image is just another combination of 0s and 1s compared to a chin from the perspective of the algorithm.

To demonstrate how these human-perceptible and non-perceptible features support accurate classification, the MIT team constructed a framework to separate the features into distinct training data sets.

Both are run through the training and are sufficient to classify as expected.

The models are seen, as illustrated in Figure 1, to use non-robust features in meaningful ways just as the robust features, even though they look like garbled information to us.

Figure 1.

An image of a frog is separated into “robust” features used to train a classifier to accurately identify it as a frog and “non-robust” features that are also used to train a classifier to identify frog accurately, but not when adversarial perturbations are included in the input.

These non-robust features used by the model are inherent to the image and not manipulated or inserted by an attacker [from Ilyas et al.

with permission].

Let’s be clear on this point again.

Legitimate data is comprised of human-imperceptible features that impact training.

These weird features that don’t appear to connect to the appropriate human classification are not artifacts or overfittings but are naturally occurring patterns that are sufficient on their own to result in a learned classifier.

In other words, patterns useful for classification that we as humans expect to see might be found by the algorithm.

However, other patterns are being identified that make no sense to us – but they are still patterns that support the algorithm in making good classifications.

Dimitris Tsipras from the MIT team is excited about their result as “it convincingly demonstrates that models can indeed learn imperceptible well-generalizing features.

”While these patterns don’t look like anything meaningful to us, the ML classifier is finding them and using them.

When facing adversarial inputs (e.

g.

, small perturbations in the new data), this same set of nonsensical features breaks down leading to incorrect classification.

Now, it’s not only some third-party attacker inserting misleading data into a sample, but it’s a consequence of the non-robust features existing in the real data that makes the classifier vulnerable to an adversary’s taunting.

Recalling the notion of transferability mentioned above from Papernot et al.

, it’s also these non-robust features existing across datasets that make them sensitive to adversarial examples trained on other models.

An adversary trained to break one model will likely break another because it’s messing with the inherent non-robust features found in both models.

So, the notion that supervised machine learning is vulnerable to adversarial examples is a direct consequence of the inherent properties of how algorithms look at the data.

The algorithm is performing just fine chugging away at whichever patterns of features lead it to the best accuracy.

If an adversary messes with its classification result, then the model still thinks it’s performing quite well.

The fact that the result is not what we as humans would perceive is our problem.

To have a bit more human touch, Tsipras said that since “models might rely on all sorts of features of the data, we need to explicitly enforce priors on what we want the model to learn.

”In other words, if we want to create models that are interpretable as far as we are concerned, then we need to encode our expected perceptions into the training process and not leave the algorithm to its own, non-human, devices.

Well, so what if the algorithm is seeing things differently than we do as humans?.Why is this unexpected?.Is it scary?.Is it something we don’t want to see happening?We know ML processes our world differently than us.

Just compare to the foundational notion in NLP where the meanings of words can be extracted not from a memorized dictionary (as we humans do) but from exploring the nearby terms in a human-written text (such as in Word2vec).

It should already be appreciated that algorithms running on silicon see information differently than synapses firing in brains.

However, it’s this potential for an alternate, although equally effective, viewpoint that can lead AI down different paths of intention than humans.

It’s what might make us lose our control.

From the work of Ilyas et al.

, a key focus, then, should be to embed human biases into ML algorithms to always nudge them into the direction we expect from our viewpoint.

For their future work, Tsipras said, “we are actively working on utilizing robustness as a tool to bias models towards learning features that are more meaningful to humans.

”Building our models to be more human-aligned by including priors during development and training may be the crucial starting point to ensuring we don’t lose our controlling grasp on the future of AI.

So, consider keeping a little touch of “human” in your next machine learning project.

…Resources cited in this article:Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.

disqus.

com/embed.

js; (document.

getElementsByTagName(head)[0] || document.

getElementsByTagName(body)[0]).

appendChild(dsq); })();.

. More details

Leave a Reply