Neural Networks seem to follow a puzzlingly simple strategy to classify images

Why should a ResNet-50 learn about complex large-scale relationships like object shape if the abundance of local image features is sufficient to solve the task?To test the hypothesis that modern DNNs follow a similar strategy as simple bag-of-feature networks we test different ResNets, DenseNets and VGGs on the following “signatures” of BagNets:Decisions are invariant against spatial shuffling of image features (could only be tested on VGG models).

Modifications of different image parts should be independent (in terms of their effect on the total class evidence).

Errors made by standard CNNs and BagNets should be similar.

Standard CNNs and BagNets should be sensitive to similar features.

In all four experiments we find a strikingly similar behaviour between CNNs and BagNets.

As an example, in the last experiment we show that those image parts to which the BagNets are most sensitive to (e.


if you occlude those parts) are basically the same to which CNNs are most sensitive to.

In fact, the heatmaps (the spatial map of sensitivity) of BagNets are better predictors of the sensitivity of DenseNet-169 than heatmaps generated by attribution methods like DeepLift (which compute heatmaps directly for DenseNet-169).

Of course, DNNs do not perfectly resemble bag-of-feature models but do show some deviations.

In particular, we find increased features sizes and more long-range dependencies the deeper the networks get.

Hence, deeper neural networks do improve over simpler bag-of-feature models but I don’t think that the core classification strategy has really changed.

Going beyond bag-of-features classificationViewing the decision-making of CNNs as a bag-of-feature strategy could explain several weird observations about CNNs.

First, it would explain why CNNs have such a strong texture-bias.

Second, it would explain why CNNs are so insensitive to the shuffling of image parts.

It might even explain the existence of adversarial stickers and adversarial perturbations in general: one can place misleading signals anywhere in the image and the CNN will still reliably pick up that signal whether or not these signal fit to the rest of the image.

At its core our work shows that CNNs use the many weak statistical regularities present in natural images for classification and don’t make the jump towards object-level integration of image parts like humans.

The same is likely true for other tasks and sensory modalities.

We have to think hard about how to build our architectures, tasks and learning methods to counteract this tendency towards weak statistical correlations.

One angle would be to improve the inductive biases of CNNs away from small local features towards more global features.

Another angle would be to remove or replace those features onto which the networks should not rely, which is exactly what we did in another ICLR 2019 publication using a style transfer preprocessing to remove the natural object texture.

One of the biggest problems, however, is certainly the task of image classification itself: if local image features are sufficient to solve the task, there is no incentive to learn the true “physics” of the natural world.

We will have to restructure the task itself in a way that pushes models to learn the physical nature of objects.

This likely has to go beyond purely observational learning of correlations between input and output features in order to allow models to extract causal dependencies.

Taken together, our results suggest that CNNs might follow an extremely simple classification strategy.

The fact that such a discovery can still be made in the year 2019 highlights how little we have yet understood about the inner workings of deep neural networks.

This lack of understanding prevents us from developing fundamentally better models and architectures that close the gap between human and machine perception.

Deepening our understanding will allow us to uncover ways to close this gap.

This can be extremely fruitful: as we tried to bias CNNs towards more physical properties of objects we suddenly reached human-like noise robustness.

I expect many more exciting results on our way towards CNNs that truly learn the physical and causal nature of our world.


. More details

Leave a Reply