RNNs: The Trade-Off Between Long-Term Memory and Smoothness

RNNs: The Trade-Off Between Long-Term Memory and SmoothnessZachary ManesiotisBlockedUnblockFollowFollowingJun 20It is well-known that retaining long-term information when learning via gradient descent is a difficult task for Recurrent Neural Networks (RNNs) due to the vanishing or exploding gradient problem.

A recent study examined this behaviour more in-depth and came to an interesting conclusion regarding the relationship between two seemingly disparate properties of a RNN: the smoothness of the cost function, and the long-term memory retention.

What does it mean if a cost function is “smooth”?Smoothness is based on the Lipschitz continuity of the cost function and the cost function’s gradient.

What this means essentially is that for low Lipschitz constants, the cost function is less intricate (more “smooth”) and thus convergence is possible even with larger step sizes.

Lipschitz continuity – WikipediaIn mathematical analysis, Lipschitz continuity, named after Rudolf Lipschitz, is a strong form of uniform continuity…en.

wikipedia.

orgHow does this relate to long-term memory in RNNs?Examples have shown us that for some parameters, an LSTM model exhibits chaotic behaviour (which essentially gives an infinite long-term memory in the sense that small variations in the initial conditions can lead to vastly different outcomes).

Interestingly, the cost-function exhibits intricate and highly non-smooth behaviour which leads us to believe that there may be an underlying relationship between the smoothness of the cost function and the long-term information retention.

In a recent study, Ribeiro, Tiels, Aguirre, and Schön further prove that this relation has deep significance by quantifying the amount of long-term information retained by a RNN and showing that as the Lipschitz constant increases, the slower the decay of information retention is.

Recall that a large Lipschitz constant also implies that the cost function is intricate (with lots of local minima, thus difficult to converge).

The derivation of the above bound is shown below:The authors conclude by stating:the insight provided here might be used for the design of new regularization techniques and optimization algorithms that might help to properly explore challenging and interesting regions of the RNN parameter space.

Although RNNs have been gradually losing popularity since the development of high performance feed-forward architectures for sequential learning tasks, this deep insight into the behaviour of RNNs will no doubt shed some light onto possible solutions to this.

.. More details

Leave a Reply