Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters

Right: attention weights for selected token (“i”)On the left, we can see that the [SEP] token disrupts the next-token attention pattern, as most of the attention from [SEP] is directed to [CLS] rather than the next token..Thus this pattern appears to operate primarily within each sentence.This pattern is related to a right-to-left version of the RNN, where state updates are made sequentially from right to left..Pattern 1 appears over multiple layers of the model, in some sense emulating the recurrent updates of an RNN.Pattern 2: Attention to previous wordIn this pattern, much of the attention is directed to the previous token in the sentence..For example, most of the attention for “went” is directed to the previous word “i” in the figure below..The pattern is not as distinct as the last one; some attention is also dispersed to other tokens, especially the [SEP] tokens..Like Pattern 1, this is loosely related to a sequential RNN, in this case the more traditional left-to-right RNN.Pattern 2: Attention to previous word..Left: attention weights for all tokens..Right: attention weights for selected token (“went”)Pattern 3: Attention to identical/related wordsIn this pattern, attention is paid to identical or related words, including the source word itself..In the example below, most of the attention for the first occurrence of “store” is directed to itself and to the second occurrence of “store”..This pattern is not as distinct as some of the others, with attention dispersed over many different words.Pattern 3: Attention to identical/related tokens..Left: attention weights for all tokens..Right: attention weights for selected token (“store”)Pattern 4: Attention to identical/related words in other sentenceIn this pattern, attention is paid to identical or related words in the other sentence.. More details

Leave a Reply