Are BERT Features InterBERTible?

Better go catch it!Let’s take a look at another example, this time using the word pie.

We generate the 5 sentences:pie (using pie without any context)the man ate a pie (using pie as the object)the man threw a pie (using pie as the object)the pie was delicious (using pie as the subject)the pie ate a man (using pie in a non-admissable context)We observe a trend that’s very similar to what we saw in our previous refrigerator example.

Make pies, not war.

Next, let’s take a look at our original example with the words king, queen, man, and woman.

We construct 4 nearly identical sentences, swapping out their subjects.

the king passed a lawthe queen passed a lawthe man passed a lawthe woman passed a lawFrom these sentences, we extract the BERT representation of the subject.

In this case, we get a better result: subtracting man from king and adding woman shifts us very slightly closer towards queen.

A slight improvement from before.

Albeit still not convincing.

Finally, we explore how word representations change when the structure of the sentence is fixed but the sentiment is not.

Here, we construct 3 sentences:math is a difficult subjectmath is a hard subjectmath is a simple subjectUsing these sentences, we would like to probe what happens to the subject and adjective representations as we vary the sentiment.

Interestingly enough, we find that the adjective that are synonymous (i.


difficult and hard) have similar representations but the adjectives that are antonymous (i.


difficult and simple) have very different representations.

synonyms vs.


Additionally, as we vary the sentiment, we find that the subject, math, is more similar when the contexts have the same sentiment (i.


difficult and hard) than if the contexts have different sentiments (i.


difficult and simple).

shifting subject representations.

ConclusionIn conclusion, the results seem to signal that, that like Word2Vec, BERT may also learn a semantic vector representation (albeit much less pronounced).

It also seems that BERT really does rely heavily on contextual information: words without any context are very different than the same word with some context and shifting contexts (like changing sentiments) also shift subject representation.

Keep in mind that there is always the danger of overgeneralizing with limited evidence.

These experiments are not complete and this is just the beginning.

We use a very small sample size (out of the vast lexicon of english words), while evaluating on a very specific distance metric (cosine distance), on a very ad-hoc set of experiments.

Future work on analyzing BERT representations should expand on all of these aspects.

Finally, thanks to John Hewitt and Zack Lipton for providing useful discussion on the subject.


. More details

Leave a Reply