And Why?Without understanding machine learning, it’s not difficult to reckon that the more context we have captured, the more accurate the prediction can be made.
As such, the ability of a model to capture the most context deeply and efficiently is the winning recipe.
Let’s play a game — what are [Guess1] and [Guess2] in the below context?[‘Natural’, ‘language’, ‘processing’, ‘is’, ‘a’, ‘marriage’, ‘of’, [Guess1], [Guess2], ‘and’, ‘linguistics’]Given the 3 min constraint, let me reveal the answer, and instead I would ask you: which model (GPT, BERT, XLNet) you found the most helpful to find out the answer.
Answer: [‘Natural’, ‘language’, ‘processing’, ‘is’, ‘a’, ‘marriage’, ‘of’, ‘machine’, ‘learning’, ‘and’, ‘linguistics’]We are using the notation Pr(Guess|Context) onwards.
Literally it means the probability of guess based on the context.
GPT — We read from left to right and so we do not know the context after ‘machine’, ‘learning’:Pr(‘machine’|[‘Natural’, ‘language’, ‘processing’, ‘is’, ‘a’, ‘marriage’, ‘of’])Pr(‘learning’|[‘Natural’, ‘language’, ‘processing’, ‘is’, ‘a’, ‘marriage’, ‘of’, ‘machine’])Knowing ‘machine’ actually help you guess ‘learning’ because ‘learning’ frequently follows ‘machine’ as machine learning is popular.
BERT — We know both sides in contrast with GPT, but we are guessing ‘machine’, and ‘learning’ based on the same context:Pr(‘machine’|[‘Natural’, ‘language’, ‘processing’, ‘is’, ‘a’, ‘marriage’, ‘of’, ‘and’, ‘linguistics’])Pr(‘learning’|[‘Natural’, ‘language’, ‘processing’, ‘is’, ‘a’, ‘marriage’, ‘of’, ‘and’, ‘linguistics’])Having ‘linguistics’ actually help you guess ‘machine’ ‘learning’ because you know that natural language processing is a beautiful blend of machine learning and linguistics.
Even if you don’t know that, with the presence of ‘linguistics’, you at least know that the it is not ‘linguistics’.
You can see the obvious cons of BERT is that it is not able to account for the fact that ‘machine’ and ‘learning’ make a quite common term.
How do we combine both pros of GPT and BERT?XLNet — The Best of Both:Permutation!.The power of permutation is that even if we only read from left to right, permutation allows us to capture the context of both sides (reading left to right, and reading right to left).
One of the permutations that allows us to capture context of both sides: [‘Natural’, ‘language’, ‘processing’, ‘is’, ‘a’, ‘marriage’, ‘of’, ‘and’, ‘linguistics’, ‘machine’, ‘learning’]Pr(‘machine’|[‘Natural’, ‘language’, ‘processing’, ‘is’, ‘a’, ‘marriage’, ‘of’, ‘and’, ‘linguistics’])Pr(‘learning’|[‘Natural’, ‘language’, ‘processing’, ‘is’, ‘a’, ‘marriage’, ‘of’, ‘and’, ‘linguistics’, ‘machine’])This time, you have the full context, and you immediately can guess ‘learning’, after guessing ‘machine’.
You can see clearly that XLNet combines the benefits of both GPT and BERT.
That’s all, hopefully it’s just a 3 min read.
Please clap and share if you enjoy this article!.Of course, read the XLNet paper if you want to know more.
.. More details