How subword helps on your NLP model?

How subword helps on your NLP model?Introduction to subwordEdward MaBlockedUnblockFollowFollowingMay 18Classic word representation cannot handle unseen word or rare word well.

Character embeddings is one of the solution to overcome out-of-vocabulary (OOV).

However, it may too fine-grained any missing some important information.

Subword is in between word and character.

It is not too fine-grained while able to handle unseen word and rare word.

For example, we can split “subword” to “sub” and “word”.

In other word we use two vector (i.

e.

“sub” and “word”) to represent “subword”.

You may argue that it uses more resource to compute it but the reality is that we can use less footprint by comparing to word representation.

This story will discuss about SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al.

, 2018) and further discussing about different subword algorithms.

The following are will be covered:Byte Pair Encoding (BPE)WordPieceUnigram Language ModelSentencePieceByte Pair Encoding (BPE)Sennrich et al.

(2016) proposed to use Byte Pair Encoding (BPE) to build subword dictionary.

Radfor et al adopt BPE to construct subword vector to build GPT-2 in 2019.

AlgorithmPrepare a large enough training data (i.

e.

corpus)Define a desired subword vocabulary sizeSplit word to sequence of characters and appending suffix “</w>” to end of word with word frequency.

So the basic unit is character in this stage.

For example, the frequency of “low” is 5, then we rephrase it to “l o w </w>”: 5Generating a new subword according to the high frequency occurrence.

Repeating step 4 until reaching subword vocabulary size which is defined in step 2 or the next highest frequency pair is 1.

Algorithm of BPE (Sennrich et al.

, 2015)ExampleTaking “low: 5”, “lower: 2”, “newest: 6” and “widest: 3” as an example, the highest frequency subword pair is e and s.

It is because we get 6 count from newest and 3 count from widest.

Then new subword (es) is formed and it will become a candidate in next iteration.

In the second iteration, the next high frequency subword pair is es (generated from previous iteration )and t.

It is because we get 6count from newest and 3 count from widest.

Keep iterate until built a desire size of vocabulary size or the next highest frequency pair is 1.

WordPieceWordPiece is another word segmentation algorithm and it is similar with BPE.

Schuster and Nakajima introduced WordPiece by solving Japanese and Korea voice problem in 2012.

Basically, WordPiece is similar with BPE and the difference part is forming a new subword by likelihood but not the next highest frequency pair.

AlgorithmPrepare a large enough training data (i.

e.

corpus)Define a desired subword vocabulary sizeSplit word to sequence of charactersBuild a languages model based on step 3 dataChoose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.

Repeating step 5until reaching subword vocabulary size which is defined in step 2 or the likelihood increase falls below a certain threshold.

Unigram Language ModelKudo.

introduced unigram language model as another algorithm for subword segmentation.

One of the assumption is all subword occurrence are independently and subword sequence is produced by the product of subword occurrence probabilities.

Both WordPiece and Unigram Language Model leverages languages model to build subword vocabulary.

AlgorithmPrepare a large enough training data (i.

e.

corpus)Define a desired subword vocabulary sizeOptimize the probability of word occurrence by giving a word sequence.

Compute the loss of each subwordSort the symbol by loss and keep top X % of word (e.

g.

X can be 80).

To avoid out-of-vocabulary, character level is recommend to be included as subset of subword.

Repeating step 3–5until reaching subword vocabulary size which is defined in step 2 or no change in step 5.

SentencePieceSo, any existing library which we can leverage it for our text processing?.Kudo and Richardson implemented SentencePiece library.

You have to train your tokenizer based on your data such that you can encode and decoding your data for downstream tasks.

First of all, preparing a plain text including your data and then triggering the following API to train the modelimport sentencepiece as spmspm.

SentencePieceTrainer.

Train('–input=test/botchan.

txt –model_prefix=m –vocab_size=1000')It is super fast and you can load the model bysp = spm.

SentencePieceProcessor()sp.

Load("m.

model")To encode your text, you just need tosp.

EncodeAsIds("This is a test")For more examples and usages, you can access this repo.

Take AwaySubword balances vocabulary size and footprint.

Extreme case is we can only use 26 token (i.

e.

character) to present all English word.

16k or 32k subwords are recommended vocabulary size to have a good result.

Many Asian language word cannot be separated by space.

Therefore, the initial vocabulary is larger than English a lot.

You may need to prepare over 10k initial word to kick start the word segmentation.

From Schuster and Nakajima research, they propose to use 22k word and 11k word for Japanese and Korean respectively.

Like to learn?I am Data Scientist in Bay Area.

Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related.

Feel free to connect with me on LinkedIn or following me on Medium or Github.

Extension ReadingClassic word representationCharacter embeddingsToo powerful NLP model (GPT-2)SentencePiece GIT repoReferenceT.

Kudo and J.

Richardson.

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.

2018R.

Sennrich, B.

Haddow and A.

Birch.

Neural Machine Translation of Rare Words with Subword Units.

2015M.

Schuster and K.

Nakajima.

Japanese and Korea Voice Search.

2012Taku Kudo.

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.

2018.. More details

Leave a Reply