Hugging Face Releases New NLP ‘Tokenizers’ Library Version (v0.8.0)

Hugging Face is at the forefront of a lot of updates in the NLP space.

They have released one groundbreaking NLP library after another in the last few years.

Honestly, I have learned and improved my own NLP skills a lot thanks to the work open-sourced by Hugging Face.

And today, they’ve released another big update – a brand new version of their popular Tokenizer library.

  A Quick Introduction to Tokenization So, what is tokenization? Tokenization is a crucial cog in Natural Language Processing (NLP).

It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.

Tokens are the building blocks of Natural Language.

Tokenization is a way of separating a piece of text into smaller units called tokens.

Here, tokens can be either words, characters, or subwords.

Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

For example, consider the sentence: “Never give up”.

The most common way of forming tokens is based on space.

Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up.

As each token is a word, it becomes an example of Word tokenization.

  Why is Tokenization Required? As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level.

The sentences or phrases of a text dataset are first tokenized and then those tokens are converted into integers which are then fed into the deep learning models.

For example, Transformer-based models – the State-of-the-Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level.

Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.

  Hugging Face’s Tokenizers Library We all know about Hugging Face thanks to their Transformer library that provides a high-level API to state-of-the-art transformer-based models such as BERT, GPT2, ALBERT, RoBERTa, and many more.

The Hugging Face team also happens to maintain another highly efficient and super fast library for text tokenization called Tokenizers.

Recently, they have released the v0.

8.

0 version of the library.

In this article, I’ll show how you can easily get started with this Tokenizers library for NLP tasks.

  Getting Started with Tokenizers I’ll be using Google Colab for this demo.

However, you are free to use any other platform or IDE of your choice.

So, first of all, let’s quickly install the tokenizers library: !pip install tokenizers You can check the version of the library by executing the command below: tokenizers.

__version__ Let’s import some required libraries and the BertWordPieceTokenizer from the tokenizer library: View the code on Gist.

There are other different types of tokenization schemes available as well, such as ByteLevelBPETokenizer, CharBPETokenizer, and SentencePieceBPETokenizer.

In this article, I will be using BertWordPieceTokenizer only.

This is the tokenization schemes used in the BERT model.

  Tokenization Next, we have to download a vocabulary set for our tokenizer: # Bert Base Uncased Vocabulary !wget https://s3.

amazonaws.

com/models.

huggingface.

co/bert/bert-base-uncased-vocab.

txt Now, let’s tokenize a sample sentence: View the code on Gist.

The three main components of “encoded_output” are: ids – The integer values assigned to the tokens of the input sentence.

tokens – The tokens after tokenization.

offsets – The position of all the tokens in the input sentence print(encoded_output.

ids) Output: [101, 2653, 2003, 1037, 2518, 1997, 5053, 1012, 2021, 11495, 1037, 2047, 2653, 2013, 11969, 2003, 3243, 1037, 4830, 16671, 2075, 9824, 1012, 102] print(encoded_output.

tokens) Output: [‘[CLS]’, ‘language’, ‘is’, ‘a’, ‘thing’, ‘of’, ‘beauty’, ‘.

’, ‘but’, ‘mastering’, ‘a’, ‘new’, ‘language’, ‘from’, ‘scratch’, ‘is’, ‘quite’, ‘a’, ‘da’, ‘##unt’, ‘##ing’, ‘prospect’, ‘.

’, ‘[SEP]’] print(encoded_output.

offsets) Output: [(0, 0), (0, 8), (9, 11), (12, 13), (14, 19), (20, 22), (23, 29), (29, 30), (31, 34), (35, 44), (45, 46), (47, 50), (51, 59), (60, 64), (65, 72), (73, 75), (76, 81), (82, 83), (84, 86), (86, 89), (89, 92), (93, 101), (101, 102), (0, 0)]   Saving and Loading Tokenizer The tokenizers library also allows us to easily save our tokenizer as a JSON file and load it for later use.

This is helpful for large text datasets.

We won’t have to initialize the tokenizer again and again.

View the code on Gist.

  Encode Pre-Tokenized Sequences While working with text data, there are often situations where the data is already tokenized.

However, it is not tokenized as per the desired tokenization scheme.

In such a case, the tokenizers library can come in handy as it can encode pre-tokenized text sequences as well.

So, instead of the input sentence, we will pass the tokenized form of the sentence as input.

Here, we have tokenized the sentence based on the space between two consecutive words: print(sentence.

split()) Output: [‘Language’, ‘is’, ‘a’, ‘thing’, ‘of’, ‘beauty.

’, ‘But’, ‘mastering’, ‘a’, ‘new’, ‘language’, ‘from’, ‘scratch’, ‘is’, ‘quite’, ‘a’, ‘daunting’, ‘prospect.

’] View the code on Gist.

Output: [‘[CLS]’, ‘language’, ‘is’, ‘a’, ‘thing’, ‘of’, ‘beauty’, ‘.

’, ‘but’, ‘mastering’, ‘a’, ‘new’, ‘language’, ‘from’, ‘scratch’, ‘is’, ‘quite’, ‘a’, ‘da’, ‘##unt’, ‘##ing’, ‘prospect’, ‘.

’, ‘[SEP]’] It turns out that this output is identical to the output we got when the input was a text string.

  Speed Testing Tokenizers As I mentioned above, tokenizers is a fast tokenization library.

Let’s test it out on a large text corpus.

I will use the WikiText-103 dataset (181 MB in size).

Let’s first download it and then unzip it: !wget https://s3.

amazonaws.

com/research.

metamind.

io/wikitext/wikitext-103-v1.

zip !unzip wikitext-103-v1.

zip The unzipped data contains three files – wiki.

train.

tokens, wiki.

valid.

tokens, and wiki.

test.

tokens.

We will use wiki.

train.

tokens file only for benchmarking: View the code on Gist.

Output: 1801350 There are close to two million sequences of text in the train set.

It is quite a huge number.

Let’s see how the tokenizers library deals with this huge data.

We will use “encode_batch” instead of “encode” because now we are going to tokenize more than one sequence: View the code on Gist.

Output: 218.

2345 This is mind-blowing! It took just 218 seconds or close to 3.

5 minutes to tokenize 1.

8 million text sequences.

Most of the other tokenization methods would crash even on Colab.

  Go ahead, try it out and let me know your experience using Hugging Face’s Tokenizers NLP library! You can also read this article on our Mobile APP Related Articles (adsbygoogle = window.

adsbygoogle || []).

push({});.

Leave a Reply