Hugging Face is at the leading edge of a lot of updates in the NLP space. They have launched one groundbreaking NLP library after another in the last couple of years. Truthfully, I have actually learned and enhanced my own NLP skills a lot thanks to the work open-sourced by Hugging Face.
And today, theyve launched another big upgrade– a brand name new variation of their popular Tokenizer library
Consider the sentence: “Never offer up”.
A Quick Introduction to Tokenization.
Tokenization is a method of separating a piece of text into smaller sized units called tokens. Here, tokens can be either characters, words, or subwords. Tokenization can be broadly classified into 3 types– subword, word, and character (n-gram characters) tokenization.
The most typical way of forming tokens is based upon space. Assuming area as a delimiter, the tokenization of the sentence results in 3 tokens– Never-give-up. As each token is a word, it becomes an example of Word tokenization
What is tokenization? Tokenization is a vital cog in Natural Language Processing (NLP). Its an essential step in both standard NLP techniques like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.
Tokens are the building blocks of Natural Language.
Why is Tokenization Required?
Now, lets tokenize a sample sentence:.
The Hugging Face team likewise takes place to preserve another very and highly efficient quick library for text tokenization called Tokenizers. Recently, they have actually launched the v0.8.0 variation of the library.
You can inspect the variation of the library by carrying out the command listed below:.
Transformer-based designs– the State-of-the-Art (SOTA) Deep Learning architectures in NLP– process the raw text at the token level. Likewise, the most popular deep knowing architectures for NLP like RNN, GRU, and LSTM likewise process the raw text at the token level
# Bert Base Uncased Vocabulary.
! wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt.
Key Highlights of Tokenizers v0.8.0.
There are other various types of tokenization schemes offered as well, such as ByteLevelBPETokenizer, CharBPETokenizer, and SentencePieceBPETokenizer. In this post, I will be utilizing BertWordPieceTokenizer just. This is the tokenization plans used in the BERT design
Ill be utilizing Google Colab for this demonstration. You are free to use any other platform or IDE of your option. First of all, lets rapidly install the tokenizers library:.
Starting with Tokenizers.
Now both pre-tokenized sequences and raw text strings can be encoded.
Training a customized tokenizer is now 5 to 10 times much faster.
Conserving a tokenizer is easier than ever. It takes simply one line of code to save a tokenizer as a JSON file.
And many other enhancements, and repairs.
ids– The integer values appointed to the tokens of the input sentence.
As tokens are the foundation of Natural Language, the most typical way of processing the raw text takes place at the token level. The sentences or phrases of a text dataset are first tokenized and after that those tokens are converted into integers which are then fed into the deep knowing designs.
tokens– The tokens after tokenization.
Next, we need to download a vocabulary set for our tokenizer:.
tokenizers. __ variation __.
Hugging Faces Tokenizers Library.
Lets import some needed libraries and the BertWordPieceTokenizer from the tokenizer library:.
The three main parts of “encoded_output” are:.
! pip install tokenizers.
To see the whole list of modifications and updates refer to this link. In this post, Ill reveal how you can easily get going with this most current variation of the Tokenizers library for NLP tasks
We all understand about Hugging Face thanks to their Transformer library that offers a high-level API to state-of-the-art transformer-based models such as BERT, GPT2, ALBERT, RoBERTa, and a lot more.
print( encoded_output. ids).
offsets– The position of all the tokens in the input sentence.
Output: [101, 2653, 2003, 1037, 2518, 1997, 5053, 1012, 2021, 11495, 1037, 2047, 2653, 2013, 11969, 2003, 3243, 1037, 4830, 16671, 2075, 9824, 1012, 102]
Output: [[ CLS], language, is, a, thing, of, appeal, ., but, mastering, a, new, language, from, scratch, is, quite, a, da, ## unt, ## ing, possibility, ., [SEP]] print( encoded_output. offsets).
print( encoded_output. tokens).
Output: [( 0, 0), (0, 8), (9, 11), (12, 13), (14, 19), (20, 22), (23, 29), (29, 30), (31, 34), (35, 44), (45, 46), (47, 50), (51, 59), (60, 64), (65, 72), (73, 75), (76, 81), (82, 83), (84, 86), (86, 89), (89, 92), (93, 101), (101, 102), (0, 0)]
Encode Pre-Tokenized Sequences.
As I pointed out above, tokenizers is a quick tokenization library. Lets evaluate it out on a large text corpus.
While working with text information, there are frequently situations where the information is already tokenized. It is not tokenized as per the preferred tokenization scheme. In such a case, the tokenizers library can come in convenient as it can encode pre-tokenized text sequences.
Speed Testing Tokenizers.
Saving and Loading Tokenizer.
I will utilize the WikiText-103 dataset (181 MB in size). Lets very first download it and then unzip it:.
So, instead of the input sentence, we will pass the tokenized kind of the sentence as input. Here, we have actually tokenized the sentence based upon the space in between two consecutive words:.
Output: [[ CLS], language, is, a, thing, of, charm, ., however, mastering, a, new, language, from, scratch, is, rather, a, da, ## unt, ## ing, possibility, ., [SEP]] It ends up that this output is similar to the output we got when the input was a text string
The tokenizers library likewise permits us to easily conserve our tokenizer as a JSON file and load it for later use. This is practical for big text datasets. We wont have to initialize the tokenizer once again and once again
This is mind-blowing! It took simply 218 seconds or near 3.5 minutes to tokenize 1.8 million text series. Most of the other tokenization approaches would crash even on Colab
The unzipped data contains three files– wiki.train.tokens, wiki.test.tokens, and wiki.valid.tokens. We will use wiki.train.tokens file just for benchmarking:.
There are close to 2 million sequences of text in the train set. Lets see how the tokenizers library deals with this substantial data.
Go on, try it out and let me understand your experience using Hugging Faces Tokenizers NLP library!
They have released one groundbreaking NLP library after another in the last few years. Initially of all, lets quickly install the tokenizers library:.
Lets see how the tokenizers library deals with this huge information.
You can also read this article on our Mobile APP.
! The tokenizers library also permits us to quickly save our tokenizer as a JSON file and load it for later on usage. In such a case, the tokenizers library can come in handy as it can encode pre-tokenized text sequences.