Tokenizers, or how do machines read?

Statistics and Data Science

1 May 2020

SEO at the coronavirus time. Customize your strategy

15 May 2020

8 May 2020

The article is based on the article https://blog.floydhub.com/tokenization-nlp/

The past year has been huge for natural language processing (NLP). As for the improvements, it is now possible to deploy neural networks faster thanks to the use of optimized libraries and efficient hardware. However, one of the bottlenecks in modern deep learning-based NLP pipelines is tokenization, particularly implementations that are comprehensive and framework-independent.

The world of deep learning and NLP is evolving at a rapid pace. The two most prominent trends are the Transformer architecture (2017) and the BERT language model (2018) which is the most well-known model using old architecture. In particular, these two developments have improved machine performance in a wide range of language reading tasks.

Why is reading difficult for machines?

A person can understand language before he learns to read. Machines don’t have that phonetic beginning. Without knowing anything about language, you have to develop systems that allow them to process text without the ability, like humans, to associate sounds with the meaning of words.

How can machines start processing text if they know nothing about grammar, sounds, words, or sentences? You can create rules that tell the machine to process the text to allow it to do a dictionary type search. However, in this scenario machine is not learning anything. You would have to have a static data set for every possible combination of words and all their grammatical variants.

Instead of training the machine to look up established dictionaries, you need to teach the machine to recognize and “read” text in such a way that it can learn from that action. In other words, the more he reads, the more he learns. People do this by using the way they previously learned phonetic sounds. Machines don’t have the knowledge to use them, so they need to be told how to divide the text into standard units to process it. They do this using a system called “tokenization,” in which sequences of text are broken down into smaller chunks or “tokens” and then fed as input into an NLP deep learning model such as BERT. But before we look at the different ways to tokenize text, let’s first see if we need to use tokenization at all.

Do we need Tokenizers?

To teach a deep learning model such as BERT or GPT-2 how to perform well in natural language processing tasks, we need to give it a lot of text. With a specific architecture design, the model will learn some level of syntactic or semantic understanding. This continues to be an area of active research into the level of semantic understanding of these models. They are thought to learn syntactic knowledge at lower levels of the neural network and then semantic knowledge at higher levels as they begin to refine more specific signals in the language domain, e. g., medical texts versus technical training.

The specific type of architecture used will have a significant impact on the tasks the model can handle, how quickly it can learn, and how well it performs. GPT2, for example, uses a decoder architecture because its job is to predict the next word in the sequence. In contrast, BERT uses an encoder-type architecture because it is trained for a larger range of NLP tasks, such as next-sentence prediction, question-answer retrieval, and classification. No matter how they are designed, they all need to receive text through their input layers to do any kind of learning.

To offer access to a fast, state-of-the-art, and easy-to-use tokenizer that works well with modern NLP pipelines, open-source tokenizers have been made available. Tokenizers are, as the name suggests, implementations of today’s most popular tokenizers with an emphasis on performance and versatility. The tokenizer implementation consists of the following pipeline of processes, each applying different transformations to the textual information.

Tokenization of subtitles

Classical word representation does not handle rare words well. Embedding characters is one solution to overcome a lack of vocabulary. However, you may be missing important information too finely. The subword is between the word and the sign. It’s not too fine-grained, yet it handles the invisible word and the rare word.

Byte pair encoding (BPE)

One popular subtitle tokenization algorithm that follows the above approach is BPE. BPE was originally used to compress data by finding common combinations of byte pairs. It can also be applied to NLP to find the most efficient way to represent text. We can look at an example to see how BPE works in practice

Unigram subtitle tokenization

We observed that using subtext pattern frequencies for tokenization can result in ambiguous final encoding. The problem is that we have no way of predicting which particular token is likely to be best when encoding new input text. Fortunately, the need to predict the most likely sequence of text is not a unique problem for tokenization. We can use this knowledge to build a better tokenizer. To solve this complexity, the simplest approach is a unigram model that only considers the probability of the current word.

The unigram approach differs from BPE in that it attempts to select the most likely option instead of the best option at each iteration. To generate the unigram submenu token set, you must first define the desired final token set size, as well as the initial submenu token set. You can select a set of source word tokens in a similar way to BPE and select the most frequent subsequences.

WordPiece

The world of subword tokenization, like the world of deep learning NLP, is evolving rapidly in a short period of time. So when BERT was released in 2018, it included a new algorithm under a word called WordPiece. On first reading, you might think you’re back to square one and need to come up with a different subword model. However, WordPiece turns out to be very similar to BPE. Think of WordPiece as an intermediary between the BPE approach and the unigram approach. BPE takes two tokens, checks the frequency of each pair, and then combines the pairs that have the highest total frequency count. It only takes into account the most common pair combinations at each step, nothing more.

WordPiece seems to be a bridge between the BPE and unigram approaches. Its general approach is similar to BPE, but it also uses the unigram method to determine when to merge tokens.

SentencePiece

SentencePiece essentially tries to put all tokenization tools and techniques under words under one banner. It’s a bit like a Swiss Army Knife for tokenizing subtitles.

What issues SentencePiece addresses:

All other models assume that the input data is already tokenized: BPE and Unigram are great models, but they have one big flaw – they both need to have their data tokenized already. SentencePiece solves this problem by simply inputting raw text and then doing everything based on that input to perform tokenization under the words.
Language agnostic: Since all other subword algorithms must have pre-tokenized input, this limits their applicability to many languages. You need to create rules for different languages so you can use them as input to your model. It gets very messy very quickly.
Decoding is hard: Another problem caused by models like BPE and unigram requiring already tokenized input is that you don’t know what encoding rules were used. This creates problems when trying to reproduce results or confirm results.
No comprehensive solution: these are just some of the problems, which means that BPE and unigram are not fully complete or comprehensive solutions. You can’t just plug in the raw input and get the result. Instead, they are just part of the solution. SentencePiece brings together everything you need for a comprehensive solution in one neat package.

How does SentencePiece solve all these problems?

SentencePiece uses a number of features that address all of the above issues, which are detailed both in the linked document and in the corresponding GitHub repository. Both are great resources, but if you’re short on time, browsing the repository may be the best way to get a quick overview of SentencePiece and all the related Swiss Army greatness.

For now, it will suffice to highlight some of the techniques used by SentencePiece to address the above shortcomings before diving into some code examples.

With the rapid development of Deep Learning, it is easy to look only at the main core of models such as BERT and XLNet. However, the format that allows these models to process text is critical to their learning. Understanding the basics of tokenizers under words will allow you to quickly master the latest innovations in the field. It will also help if you are trying to build these networks yourself, or are dealing with multilingual issues while doing traditional NLP tasks. Either way, knowing about models like SentencePiece will be a useful tool if you are interested in the field of Deep Learning NLP.

greenlogic