The entire process of converting a set of text (paragraphs) into a set of sentences or a set of words is called tokenization. It can consist of individual words, characters, sentences, or paragraphs. One of the important steps to perform in the NLP pipeline. Convert unstructured text into a suitable data format using tokenization in a nutshell along with its tools, types. One of the basic and first steps in working with text data. In this article we will look into tokenization in a nutshell along with its tools, types.
Types of Tokenization
1. Word Tokenization
Word tokenization is the most commonly used algorithm. Splits a section of text into individual words based on certain delimiters. Depending on the delimiter, different tokens are formed at the word level. Pre-trained word embeddings such as Word2Vec and GloVe fall under word this.
2. Sentence Tokenization
Sentence tokenization is the process of splitting text into individual sentences.
Why Tokenization is important?
Tokenization is the first and foremost step in the NLP pipeline. A tokenizer will break the data into small chunks for easy interpretation. The most popular deep learning architectures for NLP, such as RNNs, GRUs, and LSTMs, also process raw text at the token level. Tokenization is most important step in modeling text data. Tokenization is performed on the corpus to obtain tokens. Then prepare your vocabulary with the said tokens: A vocabulary refers to a set of unique tokens in a corpus. Recall that the vocabulary can be built by looking at individual tokens in the corpus or by looking at the top K common words.
Creating Vocabulary is the ultimate goal of Tokenization
Different Techniques and Tools For Tokenization
There are several ways to tokenize specific text data. You can choose any method based on your language, library, and modeling goals. Tokenization can be done via various libraries such as NLP, Spacy, Textblob, Keras, Genism. Let’s understand some of the basic terminology:
Corpus: A corpus is a large collection of linguistic data. Use your corpus to train different models for sentiment analysis, classification, and more.
Tokens: Tokens, also known as words, are the output of tokenization.
Tokenization with python
Let’s start with the basic Python in-build method. You can use the split() method to split a string and return a list where each word is a list item. This method is also known as whitespace tokenization. By default the split() method uses spaces as delimiters, but you have the option to format them.
Example for word-level tokenization.
sentence = "Hello, this is a blog on tokenization." tokens = sentence.split() print("Python In-build Tokenization:n", tokens) #Output Python In-build Tokenization: ['Hello,', 'this', 'is', 'a', 'blog' ,'on', 'tokenization.']
The built-in split() method works fine, but it has a issue where punctuation marks are not considered separate tokens. The words “hello” and “,” are not separated like “hello”.
For sentence tokenization, use Python’s built-in method split(). Let’s look at an example of sentence-level tokenization using a comma ‘,’ as a delimiter. Here the text is split when there is a comma (,) in the text.
sentence = "This is a learning platform. Which provides knowledge of Tech" tokens = sentence.split('. ') print(tokens) # Output ['This is a learning platform', 'Which provides knowledge of Tech'] #If we want to tokenize the sentence when the dot(.) is encountered. This is what we normally use. sentence_2 = "This is a learning platform. Which provides knowledge of DS" tokens = sentence_2.split('. ') print(tokens) # Output ['This is a learning platform', 'Which provides knowledge of DS']
Tokenization with NLTK
Natural Language Toolkit (NLTK) is a python library for natural language processing (NLP). NLTK has a module for word tokenization and sentence tokenization. First, to download the library:
!pip install --user -U nltk
1. An example for word tokenizer in NLTK
from nltk.tokenize import word_tokenize sentence = "This is a *learning platform.s!. Which provides #knowledge of Tech" tokens = word_tokenize(sentence) print(tokens) # Output ['This', 'is', 'a', '*', 'learning', 'platform.s', '!', '.', 'Which', 'provides', '#', 'knowledge', 'of', 'Tech']
Note: Punctuation is also treated as a separate token when using the NLTK word tokenizer.
2. An example for sentence tokenizer in NLTK
from nltk.tokenize import sent_tokenize sentence = "This is a *learning platform.s!. Which provides #knowledge of ML" tokens = sent_tokenize(sentence) print(tokens) # Output ['This is a *learning platform.s!.', 'Which provides #knowledge of ML']
3. Punctuation based tokenizer
A punctuation-based tokenizer splits the given text based on punctuation and spaces. A punctuation-based tokenizer splits words on punctuation, platform.s is the whole word, but using the punctuation tokenizer turns the word into ‘platform’, ‘.’, ‘s’.
from nltk.tokenize import wordpunct_tokenize sentence = "This is a *learning platform.s!. Which provides #knowledge of ML" tokens = wordpunct_tokenize(sentence) print(tokens) # Output ['This', 'is', 'a', '*', 'learning', 'platform', '.', 's', '!.', 'Which', 'provides', '#', 'knowledge', 'of', 'ML']
Text blob is an open source Python library for text processing. Textblobs are faster than NLTK, easier to use, and have callable functions. Can be used for simple application usage and to perform various operations on the textual data like Sentiment Analysis, Noun Phrase Extraction, Classification of Textual Data etc. Lets start with installing the library
!pip install textblob
1. Word Tokenizer
For word-level tokenization, use the word attribute. Returns a list of word objects. When using the word tokenizer, textblob removes punctuation from the text.
from textblob import TextBlob sentence = "This is a *learning platform.s!. Which provides #knowledge of ML" token = TextBlob(sentence) print(token.words) # Output ['This', 'is', 'a', 'learning', 'platform.s', 'Which', 'provides', 'knowledge', 'of', 'ML']
2. Sentence Tokenizer
from textblob import TextBlob token = TextBlob(sentence) print(token.sentences) # Output [Sentence("This is a *learning platform.s!."), Sentence("Which provides #knowledge of ML")]
Tokenization with RegEx
A regular expression is a sequence of characters that defines a particular pattern. With the help of regular expressions, you can find a given string and perform word tokenization, sentence tokenization, or character tokenization. An example where we will tokenize the tokens.
import re sentence = "This is! a learning platform! Which provides knowledge of ML" tokens = re.findall("[w]+", sentence) print(tokens) # Output ['This', 'is', 'a', 'learning', 'platform', 'Which', 'provides', 'knowledge', 'of', 'ML']
We have explained what tokenization is and the benefits of tokenization. This is the first and most important step in the NLP pipeline. We also discussed various libraries used for tokenization and how tokenization is performed at the word and sentence level. After reading this article, you can use libraries and tools to implement different kinds of tokenization.