Text Data Vectorization Techniques in Natural Language Processing

Reading Time: 6 minutes

Features in any Machine Learning algorithms are generally numerical data on which we can easily perform any mathematical operations. But Machine Learning algorithms cannot work on raw text data. Machine Learning algorithms can only process numerical representation in form of vector(matrix) of actual text. For converting textual data into numerical representation of features we can use the following text vectorization techniques in Natural Language Processing.

  • Bag Of Words (Count Vectorizer)
  • Term Frequency and Inverse Document Frequency (TF-IDF)
  • Word2Vec

Raw data contains numerical values, punctuations, spaces, special characters which can hamper the performance of model. So it is necessary to pre-process the data first. For that we can use various pre-processing techniques like :

  • Regular expressions – for removing numerical values, punctuation’s, special characters etc.
  • Lowercase the text data
  • Tokenization – converting group of sentence into tokens
  • Removing stopwords from text data. Example – of, in, the etc.
  • Stemming and/or Lemmetization – reducing a word to its word stem

After applying these pre-processing technique we need to convert the final extracted features into numerical features in order to build our model. This is where Text Data Vectorization techniques come into picture.

Let’s have a look at each of them in detail:

Bag Of Words

BOW is a text vectorization model commonly useful in document representation method in the field of information retrieval.

In information retrieval, the BOW model assumes that for a document, it ignores its word order, grammar, syntax and other factors, and treats it as a collection of several words. The appearance of each word in the document is independent of whether other words appear. (It’s out of order)
The Bag-of-words model (BOW model) ignores the grammar and word order of a text, and uses a set of unordered words to express a text or a document.

Let’s look at the following code :

In the example above three sentences are taken which have in all 12 unique words. The order of words is not related in which they appear in sentence. The sentences are transformed to vectors using CountVectorizer() function. The output contains a total of 12 elements, where the i-th element represents the number of times the i-th word in the dictionary appears in the sentence.

Imagine a huge document set D with a total of M documents. Unique words from documents are extracted, comprising a list N words. In Bag of words model, each document represents N-dimensional vector.

The BOW model can be considered as a statistical histogram. It is used in text retrieval, document classification and processing applications.

TF-IDF (Term Frequency – Inverse Document Frequency)

Another popular word embedding/text vectorization technique for extracting features from data is TF-IDF. TF-IDF is numerical statistical technique and used to figure out the relevance of any word in document, which is part of an even larger body of document.

The two metrics TF and IDF are as follows:

Term Frequency (TF) – In TF , we are giving some scoring for each word or token based on the frequency of that word. The frequency of a word is dependent on the length of the document. Means in large size of document a word occurs more than a small or medium size of the documents.

So to overcome this problem we will divide the frequency of a word with the length of the document (total number of words) to normalize.By using this technique, we are creating a sparse matrix with frequency of every word in each document.

TF = no. of times term occurrences in a document / total number of words in a document

Inverse Document Frequency (IDF) – It is a measure of the general importance of a word. The main idea is that if there are fewer documents containing the entry t and the larger, it means that the entry has a good ability to distinguish categories. The IDF of a specific word can be calculated by dividing the total number of files by the number of files containing the word, and then taking the log of the obtained quotient.

IDF = log base e (total number of documents / number of documents which are having term t)

Formula to calculate complete TF-IDF value is –

Example:

Consider a document containing 100 words where in the word cat appears 3 times. The term frequency (Tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these.

Then, the inverse document frequency (Idf) is calculated as log(10,000,000 / 1,000) = 4.

Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12.

In this code the TF-IDF of three sentences is found and converted to vectors using TfidfVectorizer(). TF-IDF value increases based on frequency of the word in a document. Like Bag of Words in this technique also we can not get any semantic meaning for words.

TF-IDF application

  1. Search Engine
  2. Keyword Extraction
  3. Text Similarity
  4. Text Summary

Word2Vec

With Bag of Words and TF-IDF text vectorization techniques we did not get semantic meaning of words. But for most of the applications of NLP tasks like sentiment classification, sarcasm detection etc require semantic meaning of a word and semantic relationships of a word with other words.

Word embeddings captures semantic and syntactic relationships between words and also the context of words in a document. Word2vec technique used to implement word embeddings.

Word2vec model takes input as a large size of corpus and produces output to vector space. This vector space size may be in hundred of dimensionality. Each word vector will be placed on this vector space. In vector space words that share context commonly in a corpus are closer to each other. Word vector having positions of corresponding words in a vector space.

The Word2vec method learns all those types of relationships of words while building a model. For this purpose word2vec uses 2 types of methods. There are

  1. Skip-gram
  2. CBOW (Continuous Bag of Words)

The Word2vec model captures relationships of words with the help of window size by using skip-gram and CBOW methods. Window size is a technique similar to n-grams where we create sequence of n words.

Skip-gram

Skip-gram method takes the center word from the window size words as an input and context words (neighbour words) as outputs. Word2vec models predict the context words of a center word using skip-gram method. Skip-gram works well with a small dataset and identifies rare words really well.

Continuous Bag-of-Words (CBOW)

CBow is just a reverse method of the skip gram method. Here we are taking context words as input and predicting the center word within the window. Another difference from skip gram method is, It was working faster and better representations for most frequency words.

Let’s look at the implementation using CBOW :

Word2Vec has its applications in knowledge discovery and recommendation systems.

Conclusion:

We can use any one of the text feature extraction based on our project requirement. Because every method has their advantages  like a Bag-Of-Words suitable for text classification, TF-IDF is for document classification and if you want semantic relation between words then go with word2vec.

We can’t say blindly what type of feature extraction gives better results. One more thing is building word embeddings from our dataset or corpus will give better results. But we don’t always have enough size of data set so in that case we can use pre-trained models with transfer learning.

References:

  1. https://www.oreilly.com/library/view/applied-text-analysis/9781491963036/ch04.html
  2. https://medium.com/@paritosh_30025/natural-language-processing-text-data-vectorization-af2520529cf7
knoldus

Written by 

Working as a Sr. Software Consultant AI/ML at Knoldus. Like exploring more of Data Science and its related technology. Current learning areas are Natural Language Processing, Deep Learning and Artificial Intelligence.