NLP using Tensorflow: A Small Guide

Reading Time: 3 minutes

Natural Language Processing or NLP is now one of the most important machine learning techniques that one AI/ML practitioner should possess. It the current world we can see a vast use of NLP implementations. Such as text reading, number plate reading,email spam filtering, predictive text and so on. In this blog we are going to see some key things to know, in order to implement NLP using tensorflow.

Introduction:

So lets first understand to what is NLP.

Natural language processing is a sub-field of linguistics, computer science, and artificial intelligence . It is concerned with the interactions between computers and human language. In particular how to program computers to process and analyse large amounts of natural language data. It strives to build machines that understand and respond to text or voice data and respond with text or speech of their own the same way humans do.

NLP with TensorFlow

Natural Language Processing with TensorFlow brings together to give you invaluable tools to work with the immense volume of unstructured data in today’s data streams, and apply these tools to specific NLP tasks. In TensorFlow their are few important areas which we need to know in order to built a model. These areas are tokenization and and sequencing. So lets understand them.

Tokenization:

Tokenism helps in how to represent words in a way that a computer can process them, and later built a neural network that can understand them. Tokenization is the process of splitting the text into smaller units such as sentences, words or subwords. We shall see how we can process the text corpus by tokenizing text into words:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

#initialize list of sentences
sentences = [
  'Life is so beautiful',
  'Hope keeps us going',
  'Let us celebrate life!'
   ]

#instantiate tokenizer and call the fit_to_texts method.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

print(word_index)

When you write the above code it will print the ouput like this:

{‘life’: 1, ‘us’: 2, ‘is’: 3, ‘so’: 4, ‘beautiful’: 5, ‘hope’: 6, ‘keeps’: 7, ‘going’: 8, ‘let’: 9, ‘celebrate’: 10}

So as we see the sentences are now separated into individual words with their indexes.

Understanding Sequencing:

we shall build on the tokenized text, using these generated tokens to convert the text into a sequence. We can get a sequence by calling the texts_to_sequences method.

sequences = tokenizer.texts_to_sequences(sentences)

#output:[[2, 4, 5, 6], [7, 8, 3, 9], [10, 3, 11, 2]]

#import pad_sequences function to pad our sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(sequences)
print(padded)

Output would be:

 [[ 2  4  5  6]  
[ 7 8 3 9]
[10 3 11 2]]

By default, the length of the padded sequence = length of the longest sentence. However, we can limit the maximum length by explicitly setting the maxlen argument. Such as:

padded = pad_sequences(sequences,maxlen=5)

These are the two things along with tensorflow keras and dense layers can be used to built a good model.

Conclusion:

NLP has been a very important machine learning practise. With TensorFlow it has just improved more n more. We can gain insights into text data and hands-on on how to use those insights to train NLP models. So give your best to understand this NLP practises and develop some good models.

Referances: