MachineX: Ultimate guide to NLP (Part 1)

Reading Time: 7 minutes

In this blog, we are going to see some basic text operations with NLP, to solve different problems.

This Blog is a part of a series Ultimate guide to NLP , which will focus on Basic text pre-processing techniques.

Some of the major areas that we will be covering in this series of Blogs include the following:

  1. Text Pre-Processing
  2. Understanding of Text & Feature Engineering
  3. Supervised Learning Models for Text Data
  4. Unsupervised Learning Models for Text Data

Introduction to NLP

We all know that computers are really good at learning from spreadsheets of data filled with numbers, but we humans communicate with words, not with numbers.

A lot of information in the world is unstructured — raw text in English or another human language. How can we get a computer to understand the unstructured text and extract data from it?

NLP (a subfield of AI) is focused on enabling computers to understand and communicate in human language. The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.

Natural Language Processing or NLP is a field of Artificial Intelligence that gives the machines the ability to read, understand, and derive meaning from human languages.

Applications of NLP

There are many applications of NLP now a days . for example:

  • Sentiment classification
  • Topic Extraction
  • Search Engine
  • Entity Extraction
  • Autocomplete
  • Speech to text
  • Review Rating Prediction
  • Translation
  • Question Answering

and so on….

Nowadays, most of us have smartphones that have speech recognition which uses NLP to understand what a user is saying. Also, many people use laptops which operating system has built-in speech recognition.

Some examples:


Siri is a virtual assistant of the Apple Inc.’s iOS, watchOS, macOS, HomePod, and tvOS operating systems. Again, you can do a lot of things with voice commands: start a call, text someone, send an email, set a timer, take a picture, open an app, set an alarm, use navigation and so on.


whenever we used to receive a mail on Gmail, it scans the mail content and heading to find if a mail is a spam or not. It uses a machine learning model to predict the possibility of being spam of these mails by trained on different features like the topic of mail, mail address, content, and many more.

Pre-processing the data

It is very important to pre-processing the text data. Because text data comes in many formats which consist of much-unwanted information. For eg:

Can you he.lp me with loan? 🙂

so, the text might contain a lot of things like Abbreviations, Unintentional characters, Symbols, or emojis, which can confuse our NLP model to make a prediction or take a decision. That’s why it is essential to process our text data before feeding it to our NLP or machine learning model.

there are many ways to process text data, for instance we are going to discuss some of them:

  • Removing weird spaces
  • Tokenization
  • Removing stopwords
  • Contraction mapping
  • Stemming
  • Emoji handling
  • Cleaning HTML

Removing weird spaces

Removing weird spaces is the biggest challenge whenever we take any text data from pdf or read the text from different file formats. Data used to come with unwanted spaces because of format issues, In Python, we can avoid them like this:

sample_text = "I want to remove spaces "
def remove_space(text):
text = text.strip()
text = text.split()
return " ".join(text)


‘I want to remove spaces’


Tokenization is also an important step of data pre-processing. It means to covert all the text data/ words into tokens.In English and many other languages using some form of Latin alphabet, space is a good approximation of a word divider.

However, we still can have problems if we only split by space to achieve the wanted results. Some English compound nouns are variably written and sometimes they contain a space. In most cases, we use a library to achieve the wanted results, so again don’t worry too much for the details.


import nltk'punkt')
from nltk.tokenize import word_tokenize
text = "hello, how are you?"
tokens = word_tokenize(text)
view raw hosted with ❤ by GitHub


['hello', ',', 'how', 'are', 'you', '?']

Stop words

Stopwords are the most common words in any natural language. For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document. These words can add a lot of noise. That’s why we want to remove these irrelevant words.

for example lets look at this string:

“There is a notebook on the computer desk”.

You can see in this statement, there are some words like “is”, “a”, “on”, and  “the” which add no meaning to the statement while parsing it. On the other hand, words like “there”, “notebook”, “computer”, and “desk” are the keywords and tell us what the statement is all about.

So what is the solution? . The NLTK tool has a well predefined list of stopwords. We just need to download the stopwords using NLTK like this:

The NLTK tool has a predefined list of stopwords that refers to the most common words. If you use it for your first time, you need to download the stop words using this code:“stopwords”). Once we complete the downloading, we can load the stopwords package from the nltk.corpus and use it to load the stop words.

from nltk.corpus import stopwords


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In the first line, we loaded the stopwords package from the nltk.corpus and use it to load the stop words. As an output, you can see a list of stopwords predefined in NLTK’s stopword package.

Note:  you need to download the stop words using this code:“stopwords”) , If you are using this first time.

Lets see how to remove stopwords from a sentence.

import nltk
stop_words = set(stopwords.words("english"))
sentence = "We are one of the organization using tensorflow heavily."
words = nltk.word_tokenize(sentence)
without_stop_words = [word for word in words if not word in stop_words]


['We', 'one', 'organization', 'using', 'tensorflow', 'heavily', '.']

Contraction mapping

Contractions are words that we write with an apostrophe. Examples of contractions are words like “ain’t” or “aren’t”. Since we want to standardize our text, it makes sense to expand these contractions. Sometimes, we trained our model on our standardized data but at the production time, users or customers can send text full of contractions.

For removing contractions, we first have to define a dictionary contains all the word mapings like this:

contraction = {
"'cause": 'because',
',cause': 'because',
';cause': 'because',
"ain't": 'am not',
'ain,t': 'am not',
'ain;t': 'am not',
'ain ́t': 'am not',
'ain’t': 'am not',
"aren't": 'are not',
'aren,t': 'are not',
'aren;t': 'are not',
'aren ́t': 'are not',
'aren’t': 'are not'

So here, we have defined some of our contraction mappings. You can define many more according to your use case.

we can remove contraction from our sentences by using this dictonary like this:

def mapping_replacer(x, dic):
for word in dic.keys():
if " " + word + " " in x:
x = x.replace(" " + word + " ", " " + dic[word] + " ")
return x
text = "we aren't able to come ,cause of heavy rain"
mapping_replacer(text, contraction)


'we are not able to come because of heavy rain'


For grammatical reasons, documents can contain different forms of a word such as drivedrivesdriving. Also, sometimes we have related words with a similar meaning, such as nationnationalnationality. Stemming helps us in standardizing words to their base or root stem, irrespective of their inflections, which helps many applications like classifying or clustering text, and even in information retrieval. Let’s see the popular Porter stemmer in action now!

import nltk
def my_stemmer(text):
ps = nltk.porter.PorterStemmer()
text = ' '.join([ps.stem(word) for word in text.split()])
return text
my_stemmer("My system keeps crashing his crashed yesterday, ours crashes")
view raw hosted with ❤ by GitHub


'My system keep crash hi crash yesterday, our crash'

Emoji handling

It is not an problem , when the data is coming from a machine or some software. But when it come to real time textual data like in chatbots , customers can enter whatever they want , it can be any symbol , special characters , and most of the time its emojis.

So to handle this kind of conditions, you have to just install a python library called emoji like this”

pip install emoji 

After that you have to import the library and load the unicodes of all the emoji symbols like :

import emoji
emojis = emoji.UNICODE_EMOJI

Cleaning HTML

There is a lot of data available on websites nowadays but the problem is whenever we scraped these website pages, they do come with a lot of unwanted tags. It creates a headache for a developer to create data out of these files.

Let’s see, how we can clean out Html content, first we need our HTML content that looks like:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="; class="sister" id="link1">Elsie</a>,
<a href="; class="sister" id="link2">Lacie</a> and
<a href="; class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
view raw html hosted with ❤ by GitHub

then we have to install BeautifulSoup python library :

pip3 install BeautifulSoup4


sudo apt-get install python3-bs4

Now we have to run this document through Beautiful Soup that will give us a BeautifulSoup object, which represents the document as a nested data structure:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')


#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="" id="link3">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

Here are some simple ways to navigate that data structure:

# <title>The Dormouse's story</title>
# u'title'
# u'The Dormouse's story'
# u'head'
# <p class="title"><b>The Dormouse's story</b></p>
# u'title'
# <a class="sister" href="; id="link1">Elsie</a>
# [<a class="sister" href="; id="link1">Elsie</a>,
# <a class="sister" href="; id="link2">Lacie</a>,
# <a class="sister" href="; id="link3">Tillie</a>]
# <a class="sister" href="; id="link3">Tillie</a>

One common task is extracting all the URLs found within a page’s <a> tags:

view raw find hosted with ❤ by GitHub


Another common task is extracting all the text from a page:

view raw get hosted with ❤ by GitHub


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Lacie and
and they lived at the bottom of a well.

for more operations on html do refer to this guide

So these are some basic text pre-processing steps which can help you to clean you text and make it standardize.

 Additionally, once you’ve fully preprocessed your data, and are ready to create your bag of words (count vectorizer) or TF-IDF vectorizer, you can adjust the parameters to fit your requirements for your machine learning problem.

We will these feature extractions and text representation techniques in our next part of blog

I hope this guide speeds up the preprocessing of your text data for your next NLP project. Feel free to leave any thoughts and insights

Stay Tunes, happy learning 🙂

and try to remove emoji from above sentence :p

Follow MachineX Intelligence for more:

Written by 

Shubham Goyal is a Data Scientist at Knoldus Inc. With this, he is an artificial intelligence researcher, interested in doing research on different domain problems and a regular contributor to society through blogs and webinars in machine learning and artificial intelligence. He had also written a few research papers on machine learning. Moreover, a conference speaker and an official author at Towards Data Science.