In this blog, we are going to see how we can NLP library NLTK for sentiment analysis.

Sentiment Analysis is a common NLP task nowadays. Every data scientist or a person working on data science needs to perform.
Introduction to NLP
Natural Language processing
Natural Language Processing (NLP) is a subfield of artificial intelligence that helps computers understand human language. NLP enables machines to derive meaning from the human language so that we can gain valuable insights from online communication.
In 1950 when Alan Turing published his paper entitled “Computing Machinery and Intelligence,” and from there, interest in natural language processing (NLP) began. A powerful process/test was originated from there called Turing Test emerged. Turing basically asserted that a computer could be considered intelligent if it could carry on a conversation with a human being without the human realizing they were talking to a machine.
NLTK
NLTK is the most popular Python package for Natural Language processing, it provides algorithms for importing, cleaning, pre-processing text data in human language and then apply computational linguistics algorithms like sentiment analysis.
Data
The Dataset was introduced by Bo Pang and Lillian Lee. This dataset is redistributed with NLTK with permission from the authors. Today will download the data directly from the NLTK library.
This dataset contains 1000 positive and 1000 negative processed reviews in form of text files.
- 1000 text files with positive reviews
- 1000 text files with negative reviews
So let’s go to the solution,
Implementation
First, we will import the NLTK package and
It also includes many easy-to-use datasets in the nltk.corpus
package, we can download for example the movie_reviews
package using the nltk.download
function:
the code for the same will be

similarly, if you want to see all the available dataset to download, you can simply write
nltk.download()
This will list down all the datasets available with NLTK
Once the data have been downloaded, we can import them from nltk.corpus

Inspect the Movie Reviews Dataset
The fileids
the method provided by all the datasets in nltk.corpus
gives access to a list of all the files available.
In particular, in the movie_reviews dataset, we have 2000 text files, each of them is a review of a movie, and they are already split in a neg
folder for the negative reviews and a pos
folder for the positive reviews:

fileids
can also filter the available files based on their category, which is the name of the subfolders they are located in. Therefore we can have lists of positive and negative reviews separately.

We can inspect one of the reviews using the raw
method of movie_reviews
, each file is split into sentences, the curators of this dataset also removed from each review from any direct mention of the rating of the movie.

output:
films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before .
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen .
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd .
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes .
in other words , don't dismiss this film because of its source .
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes .
getting the hughes brothers to direct this seems almost as ludicrous as casting carrot top in , well , anything , but riddle me this : who better to direct a film that's set in the ghetto and features really violent street crime than the mad geniuses behind menace ii society ?
the ghetto in question is , of course , whitechapel in 1888 london's east end .
it's a filthy , sooty place where the whores ( called " unfortunates " ) are starting to get a little nervous about this mysterious psychopath who has been carving through their profession with surgical precision .
when the first stiff turns up , copper peter godley ( robbie coltrane , the world is not enough ) calls in inspector frederick abberline ( johnny depp , blow ) to crack the case .
abberline , a widower , has prophetic dreams he unsuccessfully tries to quell with copious amounts of absinthe and opium .
upon arriving in whitechapel , he befriends an unfortunate named mary kelly ( heather graham , say it isn't so ) and proceeds to investigate the horribly gruesome crimes that even the police surgeon can't stomach .
i don't think anyone needs to be briefed on jack the ripper , so i won't go into the particulars here , other than to say moore and campbell have a unique and interesting theory about both the identity of the killer and the reasons he chooses to slay .
in the comic , they don't bother cloaking the identity of the ripper , but screenwriters terry hayes ( vertical limit ) and rafael yglesias ( les mis ? rables ) do a good job of keeping him hidden from viewers until the very end .
it's funny to watch the locals blindly point the finger of blame at jews and indians because , after all , an englishman could never be capable of committing such ghastly acts .
and from hell's ending had me whistling the stonecutters song from the simpsons for days ( " who holds back the electric car/who made steve guttenberg a star ? " ) .
don't worry - it'll all make sense when you see it .
now onto from hell's appearance : it's certainly dark and bleak enough , and it's surprising to see how much more it looks like a tim burton film than planet of the apes did ( at times , it seems like sleepy hollow 2 ) .
the print i saw wasn't completely finished ( both color and music had not been finalized , so no comments about marilyn manson ) , but cinematographer peter deming ( don't say a word ) ably captures the dreariness of victorian-era london and helped make the flashy killing scenes remind me of the crazy flashbacks in twin peaks , even though the violence in the film pales in comparison to that in the black-and-white comic .
oscar winner martin childs' ( shakespeare in love ) production design turns the original prague surroundings into one creepy place .
even the acting in from hell is solid , with the dreamy depp turning in a typically strong performance and deftly handling a british accent .
ians holm ( joe gould's secret ) and richardson ( 102 dalmatians ) log in great supporting roles , but the big surprise here is graham .
i cringed the first time she opened her mouth , imagining her attempt at an irish accent , but it actually wasn't half bad .
the film , however , is all good .
2 : 00 - r for strong violence/gore , sexuality , language and drug content
Tokenize Text in Words

The first thing, we have to do is generally to split the text into words, this process might appear simple but it is very tedious to handle all corner cases, see for example all the issues with punctuation we have to solve if we just start with a split on whitespace:

Output:
['Why',
'then,',
'O',
'brawling',
'love!',
'O',
'loving',
'hate!',
'O',
'any',
'thing,',
'of',
'nothing',
'first',
'create!',
'O',
'heavy',
'lightness,',
'serious',
'vanity,',
'Misshapen',
'chaos',
'of',
'well-seeming',
'forms,',
'Feather',
'of',
'lead,',
'bright',
'smoke,',
'cold',
'fire,',
'sick',
'health,',
'Still-waking',
'sleep,',
'that',
'is',
'not',
'what',
'it',
'is!',
'This',
'love',
'feel',
'I,',
'that',
'feel',
'no',
'love',
'in',
'this.']
movie_reviews
corpus already has direct access to tokenized text with the words
method:

Build a bag-of-words model
The simplest model for analyzing text is just to think about text as an unordered collection of words (bag-of-words). This can generally allow inferring from the text the category, the topic or the sentiment.
From the bag-of-words model, we can build features to be used by a classifier, here we assume that each word is a feature that can either be True
or False
.
We implement this in Python as a dictionary where for each word in a sentence we associate if a word is missing, that would be the same as assigning False
.

output:
{'Why': True,
'then': True,
',': True,
'O': True,
'brawling': True,
'love': True,
'!': True,
'loving': True,
'hate': True,
'any': True,
'thing': True,
'of': True,
'nothing': True,
'first': True,
'create': True,
'heavy': True,
'lightness': True,
'serious': True,
'vanity': True,
'Misshapen': True,
'chaos': True,
'well-seeming': True,
'forms': True,
'Feather': True,
'lead': True,
'bright': True,
'smoke': True,
'cold': True,
'fire': True,
'sick': True,
'health': True,
'Still-waking': True,
'sleep': True,
'that': True,
'is': True,
'not': True,
'what': True,
'it': True,
'This': True,
'feel': True,
'I': True,
'no': True,
'in': True,
'this': True,
'.': True}

Output:
{'Why': True,
'then': True,
',': True,
'O': True,
'brawling': True,
'love': True,
'!': True,
'loving': True,
'hate': True,
'any': True,
'thing': True,
'of': True,
'nothing': True,
'first': True,
'create': True,
'heavy': True,
'lightness': True,
'serious': True,
'vanity': True,
'Misshapen': True,
'chaos': True,
'well-seeming': True,
'forms': True,
'Feather': True,
'lead': True,
'bright': True,
'smoke': True,
'cold': True,
'fire': True,
'sick': True,
'health': True,
'Still-waking': True,
'sleep': True,
'that': True,
'is': True,
'not': True,
'what': True,
'it': True,
'This': True,
'feel': True,
'I': True,
'no': True,
'in': True,
'this': True,
'.': True}
This is what we wanted, but we notice that also punctuation like “!” and words useless for classification purposes like “of” or “that” are also included.
Those words are named “stopwords” and nltk
has a convenient corpus we can download:

Using the Python string.punctuation
list and the English stopwords we can build better features by filtering out those words that would not help in the classification:

Output:
{'Why': 1,
'O': 1,
'brawling': 1,
'love': 1,
'loving': 1,
'hate': 1,
'thing': 1,
'nothing': 1,
'first': 1,
'create': 1,
'heavy': 1,
'lightness': 1,
'serious': 1,
'vanity': 1,
'Misshapen': 1,
'chaos': 1,
'well-seeming': 1,
'forms': 1,
'Feather': 1,
'lead': 1,
'bright': 1,
'smoke': 1,
'cold': 1,
'fire': 1,
'sick': 1,
'health': 1,
'Still-waking': 1,
'sleep': 1,
'This': 1,
'feel': 1,
'I': 1}
Plotting Frequencies of Words
It is common to explore a dataset before starting the analysis, in this section we will find the most common words and plot their frequency.
Using the .words()
function with no argument we can extract the words from the entire dataset and check that it is about 1.6 million.

Now, in this dataset we also have many unwanted data/words also, we need to filter those out. It will also make our dataset small.

Output:
['plot',
'two',
'teen',
'couples',
'go',
'church',
'party',
'drink',
'drive',
'get',
'accident',
'one',
'guys',
'dies',
'girlfriend',
'continues',
'see',
'life',
'nightmares',
'deal']
Now we have a package named as collection counter, which can be used to counting frequencies of words presented in our list.

Output:
[('film', 9517),
('one', 5852),
('movie', 5771),
('like', 3690),
('even', 2565),
('good', 2411),
('time', 2411),
('story', 2169),
('would', 2109),
('much', 2049)]
It also has the mosty_common() function for selected mostly used words in our list.
Visualize the data
Now, we are going to use the matplotlib library for data visualization.

We can sort the word counts and plot their values on Logarithmic axes to check the shape of the distribution. This visualization is particularly useful if comparing 2 or more datasets, a flatter distribution indicates a large vocabulary while a peaked distribution a restricted vocabulary often due to a focused topic or specialized language.


Another related plot is the histogram sorted_word_counts
, which displays how many words have a count in a specific range.




now we have insights into the dataset and we have to make an ML model n to classify the sentiments.
Making a Machine learning model
Now the final step is to make a machine learning model and train it with the features we had just made.
for the further implementation, you can download it from here: movie rating review with NLTK
Stay Tunes, happy learning 🙂
Follow MachineX Intelligence for more: