Understanding the Rasa NLU Pipeline

Reading Time: 4 minutes

Rasa is an open source machine learning framework for automated text and voice-based conversations. Understand messages, hold conversations, and connect to messaging channels and APIs.
Today in this blog we are covering the NLU pipeline of Rasa. The goal of this guide is to explain the role components play in the Rasa NLU pipeline, and also to explain how they interact with each other.

The NLU Pipeline

A Machine Learning pipeline can be defined as a sequence of tasks which is used to train a machine learning model.
As shown in below figure, Rasa NLU pipeline consists of a sequence of predefined tasks known as “Components”, with their specified configurations. Which can be seen in ‘config.yml’ file in a Rasa project. This file describes all the steps of the pipeline that will Rasa use to classify intents and entities.

pipeline:
   - name: WhitespaceTokenizer
   - name: RegexFeaturizer
   - name: LexicalSyntacticFeaturizer
   - name: CountVectorsFeaturizer
   - name: DIETClassifier
   - name: EntitySynonymMapper
   - name: ResponseSelector
   - name: FallbackClassifier

Components in Rasa NLU Pipeline

According to Rasa, The Components make up your NLU pipeline and work sequentially to process user input into structured output. There are components for entity extraction, for intent classification, response selection, preprocessing, and more.

Components in Rasa are classified as following:

  1. Tokenizers
  2. Featurizers
  3. Intent Classifiers
  4. Entity Extractors
  5. Selectors

Let’s discuss what each of the above components and how they are used in the above pipeline.

1. Tokenizers

Tokenizers breaks up the original text into pieces called tokens, And returns a list of words or tokens This is the first step in any NLU pipeline which must happen before text is featurized for machine learning. As shown in the above, the first component used the pipeline is “WhitespaceTokenizer”. It tokenize the raw text input using white-spaces as a separator.

2. Featurizers

Featurizer transforms raw input data into a processed form known as Feature Vector which is a machine readable representation that is used as an input to the machine learning model.

In Rasa text featurizers are divided into two different categories:

a) Sparse featurizers

b) Dense featurizers

Sparse featurizers are featurizers that return feature vectors with a lot of missing values, e.g. zeros. As those feature vectors would normally take up a lot of memory, we store them as sparse features. Sparse features only store the values that are non zero and their positions in the vector.

On the other hand, Dense featurizers return feature vectors which contains pretrained embedding. These vectors length is in the range of 50–300. These vectors work better in every NLP problem than sparse vectors. So similar meaning words have similar representations.
For example: “home” and “house” mean two different things in sparse vector representations, in dense feature representation, it captures the similarity between these words. 

   - name: RegexFeaturizer
   - name: LexicalSyntacticFeaturizer
   - name: CountVectorsFeaturizer

One may be interested in extracting the different type of features from the text and concatenate together to feed as a input to the machine learning model. As you can see in the above pipeline, we have used three different featurizers in series, featurizer at one point relies on the output of the featurizer before it.

RegexFeaturizer, creates a sparse feature vector representation of raw text data using regular expressions.

LexicalSyntacticFeaturizer, creates lexical and syntactic features for a raw text data to support entity extraction.

CountVectorsFeaturizer, creates bag-of-words representation of raw text data, intents, and responses.

3. Intent Classifiers

Once we’ve generated features for all of the tokens and for the entire sentence, we can pass it to an intent classification model. Intent classifiers assign one of the intents defined in the domain file to incoming user messages. We recommend using Rasa’s DIET model which can handle both intent classification as well as entity extraction. It is also able to learn from both the token- as well as sentence features.

In the above pipeline we have use DIETclassifier, which extracts entities and intents and outputs entities, intents and intent rankings as shown below.

{
    "intent": {"name": "greet", "confidence": 0.7800},
    "intent_ranking": [
        {
            "confidence": 0.7800,
            "name": "greet"
        },
        {
            "confidence": 0.1400,
            "name": "goodbye"
        },
        {
            "confidence": 0.0800,
            "name": "restaurant_search"
        }
    ],
    "entities": [{
        "end": 53,
        "entity": "time",
        "start": 48,
        "value": "2017-04-10T00:00:00.000+02:00",
        "confidence": 1.0,
        "extractor": "DIETClassifier"
    }]
}

Another Intent classifier used in the above pipeline is “FallbackClassifier”. When the DIETclassifier is unable to classify an intent with a confidence greater or equal than the threshold value, The FallbackClassifier classifies an input text message with the intent named “nlu_fallback“.  It can also predict the fallback intent in the case when the confidence scores of the two top ranked intents are closer than the the ambiguity_threshold. Like DIETClassifier it also outputs entities, intents and intent-ranking as shown below:


    {
        "intent": {"name": "nlu_fallback", "confidence": 0.7183846840434321},
        "intent_ranking": [
            {
                "confidence": 0.7183846840434321,
                "name": "nlu_fallback"
            },
            {
                "confidence": 0.28161531595656784,
                "name": "restaurant_search"
            }
        ],
        "entities": [{
            "end": 53,
            "entity": "time",
            "start": 48,
            "value": "2017-04-10T00:00:00.000+02:00",
            "confidence": 1.0,
            "extractor": "DIETClassifier"
        }]
    }

4. Entity Extractors

Entity extractors extract entities, such as person names or locations, from the text data. Even though DIET is capable of learning how to detect entities, we don’t necessarily recommend using it for every type of entity out there. For example, entities that follow a structured pattern, like phone numbers, don’t really need an algorithm to detect them.

As shown in above pipeline, we have used “EntitySynonymMapper” for entity extraction, It maps synonymous entity values to the same value. If the training data contains defined synonyms of tan entity, this component will make sure that detected entity values will be mapped to the same value. For example, if your training data contains the following examples having similar entities like United Kingdom and UK. This component will allow you to map the entities United Kingdom and UK to 'uk'. The entity extraction will return 'uk' even though the message contains United Kingdom. When this component changes an existing entity, it appends itself to the processor list of this entity.

5. Selectors

After the extraction of intents and entities, one may need to generate the response out of the given input text message, here comes the role of the selectors which predict a response from a set of predefined responses according to the confidence of the intents.

In the above pipeline we have used “ResponseSelector” component to return the response. It outputs a dictionary with the key as the retrieval intent of the response selector and value containing predicted responses, confidence and the response key under the retrieval intent.

Conclusion

In this blog post, we’ve reviewed the different types of components in Rasa NLU pipeline. It’s good to understand how components interact in the pipeline because it will help you decide which components are relevant to your model and to customise the pipeline accordingly.