How to tokenize your search by N-Grams using Elastic Search in Scala?

Table of contents

Reading Time: 2 minutes

N–Grams can be used to search big data with compound words. German language is famous and referred for combining several small words into one massive compound word in order to capture precise or complex meanings.

N-Grams are the fragments in which a word is broken, and as more number of fragments relevant to data, the more fragments will match.N-Grams has its length of fragment as min_gram and max_gram, a trigram(length of 3) is a good length to start with.

For Example we have following words,

Aussprachewörterbuch
	Meaning : Pronunciation dictionary
Weltgesundheitsorganisation
	Meaning : World Health Organization
Weißkopfseeadler
	Meaning : White-headed sea eagle, or bald eagle

Setup the Index

Now we put an index with the command,

curl -XPUT 'localhost:9200/dictionary' -d '{
    "settings": {
        "analysis": {
            "filter": {
                "trigrams_filter": {
                    "type":     "ngram",
                    "min_gram": 3,
                    "max_gram": 3
                }
            },
            "analyzer": {
                "trigrams": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "lowercase",
                        "trigrams_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "germanDictionary": {
            "properties": {
                "text": {
                    "type":     "string",
                    "analyzer": "trigrams" 
                }
            }
        }
    }
}'

Insert bulk of Documents

curl -XPOST 'localhost:9200/dictionary/germanDictionary/_bulk?pretty' -d '
{ "index":{"_id":"1"} }
{ "text": "Aussprachewörterbuch" }
{ "index":{"_id":"2"} }
{ "text": "Weltgesundheitsorganisation" }
{ "index":{"_id":"3"} }
{ "text": "Weißkopfseeadler" }'

Now the Index dictionary is created with type germanDictionary and contains three documents as created in bulk.

Applying search with N-Grams

If we search with wörterbuch we should get the Aussprachewörterbuch as result, by the command,

curl -XGET ‘http://localhost:9200/dictionary/germanDictionary/_search?q=text:Wörterbuch’

Elastic Search will return the following response with the source that matched with token wörterbuch,

{
    "took":8,
    "timed_out":false,
    "_shards":{
    "total":5,
        "successful":5,
        "failed":0
},
    "hits":{
    "total":1,
        "max_score":0.29921588,
        "hits":[
        {
            "_index":"dictionary",
            "_type":"germanDictionary",
            "_id":"1",
            "_score":0.29921588,
            "_source":{
                "text":"Aussprachewörterbuch"
            }
        }
    ]
}
}

Now if we search with Adler we should get the Weißkopfseeadler as result, by the command,

curl -XGET ‘http://localhost:9200/dictionary/germanDictionary/_search?q=text:Adler‘

Elastic Search will return the following response with the source that matched with token Adler,

{
    "took":5,
    "timed_out":false,
    "_shards":{
    "total":5,
        "successful":5,
        "failed":0
	},
    "hits":{
    "total":1,
        "max_score":0.53148466,
        "hits":[
        {
            "_index":"dictionary",
            "_type":"germanDictionary",
            "_id":"3",
            "_score":0.53148466,
            "_source":{
                "text":"Weißkopfseeadler"
            }
        }
    ]
}
}

Finally if we search with er we should not get any successfull result, by the command,

curl -XGET ‘http://localhost:9200/dictionary/germanDictionary/_search?q=text:er‘

This time Elastic Search will return the following response,

{
    "took":3,
    "timed_out":false,
    "_shards":{
    "total":5,
        "successful":5,
        "failed":0
},
    "hits":{
    "total":0,
        "max_score":null,
        "hits":[
    ]
}
}

Now here we compare this result with the results we got in above two search queries,

The total number of documents found is zero
Maximum score(max_score) is null
Hits is a blank array

Why did Elastic Search return this response as result ?

The answer is because of this filter,

{
    "trigrams_filter": {
    "type": "ngram",
        "min_gram": 3,
        "max_gram": 10
    }
}

Here the minimum length for token to search with is required as of 3 to execute the search query successfully.

This is how we can use n-gram for token based searching.

Happy Blogging !!

2 thoughts on “How to tokenize your search by N-Grams using Elastic Search in Scala?2 min read”

Reblogged this on knoldernarayan.

Nice article, what has this to do with Scala ? and Have you thought of using Shingle_analyzer ? (allows for edge ngrams and hence allows for typeahead suggestions. )

Comments are closed.