How to tokenize your search by N-Grams using Elastic Search in Scala?


NGrams can be used to search big data with compound words. German language is famous and referred for combining several small words into one massive compound word in order to capture precise or complex meanings.

N-Grams are the fragments in which a word is broken, and as more number of fragments relevant to data, the more fragments will match.N-Grams has its length of fragment as min_gram and max_gram, a trigram(length of 3) is a good length to start with.

For Example we have following words,

Aussprachewörterbuch
	Meaning : Pronunciation dictionary
Weltgesundheitsorganisation
	Meaning : World Health Organization
Weißkopfseeadler
	Meaning : White-headed sea eagle, or bald eagle

Setup the Index

Now we put an index with the command,

curl -XPUT 'localhost:9200/dictionary' -d '{
    "settings": {
        "analysis": {
            "filter": {
                "trigrams_filter": {
                    "type":     "ngram",
                    "min_gram": 3,
                    "max_gram": 3
                }
            },
            "analyzer": {
                "trigrams": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter":   [
                        "lowercase",
                        "trigrams_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "germanDictionary": {
            "properties": {
                "text": {
                    "type":     "string",
                    "analyzer": "trigrams" 
                }
            }
        }
    }
}'

Insert bulk of Documents

curl -XPOST 'localhost:9200/dictionary/germanDictionary/_bulk?pretty' -d '
{ "index":{"_id":"1"} }
{ "text": "Aussprachewörterbuch" }
{ "index":{"_id":"2"} }
{ "text": "Weltgesundheitsorganisation" }
{ "index":{"_id":"3"} }
{ "text": "Weißkopfseeadler" }'

Now the Index dictionary is created with type germanDictionary and contains three documents as created in bulk.

Applying search with N-Grams

If we search with wörterbuch we should get the Aussprachewörterbuch as result, by the command,

curl -XGET ‘http://localhost:9200/dictionary/germanDictionary/_search?q=text:Wörterbuch’

Elastic Search will return the following response with the source that matched with token wörterbuch,

{
    "took":8,
    "timed_out":false,
    "_shards":{
    "total":5,
        "successful":5,
        "failed":0
},
    "hits":{
    "total":1,
        "max_score":0.29921588,
        "hits":[
        {
            "_index":"dictionary",
            "_type":"germanDictionary",
            "_id":"1",
            "_score":0.29921588,
            "_source":{
                "text":"Aussprachewörterbuch"
            }
        }
    ]
}
}

Now if we search with Adler we should get the Weißkopfseeadler as result, by the command,

curl -XGET ‘http://localhost:9200/dictionary/germanDictionary/_search?q=text:Adler

Elastic Search will return the following response with the source that matched with token Adler,

{
    "took":5,
    "timed_out":false,
    "_shards":{
    "total":5,
        "successful":5,
        "failed":0
	},
    "hits":{
    "total":1,
        "max_score":0.53148466,
        "hits":[
        {
            "_index":"dictionary",
            "_type":"germanDictionary",
            "_id":"3",
            "_score":0.53148466,
            "_source":{
                "text":"Weißkopfseeadler"
            }
        }
    ]
}
}

Finally if we search with er we should not get any successfull result, by the command,

curl -XGET ‘http://localhost:9200/dictionary/germanDictionary/_search?q=text:er

This time Elastic Search will return the following response,

{
    "took":3,
    "timed_out":false,
    "_shards":{
    "total":5,
        "successful":5,
        "failed":0
},
    "hits":{
    "total":0,
        "max_score":null,
        "hits":[
    ]
}
}

Now here we compare this result with the results we got in above two search queries,

  • The total number of documents found is zero
  • Maximum score(max_score) is null
  • Hits is a blank array

Why did Elastic Search return this response as result ?

The answer is because of this filter,

{
    "trigrams_filter": {
    "type": "ngram",
        "min_gram": 3,
        "max_gram": 10
    }
}

Here the minimum length for token to search with is required as of 3 to execute the search query successfully.

This is how we can use n-gram for token based searching.

Happy Blogging !!

Advertisements

About Harsh Sharma Khandal

Harsh is a Sr. Software Consultant at Knoldus Software LLP with 4 year of experience. He is a fan of programming standards and conventions. He has good knowledge of Scala, Java, 3D Modeling and 3D animation. His current passions include utilizing the power of Scala, Akka and Play to make reactive applications. He is a technologist and is never too far away from the keyboard. He believes in standard coding practices. His focus always remains on practical work. He has Master's in Computer Applications from Rajasthan Technical University, Kota. His hobbies include reading books and writing the code in multiple ways to find the best way it can be represented.
This entry was posted in Elasticsearch, Scala and tagged , , , , . Bookmark the permalink.

2 Responses to How to tokenize your search by N-Grams using Elastic Search in Scala?

  1. Ramdev Wudali says:

    Nice article, what has this to do with Scala ? and Have you thought of using Shingle_analyzer ? (allows for edge ngrams and hence allows for typeahead suggestions. )

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s