Apache Solr with Java: Result Grouping with Solrj

This blog is a detailed, step-by-step guide on implementing group by field in Apache Solr using Solrj.

Note: Grouping is different from Faceting in Apache Solr. While grouping returns the documents grouped by the specified field, faceting returns the count of documents for each of the different values for the specified field. However you can combine grouping and faceting in Solr. This blog talks about grouping without the use of facet and implementing the same through Solrj (version 6.4.1).

Without much ado, let’s get to it.

First, you need a running Solr instance with correctly indexed data.

Note: A Solr core is basically an index of the text and fields found in documents. A single Solr instance can contain multiple “cores”, which are separate from each other based on local criteria. If you don’t have a running solr instance with a core set up, Apache Solr also provides a number of useful examples to help you learn about key features. You can launch the examples using the -e flag.

Set up your local solr, by following the directions below:

(i) Download Solr for your operating system from Apache Solr – Downloads

(ii) Go to the directory and start solr with the demo data provided by Apache Solr by running the following command in the solr directory:

bin/solr -e techproducts

Continue reading

Posted in Java | Tagged , , | Leave a comment

Data modeling in Cassandra

Role of Partitioning & Clustering Keys in Cassandra

Primary and Clustering Keys should be one of the very first things you learn about when modeling Cassandra data.  With this post I will cover what the different types of Primary Keys are, how they can be used, what their purpose is, and how they affect your queries.

Primary key

Primary Keys are defined when you create your table.  The most basic primary key is a single column.  A single column is great for when you know the value that you will be searching for.  The following table has a single column, comp_id, as the primary key,

CREATE TABLE company_Information (
comp_id text,
name text ,
city text,
state_province text,
country_code text,
PRIMARY KEY (comp_id )

A single column Primary Key is also called a Partition Key.  When Cassandra is deciding where in the cluster to store this particular piece of data, it will hash the partition key. The value of that hash dictates where the data will reside and which replicas will be responsible for it.

Partition Key

The Partition Key is responsible for the distribution of data among the nodes.  suppose there are 4 nodes A , B ,C ,D and  let’s assume hash values are between 0-100  and also assume that 0-25 , 25-50 , 50-75 and 75-100 are the hash values for the nodes A,B,C,D .   When we insert the first row into the company_Information table, the value of comp_id will be hashed.  Let’s also assume that the first record will have a hash of 34.  That will fall into the values that Node 2’s partition is assigned.

Compound Key

  • A multi-column primary key is called a Compound Key.
  • Primary keys can also be more than one column.

CREATE TABLE company_Information (
comp_id text,
name text ,
city text,
state_province text,
country_code text,
PRIMARY KEY (country_code , city , name , comp_id )

This example has four columns in the Primary Key clause. An interesting characteristic of Compound Keys is that only the first column is considered the Partition Key. There rest of the columns in the Primary Key clause are Clustering Keys.

Order By

You can change the default shorting order from ascending to descending by “order By” . There is an additional WITH clause that you need to add to the CREATE TABLE to make this possible.

CREATE TABLE company_Information (
comp_id text,
name text ,
city text,
state_province text,
country_code text,
PRIMARY KEY (country_code , city , name , comp_id )
) WITH CLUSTERING ORDER BY (city DESC, name ASC , comp_id DESC);

Now we’ve changed the ordering of the Clustering Keys to sort city in descending order . Did you notice that I did not specify what the sort is for country_code? Since it’s the partition key, there is nothing to sort as hashed values won’t be close to each other in the cluster.

Clustering Keys

Each additional column that is added to the Primary Key clause is called a Clustering Key. A clustering key is responsible for sorting data within the partition. In our example company_Information table, country_code is the partition key with city, name & comp_id acting as the clustering keys. By default, the clustering key columns are sorted in ascending order.

Composite Key

A Composite Key is when you have a multi-column Partition Key.  The above example only used country_code for partitioning.  This means that all records with a country_code value of “INDIA” are in the same partition.Avoiding wide rows is the perfect reason to move to a Composite Key.  Let’s change the Partition Key to include the comp_id & city columns.  We do this by nesting parenthesis around the columns that are to be a Composite Key, as follows:

CREATE TABLE company_Information (
comp_id text,
name text ,
city text,
state_province text,
country_code text,
PRIMARY KEY ((country_code , city , comp_id) , name )

What this does is it changes the hash value from being calculated off of only country_code. Now it will be calculated off of the combination of country_code, city & comp_id. Each combination of the three columns have their own hash value and will be stored in completely different partition in the cluster

Posted in Cassandra, Scala | Tagged , , | Leave a comment

Installing and Running Presto

Hi Folks !
In my previous blog, I had talked about Getting Introduced with Presto.
In today’s blog, I shall be talking about setting up(installing) and running presto.

The basic pre-requisites for setting up Presto are:

  • Linux or Mac OS X
  • Java 8, 64-bit
  • Python 2.4+


  1. Download the Presto Tarball from here
  2. Unpack the Tarball
  3. After unpacking you will see a directory presto-server-0.175 which we will call the installation directory.


Inside the installation directory create a directory called etc. This directory will hold the following configurations :

  1. Node Properties: environmental configuration specific to each node
  2.  JVM Config: command line options for the Java Virtual Machine
  3. Config Properties: configuration for the Presto server
  4. Catalog Properties: configuration for Connectors (data sources)
  5. Log Properties : configuring the log levels

Now we will setup the above properties one by one.

Step 1 : Setting up Node Properties

Create a file called node.properties inside the etc folder. This file will contain the configuration specific to each node. Given below is description of of the properties we need to set in this file

  • node.environment: The name of the presto environment. All the nodes in the cluster must have identical environment name.
  • node.id: This is the unique identifier for every node.
  • node.data-dir: The path of the data directory.

Note : Presto will stores the logs and other data at the location specified in the node.data-dir.  It is recommended to create data directory external to the installation directory, this allows easy preservation during the upgrade.

Continue reading

Posted in big data, database, Scala | Tagged , , , , | Leave a comment

Avro Communication over TCP Sockets

Storing/Transferring object is a requirement of most applications. What if there is a need for communication between machine having incompatible architecture. Java Serialization won’t work for that. Now, if you are thinking about Serialization Framework then you are right. So, let’s start with one of the Serialization framework Apache Avro.

What is Avro?

Apache Avro is a language-neutral data serialization system. It’s a schema-based system which serializes the data having built-in schema into a compact binary format, post data can be deserialized by any application having the same schema.

In this post, I will demonstrate how to read schema by using parsers library and send/ receive the serialized data over java socket.

Let’s create a new maven project and add avro dependency in pom.xml


Now, create new avro schema file in schema directory of project say it as employee.avsc

 "namespace": "example.avro",
 "type": "record",
 "name": "employee",
 "fields": [
  {"name": "Name", "type": "string"},
  {"name": "id", "type": "string"}

here we are considering a schema of employee having name and id.

Instantiate the Schema.Parser class by passing the file path where the schema is stored to its parse method.

Schema schema = new Schema.Parser().parse(new File("src/schema/employee.avsc"));

After Schema parsing we need to create record using GenericData and store data using put method.

Continue reading

Posted in Scala | Tagged , | 1 Comment

Partitioning in Apache Hive


Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values.thats where Partitioning comes into play a partition is nothing but a directory which contains the chunk of data when we do partitioning, we create a partition for each unique value of the column

lets run a simple example to see what it is

syntax to create partition table is

create table tablename(colname type) partitioned by(colname type);

if hive.exec.dynamic.partition.mode is set to strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic



here we create a table named emp info with two fields name and addresss we partitioned the table by column ID of type int and then we insert the value in this table

it’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.

syntax for inserting values is

insert into partition values();

first we insert record with id=1 now insert another record with id=2


now got to /user/hive

/warehouse/default/empinfo directory in your hdfs


as we can see there are two partitions one with name id=1 and other one as id =2 now when a  select query is fired with where clause it will not scan the full table it will only scan the required partition


if you will tried  it with a non partition table with large dataset it will take more time in comparison because it will have to go through entire table scan

i hope this blog will be helpful happy coding


Posted in Scala | Leave a comment

UnderStanding Optimized Logical Plan In Spark

LogicalPlan is a tree that represents both schema and data,these trees are manipulated and optimized by catalyst framework

There are three types of logical plans
○ Parsed logical plan
○ Analysed Logical Plan
○ Optimized logical Plan

Analysed Logical plan goes through series of rules to resolve and optimize plan is produced

Optimized plan normally allows spark to plug in set of optimization rules Even developer can plug in his/her own rules to optimizer

this optimized logical is get converted to physcial plan for further execution,these plans lied inside the dataframe api now lets run a example to see these plans and what is the difference between them


 using out rdd we created a dataframe with column name c1,c2,c3 and data values 1 to 100
now to see the plan of a data frame we will be using explain command if you run it with out true argument it gives only the physical plan,physical plan is always a rdd
to see all the three plans run explain command with true argument
Explain also shows Physical plan
if we have a look here all plan looks same than what is the difference between the optimized logical plan and analysed logical plan now run this example with two filters
here is the actual difference
== Analyzed Logical Plan ==
c1: string, c2: string, c3: string
Filter NOT (cast(c2#14 as double) = cast(0 as double))
+- Filter NOT (cast(c1#13 as double) = cast(0 as double))
+- LogicalRDD [c1#13, c2#14, c3#15]== Optimized Logical Plan ==
Filter (((isnotnull(c1#13) && NOT (cast(c1#13 as double) = 0.0)) && isnotnull(c2#14)) && NOT (cast(c2#14 as double) = 0.0))
+- LogicalRDD [c1#13, c2#14, c3#15]
in optimized logical plan spark does optimization itself it sees that there is no need of two filters instead the same task can be done with only one filter using and operator
so it does execution in one filter


Posted in Scala | Leave a comment

Multiple Feeds at one place: MultiFeed App

jianOkay, what if i tell you, there is an app :D, ever feel about having an App where you can add all your interesting blogs feeds ?

Here it is, may be there are many other available in play store but this one is simple actually very simple, just add your interesting blog feed url, get top 20 feeds as a simple list, click the post read that in the app close the app. Done. You earned the skill.

Here is a simple running image to show the working of this multi feed app.

Can’t wait ? Ok just download it from here for free: Play Store


Keep Learning Keep Sharing 🙂

Image | Posted on by | Tagged , , , , , , | 1 Comment

Starting Hive-Client Programmatically With Scala

Hive defines a simple SQL-like query language to querying and managing large datasets called Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. Hive allows programmers who are familiar with the language to write the custom MapReduce framework to perform more sophisticated analysis.

In this Blog,we will learn how to create a hive client with scala to execute basic hql commands,first create a scala project with scala 2.12 version

now add following properties in your build.sbt file

name := "hive_cli_client"

version := "1.0"

scalaVersion := "2.12.2"

libraryDependencies += "org.apache.hive" % "hive-exec" % "1.2.1" excludeAll
                       ExclusionRule(organization = "org.pentaho")

libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.7.3"

libraryDependencies += "org.apache.httpcomponents" % "httpclient" % "4.3.4"

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"

libraryDependencies += "org.apache.hive" % "hive-service" % "1.2.1"

libraryDependencies += "org.apache.hive" % "hive-cli" % "1.2.1"

libraryDependencies += "org.scalatest" % "scalatest_2.12" % "3.0.3"

in my case i am using hive 2.1.1,you can use any,let the dependencies to be resolved,now add a scala class in your project named as hiveclient

package cli

import java.io.IOException

import scala.util.Try

import org.apache.hadoop.hive.cli.CliSessionState
import org.apache.hadoop.hive.conf.HiveConf
import org.apache.hadoop.hive.ql.Driver
import org.apache.hadoop.hive.ql.session.SessionState

 * Hive meta API client for Testing Purpose
 * @author Anubhav
class HiveClient {

  val hiveConf = new HiveConf(classOf[HiveClient])

   * Get the hive ql driver to execute ddl or dml
   * @return
  private def getDriver: Driver = {
    val driver = new Driver(hiveConf)
    SessionState.start(new CliSessionState(hiveConf))

   * @param hql
   * @throws org.apache.hadoop.hive.ql.CommandNeedRetryException
   * @return int

  def executeHQL(hql: String): Int = {
    val responseOpt = Try(getDriver.run(hql)).toEither
    val response = responseOpt match {
      case Right(response) => response
      case Left(exception) => throw new Exception(s"${ exception.getMessage }")
    val responseCode = response.getResponseCode
    if (responseCode != 0) {
      val err: String = response.getErrorMessage
      throw new IOException("Failed to execute hql [" + hql + "], error message is: " + err)


it has one public method executeHQL that called private method getDriver to get the hiveDriver instance and execute hql with it,this method will give back the response code back

now write the test case to test this hive client

import cli.HiveClient
import org.scalatest.FunSuite

class HiveClientTest extends FunSuite {

  val hiveClient = new HiveClient

  test("testing for the hql query") {
    assert(hiveClient.executeHQL("DROP TABLE IF EXISTS DEMO") == 0)
    assert(hiveClient.executeHQL("CREATE TABLE IF NOT EXISTS DEMO(id int)") == 0)
    assert(hiveClient.executeHQL("INSERT INTO DEMO VALUES(1)") == 0)
    assert(hiveClient.executeHQL("SELECT * FROM DEMO") == 0)
    assert(hiveClient.executeHQL("SELECT COUNT(*) FROM DEMO") == 0)



now run these  test cases ooo.png

i hope this blog will be helpful happy coding


Posted in Scala | Leave a comment

Getting Started with Apache Spark


Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.
Spark has several advantages compared to other big data and Map Reduce technologies like Hadoop and Storm.
Apache Spark is an improvement on the original Hadoop MapReduce component of the hadoop big data ecosystem.

Features Of Spark

  • Supports more than just Map and Reduce functions.
  • Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow.
  • Provides concise and consistent APIs in Scala, Java and Python.
  • Offers interactive shell for Scala and Python. This is not available in Java yet.
  • Spark is written in Scala Programming Language and runs on Java Virtual Machine (JVM) environment.

Spark Ecosystem

Continue reading

Posted in apache spark, Scala, Spark | Tagged , , , , , , | 1 Comment


Hi all,

Knoldus has organized a 30 min session on 24th March 2017 at 4:50 PM. The topic was Pure.css. Many people have joined and enjoyed the session. I am going to share the slides here. Please let me know if you have any question related to linked slides.

Here’s the video of the session:

For any queries, ask in the comments.


Posted in Scala | Leave a comment