Partition-Aware Data Loading in Spark SQL

Data loading, in Spark SQL, means loading data in memory/cache of Spark worker nodes. For which we use to write following code:

val connectionProperties = new Properties()
connectionProperties.put("user", "username")
connectionProperties.put("password", "password")
val jdbcDF =
.jdbc("jdbc:postgresql:dbserver", "schema.table", connectionProperties)

In here we are using jdbc function of DataFrameReader API of Spark SQL to load the data from table into Spark Executor’s memory, no matter how many rows are there in table.

Here is an example of jdbc implementation:

val df ="jdbcUrl", "person", connectionProperties)

In the code above, the data will be loaded into Spark Cluster. But, only one worker will iterate over the table and try to load the whole data in its memory, as only one partition is created, which might work if table contains few hundred thousand records. This fact can be confirmed by following snapshot of Spark UI:


But, what will happen if the table, that needs to be loaded in Spark cluster, contains 10, 20 or 50 Million rows. In that case loading whole table in one Spark worker node will not be very efficient as it will take both more time and memory.

So, to improve this situation we can load table in Spark Cluster in partition aware manner, i.e., asking each worker node to pick up its own partition of table from database and load it into its memory. For that we need to use same jdbc function of DataFrameReader API but in a different way, something like this:

val df ="jdbcUrl", "person", "personId", 0, 2944089, 4, connectionProperties)

In above code, we have provided personId column, along with min. and max. ids, using which Spark will partition the rows, i.e., Partition 1 will contain rows with IDs from 0 to 736022, Partition 2 will contain rows with IDs from 736023 to 1472045, … and so on. If you want to increase the number of partitions, you can do so but you have to keep in mind that more the number of partitions, more the number of connections will be created by Spark to fetch the data.

Now, when Spark will load the data from person table, then each worker node will fetch only that partition, i.e., the one which is asked by master node to load. Following, is the snapshot of Spark UI which confirms that each node is loading only one partition from database:


Here we can see that person table is partitioned into 4 parts which is being loaded by different executors in parallel.

In this way we can ease our pain of loading large data from RDBMS into Spark cluster and leverage the feature of partition aware data loading into Spark.

If you have any doubts or suggestions then kindly leave a comment !!


Posted in Scala, Spark | Tagged , , , | 2 Comments

Jmeter-Database testing with jmeter

Now here we again come with new topic Jmeter database testing. Now for database testing we have download mysql-connector-java jar file and placed in lib folder of jmeter then we start with the Thread group.



Now we make a JDBC connection configuration.



In JDBC connection configuration we will define the database URL,JDBC driver class ,MYSQL username,password etc.



Now we add a sampler for JDBC request.


In JDBC request we define the query like if we run select query we select the select statement in query type.


We can see the result in view result tree listener



Also we can see the result in summary report.




Posted in Scala | Leave a comment

Jenkins Build Jobs

In continuation to my previous blogs Introduction to Jenkins and Jenkins – Manage Security , I will now be talking about creating build jobs with Jenkins.
It is easy and simple to create a new build job in Jenkins. Follow the given steps to get started:
  • From the Jenkins Dashboard, Click on “New Item”
  • Name your project and select project type.


  • Click on “Ok” to continue.
Understanding Build Job Types
Jenkins supports different types of build jobs. Some of them are :
  • Freestyle software project
Freestyle build jobs are general-purpose build jobs, which provides a maximum of flexibility. This is the central feature of Jenkins. Jenkins will build your project, combining any SCM with any build system, and this can be even used for something other than software build.
  • Maven project
Build a maven project. Jenkins takes advantage of your POM files and drastically reduces the configuration.
  • External Job
This type of job allows you to record the execution of a process run outside Jenkins, even on a remote machine. This is designed so that you can use Jenkins as a dashboard of your existing automation system.

Continue reading

Posted in Tutorial, integration, testing, Performance Testing | 1 Comment

Creating A Simple Hive Udf In Scala

Sometimes the query you want to write can’t be expressed easily (or at all) using the built-in functions that Hive provides. By allowing you to write a user-defined function (UDF), Hive makes it easy to plug in your own processing code and invoke it from a Hive query,UDFs have to be written in Java, the language that Hive itself is written in. but in this blog we will write it in scala

A UDF must satisfy the following two properties:

• A UDF must be a subclass of org.apache.hadoop.hive.ql.exec.UDF.

• A UDF must implement at least one evaluate() method.

The evaluate() method is not defined by an interface, since it may take an arbitrary number of arguments, of arbitrary types, and it may return a value of arbitrary type.

Hive introspects the UDF to find the evaluate() method that matches the Hive function that was invoked.

lets get started scala version that i am using is scala 2.11,now add following properties in your build.sbt file

name := "hiveudf_example"

version := "1.0"

scalaVersion := "2.11.1"

unmanagedJars in Compile += file("/usr/lib/hive/lib/hive-exec-2.0.0.jar")

path in the file is the path of  your hive home i am hardcording it u can give it yours,create your main file as follows

package com.knoldus.udf

import org.apache.hadoop.hive.ql.exec.UDF

class Scala_Hive_Udf extends UDF {

  def evaluate(str: String): String = {


i am creating udf for trim method in hive,you can create any method you want,now next task is to create assembly for your project,add sbt assembly plugin in your plugins.sbt file

logLevel := Level.Warn

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

next step is to create jar go to your sbt console and hit command

sbt assembly

you can find your jar inside the target folder,now submit this jar to hive as udf,first start hive using hive command and submit the jar using ADD JAR command followed by path of your jar

Logging initialized using configuration in jar:file:/home/knoldus/Documents/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/
hive> ADD JAR /home/knoldus/Desktop/opensource/hiveudf_example/target/scala-2.11/hiveudf_example-assembly-1.0.jar
> ;
Added [/home/knoldus/Desktop/opensource/hiveudf_example/target/scala-2.11/hiveudf_example-assembly-1.0.jar] to class path

create a function with this udf

hive> CREATE FUNCTION trim AS 'com.knoldus.udf.Scala_Hive_Udf';
Time taken: 0.47 seconds

now we  will call this function as below

hive> select trim(" hello ");
Time taken: 1.304 seconds, Fetched: 1 row(s)

this is the simplest way to create a udf in hive,i hope this blog helps happy coding


Posted in Scala | Leave a comment

Hotfix To Install Latest Oracle JDK On Linux EC2 Instance (CentOS)

In this blog we will explore the procedure to install latest java 8 version ie JDK 8u121 (released on 17th January, 2017) on the EC2 Linux instance which comes with centOS as its default operating system. Following steps must be followed sequentially :

Step 1 -> Ssh to your instance

Refer Step 4 of this blog to ssh to your instance.

Step 2 -> Check the Java version

java -version


As can be seen in the image above, java 7 is installed on the EC2 instance by default. We will now be updating to java 8.

Step 3 -> Download RPM package of Oracle JDK (8u121)

Continue reading

Posted in Scala | Leave a comment

Knolx: An Introduction to BDD

Hi all,

Knoldus has organized an one hour session on 23rd Dec 2016 at 4:00 PM. Topic was An introduction to BDD. Many people have joined and enjoyed the session. I am going to share the slides here. Please let me know if you have any question related to linked slides.


Posted in Scala | Leave a comment

Blending Cucumber, Cassandra and Akka-Http


Knoldus has always pioneered the deep diving into the best ways to use cutting edge technologies. In the past few days, one of our team carried this deed by integrating Cucumber with Akka-Http, Cassandra and of course, Scala. In this blog, we reach out to you to explain and show how this can be done.


Cucumber is for Behavior Driven Design (BDD). The approach of Cucumber is to write the behavior of the application and then run them for acceptance testing.


Akka-Http is a general toolkit provided by Akka to implement HTTP services. It supports both client and server side services.


Cassandra is a database that provides high scalability and availability with best performance.

Continue reading

Posted in Akka, akka-http, Cassandra, Cucumber, database, knoldus, sbt, Scala, scalatest, Test, testing, tests | Tagged | 6 Comments

Application compatibility for different Spark versions

Recently spark version 2.1 was released and there is a significant difference between the 2 versions.

Spark 1.6 has DataFrame and SparkContext while 2.1 has Dataset and SparkSession.

Now the question arises how to write code so that both the versions of spark are supported.

Fortunately maven provides the feature of building your application with different profiles.
In this blog i will tell you guys how to make your application compatible with different spark versions.

Lets start by creating a empty maven project.
You can use the maven-archetype-quickstart for setting up your project.
Archetypes provide a basic template for your project and maven has a rich collection of these templates for all your needs.

Once the project setup is done we need to make 3 modules. Lets name them core, spark and spark2 and setting the artifact id of each module to their respective names.

Continue reading

Posted in apache spark, Java, Scala, Spark | Tagged , , , , | 3 Comments

Knoldus Bags the Prestigious Huawei Partner of the Year Award

Knoldus was humbled to receive the prestigious partner of the year award from Huawei at a recently held ceremony in Bangalore, India.


It means a lot for us and is a validation of the quality and focus that we put on the Scala and Spark Ecosystem. Huawei recognized Knoldus for the expertise in Scala and Spark along with the excellent software development process practices under the Knolway™ umbrella. Knolway™ is the Knoldus Way of developing software which we have curated and refined over the past 6 years of developing Reactive and Big Data products.

Our heartiest thanks to Mr. V.Gupta, Mr. Vadiraj and Mr. Raghunandan for this honor.


About Huawei

Huawei is a leading global information and communications technology (ICT) solutions provider. Driven by responsible operations, ongoing innovation, and open collaboration, we have established a competitive ICT portfolio of end-to-end solutions in telecom and enterprise networks, devices, and cloud computing. Our ICT solutions, products, and services are used in more than 170 countries and regions, serving over one-third of the world’s population. With more than 170,000 employees, Huawei is committed to enabling the future information society and building a Better Connected World.

About Knoldus
Knoldus specializes in building reactive products using the Scala ecosystem and data science platforms using Spark. Our strong background in Scala makes us attractive when you need deep Spark understanding. We work on a broad variety of big data initiatives like Knowledge graph systems  (Ontology and semantic web), time series & Realtime (IoT), Structured ( Relational & NoSQL) and cover varied domains including Pharma, Manufacturing, and Finance.

Knoldus is active in the Open source community with contributions to various libraries like Spark, Scala, Akka, Neo4j, Deeplearning4j etc. Knoldus is known for its tech blogs and open source contributions in the area of big data and functional programming. Some of the known direct clients of Knoldus include Huawei, Philips, Schlumberger, Elsevier, BoA and quite a few niche startups.

Posted in Akka, Scala, Spark | Tagged , | 5 Comments

Mounting EBS(Elastic Block Store) volume to EC2(Elastic Compute Cloud) instance.

In this blog we will consider simple steps by which we can attach and mount a volume(elastic block store) to an ec2 instance. Following steps must be followed :

Step 1 -> Creating an ec2 instance

We will be creating a Linux instance (m4.large) which comes with Cent OS as the default operating system. Steps of creating an instance are clearly mentioned in official amazon docs, check here.

Step 2 -> Creating a volume

1) On the left panel of the ec2 console , click on volumes present under the heading ELASTIC BLOCK STORE.

2) Click on Create Volume button.

3) In the box that appears enter the size of the volume you want to attach in Size (GiB). And under the Availability Zone field, enter the same availability zone as of your instance.


The volume is created as its State changes from creating(yellow) to available (blue)


Step 3 -> Attaching volume with the instance

Continue reading

Posted in Amazon, Amazon EC2, AWS, AWS Services, Cloud, Devops, Scala, scaladays | 1 Comment