Spark SQL

Kafka And Spark Streams: The happily ever after !!

Hi everyone, Today we are going to understand a bit about using the spark streaming to transform and transport data between Kafka topics. The demand for stream processing is increasing every day. The reason is that often, processing big volumes of data is not enough. We need real-time processing of data especially when we need to handle continuously increasing volumes of data and also need Continue Reading

What’s new in Apache Spark 2.2

Apache recently released a newer version of Spark i.e Apache Spark2.2. The new version comes with new improvements as well as the addition of new functionalities. The major addition to this release is Structured Streaming. It has been marked as production ready and its experimental tag has been removed. Some of the high-level changes and improvements : Production ready Structured Streaming Expanding SQL functionalities New distributed Continue Reading

Partition-Aware Data Loading in Spark SQL

Data loading, in Spark SQL, means loading data in memory/cache of Spark worker nodes. For which we use to write following code: val connectionProperties = new Properties() connectionProperties.put(“user”, “username”) connectionProperties.put(“password”, “password”) val jdbcDF = spark.read .jdbc(“jdbc:postgresql:dbserver”, “schema.table”, connectionProperties) In here we are using jdbc function of DataFrameReader API of Spark SQL to load the data from table into Spark Executor’s memory, no matter how many rows are Continue Reading

Cassandra with Spark 2.0 : Building Rest API !

In this tutorial , we will be demonstrating how to make a REST service in Spark using Akka-http as a side-kick  😉  and Cassandra as the data store. We have seen the power of Spark earlier and when it is combined with Cassandra in a right way it becomes even more powerful. Earlier we have seen how to build Rest Api on Spark and Couchbase Continue Reading

UDF overloading in spark

UDF are User Defined Function which are register with hive context to use custom functions in spark SQL queries. For example if you want to prepend some string in any other string or column then you can create a following UDF def addSymbol(input:String, symbol:String)={ symbol+input } Now to register above function in hiveContext we need to register UDF as follows hiveContext.udf.register(“addSymbol”,(input:String,symbol:String)=>addSymbol(input,symbol)) Now you can use Continue Reading

Meetup: An Overview of Spark DataFrames with Scala

Knoldus organized a Meetup on Wednesday, 18 Nov 2015. In this Meetup, an overview of Spark DataFrames with Scala, was given. Apache Spark is a distributed compute engine for large-scale data processing. A wide range of organizations are using it to process large datasets. Many Spark and Scala enthusiasts attended this session and got to know, as to why DataFrames are the best fit for building an application in Spark with Scala Continue Reading

Meetup: Introduction to Spark with Scala

Knoldus organized a Meetup on Wednesday, 1 April 2015. In this Meetup, we gave a brief introduction of Spark with Scala. Apache Spark is a fast and general engine for large-scale data processing. A wide range of organizations are using it to process large datasets. Many Spark and Scala enthusiasts attended this session and got an insight of Apache Spark. Examples shown in above slides can be downloaded from here.

Play with Spark: Building Spark SQL in a Play Spark Application

In our last post of Play with Spark! series, we saw how to integrate Spark Streaming in a Play Scala application. Now in this blog we will see how to add Spark SQL feature in a Play Scala application. Spark SQL is a powerful tool of Apache Spark. It allows relational queries, expressed in SQL, HiveQL, or Scala, to be executed using Spark. Apache Spark has a new Continue Reading

%d bloggers like this: