Author: Pinku Swargiary

Reading SBT dependency tree

Reading Time: 3 minutes In this post, we are going to look into reading sbt dependency tree and resolving one of the scenario using an example. While upgrading our library versions of our repository, we often fall into different issues like compatibility between library versions and so on. In this situations dependencyTree is one of the tool, which can help us sneek into different versions of library our build Continue Reading

Writing Unit Test for Apache Spark using Memory Streams

Reading Time: 2 minutes In this post, we are going to look into how we can leverage apache spark’s memory streams for Unit testing What is it ? Apache spark’s memory streams is a concrete streaming source of memory data source that supports reading in Micro-Batch Stream Processing. Lets jump into it We will be using a memory stream writing some test data in memory as a stream. We Continue Reading

All you need to know about Avro schema

Reading Time: 4 minutes In this post, we are going to dive into the basics of the Avro Schema. We will create a sample avro schema and serialize it to a sample output file and also read the file as an example according to the avro schema. Intro to Avro Apache Avro is a data serialization system developed by Doug Cutting, the father of Hadoop that helps with data Continue Reading

Streaming from Kafka to PostgreSQL through Spark Structured Streaming

Reading Time: 3 minutes Hello everyone, in this blog we are going to learn how to do a structured streaming in spark with kafka and postgresql in our local system. We will be doing all this using scala so without any furthur pause, lets begin. Setting up the necessities first: Dependencies Set up the required dependencies for scala, spark, kafka and postgresql. 2. PostgreSQL setup Lets start fresh by Continue Reading

Kryo Serialization in Spark

Reading Time: 4 minutes Spark provides two types of serialization libraries: Java serialization and (default) Kryo serialization. For faster serialization and deserialization spark itself recommends to use Kryo serialization in any network-intensive application. Then why is it not set to default : Why Kryo is not set to default in Spark? The only reason Kryo is not set to default is because it requires custom registration. Although, Kryo is Continue Reading