Big Data and Fast Data

Data Lake – Build it in Phases

Reading Time: 3 minutes Data Lake – How to build a data lake and what are the phases involved in the same.

Apache Spark: Read Data from S3 Bucket

Reading Time: 2 minutes Well, a one working with spark is very much familiar with the ways of reading the file from local either from a Table or HDFS or from any file. But do you know how tricky it is to read data into spark from an S3 bucket? So, this blog makes you give a stepwise follow up to how to read data from an S3 bucket. Continue Reading

Scale Out with Cluster Sharding

Reading Time: 3 minutes If your actors are distributed across several nodes in the cluster, Cluster Sharding allows you to interact with them without worrying about their physical location and using only their logical identifier. Even if an actor re-locates to a new node, Akka will take care of locating it for you. You just need to send a message to it as if it is located on your local node.

iot application development

IoT Application Development: Tips & Tricks for success

Reading Time: 6 minutes Internet of Things or IoT is everywhere. From smart homes & smart cities to your fitness trackers & connected cars, we have seen them all and there’s more to come. As we gear up for 2020, studies suggest that IoT will comprise of 30 billion connected devices and that number may go up to 500 billion in another 10 years.  IoT is changing the trajectory Continue Reading

integrating Cucumber with Akka-Http

Akka Cluster Formation Fundamentals

Reading Time: 3 minutes Akka Cluster Formation Every actor has an address in Akka. The actor could be present locally or could be remote. Remote Actors require communication over the network. Each Actor system in a cluster is called a member or node. Node is addressed by a combination of hostname, port, and UUID (Regenerated when Actor System restarted). An actor can join the cluster with this combination to Continue Reading

Apache Spark: Repartitioning v/s Coalesce

Reading Time: 3 minutes Does partitioning help you increase/decrease the Job Performance? Spark splits data into partitions and computation is done in parallel for each partition. It is very important to understand how data is partitioned and when you need to manually modify the partitioning to run spark applications efficiently. Now, diving into our main topic i.e Repartitioning v/s Coalesce What is Coalesce? The coalesce method reduces the number Continue Reading

Understanding data persistence in Lagom

Reading Time: 4 minutes When we create any microservice, or in general any service, one of the biggest task is to manage data persistence. Lagom supports various databases for doing this task. By default, Lagom uses Cassandra to persist data.

Big Data Landscape explained

Reading Time: 5 minutes Big Data has now evolved into a buzz word and it seems everyone is either working on it or want to work on it. However, most of the people associate Big Data with some of the popular tool sets like Hadoop, Spark, NoSql databases like Hive, Cassandra , HBase etc. HDFS made Big Data popular as it gave us an option to distribute the data Continue Reading

Understanding the working of Spark Driver and Executor

Reading Time: 4 minutes This blog pertains to Apache SPARK, where we will understand how Spark’s Driver and Executors communicate with each other to process a given job. So let’s get started. First, let’s see what Apache Spark is. The official definition of Apache Spark says that “Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an in-memory computation processing engine where the data is Continue Reading

Understanding how Spark runs on YARN with HDFS

Reading Time: 6 minutes This blog pertains to Apache SPARK and YARN (Yet Another Resource Negotiator), where we will understand how Spark runs on YARN with HDFS. So let’s get started. First, let’s see what Apache Spark is. The official definition of Apache Spark says that “Apache Spark™ is a unified analytics engine for large-scale data processing.” It is an in-memory computation processing engine where the data is kept Continue Reading

Kafka Timestamp Extractor

Reading Time: 3 minutes Hi folks, I hope you all’re doing well, so if you land up here you probably looking for Timestamp Extractor for kafka streams, so whats the buzz is all about? So in this blog we are going to look what it is and would explore it as well, so buckle up. The Timestamp Extractor As per docs, A timestamp extractor extracts a timestamp from an Continue Reading

Custom Partitioner in Kafka: Let’s Take Quick Tour!

Reading Time: 5 minutes In this blog, we are going to explore the Kafka partitioner. We will try to understand why the default partitioner is not enough and when you might need a custom partitioner. We will also look at a use case and create code for the custom partitioner. I assumed that you have sound knowledge of Kafka. Let’s understand the behavior of the default partitioner. The default Continue Reading

Kryo Serialization in Spark

Reading Time: 4 minutes Spark provides two types of serialization libraries: Java serialization and (default) Kryo serialization. For faster serialization and deserialization spark itself recommends to use Kryo serialization in any network-intensive application. Then why is it not set to default : Why Kryo is not set to default in Spark? The only reason Kryo is not set to default is because it requires custom registration. Although, Kryo is Continue Reading