Big Data and Fast Data

integrating Cucumber with Akka-Http

Akka Cluster in use (Part 4): Managing a Cluster

Reading Time: 3 minutes Hello friends, I hope you all are safe in the COVID-19 pandemic and learning new tools and tech while staying at home. In our last blog post on Akka Cluster, we saw an Akka Cluster in action and learnt about how the node(s) react to new nodes in the Cluster. Now when we know how to create & setup an Akka Cluster, let’s learn, how to Continue Reading

fetching data from different sources using Spark 2.1

Spark: createDataFrame() vs toDF()

Reading Time: 2 minutes There are two different ways to create a Dataframe in Spark. First, using toDF() and second is using createDataFrame(). In this blog we will see how we can create Dataframe using these two methods and what’s the exact difference between them. toDF() toDF() method provides a very concise way to create a Dataframe. This method can be applied to a sequence of objects. To access Continue Reading

Cluster vs Client: Execution modes for a Spark application

Reading Time: 3 minutes Whenever we submit a Spark application to the cluster, the Driver or the Spark App Master should get started. And the Driver will be starting N number of workers. Spark driver will be managing spark context object to share the data and coordinates with the workers and cluster manager across the cluster. Cluster Manager can be Spark Standalone or Hadoop YARN or Mesos. Workers will Continue Reading

Knime Analytics Platform: A dream for a data scientist

Reading Time: 3 minutes In this blog, we are going to see, what is the Knime analytics platform and its important features to create an analytics workflow in an easy way. Introduction to Knime Analytics Platform KNIME is a platform built for powerful analytics on a GUI based workflow. This means you do not have to know how to code to be able to work using KNIME and derive Continue Reading

Serialization in Kafka

Reading Time: 2 minutes Serialization is the process of converting an object into a stream of bytes that are used for transmission. Kafka stores and transmits these bytes of arrays in its queue. Deserialization, as the name suggests, does the opposite of serialization, in which we convert bytes of arrays into the desired data type. Apache Kafka stores as well as transmit these bytes of arrays in its queue. Continue Reading

fetching data from different sources using Spark 2.1

Spark: Type Safety in Dataset vs DataFrame

Reading Time: 4 minutes With type safety, programming languages prevents type errors, or we can say that type safety means the compiler will validate type while compiling, and throw an error when we try to assign a wrong type to a variable. Spark, a unified analytics engine for big data processing provides two very useful API’s DataFrame and Dataset that is easy to use, and are intuitive and expressive which makes Continue Reading

kafka with spark

Dynamic Partition Pruning in Spark 3.0

Reading Time: 6 minutes Dynamic Partition Pruning in Spark 3.0 With the release of Spark 3.0, big improvements were implemented to enable Spark to execute faster and there came many new features along with it. Among them, dynamic partition pruning is one. Before diving into the features which are new in Dynamic Partition Pruning let us understand what is Partition Pruning. Partition Pruning in Spark In standard database pruning Continue Reading

Akka-gRPC

Reading Time: 3 minutes Akka gRPC provides support for building streaming gRPC servers and clients on top of Akka Streams and Akka Http. Features of Akka-gRPC A generator, that starts from a protobuf service definitions, for: Model classes The service API as a Scala trait using Akka Stream Sources On the server side code to create an Akka HTTP route based on your implementation of the service On the client side, a client for the Continue Reading

Akka Cluster in use (Part 3): Setup a Local Akka Cluster

Reading Time: 4 minutes Hello friends, I hope you all are safe in COVID-19 pandemic and learning new tools and tech while staying at home. In our last blog post on Akka Cluster, we learnt about the configurations we need in order to form an Akka Cluster. But we didn’t saw it in action. Hence in this blog post, we will see one in action. Step 1: Download the Continue Reading

Rebalancing: What the fuss is all about?

Reading Time: 4 minutes Apache Kafka is ruling in the world of Big Data. It is just not a messaging queue but a full-fledged event streaming platform. We have looked through the basic idea of Kafka and what makes it faster than any other messaging queue. You can read about it from my previous blog. Also, we looked through Partitions, Replicas, and ISR. We are now ready for our Continue Reading

Akka-Streams: All About Graphs!

Reading Time: 4 minutes In my previous blogs, I discussed about the basics of akka-streams and materialization. Now let’s dig deeper into Graphs in Akka-Streams. Graphs Till now we know how to create a linear pipeline/linear graph. But in real life scenario we generally don’t have linear graphs to implement. The graphs can be complex. In Akka Streams computation graphs are written in a more graph-resembling DSL. It aims Continue Reading