Author: RakhiPareek

Receivers in Apache Spark Streaming

Reading Time: 2 minutes Receivers are special objects in Spark Streaming. The receiver’s goal is to consume data from data sources and move it to Spark. We create Receivers by streaming context as long-running tasks on different executors. We can build receivers by extending the abstract class Receiver. To start or stop the receiver there are two methods:- onStart() This method contains all important things like opening connections, creating threads, Continue Reading

Cache and Persist in Apache Spark Dataframe

Reading Time: 2 minutes Spark computations are faster than map-reduce jobs. If we haven’t designed our jobs for reusing computations then our performance will degrade for billions and trillions of data. Hence, we may need to look at the stages and use optimization techniques as one of the ways to improve performance. cache() and persist() methods provide an optimization mechanism to store the intermediate computation of a spark data frame. So we Continue Reading

Deploy modes in Apache Spark

Reading Time: 2 minutes Spark is an open-source framework engine that has high-speed and easy-to-use nature in the field of big data processing and analysis. Spark has some built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation that makes it faster and cyclic data flow and it can run either on cluster mode or standalone mode and can also access diverse Continue Reading

Great start with Elixir

Reading Time: 2 minutes What is Elixir? Elixir is a functional programming language and it leverages the Erlang VM, known for running low-latency, distributed, and fault-tolerant systems. It provides successful participation in web development, embedded software, data ingestion, and multimedia processing, across a wide range of industries.  Elixir Installation in ubuntu Here, is the installation part of elixir in ubuntu and the only prerequisite of installing elixir is erlang Continue Reading

Why modern system need AKKA

Reading Time: 4 minutes In this blog, we are going to know why concurrent and parallel programming needs the Actor programming model. Challenge of encapsulation in multithreaded environment Encapsulation wraps object and their data into one unit so that they can not accessible directly from the outside. The object is responsible for exposing safe operations that protect the uniform nature of its encapsulated data. This message sequence chart shows Continue Reading

Important things that need to know about Spark RDD.

Reading Time: 4 minutes What is RDD in Spark? RDD stands for “Resilient Distributed Dataset”. RDD in Apache Spark is a Data structure, and also an immutable collection of objects computes on the different nodes of the cluster. Resilient, i.e. fault-tolerant, the data is present into multiple executable nodes so that in case of failure of any node it can get backup from another executable nodes. Distributed, since Data Continue Reading

Simple Guidance for Concurrency vs Parallelism

Reading Time: 3 minutes I was learning about Concurrency and Parallelism, So both the terms have very similar definitions and that creates a very important question. What is the difference between the term Concurrency and Parallelism? So, I will answer this question in the blog. Concurrency :- Concurrency is when two or more tasks are running at an overlapping time. In a more generalized way, we can say at Continue Reading