Author: RakhiPareek

AWS S3 bucket

Reading Time: 3 minutes Amazon S3 (Simple Storage Service) is a cloud-based object storage service offered by Amazon Web Services (AWS). S3 is a scalable, secure, and highly available way to store and retrieve data. AWS S3 bucket is a container for storing objects in the Amazon S3 cloud. Uses of AWS S3 Bucket Backup and Disaster Recovery AWS S3 bucket is an excellent choice for backing up data Continue Reading

Lazy Evaluation in Scala

Reading Time: 3 minutes Lazy evaluation is a technique that delays the computation of an expression until it is needed. This can be useful for improving performance and reducing memory usage in certain situations. Lazy Evaluation in Scala In Scala, lazy evaluation is achieved through the use of lazy vals. A lazy val is a value that is computed lazily. Its value is not evaluated until it is accessed Continue Reading

Kafka Introduction with CLI Commands

Reading Time: 3 minutes What is Apache Kafka Apache Kafka is a distributed open-source system specially designed for streams. Mostly Kafka is used in real-time streaming data architectures to provide real-time analytics. It is fault-tolerant, high-throughput, horizontally scalable, and allows geographically distributed data streams and stream processing applications. Basic Componants of kafka Producer A producer is an entity/application that publishes data to a Kafka cluster. Broker A broker is responsible for receiving Continue Reading

Receivers in Apache Spark Streaming

Reading Time: 2 minutes Receivers are special objects in Spark Streaming. The receiver’s goal is to consume data from data sources and move it to Spark. We create Receivers by streaming context as long-running tasks on different executors. We can build receivers by extending the abstract class Receiver. To start or stop the receiver there are two methods:- onStart() This method contains all important things like opening connections, creating threads, Continue Reading


Cache and Persist in Apache Spark Dataframe

Reading Time: 2 minutes Spark computations are faster than map-reduce jobs. If we haven’t designed our jobs for reusing computations then our performance will degrade for billions and trillions of data. Hence, we may need to look at the stages and use optimization techniques as one of the ways to improve performance. cache() and persist() methods provide an optimization mechanism to store the intermediate computation of a spark data frame. So we Continue Reading

Deploy modes in Apache Spark

Reading Time: 2 minutes Spark is an open-source framework engine that has high-speed and easy-to-use nature in the field of big data processing and analysis. Spark has some built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation that makes it faster and cyclic data flow and it can run either on cluster mode or standalone mode and can also access diverse Continue Reading

Great start with Elixir

Reading Time: 2 minutes What is Elixir? Elixir is a functional programming language and it leverages the Erlang VM, known for running low-latency, distributed, and fault-tolerant systems. It provides successful participation in web development, embedded software, data ingestion, and multimedia processing, across a wide range of industries.  Elixir Installation in ubuntu Here, is the installation part of elixir in ubuntu and the only prerequisite of installing elixir is erlang Continue Reading

Why modern system need AKKA

Reading Time: 4 minutes In this blog, we are going to know why concurrent and parallel programming needs the Actor programming model. Challenge of encapsulation in multithreaded environment Encapsulation wraps object and their data into one unit so that they can not accessible directly from the outside. The object is responsible for exposing safe operations that protect the uniform nature of its encapsulated data. This message sequence chart shows Continue Reading

Important things that need to know about Spark RDD.

Reading Time: 4 minutes What is RDD in Spark? RDD stands for “Resilient Distributed Dataset”. RDD in Apache Spark is a Data structure, and also an immutable collection of objects computes on the different nodes of the cluster. Resilient, i.e. fault-tolerant, the data is present into multiple executable nodes so that in case of failure of any node it can get backup from another executable nodes. Distributed, since Data Continue Reading

Simple Guidance for Concurrency vs Parallelism

Reading Time: 3 minutes I was learning about Concurrency and Parallelism, So both the terms have very similar definitions and that creates a very important question. What is the difference between the term Concurrency and Parallelism? So, I will answer this question in the blog. Concurrency :- Concurrency is when two or more tasks are running at an overlapping time. In a more generalized way, we can say at Continue Reading