Apache Spark

man in white shirt using macbook pro

Concept of UDF in Spark: User-Defined Function

Reading Time: 3 minutes As we all know, Spark contains a whole variety of inbuilt functions through which you can do any sort of transformation in your data frame and achieve your desired output, but sometimes you may find that you don’t require them. Then What? In that case, you can define your own function, known as UDFs (User Defined Functions) which makes it possible to write your own Continue Reading

Apache Spark’s Developers Friendly Structured APIs: Dataframe and Datasets

Reading Time: 3 minutes This is the second part of the blog series on Spark‘s structured APIs Dataframe & Datasets. In the first part we covered Dataframe and I recommend you go read that blog first if you are new to spark. In this blog we’ll cover the Spark Datasets API, so let’s get started. The Datasets API Datasets are also the combination of two characteristics: typed and untyped Continue Reading

Dataframe and Datasets: Apache Spark’s Developers Friendly Structured APIs

Reading Time: 4 minutes This is a two-part blogs in which first we’ll be covering Dataframe API and in the second part Datasets. Spark 2.x introduced the concept of structuring the spark by introducing two concepts: – to express some computation by using common patterns found in data analysis, such as filtering, selecting, counting, aggregating, and grouping. And the second one of order and structure your data in a Continue Reading

Apache Spark

Brief Introduction to Apache Spark

Reading Time: 3 minutes What is Apache Spark: Apache Spark is an open-source data processing engine to store and process data in real-time across various clusters of computers using simple programming constructs. It supports various programming languages like Scala, Python, Java, and R. Spark Architecture: Uses of Apache Spark: It is used for- Applications of Data processing Batch processing Processing Structured data Machine learning Process graph data Processing Streaming data Features of Continue Reading

Writing Unit Test for Apache Spark using Memory Streams

Reading Time: 2 minutes In this post, we are going to look into how we can leverage apache spark’s memory streams for Unit testing What is it ? Apache spark’s memory streams is a concrete streaming source of memory data source that supports reading in Micro-Batch Stream Processing. Lets jump into it We will be using a memory stream writing some test data in memory as a stream. We Continue Reading

Apache Spark’s Join Algorithms

Reading Time: 4 minutes Joins in Apache Spark are fundamental transformations, but if you are not familiar with their internal algorithm, they can become too expensive.

Creating Data Pipeline with Spark streaming, Kafka and Cassandra

Reading Time: 3 minutes Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline. Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data Continue Reading

Spark: Streaming Datasets

Reading Time: 3 minutes Spark providing us a high-level API – Dataset, which makes it easy to get type safety and securely perform manipulation in a distributed and a local environment without code changes. Also, spark structured streaming, a high-level API for stream processing allows us to stream a particular Dataset which is nothing but a type-safe structured streams. In this blog, we will see how we can create Continue Reading

Optimizations In Spark: For BETTER OR For WORSE

Reading Time: 5 minutes This blog focuses on some of the problems faced while working with the Spark SQL

Delta Lake: Schema Enforcement & Evolution

Reading Time: 4 minutes Nowadays data is constantly evolving and changing. As well as the business problems and requirements are evolving, the shape or the structure of the data is also changing. When that happens, we want to be in control of how the data or schema changes. But how we can achieve this? Delta Lake has good ways to control how schema changes. With Delta Lake, users have Continue Reading

fetching data from different sources using Spark 2.1

Spark: createDataFrame() vs toDF()

Reading Time: 2 minutes There are two different ways to create a Dataframe in Spark. First, using toDF() and second is using createDataFrame(). In this blog we will see how we can create Dataframe using these two methods and what’s the exact difference between them. toDF() toDF() method provides a very concise way to create a Dataframe. This method can be applied to a sequence of objects. To access Continue Reading

Streaming from Kafka to PostgreSQL through Spark Structured Streaming

Reading Time: 3 minutes Hello everyone, in this blog we are going to learn how to do a structured streaming in spark with kafka and postgresql in our local system. We will be doing all this using scala so without any furthur pause, lets begin. Setting up the necessities first: Dependencies Set up the required dependencies for scala, spark, kafka and postgresql. 2. PostgreSQL setup Lets start fresh by Continue Reading

Apache Spark: Handle Corrupt/Bad Records

Reading Time: 3 minutes Most of the time writing ETL jobs becomes very expensive when it comes to handling corrupt records. And in such cases, ETL pipelines need a good solution to handle corrupted records. Because, larger the ETL pipeline is, the more complex it becomes to handle such bad records in between. Corrupt data includes: Missing information Incomplete information Schema mismatch Differing formats or data types Apache Spark: Continue Reading