Knoldus Blogs

Deploy modes in Apache Spark

June 20, 2022June 21, 2022Apache Spark, Spark, spark, Tech Blogs

Reading Time: 2 minutes Spark is an open-source framework engine that has high-speed and easy-to-use nature in the field of big data processing and analysis. Spark has some built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation that makes it faster and cyclic data flow and it can run either on cluster mode or standalone mode and can also access diverse Continue Reading

Getting Started with Spark 3

February 3, 2022February 3, 2022Akka, Spark, Studio-Scala, Tech Blogsscala, Spark

Reading Time: 4 minutes Introduction to Apache Spark Big Data processing frameworks like Apache Spark provides an interface for programming data clusters using fault tolerance and data parallelism. Apache Spark is broadly used for the speedy processing of large datasets. Apache Spark is an open-source platform, built by a broad group of software developers from 200 plus companies. Over 1000 plus developers have contributed since 2009 to Apache Spark. Continue Reading

Understanding persistence in Apache Spark

October 2, 2020October 12, 2020Analytics, Apache Spark, Big Data and Fast Data, ML, AI and Data Engineering, Spark, Studio-Scala, Tech BlogsCaching, distributed caching

Reading Time: 4 minutes In this blog, we will try to understand the concept of Persistence in Apache Spark in a very layman term with scenario-based examples. Note: The scenarios are only meant for your easy understanding. Spark Architecture Note: Cache memory can be shared between Executors. What does it mean by persisting/caching an RDD? Spark RDD persistence is an optimization technique which saves the result of RDD evaluation Continue Reading

Creating Data Pipeline with Spark streaming, Kafka and Cassandra

August 24, 2020October 22, 2020Apache Kafka, Apache Spark, Big Data and Fast Data, Cassandra, MessagesAPI, Spark, Streaming, Studio-ScalaApache Kafka, Apache Spark, Cassandra, data analysis, DataStream API

Reading Time: 3 minutes Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline. Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data Continue Reading

Spark Structured Streaming (Part 4) – Handling Late Data

August 18, 2020August 19, 2020Analytics, Apache Spark, Big Data and Fast Data, ML, AI and Data Engineering, Spark, Streaming, Streaming Solutions, Studio-Scala, Tech Blogsstateful streaming, Structured Streaming, watermark

Reading Time: 3 minutes Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Understanding Stateful Streaming“. And this blog pertains to Handling Late Arriving Data in Spark Structured Streaming. So let’s get started. Handling Late Data With window aggregates (discussed in the previous blog) Spark automatically takes cares of late data. Every aggregate window is like a bucket Continue Reading

Spark: Streaming Datasets

August 18, 2020August 18, 2020Apache Spark, Big Data and Fast Data, Spark, Streaming, Studio-ScalaApache Spark, Big Data Analytics, DataFrame, DataStream API, Spark Streaming, Spark Structured Streaming, Streaming Spark

Reading Time: 3 minutes Spark providing us a high-level API – Dataset, which makes it easy to get type safety and securely perform manipulation in a distributed and a local environment without code changes. Also, spark structured streaming, a high-level API for stream processing allows us to stream a particular Dataset which is nothing but a type-safe structured streams. In this blog, we will see how we can create Continue Reading

Spark Structured Streaming (Part 3) – Stateful Streaming

August 14, 2020August 21, 2020Analytics, Apache Spark, Big Data and Fast Data, ML, AI and Data Engineering, Spark, Streaming, Streaming Solutions, Studio-Scala, Tech Blogsstateful streaming, stateful streaming scala, Structured Streaming

Reading Time: 4 minutes Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Internals of Structured Streaming“. And this blog pertains to Stateful Streaming in Spark Structured Streaming. So let’s get started. Let’s start from the very basic understanding of what is Stateful Stream Processing. But to understand that, let’s first understand what Stateless Stream Processing is. In Continue Reading

Stateful Streaming in Spark

August 10, 2020October 12, 2020Apache Spark, Big Data and Fast Data, Spark, Studio-Scala, Tech BlogsBig Data, scala, Spark, Spark Streaming, stateful streaming

Reading Time: 4 minutes Apache Spark is a fast and general-purpose cluster computing system. In Spark, we can do the batch processing and stream processing as well. It does near real-time processing. It means that it processes the data in micro-batches. I have discussed more Spark Streaming in my previous blog. Now in this blog, I’ll discuss Stateful Streaming in Spark. So let’s start !! What is Stateful Streaming? Continue Reading

Spark Structured Streaming (Part 2) – The Internals

August 9, 2020August 14, 2020Analytics, Apache Spark, Big Data and Fast Data, ML, AI and Data Engineering, Spark, Streaming, Streaming Solutions, Studio-Scala, Tech BlogsStructured Streaming

Reading Time: 2 minutes Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Introduction to Structured Streaming“. So I’ll exactly start from the point where I left in the previous blog. Structure of Streaming Query When we call start() API, Spark internally translates this code into a Logical Plan (an abstract representation of what the code does), then Continue Reading

Spark Structured Streaming (Part 1) – Introduction

August 6, 2020October 12, 2020Analytics, Apache Spark, Big Data and Fast Data, ML, AI and Data Engineering, Spark, Streaming, Streaming Solutions, Studio-Scala, Tech BlogsStructured Streaming

Reading Time: 5 minutes In this Spark Structured Streaming series of blogs, we will have a deep look into what structured streaming is in a very layman language. So let’s get started. Introduction Structured streaming is a stream processing engine built on top of the Spark SQL engine and uses the Spark SQL APIs. It is fast, scalable and fault-tolerant. It provides rich, unified and high-level APIs in the Continue Reading