Knoldus Blogs

Deploy modes in Apache Spark

June 20, 2022June 21, 2022Apache Spark, spark, Spark, Tech Blogs

Reading Time: 2 minutes Spark is an open-source framework engine that has high-speed and easy-to-use nature in the field of big data processing and analysis. Spark has some built-in modules for graph processing, machine learning, streaming, SQL, etc. The spark execution engine supports in-memory computation that makes it faster and cyclic data flow and it can run either on cluster mode or standalone mode and can also access diverse Continue Reading

The ecosystem of Apache Spark

June 6, 2022June 9, 2022Apache Spark, spark, Tech Blogs

Reading Time: 4 minutes Apache Spark is a powerful alternative to Hadoop MapReduce, with several, rich functionality features, like machine learning, real-time stream processing, and graph computations. It is an open-source distributed cluster-computing framework. It is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries, and streaming. Apart from supporting all these workloads in a respective system. It reduces the management burden of Continue Reading

Comparing Data Streaming Frameworks | Scala

December 10, 2021December 10, 2021akka-streams, Apache Kafka, Apache Spark, Big Data and Fast Data, Studio-Scala, Tech BlogsAkka, kafka, scala, Spark, Streaming

Reading Time: 4 minutes In this Era of Technology, where the amount of data is growing exponentially and every bit of data holds value. Even, according to some reports, the number of bytes being generated and stored till now in the world has already exceeded the star counts in the sky. As every bit is useful so, it is very important to store them without losing any bit. When Continue Reading

Spark 3.0 : Adaptive Query Execution(AQE)

September 30, 2021September 30, 2021Apache Spark, Studio-Scala

Reading Time: 3 minutes Introduction As we all know optimization plays an important role in the success of spark SQL. Therefore, a lot of work has been done in this direction. Before spark 3.0, cost-based optimization was a major hit in which different stages related to cost (based on time efficiency and estimated CPU and I/O usage) are compared and executes the strategy which minimizes the cost. But, because Continue Reading

Writing Unit Test for Apache Spark using Memory Streams

March 24, 2021April 19, 2021Apache Spark, scalatest, Studio-Scala#scalatest, Apache Spark, In-memory computing, Streaming Spark, Unit testing Spark Streaming

Reading Time: 2 minutes In this post, we are going to look into how we can leverage apache spark’s memory streams for Unit testing What is it ? Apache spark’s memory streams is a concrete streaming source of memory data source that supports reading in Micro-Batch Stream Processing. Lets jump into it We will be using a memory stream writing some test data in memory as a stream. We Continue Reading

Using Spark as a Database

November 23, 2020November 23, 2020Apache Spark, Big Data and Fast Data, Database, SQL, Studio-Scala

Reading Time: 4 minutes You must have heard that Apache Spark is a powerful distributed data processing engine. But do you know that Spark (with the help of Hive) can also act as a database? So, in this blog, we will learn how Apache Spark can be leveraged as a database by creating tables in it and querying upon them. Introduction Since Spark is a database in itself, we Continue Reading

Understanding persistence in Apache Spark

October 2, 2020October 12, 2020Analytics, Apache Spark, Big Data and Fast Data, ML, AI and Data Engineering, Spark, Studio-Scala, Tech BlogsCaching, distributed caching

Reading Time: 4 minutes In this blog, we will try to understand the concept of Persistence in Apache Spark in a very layman term with scenario-based examples. Note: The scenarios are only meant for your easy understanding. Spark Architecture Note: Cache memory can be shared between Executors. What does it mean by persisting/caching an RDD? Spark RDD persistence is an optimization technique which saves the result of RDD evaluation Continue Reading

Spark SQL in Delta Lake 0.7.0

September 3, 2020September 12, 2020Apache Spark, Big Data and Fast Data, Java, SQLAnalytics, Big Data, delta lake, query, Spark, sql

Reading Time: 3 minutes Nowadays Delta lake is a buzz word in the Big Data world, especially among the spark developers because it relegates lots of issues found in the Big Data domain. Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It is evolving day by day and adds cool features in its every release. Continue Reading

Apache Spark’s Join Algorithms

August 31, 2020August 31, 2020Apache Spark, Studio-Scala, Tech BlogsApache Spark, Broadcast Join, Join opertaions, Join optimization, Joins in Spark, Shuffled Hash Join, Sort Merge Join

Reading Time: 4 minutes Joins in Apache Spark are fundamental transformations, but if you are not familiar with their internal algorithm, they can become too expensive.

Creating Data Pipeline with Spark streaming, Kafka and Cassandra

August 24, 2020October 22, 2020Apache Kafka, Apache Spark, Big Data and Fast Data, Cassandra, MessagesAPI, Spark, Streaming, Studio-ScalaApache Kafka, Apache Spark, Cassandra, data analysis, DataStream API

Reading Time: 3 minutes Hi Folks!! In this blog, we are going to learn how we can integrate Spark Structured Streaming with Kafka and Cassandra to build a simple data pipeline. Spark Structured Streaming is a component of Apache Spark framework that enables scalable, high throughput, fault tolerant processing of data streams.Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data Continue Reading

Spark Structured Streaming (Part 4) – Handling Late Data

August 18, 2020August 19, 2020Analytics, Apache Spark, Big Data and Fast Data, ML, AI and Data Engineering, Spark, Streaming, Streaming Solutions, Studio-Scala, Tech Blogsstateful streaming, Structured Streaming, watermark

Reading Time: 3 minutes Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Understanding Stateful Streaming“. And this blog pertains to Handling Late Arriving Data in Spark Structured Streaming. So let’s get started. Handling Late Data With window aggregates (discussed in the previous blog) Spark automatically takes cares of late data. Every aggregate window is like a bucket Continue Reading

Spark: Streaming Datasets

August 18, 2020August 18, 2020Apache Spark, Big Data and Fast Data, Spark, Streaming, Studio-ScalaApache Spark, Big Data Analytics, DataFrame, DataStream API, Spark Streaming, Spark Structured Streaming, Streaming Spark

Reading Time: 3 minutes Spark providing us a high-level API – Dataset, which makes it easy to get type safety and securely perform manipulation in a distributed and a local environment without code changes. Also, spark structured streaming, a high-level API for stream processing allows us to stream a particular Dataset which is nothing but a type-safe structured streams. In this blog, we will see how we can create Continue Reading

Spark Structured Streaming (Part 3) – Stateful Streaming

August 14, 2020August 21, 2020Analytics, Apache Spark, Big Data and Fast Data, ML, AI and Data Engineering, Spark, Streaming, Streaming Solutions, Studio-Scala, Tech Blogsstateful streaming, stateful streaming scala, Structured Streaming

Reading Time: 4 minutes Welcome back folks to this blog series of Spark Structured Streaming. This blog is the continuation of the earlier blog “Internals of Structured Streaming“. And this blog pertains to Stateful Streaming in Spark Structured Streaming. So let’s get started. Let’s start from the very basic understanding of what is Stateful Stream Processing. But to understand that, let’s first understand what Stateless Stream Processing is. In Continue Reading