Spark Streaming

Do you really need Spark? Think Again!

With the massive amount of increase in big data technologies today, it is becoming very important to use the right tool for every process. The process can be anything like Data ingestion, Data processing, Data retrieval, Data Storage, etc. Today we are going to focus on one of those popular big data technologies i.e., Apache Spark. Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark Continue Reading

Spark Streaming vs. Structured Streaming

Fan of Apache Spark? I am too. The reason is simple. Interesting APIs to work with, fast and distributed processing, unlike map-reduce no I/O overhead, fault tolerance and many more. With this much, you can do a lot in this world of Big data and Fast data. From “processing huge chunks of data” to “working on streaming data”, Spark works flawlessly in all. In this Continue Reading

Spark: RDD vs DataFrames

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.One use of Spark SQL is to execute SQL queries. When running SQL from within another Continue Reading

Is Apache Flink the future of Real-time Streaming?

In our last blog, we had a discussion about the latest version of Spark i.e 2.4 and the new features that it has come up with. While trying to come up with various approaches to improve our performance, we got the chance to explore one of the major contenders in the race, Apache Flink. Apache Flink is an open source platform which is a streaming Continue Reading

Kafka And Spark Streams: The happily ever after !!

Hi everyone, Today we are going to understand a bit about using the spark streaming to transform and transport data between Kafka topics. The demand for stream processing is increasing every day. The reason is that often, processing big volumes of data is not enough. We need real-time processing of data especially when we need to handle continuously increasing volumes of data and also need Continue Reading

Spark Streaming: Unit Testing DStreams

Frankly, I don’t think there’s any need of telling us, “The Developers”, the need for proper testing or Unit testing to be correct(QAs, Don’t be flattered :P). The unit test cases are the quickest way to know there’s something wrong with our code. “Unit testing is important because it is one of the earliest testing efforts performed on the code and the earlier defects are detected, the easier Continue Reading

spark streaming with kafka

Assimilation of Spark Streaming With Kafka

As we know Spark is used at a wide range of organizations to process large datasets. It seems like spark becoming main stream. In this blog we will talk about Assimilation of Spark Streaming With Kafka. So, lets get started. How Kafka can be integrated with Spark? Kafka provides a messaging and integration platform for Spark streaming. Kafka act as the central hub for real-time streams of Continue Reading

fetching data from different sources using Spark 2.1

What’s new in Apache Spark 2.2

Apache recently released a newer version of Spark i.e Apache Spark 2.2. The new version comes with new improvements as well as the addition of new functionalities. The major addition to this release is Structured Streaming. It has been marked as production ready and its experimental tag has been removed. Some of the high-level changes and improvements : Production ready Structured Streaming Expanding SQL functionalities New Continue Reading

Basic Example for Spark Structured Streaming & Kafka Integration

The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata. However, because the newer integration uses the new Kafka consumer API instead of the simple API, there are notable differences in usage. This version of the integration is marked as Continue Reading

Spark Streaming vs Kafka Stream

The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions in real time. Stream processing is the real-time processing of data continuously and concurrently. Streaming processing” is the ideal platform to process data streams or Continue Reading

apache spark

Getting Started with the Apache Spark

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. Spark has several advantages compared to other big data and Map Reduce technologies like Hadoop and Storm. Apache Spark is an improvement on the original Hadoop MapReduce Continue Reading

Streaming with Apache Spark Custom Receiver

Hello inqisitor. In previous blog we have seen about the predefined Stream receiver of Spark. In this blog we are going to discuss about Custom receiver of spark so that we can source the data from any . So if we want to use Custom Receiver than we should know first we are not going to use SparkSession as entry point , if there are Continue Reading

Knoldus Pune Careers - Hiring Freshers

Get a head start on your career at Knoldus. Join us!