Spark Streaming

Kafka And Spark Streams: The happily ever after !!

Hi everyone, Today we are going to understand a bit about using the spark streaming to transform and transport data between Kafka topics. The demand for stream processing is increasing every day. The reason is that often, processing big volumes of data is not enough. We need real-time processing of data especially when we need to handle continuously increasing volumes of data and also need Continue Reading

Spark Streaming: Unit Testing DStreams

Frankly, I don’t think there’s any need of telling us, “The Developers”, the need for proper testing or Unit testing to be correct(QAs, Don’t be flattered :P). The unit test cases are the quickest way to know there’s something wrong with our code. “Unit testing is important because it is one of the earliest testing efforts performed on the code and the earlier defects are detected, the easier Continue Reading

spark streaming with kafka

Assimilation of Spark Streaming With Kafka

As we know Spark is used at a wide range of organizations to process large datasets. It seems like spark becoming main stream. In this blog we will talk about Assimilation of Spark Streaming With Kafka. So, lets get started. How Kafka can be integrated with Spark? Kafka provides a messaging and integration platform for Spark streaming. Kafka act as the central hub for real-time streams of Continue Reading

fetching data from different sources using Spark 2.1

What’s new in Apache Spark 2.2

Apache recently released a newer version of Spark i.e Apache Spark 2.2. The new version comes with new improvements as well as the addition of new functionalities. The major addition to this release is Structured Streaming. It has been marked as production ready and its experimental tag has been removed. Some of the high-level changes and improvements : Production ready Structured Streaming Expanding SQL functionalities New Continue Reading

Basic Example for Spark Structured Streaming & Kafka Integration

The Spark Streaming integration for Kafka 0.10 is similar in design to the 0.8 Direct Stream approach. It provides simple parallelism, 1:1 correspondence between Kafka partitions and Spark partitions, and access to offsets and metadata. However, because the newer integration uses the new Kafka consumer API instead of the simple API, there are notable differences in usage. This version of the integration is marked as Continue Reading

Spark Streaming vs Kafka Stream

The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions in real time. Stream processing is the real-time processing of data continuously and concurrently. Streaming processing” is the ideal platform to process data streams or Continue Reading

apache spark

Getting Started with the Apache Spark

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. Spark has several advantages compared to other big data and Map Reduce technologies like Hadoop and Storm. Apache Spark is an improvement on the original Hadoop MapReduce Continue Reading

Streaming with Apache Spark Custom Receiver

Hello inqisitor. In previous blog we have seen about the predefined Stream receiver of Spark. In this blog we are going to discuss about Custom receiver of spark so that we can source the data from any . So if we want to use Custom Receiver than we should know first we are not going to use SparkSession as entry point , if there are Continue Reading

Streaming with Apache Spark 2.0

Hello geeks we were discussed about Apache Spark 2.0 with hive in earlier blog. Now i am going to describe how can we use spark to stream the data   . At first we need to understand this new Spark Streaming architecture  .

MeetUp on “An Overview of Spark DataFrames with Scala”

Knoldus is organizing an one hour session on 18th Nov 2015 at 6:00 PM. Topic would be An Overview of Spark DataFrames with Scala. All of you are invited to join this session. Address:- 30/29, First Floor, Above UCO Bank, Near Rajendra Place Metro Station,  New Delhi, India Please click here for more details.

Gnip using Spark Streaming :- An Apache Spark Utility to pull Tweets from Gnip in realtime

We all are familiar with Gnip, Inc. which provides data from dozens of social media websites via a single API. It is also known as the Grand Central Station for social media web. One of its popular API is PowerTrack which provides Tweets from Twitter in realtime along with the ability to filter Twitter’s full firehose, giving its customers only what they are interested in. This Continue Reading

Stateful transformation on Dstream in apache spark with example of wordcount

Sometimes we have a use-case in which we need to maintain state of paired Dstream to use it in next Dstream . So we are taking a example of stateful wordcount in socketTextStreaming. Like in wordcount example if word “xyz” comes twice is in first Dstream or window, it reduce it and its value is 2 but its state will lost in the next Dstream Continue Reading

Meetup: Introduction to Spark with Scala

Knoldus organized a Meetup on Wednesday, 1 April 2015. In this Meetup, we gave a brief Introduction to Spark with Scala. Apache Spark is a fast and general engine for large-scale data processing. A wide range of organizations are using it to process large datasets. Many Spark and Scala enthusiasts attended this session and got an insight of Apache Spark. Examples shown in above slides can be downloaded from Continue Reading

%d bloggers like this: