Apache Spark

CuriosityX: RDDs – The backbone of Apache Spark

In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let’s start with the very basic question that comes to our mind Continue Reading

They said Spark Streaming simply means Discretized Stream

I am working in a company (Knoldus Software LLP) where Apache Spark is literally running into people’s blood means there are certain people who are really good at it. If you ever visit our blogging page and search for stuff related to spark, you will find enough content which is capable of solving your most of spark related queries, starting form introductions to solutions for Continue Reading

Difference between Apache Hadoop and Apache Spark Mapreduce

The term Big Data has created a lot of hype already in the business world. Hadoop and Spark are both Big Data frameworks – they provide some of the most popular tools used to carry out common Big Data-related tasks. In this blog, we will cover what is the difference between Apache Hadoop and Apache Spark MapReduce. Introduction Spark – It is an open source Continue Reading

spark streaming with kafka

Assimilation of Spark Streaming With Kafka

As we know Spark is used at a wide range of organizations to process large datasets. It seems like spark becoming main stream. In this blog we will talk about Assimilation of Spark Streaming With Kafka. So, lets get started. How Kafka can be integrated with Spark? Kafka provides a messaging and integration platform for Spark streaming. Kafka act as the central hub for real-time streams of Continue Reading

fetching data from different sources using Spark 2.1

What’s new in Apache Spark 2.2

Apache recently released a newer version of Spark i.e Apache Spark 2.2. The new version comes with new improvements as well as the addition of new functionalities. The major addition to this release is Structured Streaming. It has been marked as production ready and its experimental tag has been removed. Some of the high-level changes and improvements : Production ready Structured Streaming Expanding SQL functionalities New Continue Reading

Spark Structured Streaming: A Simple Definition

“Structured Streaming”, nowadays we are hearing this term in Apache Spark ecosystem quite a lot, as it is being preached as next big thing in scalable big data world. Although, we all know that Structured Streaming means a stream having structured data in it, but very few of us knows what exactly it is and where we can use it. So, in this blog post Continue Reading

apache spark

Getting Started with the Apache Spark

Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. Spark has several advantages compared to other big data and Map Reduce technologies like Hadoop and Storm. Apache Spark is an improvement on the original Hadoop MapReduce Continue Reading

The Dominant APIs of Spark: Datasets, DataFrames and RDDs

While working with Spark often we come across the three APIs: DataFrames, Datasets and RDDs.  In this blog I will discuss the three in terms of use case, performance and optimization.  It is essential to keep in mind that there is seamless transformation available between the three DataFrames, Datasets and RDDs. Implicitly the RDD forms the apex of both DataFrame and Datasets. The inception of Continue Reading

Partition-Aware Data Loading in Spark SQL

Data loading, in Spark SQL, means loading data in memory/cache of Spark worker nodes. For which we use to write following code: val connectionProperties = new Properties() connectionProperties.put(“user”, “username”) connectionProperties.put(“password”, “password”) val jdbcDF = spark.read .jdbc(“jdbc:postgresql:dbserver”, “schema.table”, connectionProperties) In here we are using jdbc function of DataFrameReader API of Spark SQL to load the data from table into Spark Executor’s memory, no matter how many rows are Continue Reading

Migration From Spark 1.x to Spark 2.x

Hello Folks, As we know that we have latest release of Spark 2.0, with to much enhancement and new features. If you are using Spark 1.x and now you want to move your application with Spark 2.0 that time you have to take care for some changes which happened in the API. In this blog we are going to get an overview of common changes: Continue Reading

Spark – LDA : A Complete example of clustering algorithm for topic discovery.

In this blog we will be demonstrating the functionality of applying the full ML pipeline over a set of documents which in this case we are using 10 books from the internet. So lets start with first thing first.. What is Clustering ? Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a Continue Reading

Spark – IoT : Combining Big Data Analysis with IoT

Welcome back , folks ! Time for some new gig ! I think that last series i.e. Scala – IOT was pretty amazing , which got an overwhelming response from you all which resulted in pumping up the idea of this new web-series Spark-IOT. So let’s get started, What was the motivation ? I have been active in the IoT community here, and I found Continue Reading

Streaming with Apache Spark Custom Receiver

Hello inqisitor. In previous blog we have seen about the predefined Stream receiver of Spark. In this blog we are going to discuss about Custom receiver of spark so that we can source the data from any . So if we want to use Custom Receiver than we should know first we are not going to use SparkSession as entry point , if there are Continue Reading

%d bloggers like this: