0 comments on “They said Spark Streaming simply means Discretized Stream”

They said Spark Streaming simply means Discretized Stream


I am working in a company (Knoldus Software LLP) where Apache Spark is literally running into people's blood means there are certain people who are really good at it. If you ever visit our blogging page and search for stuff…

3 comments on “Apache Hadoop vs Apache Spark”

Apache Hadoop vs Apache Spark


The term Big Data has created a lot of hype already in the business world. Hadoop and Spark are both Big Data frameworks – they provide some of the most popular tools used to carry out common Big Data-related tasks.…

2 comments on “Assimilation of Spark Streaming With Kafka”

Assimilation of Spark Streaming With Kafka


As we know Spark is used at a wide range of organizations to process large datasets. It seems like spark becoming main stream. In this blog we will talk about Integration of Kafka with Spark Streaming. So, lets get started. How Kafka…

5 comments on “What’s new in Apache Spark 2.2”

What’s new in Apache Spark 2.2


Apache recently released a newer version of Spark i.e Apache Spark2.2. The new version comes with new improvements as well as the addition of new functionalities. The major addition to this release is Structured Streaming. It has been marked as production…

2 comments on “Spark Structured Streaming: A Simple Definition”

Spark Structured Streaming: A Simple Definition


"Structured Streaming", nowadays we are hearing this term in Apache Spark ecosystem quite a lot, as it is being preached as next big thing in scalable big data world. Although, we all know that Structured Streaming means a stream having…

1 comment on “Getting Started with Apache Spark”

Getting Started with Apache Spark


Introduction Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. Spark…

1 comment on “The Dominant APIs of Spark: Datasets, DataFrames and RDDs”

The Dominant APIs of Spark: Datasets, DataFrames and RDDs


While working with Spark often we come across the three APIs: DataFrames, Datasets and RDDs.  In this blog I will discuss the three in terms of use case, performance and optimization.  It is essential to keep in mind that there…

7 comments on “Partition-Aware Data Loading in Spark SQL”

Partition-Aware Data Loading in Spark SQL


Data loading, in Spark SQL, means loading data in memory/cache of Spark worker nodes. For which we use to write following code: val connectionProperties = new Properties() connectionProperties.put("user", "username") connectionProperties.put("password", "password") val jdbcDF = spark.read .jdbc("jdbc:postgresql:dbserver", "schema.table", connectionProperties) In here we are…

2 comments on “Migration From Spark 1.x to Spark 2.x”

Migration From Spark 1.x to Spark 2.x


Hello Folks, As we know that we have latest release of Spark 2.0, with to much enhancement and new features. If you are using Spark 1.x and now you want to move your application with Spark 2.0 that time you…

7 comments on “Spark – LDA : A Complete example of clustering algorithm for topic discovery.”

Spark – LDA : A Complete example of clustering algorithm for topic discovery.


In this blog we will be demonstrating the functionality of applying the full ML pipeline over a set of documents which in this case we are using 10 books from the internet. So lets start with first thing first.. What…