Apache Spark

kafka with spark

Tuning a Spark Application

Having trouble optimizing your Spark application? If yes, then this blog will surely guide you on how you can optimize it and what parameters should be tuned so that our spark application gives the best performance. Spark applications can cause a bottleneck due to resources such as CPU, memory, network etc. We need to tune our memory usage, data structures tuning, how RDDs need to Continue Reading

HDFS: A Conceptual View

There has been a significant boom in distributed computing over the past few years. Various components communicate with each other over network inspite of being deployed on different physical machines. A distributed file system (DFS) is a file system with data stored on a server. The data is accessed and processed as if it was stored on the local client machine. The DFS makes it convenient to share information Continue Reading

Spark: Why should we use SparkSession ?

Spark 2.0 is the next major release of Apache Spark. This brings major change for the level of abstraction for the spark API and libraries. The release has the major change for the ones who want to make use of all the advancement in this release, So in this blog post, I’ll be discussing Spark-Session. Need Of Spark-Session

Spark vs MapReduce: Which is better?

Both the technologies are equipped with amazing features, however with the increased need for real-time analytics, these two giving tough competition to each other What are MapReduce and Spark? MapReduce:- MapReduce is a programming model for processing huge amounts of data in a parallel and distributed. In this model, there are two tasks that are undertaken Map and Reduce and there is a map function Continue Reading

Spark Structured Streaming with Elasticsearch

There’s been a lot of time we have been working on streaming data. Using Apache Spark for that can be much convenient. Spark provides two APIs for streaming data one is Spark Streaming which is a separate library provided by Spark. Another one is Structured Streaming which is built upon the Spark-SQL library. We will discuss the trade-offs and differences between these two libraries in Continue Reading

kafka with spark

RDD: Spark’s Fault Tolerant In-Memory weapon

A fault-tolerant collection of elements that can be operated on in parallel:  “Resilient Distributed Dataset” a.k.a. RDD RDD (Resilient Distributed Dataset) is the fundamental data structure of Apache Spark which are an immutable collection of objects which computes on the different node of the cluster. Each and every dataset in Spark RDD is logically partitioned across many servers so that they can be computed on Continue Reading

kafka with spark

Spark Unconstructed | Deep dive into DAG

Apache Spark is all the rage these days. People who work with Big Data, Spark is a household name for them. We have been using it for quite some time now. So we already know that Spark is lightning-fast cluster computing technology, it is faster than Hadoop MapReduce. If you ask any of these Spark techies, how Spark is fast, they would give you a Continue Reading

CuriosityX: RDDs – The backbone of Apache Spark

In our last blog, we tried to understand about using the spark streaming to transform and transport data between Kafka topics. After reading that many of the readers asked us to give a brief description of RDDs in Spark which we used. So, this blog is totally dedicated to the RDDs in Spark. So let’s start with the very basic question that comes to our mind Continue Reading

Code Combat II : The Code Battle For The Vanguard Continues…

“If you can dream it, you can do it. ”  -Walt Disney For some coding is a job. For some, it is an exercise. But for us folks here at Knoldus, it’s a Passion. So in order to bring a twist in the daily work schedule, Knoldus held an overnight Hackathon competition within the organization on 18th May 2018 which presented an opportunity for every Knolder(employees Continue Reading

Knoldus-Spark-AI-Summit-2018

A quick recap of SPARK + AI SUMMIT 2018

Spark + AI Summit was the world’s largest big data event focused entirely on Apache Spark assembling 4,000 people(best engineers, scientists, analysts, and executives) from over 40 countries attending to share their knowledge and receive expert training on this open-source powerhouse. If you didn’t make it, don’t worry; We’ve got you covered! Vikas Hazrati and Ram Indukuri from Knoldus Inc. attended the Spark+AI Summit 2018, Continue Reading

Spark Stream-Stream Join

Tuning spark on yarn

In this blog we will learn how to tuning yarn with spark in both mode yarn-client and yarn-cluster,the only requirement to get started is that you must have a hadoop based yarn-spark cluster with you. In case you want to create a cluster you can follow this blog here. 1. yarn-client mode:  In client mode, the driver runs in the client process, and the application master is only used Continue Reading

Structured Streaming: Philosophy behind it

In our previous blogs: Structured Streaming: What is it? & Structured Streaming: How it works? We got to know 2 major points about Structured Streaming – It is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications. It treats the live data stream as a table that is being continuously appended/updated which allows us to express our streaming computation as Continue Reading

Structured Streaming: How it works?

In our previous blog post – Structured Streaming: What is it? we got to know that Structured Streaming is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications. Now it’s time to learn  – How it works? So, in this blog post, we will look at the working of a structured stream via an example. So, let’s take a Continue Reading

%d bloggers like this: