Knoldus Blogs

Exploring Spark Structured Streaming

June 14, 2017Apache Spark, Streaming, Studio-ScalaSpark, Spark Structured Streaming

Reading Time: 6 minutes Hello Spark Enthusiasts, Streaming apps are growing more complex. And it is getting difficult to do with current distributed streaming engines. Why streaming is hard ? Streaming computations don’t run in isolation. Data arriving out of time order is a problem for batch-processed processing. Writing stream processing operations from scratch is not easy. Problem with DStreams: Processing with event-time: dealing with late data. Interoperate streaming Continue Reading

Spark Streaming vs Kafka Stream

June 13, 2017June 13, 2017Apache Kafka, Apache Spark, Big Data and Fast Data, Streaming, Studio-ScalaKafka Streaming, Spark Streaming

Reading Time: 4 minutes The demand for stream processing is increasing a lot these days. The reason is that often processing big volumes of data is not enough. Data has to be processed fast, so that a firm can react to changing business conditions in real time. Stream processing is the real-time processing of data continuously and concurrently. Streaming processing” is the ideal platform to process data streams or Continue Reading

Streaming in Spark, Flink and Kafka

June 13, 2017Apache Flink, Apache Kafka, Apache Spark, StreamingFlink, kafka, Spark, Streaming Spark

Reading Time: 7 minutes There is a lot of buzz going on between when to use use spark, when to use flink, and when to use Kafka. Both spark streaming and flink provides exactly once guarantee that every record will be processed exactly once thereby eliminating any duplicates that might be available. Both provide very high throughput compared to any other processing system like storm, and the overhead of Continue Reading

Apache Spark: Reading csv using custom timestamp format

June 5, 2017August 16, 2018Apache Spark, Big Data and Fast Data, Functional Programming, Studio-Scala

Reading Time: 3 minutes In this blog, we are considering a situation where I wanted to read a CSV through spark, but the CSV contains some timestamp columns in it. Is this going to be a problem while inferring schema at the time of reading csv using spark? Well, the answer may be No, if the csv have the timestamp field in the specific yyyy-MM-dd hh:mm:ss format. In this particular case, the spark csv reader can Continue Reading

Apache Spark: 3 Reasons Why You Should Not Use RDDs

June 3, 2017August 14, 2018Apache Spark, Big Data and Fast Data, Spark, Studio-ScalaRDD

Reading Time: 4 minutes Apache Spark, whenever we hear these two words, the first thing that comes to our mind is RDD , i.e., Resilient Distributed Datasets. Now, it has been more than 5 years since Apache Spark came into existence and after its arrival a lot of things got changed in big data industry. But, the major change was dethroning of Hadoop MapReduce. I mean Spark literally replaced MapReduce and this Continue Reading

Dealing With Deltas In Amazon Redshift

June 2, 2017August 16, 2018Amazon, Apache Spark, AWS, AWS Services, Database, Spark, Studio-Scalaamazon, Deltas, Redshift

Reading Time: 5 minutes Hi, In this blog I would like to discuss a scenario of implementation of Deltas in Amazon Redshift using spark-redshift. Prior to that I would like to make you aware of Amazon Redshift, spark-redshift library and integration of Spark with Redshift. It is assumed that you have a fair knowledge of programming in Apache Spark and Spark SQL. You may refer to the documentation links Continue Reading

Apache Spark : Handle null timestamp while reading csv in Spark 2.0.0

June 2, 2017September 10, 2018Apache Spark, Big Data and Fast Data, Spark, Studio-Scala

Reading Time: 2 minutes Hello folks, Hope you all are doing good !!! In this blog, I will discuss a problem which I faced some days back. One thing to keep in mind that this problem is specifically related to Spark version 2.0.0. Other than this version, this problem does not occur. Problem : Spark code was reading CSV file. This particular CSV file had one timestamp column that might Continue Reading

Getting Started with the Apache Spark

May 9, 2017August 8, 2018Apache Spark, Spark, Studio-ScalaApache Spark, installation of spark on linux, introduction to spark, Spark, spark architecture, spark ecosystem, Spark Streaming

Reading Time: 2 minutes Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project. Spark has several advantages compared to other big data and Map Reduce technologies like Hadoop and Storm. Apache Spark is an improvement on the original Hadoop MapReduce Continue Reading

Introduction To HADOOP !

April 21, 2017August 6, 2018Apache Flink, Apache Spark, Big Data and Fast Data, Database, HDFS, software, Spark, Studio-Scala, Testing, Testing

Reading Time: 4 minutes Here I am to going to write a blog on Hadoop! “Bigdata is not about data! The value in Bigdata [is in] the analytics. ” -Harvard Prof. Gary King So the Hadoop came into Introduction! Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache Continue Reading

Apache Spark : Spark Union adds up the partition of input RDDs

April 1, 2017July 25, 2018Agile, Apache Spark, Big Data and Fast Data, Studio-Scala

Reading Time: 2 minutes Some days back when I was doing union of 2 pair rdds, I found the strange behavior for the number of partitions. The output RDD got different number of partition than input Rdd. For ex: suppose rdd1 and rdd2, each have 2 no of partitions and after union of these rdds I was expecting same no of partitions for output RDD, but the output RDD got the Continue Reading

Reading data from different sources using Spark 2.1

March 6, 2017August 9, 2018Apache Spark, Spark, Studio-Scala

Reading Time: 2 minutes Hi all, In this blog, we’ll be discussing on fetching data from different sources using Spark 2.1 like csv, json, text and parquet files. So first of all let’s discuss what’s new in Spark 2.1. In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark whereas in Spark 2.1 the same effects can be achieved through SparkSession, without explicitly Continue Reading

Spark Cassandra Connector On Spark-Shell

February 14, 2017Apache Spark, Big Data and Fast Data, Cassandra, Spark, Studio-Scala

Reading Time: 2 minutes Using Spark-Cassandra-Connector on Spark Shell Hi All , In this blog we will see how we can execute our spark code on spark shell using Cassandra . This is very efficient at testing or learning time , where we have to execute our code on spark shell rather than doing on any IDE . Here we will use spark version – 1.6.2 you can download Continue Reading

Introduction to Structured Streaming

February 13, 2017September 10, 2018Apache Spark, Spark, Streaming, Studio-Scala

Reading Time: < 1 minute Hello!! Knoldus had organized half an hour session on Structured Streaming briefing about the API changes, how it is different from the early Stream Computation paradigm (DStreams) and example API demonstration. Hope you will enjoy. Below are the slides and Video from the session. Slide: Video: